Evaluation API ============== This page documents the evaluation APIs in CausalFM. Metrics ------- compute_pehe ~~~~~~~~~~~~ **Function:** ``causalfm.evaluation.metrics.compute_pehe(predictions, ground_truth)`` Compute Precision in Estimation of Heterogeneous Effects (PEHE). .. math:: \text{PEHE} = \sqrt{\frac{1}{n}\sum_{i=1}^n (\tau_i - \hat{\tau}_i)^2} :param predictions: Predicted CATE values :type predictions: np.ndarray or torch.Tensor :param ground_truth: True ITE values :type ground_truth: np.ndarray or torch.Tensor :return: PEHE score :rtype: float Example: .. code-block:: python from causalfm.evaluation import compute_pehe import numpy as np true_ite = np.array([1.5, 2.3, 0.8, -0.5, 1.2]) pred_cate = np.array([1.4, 2.1, 0.9, -0.3, 1.1]) pehe = compute_pehe(pred_cate, true_ite) print(f"PEHE: {pehe:.4f}") compute_ate_error ~~~~~~~~~~~~~~~~~ **Function:** ``causalfm.evaluation.metrics.compute_ate_error(predictions, ground_truth)`` Compute Average Treatment Effect error. .. math:: \text{ATE Error} = \left|\frac{1}{n}\sum_{i=1}^n \tau_i - \frac{1}{n}\sum_{i=1}^n \hat{\tau}_i\right| :param predictions: Predicted CATE values :type predictions: np.ndarray or torch.Tensor :param ground_truth: True ITE values :type ground_truth: np.ndarray or torch.Tensor :return: ATE error :rtype: float Example: .. code-block:: python from causalfm.evaluation import compute_ate_error ate_error = compute_ate_error(pred_cate, true_ite) print(f"ATE Error: {ate_error:.4f}") compute_mse ~~~~~~~~~~~ **Function:** ``causalfm.evaluation.metrics.compute_mse(predictions, ground_truth)`` Compute Mean Squared Error. .. math:: \text{MSE} = \frac{1}{n}\sum_{i=1}^n (\tau_i - \hat{\tau}_i)^2 :param predictions: Predicted values :type predictions: np.ndarray or torch.Tensor :param ground_truth: True values :type ground_truth: np.ndarray or torch.Tensor :return: MSE :rtype: float Example: .. code-block:: python from causalfm.evaluation import compute_mse mse = compute_mse(pred_cate, true_ite) print(f"MSE: {mse:.4f}") compute_rmse ~~~~~~~~~~~~ **Function:** ``causalfm.evaluation.metrics.compute_rmse(predictions, ground_truth)`` Compute Root Mean Squared Error. .. math:: \text{RMSE} = \sqrt{\text{MSE}} :param predictions: Predicted values :type predictions: np.ndarray or torch.Tensor :param ground_truth: True values :type ground_truth: np.ndarray or torch.Tensor :return: RMSE :rtype: float Example: .. code-block:: python from causalfm.evaluation import compute_rmse rmse = compute_rmse(pred_cate, true_ite) print(f"RMSE: {rmse:.4f}") Basic Usage ----------- Computing Multiple Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.evaluation import ( compute_pehe, compute_ate_error, compute_mse, compute_rmse ) import numpy as np # Your predictions and ground truth predictions = np.random.randn(100) ground_truth = np.random.randn(100) # Compute all metrics pehe = compute_pehe(predictions, ground_truth) ate_error = compute_ate_error(predictions, ground_truth) mse = compute_mse(predictions, ground_truth) rmse = compute_rmse(predictions, ground_truth) print(f"PEHE: {pehe:.4f}") print(f"ATE Error: {ate_error:.4f}") print(f"MSE: {mse:.4f}") print(f"RMSE: {rmse:.4f}") With PyTorch Tensors ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import torch from causalfm.evaluation import compute_pehe # Works with torch tensors pred = torch.randn(100) true = torch.randn(100) pehe = compute_pehe(pred, true) print(f"PEHE: {pehe:.4f}") Model Evaluation ---------------- Evaluating a Model on a Dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.models import StandardCATEModel from causalfm.evaluation import compute_pehe, compute_ate_error import pandas as pd import torch # Load model model = StandardCATEModel.from_pretrained("checkpoints/best_model.pth") # Load test data df = pd.read_csv("data/test/test_dataset_1.csv") # Extract features x_cols = [c for c in df.columns if c.startswith('x')] X = torch.FloatTensor(df[x_cols].values) A = torch.FloatTensor(df['treatment'].values).unsqueeze(1) Y = torch.FloatTensor(df['outcome'].values).unsqueeze(1) true_ite = df['ite'].values # Split n_train = int(0.8 * len(X)) x_train, x_test = X[:n_train], X[n_train:] a_train, y_train = A[:n_train], Y[:n_train] ite_test = true_ite[n_train:] # Predict result = model.estimate_cate(x_train, a_train, y_train, x_test) pred_cate = result['cate'].cpu().numpy() # Evaluate pehe = compute_pehe(pred_cate, ite_test) ate_error = compute_ate_error(pred_cate, ite_test) print(f"PEHE: {pehe:.4f}") print(f"ATE Error: {ate_error:.4f}") Evaluating Multiple Datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from pathlib import Path import pandas as pd # Evaluate on all test datasets test_dir = Path("data/test/") results = [] for file in test_dir.glob("test_*.csv"): df = pd.read_csv(file) # ... prepare data and predict ... pehe = compute_pehe(pred_cate, ite_test) ate_error = compute_ate_error(pred_cate, ite_test) results.append({ 'dataset': file.name, 'pehe': pehe, 'ate_error': ate_error }) # Aggregate results results_df = pd.DataFrame(results) print(results_df) print(f"\nAverage PEHE: {results_df['pehe'].mean():.4f} ± {results_df['pehe'].std():.4f}") Advanced Evaluation ------------------- Uncertainty Quantification ~~~~~~~~~~~~~~~~~~~~~~~~~~ Evaluate calibration of uncertainty estimates: .. code-block:: python import numpy as np from causalfm.models import StandardCATEModel model = StandardCATEModel.from_pretrained("checkpoints/best_model.pth") # Get predictions with uncertainty result = model.estimate_cate(x_train, a_train, y_train, x_test) pred_cate = result['cate'].cpu().numpy() pi = result['gmm_pi'].cpu().numpy() mu = result['gmm_mu'].cpu().numpy() sigma = result['gmm_sigma'].cpu().numpy() # Compute predictive variance variance = (pi * (sigma**2 + mu**2)).sum(axis=-1) - pred_cate**2 std_dev = np.sqrt(variance) # Check calibration errors = pred_cate - true_ite standardized_errors = errors / std_dev print(f"Mean standardized error: {standardized_errors.mean():.4f}") print(f"Std standardized error: {standardized_errors.std():.4f}") Coverage Analysis ~~~~~~~~~~~~~~~~~ .. code-block:: python # Sample from GMM for confidence intervals n_samples = 10000 samples = np.zeros((len(pred_cate), n_samples)) for i in range(len(pred_cate)): components = np.random.choice(len(pi[i]), size=n_samples, p=pi[i]) for k in range(len(pi[i])): mask = (components == k) n_k = mask.sum() if n_k > 0: samples[i, mask] = np.random.normal(mu[i, k], sigma[i, k], n_k) # Compute 95% CI ci_lower = np.percentile(samples, 2.5, axis=1) ci_upper = np.percentile(samples, 97.5, axis=1) # Check coverage coverage = np.mean((true_ite >= ci_lower) & (true_ite <= ci_upper)) print(f"95% CI Coverage: {coverage:.2%}") Visualization ------------- Plotting Results ~~~~~~~~~~~~~~~~ .. code-block:: python import matplotlib.pyplot as plt # Predicted vs True plt.figure(figsize=(8, 6)) plt.scatter(true_ite, pred_cate, alpha=0.6) min_val = min(true_ite.min(), pred_cate.min()) max_val = max(true_ite.max(), pred_cate.max()) plt.plot([min_val, max_val], [min_val, max_val], 'r--', label='Perfect') plt.xlabel('True ITE') plt.ylabel('Predicted CATE') plt.title(f'PEHE: {pehe:.4f}') plt.legend() plt.grid(True, alpha=0.3) plt.savefig('predictions.png') Error Distribution ~~~~~~~~~~~~~~~~~~ .. code-block:: python errors = pred_cate - true_ite plt.figure(figsize=(10, 4)) # Histogram plt.subplot(1, 2, 1) plt.hist(errors, bins=30, edgecolor='black', alpha=0.7) plt.xlabel('Prediction Error') plt.ylabel('Frequency') plt.title('Error Distribution') plt.axvline(0, color='r', linestyle='--') # Box plot plt.subplot(1, 2, 2) plt.boxplot(errors) plt.ylabel('Prediction Error') plt.title('Error Box Plot') plt.axhline(0, color='r', linestyle='--') plt.tight_layout() plt.savefig('errors.png') Comparison ---------- Comparing Multiple Models ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.models import StandardCATEModel, IVModel from causalfm.evaluation import compute_pehe # Load models models = { 'Standard': StandardCATEModel.from_pretrained("checkpoints/standard.pth"), 'IV': IVModel.from_pretrained("checkpoints/iv.pth") } # Evaluate each results = {} for name, model in models.items(): result = model.estimate_cate(x_train, a_train, y_train, x_test) pred = result['cate'].cpu().numpy() pehe = compute_pehe(pred, true_ite) results[name] = pehe print("Model Comparison:") for name, pehe in results.items(): print(f" {name}: PEHE={pehe:.4f}") Baseline Comparison ~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np # CausalFM causalfm_pehe = compute_pehe(pred_cate, true_ite) # Baseline 1: ATE for all ate_pred = np.full_like(true_ite, true_ite.mean()) ate_pehe = compute_pehe(ate_pred, true_ite) # Baseline 2: Zero effect zero_pred = np.zeros_like(true_ite) zero_pehe = compute_pehe(zero_pred, true_ite) print("Comparison:") print(f" CausalFM: {causalfm_pehe:.4f}") print(f" ATE Baseline: {ate_pehe:.4f}") print(f" Zero Baseline: {zero_pehe:.4f}") Best Practices -------------- Evaluation Guidelines ~~~~~~~~~~~~~~~~~~~~~ 1. **Multiple test datasets**: Use 10+ test datasets for robust estimates 2. **Report statistics**: Include mean ± std deviation 3. **Check calibration**: Verify uncertainty estimates are well-calibrated 4. **Compare baselines**: Always compare with simple baselines 5. **Visualize**: Create plots to identify systematic biases Complete Evaluation Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.models import StandardCATEModel from causalfm.evaluation import compute_pehe, compute_ate_error, compute_rmse from pathlib import Path import pandas as pd # Load model model = StandardCATEModel.from_pretrained("checkpoints/best_model.pth") # Evaluate all test datasets results = [] for file in Path("data/test/").glob("test_*.csv"): # ... load and prepare data ... # Predict result = model.estimate_cate(x_train, a_train, y_train, x_test) pred = result['cate'].cpu().numpy() # Evaluate results.append({ 'dataset': file.name, 'pehe': compute_pehe(pred, true_ite), 'ate_error': compute_ate_error(pred, true_ite), 'rmse': compute_rmse(pred, true_ite) }) # Summary df = pd.DataFrame(results) print("\nFinal Results:") print(f"PEHE: {df['pehe'].mean():.4f} ± {df['pehe'].std():.4f}") print(f"ATE Error: {df['ate_error'].mean():.4f} ± {df['ate_error'].std():.4f}") # Save df.to_csv("evaluation_results.csv", index=False) See Also -------- * :doc:`../user_guide/evaluation` - Detailed evaluation guide * :doc:`models` - Model API reference * :doc:`../examples/standard_cate` - Complete evaluation example