Evaluation ========== CausalFM provides comprehensive evaluation tools for assessing model performance on causal inference tasks. .. important:: **Data Normalization Required** If your model was trained on normalized data, you **must** normalize test data before evaluation to ensure consistent results. .. code-block:: python from causalfm.data import normalize_data # Normalize test data X_norm, Y_norm, x_scaler, y_scaler = normalize_data( X_test, Y_test, Y0_test, Y1_test ) Evaluation Metrics ------------------ Standard Metrics ~~~~~~~~~~~~~~~~ CausalFM implements the following metrics for evaluating causal effect estimates: **PEHE (Precision in Estimation of Heterogeneous Effects)** Measures the accuracy of individual treatment effect (ITE) predictions: .. math:: \text{PEHE} = \sqrt{\frac{1}{n}\sum_{i=1}^n (\tau_i - \hat{\tau}_i)^2} where :math:`\tau_i` is the true ITE and :math:`\hat{\tau}_i` is the predicted CATE. **ATE Error (Average Treatment Effect Error)** Measures the accuracy of average treatment effect estimation: .. math:: \text{ATE Error} = \left|\frac{1}{n}\sum_{i=1}^n \tau_i - \frac{1}{n}\sum_{i=1}^n \hat{\tau}_i\right| **MSE (Mean Squared Error)** Standard squared error metric: .. math:: \text{MSE} = \frac{1}{n}\sum_{i=1}^n (\tau_i - \hat{\tau}_i)^2 **RMSE (Root Mean Squared Error)** Square root of MSE for interpretability: .. math:: \text{RMSE} = \sqrt{\text{MSE}} Basic Usage ~~~~~~~~~~~ .. code-block:: python from causalfm.evaluation import ( compute_pehe, compute_ate_error, compute_mse, compute_rmse ) import numpy as np # Ground truth ITEs true_ite = np.array([1.5, 2.3, 0.8, -0.5, 1.2]) # Model predictions pred_cate = np.array([1.4, 2.1, 0.9, -0.3, 1.1]) # Compute metrics pehe = compute_pehe(pred_cate, true_ite) ate_error = compute_ate_error(pred_cate, true_ite) mse = compute_mse(pred_cate, true_ite) rmse = compute_rmse(pred_cate, true_ite) print(f"PEHE: {pehe:.4f}") print(f"ATE Error: {ate_error:.4f}") print(f"MSE: {mse:.4f}") print(f"RMSE: {rmse:.4f}") With PyTorch Tensors ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import torch from causalfm.evaluation import compute_pehe # Metrics work with both numpy arrays and torch tensors true_ite = torch.randn(100) pred_cate = torch.randn(100) pehe = compute_pehe(pred_cate, true_ite) print(f"PEHE: {pehe:.4f}") Model Evaluation ---------------- Evaluating a Single Dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.models import StandardCATEModel from causalfm.evaluation import compute_pehe, compute_ate_error from causalfm.data import normalize_data, normalize_ite import pandas as pd import torch # Load model model = StandardCATEModel.from_pretrained("checkpoints/best_model.pth") # Load test data df = pd.read_csv("data/test/test_dataset_1.csv") # Extract and normalize features x_cols = [c for c in df.columns if c.startswith('x')] X_norm, Y_norm, x_scaler, y_scaler = normalize_data( df[x_cols].values, df['outcome'].values, df['y0'].values, df['y1'].values ) X = torch.FloatTensor(X_norm) A = torch.FloatTensor(df['treatment'].values).unsqueeze(1) Y = torch.FloatTensor(Y_norm).unsqueeze(1) # Normalize ITE for evaluation true_ite, _ = normalize_ite(df['y0'].values, df['y1'].values, y_scaler) # Split into train/test for in-context learning n_train = int(0.8 * len(X)) x_train, x_test = X[:n_train], X[n_train:] a_train, y_train = A[:n_train], Y[:n_train] ite_test = true_ite[n_train:] # Predict result = model.estimate_cate(x_train, a_train, y_train, x_test) pred_cate = result['cate'].cpu().numpy() # Evaluate pehe = compute_pehe(pred_cate, ite_test) ate_error = compute_ate_error(pred_cate, ite_test) print(f"Dataset: test_dataset_1") print(f" PEHE: {pehe:.4f}") print(f" ATE Error: {ate_error:.4f}") Evaluating Multiple Datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.models import StandardCATEModel from causalfm.evaluation import compute_pehe, compute_ate_error import pandas as pd import torch from pathlib import Path # Load model model = StandardCATEModel.from_pretrained("checkpoints/best_model.pth") # Evaluate on multiple test datasets test_dir = Path("data/test/") test_files = sorted(test_dir.glob("test_*.csv")) results = [] for file in test_files: df = pd.read_csv(file) # Extract features x_cols = [c for c in df.columns if c.startswith('x')] X = torch.FloatTensor(df[x_cols].values) A = torch.FloatTensor(df['treatment'].values).unsqueeze(1) Y = torch.FloatTensor(df['outcome'].values).unsqueeze(1) true_ite = df['ite'].values # Split n_train = int(0.8 * len(X)) x_train, x_test = X[:n_train], X[n_train:] a_train, y_train = A[:n_train], Y[:n_train] ite_test = true_ite[n_train:] # Predict and evaluate result = model.estimate_cate(x_train, a_train, y_train, x_test) pred_cate = result['cate'].cpu().numpy() pehe = compute_pehe(pred_cate, ite_test) ate_error = compute_ate_error(pred_cate, ite_test) results.append({ 'dataset': file.name, 'pehe': pehe, 'ate_error': ate_error }) # Create results DataFrame results_df = pd.DataFrame(results) print(results_df) print(f"\nAverage PEHE: {results_df['pehe'].mean():.4f} ± {results_df['pehe'].std():.4f}") print(f"Average ATE Error: {results_df['ate_error'].mean():.4f} ± {results_df['ate_error'].std():.4f}") Automated Evaluation ~~~~~~~~~~~~~~~~~~~~ For convenience, use the built-in evaluation utilities: .. code-block:: python from causalfm.evaluation.metrics import evaluate_model_on_dataset # Evaluate single dataset result = evaluate_model_on_dataset( model, data_path="data/test/test_dataset_1.csv", train_ratio=0.8 ) print(f"PEHE: {result['pehe']:.4f}") print(f"ATE Error: {result['ate_error']:.4f}") .. code-block:: python from causalfm.evaluation.metrics import evaluate_model_on_directory # Evaluate all datasets in a directory results_df = evaluate_model_on_directory( model, data_dir="data/test/", file_pattern="test_*.csv", train_ratio=0.8 ) print(results_df) print(f"\nSummary:") print(results_df[['pehe', 'ate_error']].describe()) Comparing Models ---------------- Comparing Multiple Models ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.models import StandardCATEModel, IVModel, FrontdoorModel from causalfm.evaluation import compute_pehe import pandas as pd # Load models models = { 'Standard': StandardCATEModel.from_pretrained("checkpoints/standard.pth"), 'IV': IVModel.from_pretrained("checkpoints/iv.pth"), 'Frontdoor': FrontdoorModel.from_pretrained("checkpoints/frontdoor.pth") } # Evaluate each model comparison_results = [] for name, model in models.items(): # ... load and prepare data ... result = model.estimate_cate(x_train, a_train, y_train, x_test) pred_cate = result['cate'].cpu().numpy() pehe = compute_pehe(pred_cate, true_ite) comparison_results.append({ 'model': name, 'pehe': pehe }) comparison_df = pd.DataFrame(comparison_results) print(comparison_df) Baseline Comparisons ~~~~~~~~~~~~~~~~~~~~ Compare with simple baselines: .. code-block:: python import numpy as np from causalfm.evaluation import compute_pehe # CausalFM prediction causalfm_pehe = compute_pehe(pred_cate, true_ite) # Baseline 1: Predict ATE for everyone ate_baseline = np.full_like(true_ite, true_ite.mean()) baseline_ate_pehe = compute_pehe(ate_baseline, true_ite) # Baseline 2: Random predictions random_pred = np.random.randn(len(true_ite)) random_pehe = compute_pehe(random_pred, true_ite) # Baseline 3: Zero effect zero_pred = np.zeros_like(true_ite) zero_pehe = compute_pehe(zero_pred, true_ite) print(f"CausalFM PEHE: {causalfm_pehe:.4f}") print(f"ATE Baseline PEHE: {baseline_ate_pehe:.4f}") print(f"Random PEHE: {random_pehe:.4f}") print(f"Zero Effect PEHE: {zero_pehe:.4f}") Uncertainty Evaluation ---------------------- Calibration Analysis ~~~~~~~~~~~~~~~~~~~~ Evaluate the calibration of uncertainty estimates: .. code-block:: python import numpy as np from causalfm.models import StandardCATEModel model = StandardCATEModel.from_pretrained("checkpoints/best_model.pth") # Get predictions with uncertainty result = model.estimate_cate(x_train, a_train, y_train, x_test) pred_cate = result['cate'].cpu().numpy() # GMM parameters pi = result['gmm_pi'].cpu().numpy() mu = result['gmm_mu'].cpu().numpy() sigma = result['gmm_sigma'].cpu().numpy() # Compute predictive variance variance = (pi * (sigma**2 + mu**2)).sum(axis=-1) - pred_cate**2 std_dev = np.sqrt(variance) # Compute standardized errors errors = pred_cate - true_ite standardized_errors = errors / std_dev # Check if standardized errors follow N(0,1) print(f"Mean of standardized errors: {standardized_errors.mean():.4f}") print(f"Std of standardized errors: {standardized_errors.std():.4f}") # Calibration plot import matplotlib.pyplot as plt plt.figure(figsize=(10, 5)) # Plot 1: Predicted std vs absolute error plt.subplot(1, 2, 1) plt.scatter(std_dev, np.abs(errors), alpha=0.5) plt.xlabel('Predicted Std Dev') plt.ylabel('Absolute Error') plt.title('Uncertainty Calibration') # Plot 2: QQ plot of standardized errors plt.subplot(1, 2, 2) from scipy import stats stats.probplot(standardized_errors, dist="norm", plot=plt) plt.title('Q-Q Plot') plt.tight_layout() plt.savefig('calibration.png') Coverage Analysis ~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np # Compute confidence intervals n_samples = 10000 n_test = len(pred_cate) samples = np.zeros((n_test, n_samples)) for i in range(n_test): # Sample component indices components = np.random.choice( len(pi[i]), size=n_samples, p=pi[i] ) # Sample from selected components for k in range(len(pi[i])): mask = (components == k) n_k = mask.sum() if n_k > 0: samples[i, mask] = np.random.normal( mu[i, k], sigma[i, k], n_k ) # Compute 95% confidence intervals ci_lower = np.percentile(samples, 2.5, axis=1) ci_upper = np.percentile(samples, 97.5, axis=1) # Check coverage coverage = np.mean((true_ite >= ci_lower) & (true_ite <= ci_upper)) print(f"95% CI Coverage: {coverage:.2%}") # Expected: ~95% for well-calibrated model Visualization ------------- Plotting Predictions ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import matplotlib.pyplot as plt import numpy as np # Scatter plot: predicted vs true plt.figure(figsize=(8, 6)) plt.scatter(true_ite, pred_cate, alpha=0.6) # Perfect prediction line min_val = min(true_ite.min(), pred_cate.min()) max_val = max(true_ite.max(), pred_cate.max()) plt.plot([min_val, max_val], [min_val, max_val], 'r--', label='Perfect Prediction') plt.xlabel('True ITE') plt.ylabel('Predicted CATE') plt.title(f'CATE Predictions (PEHE: {pehe:.4f})') plt.legend() plt.grid(True, alpha=0.3) plt.savefig('predictions.png') Error Distribution ~~~~~~~~~~~~~~~~~~ .. code-block:: python errors = pred_cate - true_ite plt.figure(figsize=(10, 4)) # Histogram plt.subplot(1, 2, 1) plt.hist(errors, bins=30, edgecolor='black', alpha=0.7) plt.xlabel('Prediction Error') plt.ylabel('Frequency') plt.title('Error Distribution') plt.axvline(0, color='r', linestyle='--', label='Zero Error') plt.legend() # Box plot plt.subplot(1, 2, 2) plt.boxplot(errors) plt.ylabel('Prediction Error') plt.title('Error Box Plot') plt.axhline(0, color='r', linestyle='--') plt.tight_layout() plt.savefig('error_distribution.png') Uncertainty Visualization ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Sort by predicted CATE sorted_idx = np.argsort(pred_cate) plt.figure(figsize=(12, 6)) x = np.arange(len(sorted_idx)) # Plot predictions with uncertainty bands plt.plot(x, pred_cate[sorted_idx], label='Predicted CATE', color='blue') plt.fill_between(x, ci_lower[sorted_idx], ci_upper[sorted_idx], alpha=0.3, label='95% CI') plt.scatter(x, true_ite[sorted_idx], s=10, alpha=0.5, color='red', label='True ITE') plt.xlabel('Sample (sorted by prediction)') plt.ylabel('Treatment Effect') plt.title('CATE Predictions with Uncertainty') plt.legend() plt.grid(True, alpha=0.3) plt.savefig('uncertainty.png') Real-World Evaluation --------------------- Jobs Dataset Example ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.models import StandardCATEModel from causalfm.evaluation import compute_pehe import pandas as pd import torch # Load model model = StandardCATEModel.from_pretrained("checkpoints/best_model.pth") # Load Jobs dataset (real-world data) df = pd.read_csv("DATA_standard/jobs_data/jobs_data.csv") # Prepare data feature_cols = ['age', 'education', 'black', 'hispanic', 'married', 'nodegree', 're74', 're75'] X = torch.FloatTensor(df[feature_cols].values) A = torch.FloatTensor(df['treat'].values).unsqueeze(1) Y = torch.FloatTensor(df['re78'].values).unsqueeze(1) # Split n_train = int(0.8 * len(X)) x_train, x_test = X[:n_train], X[n_train:] a_train, y_train = A[:n_train], Y[:n_train] # Estimate treatment effects result = model.estimate_cate(x_train, a_train, y_train, x_test) cate = result['cate'].cpu().numpy() # Analyze results print(f"Estimated ATE: {cate.mean():.2f}") print(f"CATE range: [{cate.min():.2f}, {cate.max():.2f}]") print(f"Percentage with positive effect: {(cate > 0).mean():.1%}") Best Practices -------------- Evaluation Guidelines ~~~~~~~~~~~~~~~~~~~~~ 1. **Use multiple test datasets** (10+ recommended) to get robust estimates 2. **Report standard deviations** along with mean metrics 3. **Check uncertainty calibration** if using GMM predictions 4. **Compare with baselines** to demonstrate improvement 5. **Visualize predictions** to identify systematic biases Example Complete Evaluation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from causalfm.models import StandardCATEModel from causalfm.evaluation import compute_pehe, compute_ate_error, compute_rmse import pandas as pd import numpy as np from pathlib import Path # Load model model = StandardCATEModel.from_pretrained("checkpoints/best_model.pth") # Evaluate on all test datasets test_dir = Path("data/test/") results = [] for file in sorted(test_dir.glob("test_*.csv")): # Load and prepare data df = pd.read_csv(file) # ... (data preparation) ... # Predict result = model.estimate_cate(x_train, a_train, y_train, x_test) pred_cate = result['cate'].cpu().numpy() # Compute metrics results.append({ 'dataset': file.name, 'pehe': compute_pehe(pred_cate, true_ite), 'ate_error': compute_ate_error(pred_cate, true_ite), 'rmse': compute_rmse(pred_cate, true_ite), 'n_test': len(true_ite) }) # Aggregate results results_df = pd.DataFrame(results) print("=" * 60) print("EVALUATION RESULTS") print("=" * 60) print(f"\nNumber of test datasets: {len(results_df)}") print(f"\nMetric Summary:") print(results_df[['pehe', 'ate_error', 'rmse']].describe()) print(f"\nFinal Results:") print(f" PEHE: {results_df['pehe'].mean():.4f} ± {results_df['pehe'].std():.4f}") print(f" ATE Error: {results_df['ate_error'].mean():.4f} ± {results_df['ate_error'].std():.4f}") print(f" RMSE: {results_df['rmse'].mean():.4f} ± {results_df['rmse'].std():.4f}") # Save results results_df.to_csv("evaluation_results.csv", index=False) API Reference ------------- For complete API documentation, see: * :func:`causalfm.evaluation.compute_pehe` * :func:`causalfm.evaluation.compute_ate_error` * :func:`causalfm.evaluation.compute_mse` * :func:`causalfm.evaluation.compute_rmse`