University of Southern Denmark
Correlation \(\neq\) causation
We often see variables move together. But this correlation does not necessarily imply that one variable causes the other.
A third factor\(—\)called confounder\(—\)can increase/decrease both, creating a non-causal association.
Identification and estimation
Quantifying causal effects requires disentangling
Identification strategies
Several identification strategies exist:
Potential outcomes model
Causality is often formalized through the potential outcomes framework (Neyman (1923); Rubin (1974)).
Under this model, the individual causal effect is \(Y_i(1) - Y_i(0)\).
→ Never observable, as the counterfactual is missing.
Therefore, we focus on estimands that are identifiable under (somewhat) reasonable assumptions.
Two important special cases of \(\tau_g\):
Naive comparisons won’t cut it
One might be tempted to simply compare mean outcomes for treated (\(D_i = 1\)) and control (\(D_i = 0\)) units. However,
\[ \begin{aligned} \mathbb{E} [ Y_i | D_i = 1] - \mathbb{E} [ Y_i | D_i = 0] & = \mathbb{E} [ Y_i(1) | D_i = 1] - \mathbb{E} [ Y_i(0) | D_i = 0] \\ &= \mathbb{E} [ Y_i(1) | D_i = 1] + \color{#06b6d4}{\mathbb{E} [Y_i (0) | D_i = 1]} - \color{#06b6d4}{\mathbb{E} [Y_i (0) | D_i = 1]} - \mathbb{E} [ Y_i(0) | D_i = 0] \\ &= \underbrace{\mathbb{E} [ Y_i(1) - Y_i (0)| D_i = 1]}_{\text{ATT}} + \underbrace{\mathbb{E} [Y_i (0) | D_i = 1] - \mathbb{E} [ Y_i(0) | D_i = 0]}_{\text{Selection bias}}. \end{aligned} \tag{1}\]
Exercise\(—\)Randomization kills selection bias (\(\approx 5\) minutes)
Show that, if treatment is randomly assigned\(—\)that is, if
\[ Y_i(1), Y_i(0) \perp D_i, \] then \(\mathbb{E} [ Y_i | D_i = 1] - \mathbb{E} [ Y_i | D_i = 0] = \text{ATT}\).
Solution. Under randomization, \(\mathbb{E} [Y_i (d) | D_i = d] = \mathbb{E} [Y_i (d) | D_i = 1 - d] = \mathbb{E} [Y_i (d)]\) for \(d = 0, 1\). Therefore, it’s simply a matter of “dropping the conditioning” from (1).
As a corollary, randomization ensures that \(\text{ATT} = \text{ATNT} = \text{ATE}\).
Observational data
Randomization is ideal but often infeasible (costs, ethics, logistics) and may limit external validity. We can still identify causal effects using observational data with appropriate designs and assumptions.
Selection-on-observables
Selection-on-observables research designs are used when
This approach is based on the following assumptions (see, e.g., Imbens & Rubin (2015)):
Assumption 1 (Unconfoundedness). \(Y_i(1), Y_i(0) \perp D_i | \boldsymbol{X}_i\).
→ \(\boldsymbol{X}_i\) fully accounts for selection into treatment\(—\)we observe all confounders.
Assumption 2 (Positivity). \(0 < \pi(\boldsymbol X_i) < 1\).
→ “Valid comparisons” exist.
Together, Assumptions 1\(\text{-}\)2 allow for the identification of \(\tau\) and its conditional versions.
→ Meaningful comparisons between treated and control units with similar observed characteristics.
Even naive conditional comparisons won’t cut it
Without assuming that \(\boldsymbol{X}_i\) fully accounts for selection into treatment, contrasts between treated and controls units with similar observed characteristics are off:
\[ \begin{aligned} \mu(1, \boldsymbol{x}) - \mu(0, \boldsymbol{x}) & = \mathbb{E} [ Y_i(1) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}] - \mathbb{E} [ Y_i(0) | D_i = 0, \boldsymbol{X}_i = \boldsymbol{x}] \\ &= \mathbb{E} [ Y_i(1) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}] - \mathbb{E} [ Y_i(0) | D_i = 0, \boldsymbol{X}_i = \boldsymbol{x}] \\ & + \color{#06b6d4}{\mathbb{E} [Y_i (0) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}]} - \color{#06b6d4}{\mathbb{E} [Y_i (0) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}]} \\ &= \underbrace{\mathbb{E} [ Y_i(1) - Y_i (0)| D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}]}_{\text{CATT}(\boldsymbol{x})} + \underbrace{\mathbb{E} [Y_i (0) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}] - \mathbb{E} [ Y_i(0) | D_i = 0, \boldsymbol{X}_i = \boldsymbol{x}]}_{\text{Selection bias}(\boldsymbol{x})}. \end{aligned} \tag{2}\]
Exercise\(—\)Selection-on-observables kills conditional selection bias (\(\approx 5\) minutes)
Show that, under Assumptions 1\(\text{-}\)2,
\[ \mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i) = \tau ( \boldsymbol{X}_i ). \]
Solution. Under Assumption 1,
\[ \mathbb{E} [Y_i (d) | D_i = d, \boldsymbol{X}_i = \boldsymbol{x}] = \mathbb{E} [Y_i (d) | D_i = 1 - d, \boldsymbol{X}_i = \boldsymbol{x}] = \mathbb{E} [Y_i (d) | \boldsymbol{X}_i = \boldsymbol{x}], \quad \text{for } d = 0, 1. \] Therefore, it’s simply a matter of “dropping the conditioning” from (2).
→ Assumption 2 is needed to ensure that all the conditional expectations are well-defined for all values within the support of \(\boldsymbol{X}_i\).
As a corollary, Assumptions 1\(\text{-}\)2 ensure that \(\text{CATT} = \text{CATNT} = \text{CATE}\).
Exercise\(—\)Understanding DAGs (\(\approx 5\) minutes)
Discuss and answer the following questions.
Solution.
→ We call \(Z\) an instrument.
→ We call \(C\) a collider.
Exercise\(—\)Selection-on-observables trade-offs (\(\approx 5\) minutes)
Can you think of any trade-off between Assumptions 1 and 2? Think about what happens if \(\text{dim}(\boldsymbol{X}_i) \uparrow\).
Solution. As \(\text{dim}(\boldsymbol{X}_i) \uparrow\), we (may) strengthen the credibility of unconfoundedness; however, we simultaneously make positivity less plausible.
→ Curse of dimensionality: the covariate space becomes sparse, and treated and control units may no longer overlap well.
ATE identification under selection-on-observables
Under Assumptions 1\(\text{-}\)2, the ATE can be written in three equivalent ways:
\[ \tau = \mathbb{E}[\mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i)]. \] \[ \tau = \mathbb{E} \left[\frac{D_i Y_i}{\pi ( \boldsymbol{X}_i)} - \frac{(1 - D_i) Y_i}{1 - \pi ( \boldsymbol{X}_i)} \right]. \] \[ \tau = \mathbb{E} \left[\mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i) + \frac{D_i (Y_i - \mu(1, \boldsymbol{X}_i))}{\pi ( \boldsymbol{X}_i)} - \frac{(1 - D_i) (Y_i - \mu(0, \boldsymbol{X}_i))}{1 - \pi ( \boldsymbol{X}_i)} \right]. \]
These equalities point to different estimation strategies, all relying on learning unknown nuisance functions.
→ We might need a high-dimensional \(\boldsymbol{X}_i\) to satisfy Assumption 1. Machine learning is thus attractive.
Cross-fitting
Naive plug-in ML of nuisance functions can introduce overfitting bias.
Exercise\(—\)Suitable estimators (\(\approx 3\) minutes)
Discuss and list suitable estimators for \(\mu(\cdot, \cdot)\) and \(\pi(\cdot)\).
Solution. Notice that \(\pi(\boldsymbol{X}_i) = \mathbb{P} (D_i = 1 | \boldsymbol{X}_i)\)\(—\)the propensity score equals the “probability of success” of \(D_i\).
→ We already learnt how LPM/Logit/Probit can help us here.
→ ML provides flexible alternatives (e.g., classification trees/forests/boosting, penalized regressions).
The estimation of \(\mu(\cdot, \cdot)\) can be carried out differently according to the outcome’s nature.
→ (Penalized) linear models or regression trees/forests/boosting if \(Y_i\) is continuous.
→ Classification algorithms if \(Y_i\) is binary.
→ Ordered Logit/Probit or Ordered Correlation Forest (Di Francesco (2025)) if \(Y_i\) is ordered.
Exercise\(—\)Cross-fitting trade-offs (\(\approx 3\) minutes)
Discuss trade-offs when changing the number of folds \(K\).
Solution. As \(K\) increases, we expect more stable nuisance fits, as the training size increases. Carrying the argument to its extreme, one could implement a “leave-one-out” approach by setting \(K = n\).
However, computational time increases in \(K\), as nuisances must be fit \(K\) times; this could lead to infeasibility if \(n\) is large and we use “heavy” ML methodologies.
Outcome regression identification of ATE
Under selection-on-observables, we can write
\[ \tau = \mathbb{E}[\mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i)]. \tag{3}\]
Equation (3) motivates the plug-in estimator
\[ \hat{\tau}^{OR} = \frac{1}{n} \sum_{i = 1}^n \left[ \hat{\mu}_{-k(i)}(1, \boldsymbol{X}_i) - \hat{\mu}_{-k(i)}(0, \boldsymbol{X}_i) \right]. \tag{4}\]
Asymptotic Normality of OR Estimator
Suppose \(\mathrm{RMSE}(\hat{\mu}_{-k}) = o_p(n^{-1/2})\) for \(k = 1, \dots, K\). Then, \[ \sqrt{n} ( \hat{\tau}^{OR} - \tau) \xrightarrow{d} \mathcal{N}\left(0, \mathbb{V}\left(\mu(1, \boldsymbol{X}) - \mu(0, \boldsymbol{X}) \right) \right). \] Furthermore, \[ \widehat{\mathbb{V}}_n^{OR} := \frac{1}{n} \sum_{i = 1}^n \left( \hat{\mu}_{-k(i)} (1, \boldsymbol{X}_i) - \hat{\mu}_{-k(i)} (0, \boldsymbol{X}_i) - \hat{\tau}^{OR} \right)^2 \xrightarrow{p} \mathbb{V}\left(\mu(1, \boldsymbol{X}) - \mu(0, \boldsymbol{X})\right). \]
IPW identification of ATE
Under selection-on-observables, we can write
\[ \tau = \mathbb{E} \left[ \frac{D_i Y_i}{\pi(\boldsymbol{X}_i)} - \frac{(1 - D_i) Y_i}{1 - \pi(\boldsymbol{X}_i)}\right]. \tag{5}\]
Equation (5) motivates the plug-in estimator
\[ \hat{\tau}^{IPW} = \frac{1}{n}\sum_{i=1}^n \left[\frac{D_i Y_i}{\hat{\pi}_{-k(i)}(\boldsymbol{X}_i)} - \frac{(1 - D_i) Y_i}{1 - \hat{\pi}_{-k(i)}(\boldsymbol{X}_i)} \right]. \tag{6}\]
Asymptotic Normality of IPW Estimator
Suppose \(0 < \hat\pi(\boldsymbol{X}_i) < 1\) and \(\mathrm{RMSE} (\hat{\pi}_{-k}) = o_p(n^{-1/2})\) for \(k = 1, \dots, K\). Then \[ \sqrt{n} (\hat{\tau}^{IPW}- \tau) \xrightarrow{d} \mathcal{N} \left(0, \mathbb{V} \left( \frac{D Y}{\pi(\boldsymbol{X})} - \frac{(1-D) Y}{1-\pi(\boldsymbol{X})}\right) \right). \]
Furthermore, \[ \widehat{\mathbb{V}}^{IPW}_n := \frac{1}{n}\sum_{i=1}^n \left( \frac{D_i Y_i}{\hat{\pi}_{-k(i)}(\boldsymbol{X}_i)} - \frac{(1-D_i) Y_i}{1-\hat{\pi}_{-k(i)}(\boldsymbol{X}_i)} - \hat{\tau}^{IPW} \right)^2 \xrightarrow{p} \mathbb{V}\left( \frac{D Y}{\pi(\boldsymbol{X})} - \frac{(1-D) Y}{1-\pi(\boldsymbol{X})} \right). \]
AIPW identification of ATE
Define \[ \psi \left( \boldsymbol{X}_i \right) := \underbrace{\mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i)}_{\text{OR}} + \underbrace{\frac{D_i \left( Y_i - \mu(1, \boldsymbol{X}_i) \right)}{\pi(\boldsymbol{X}_i)} - \frac{(1 - D_i) \left( Y_i - \mu(0, \boldsymbol{X}_i) \right)}{1 - \pi(\boldsymbol{X}_i)}}_{\text{IPW on residuals}}. \tag{7}\]
Under selection-on-observables, we can write
\[ \tau = \mathbb{E} \left[ \psi \left( \boldsymbol{X}_i \right) \right]. \tag{8}\]
Equation (8) motivates the plug-in estimator
\[ \hat{\tau}^{AIPW} = \frac{1}{n}\sum_{i=1}^n \hat\psi_{-k(i)} (\boldsymbol{X}_i). \tag{9}\]
Asymptotic Normality of AIPW Estimator
Suppose \(0 < \hat\pi(\boldsymbol{X}_i) < 1\) and \(\mathrm{RMSE}(\hat{\mu}_{-k}) \cdot \mathrm{RMSE}(\hat{\pi}_{-k}) = o_p(n^{-1/2})\) for \(k = 1, \dots, K\). Then \[ \sqrt{n} \left( \hat{\tau}^{AIPW} - \tau \right) \xrightarrow{d} \mathcal{N} \left( 0, \mathbb{V} \left( \psi \left( \boldsymbol{X}_i \right) \right) \right), \] \[ \widehat{\mathbb{V}}^{AIPW}_n := \frac{1}{n} \sum_{i = 1}^n \left( \hat \psi \left( \boldsymbol{X}_i \right) - \hat \tau^{AIPW} \right)^2 \xrightarrow{p} \mathbb{V} \left( \psi \left( \boldsymbol{X} \right) \right). \]
Decomposition of OR, IPW, and AIPW estimators
Let \(W_i := (Y_i, D_i, \boldsymbol{X}_i)\) and let \(\eta := (\mu, \pi)\) collect the nuisance functions. The ATE estimators can be written as \[ \hat{\tau} = \frac{1}{n} \sum_{i = 1}^n m (W_i;\, \hat{\eta}_{-k(i)}), \tag{10}\]
where \(m(\cdot; \eta)\) is the estimator’s score.
Adding and subtracting the oracle score \(m(W_i; \eta)\) and rearranging, \[ \hat{\tau} = \underbrace{\frac{1}{n} \sum_{i = 1}^n m \left(W_i; \eta \right)}_{\text{oracle term}} + \underbrace{\frac{1}{n} \sum_{i = 1}^n \left[ m ( W_i; \hat{\eta}_{-k(i)}) - m \left(W_i; \eta\right) \right]}_{\text{plug-in bias}}. \tag{11}\]
Comparing requirements on nuisance estimation
To obtain asymptotically well-behaved estimates, OR, IPW, and AIPW need nuisances to be estimated accurately enough to control the plug-in bias.
A sufficient condition for AIPW is \(\mathrm{RMSE}(\hat{\mu}_{-k}) = o_p(n^{-1/4})\) and \(\mathrm{RMSE}(\hat{\pi}_{-k}) = o_p(n^{-1/4})\)
→ AIPW controls the plug-in bias under weaker requirements on nuisance estimation accuracy.
Data-generating process
We run a simple Monte Carlo to illustrate the benefits of cross-fitting for AIPW estimation of the ATE.
We construct \(\hat\tau^{AIPW}\) using regression/classification trees for nuisance estimation.
Exercise\(—\)Understanding the DPG (\(\approx\) 5 minutes)
Discuss and answer the following questions.
Solution.
→ \(X_{i1}\) enters the models of both \(Y_i\) and \(D_i\).
→ Controlling for \(X_{i1}\) is enough for identification.
→ Effects are larger when \(X_{i1} > 0\).
Effect heterogeneity matters
The ATE quantifies the average impact of the policy on the reference population.
→ Straightforward to interpret and communicate.
However, the ATE lacks information regarding effect heterogeneity.
→ Who benefits most/least? Any harmed groups? Are effects monotone?
To study heterogeneity, we can target
Exercise\(—\)How to tackle effect heterogeneity (\(\approx\) 3 minutes)
Discuss and answer the following questions.
Solution.
GATE analysis
A common approach to tackle effect heterogeneity is to report the ATEs across different subgroups defined by observable covariates:
\[ \tau_g := \mathbb{E} [Y_i(1) - Y_i(0) | G_i = g], \] with \(G_i = 1, \dots, G\) a discrete “group” indicator built from \(\boldsymbol{X}_i\).
→ GATEs with continuous \(G_i\) are more challenging to handle (Zimmert & Lechner (2019); Fan et al. (2022)).
If groups are “exogenous”\(—\)i.e., pre-specified\(—\)GATEs can be estimated by applying ATE estimators separately for each group:
\[ \hat{\tau}_{\color{#c1121f}{g}} = \frac{1}{n_{\color{#c1121f}{g}}} \sum_{i = 1}^n \color{#c1121f}{\mathbf{1} \{G_i = g\}} \, m (W_i;\, \hat{\eta}_{-k(i)}), \quad g = 1, \dots, G. \]
This approach is simple to implement, but can be inefficient with many small groups.
A more efficient alternative is to project the CATEs on “group dummies.”
→ \(\tau_g = \mathbb{E} [\tau(\boldsymbol{X}_i) | G_i = g]\) motivates modelling the best linear predictor of \(\tau(\cdot)\) given \(G_i\).
However, \(\tau(\cdot)\) is unobserved. Fortunately, we can proxy it with \(\psi(\cdot)\) (Semenova & Chernozhukov (2021)).
→ \(\mathbb{E} [ \psi(\boldsymbol{X}_i) | \boldsymbol{X}_i] = \tau(\boldsymbol{X}_i)\).
\[ \hat\psi_{-k(i)}(\boldsymbol{X}_i) = \sum_{g = 1}^G \mathbf{1} \{G_i = g\} \beta_g + \epsilon_i. \tag{12}\]
Under selection-on-observables, \(\beta_g = \tau_g\). With cross-fitting, Semenova & Chernozhukov ((2021)) show that the OLS estimator \(\hat\beta_g\) is root-\(n\) consistent and asymptotically normal, provided \(\mathrm{RMSE}(\hat{\mu}_{-k}) \cdot \mathrm{RMSE}(\hat{\pi}_{-k}) = o_p(n^{-1/2})\).
Data-driven groups
So far, we assumed groups were pre-specified. Pre-specifying groups is simple but risks missing unexpected heterogeneity.
Researchers often want to “let the data speak” and discover data-driven heterogeneous subgroups.
→ Tree-based approaches are popular because of their piece-wise constant structure.
However, data-driven groups are endogenous, and naive reuse of the same sample for grouping and estimation biases results.
A remedy is to combine tree-based approaches with honesty (Athey & Imbens (2016)).
Causal trees (Athey & Imbens (2016))
Grow tree by adapting splitting criterion to target effect heterogeneity, and estimate GATEs by
\[ Y_i = \sum_{l = 1}^L L_{il} \gamma_l + \sum_{l = 1}^L L_{il} D_i \beta_l + \epsilon_i. \]
Aggregation trees (Di Francesco (2024))
Grow tree by applying standard CART to estimated CATEs, and estimate GATEs by \[ \hat\psi_{-k(i)}(\boldsymbol{X}_i) = \sum_{l = 1}^L L_{il} \beta_l + \epsilon_i. \]
Exercise\(—\)Honesty trade-offs (\(\approx 3\) minutes)
Discuss and answer the following questions.
Solution.
Q1. Yes. Sample splitting uses fewer observations for each task (splitting vs. effect estimation), which increases variance and typically reduces predictive accuracy in finite samples.
Q2.
CATE estimation
To “fully” tackle effect heterogeneity, we can focus on the CATEs \[ \tau (\boldsymbol{X}_i) := \mathbb{E} [ Y_i(1) - Y_i(0) | \boldsymbol{X}_i]. \] CATEs provide information at the finest level of granularity achievable with the covariates we observe.
→ They let us relate effect heterogeneity directly to observable covariates.
Moving from ATE/GATEs to CATEs means moving from the estimation of a low-dimensional parameter to the estimation of a high-dimensional function.
→ This is essentially a prediction problem, but the algorithm must output causal effects rather than outcome predictions.
Two broad strategies:
Decomposing the CATE problem
Meta-learners (Künzel et al. (2019)) exploit the fact that the CATE problem can be recast as a sequence of standard prediction tasks, which can then be solved using any supervised learning algorithm.
S-learner
T-learner
X-learner
→ A common choice is \(g(\boldsymbol{x}) = \hat\pi(\boldsymbol{x})\).
Exercise\(—\)Meta-learners trade-offs (\(\approx\) 5 minutes)
Discuss the pros and cons of the S-, T-, and X-learner.
Solution.
Data-generating process
We run a simple Monte Carlo to compare S-, T-, and X-learners.
We use regression forests as base learners, setting \(p = 5\) and \(n = 800\).
Exercise\(—\)Reading the meta-learner simulation (\(\approx\) 5 minutes)
Discuss and answer the following questions.
Solution.
→ S-learner artificially shrinks effects towards zero\(—\)it’s “lucky” when treatment has no effect.
→ Each model uses only one arm; in regions where one arm is rare, it must extrapolate heavily, leading to noisy and unstable predictions.
High variance of CATE estimates \(\neq\) heterogeneity
Looking at the distribution of the estimated CATEs is not an effective strategy for assessing effect heterogeneity.
→ High variation in predictions due to estimation noise does not necessarily imply heterogeneous effects.
Emerging literature on CATE validation
There is an emerging literature on CATE validation that introduces methodologies to quantify how much signal a given CATE estimator is actually capturing.