Causal machine learning

Riccardo Di Francesco

University of Southern Denmark

Introduction

Correlation \(\neq\) causation

We often see variables move together. But this correlation does not necessarily imply that one variable causes the other.


A third factor\(—\)called confounder\(—\)can increase/decrease both, creating a non-causal association.

Identification vs. estimation

Identification and estimation

Quantifying causal effects requires disentangling

  • Identification, the question of what can be learned about causal effects, from
  • Estimation, which concerns how these effects can be learned efficiently from data.

Often, units can strategically choose whether to “take the treatment.” This behavior is called self-selection and might lead to selection bias.
    → Identification kills such bias by leveraging different data structures and assumptions\(—\)thus ensuring valid apples-to-apples comparisons.

Once identification is deemed credible, we move to estimation. This is where ML kicks in.
    → Most estimation problems reduce to prediction tasks, which we want to solve flexibly.

Identification strategies

Several identification strategies exist:

  • Randomized controlled trials
  • Selection-on-observables
  • Instrumental variables
  • Regression discontinuity
  • Difference-in-differences

Potential outcomes and estimands

Potential outcomes model

Causality is often formalized through the potential outcomes framework (Neyman (1923); Rubin (1974)).

  • \(Y_i\).
  • \(D_i \in \{0,1\}\).
  • \(\boldsymbol{X}_i\).
  • \(\mu (d, \boldsymbol{x}) := \mathbb{E} [ Y_i | D_i = d, \boldsymbol{X}_i = \boldsymbol{x}]\).
  • \(\pi (\boldsymbol{x}) := \mathbb{E} [ D_i = 1 | \boldsymbol{X}_i = \boldsymbol{x}]\).
  • \(Y_i(0), Y_i(1)\).
  • \(Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0)\).

Under this model, the individual causal effect is \(Y_i(1) - Y_i(0)\).

→ Never observable, as the counterfactual is missing.

Therefore, we focus on estimands that are identifiable under (somewhat) reasonable assumptions.

  • \(\text{ATE} → \tau := \mathbb{E}[Y_i(1) - Y_i(0)]\).
  • \(\text{GATE} → \tau_\color{#06b6d4}{g} := \mathbb{E}[Y_i(1) - Y_i(0) | \color{#06b6d4}{G_i = g}]\).
  • \(\text{CATE} → \tau(\color{#f59e0b}{\boldsymbol{x}}) := \mathbb{E}[Y_i(1) - Y_i(0) | \color{#f59e0b}{\boldsymbol{X}_i = \boldsymbol{x}}]\).

Two important special cases of \(\tau_g\):

  • \(\text{ATT} := \mathbb{E}[Y_i(1) - Y_i(0) | \color{#c1121f}{D_i = 1}]\).
  • \(\text{ATNT} := \mathbb{E}[Y_i(1) - Y_i(0) | \color{#c1121f}{D_i = 0}]\).

Mean difference decomposition

Naive comparisons won’t cut it

One might be tempted to simply compare mean outcomes for treated (\(D_i = 1\)) and control (\(D_i = 0\)) units. However,

\[ \begin{aligned} \mathbb{E} [ Y_i | D_i = 1] - \mathbb{E} [ Y_i | D_i = 0] & = \mathbb{E} [ Y_i(1) | D_i = 1] - \mathbb{E} [ Y_i(0) | D_i = 0] \\ &= \mathbb{E} [ Y_i(1) | D_i = 1] + \color{#06b6d4}{\mathbb{E} [Y_i (0) | D_i = 1]} - \color{#06b6d4}{\mathbb{E} [Y_i (0) | D_i = 1]} - \mathbb{E} [ Y_i(0) | D_i = 0] \\ &= \underbrace{\mathbb{E} [ Y_i(1) - Y_i (0)| D_i = 1]}_{\text{ATT}} + \underbrace{\mathbb{E} [Y_i (0) | D_i = 1] - \mathbb{E} [ Y_i(0) | D_i = 0]}_{\text{Selection bias}}. \end{aligned} \tag{1}\]

Exercise\(—\)Randomization kills selection bias (\(\approx 5\) minutes)

Show that, if treatment is randomly assigned\(—\)that is, if

\[ Y_i(1), Y_i(0) \perp D_i, \] then \(\mathbb{E} [ Y_i | D_i = 1] - \mathbb{E} [ Y_i | D_i = 0] = \text{ATT}\).


Solution. Under randomization, \(\mathbb{E} [Y_i (d) | D_i = d] = \mathbb{E} [Y_i (d) | D_i = 1 - d] = \mathbb{E} [Y_i (d)]\) for \(d = 0, 1\). Therefore, it’s simply a matter of “dropping the conditioning” from (1).

As a corollary, randomization ensures that \(\text{ATT} = \text{ATNT} = \text{ATE}\).

Selection-on-observables

Observational data

Randomization is ideal but often infeasible (costs, ethics, logistics) and may limit external validity. We can still identify causal effects using observational data with appropriate designs and assumptions.

Selection-on-observables

Selection-on-observables research designs are used when

  • units are observed at a specific point in time,
  • some receive treatment, and
  • data on both pre-treatment characteristics and post-treatment outcomes are available.

This approach is based on the following assumptions (see, e.g., Imbens & Rubin (2015)):

Assumption 1 (Unconfoundedness). \(Y_i(1), Y_i(0) \perp D_i | \boldsymbol{X}_i\).

\(\boldsymbol{X}_i\) fully accounts for selection into treatment\(—\)we observe all confounders.

Assumption 2 (Positivity). \(0 < \pi(\boldsymbol X_i) < 1\).

→ “Valid comparisons” exist.

Together, Assumptions 1\(\text{-}\)2 allow for the identification of \(\tau\) and its conditional versions.

Meaningful comparisons between treated and control units with similar observed characteristics.

Conditional mean difference decomposition

Even naive conditional comparisons won’t cut it

Without assuming that \(\boldsymbol{X}_i\) fully accounts for selection into treatment, contrasts between treated and controls units with similar observed characteristics are off:

\[ \begin{aligned} \mu(1, \boldsymbol{x}) - \mu(0, \boldsymbol{x}) & = \mathbb{E} [ Y_i(1) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}] - \mathbb{E} [ Y_i(0) | D_i = 0, \boldsymbol{X}_i = \boldsymbol{x}] \\ &= \mathbb{E} [ Y_i(1) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}] - \mathbb{E} [ Y_i(0) | D_i = 0, \boldsymbol{X}_i = \boldsymbol{x}] \\ & + \color{#06b6d4}{\mathbb{E} [Y_i (0) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}]} - \color{#06b6d4}{\mathbb{E} [Y_i (0) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}]} \\ &= \underbrace{\mathbb{E} [ Y_i(1) - Y_i (0)| D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}]}_{\text{CATT}(\boldsymbol{x})} + \underbrace{\mathbb{E} [Y_i (0) | D_i = 1, \boldsymbol{X}_i = \boldsymbol{x}] - \mathbb{E} [ Y_i(0) | D_i = 0, \boldsymbol{X}_i = \boldsymbol{x}]}_{\text{Selection bias}(\boldsymbol{x})}. \end{aligned} \tag{2}\]

Exercise\(—\)Selection-on-observables kills conditional selection bias (\(\approx 5\) minutes)

Show that, under Assumptions 1\(\text{-}\)2,

\[ \mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i) = \tau ( \boldsymbol{X}_i ). \]


Solution. Under Assumption 1,

\[ \mathbb{E} [Y_i (d) | D_i = d, \boldsymbol{X}_i = \boldsymbol{x}] = \mathbb{E} [Y_i (d) | D_i = 1 - d, \boldsymbol{X}_i = \boldsymbol{x}] = \mathbb{E} [Y_i (d) | \boldsymbol{X}_i = \boldsymbol{x}], \quad \text{for } d = 0, 1. \] Therefore, it’s simply a matter of “dropping the conditioning” from (2).

→ Assumption 2 is needed to ensure that all the conditional expectations are well-defined for all values within the support of \(\boldsymbol{X}_i\).

As a corollary, Assumptions 1\(\text{-}\)2 ensure that \(\text{CATT} = \text{CATNT} = \text{CATE}\).

Exercise

Exercise\(—\)Understanding DAGs (\(\approx 5\) minutes)

Discuss and answer the following questions.

  • Q1. In each DAG, is the naive difference in means an unbiased ATE estimator?
  • Q2. For each DAG, which variable(s) act as confounders?
  • Q3. In the top DAG, do we need to control for \(Z\)?
  • Q3. In the bottom DAG, should we control for \(C\)?

Solution.

  • The difference in means is biased in both DAGs, as \(X\) affects both \(D\) and \(Y\).
  • \(X\) is a confounder in both DAGs.
  • No; \(Z\) directly affects only \(D\) and impacts \(Y\) only through \(D\), so it is not a confounder.

We call \(Z\) an instrument.

  • No; doing so “opens the path” and creates a spurious association between \(D\) and \(Y\).

We call \(C\) a collider.

Exercise\(—\)Selection-on-observables trade-offs (\(\approx 5\) minutes)

Can you think of any trade-off between Assumptions 1 and 2? Think about what happens if \(\text{dim}(\boldsymbol{X}_i) \uparrow\).


Solution. As \(\text{dim}(\boldsymbol{X}_i) \uparrow\), we (may) strengthen the credibility of unconfoundedness; however, we simultaneously make positivity less plausible.

→ Curse of dimensionality: the covariate space becomes sparse, and treated and control units may no longer overlap well.

Summary figure
Summary figure
Summary figure
Countour plots of the probability of Assumption 2 holding true for different values of \(\text{dim}(\boldsymbol{X}_i)\) (\(x\)-axis) and \(\mathbb{E} [D_i]\) (\(y\)-axis). \(n\) is fixed at \(1000\).

Building blocks for ATE estimation

ATE identification under selection-on-observables

Under Assumptions 1\(\text{-}\)2, the ATE can be written in three equivalent ways:

\[ \tau = \mathbb{E}[\mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i)]. \] \[ \tau = \mathbb{E} \left[\frac{D_i Y_i}{\pi ( \boldsymbol{X}_i)} - \frac{(1 - D_i) Y_i}{1 - \pi ( \boldsymbol{X}_i)} \right]. \] \[ \tau = \mathbb{E} \left[\mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i) + \frac{D_i (Y_i - \mu(1, \boldsymbol{X}_i))}{\pi ( \boldsymbol{X}_i)} - \frac{(1 - D_i) (Y_i - \mu(0, \boldsymbol{X}_i))}{1 - \pi ( \boldsymbol{X}_i)} \right]. \]

These equalities point to different estimation strategies, all relying on learning unknown nuisance functions.

→ We might need a high-dimensional \(\boldsymbol{X}_i\) to satisfy Assumption 1. Machine learning is thus attractive.

Cross-fitting

Naive plug-in ML of nuisance functions can introduce overfitting bias.

  • Split data into \(I_1, \dots, I_K\) (\(\approx\) equal size).
  • Let \(I_k^c = \{1, \dots, n\} \setminus I_k\) and \(k(i)\) be \(i\)’s fold.
  • For each \(k\):
    1. Train \(\hat\mu_{-k}, \hat\pi_{-k}\) on \(I_k^c\);
    2. Predict on \(I_k\) using \(\hat\mu_{-k}, \hat\pi_{-k}\).
  • Stack all out-of-fold scores; average for the final estimator.

Exercise

Exercise\(—\)Suitable estimators (\(\approx 3\) minutes)

Discuss and list suitable estimators for \(\mu(\cdot, \cdot)\) and \(\pi(\cdot)\).


Solution. Notice that \(\pi(\boldsymbol{X}_i) = \mathbb{P} (D_i = 1 | \boldsymbol{X}_i)\)\(—\)the propensity score equals the “probability of success” of \(D_i\).

→ We already learnt how LPM/Logit/Probit can help us here.

→ ML provides flexible alternatives (e.g., classification trees/forests/boosting, penalized regressions).

The estimation of \(\mu(\cdot, \cdot)\) can be carried out differently according to the outcome’s nature.

→ (Penalized) linear models or regression trees/forests/boosting if \(Y_i\) is continuous.

→ Classification algorithms if \(Y_i\) is binary.

→ Ordered Logit/Probit or Ordered Correlation Forest (Di Francesco (2025)) if \(Y_i\) is ordered.

Exercise\(—\)Cross-fitting trade-offs (\(\approx 3\) minutes)

Discuss trade-offs when changing the number of folds \(K\).


Solution. As \(K\) increases, we expect more stable nuisance fits, as the training size increases. Carrying the argument to its extreme, one could implement a “leave-one-out” approach by setting \(K = n\).

However, computational time increases in \(K\), as nuisances must be fit \(K\) times; this could lead to infeasibility if \(n\) is large and we use “heavy” ML methodologies.

Summary figure
Summary figure
Summary figure

Outcome regression

Outcome regression identification of ATE

Under selection-on-observables, we can write

\[ \tau = \mathbb{E}[\mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i)]. \tag{3}\]

Equation (3) motivates the plug-in estimator

\[ \hat{\tau}^{OR} = \frac{1}{n} \sum_{i = 1}^n \left[ \hat{\mu}_{-k(i)}(1, \boldsymbol{X}_i) - \hat{\mu}_{-k(i)}(0, \boldsymbol{X}_i) \right]. \tag{4}\]

Asymptotic Normality of OR Estimator

Suppose \(\mathrm{RMSE}(\hat{\mu}_{-k}) = o_p(n^{-1/2})\) for \(k = 1, \dots, K\). Then, \[ \sqrt{n} ( \hat{\tau}^{OR} - \tau) \xrightarrow{d} \mathcal{N}\left(0, \mathbb{V}\left(\mu(1, \boldsymbol{X}) - \mu(0, \boldsymbol{X}) \right) \right). \] Furthermore, \[ \widehat{\mathbb{V}}_n^{OR} := \frac{1}{n} \sum_{i = 1}^n \left( \hat{\mu}_{-k(i)} (1, \boldsymbol{X}_i) - \hat{\mu}_{-k(i)} (0, \boldsymbol{X}_i) - \hat{\tau}^{OR} \right)^2 \xrightarrow{p} \mathbb{V}\left(\mu(1, \boldsymbol{X}) - \mu(0, \boldsymbol{X})\right). \]

Inverse probability weighting

IPW identification of ATE

Under selection-on-observables, we can write

\[ \tau = \mathbb{E} \left[ \frac{D_i Y_i}{\pi(\boldsymbol{X}_i)} - \frac{(1 - D_i) Y_i}{1 - \pi(\boldsymbol{X}_i)}\right]. \tag{5}\]

Equation (5) motivates the plug-in estimator

\[ \hat{\tau}^{IPW} = \frac{1}{n}\sum_{i=1}^n \left[\frac{D_i Y_i}{\hat{\pi}_{-k(i)}(\boldsymbol{X}_i)} - \frac{(1 - D_i) Y_i}{1 - \hat{\pi}_{-k(i)}(\boldsymbol{X}_i)} \right]. \tag{6}\]

Asymptotic Normality of IPW Estimator

Suppose \(0 < \hat\pi(\boldsymbol{X}_i) < 1\) and \(\mathrm{RMSE} (\hat{\pi}_{-k}) = o_p(n^{-1/2})\) for \(k = 1, \dots, K\). Then \[ \sqrt{n} (\hat{\tau}^{IPW}- \tau) \xrightarrow{d} \mathcal{N} \left(0, \mathbb{V} \left( \frac{D Y}{\pi(\boldsymbol{X})} - \frac{(1-D) Y}{1-\pi(\boldsymbol{X})}\right) \right). \]

Furthermore, \[ \widehat{\mathbb{V}}^{IPW}_n := \frac{1}{n}\sum_{i=1}^n \left( \frac{D_i Y_i}{\hat{\pi}_{-k(i)}(\boldsymbol{X}_i)} - \frac{(1-D_i) Y_i}{1-\hat{\pi}_{-k(i)}(\boldsymbol{X}_i)} - \hat{\tau}^{IPW} \right)^2 \xrightarrow{p} \mathbb{V}\left( \frac{D Y}{\pi(\boldsymbol{X})} - \frac{(1-D) Y}{1-\pi(\boldsymbol{X})} \right). \]

Augmented inverse probability weighting

AIPW identification of ATE

Define \[ \psi \left( \boldsymbol{X}_i \right) := \underbrace{\mu(1, \boldsymbol{X}_i) - \mu(0, \boldsymbol{X}_i)}_{\text{OR}} + \underbrace{\frac{D_i \left( Y_i - \mu(1, \boldsymbol{X}_i) \right)}{\pi(\boldsymbol{X}_i)} - \frac{(1 - D_i) \left( Y_i - \mu(0, \boldsymbol{X}_i) \right)}{1 - \pi(\boldsymbol{X}_i)}}_{\text{IPW on residuals}}. \tag{7}\]

Under selection-on-observables, we can write

\[ \tau = \mathbb{E} \left[ \psi \left( \boldsymbol{X}_i \right) \right]. \tag{8}\]

Equation (8) motivates the plug-in estimator

\[ \hat{\tau}^{AIPW} = \frac{1}{n}\sum_{i=1}^n \hat\psi_{-k(i)} (\boldsymbol{X}_i). \tag{9}\]

Asymptotic Normality of AIPW Estimator

Suppose \(0 < \hat\pi(\boldsymbol{X}_i) < 1\) and \(\mathrm{RMSE}(\hat{\mu}_{-k}) \cdot \mathrm{RMSE}(\hat{\pi}_{-k}) = o_p(n^{-1/2})\) for \(k = 1, \dots, K\). Then \[ \sqrt{n} \left( \hat{\tau}^{AIPW} - \tau \right) \xrightarrow{d} \mathcal{N} \left( 0, \mathbb{V} \left( \psi \left( \boldsymbol{X}_i \right) \right) \right), \] \[ \widehat{\mathbb{V}}^{AIPW}_n := \frac{1}{n} \sum_{i = 1}^n \left( \hat \psi \left( \boldsymbol{X}_i \right) - \hat \tau^{AIPW} \right)^2 \xrightarrow{p} \mathbb{V} \left( \psi \left( \boldsymbol{X} \right) \right). \]

Oracle + plug-in bias

Decomposition of OR, IPW, and AIPW estimators

Let \(W_i := (Y_i, D_i, \boldsymbol{X}_i)\) and let \(\eta := (\mu, \pi)\) collect the nuisance functions. The ATE estimators can be written as \[ \hat{\tau} = \frac{1}{n} \sum_{i = 1}^n m (W_i;\, \hat{\eta}_{-k(i)}), \tag{10}\]

where \(m(\cdot; \eta)\) is the estimator’s score.

Adding and subtracting the oracle score \(m(W_i; \eta)\) and rearranging, \[ \hat{\tau} = \underbrace{\frac{1}{n} \sum_{i = 1}^n m \left(W_i; \eta \right)}_{\text{oracle term}} + \underbrace{\frac{1}{n} \sum_{i = 1}^n \left[ m ( W_i; \hat{\eta}_{-k(i)}) - m \left(W_i; \eta\right) \right]}_{\text{plug-in bias}}. \tag{11}\]

Comparing requirements on nuisance estimation

To obtain asymptotically well-behaved estimates, OR, IPW, and AIPW need nuisances to be estimated accurately enough to control the plug-in bias.

  • OR requires \(\mathrm{RMSE}(\hat{\mu}_{-k}) = o_p(n^{-1/2})\).
  • IPW requires \(\mathrm{RMSE}(\hat{\pi}_{-k}) = o_p(n^{-1/2})\).
  • AIPW requires \(\mathrm{RMSE}(\hat{\mu}_{-k}) \cdot \mathrm{RMSE}(\hat{\pi}_{-k}) = o_p(n^{-1/2})\).

A sufficient condition for AIPW is \(\mathrm{RMSE}(\hat{\mu}_{-k}) = o_p(n^{-1/4})\) and \(\mathrm{RMSE}(\hat{\pi}_{-k}) = o_p(n^{-1/4})\)

AIPW controls the plug-in bias under weaker requirements on nuisance estimation accuracy.

Toy simulation

Data-generating process

We run a simple Monte Carlo to illustrate the benefits of cross-fitting for AIPW estimation of the ATE.

  • Generate \(p\) i.i.d. covariates \(X_{ij} \sim \mathcal N(0,1)\).
  • Propensity score defined as \(\pi(\boldsymbol X_i) = \Lambda (0.6 X_{i1} + 0.8 \sin(X_{i3}))\).
  • Treatment assignment follows a Bernoulli process \(D_i \sim \mathrm{Bernoulli}(\pi(\boldsymbol{X}_i))\).
  • Control regression function is \(\mu(0, \boldsymbol{X}_i) = 2\sin(X_{i1}) + 0.5 X_{i2}^2\).
  • CATE function specified as \(\tau(\boldsymbol{X}_i) = 1 + 0.5 \cdot \mathbf{1} \{X_{i1} > 0\}\).
  • Generate observed outcomes as \(Y_i = \mu(0, \boldsymbol{X}_i) + D_i \tau(\boldsymbol{X}_i) + \mathcal N(0,1)\).

We construct \(\hat\tau^{AIPW}\) using regression/classification trees for nuisance estimation.

Exercise\(—\)Understanding the DPG (\(\approx\) 5 minutes)

Discuss and answer the following questions.

  • Q1. Is this DGP confounded?
  • Q2. Which variable(s) are confounders?
  • Q3. Is there treatment-effect heterogeneity?
  • Q4. What’s the model of \(\mu(1, \boldsymbol{X}_i)\)?

Solution.

  • Yes.

\(X_{i1}\) enters the models of both \(Y_i\) and \(D_i\).

  • \(X_{i1}\) is the only confounder.

→ Controlling for \(X_{i1}\) is enough for identification.

  • Yes.

→ Effects are larger when \(X_{i1} > 0\).

  • Notice that \(\mu(1, \boldsymbol{X}_i) = \tau(\boldsymbol{X}_i) + \mu(0, \boldsymbol{X}_i)\).
Summary figure
Results for \(p = 15\), \(n = 500\), and \(K = 2\), with \(500\) replications (\(\approx\) 1 minute on my laptop).
Summary figure
Results for \(p = 15\), \(n = 500\), and \(\color{#c1121f}{K = 5}\), with \(500\) replications (\(\approx\) 2:30 minutes on my laptop).

ATE vs. GATEs vs. CATEs

Effect heterogeneity matters

The ATE quantifies the average impact of the policy on the reference population.

→ Straightforward to interpret and communicate.

However, the ATE lacks information regarding effect heterogeneity.

→ Who benefits most/least? Any harmed groups? Are effects monotone?

To study heterogeneity, we can target

  • GATEs (group ATEs), or
  • CATEs (conditional ATEs).

Exercise\(—\)How to tackle effect heterogeneity (\(\approx\) 3 minutes)

Discuss and answer the following questions.

  • Q1. Can you think of some “natural” grouping rules?
  • Q2. What are the trade-offs when changing the number of groups?
  • Q3. What are the trade-offs between targeting GATEs vs CATEs?

Solution.

  • Standard examples are age bands or gender/ethnicity strata.
  • More groups → finer detail but smaller cells (noisy/fragile) and harder interpretation; fewer groups → stable but coarse.
  • GATEs are simple to interpret, but need group choice and can miss within-group patterns. CATEs are more flexible, but harder to communicate and estimate.

GATEs overview

GATE analysis

A common approach to tackle effect heterogeneity is to report the ATEs across different subgroups defined by observable covariates:

\[ \tau_g := \mathbb{E} [Y_i(1) - Y_i(0) | G_i = g], \] with \(G_i = 1, \dots, G\) a discrete “group” indicator built from \(\boldsymbol{X}_i\).

→ GATEs with continuous \(G_i\) are more challenging to handle (Zimmert & Lechner (2019); Fan et al. (2022)).


If groups are “exogenous”\(—\)i.e., pre-specified\(—\)GATEs can be estimated by applying ATE estimators separately for each group:

\[ \hat{\tau}_{\color{#c1121f}{g}} = \frac{1}{n_{\color{#c1121f}{g}}} \sum_{i = 1}^n \color{#c1121f}{\mathbf{1} \{G_i = g\}} \, m (W_i;\, \hat{\eta}_{-k(i)}), \quad g = 1, \dots, G. \]

This approach is simple to implement, but can be inefficient with many small groups.


A more efficient alternative is to project the CATEs on “group dummies.”

\(\tau_g = \mathbb{E} [\tau(\boldsymbol{X}_i) | G_i = g]\) motivates modelling the best linear predictor of \(\tau(\cdot)\) given \(G_i\).

However, \(\tau(\cdot)\) is unobserved. Fortunately, we can proxy it with \(\psi(\cdot)\) (Semenova & Chernozhukov (2021)).

\(\mathbb{E} [ \psi(\boldsymbol{X}_i) | \boldsymbol{X}_i] = \tau(\boldsymbol{X}_i)\).

\[ \hat\psi_{-k(i)}(\boldsymbol{X}_i) = \sum_{g = 1}^G \mathbf{1} \{G_i = g\} \beta_g + \epsilon_i. \tag{12}\]

Under selection-on-observables, \(\beta_g = \tau_g\). With cross-fitting, Semenova & Chernozhukov ((2021)) show that the OLS estimator \(\hat\beta_g\) is root-\(n\) consistent and asymptotically normal, provided \(\mathrm{RMSE}(\hat{\mu}_{-k}) \cdot \mathrm{RMSE}(\hat{\pi}_{-k}) = o_p(n^{-1/2})\).

How to form groups

Data-driven groups

So far, we assumed groups were pre-specified. Pre-specifying groups is simple but risks missing unexpected heterogeneity.

Researchers often want to “let the data speak” and discover data-driven heterogeneous subgroups.

→ Tree-based approaches are popular because of their piece-wise constant structure.

However, data-driven groups are endogenous, and naive reuse of the same sample for grouping and estimation biases results.


A remedy is to combine tree-based approaches with honesty (Athey & Imbens (2016)).

  1. Split the data into training and honest subsamples.
  2. Construct a tree using training data.
  3. Estimate leaf-ATEs using honest data.

Causal trees (Athey & Imbens (2016))

Grow tree by adapting splitting criterion to target effect heterogeneity, and estimate GATEs by

\[ Y_i = \sum_{l = 1}^L L_{il} \gamma_l + \sum_{l = 1}^L L_{il} D_i \beta_l + \epsilon_i. \]

Aggregation trees (Di Francesco (2024))

Grow tree by applying standard CART to estimated CATEs, and estimate GATEs by \[ \hat\psi_{-k(i)}(\boldsymbol{X}_i) = \sum_{l = 1}^L L_{il} \beta_l + \epsilon_i. \]

Summary figure
Summary figure
Summary figure
Summary figure

Exercise

Exercise\(—\)Honesty trade-offs (\(\approx 3\) minutes)

Discuss and answer the following questions.

  • Q1. Does honesty come at a price?
  • Q2. What are the trade-offs when varying the fraction of the sample allocated to the training vs. honest subsamples?

Solution.

  • Q1. Yes. Sample splitting uses fewer observations for each task (splitting vs. effect estimation), which increases variance and typically reduces predictive accuracy in finite samples.

  • Q2.

    • More data to training ⇒ better splits, but fewer observations in the honest sample ⇒ noisier effect estimates within each leaf.
    • More data to honesty ⇒ more precise effect estimates within leaves, but splits are learned from less data ⇒ coarser / less stable partitions
    • We balance these two sources of error: too little training data leads to bad trees; too little honest data leads to unreliable GATE estimates.

CATE overview

CATE estimation

To “fully” tackle effect heterogeneity, we can focus on the CATEs \[ \tau (\boldsymbol{X}_i) := \mathbb{E} [ Y_i(1) - Y_i(0) | \boldsymbol{X}_i]. \] CATEs provide information at the finest level of granularity achievable with the covariates we observe.

→ They let us relate effect heterogeneity directly to observable covariates.


Moving from ATE/GATEs to CATEs means moving from the estimation of a low-dimensional parameter to the estimation of a high-dimensional function.

→ This is essentially a prediction problem, but the algorithm must output causal effects rather than outcome predictions.

Two broad strategies:

  • Recast the CATE problem as a sequence of standard prediction tasks, or
  • Modify ML algorithms so that they directly target causal quantities.

CATE metalearners

Decomposing the CATE problem

Meta-learners (Künzel et al. (2019)) exploit the fact that the CATE problem can be recast as a sequence of standard prediction tasks, which can then be solved using any supervised learning algorithm.

S-learner

  1. Train a single model \(\hat\mu(d, \boldsymbol{x})\) by regressing \(Y_i\) on \((D_i, \boldsymbol{X}_i)\) using all units.
  2. For any \(\boldsymbol{x}\), construct \(\hat\tau^{S}(\boldsymbol{x}) = \hat\mu(1, \boldsymbol{x}) - \hat\mu(0, \boldsymbol{x})\).

T-learner

  1. Train two models \(\hat\mu(1, \cdot)\) and \(\hat\mu(0, \boldsymbol{x})\) by regressing \(Y_i\) on \(\boldsymbol{X}_i\) separately in the treated and control samples.
  2. For any \(\boldsymbol{x}\), construct \(\hat\tau^{T}(\boldsymbol{x}) = \hat\mu_1(\boldsymbol{x}) - \hat\mu_0(\boldsymbol{x})\).

X-learner

  1. Train \(\hat\mu(1, \cdot)\) and \(\hat\mu(0, \cdot)\) as in the T-learner.
  2. Impute treatment effects \(\tilde\tau_{1,i} = Y_i - \hat\mu(0, \boldsymbol{X}_i)\) and \(\tilde\tau_{0,i} = \hat\mu(1, \boldsymbol{X}_i) - Y_i\).
  3. Fit two CATE models \(\hat\tau_1(\boldsymbol{x})\) and \(\hat\tau_0(\boldsymbol{x})\) by regressing \(\tilde\tau_{d,i}\) on \(\boldsymbol{X}_i\) in the corresponding subsamples.
  4. Construct \(\hat\tau^{X}(\boldsymbol{x}) = g(\boldsymbol{x}) \hat\tau_0(\boldsymbol{x}) + (1 - g(\boldsymbol{x})) \hat\tau_1(\boldsymbol{x})\) using any weight function \(g(\boldsymbol{x}) \in [0, 1]\).

→ A common choice is \(g(\boldsymbol{x}) = \hat\pi(\boldsymbol{x})\).

Exercise

Exercise\(—\)Meta-learners trade-offs (\(\approx\) 5 minutes)

Discuss the pros and cons of the S-, T-, and X-learner.


Solution.

  • S-learner
    • ✅ Trains a single model on the full sample.
    • ✅ Very simple to implement: just treat \(D_i\) as another covariate.
    • ❌ The learner may heavily regularize the contribution of \(D_i\), effectively shrinking treatment effects toward zero.
    • ❌ Can perform poorly when treatment assignment is highly imbalanced.
  • T-learner
    • ✅ Fits separate models for treated and control units, allowing very flexible and potentially very different response surfaces.
    • ✅ Avoids directly regularizing \(D_i\) inside a single model.
    • ❌ Each model uses only one arm, so the effective sample size per model is smaller ⇒ higher variance, especially if one arm is small.
    • ❌ With poor overlap, one of the two models must extrapolate heavily.
  • X-learner
    • ✅ Can explicitly exploit imbalance/overlap by letting each arm contribute more where it is better informed about local effects.
    • ✅ Often performs well when treatment is strongly imbalanced and base learners are flexible.
    • ❌ More complex: requires two outcome models, two “effect” models, and a weighting scheme \(g(\boldsymbol{x})\).
    • ❌ Sensitive to first-stage quality: poor outcome models propagate bias into the imputed treatment effects.

Toy simulation

Data-generating process

We run a simple Monte Carlo to compare S-, T-, and X-learners.

  • Generate \(p\) i.i.d. covariates \(X_{ij} \sim \mathcal N(0,1)\).
  • Propensity score defined as \(e(\boldsymbol X_i) = \Lambda(-1.5 + 1.5 X_{i1})\).
  • Treatment assignment follows a Bernoulli process \(D_i \sim \mathrm{Bernoulli}(e(\boldsymbol{X}_i))\).
  • Baseline regression function is \(\mu(0, \boldsymbol{X}_i) = 2\sin(\pi X_{i1} X_{i2}) + 0.5 X_{i3}^2 + X_{i4}\).
  • Generate observed outcomes as \(Y_i = \mu(\boldsymbol{X}_i) + D_i \tau(\boldsymbol{X}_i) + \mathcal N(0,1)\).

We use regression forests as base learners, setting \(p = 5\) and \(n = 800\).

Exercise\(—\)Reading the meta-learner simulation (\(\approx\) 5 minutes)

Discuss and answer the following questions.

  • Q1. Which learner would you expect to have the lowest RMSE for \(\tau(\boldsymbol X_i) = 0\)?
  • Q2. Why might the T-learner perform relatively poorly?

Solution.

  • S-learner.

→ S-learner artificially shrinks effects towards zero\(—\)it’s “lucky” when treatment has no effect.

  • T-learner suffers from high variance because treatment is imbalanced (\(\mathbb{E} [D_i] \approx 0.25\)).

→ Each model uses only one arm; in regions where one arm is rare, it must extrapolate heavily, leading to noisy and unstable predictions.

Toy simulation\(—\)Results

Summary figure
Results for \(\tau(\boldsymbol{X_i}) = 0\), with \(100\) replications (\(\approx\) 15 seconds on my laptop).
Summary figure
Results for \(\tau(\boldsymbol{X_i}) = 0\), with \(100\) replications (\(\approx\) 15 seconds on my laptop).
Summary figure
Results for \(\tau(\boldsymbol{X_i}) = X_{i1}\), with \(100\) replications (\(\approx\) 15 seconds on my laptop).
Summary figure
Results for \(\tau(\boldsymbol{X_i}) = X_{i1}\), with \(100\) replications (\(\approx\) 15 seconds on my laptop).

CATE validation

High variance of CATE estimates \(\neq\) heterogeneity

Looking at the distribution of the estimated CATEs is not an effective strategy for assessing effect heterogeneity.

→ High variation in predictions due to estimation noise does not necessarily imply heterogeneous effects.

Emerging literature on CATE validation

There is an emerging literature on CATE validation that introduces methodologies to quantify how much signal a given CATE estimator is actually capturing.

Summary figure
Results for \(\tau(\boldsymbol{X_i}) = 0\), with \(100\) replications (\(\approx\) 15 seconds on my laptop).

References

Athey, S., and G. W. Imbens. (2016): “Recursive partitioning for heterogeneous causal effects,” Proceedings of the National Academy of Sciences, 113, 7353–60.
Chernozhukov, V., M. Demirer, E. Duflo, and I. Fernández-Val. (2017): “Generic machine learning inference on heterogenous treatment effects in randomized experiments,” arXiv preprint arXiv:1712.04802,.
Di Francesco, R. (2024): “Aggregation trees,” arXiv preprint arXiv:2410.11408,.
---. (2025): “Ordered correlation forest,” Econometric Reviews, 44, 416–32.
Fan, Q., Y.-C. Hsu, R. P. Lieli, and Y. Zhang. (2022): “Estimation of conditional average treatment effects with high-dimensional data,” Journal of Business & Economic Statistics, 40, 313–27.
Imai, K., and M. L. Li. (2025): “Statistical inference for heterogeneous treatment effects discovered by generic machine learning in randomized experiments,” Journal of Business & Economic Statistics, 43, 256–68.
Imbens, G. W., and D. B. Rubin. (2015): Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge University Press.
Künzel, S. R., J. S. Sekhon, P. J. Bickel, and B. Yu. (2019): “Metalearners for estimating heterogeneous treatment effects using machine learning,” Proceedings of the National Academy of Sciences, 116, 4156–65.
Neyman, J. (1923): “Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes,” Roczniki Nauk Rolniczyc, 10, 1–51.
Rubin, D. B. (1974): “Estimating causal effects of treatments in randomized and nonrandomized studies,” Journal of Educational Psychology, 66, 688--701.
Semenova, V., and V. Chernozhukov. (2021): “Debiased machine learning of conditional average treatment effects and other causal functions,” The Econometrics Journal, 24, 264–89.
Yadlowsky, S., S. Fleming, N. Shah, E. Brunskill, and S. Wager. (2025): “Evaluating treatment prioritization rules via rank-weighted average treatment effects,” Journal of the American Statistical Association, 120, 38–51.
Zimmert, M., and M. Lechner. (2019): “Nonparametric estimation of causal heterogeneity under high-dimensional confounding,” arXiv preprint arXiv:1908.08779,.