In this tutorial, we show how to use the aggTrees package to discover heterogeneous subgroups in a selection-on-observables setting.

Methodology Overview

The approach consists of three steps:

  1. Estimate the conditional average treatment effects (CATEs);
  2. Approximate the CATEs by a decision tree;
  3. Prune the tree.

This way, we generate a sequence of groupings, one for each granularity level.

The resulting sequence is nested in the sense that subgroups formed at a given level of granularity are never broken at coarser levels. This guarantees consistency of the results across the different granularity levels, generally considered a basic requirement that every classification system should satisfy. Moreover, each grouping features an optimality property in that it ensures that the loss in explained heterogeneity resulting from aggregation is minimized.

Given the sequence of groupings, we can estimate the group average treatment effects (GATEs) as we like. The package supports two estimators, based on differences in mean outcomes between treated and control units (unbiased only in randomized experiments) and on debiased machine learning procedures (unbiased also in observational studies).

The package also allows to get standard errors for the GATEs by estimating via OLS appropriate linear models. Then, under an “honesty” condition, we can use the estimated standard errors to conduct valid inference about the GATEs as usual, e.g., by constructing conventional confidence intervals.1

Code

For illustration purposes, let us generate some data. We also split the observed sample into a training sample and an honest sample of equal sizes, as this will be necessary to achieve valid inference about the GATEs later on.

## Generate data.
set.seed(1986)

n <- 500 # Small sample size due to compliance with CRAN notes.
k <- 3

X <- matrix(rnorm(n * k), ncol = k)
colnames(X) <- paste0("x", seq_len(k))
D <- rbinom(n, size = 1, prob = 0.5)
mu0 <- 0.5 * X[, 1]
mu1 <- 0.5 * X[, 1] + X[, 2]
Y <- mu0 + D * (mu1 - mu0) + rnorm(n)

## Sample split.
splits <- sample_split(length(Y), training_frac = 0.5)
training_idx <- splits$training_idx
honest_idx <- splits$honest_idx

Y_tr <- Y[training_idx]
D_tr <- D[training_idx]
X_tr <- X[training_idx, ]

Y_hon <- Y[honest_idx]
D_hon <- D[honest_idx]
X_hon <- X[honest_idx, ]

CATEs Estimation

First, we need to estimate the CATEs. This can be achieved with any estimator we like. Here we use the causal forest estimator. The CATEs are estimated using only the training sample.

## Estimate the CATEs. Use only training sample.
forest <- causal_forest(X_tr, Y_tr, D_tr) 

cates_tr <- predict(forest, X_tr)$predictions
cates_hon <- predict(forest, X_hon)$predictions

Constructing the Sequence of Groupings

Now we use the build_aggtree function to construct the sequence of groupings. This function approximates the estimated CATEs by a decision tree using only the training sample and computes node predictions (i.e., the GATEs) using only the honest sample. build_aggtree allows the user to choose between two GATE estimators:

  1. If we set method = "raw", the GATEs are estimated by taking the differences between the mean outcomes of treated and control units in each node. This is an unbiased estimator (only) in randomized experiments;
  2. If we set method = "aipw", the GATEs are estimated by averaging doubly-robust scores in each node. This is an unbiased estimator also in observational studies under particular conditions on the construction of the scores.2

The doubly-robust scores are estimated internally using 5-fold cross-fitting and only observations from the honest sample.

## Construct the sequence. Use doubly-robust scores (default option).
groupings <- build_aggtree(Y_tr, D_tr, X_tr, # Training sample. 
                           Y_hon, D_hon, X_hon, # Honest sample.
                           cates_tr = cates_tr, cates_hon = cates_hon) # Predicted CATEs.

## Print.
print(groupings)
#> Honest estimates: TRUE 
#> n= 250 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#>  1) root 250 123.9466000  0.13700170  
#>    2) x2< 0.4110066 169  17.4697700 -0.26060540  
#>      4) x2< -0.2808111 98   1.5706020 -0.58001140 *
#>      5) x2>=-0.2808111 71   2.2189360  0.24512080  
#>       10) x2< 0.1101004 38   0.3944950  0.39862120 *
#>       11) x2>=0.1101004 33   0.3364562 -0.06187981 *
#>    3) x2>=0.4110066 81   2.1095900  0.78572910  
#>      6) x2< 0.6244741 16   0.2477193 -0.36697840 *
#>      7) x2>=0.6244741 65   0.2205796  1.11284900 *

## Plot.
plot(groupings) # Try also setting 'sequence = TRUE'.

Further Analysis

Now that we have a whole sequence of optimal groupings, we can pick the grouping associated with our preferred granularity level and call the inference_aggtree function. This function does the following:

  1. It gets standard errors for the GATEs by estimating via OLS appropriate linear models using the honest sample. The choice of the linear model depends on the method we used when we called build_aggtree;3
  2. It tests the null hypotheses that the differences in the GATEs across all pairs of groups equal zero. Here, we account for multiple hypotheses testing by adjusting the pp-values using Holm’s procedure;
  3. It computes the average characteristics of the units in each group.

To report the results, we can print nice LATEX tables.

## Inference with 4 groups.
results <- inference_aggtree(groupings, n_groups = 4)

## LATEX.
print(results, table = "diff")
#> \begingroup
#>   \setlength{\tabcolsep}{8pt}
#>   \renewcommand{\arraystretch}{1.2}
#>   \begin{table}[b!]
#>     \centering
#>     \begin{adjustbox}{width = 1\textwidth}
#>     \begin{tabular}{@{\extracolsep{5pt}}l c c c c}
#>       \\[-1.8ex]\hline
#>       \hline \\[-1.8ex] 
#> 
#>       & \textit{Leaf 1} & \textit{Leaf 2} & \textit{Leaf 3} & \textit{Leaf 4} \\
#>       \addlinespace[2pt]
#>       \hline \\[-1.8ex] 
#> 
#>       \multirow{3}{*}{GATEs} & -0.580 & -0.367 &  0.245 &  1.113 \\
#>       & [-1.017, -0.143] & [-1.396,  0.662] & [-0.298,  0.788] & [ 0.633,  1.593] \\
#>       & \{NA, NA\} & \{NA, NA\} & \{NA, NA\} & \{NA, NA\} \\ 
#> 
#>       \addlinespace[2pt]
#>       \hline \\[-1.8ex] 
#> 
#>       \textit{Leaf 1} & NA & NA & NA & NA \\
#>             & (NA) & (NA) & (NA) & (NA) \\ 
#>       \textit{Leaf 2} & 0.21 &   NA &   NA & NA \\
#>             & (0.709) & (   NA) & (   NA) & (NA) \\ 
#>       \textit{Leaf 3} & 0.83 & 0.61 &   NA & NA \\
#>             & (0.079) & (0.607) & (   NA) & (NA) \\ 
#>       \textit{Leaf 4} & 1.69 & 1.48 & 0.87 & NA \\
#>             & (0.000) & (0.056) & (0.079) & (NA) \\ 
#> 
#>       \addlinespace[3pt]
#>       \\[-1.8ex]\hline
#>       \hline \\[-1.8ex]
#>     \end{tabular}
#>     \end{adjustbox}
#>     \caption{Point estimates and $95\%$ confidence intervals for the GATEs based on asymptotic normality (in square brackets) and on the percentiles of the bootstrap distribution (in curly braces). Leaves are sorted in increasing order of the GATEs. Additionally, the GATE differences across all pairs of leaves are displayed. $p$-values testing the null hypothesis that a single difference is zero are adjusted using Holm's procedure and reported in parenthesis under each point estimate.}
#>     \label{table_differences_gates}
#>     \end{table}
#> \endgroup

print(results, table = "avg_char")
#> \begingroup
#>   \setlength{\tabcolsep}{8pt}
#>   \renewcommand{\arraystretch}{1.1}
#>   \begin{table}[b!]
#>     \centering
#>     \begin{adjustbox}{width = 1\textwidth}
#>     \begin{tabular}{@{\extracolsep{5pt}}l c c c c c c c c }
#>       \\[-1.8ex]\hline
#>       \hline \\[-1.8ex]
#>       & \multicolumn{2}{c}{\textit{Leaf 1}} & \multicolumn{2}{c}{\textit{Leaf 2}} & \multicolumn{2}{c}{\textit{Leaf 3}} & \multicolumn{2}{c}{\textit{Leaf 4}} \\\cmidrule{2-3} \cmidrule{4-5} \cmidrule{6-7} \cmidrule{8-9} 
#>       & Mean & (S.E.) & Mean & (S.E.) & Mean & (S.E.) & Mean & (S.E.) \\
#>       \addlinespace[2pt]
#>       \hline \\[-1.8ex] 
#> 
#>       \texttt{x1} &  0.125 & (0.097) & -0.030 & (0.202) & -0.089 & (0.148) &  0.083 & (0.106) \\ 
#>       \texttt{x2} & -0.941 & (0.057) &  0.505 & (0.012) &  0.020 & (0.026) &  1.222 & (0.050) \\ 
#>       \texttt{x3} &  0.088 & (0.109) & -0.542 & (0.197) &  0.316 & (0.141) & -0.036 & (0.107) \\ 
#> 
#>       \addlinespace[3pt]
#>       \\[-1.8ex]\hline
#>       \hline \\[-1.8ex]
#>     \end{tabular}
#>     \end{adjustbox}
#>     \caption{Average characteristics of units in each leaf, obtained by regressing each covariate on a set of dummies denoting leaf membership . Standard errors are estimated via the Eicker-Huber-White estimator. Leaves are sorted in increasing order of the GATEs.}
#>     \label{table_average_characteristics_leaves}
#>     \end{table}
#> \endgroup

  1. Check the inference vignette for more details.↩︎

  2. See footnote 1.↩︎

  3. See footnote 1.↩︎