Aggregation Trees

Nonparametric data-driven approach to discovering heterogeneous subgroups in a selection-on-observables framework. The approach constructs a sequence of groupings, one for each level of granularity. Groupings are nested and feature an optimality property. For each grouping, we obtain point estimation and standard errors for the group average treatment effects (GATEs) using debiased machine learning procedures. Additionally, we assess whether systematic heterogeneity is found by testing the hypotheses that the differences in the GATEs across all pairs of groups are zero. Finally, we investigate the driving mechanisms of effect heterogeneity by computing the average characteristics of units in each group.

build_aggtree(
  Y_tr,
  D_tr,
  X_tr,
  Y_hon = NULL,
  D_hon = NULL,
  X_hon = NULL,
  cates_tr = NULL,
  cates_hon = NULL,
  method = "aipw",
  scores = NULL,
  ...
)

inference_aggtree(object, n_groups, boot_ci = FALSE, boot_R = 2000)

Arguments

Y_tr: Outcome vector for training sample.
D_tr: Treatment vector for training sample.
X_tr: Covariate matrix (no intercept) for training sample.
Y_hon: Outcome vector for honest sample.
D_hon: Treatment vector for honest sample.
X_hon: Covariate matrix (no intercept) for honest sample.
cates_tr: Optional, predicted CATEs for training sample. If not provided by the user, CATEs are estimated internally via a causal_forest.
cates_hon: Optional, predicted CATEs for honest sample. If not provided by the user, CATEs are estimated internally via a causal_forest.
method: Either "raw" or "aipw", controls how node predictions are computed.
scores: Optional, vector of scores to be used in computing node predictions. Useful to save computational time if scores have already been estimated. Ignored if method == "raw".
...: Further arguments from rpart.control.
object: An aggTrees object.
n_groups: Number of desired groups.
boot_ci: Logical, whether to compute bootstrap confidence intervals.
boot_R: Number of bootstrap replications. Ignored if boot_ci == FALSE.

Value

build_aggtree returns an aggTrees object.

inference_aggtree returns an aggTrees.inference object, which in turn contains the aggTrees object used in the call.

Details

Aggregation trees are a three-step procedure. First, the conditional average treatment effects (CATEs) are estimated using any estimator. Second, a tree is grown to approximate the CATEs. Third, the tree is pruned to derive a nested sequence of optimal groupings, one for each granularity level. For each level of granularity, we can obtain point estimation and inference about the GATEs.

To implement this methodology, the user can rely on two core functions that handle the various steps.

Constructing the Sequence of Groupings

build_aggtree constructs the sequence of groupings (i.e., the tree) and estimate the GATEs in each node. The GATEs can be estimated in several ways. This is controlled by the method argument. If method == "raw", we compute the difference in mean outcomes between treated and control observations in each node. This is an unbiased estimator in randomized experiment. If method == "aipw", we construct doubly-robust scores and average them in each node. This is unbiased also in observational studies. Honest regression forests and 5-fold cross fitting are used to estimate the propensity score and the conditional mean function of the outcome (unless the user specifies the argument scores).

The user can provide a vector of the estimated CATEs via the cates_tr and cates_hon arguments. If no CATEs are provided, these are estimated internally via a causal_forest using only the training sample, that is, Y_tr, D_tr, and X_tr.

GATEs Estimation and Inference

inference_aggtree takes as input an aggTrees object constructed by build_aggtree. Then, for the desired granularity level, chosen via the n_groups argument, it provides point estimation and standard errors for the GATEs. Additionally, it performs some hypothesis testing to assess whether we find systematic heterogeneity and computes the average characteristics of the units in each group to investigate the driving mechanisms.

Point estimates and standard errors for the GATEs

GATEs and their standard errors are obtained by fitting an appropriate linear model. If method == "raw", we estimate via OLS the following:

$$Y_i = \sum_{l = 1}^{|T|} L_{i, l} \gamma_l + \sum_{l = 1}^{|T|} L_{i, l} D_i \beta_l + \epsilon_i$$

with L_{i, l} a dummy variable equal to one if the i-th unit falls in the l-th group, and |T| the number of groups. If the treatment is randomly assigned, one can show that the betas identify the GATE of each group. However, this is not true in observational studies due to selection into treatment. In this case, the user is expected to use method == "aipw" when calling build_aggtree. In this case, inference_aggtree uses the scores in the following regression:

$$score_i = \sum_{l = 1}^{|T|} L_{i, l} \beta_l + \epsilon_i$$

This way, betas again identify the GATEs.

Regardless of method, standard errors are estimated via the Eicker-Huber-White estimator.

If boot_ci == TRUE, the routine also computes asymmetric bias-corrected and accelerated 95% confidence intervals using 2000 bootstrap samples. Particularly useful when the honest sample is small-ish.

Hypothesis testing

inference_aggtree uses the standard errors obtained by fitting the linear models above to test the hypotheses that the GATEs are different across all pairs of leaves. Here, we adjust p-values to account for multiple hypotheses testing using Holm's procedure.

Average Characteristics

inference_aggtree regresses each covariate on a set of dummies denoting group membership. This way, we get the average characteristics of units in each leaf, together with a standard error. Leaves are ordered in increasing order of their predictions (from most negative to most positive). Standard errors are estimated via the Eicker-Huber-White estimator.

Caution on Inference

Regardless of the chosen method, both functions estimate the GATEs, the linear models, and the average characteristics of units in each group using only observations in the honest sample. If the honest sample is empty (this happens when the user either does not provide Y_hon, D_hon, and X_hon or sets them to NULL), the same data used to construct the tree are used to estimate the above quantities. This is fine for prediction but invalidates inference.

References

Di Francesco, R. (2022). Aggregation Trees. CEIS Research Paper, 546. doi:10.2139/ssrn.4304256 .

Author

Riccardo Di Francesco

Examples

## Generate data.
set.seed(1986)

n <- 1000
k <- 3

X <- matrix(rnorm(n * k), ncol = k)
colnames(X) <- paste0("x", seq_len(k))
D <- rbinom(n, size = 1, prob = 0.5)
mu0 <- 0.5 * X[, 1]
mu1 <- 0.5 * X[, 1] + X[, 2]
Y <- mu0 + D * (mu1 - mu0) + rnorm(n)

## Training-honest sample split.
honest_frac <- 0.5
splits <- sample_split(length(Y), training_frac = (1 - honest_frac))
training_idx <- splits$training_idx
honest_idx <- splits$honest_idx

Y_tr <- Y[training_idx]
D_tr <- D[training_idx]
X_tr <- X[training_idx, ]

Y_hon <- Y[honest_idx]
D_hon <- D[honest_idx]
X_hon <- X[honest_idx, ]

## Construct sequence of groupings. CATEs estimated internally.
groupings <- build_aggtree(Y_tr, D_tr, X_tr, # Training sample.
                           Y_hon, D_hon, X_hon) # Honest sample.

## Alternatively, we can estimate the CATEs and pass them.
library(grf)
forest <- causal_forest(X_tr, Y_tr, D_tr) # Use training sample.
cates_tr <- predict(forest, X_tr)$predictions
cates_hon <- predict(forest, X_hon)$predictions

groupings <- build_aggtree(Y_tr, D_tr, X_tr, # Training sample.
                           Y_hon, D_hon, X_hon, # Honest sample.
                           cates_tr, cates_hon) # Predicted CATEs.

## We have compatibility with generic S3-methods.
summary(groupings)
#> Honest estimates: TRUE 
#> Call:
#> rpart::rpart(formula = cates ~ ., data = data.frame(cates = cates_tr, 
#>     X_tr), method = "anova", model = TRUE, control = rpart::rpart.control(...))
#>   n= 500 
#> 
#>           CP nsplit  rel error     xerror        xstd
#> 1 0.73492873      0 1.00000000 1.00426271 0.039701297
#> 2 0.18616919      1 0.26507127 0.26966430 0.011952010
#> 3 0.02841686      2 0.07890208 0.08421419 0.006999559
#> 4 0.01028333      3 0.05048522 0.05456498 0.003529531
#> 5 0.01000721      4 0.04020189 0.04544464 0.002907779
#> 6 0.01000000      5 0.03019468 0.04247355 0.002758597
#> 
#> Variable importance
#> x2 x1 x3 
#> 92  7  1 
#> 
#> Node number 1: 500 observations,    complexity param=0.7349287
#>   mean=-0.1288164, MSE=0.6432082 
#>   left son=2 (226 obs) right son=3 (274 obs)
#>   Primary splits:
#>       x2 < -0.1540737 to the left,  improve=0.734928700, (0 missing)
#>       x1 < -0.2090187 to the left,  improve=0.044737920, (0 missing)
#>       x3 < 0.07957051 to the right, improve=0.005591071, (0 missing)
#>   Surrogate splits:
#>       x1 < -0.2090187 to the left,  agree=0.582, adj=0.075, (0 split)
#>       x3 < -1.12924   to the left,  agree=0.554, adj=0.013, (0 split)
#> 
#> Node number 2: 226 observations,    complexity param=0.02841686
#>   mean=-1.331517, MSE=0.07323178 
#>   left son=4 (191 obs) right son=5 (35 obs)
#>   Primary splits:
#>       x2 < -0.380934  to the left,  improve=0.55219140, (0 missing)
#>       x1 < -0.7607663 to the left,  improve=0.16945840, (0 missing)
#>       x3 < -0.6784445 to the right, improve=0.01956249, (0 missing)
#> 
#> Node number 3: 274 observations,    complexity param=0.1861692
#>   mean=0.7279016, MSE=0.2507213 
#>   left son=6 (151 obs) right son=7 (123 obs)
#>   Primary splits:
#>       x2 < 0.6427372  to the left,  improve=0.87154070, (0 missing)
#>       x1 < -0.6991158 to the left,  improve=0.02507393, (0 missing)
#>       x3 < 0.1097978  to the right, improve=0.02332171, (0 missing)
#>   Surrogate splits:
#>       x1 < 0.9938645  to the left,  agree=0.569, adj=0.041, (0 split)
#>       x3 < 2.15878    to the left,  agree=0.558, adj=0.016, (0 split)
#> 
#> Node number 4: 191 observations,    complexity param=0.01000721
#>   mean=-1.50897, MSE=0.03120289 
#>   left son=8 (51 obs) right son=9 (140 obs)
#>   Primary splits:
#>       x1 < -0.6186045 to the left,  improve=0.54001560, (0 missing)
#>       x2 < -0.9480893 to the left,  improve=0.28070410, (0 missing)
#>       x3 < -0.4244729 to the right, improve=0.03014027, (0 missing)
#>   Surrogate splits:
#>       x2 < -2.860039  to the left,  agree=0.743, adj=0.039, (0 split)
#>       x3 < 2.618685   to the right, agree=0.743, adj=0.039, (0 split)
#> 
#> Node number 5: 35 observations
#>   mean=-0.4543899, MSE=0.04147575 
#> 
#> Node number 6: 151 observations
#>   mean=0.1031714, MSE=0.02774137 
#> 
#> Node number 7: 123 observations,    complexity param=0.01028333
#>   mean=1.464524, MSE=0.03769027 
#>   left son=14 (35 obs) right son=15 (88 obs)
#>   Primary splits:
#>       x2 < 0.9046113  to the left,  improve=0.7133799, (0 missing)
#>       x3 < 0.5597582  to the right, improve=0.2409727, (0 missing)
#>       x1 < -0.1088159 to the left,  improve=0.1054401, (0 missing)
#>   Surrogate splits:
#>       x3 < 0.5597582  to the right, agree=0.724, adj=0.029, (0 split)
#> 
#> Node number 8: 51 observations
#>   mean=-1.80463, MSE=0.01630033 
#> 
#> Node number 9: 140 observations
#>   mean=-1.401881, MSE=0.0136434 
#> 
#> Node number 14: 35 observations
#>   mean=0.599371, MSE=0.01761238 
#> 
#> Node number 15: 88 observations
#>   mean=1.859485, MSE=0.008094427 
#> 
print(groupings)
#> Honest estimates: TRUE 
#> n= 500 
#> 
#> node), split, n, deviance, yval
#>       * denotes terminal node
#> 
#>  1) root 500 321.6041000 -0.1288164  
#>    2) x2< -0.1540737 226  16.5503800 -1.3315170  
#>      4) x2< -0.380934 191   5.9597530 -1.5089700  
#>        8) x1< -0.6186045 51   0.8313171 -1.8046300 *
#>        9) x1>=-0.6186045 140   1.9100760 -1.4018810 *
#>      5) x2>=-0.380934 35   1.4516510 -0.4543899 *
#>    3) x2>=-0.1540737 274  68.6976200  0.7279016  
#>      6) x2< 0.6427372 151   4.1889470  0.1031714 *
#>      7) x2>=0.6427372 123   4.6359030  1.4645240  
#>       14) x2< 0.9046113 35   0.6164333  0.5993710 *
#>       15) x2>=0.9046113 88   0.7123096  1.8594850 *
plot(groupings) # Try also setting 'sequence = TRUE'.


## To predict, do the following.
tree <- subtree(groupings$tree, cv = TRUE) # Select by cross-validation.
head(predict(tree, data.frame(X_hon)))
#>          1          2          3          4          5          6 
#> -0.4543899 -1.4018805 -1.8046302  1.8594849 -1.8046302  0.1031714 

## Inference with 4 groups.
results <- inference_aggtree(groupings, n_groups = 4)

summary(results$model) # Coefficient of leafk is GATE in k-th leaf.
#> 
#> Call:
#> estimatr::lm_robust(formula = scores ~ 0 + leaf, data = data.frame(scores = scores, 
#>     leaf = leaves), se_type = "HC1")
#> 
#> Standard error type:  HC1 
#> 
#> Coefficients:
#>       Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
#> leaf1  -1.5090     0.1714 -8.8034 2.221e-17  -1.8457  -1.1722 496
#> leaf2  -0.4544     0.2827 -1.6072 1.087e-01  -1.0099   0.1011 496
#> leaf3   0.1032     0.1683  0.6131 5.401e-01  -0.2275   0.4338 496
#> leaf4   1.4645     0.1833  7.9880 9.657e-15   1.1043   1.8247 496
#> 
#> Multiple R-squared:  0.2332 ,	Adjusted R-squared:  0.227 
#> F-statistic: 36.07 on 4 and 496 DF,  p-value: < 2.2e-16

results$gates_diff_pairs$gates_diff # GATEs differences.
#>          leaf1     leaf2    leaf3 leaf4
#> leaf1       NA        NA       NA    NA
#> leaf2 1.054580        NA       NA    NA
#> leaf3 1.612141 0.5575613       NA    NA
#> leaf4 2.973494 1.9189138 1.361352    NA
results$gates_diff_pairs$holm_pvalues # leaves 1-2 not statistically different.
#>              [,1]         [,2]         [,3] [,4]
#> [1,]           NA           NA           NA   NA
#> [2,] 3.029458e-03           NA           NA   NA
#> [3,] 2.640297e-10 9.077709e-02           NA   NA
#> [4,] 6.595850e-28 8.481167e-08 2.141865e-07   NA

## LATEX.
print(results, table = "diff")
#> \begingroup
#>   \setlength{\tabcolsep}{8pt}
#>   \renewcommand{\arraystretch}{1.2}
#>   \begin{table}[b!]
#>     \centering
#>     \begin{adjustbox}{width = 1\textwidth}
#>     \begin{tabular}{@{\extracolsep{5pt}}l c c c c}
#>       \\[-1.8ex]\hline
#>       \hline \\[-1.8ex] 
#> 
#>       & \textit{Leaf 1} & \textit{Leaf 2} & \textit{Leaf 3} & \textit{Leaf 4} \\
#>       \addlinespace[2pt]
#>       \hline \\[-1.8ex] 
#> 
#>       \multirow{3}{*}{GATEs} & -1.509 & -0.454 &  0.103 &  1.465 \\
#>       & [-1.844, -1.174] & [-1.009,  0.101] & [-0.226,  0.432] & [ 1.106,  1.824] \\
#>       & \{NA, NA\} & \{NA, NA\} & \{NA, NA\} & \{NA, NA\} \\ 
#> 
#>       \addlinespace[2pt]
#>       \hline \\[-1.8ex] 
#> 
#>       \textit{Leaf 1} & NA & NA & NA & NA \\
#>             & (NA) & (NA) & (NA) & (NA) \\ 
#>       \textit{Leaf 2} & 1.05 &   NA &   NA & NA \\
#>             & (0.003) & (   NA) & (   NA) & (NA) \\ 
#>       \textit{Leaf 3} & 1.61 & 0.56 &   NA & NA \\
#>             & (0.000) & (0.091) & (   NA) & (NA) \\ 
#>       \textit{Leaf 4} & 2.97 & 1.92 & 1.36 & NA \\
#>             & (0.000) & (0.000) & (0.000) & (NA) \\ 
#> 
#>       \addlinespace[3pt]
#>       \\[-1.8ex]\hline
#>       \hline \\[-1.8ex]
#>     \end{tabular}
#>     \end{adjustbox}
#>     \caption{Point estimates and $95\%$ confidence intervals for the GATEs based on asymptotic normality (in square brackets) and on the percentiles of the bootstrap distribution (in curly braces). Leaves are sorted in increasing order of the GATEs. Additionally, the GATE differences across all pairs of leaves are displayed. $p$-values testing the null hypothesis that a single difference is zero are adjusted using Holm's procedure and reported in parenthesis under each point estimate.}
#>     \label{table_differences_gates}
#>     \end{table}
#> \endgroup 
#> 
print(results, table = "avg_char")
#> \begingroup
#>   \setlength{\tabcolsep}{8pt}
#>   \renewcommand{\arraystretch}{1.1}
#>   \begin{table}[b!]
#>     \centering
#>     \begin{adjustbox}{width = 1\textwidth}
#>     \begin{tabular}{@{\extracolsep{5pt}}l c c c c c c c c }
#>       \\[-1.8ex]\hline
#>       \hline \\[-1.8ex]
#>       & \multicolumn{2}{c}{\textit{Leaf 1}} & \multicolumn{2}{c}{\textit{Leaf 2}} & \multicolumn{2}{c}{\textit{Leaf 3}} & \multicolumn{2}{c}{\textit{Leaf 4}} \\\cmidrule{2-3} \cmidrule{4-5} \cmidrule{6-7} \cmidrule{8-9} 
#>       & Mean & (S.E.) & Mean & (S.E.) & Mean & (S.E.) & Mean & (S.E.) \\
#>       \addlinespace[2pt]
#>       \hline \\[-1.8ex] 
#> 
#>       \texttt{x1} & -0.018 & (0.073) &  0.051 & (0.179) & -0.021 & (0.079) & -0.023 & (0.082) \\ 
#>       \texttt{x2} & -1.056 & (0.044) & -0.261 & (0.013) &  0.226 & (0.017) &  1.272 & (0.045) \\ 
#>       \texttt{x3} &  0.043 & (0.075) & -0.114 & (0.154) & -0.149 & (0.081) &  0.029 & (0.088) \\ 
#> 
#>       \addlinespace[3pt]
#>       \\[-1.8ex]\hline
#>       \hline \\[-1.8ex]
#>     \end{tabular}
#>     \end{adjustbox}
#>     \caption{Average characteristics of units in each leaf, obtained by regressing each covariate on a set of dummies denoting leaf membership . Standard errors are estimated via the Eicker-Huber-White estimator. Leaves are sorted in increasing order of the GATEs.}
#>     \label{table_average_characteristics_leaves}
#>     \end{table}
#> \endgroup 
#>