Obtains point estimates and standard errors for the group average treatment effects (GATEs), where groups correspond to the leaves of an rpart object. Additionally, performs some hypothesis testing.

causal_ols_rpart(
  tree,
  Y,
  D,
  X,
  method = "aipw",
  scores = NULL,
  boot_ci = FALSE,
  boot_R = 2000
)

Arguments

tree

An rpart object.

Y

Outcome vector.

D

Treatment assignment vector

X

Covariate matrix (no intercept).

method

Either "raw" or "aipw", defines the outcome used in the regression.

scores

Optional, vector of scores to be used in the regression. Useful to save computational time if scores have already been estimated. Ignored if method == "raw".

boot_ci

Logical, whether to compute bootstrap confidence intervals.

boot_R

Number of bootstrap replications. Ignored if boot_ci == FALSE.

Value

A list storing:

model

The model fitted to get point estimates and standard errors for the GATEs, as an lm_robust object.

gates_diff_pairs

Results of testing whether GATEs differ across all pairs of leaves. This is a list storing GATEs differences and p-values adjusted using Holm's procedure (check p.adjust). NULL if the tree consists of a root only.

boot_ci

Bootstrap confidence intervals (this is an empty list if boot_ci == FALSE.

scores

Vector of doubly robust scores. NULL if method == 'raw'.

Details

Point estimates and standard errors for the GATEs

The GATEs and their standard errors are obtained by fitting an appropriate linear model. If method == "raw", we estimate via OLS the following:

$$Y_i = \sum_{l = 1}^{|T|} L_{i, l} \gamma_l + \sum_{l = 1}^{|T|} L_{i, l} D_i \beta_l + \epsilon_i$$

with L_{i, l} a dummy variable equal to one if the i-th unit falls in the l-th leaf of tree, and |T| the number of groups. If the treatment is randomly assigned, one can show that the betas identify the GATE in each leaf. However, this is not true in observational studies due to selection into treatment. In this case, the user is expected to use method == "aipw" to run the following regression:

$$score_i = \sum_{l = 1}^{|T|} L_{i, l} \beta_l + \epsilon_i$$

where score_i are doubly-robust scores constructed via honest regression forests and 5-fold cross fitting (unless the user specifies the argument scores). This way, betas again identify the GATEs.

Regardless of method, standard errors are estimated via the Eicker-Huber-White estimator.

If boot_ci == TRUE, the routine also computes asymmetric bias-corrected and accelerated 95% confidence intervals using 2000 bootstrap samples.

If tree consists of a root only, causal_ols_rpart regresses y on a constant and D if method == "raw", or regresses the doubly-robust scores on a constant if method == "aipw". This way, we get an estimate of the overall average treatment effect.

Hypothesis testing

causal_ols_rpart uses the standard errors obtained by fitting the linear models above to test the hypotheses that the GATEs are different across all pairs of leaves. Here, we adjust p-values to account for multiple hypotheses testing using Holm's procedure.

Caution on Inference

"honesty" is a necessary requirement to get valid inference. Thus, observations in Y, D, and X must not have been used to construct the tree and the scores.

References

Author

Riccardo Di Francesco

Examples

## Generate data.
set.seed(1986)

n <- 1000
k <- 3

X <- matrix(rnorm(n * k), ncol = k)
colnames(X) <- paste0("x", seq_len(k))
D <- rbinom(n, size = 1, prob = 0.5)
mu0 <- 0.5 * X[, 1]
mu1 <- 0.5 * X[, 1] + X[, 2]
Y <- mu0 + D * (mu1 - mu0) + rnorm(n)

## Split the sample.
splits <- sample_split(length(Y), training_frac = 0.5)
training_idx <- splits$training_idx
honest_idx <- splits$honest_idx

Y_tr <- Y[training_idx]
D_tr <- D[training_idx]
X_tr <- X[training_idx, ]

Y_hon <- Y[honest_idx]
D_hon <- D[honest_idx]
X_hon <- X[honest_idx, ]

## Construct a tree using training sample.
library(rpart)
tree <- rpart(Y ~ ., data = data.frame("Y" = Y_tr, X_tr), maxdepth = 2)

## Estimate GATEs in each node (internal and terminal) using honest sample.
results <- causal_ols_rpart(tree, Y_hon, D_hon, X_hon, method = "raw")

summary(results$model) # Coefficient of leafk:D is GATE in k-th leaf.
#> 
#> Call:
#> estimatr::lm_robust(formula = Y ~ 0 + leaf + D:leaf, data = data.frame(Y = Y, 
#>     leaf = leaves, D = D), se_type = "HC1")
#> 
#> Standard error type:  HC1 
#> 
#> Coefficients:
#>          Estimate Std. Error  t value  Pr(>|t|) CI Lower CI Upper  DF
#> leaf1   -0.036996    0.12751 -0.29014 7.718e-01  -0.2875   0.2135 492
#> leaf2    0.477702    0.19133  2.49681 1.286e-02   0.1018   0.8536 492
#> leaf3   -0.001513    0.09483 -0.01595 9.873e-01  -0.1878   0.1848 492
#> leaf4    0.814105    0.21189  3.84217 1.379e-04   0.3978   1.2304 492
#> leaf1:D -1.508032    0.20002 -7.53928 2.293e-13  -1.9010  -1.1150 492
#> leaf2:D -1.110904    0.27575 -4.02865 6.497e-05  -1.6527  -0.5691 492
#> leaf3:D  0.555771    0.14146  3.92889 9.755e-05   0.2778   0.8337 492
#> leaf4:D  1.003031    0.35536  2.82259 4.957e-03   0.3048   1.7012 492
#> 
#> Multiple R-squared:  0.2944 ,	Adjusted R-squared:  0.2829 
#> F-statistic: 25.03 on 8 and 492 DF,  p-value: < 2.2e-16

results$gates_diff_pair$gates_diff # GATEs differences.
#>           leaf1    leaf2     leaf3 leaf4
#> leaf1        NA       NA        NA    NA
#> leaf2 0.3971281       NA        NA    NA
#> leaf3 2.0638027 1.666675        NA    NA
#> leaf4 2.5110626 2.113935 0.4472599    NA
results$gates_diff_pair$holm_pvalues # leaves 1-2 and 3-4 not statistically different.
#>              [,1]         [,2]      [,3] [,4]
#> [1,]           NA           NA        NA   NA
#> [2,] 4.856388e-01           NA        NA   NA
#> [3,] 2.413800e-15 4.669980e-07        NA   NA
#> [4,] 7.666574e-09 1.015841e-05 0.4856388   NA