R/rpart-functions.R
causal_ols_rpart.Rd
Obtains point estimates and standard errors for the group average treatment effects (GATEs), where groups correspond to the
leaves of an rpart
object. Additionally, performs some hypothesis testing.
causal_ols_rpart(
tree,
Y,
D,
X,
method = "aipw",
scores = NULL,
boot_ci = FALSE,
boot_R = 2000
)
An rpart
object.
Outcome vector.
Treatment assignment vector
Covariate matrix (no intercept).
Either "raw"
or "aipw"
, defines the outcome used in the regression.
Optional, vector of scores to be used in the regression. Useful to save computational time if scores have already been estimated. Ignored if method == "raw"
.
Logical, whether to compute bootstrap confidence intervals.
Number of bootstrap replications. Ignored if boot_ci == FALSE
.
A list storing:
model
The model fitted to get point estimates and standard errors for the GATEs, as an lm_robust
object.
gates_diff_pairs
Results of testing whether GATEs differ across all pairs of leaves. This is a list storing GATEs differences and p-values adjusted using Holm's procedure (check p.adjust
). NULL
if the tree consists of a root only.
boot_ci
Bootstrap confidence intervals (this is an empty list if boot_ci == FALSE
.
scores
Vector of doubly robust scores. NULL
if method == 'raw'
.
The GATEs and their standard errors are obtained by fitting an appropriate linear model. If method == "raw"
, we
estimate via OLS the following:
$$Y_i = \sum_{l = 1}^{|T|} L_{i, l} \gamma_l + \sum_{l = 1}^{|T|} L_{i, l} D_i \beta_l + \epsilon_i$$
with L_{i, l}
a dummy variable equal to one if the i-th unit falls in the l-th leaf of tree
, and |T| the number of
groups. If the treatment is randomly assigned, one can show that the betas identify the GATE in each leaf. However, this is not true
in observational studies due to selection into treatment. In this case, the user is expected to use method == "aipw"
to run
the following regression:
$$score_i = \sum_{l = 1}^{|T|} L_{i, l} \beta_l + \epsilon_i$$
where score_i are doubly-robust scores constructed via honest regression forests and 5-fold cross fitting (unless the user specifies
the argument scores
). This way, betas again identify the GATEs.
Regardless of method
, standard errors are estimated via the Eicker-Huber-White estimator.
If boot_ci == TRUE
, the routine also computes asymmetric bias-corrected and accelerated 95% confidence intervals using 2000 bootstrap
samples.
If tree
consists of a root only, causal_ols_rpart
regresses y
on a constant and D
if
method == "raw"
, or regresses the doubly-robust scores on a constant if method == "aipw"
. This way,
we get an estimate of the overall average treatment effect.
Di Francesco, R. (2022). Aggregation Trees. CEIS Research Paper, 546. doi:10.2139/ssrn.4304256 .
## Generate data.
set.seed(1986)
n <- 1000
k <- 3
X <- matrix(rnorm(n * k), ncol = k)
colnames(X) <- paste0("x", seq_len(k))
D <- rbinom(n, size = 1, prob = 0.5)
mu0 <- 0.5 * X[, 1]
mu1 <- 0.5 * X[, 1] + X[, 2]
Y <- mu0 + D * (mu1 - mu0) + rnorm(n)
## Split the sample.
splits <- sample_split(length(Y), training_frac = 0.5)
training_idx <- splits$training_idx
honest_idx <- splits$honest_idx
Y_tr <- Y[training_idx]
D_tr <- D[training_idx]
X_tr <- X[training_idx, ]
Y_hon <- Y[honest_idx]
D_hon <- D[honest_idx]
X_hon <- X[honest_idx, ]
## Construct a tree using training sample.
library(rpart)
tree <- rpart(Y ~ ., data = data.frame("Y" = Y_tr, X_tr), maxdepth = 2)
## Estimate GATEs in each node (internal and terminal) using honest sample.
results <- causal_ols_rpart(tree, Y_hon, D_hon, X_hon, method = "raw")
summary(results$model) # Coefficient of leafk:D is GATE in k-th leaf.
#>
#> Call:
#> estimatr::lm_robust(formula = Y ~ 0 + leaf + D:leaf, data = data.frame(Y = Y,
#> leaf = leaves, D = D), se_type = "HC1")
#>
#> Standard error type: HC1
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
#> leaf1 -0.036996 0.12751 -0.29014 7.718e-01 -0.2875 0.2135 492
#> leaf2 0.477702 0.19133 2.49681 1.286e-02 0.1018 0.8536 492
#> leaf3 -0.001513 0.09483 -0.01595 9.873e-01 -0.1878 0.1848 492
#> leaf4 0.814105 0.21189 3.84217 1.379e-04 0.3978 1.2304 492
#> leaf1:D -1.508032 0.20002 -7.53928 2.293e-13 -1.9010 -1.1150 492
#> leaf2:D -1.110904 0.27575 -4.02865 6.497e-05 -1.6527 -0.5691 492
#> leaf3:D 0.555771 0.14146 3.92889 9.755e-05 0.2778 0.8337 492
#> leaf4:D 1.003031 0.35536 2.82259 4.957e-03 0.3048 1.7012 492
#>
#> Multiple R-squared: 0.2944 , Adjusted R-squared: 0.2829
#> F-statistic: 25.03 on 8 and 492 DF, p-value: < 2.2e-16
results$gates_diff_pair$gates_diff # GATEs differences.
#> leaf1 leaf2 leaf3 leaf4
#> leaf1 NA NA NA NA
#> leaf2 0.3971281 NA NA NA
#> leaf3 2.0638027 1.666675 NA NA
#> leaf4 2.5110626 2.113935 0.4472599 NA
results$gates_diff_pair$holm_pvalues # leaves 1-2 and 3-4 not statistically different.
#> [,1] [,2] [,3] [,4]
#> [1,] NA NA NA NA
#> [2,] 4.856388e-01 NA NA NA
#> [3,] 2.413800e-15 4.669980e-07 NA NA
#> [4,] 7.666574e-09 1.015841e-05 0.4856388 NA