Generate Qualitative Data (Selection-on-Observables)

Generate a synthetic data set with qualitative outcomes under a selection-on-observables design. The data include a binary treatment indicator and a matrix of covariates. The treatment is either independent or conditionally (on the covariates) independent of potential outcomes, depending on users' choices.

generate_qualitative_data_soo(n, assignment, outcome_type)

Arguments

n: Sample size.
assignment: String controlling treatment assignment. Must be either "randomized" (random assignment) or "observational" (random assigment conditional on the generated covariates).
outcome_type: String controlling the outcome type. Must be either "multinomial" or "ordered". Affects how potential outcomes are generated.

Value

A list storing a data frame with the observed data, the true propensity score, and the true probabilities of shift.

Details

Outcome type

Potential outcomes are generated differently according to outcome_type. If outcome_type == "multinomial", generate_qualitative_data_soo computes linear predictors for each class using the covariates:

$$\eta_{mi} (d) = \beta_{m1}^d X_{i1} + \beta_{m2}^d X_{i2} + \beta_{m3}^d X_{i3}, \quad d = 0, 1,$$

and then transforms $\eta_{mi} (d)$ into valid probability distributions using the softmax function:

$$P(Y_i(d) = m | X_i) = \frac{\exp(\eta_{mi} (d))}{\sum_{m'} \exp(\eta_{m'i}(d))}, \quad d = 0, 1.$$

It then generates potential outcomes $Y_i(1)$ and $Y_i(0)$ by sampling from {1, 2, 3} using $P(Y_i(d) = m | X_i), \, d = 0, 1$.

If instead outcome_type == "ordered", generate_qualitative_data_soo first generates latent potential outcomes:

$$Y_i^* (d) = \tau d + X_{i1} + X_{i2} + X_{i3} + N (0, 1), \quad d = 0, 1,$$

with $\tau = 2$. It then constructs $Y_i (d)$ by discretizing $Y_i^* (d)$ using threshold parameters $\zeta_1 = 2$ and $\zeta_2 = 4$. Then,

$$P(Y_i(d) = m | X_i) = P(\zeta_{m-1} < Y_i^*(d) \leq \zeta_m | X_i) = \Phi (\zeta_m - \sum_j X_{ij} - \tau d) - \Phi (\zeta_{m-1} - \sum_j X_{ij} - \tau d), \quad d = 0, 1,$$

which allows us to analytically compute the probabilities of shift.

Treatment assignment

Treatment is always assigned as $D_i \sim \text{Bernoulli}(\pi(X_i))$. If assignment == "randomized", then the propensity score is specified as $\pi(X_i) = P ( D_i = 1 | X_i)) = 0.5$. If instead assignment == "observational", then $\pi(X_i) = (X_{i1} + X_{i3}) / 2$.

Other details

The function always generates three independent covariates from $U(0,1)$. Observed outcomes $Y_i$ are always constructed using the usual observational rule.

Controlling for $X_{i1}$ and $X_{i3}$ is sufficient for selection-on-observables to hold.

Author

Riccardo Di Francesco

Examples

## Generate synthetic data.
set.seed(1986)

data <- generate_qualitative_data_soo(100,
                                      assignment = "observational",
                                      outcome_type = "ordered")

data$pshifts
#> [1] -0.577876162 -0.006807437  0.584683599