Generate a synthetic data set with an ordered non-numeric outcome, together with conditional probabilities and covariates' marginal effects.

generate_ordered_data(n)

Arguments

n

Sample size.

Value

A list storing a data frame with the observed data, a matrix of true conditional probabilities, and a matrix of true marginal effects at the mean of the covariates.

Details

First, a latent outcome is generated as follows:

$$Y_i^* = g ( X_i ) + \epsilon_i$$

with:

$$g ( X_i ) = X_i^T \beta$$

$$X_i := (X_{i, 1}, X_{i, 2}, X_{i, 3}, X_{i, 4}, X_{i, 5}, X_{i, 6})$$

$$X_{i, 1}, X_{i, 3}, X_{i, 5} \sim \mathcal{N} \left( 0, 1 \right)$$

$$X_{i, 2}, X_{i, 4}, X_{i, 6} \sim \textit{Bernoulli} \left( 0, 1 \right)$$

$$\beta = \left( 1, 1, 1/2, 1/2, 0, 0 \right)$$

$$\epsilon_i \sim logistic (0, 1)$$

Second, the observed outcomes are obtained by discretizing the latent outcome into three classes using uniformly spaced threshold parameters.

Third, the conditional probabilities and the covariates' marginal effects at the mean are generated using standard textbook formulas. Marginal effects are approximated using a sample of 1,000,000 observations.

References

  • Di Francesco, R. (2023). Ordered Correlation Forest. arXiv preprint arXiv:2309.08755.

See also

Author

Riccardo Di Francesco

Examples

## Generate synthetic data.
set.seed(1986)

data <- generate_ordered_data(1000)

head(data$true_probs)
#>         P(Y=1)    P(Y=2)    P(Y=3)
#> [1,] 0.1780543 0.3900898 0.4318559
#> [2,] 0.4252436 0.3927161 0.1820403
#> [3,] 0.3558784 0.4145215 0.2296001
#> [4,] 0.3128062 0.4215497 0.2656441
#> [5,] 0.3413966 0.4175279 0.2410755
#> [6,] 0.4785603 0.3693182 0.1521215
data$me_at_mean
#>       P'(Y=1)       P'(Y=2)   P'(Y=3)
#> x1 -0.2051655 -0.0003337965 0.2054993
#> x2 -0.1947837 -0.0166079075 0.2113916
#> x3 -0.1025828 -0.0001668983 0.1027497
#> x4 -0.1001678 -0.0044354927 0.1046033
#> x5  0.0000000  0.0000000000 0.0000000
#> x6  0.0000000  0.0000000000 0.0000000

sample <- data$sample
Y <- sample$Y
X <- sample[, -1]

## Fit ocf.
forests <- ocf(Y, X)