Estimation strategy to estimate conditional choice probabilities for ordered non-numeric outcomes.
ordered_ml(Y = NULL, X = NULL, learner = "forest", scale = TRUE)Object of class oml.
Ordered machine learning expresses conditional choice probabilities as the difference between the cumulative probabilities of two adjacent classes, which in turn can be expressed as conditional expectations of binary variables:
$$p_m \left( X_i \right) = \mathbb{E} \left[ 1 \left( Y_i \leq m \right) | X_i \right] - \mathbb{E} \left[ 1 \left( Y_i \leq m - 1 \right) | X_i \right]$$
Then we can separately estimate each expectation using any regression algorithm and pick the difference between the m-th and the
(m-1)-th estimated surfaces to estimate conditional probabilities.
ordered_ml combines this strategy with either regression forests or penalized logistic regressions with an L1 penalty,
according to the user-specified parameter learner.
If learner == "forest", then the orf
function is called from an external package, as this estimator has already been proposed by Lechner and Okasa (2019).
If learner == "l1",
the penalty parameters are chosen via 10-fold cross-validation and model.matrix is used to handle non-numeric covariates.
Additionally, if scale == TRUE, the covariates are scaled to have zero mean and unit variance.
Di Francesco, R. (2025). Ordered Correlation Forest. Econometric Reviews, 1–17. doi:10.1080/07474938.2024.2429596 .
## Generate synthetic data.
set.seed(1986)
data <- generate_ordered_data(100)
sample <- data$sample
Y <- sample$Y
X <- sample[, -1]
## Training-test split.
train_idx <- sample(seq_len(length(Y)), floor(length(Y) * 0.5))
Y_tr <- Y[train_idx]
X_tr <- X[train_idx, ]
Y_test <- Y[-train_idx]
X_test <- X[-train_idx, ]
## Fit ordered machine learning on training sample using two different learners.
ordered_forest <- ordered_ml(Y_tr, X_tr, learner = "forest")
ordered_l1 <- ordered_ml(Y_tr, X_tr, learner = "l1")
## Predict out of sample.
predictions_forest <- predict(ordered_forest, X_test)
predictions_l1 <- predict(ordered_l1, X_test)
## Compare predictions.
cbind(head(predictions_forest), head(predictions_l1))
#> P(Y=1) P(Y=2) P(Y=3) P(Y=1) P(Y=2) P(Y=3)
#> [1,] 0.4208750 0.4629083 0.11621667 0.29341958 0.6154394 0.09114097
#> [2,] 0.4666167 0.4243833 0.10900000 0.33450837 0.5539351 0.11155656
#> [3,] 0.1424000 0.3843583 0.47324167 0.07811249 0.3029698 0.61891769
#> [4,] 0.6361333 0.3146500 0.04921667 0.63353157 0.3260411 0.04042733
#> [5,] 0.4543667 0.3177833 0.22785000 0.29489112 0.4336047 0.27150420
#> [6,] 0.6192250 0.3248750 0.05590000 0.37336853 0.5365585 0.09007298