Predictive modeling for discrete outcomes

Riccardo Di Francesco

University of Southern Denmark

Introduction

Linear model for continuous outcomes

So far, you learnt how to (linearly) model continuous outcomes:

\[ Y_i = \underbrace{\boldsymbol{X}_i^\top \boldsymbol{\beta}}_{\mathbb E[Y_i | \boldsymbol{X}_i]} + \epsilon_i. \]

However, \(Y_i\) is often discrete.

Linear models are inappropriate for predictive modeling of discrete outcomes

We want our model to be consistent with the data. A linear model here produces:
  • Predictions out of range.
  • Fractional predictions (might not be problematic for binary outcomes).

Binary outcomes\(—\)Unconditional model

Expectation \(=\) probability of success

Let \(Y_i \in \{0, 1\}\) be a binary outcome.

🪙 Coin flip (\(1=\) heads)
🧪 Test result (\(1=\) positive)
🎓 Exam outcome (\(1=\) pass)

The following equality holds:

\[ \mathbb E[Y_i] = 1 \cdot \color{red}{\mathbb P (Y_i = 1)} + 0 \cdot \mathbb P (Y_i = 0). \tag{1}\]

Thus modeling \(\mathbb E[Y_i]\) is equivalent to modeling the probability of success.

Bernoulli distribution to model binary outcomes

A natural (unconditional) model for binary outcomes is the Bernoulli distribution, indexed by a single parameter \(\pi \in (0, 1)\)\(—\)the probability of success.

\[ Y_i \sim Bernoulli(\pi).\]


Useful properties are:

\(\mathbb P(Y_i = y )=\pi^y (1 - \pi)^{1 - y}\).
\(\mathbb E[Y_i]=\pi\).
\(\mathbb V(Y_i)=\pi(1-\pi)\).

Binary outcomes\(—\)Unconditional model (estimation)

Maximum likelihood estimation of the Bernoulli model

Under an i.i.d. assumption, the sample log-likelihood is

\[ \mathcal L ( \pi) = \sum_{i = 1}^n \{ Y_i \ln(\pi) + [1 - Y_i] \ln(1 - \pi) \}. \tag{2}\]

Assuming \(\mathcal L ( \cdot)\) is strictly concave, the MLE of \(\pi\) is obtained by setting \(\partial \mathcal L ( \pi) / \partial \pi = 0\). Simple algebra yields

\[ \hat{\pi} = \frac{1}{n} \sum_{i = 1}^n Y_i. \tag{3}\]

Exercise\(—\)Bernoulli MLE (\(\approx 5\) minutes)

Consider the vector \(\boldsymbol{Y} = (1, 0, 1, 0, 1)\).

1. Write \(\mathcal L(\pi)\).
2. Compute \(\hat{\pi}\).
3. Plot \(\mathcal L(\pi)\) for \(\pi \in (0, 1)\) and mark \(\hat{\pi}\).

Solution. Replacing \(\boldsymbol{Y}\) in the equations above,

1. \(\mathcal L(\pi) = 3 \log(\pi) + 2 \log(1 - \pi)\).
2. \(\hat{\pi} = 0.6\).
3. See plot on the right.

Binary outcomes\(—\)Conditional model

Conditional expectation \(=\) conditional probability of success

Suppose we further observe a set of covariates \(\boldsymbol{X}_i\). Then, the equality in Equation 1 can be extended to

\[ \mathbb E[Y_i | \boldsymbol{X}_i] = 1 \cdot \color{red}{\mathbb P (Y_i = 1 | \boldsymbol{X}_i)} + 0 \cdot \mathbb P (Y_i = 0 | \boldsymbol{X}_i). \tag{4}\]

Thus modeling \(\mathbb E[Y_i | \boldsymbol{X}_i]\) is the same as modeling the “probability of success” conditional on \(\boldsymbol{X}_i\).

Conditional Bernoulli distribution to model binary outcomes

We can extend the Bernoulli model by allowing the probability of success to depend on \(\boldsymbol{X}_i\):

\[ Y_i \sim Bernoulli(\color{red}{\pi(\boldsymbol{X}_i)} ).\]


Useful properties are:

\(\mathbb P[Y_i = y | \boldsymbol{X}_i]=\pi(\boldsymbol{X}_i)^y [1 - \pi(\boldsymbol{X}_i)]^{1 - y}\).
\(\mathbb E[Y_i | \boldsymbol{X}_i]=\pi(\boldsymbol{X}_i)\).
\(\mathbb V(Y_i | \boldsymbol{X}_i)=\pi(\boldsymbol{X}_i)[1-\pi(\boldsymbol{X}_i)]\).

Binary outcomes\(—\)Conditional model (estimation)

Maximum likelihood estimation of the conditional Bernoulli model

Under an i.i.d. assumption, the sample log-likelihood is

\[ \mathcal L ( \boldsymbol{\pi}) = \sum_{i = 1}^n \{ Y_i \ln(\pi(\boldsymbol{X}_i)) + [1 - Y_i] \ln(1 - \pi(\boldsymbol{X}_i)) \}. \tag{5}\]

Assuming \(\mathcal L ( \cdot)\) is strictly concave, the MLE estimator of \(\pi(\boldsymbol{X}_i)\) is obtained by setting \(\partial \mathcal L ( \boldsymbol{\pi}) / \partial \pi(\boldsymbol{X}_i) = 0\) for all \(i\).


However, this (nonparametric) estimation problem might be quite complex. A way out is to assume a parametric model for \(\pi(\cdot)\).

Linear index models for \(\pi(\cdot)\)

A popular way to parametrize \(\pi(\cdot)\) is through a linear index model:

\[ \pi(\boldsymbol{X}_i) = G ( \boldsymbol{X}_i^\top \boldsymbol{\beta} ), \tag{6}\]

with \(G : \mathbb R \to (0, 1)\) a smooth and strictly increasing link function.


Model Link
LPM \(G(u) = u\)
Logit \(\displaystyle G(u) = (1+e^{-u})^{-1}\)
Probit \(G(u) = \Phi(u)\)

Binary outcomes\(—\)Exercise

Exercise\(—\)Comparison between LPM, Logit, and Probit (\(\approx 5\) minutes)

Simulate a small data set following the DGP below and fit an LPM, Logit, and Probit.

\(X \sim Uniform(-2, 2)\).
\(\pi(x) = (1 + e^{-5x})^{-1}\).
\(Y \sim Bernolli(\pi(x))\).

Then plot the true and predicted probabilities against \(X\).

## Fit models.
lpm <- lm(Y ~ X)
logit <- glm(Y ~ X, family = binomial("logit"))
probit <- glm(Y ~ X, family = binomial("probit"))

## Predict probabilities on some X.
X_grid <- data.frame("X" = seq(min(X), max(X), length.out = 400))

lpm_preds <- predict(lpm, newdata = X_grid)
logit_preds <- predict(logit, newdata = X_grid, type = "response")
probit_preds <- predict(probit, newdata = X_grid, type = "response")

## Set seed.
set.seed(1986)

## Define true probabilities.
pix <- function(x) {
  (1 + exp(-x))^{-1}
}

## Generate data.
n <- 100
beta1 <- 5
X <- runif(n, min = -2, max = 2)
linear_index <- beta0 + beta1 * X
Y <- rbinom(n, size = 1, prob = pix(linear_index))

Binary outcomes\(—\)Conditional model (coefficient interpretation)

Marginal effects

For a continuous \(X_j\), we define the (pointwise) marginal effect as \[ \nabla^{j}\pi(\boldsymbol{x}) := \frac{\partial \pi(\boldsymbol{x})}{\partial x_j} = \beta_j G^\prime (\boldsymbol{x}^\top \boldsymbol{\beta}), \tag{7}\]

while for a binary \(X_j\),

\[ \nabla^{j}\pi(\boldsymbol{x}) := \pi(\boldsymbol{x}_{j=1}) - \pi(\boldsymbol{x}_{j=0}) = G (\boldsymbol{x}_{j=1}^\top \boldsymbol{\beta}) - G (\boldsymbol{x}_{j=0}^\top \boldsymbol{\beta}), \tag{8}\]

where \(\boldsymbol{x}_{j=d}\) corresponds to \(\boldsymbol{x}\) with element \(j\) set to \(d\).

Properties of marginal effects

In Logit and Probit

\(\nabla^{j}\pi(\cdot)\) is not constant.
\(\nabla^{j}\pi(\overline{\boldsymbol{x}}) \neq n^{-1} \sum_{i = 1}^n \nabla^{j}\pi(\boldsymbol{x}_i)\) .

Marginal effects \(\neq\) causal effects

\(\nabla^{j}\pi(\boldsymbol{x})\) tells you how \(\pi(\cdot)\) changes as \(x_j\) varies, holding other regressors fixed. This is an association, not a causal effect.

Binary outcomes\(—\)Exercise

Exercise\(—\)Marginal effects with a single binary covariate (\(\approx 5\) minutes)

Compute \(\nabla \pi(x)\) for a single binary regressor \(X\).

What’s your interpretation of the result?


Solution. Applying Equation 8,

\[ \begin{aligned} \nabla \pi(x) &= G (\color{red}{1} \cdot \beta) - G (\color{red}{0} \cdot \beta) \\ &= G (\beta) - G (0). \end{aligned} \]

\(\nabla \pi(x)\) is constant across observations in this one-binary-covariate setup: every flip from \(X = 0\) to \(X = 1\) is associated to a change in probability by the same amount \(G (\beta) - G (0)\).

This further implies that marginal effects at the mean and mean marginal effects are the same.

Binary outcomes\(—\)Exercise

Exercise\(—\)Marginal effects at the mean vs. mean marginal effects (\(\approx 5\) minutes)

Simulate a small data set following the DGP below and fit an LPM, Logit, and Probit.

\(X \sim Uniform(-2, 2)\).
\(\pi(x) = (1 + e^{-5x})^{-1}\).
\(Y \sim Bernolli(\pi(x))\).

Then compute and display the marginal effects at the mean and mean marginal effects.

## Load library.
library(marginaleffects)

## Individual marginal effects.
lpm_mes <- slopes(lpm)
logit_mes <- slopes(logit) 
probit_mes <- slopes(probit) 
  
## Mean marginal effects.
ame_lpm <- avg_slopes(lpm)  
ame_logit <- avg_slopes(logit)  
ame_probit <- avg_slopes(probit)  

## Marginal effects at mean.
mem_lpm <- slopes(lpm, newdata = "mean")
mem_logit  <- slopes(logit, newdata = "mean")
mem_probit <- slopes(probit, newdata = "mean")

Ordered outcomes\(—\)Data generating process

Ordered outcomes are not numeric

Let \(Y_i \in \{1, 2, \dots, M\}\), with outcome classes featuring a natural ordering\(—\)though lacking a cardinal interpretation.

⭐ Movie ratings (\(1<2<3<4<5\))
🎓 Education level (\(HS < BA < MA < PhD\))
⚖️ Satisfaction \(\{ \text{Very unhappy} < \dots < \text{Very happy}\)}.

Strictly speaking, \(Y_i\) is non-numeric\(—\)it represents a qualitative outcome. We can use numbers to represent its classes, but these are just labels.

Data-generating process of \(Y_i\)

We regard \(Y_i\) as a “discretization” of a continuous latent outcome \(Y_i^*\).

The data-generating process consists of a (linear) model for \(Y_i^*\):

\[ Y_i^* = \boldsymbol{X}_i^\top \boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim F, \quad \epsilon_i \perp \boldsymbol{X}_i, \tag{9}\]

and an observational rule for \(Y_i\):

\[ Y_i = \sum_{m = 1}^M m \mathbb{1} (\zeta_{m - 1} < Y_i^* \leq \zeta_m), \tag{10}\]

with \(-\infty = \zeta_0 < \zeta_1 < \dots < \zeta_M = \infty\) unknown “threshold parameters.”

Ordered outcomes\(—\)Exercise

Exercise\(—\)Generating ordered outcomes (\(\approx 3\) minutes)

Generate:

• A single covariate \(X_i \sim Uniform(-2, 2)\).
• A continuous outcome \(Y_i^* = 2 X_i + \epsilon_i\), with \(\epsilon_i \sim \mathcal{N} (0, 1)\).

Then discretize \(Y_i^*\) into three classes using fixed thresholds \(\boldsymbol{\zeta} = (-1, 1)\) and the cut function.


Solution.

## Set seed for reproducibility.
set.seed(1986)

## Set sample size, coefficients, and thresholds.
n <- 1000
beta <- 2
zetas  <- c(-1, 1)  
labels <- c("Bad", "Average", "Good")

## Generate single uniform covariate.
X <- runif(n, min = -2, max = 2)

## Generate latent outcome.
Y_star <- beta * X + rnorm(n)

## Discretize.
Y <- cut(Y_star, breaks  = c(-Inf, zetas, Inf), labels  = labels, ordered_result = TRUE)

Ordered outcomes\(—\)Statistical targets

We cannot average qualitative outcomes

Since \(Y_i\) is not numeric, arithmetic operations on outcome categories (e.g., averaging) lack meaningful interpretation.

\(\Rightarrow\) It does not make sense to talk about expectations.

Statistical targets when \(Y_i\) is ordered

When \(Y_i\) is ordered, we are typically interested in the conditional choice probabilities

\[ \pi_{\color{red}{m}} (\boldsymbol{X}_i) := \mathbb P (Y_i = \color{red}{m} | \boldsymbol{X}_i), \quad m = 1, \dots, M. \tag{11}\]

Notice we are targeting \(M\) functions.


For a continuous \(X_j\), we define the (pointwise) marginal effect on \(\pi_m (\cdot)\) as

\[ \nabla^{j}\pi_{\color{red}{m}}(\boldsymbol{x}) := \frac{\partial \pi_{\color{red}{m}}(\boldsymbol{x})}{\partial x_j}, \tag{12}\]

while for a binary \(X_j\),

\[ \nabla^{j}\pi_{\color{red}{m}}(\boldsymbol{x}) := \pi_{\color{red}{m}}(\boldsymbol{x}_{j=1}) - \pi_{\color{red}{m}}(\boldsymbol{x}_{j=0}). \tag{13}\]

Again, for each \(\boldsymbol{x}\), we are targeting \(M\) marginal effects.

Ordered outcomes\(—\)Ordered logit/probit

Useful expressions for choice probabilities and marginal effects

Combining the regression model for \(Y_i^*\) (Equation 9) and the observational rule for \(Y_i\) (Equation 10),

\[ \pi_m(\boldsymbol{x}) = F (\zeta_m - \boldsymbol{x}^\top \boldsymbol{\beta}) - F (\zeta_{m - 1} - \boldsymbol{x}^\top \boldsymbol{\beta}), \tag{14}\]

with \(F(\cdot)\) the cumulative distribution function of \(\epsilon_i\).


It follows that, for a continuous covariate \(X_j\),

\[ \nabla^{j}\pi_m(\boldsymbol{x}) = - \beta_j [f (\zeta_m - \boldsymbol{x}^\top \boldsymbol{\beta}) - f (\zeta_{m - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})], \tag{15}\]

while for a binary \(X_j\),

\[ \nabla^{j}\pi_m(\boldsymbol{x}) = \underbrace{[F (\zeta_m - \boldsymbol{x}_{j=1}^\top \boldsymbol{\beta}) - F (\zeta_{m - 1} - \boldsymbol{x}_{j=1}^\top \boldsymbol{\beta})]}_{= \pi_m(\boldsymbol{x}_{j=1})} - \underbrace{[F (\zeta_{m} - \boldsymbol{x}_{j=0}^\top \boldsymbol{\beta}) - F (\zeta_{m - 1} - \boldsymbol{x}_{j=0}^\top \boldsymbol{\beta})]}_{= \pi_m(\boldsymbol{x}_{j=0})}, \tag{16}\]

with \(f(\cdot)\) the probability density function associated with \(F(\cdot)\).

Maximum likelihood estimation of choice probabilities and marginal effects

Assuming a parametric model for \(F(\cdot)\) allows us to estimate \(\pi_m(\cdot)\) using standard maximum likelihood methods. Popular choices are:

• Logistic (ordered logit model).
• Normal (ordered probit model).

Ordered outcomes\(—\)Exercise

Sum of choice probabilities (\(\approx 5\) minutes)

Consider an ordered logit/probit model. Compute \(\sum_{m=1}^{M} \pi_m(\boldsymbol{x})\).

What’s your interpretation of the result?


Solution. Applying Equation 14,

\[ \begin{aligned} \sum_{m = 1}^M \pi_m(\boldsymbol{x}) = & \sum_{m = 1}^M F (\zeta_m - \boldsymbol{x}^\top \boldsymbol{\beta}) - F (\zeta_{m - 1} - \boldsymbol{x}^\top \boldsymbol{\beta}) \\ = & \color{#06b6d4}{F (\zeta_1 - \boldsymbol{x}^\top \boldsymbol{\beta})} - F (\zeta_{0} - \boldsymbol{x}^\top \boldsymbol{\beta}) + \\ & F (\zeta_2 - \boldsymbol{x}^\top \boldsymbol{\beta}) - \color{#06b6d4}{F (\zeta_{1} - \boldsymbol{x}^\top \boldsymbol{\beta})} + \\ & \dots \\ & \color{#f59e0b}{F (\zeta_{M - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})} - F (\zeta_{M - 2} - \boldsymbol{x}^\top \boldsymbol{\beta}) + \\ & F (\zeta_{M} - \boldsymbol{x}^\top \boldsymbol{\beta}) - \color{#f59e0b}{F (\zeta_{M - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})} \\ = & F (\zeta_{M} - \boldsymbol{x}^\top \boldsymbol{\beta}) - F (\zeta_{0} - \boldsymbol{x}^\top \boldsymbol{\beta}). \end{aligned} \tag{17}\]

By definition, \(\zeta_0 = - \infty\) and \(\zeta_M = \infty\). For logistic/normal \(F(\cdot)\), \(\lim_{u \to \infty} F(u) = 1\) and \(\lim_{u \to - \infty} F(u) = 0\). Therefore,

\[ \sum_{m = 1}^M \pi_m(\boldsymbol{x}) = 1. \tag{18}\]

The categories are mutually exclusive and exhaustive, so their probabilities must sum to one for any \(\boldsymbol{x}\).

Ordered outcomes\(—\)Exercise

Sum of marginal effects (\(\approx 5\) minutes)

Let \(X_j\) be a continuous regressor in an ordered logit/probit. Compute \(\sum_m^M \nabla^j \pi_m(\boldsymbol{x})\).

What’s your interpretation of the result?


Solution. Applying Equation 15,

\[ \begin{aligned} \sum_{m = 1}^M \nabla^j \pi_m(\boldsymbol{x}) = & \sum_{m = 1}^M - \beta_j [f (\zeta_m - \boldsymbol{x}^\top \boldsymbol{\beta}) - f (\zeta_{m - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})] \\ = & - \beta_j [\color{#06b6d4}{f (\zeta_1 - \boldsymbol{x}^\top \boldsymbol{\beta})} - f (\zeta_{0} - \boldsymbol{x}^\top \boldsymbol{\beta})] \\ & - \beta_j [f (\zeta_2 - \boldsymbol{x}^\top \boldsymbol{\beta}) - \color{#06b6d4}{f (\zeta_{1} - \boldsymbol{x}^\top \boldsymbol{\beta})}] \\ & \dots \\ & - \beta_j [\color{#f59e0b}{f (\zeta_{M - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})} - f (\zeta_{M-2} - \boldsymbol{x}^\top \boldsymbol{\beta})] \\ & - \beta_j [f (\zeta_M - \boldsymbol{x}^\top \boldsymbol{\beta}) - \color{#f59e0b}{f (\zeta_{M-1} - \boldsymbol{x}^\top \boldsymbol{\beta})}] \\ = & - \beta_j [f (\zeta_{M} - \boldsymbol{x}^\top \boldsymbol{\beta}) - f (\zeta_0 - \boldsymbol{x}^\top \boldsymbol{\beta})]. \end{aligned} \tag{19}\]

By definition, \(\zeta_0 = - \infty\) and \(\zeta_M = \infty\). For logistic/normal \(f(\cdot)\), \(\lim_{u \to \pm \infty} f(u) = 0\). Therefore,

\[ \sum_{m = 1}^M \nabla^j \pi_m(\boldsymbol{x}) = 0. \tag{20}\]

Changing \(x_j\) only redistributes probability mass across classes; increases in some \(\pi_m (\cdot)\) are exactly offset by decreases in others, so the total change is zero. Put differently, being more likely to belong to one class necessarily means being less likely to belong to others

Ordered outcomes\(—\)Exercise

Exercise\(—\)Comparison between Ordered Logit and Probit (\(\approx 5\) minutes)

Use the data previously generated to fit an ordered logit and probit. Then plot the predicted probabilities against \(X\).

What’s your interpretation of the results?


Solution.

## Fit models.
library(MASS)

ologit <- polr(Y ~ X, method = "logistic")
oprobit <- polr(Y ~ X, method = "probit")

## Predict probabilities on some X.
X_grid <- data.frame("X" = seq(min(X), max(X), length.out = 400))

ologit_preds <- predict(ologit, newdata = X_grid, type = "probs")
oprobit_preds <- predict(oprobit, newdata = X_grid, type = "probs")

Ordered outcomes\(—\)Fancy trick

Decomposing choice‐probability estimation into sub-prediction tasks

We can decompose the estimation of \(\pi_m(\cdot)\) using the identity

\[ \mathbb{P} (Y_i = m | \boldsymbol{X}_i) = \mathbb{P} (Y_i \leq m | \boldsymbol{X}_i) - \mathbb{P} (Y_i \leq m - 1 | \boldsymbol{X}_i). \tag{21}\]

Equation 21 says that \(\pi_m(\cdot)\) equals the difference between the cumulative distribution functions for adjacent classes \(m\) and \(m-1\).


Further note that

\[ \mathbb{P} (Y_i \leq m | \boldsymbol{X}_i) = \mathbb{P} (\mathbb{1} ( Y_i \leq m ) = 1 | \boldsymbol{X}_i). \tag{22}\]

Equation 22 states that \(\mathbb{P} (Y_i \leq m | \boldsymbol{X}_i)\) is the probability of success of the binary variable \(\mathbb{1} ( Y_i \leq m )\).


Thus, we can fit separate LPM/Logit/Probit models for the two binary outcomes \(\mathbb{1}(Y_i \leq \color{red}{m})\) and \(\mathbb{1}(Y_i \leq \color{red}{m-1})\), obtain predicted probabilities, and estimate

\[ \hat{\pi}_m(\boldsymbol{x}) = \widehat{\mathbb{P}}(\mathbb{1}(Y_i \leq \color{red}{m}) = 1 | \boldsymbol{X}_i = \boldsymbol{x}) - \widehat{\mathbb{P}}(\mathbb{1}(Y_i \leq \color{red}{m-1}) = 1 | \boldsymbol{X}_i = \boldsymbol{x}). \] This maps the ordered-outcome problem into \(M-1\) binary-outcome prediction tasks (notice that \(\mathbb{P} (Y_i \leq M | \boldsymbol{X}_i) = 1\) by construction).


Three main caveats.
1. Nothing ensures \(\widehat{\mathbb{P}}(\mathbb{1}(Y_i \leq m) = 1 | \boldsymbol{X}_i = \boldsymbol{x}) \geq \widehat{\mathbb{P}}(\mathbb{1}(Y_i \leq m - 1) = 1 | \boldsymbol{X}_i = \boldsymbol{x})\); thus \(\hat{\pi}_m(\boldsymbol{x})\) might be negative.
2. Nothing ensures \(\sum_{m = 1}^M \hat{\pi}_m(\boldsymbol{x}) = 1\).
3. Marginal effect estimation is not as straightforward as with ordered logit/probit.

1 and 2 can be solved by “normalizing” \(\hat{\pi}_m(\boldsymbol{x})\).

Ordered outcomes\(—\)Model comparison

Comparison between Ordered Logit/Probit and our fancy trick

We use the data previously generated to fit an ordered logit and probit.

We further apply our fancy trick using Logit/Probit models to estimate \(\mathbb{P} (\mathbb{1} ( Y_i \leq m ) = 1 | \boldsymbol{X}_i)\) for \(m = 1, \dots, M-1\).

We then plot the true and predicted probabilities against \(X\).


## Fit ordered models.
library(MASS)

ologit <- polr(Y ~ X, method = "logistic")
oprobit <- polr(Y ~ X, method = "probit")

## Fit M-1 logit/probits.
Y1 <- as.integer(Y <= labels[1])
Y2 <- as.integer(Y <= labels[2])

fancy_logit1 <- glm(Y1 ~ X, family = binomial(link = "logit"))
fancy_logit2 <- glm(Y2 ~ X, family = binomial(link = "logit"))

## Predict probabilities on some X.
X_grid <- data.frame("X" = seq(min(X), max(X), length.out = 400))

ologit_preds <- predict(ologit, newdata = X_grid, type = "probs")
oprobit_preds <- predict(oprobit, newdata = X_grid, type = "probs")

fancy_logit1_preds <- as.numeric(predict(fancy_logit1, newdata = X_grid, type = "response"))
fancy_logit2_preds <- as.numeric(predict(fancy_logit2, newdata = X_grid, type = "response"))

fancy_logit_preds <- cbind("Bad" = fancy_logit1_preds,
                           "Average" = fancy_logit2_preds - fancy_logit1_preds,
                           "Good" = 1 - fancy_logit2_preds)

The Good, the Bad, & the Ordinal\(—\)Case brief

Case brief

Christian’s movie log just landed on our desk. Our task is to predict his rating from basic film info by modeling \(\mathbb P(Rating=m | X)\).


Objective: movie-night triage. Rank new films by

  • a watchlist score \(\mathbb P(Rating \geq 7 | X)\) (likely good), and
  • the disaster risk \(\mathbb P(Rating \leq 3 | X)\) (likely painful).

Important

Today we’re modeling associations, not causation. If A24 titles score higher, that could be curation or selection (Christian chooses A24 because he expects to like them), not “A24 causes higher ratings.”

Tip

We’ll follow a standard pipeline:

  1. Load & peek\(—\)confirm columns, types, and granularity.
  2. Sanity checks\(—\)missing and impossible values.
  3. Descriptive statistics\(—\)Summary stats for scale and plausibility.
  4. Fit models\(—\)Ordered Logit and Probit for \(\mathbb P(Rating=m | X)\).
  5. Explain\(—\)report coefficients sensibly.
  6. Deliver\(—\)a short, readable summary with the few visuals that actually deliver insights.

The Good, the Bad, & the Ordinal\(—\)Load & peek

Exercise\(—\)Data loading and peeking (\(\approx 3\) minutes)

Let’s open the box and see what Christian actually gave us. Load the data (ChristianMovieRatings.Rds) and display a few rows (hint: use the readRDS function). Then ask yourself:

  • What does one row represent (i.e., what is the unit of observation)?
  • Do we have a stable key to uniquely identify rows?

## Load data.
path <- "../data/ChristianMovieRatings.Rds" # Edit as necessary.
dta <- readRDS(path)

## Kable. 
dta %>% 
  slice_head(n = 7) %>% 
  kbl() %>% 
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = my_font_size)
title year my_rating imdb_rating director country genre owned budget box_office director_affinity
10 Cloverfield Lane 2016 7 7.2 Dan Trachtenberg USA Gyser 1 15 110 2
12 Angry Men 1957 9 9.0 Sidney Lumet USA Drama 1 0 NA 4
12 Monkeys 1995 7 8.0 Terry Gilliam USA Sci-Fi 1 29 169 4
12 Years a Slave 2013 8 8.1 Steve McQueen USA Drama 1 20 188 5
13th 2016 6 8.2 Ava DuVernay USA Dokumentar 0 NA NA 1
17 Again 2009 2 6.4 Burr Steers USA Romcom 0 20 136 1
1917 2019 8 8.2 Sam Mendes UK Krig 1 95 384 7

The Good, the Bad, & the Ordinal\(—\)Sanity checks

“Garbage in, garbage out.”

Even a fancy model can’t fix broken inputs. If the data is messy, models will confidently learn nonsense.

Exercise\(—\)Missingness (\(\approx 5\) minutes)

Compute and report the number of missing observations in each column. How would to treat the missing observations?


## Missingness.
miss_tbl <- dta %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing") %>%
  mutate(n_rows = nrow(dta),
         missing = as.integer(missing),
         frac_missing = missing / n_rows,
         pct_missing = percent(frac_missing, accuracy = 0.1)) %>%
  arrange(desc(frac_missing))

## Kable...
Variable Missing (n) Missing (%)
budget 1365 53.9%
box_office 594 23.5%
director 2 0.1%
director_affinity 2 0.1%
imdb_rating 1 0.0%
country 1 0.0%
## Removed missing director, country, or imdb_rating, and budget and box_office columns.
dta_model <- dta %>%
  filter(!is.na(director), !is.na(country), !is.na(imdb_rating)) %>%
  dplyr::select(-budget, -box_office)

The Good, the Bad, & the Ordinal\(—\)Data descriptives

Exercise\(—\)Sizing your data (\(\approx 3\) minutes)

Before we fit anything, we need to understand our data. We start by sizing the data set. Compute and report the number of rows and the number of unique titles.

What’s your intepretation of the results?


## Core counts.
counts <- tibble(observations = nrow(dta_model),
                 movie_unique = n_distinct(dta_model$title))

## Kable...
N. rows N. movies
2529 2492
N N. (title, year)
2529 2529
title year my_rating imdb_rating director country genre owned director_affinity
Alice in Wonderland 1951 6 7.3 Clyde Geronimi USA Animation 0 4
Alice in Wonderland 2010 5 6.4 Tim Burton USA Familie 1 17
All Quiet on the Western Front 1930 6 8.1 Lewis Milestone USA Krig 1 1
All Quiet on the Western Front 2022 5 7.8 Edward Berger Tyskland Krig 0 2
Bad Education 2004 6 7.4 Pedro Almodóvar Spanien Drama 0 7
Bad Education 2019 6 7.1 Cory Finley USA Drama 0 2
Batman 1966 6 6.5 Leslie H. Martinson USA Action 0 1
Batman 1989 8 7.5 Tim Burton USA Action 1 17
Beauty and the Beast 1991 6 8.0 Gary Trousdale & Kirk Wise USA Animation 0 2
Beauty and the Beast 2017 5 7.1 Bill Condon USA Familie 1 1

The Good, the Bad, & the Ordinal\(—\)Data descriptives (cont’d)

Exercise\(—\)Summary statistics (\(\approx 5\) minutes)

Next, we investigate scale and plausibility. If a variable’s range is off (e.g., negative rating or future years), models will faithfully learn nonsense. Looking at all summary stats (min/max/mean/sd) is a fast “smell test” for outliers or unit mistakes.

Compute and report summary stats for:

  • my_rating,
  • imdb_rating,
  • owned, and
  • director_affinity.

What’s your interpretation of the results? (Hint: recall that my_rating is ordinal).


## Compute summary table using our function.
stats_table <- summarize_vars(dta_model, vars = c("my_rating", "imdb_rating", "owned", "director_affinity"))

## Kable....
variable n missing min mean sd max
my_rating 2529 0 1.0 5.857 1.610 10.0
imdb_rating 2529 0 1.5 7.050 0.914 9.3
owned 2529 0 0.0 0.391 0.488 1.0
director_affinity 2529 0 0.0 5.375 6.750 33.0

The Good, the Bad, & the Ordinal\(—\)Fit models

Exercise\(—\)Fit models (\(\approx 5\) minutes)

Now we get to the core task: fit simple, transparent models for \(\mathbb P(Rating=m | X)\).

We’ll keep it simple: fit an ordered Logit and Probit using only imdb_rating, owned, and director_affinity. Then report the estimated coefficients.

What’s your interpretation of the results?


## Fit baseline models.
ologit_base <- polr(factor(my_rating) ~ imdb_rating + owned + director_affinity, data = dta_model, method = "logistic")  
oprobit_base <- polr(factor(my_rating) ~ imdb_rating + owned + director_affinity, data = dta_model, method = "probit")

## Kable...
Variable Logit Probit
imdb_rating 1.978 [1.866, 2.089] 1.076 [1.018, 1.134]
owned 1.386 [1.218, 1.553] 0.778 [0.685, 0.872]
director_affinity 0.045 [0.034, 0.057] 0.026 [0.020, 0.033]

The Good, the Bad, & the Ordinal\(—\)Explain

Exercise\(—\)Marginal effects (\(\approx 5\) minutes)

Compute and report mean marginal effects.

What’s your interpretation of the results?


## Mean marginal effects.
ame_logit <- avg_slopes(ologit_base, type = "probs")
ame_probit <- avg_slopes(oprobit_base, type = "probs")

## Kable...
Variable 1 2 3 4 5 6 7 8 9 10
Panel A: Logit
imdb_rating -0.016 -0.035 -0.062 -0.074 -0.073 -0.028 0.141 0.113 0.030 0.004
director_affinity 0.000 -0.001 -0.001 -0.002 -0.002 -0.001 0.003 0.003 0.001 0.000
owned -0.009 -0.019 -0.039 -0.053 -0.062 -0.041 0.119 0.086 0.018 0.002
Panel B: Probit
imdb_rating -0.016 -0.034 -0.061 -0.070 -0.070 -0.026 0.135 0.107 0.031 0.005
director_affinity 0.000 -0.001 -0.001 -0.002 -0.002 -0.001 0.003 0.003 0.001 0.000
owned -0.009 -0.020 -0.040 -0.053 -0.061 -0.038 0.114 0.086 0.019 0.002

The Good, the Bad, & the Ordinal\(—\)Delivery

Exercise\(—\)Movie night triage (\(\approx 5\) minutes)

We turn the models into something we can use for our movie nights. For each film, compute \(\mathbb P(Rating \geq 7 | X)\) (watchlist score) and \(\mathbb P(Rating \leq 3 | X)\) (disaster risk). Then report the top 10 ‘Watch soon’ and ‘Probably skip’ movies. Choose either the fittted Logit or Probit model.


## Select fit and extract predicted probabilities.
fit <- ologit_base
preds <- as.data.frame(predict(fit, newdata = dta_model, type = "probs"))

## Compute watchlist score and disaster risk.
classes <- as.integer(colnames(preds))

p_good7 <- rowSums(preds[, classes >= 7, drop = FALSE])
p_bad3 <- rowSums(preds[, classes <= 3, drop = FALSE])