Linear model for continuous outcomes
So far, you learnt how to (linearly) model continuous outcomes:
\[ Y_i = \underbrace{\boldsymbol{X}_i^\top \boldsymbol{\beta}}_{\mathbb E[Y_i | \boldsymbol{X}_i]} + \epsilon_i. \]
However, \(Y_i\) is often discrete.
Linear models are inappropriate for predictive modeling of discrete outcomes
Expectation \(=\) probability of success
Let \(Y_i \in \{0, 1\}\) be a binary outcome.
The following equality holds:
\[ \mathbb E[Y_i] = 1 \cdot \color{red}{\mathbb P (Y_i = 1)} + 0 \cdot \mathbb P (Y_i = 0). \tag{1}\]
Thus modeling \(\mathbb E[Y_i]\) is equivalent to modeling the probability of success.
Bernoulli distribution to model binary outcomes
A natural (unconditional) model for binary outcomes is the Bernoulli distribution, indexed by a single parameter \(\pi \in (0, 1)\)\(—\)the probability of success.
\[ Y_i \sim Bernoulli(\pi).\]
Useful properties are:
Maximum likelihood estimation of the Bernoulli model
Under an i.i.d. assumption, the sample log-likelihood is
\[ \mathcal L ( \pi) = \sum_{i = 1}^n \{ Y_i \ln(\pi) + [1 - Y_i] \ln(1 - \pi) \}. \tag{2}\]
Assuming \(\mathcal L ( \cdot)\) is strictly concave, the MLE of \(\pi\) is obtained by setting \(\partial \mathcal L ( \pi) / \partial \pi = 0\). Simple algebra yields
\[ \hat{\pi} = \frac{1}{n} \sum_{i = 1}^n Y_i. \tag{3}\]
Exercise\(—\)Bernoulli MLE (\(\approx 5\) minutes)
Consider the vector \(\boldsymbol{Y} = (1, 0, 1, 0, 1)\).
Solution. Replacing \(\boldsymbol{Y}\) in the equations above,
Conditional expectation \(=\) conditional probability of success
Suppose we further observe a set of covariates \(\boldsymbol{X}_i\). Then, the equality in Equation 1 can be extended to
\[ \mathbb E[Y_i | \boldsymbol{X}_i] = 1 \cdot \color{red}{\mathbb P (Y_i = 1 | \boldsymbol{X}_i)} + 0 \cdot \mathbb P (Y_i = 0 | \boldsymbol{X}_i). \tag{4}\]
Thus modeling \(\mathbb E[Y_i | \boldsymbol{X}_i]\) is the same as modeling the “probability of success” conditional on \(\boldsymbol{X}_i\).
Conditional Bernoulli distribution to model binary outcomes
We can extend the Bernoulli model by allowing the probability of success to depend on \(\boldsymbol{X}_i\):
\[ Y_i \sim Bernoulli(\color{red}{\pi(\boldsymbol{X}_i)} ).\]
Useful properties are:
Maximum likelihood estimation of the conditional Bernoulli model
Under an i.i.d. assumption, the sample log-likelihood is
\[ \mathcal L ( \boldsymbol{\pi}) = \sum_{i = 1}^n \{ Y_i \ln(\pi(\boldsymbol{X}_i)) + [1 - Y_i] \ln(1 - \pi(\boldsymbol{X}_i)) \}. \tag{5}\]
Assuming \(\mathcal L ( \cdot)\) is strictly concave, the MLE estimator of \(\pi(\boldsymbol{X}_i)\) is obtained by setting \(\partial \mathcal L ( \boldsymbol{\pi}) / \partial \pi(\boldsymbol{X}_i) = 0\) for all \(i\).
However, this (nonparametric) estimation problem might be quite complex. A way out is to assume a parametric model for \(\pi(\cdot)\).
Linear index models for \(\pi(\cdot)\)
A popular way to parametrize \(\pi(\cdot)\) is through a linear index model:
\[ \pi(\boldsymbol{X}_i) = G ( \boldsymbol{X}_i^\top \boldsymbol{\beta} ), \tag{6}\]
with \(G : \mathbb R \to (0, 1)\) a smooth and strictly increasing link function.
| Model | Link | |
|---|---|---|
| LPM | \(G(u) = u\) | |
| Logit | \(\displaystyle G(u) = (1+e^{-u})^{-1}\) | |
| Probit | \(G(u) = \Phi(u)\) |
Exercise\(—\)Comparison between LPM, Logit, and Probit (\(\approx 5\) minutes)
Simulate a small data set following the DGP below and fit an LPM, Logit, and Probit.
Then plot the true and predicted probabilities against \(X\).
## Fit models.
lpm <- lm(Y ~ X)
logit <- glm(Y ~ X, family = binomial("logit"))
probit <- glm(Y ~ X, family = binomial("probit"))
## Predict probabilities on some X.
X_grid <- data.frame("X" = seq(min(X), max(X), length.out = 400))
lpm_preds <- predict(lpm, newdata = X_grid)
logit_preds <- predict(logit, newdata = X_grid, type = "response")
probit_preds <- predict(probit, newdata = X_grid, type = "response")Marginal effects
For a continuous \(X_j\), we define the (pointwise) marginal effect as \[ \nabla^{j}\pi(\boldsymbol{x}) := \frac{\partial \pi(\boldsymbol{x})}{\partial x_j} = \beta_j G^\prime (\boldsymbol{x}^\top \boldsymbol{\beta}), \tag{7}\]
while for a binary \(X_j\),
\[ \nabla^{j}\pi(\boldsymbol{x}) := \pi(\boldsymbol{x}_{j=1}) - \pi(\boldsymbol{x}_{j=0}) = G (\boldsymbol{x}_{j=1}^\top \boldsymbol{\beta}) - G (\boldsymbol{x}_{j=0}^\top \boldsymbol{\beta}), \tag{8}\]
where \(\boldsymbol{x}_{j=d}\) corresponds to \(\boldsymbol{x}\) with element \(j\) set to \(d\).
Properties of marginal effects
In Logit and Probit
Marginal effects \(\neq\) causal effects
\(\nabla^{j}\pi(\boldsymbol{x})\) tells you how \(\pi(\cdot)\) changes as \(x_j\) varies, holding other regressors fixed. This is an association, not a causal effect.
Exercise\(—\)Marginal effects with a single binary covariate (\(\approx 5\) minutes)
Compute \(\nabla \pi(x)\) for a single binary regressor \(X\).
What’s your interpretation of the result?
Solution. Applying Equation 8,
\[ \begin{aligned} \nabla \pi(x) &= G (\color{red}{1} \cdot \beta) - G (\color{red}{0} \cdot \beta) \\ &= G (\beta) - G (0). \end{aligned} \]
\(\nabla \pi(x)\) is constant across observations in this one-binary-covariate setup: every flip from \(X = 0\) to \(X = 1\) is associated to a change in probability by the same amount \(G (\beta) - G (0)\).
This further implies that marginal effects at the mean and mean marginal effects are the same.
Exercise\(—\)Marginal effects at the mean vs. mean marginal effects (\(\approx 5\) minutes)
Simulate a small data set following the DGP below and fit an LPM, Logit, and Probit.
Then compute and display the marginal effects at the mean and mean marginal effects.
## Load library.
library(marginaleffects)
## Individual marginal effects.
lpm_mes <- slopes(lpm)
logit_mes <- slopes(logit)
probit_mes <- slopes(probit)
## Mean marginal effects.
ame_lpm <- avg_slopes(lpm)
ame_logit <- avg_slopes(logit)
ame_probit <- avg_slopes(probit)
## Marginal effects at mean.
mem_lpm <- slopes(lpm, newdata = "mean")
mem_logit <- slopes(logit, newdata = "mean")
mem_probit <- slopes(probit, newdata = "mean")Ordered outcomes are not numeric
Let \(Y_i \in \{1, 2, \dots, M\}\), with outcome classes featuring a natural ordering\(—\)though lacking a cardinal interpretation.
Strictly speaking, \(Y_i\) is non-numeric\(—\)it represents a qualitative outcome. We can use numbers to represent its classes, but these are just labels.
Data-generating process of \(Y_i\)
We regard \(Y_i\) as a “discretization” of a continuous latent outcome \(Y_i^*\).
The data-generating process consists of a (linear) model for \(Y_i^*\):
\[ Y_i^* = \boldsymbol{X}_i^\top \boldsymbol{\beta} + \epsilon_i, \quad \epsilon_i \sim F, \quad \epsilon_i \perp \boldsymbol{X}_i, \tag{9}\]
and an observational rule for \(Y_i\):
\[ Y_i = \sum_{m = 1}^M m \mathbb{1} (\zeta_{m - 1} < Y_i^* \leq \zeta_m), \tag{10}\]
with \(-\infty = \zeta_0 < \zeta_1 < \dots < \zeta_M = \infty\) unknown “threshold parameters.”
Exercise\(—\)Generating ordered outcomes (\(\approx 3\) minutes)
Generate:
Then discretize \(Y_i^*\) into three classes using fixed thresholds \(\boldsymbol{\zeta} = (-1, 1)\) and the cut function.
Solution.
## Set seed for reproducibility.
set.seed(1986)
## Set sample size, coefficients, and thresholds.
n <- 1000
beta <- 2
zetas <- c(-1, 1)
labels <- c("Bad", "Average", "Good")
## Generate single uniform covariate.
X <- runif(n, min = -2, max = 2)
## Generate latent outcome.
Y_star <- beta * X + rnorm(n)
## Discretize.
Y <- cut(Y_star, breaks = c(-Inf, zetas, Inf), labels = labels, ordered_result = TRUE)We cannot average qualitative outcomes
Since \(Y_i\) is not numeric, arithmetic operations on outcome categories (e.g., averaging) lack meaningful interpretation.
Statistical targets when \(Y_i\) is ordered
When \(Y_i\) is ordered, we are typically interested in the conditional choice probabilities
\[ \pi_{\color{red}{m}} (\boldsymbol{X}_i) := \mathbb P (Y_i = \color{red}{m} | \boldsymbol{X}_i), \quad m = 1, \dots, M. \tag{11}\]
Notice we are targeting \(M\) functions.
For a continuous \(X_j\), we define the (pointwise) marginal effect on \(\pi_m (\cdot)\) as
\[ \nabla^{j}\pi_{\color{red}{m}}(\boldsymbol{x}) := \frac{\partial \pi_{\color{red}{m}}(\boldsymbol{x})}{\partial x_j}, \tag{12}\]
while for a binary \(X_j\),
\[ \nabla^{j}\pi_{\color{red}{m}}(\boldsymbol{x}) := \pi_{\color{red}{m}}(\boldsymbol{x}_{j=1}) - \pi_{\color{red}{m}}(\boldsymbol{x}_{j=0}). \tag{13}\]
Again, for each \(\boldsymbol{x}\), we are targeting \(M\) marginal effects.
Useful expressions for choice probabilities and marginal effects
Combining the regression model for \(Y_i^*\) (Equation 9) and the observational rule for \(Y_i\) (Equation 10),
\[ \pi_m(\boldsymbol{x}) = F (\zeta_m - \boldsymbol{x}^\top \boldsymbol{\beta}) - F (\zeta_{m - 1} - \boldsymbol{x}^\top \boldsymbol{\beta}), \tag{14}\]
with \(F(\cdot)\) the cumulative distribution function of \(\epsilon_i\).
It follows that, for a continuous covariate \(X_j\),
\[ \nabla^{j}\pi_m(\boldsymbol{x}) = - \beta_j [f (\zeta_m - \boldsymbol{x}^\top \boldsymbol{\beta}) - f (\zeta_{m - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})], \tag{15}\]
while for a binary \(X_j\),
\[ \nabla^{j}\pi_m(\boldsymbol{x}) = \underbrace{[F (\zeta_m - \boldsymbol{x}_{j=1}^\top \boldsymbol{\beta}) - F (\zeta_{m - 1} - \boldsymbol{x}_{j=1}^\top \boldsymbol{\beta})]}_{= \pi_m(\boldsymbol{x}_{j=1})} - \underbrace{[F (\zeta_{m} - \boldsymbol{x}_{j=0}^\top \boldsymbol{\beta}) - F (\zeta_{m - 1} - \boldsymbol{x}_{j=0}^\top \boldsymbol{\beta})]}_{= \pi_m(\boldsymbol{x}_{j=0})}, \tag{16}\]
with \(f(\cdot)\) the probability density function associated with \(F(\cdot)\).
Maximum likelihood estimation of choice probabilities and marginal effects
Assuming a parametric model for \(F(\cdot)\) allows us to estimate \(\pi_m(\cdot)\) using standard maximum likelihood methods. Popular choices are:
Sum of choice probabilities (\(\approx 5\) minutes)
Consider an ordered logit/probit model. Compute \(\sum_{m=1}^{M} \pi_m(\boldsymbol{x})\).
What’s your interpretation of the result?
Solution. Applying Equation 14,
\[ \begin{aligned} \sum_{m = 1}^M \pi_m(\boldsymbol{x}) = & \sum_{m = 1}^M F (\zeta_m - \boldsymbol{x}^\top \boldsymbol{\beta}) - F (\zeta_{m - 1} - \boldsymbol{x}^\top \boldsymbol{\beta}) \\ = & \color{#06b6d4}{F (\zeta_1 - \boldsymbol{x}^\top \boldsymbol{\beta})} - F (\zeta_{0} - \boldsymbol{x}^\top \boldsymbol{\beta}) + \\ & F (\zeta_2 - \boldsymbol{x}^\top \boldsymbol{\beta}) - \color{#06b6d4}{F (\zeta_{1} - \boldsymbol{x}^\top \boldsymbol{\beta})} + \\ & \dots \\ & \color{#f59e0b}{F (\zeta_{M - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})} - F (\zeta_{M - 2} - \boldsymbol{x}^\top \boldsymbol{\beta}) + \\ & F (\zeta_{M} - \boldsymbol{x}^\top \boldsymbol{\beta}) - \color{#f59e0b}{F (\zeta_{M - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})} \\ = & F (\zeta_{M} - \boldsymbol{x}^\top \boldsymbol{\beta}) - F (\zeta_{0} - \boldsymbol{x}^\top \boldsymbol{\beta}). \end{aligned} \tag{17}\]
By definition, \(\zeta_0 = - \infty\) and \(\zeta_M = \infty\). For logistic/normal \(F(\cdot)\), \(\lim_{u \to \infty} F(u) = 1\) and \(\lim_{u \to - \infty} F(u) = 0\). Therefore,
\[ \sum_{m = 1}^M \pi_m(\boldsymbol{x}) = 1. \tag{18}\]
The categories are mutually exclusive and exhaustive, so their probabilities must sum to one for any \(\boldsymbol{x}\).
Sum of marginal effects (\(\approx 5\) minutes)
Let \(X_j\) be a continuous regressor in an ordered logit/probit. Compute \(\sum_m^M \nabla^j \pi_m(\boldsymbol{x})\).
What’s your interpretation of the result?
Solution. Applying Equation 15,
\[ \begin{aligned} \sum_{m = 1}^M \nabla^j \pi_m(\boldsymbol{x}) = & \sum_{m = 1}^M - \beta_j [f (\zeta_m - \boldsymbol{x}^\top \boldsymbol{\beta}) - f (\zeta_{m - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})] \\ = & - \beta_j [\color{#06b6d4}{f (\zeta_1 - \boldsymbol{x}^\top \boldsymbol{\beta})} - f (\zeta_{0} - \boldsymbol{x}^\top \boldsymbol{\beta})] \\ & - \beta_j [f (\zeta_2 - \boldsymbol{x}^\top \boldsymbol{\beta}) - \color{#06b6d4}{f (\zeta_{1} - \boldsymbol{x}^\top \boldsymbol{\beta})}] \\ & \dots \\ & - \beta_j [\color{#f59e0b}{f (\zeta_{M - 1} - \boldsymbol{x}^\top \boldsymbol{\beta})} - f (\zeta_{M-2} - \boldsymbol{x}^\top \boldsymbol{\beta})] \\ & - \beta_j [f (\zeta_M - \boldsymbol{x}^\top \boldsymbol{\beta}) - \color{#f59e0b}{f (\zeta_{M-1} - \boldsymbol{x}^\top \boldsymbol{\beta})}] \\ = & - \beta_j [f (\zeta_{M} - \boldsymbol{x}^\top \boldsymbol{\beta}) - f (\zeta_0 - \boldsymbol{x}^\top \boldsymbol{\beta})]. \end{aligned} \tag{19}\]
By definition, \(\zeta_0 = - \infty\) and \(\zeta_M = \infty\). For logistic/normal \(f(\cdot)\), \(\lim_{u \to \pm \infty} f(u) = 0\). Therefore,
\[ \sum_{m = 1}^M \nabla^j \pi_m(\boldsymbol{x}) = 0. \tag{20}\]
Changing \(x_j\) only redistributes probability mass across classes; increases in some \(\pi_m (\cdot)\) are exactly offset by decreases in others, so the total change is zero. Put differently, being more likely to belong to one class necessarily means being less likely to belong to others
Exercise\(—\)Comparison between Ordered Logit and Probit (\(\approx 5\) minutes)
Use the data previously generated to fit an ordered logit and probit. Then plot the predicted probabilities against \(X\).
What’s your interpretation of the results?
Solution.
## Fit models.
library(MASS)
ologit <- polr(Y ~ X, method = "logistic")
oprobit <- polr(Y ~ X, method = "probit")
## Predict probabilities on some X.
X_grid <- data.frame("X" = seq(min(X), max(X), length.out = 400))
ologit_preds <- predict(ologit, newdata = X_grid, type = "probs")
oprobit_preds <- predict(oprobit, newdata = X_grid, type = "probs")Decomposing choice‐probability estimation into sub-prediction tasks
We can decompose the estimation of \(\pi_m(\cdot)\) using the identity
\[ \mathbb{P} (Y_i = m | \boldsymbol{X}_i) = \mathbb{P} (Y_i \leq m | \boldsymbol{X}_i) - \mathbb{P} (Y_i \leq m - 1 | \boldsymbol{X}_i). \tag{21}\]
Equation 21 says that \(\pi_m(\cdot)\) equals the difference between the cumulative distribution functions for adjacent classes \(m\) and \(m-1\).
Further note that
\[ \mathbb{P} (Y_i \leq m | \boldsymbol{X}_i) = \mathbb{P} (\mathbb{1} ( Y_i \leq m ) = 1 | \boldsymbol{X}_i). \tag{22}\]
Equation 22 states that \(\mathbb{P} (Y_i \leq m | \boldsymbol{X}_i)\) is the probability of success of the binary variable \(\mathbb{1} ( Y_i \leq m )\).
Thus, we can fit separate LPM/Logit/Probit models for the two binary outcomes \(\mathbb{1}(Y_i \leq \color{red}{m})\) and \(\mathbb{1}(Y_i \leq \color{red}{m-1})\), obtain predicted probabilities, and estimate
\[ \hat{\pi}_m(\boldsymbol{x}) = \widehat{\mathbb{P}}(\mathbb{1}(Y_i \leq \color{red}{m}) = 1 | \boldsymbol{X}_i = \boldsymbol{x}) - \widehat{\mathbb{P}}(\mathbb{1}(Y_i \leq \color{red}{m-1}) = 1 | \boldsymbol{X}_i = \boldsymbol{x}). \] This maps the ordered-outcome problem into \(M-1\) binary-outcome prediction tasks (notice that \(\mathbb{P} (Y_i \leq M | \boldsymbol{X}_i) = 1\) by construction).
1 and 2 can be solved by “normalizing” \(\hat{\pi}_m(\boldsymbol{x})\).
Comparison between Ordered Logit/Probit and our fancy trick
We use the data previously generated to fit an ordered logit and probit.
We further apply our fancy trick using Logit/Probit models to estimate \(\mathbb{P} (\mathbb{1} ( Y_i \leq m ) = 1 | \boldsymbol{X}_i)\) for \(m = 1, \dots, M-1\).
We then plot the true and predicted probabilities against \(X\).
## Fit ordered models.
library(MASS)
ologit <- polr(Y ~ X, method = "logistic")
oprobit <- polr(Y ~ X, method = "probit")
## Fit M-1 logit/probits.
Y1 <- as.integer(Y <= labels[1])
Y2 <- as.integer(Y <= labels[2])
fancy_logit1 <- glm(Y1 ~ X, family = binomial(link = "logit"))
fancy_logit2 <- glm(Y2 ~ X, family = binomial(link = "logit"))
## Predict probabilities on some X.
X_grid <- data.frame("X" = seq(min(X), max(X), length.out = 400))
ologit_preds <- predict(ologit, newdata = X_grid, type = "probs")
oprobit_preds <- predict(oprobit, newdata = X_grid, type = "probs")
fancy_logit1_preds <- as.numeric(predict(fancy_logit1, newdata = X_grid, type = "response"))
fancy_logit2_preds <- as.numeric(predict(fancy_logit2, newdata = X_grid, type = "response"))
fancy_logit_preds <- cbind("Bad" = fancy_logit1_preds,
"Average" = fancy_logit2_preds - fancy_logit1_preds,
"Good" = 1 - fancy_logit2_preds)Case brief
Christian’s movie log just landed on our desk. Our task is to predict his rating from basic film info by modeling \(\mathbb P(Rating=m | X)\).
Objective: movie-night triage. Rank new films by
Important
Today we’re modeling associations, not causation. If A24 titles score higher, that could be curation or selection (Christian chooses A24 because he expects to like them), not “A24 causes higher ratings.”
Tip
We’ll follow a standard pipeline:
Exercise\(—\)Data loading and peeking (\(\approx 3\) minutes)
Let’s open the box and see what Christian actually gave us. Load the data (ChristianMovieRatings.Rds) and display a few rows (hint: use the readRDS function). Then ask yourself:
| title | year | my_rating | imdb_rating | director | country | genre | owned | budget | box_office | director_affinity |
|---|---|---|---|---|---|---|---|---|---|---|
| 10 Cloverfield Lane | 2016 | 7 | 7.2 | Dan Trachtenberg | USA | Gyser | 1 | 15 | 110 | 2 |
| 12 Angry Men | 1957 | 9 | 9.0 | Sidney Lumet | USA | Drama | 1 | 0 | NA | 4 |
| 12 Monkeys | 1995 | 7 | 8.0 | Terry Gilliam | USA | Sci-Fi | 1 | 29 | 169 | 4 |
| 12 Years a Slave | 2013 | 8 | 8.1 | Steve McQueen | USA | Drama | 1 | 20 | 188 | 5 |
| 13th | 2016 | 6 | 8.2 | Ava DuVernay | USA | Dokumentar | 0 | NA | NA | 1 |
| 17 Again | 2009 | 2 | 6.4 | Burr Steers | USA | Romcom | 0 | 20 | 136 | 1 |
| 1917 | 2019 | 8 | 8.2 | Sam Mendes | UK | Krig | 1 | 95 | 384 | 7 |
“Garbage in, garbage out.”
Even a fancy model can’t fix broken inputs. If the data is messy, models will confidently learn nonsense.
Exercise\(—\)Missingness (\(\approx 5\) minutes)
Compute and report the number of missing observations in each column. How would to treat the missing observations?
## Missingness.
miss_tbl <- dta %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "missing") %>%
mutate(n_rows = nrow(dta),
missing = as.integer(missing),
frac_missing = missing / n_rows,
pct_missing = percent(frac_missing, accuracy = 0.1)) %>%
arrange(desc(frac_missing))
## Kable...| Variable | Missing (n) | Missing (%) |
|---|---|---|
| budget | 1365 | 53.9% |
| box_office | 594 | 23.5% |
| director | 2 | 0.1% |
| director_affinity | 2 | 0.1% |
| imdb_rating | 1 | 0.0% |
| country | 1 | 0.0% |
Exercise\(—\)Sizing your data (\(\approx 3\) minutes)
Before we fit anything, we need to understand our data. We start by sizing the data set. Compute and report the number of rows and the number of unique titles.
What’s your intepretation of the results?
| N. rows | N. movies |
|---|---|
| 2529 | 2492 |
| N | N. (title, year) |
|---|---|
| 2529 | 2529 |
| title | year | my_rating | imdb_rating | director | country | genre | owned | director_affinity |
|---|---|---|---|---|---|---|---|---|
| Alice in Wonderland | 1951 | 6 | 7.3 | Clyde Geronimi | USA | Animation | 0 | 4 |
| Alice in Wonderland | 2010 | 5 | 6.4 | Tim Burton | USA | Familie | 1 | 17 |
| All Quiet on the Western Front | 1930 | 6 | 8.1 | Lewis Milestone | USA | Krig | 1 | 1 |
| All Quiet on the Western Front | 2022 | 5 | 7.8 | Edward Berger | Tyskland | Krig | 0 | 2 |
| Bad Education | 2004 | 6 | 7.4 | Pedro Almodóvar | Spanien | Drama | 0 | 7 |
| Bad Education | 2019 | 6 | 7.1 | Cory Finley | USA | Drama | 0 | 2 |
| Batman | 1966 | 6 | 6.5 | Leslie H. Martinson | USA | Action | 0 | 1 |
| Batman | 1989 | 8 | 7.5 | Tim Burton | USA | Action | 1 | 17 |
| Beauty and the Beast | 1991 | 6 | 8.0 | Gary Trousdale & Kirk Wise | USA | Animation | 0 | 2 |
| Beauty and the Beast | 2017 | 5 | 7.1 | Bill Condon | USA | Familie | 1 | 1 |
Exercise\(—\)Summary statistics (\(\approx 5\) minutes)
Next, we investigate scale and plausibility. If a variable’s range is off (e.g., negative rating or future years), models will faithfully learn nonsense. Looking at all summary stats (min/max/mean/sd) is a fast “smell test” for outliers or unit mistakes.
Compute and report summary stats for:
my_rating,imdb_rating,owned, anddirector_affinity.What’s your interpretation of the results? (Hint: recall that my_rating is ordinal).
| variable | n | missing | min | mean | sd | max |
|---|---|---|---|---|---|---|
| my_rating | 2529 | 0 | 1.0 | 5.857 | 1.610 | 10.0 |
| imdb_rating | 2529 | 0 | 1.5 | 7.050 | 0.914 | 9.3 |
| owned | 2529 | 0 | 0.0 | 0.391 | 0.488 | 1.0 |
| director_affinity | 2529 | 0 | 0.0 | 5.375 | 6.750 | 33.0 |
Exercise\(—\)Fit models (\(\approx 5\) minutes)
Now we get to the core task: fit simple, transparent models for \(\mathbb P(Rating=m | X)\).
We’ll keep it simple: fit an ordered Logit and Probit using only imdb_rating, owned, and director_affinity. Then report the estimated coefficients.
What’s your interpretation of the results?
| Variable | Logit | Probit |
|---|---|---|
| imdb_rating | 1.978 [1.866, 2.089] | 1.076 [1.018, 1.134] |
| owned | 1.386 [1.218, 1.553] | 0.778 [0.685, 0.872] |
| director_affinity | 0.045 [0.034, 0.057] | 0.026 [0.020, 0.033] |
Exercise\(—\)Marginal effects (\(\approx 5\) minutes)
Compute and report mean marginal effects.
What’s your interpretation of the results?
| Variable | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Panel A: Logit | ||||||||||
| imdb_rating | -0.016 | -0.035 | -0.062 | -0.074 | -0.073 | -0.028 | 0.141 | 0.113 | 0.030 | 0.004 |
| director_affinity | 0.000 | -0.001 | -0.001 | -0.002 | -0.002 | -0.001 | 0.003 | 0.003 | 0.001 | 0.000 |
| owned | -0.009 | -0.019 | -0.039 | -0.053 | -0.062 | -0.041 | 0.119 | 0.086 | 0.018 | 0.002 |
| Panel B: Probit | ||||||||||
| imdb_rating | -0.016 | -0.034 | -0.061 | -0.070 | -0.070 | -0.026 | 0.135 | 0.107 | 0.031 | 0.005 |
| director_affinity | 0.000 | -0.001 | -0.001 | -0.002 | -0.002 | -0.001 | 0.003 | 0.003 | 0.001 | 0.000 |
| owned | -0.009 | -0.020 | -0.040 | -0.053 | -0.061 | -0.038 | 0.114 | 0.086 | 0.019 | 0.002 |
Exercise\(—\)Movie night triage (\(\approx 5\) minutes)
We turn the models into something we can use for our movie nights. For each film, compute \(\mathbb P(Rating \geq 7 | X)\) (watchlist score) and \(\mathbb P(Rating \leq 3 | X)\) (disaster risk). Then report the top 10 ‘Watch soon’ and ‘Probably skip’ movies. Choose either the fittted Logit or Probit model.
## Select fit and extract predicted probabilities.
fit <- ologit_base
preds <- as.data.frame(predict(fit, newdata = dta_model, type = "probs"))
## Compute watchlist score and disaster risk.
classes <- as.integer(colnames(preds))
p_good7 <- rowSums(preds[, classes >= 7, drop = FALSE])
p_bad3 <- rowSums(preds[, classes <= 3, drop = FALSE])