The Good, the Bad & the Ordinal

A mini tour of ordered models

Author

Riccardo Di Francesco

Show code

rm(list = ls())
set.seed(1986)

pkgs <- c("tibble", "dplyr", "kableExtra", "ggplot2", "scales", "tidyr", "broom", "MASS", "patchwork", "ocf", "rsample", "ggrepel")
pkgs_to_install <- pkgs[!(pkgs %in% installed.packages()[,"Package"])]
if(length(pkgs_to_install)) install.packages(pkgs_to_install, quiet = TRUE)
invisible(lapply(pkgs, library, character.only = TRUE))

Case brief

Christian’s movie log just landed on our desk. He asked: “I need an econometrician to predict my rating from basic film info (year, box office, etc.).” We’ll help him by modelling $\mathbb P(Rating=m | X)$, where $m$ is Christian’s rating (from $1$ to $10$).

Why? A concrete use-case:

Movie-night triage: Surface what actually drives Christian’s 7–10s and rank new films by a watchlist score $\mathbb P(Rating \geq 7 | X)$ (likely good) and the disaster risk $\mathbb P(Rating \leq 3 | X)$ (likely painful)$—$so we avoid duds at the department’s movie nights.

Tip

Today we’re modeling associations, not causation. If A24 titles score higher, that could be curation or selection (Christian chooses A24 because he expects to like them), not “A24 causes higher ratings.”

We’ll follow a standard pipeline:

Load & peek$—$confirm columns, types, and granularity.
Sanity checks$—$missing and impossible values.
Descriptive statistics$—$Summary stats for scale and plausibility.
Fit models$—$Ordered Logit and Probit for $\mathbb P(Rating=m | X)$; compare on held-out data.
Explain$—$report coefficients sensibly.
Deliver$—$a short, readable summary with the few visuals that actually deliver insights.

Load & peek

First things first: let’s open the box and see what Christian actually gave us. This step isn’t about modeling—it’s orientation. We want to be sure we understand the basic structure of the data.

We load the data.

Show code

## Load data.
path <- "../data/ChristianMovieRatings.Rds" # If you preserved the structure of my folder, this should work; otherwise, edit.
dta <- readRDS(path)

Now let’s peek at a few rows. We’re not hunting for patterns yet—just confirming that what we think we have is what we actually have. From Christian’s documentation, we expect to see the following variables:

title: movie’s title (identifier).
year: year of release.
my_rating: outcome.
imdb_rating: IMDB rating (averaged).
director, country, genre: directory, country of production, and genre of the movie.
owned: whether Christian owns a DVD of the movie ($=1$ if yes).
budget, box_office: budget and revenue generated from the movie (million $).
director_affinity: how many movies from the same director Christian already watched before.

Tip

While you peek, ask yourself:

What does one row represent (i.e., what is the unit of observation)?
Do we have a stable key to uniquely identify rows?

In our data, one row is a movie observation, and title looks like a unique key.

Show code

## Kable. 
dta %>% 
  slice_head(n = 7) %>% 
  kbl(caption = "Christian's movie data") %>% 
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Christian's movie data
title	year	my_rating	imdb_rating	director	country	genre	owned	budget	box_office	director_affinity
10 Cloverfield Lane	2016	7	7.2	Dan Trachtenberg	USA	Gyser	1	15	110	2
12 Angry Men	1957	9	9.0	Sidney Lumet	USA	Drama	1	0	NA	4
12 Monkeys	1995	7	8.0	Terry Gilliam	USA	Sci-Fi	1	29	169	4
12 Years a Slave	2013	8	8.1	Steve McQueen	USA	Drama	1	20	188	5
13th	2016	6	8.2	Ava DuVernay	USA	Dokumentar	0	NA	NA	1
17 Again	2009	2	6.4	Burr Steers	USA	Romcom	0	20	136	1
1917	2019	8	8.2	Sam Mendes	UK	Krig	1	95	384	7

Note

my_rating is ordinal: the numbers are category labels (from very bad to very good), not measurements on a numeric scale. Distances between labels aren’t meaningful. So arithmetic on the labels is invalid: you can’t say “an 8 is twice a 4,” nor assume the gap 7→8 equals the gap 4→5.

Relatedly, imdb_rating is a (weighted) average of users’ ordinal ratings. Beyond lacking meaningful intepretation, averaging ordinal labels mixes users who use the scale differently (each rater has their own cutpoints). For analysis, prefer ordinal models or report medians or top-box shares $\mathbb P (Rating \geq m)$.

Important

If we (artificially) treat an ordered outcome $Y \in \{1, \dots, M \}$ as numeric, its expectation is \[ \mathbb{E} [Y] = \sum_{m=1}^M m \mathbb P (Y = m). \] $\mathbb{E} [Y]$ is a weighted average of class labels $m$ with weights $\mathbb P (Y = m)$. However, the numbers $1, \dots, M$ are mere labels, not units. Therefore, $\mathbb{E} [Y]$ depends on how you code categories and isn’t invariant to all possible recodings.

Next, we check column types. Some models behave very differently depending on types, and R will happily coerce behind your back if you let it.

Why column types matter

If my_rating is numeric instead of (ordered) factor, many models will raise errors or produce odd results. If qualitative variables like genre arrive as numbers, the model may treat them as numeric, which is wrong.

Show code

## Column types.
str(dta)

tibble [2,533 × 11] (S3: tbl_df/tbl/data.frame)
 $ title            : Factor w/ 2496 levels "10 Cloverfield Lane",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year             : num [1:2533] 2016 1957 1995 2013 2016 ...
 $ my_rating        : num [1:2533] 7 9 7 8 6 2 8 5 7 4 ...
 $ imdb_rating      : num [1:2533] 7.2 9 8 8.1 8.2 6.4 8.2 5.8 7.1 5.9 ...
 $ director         : Factor w/ 1219 levels "Aamir Khan","Aaron Blaise & Robert Walker",..: 229 1059 1115 1086 95 148 1022 1091 782 573 ...
 $ country          : Factor w/ 47 levels "Argentina","Australien",..: 45 45 45 45 45 45 42 45 42 45 ...
 $ genre            : Factor w/ 19 levels "Action","Animation",..: 8 4 16 4 3 15 10 9 16 1 ...
 $ owned            : num [1:2533] 1 1 1 1 0 0 1 1 1 1 ...
 $ budget           : num [1:2533] 15 0 29 20 NA 20 95 35 4 76 ...
 $ box_office       : num [1:2533] 110 NA 169 188 NA 136 384 92 8 236 ...
 $ director_affinity: num [1:2533] 2 4 4 5 1 1 7 33 1 2 ...

Sanity checks

If the data is messy, models will confidently learn nonsense. So before visuals or modeling, we audit the basics.

“Garbage in, garbage out.”

Even a fancy model can’t fix broken inputs.

Missingness

If key fields are missing, fix that now—before plots and models.

Tip

If you find missing values, typical options are:

drop rows (safe only when few and not systematic),
simple imputations for non-outcomes, or
model-based imputations.

Be especially careful if the outcome is missing—removing or imputing those rows can bias results.

Because only few rows miss them, we’ll drop observations with missing director, country, imdb_rating or director_affinity. We’ll display below what we’re dropping. However, budget and box_office have a lot of missings relative to the sample; we’ll exclude these variables from modelling rather than dropping more rows.

Show code

## Missingness.
miss_tbl <- dta %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing") %>%
  mutate(n_rows = nrow(dta),
         missing = as.integer(missing),
         frac_missing = missing / n_rows,
         pct_missing = percent(frac_missing, accuracy = 0.1)) %>%
  arrange(desc(frac_missing))

## Kable.
miss_tbl %>%
  dplyr::select(variable, missing, pct_missing) %>%
  kbl(caption = "Missing values per column",
      col.names = c("Variable", "Missing (n)", "Missing (%)"),
      align = c("l", "r", "r")) %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Missing values per column
Variable	Missing (n)	Missing (%)
budget	1365	53.9%
box_office	594	23.5%
director	2	0.1%
director_affinity	2	0.1%
imdb_rating	1	0.0%
country	1	0.0%
title	0	0.0%
year	0	0.0%
my_rating	0	0.0%
genre	0	0.0%
owned	0	0.0%

Show code

## New data set with removed missing director, country, or imdb_rating, and budget and box_office columns.
dta_model <- dta %>%
  filter(!is.na(director), !is.na(country), !is.na(imdb_rating)) %>%
  dplyr::select(-budget, -box_office)

## Display.
dta %>%
  filter(is.na(director) | is.na(country) | is.na(imdb_rating)) %>%
  kbl(caption = paste0("Observations we dropped")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Observations we dropped
title	year	my_rating	imdb_rating	director	country	genre	budget	box_office	director_affinity
Batman: Gotham Knight	2008	5	6.7	NA	USA	Animation	NA	NA	NA
Democrats (Independent Lens)	2016	7	NA	Camilla Nielsson	Danmark	Dokumentar	NA	NA	0
Holiday Secrets	2019	3	6.5	NA	Tyskland	Drama	NA	NA	NA
Ravenous	1999	4	6.9	Antonia Bird	NA	Gyser	12	2	1

Tip

When we drop many observations, manual inspection becomes infeasible. Instead, one can report descriptive statistics for both the full sample and the restricted (post-drop) sample and compare them. Differences tell us how the analysis sample deviates from the source data$—$useful for understanding the reference population and the external validity of our results.

Non-sensical values

Now we apply plain common sense. By construction, my_rating and imdb_rating cannot be negative or larger than $10$. If they are, something went wrong upstream.

Tip

Build a small checklist of “obvious rules” for every project. If a rule is violated, decide whether to fix, filter, or flag the row—and write down that decision.

Show code

## Non-sensical values.
neg_or_g10_counts <- sapply(dta_model[, c("my_rating", "imdb_rating")], function(x) sum(x < 0 | x > 10))
tibble(variable = names(neg_or_g10_counts),
       n_rows = nrow(dta),
       impossible_n = as.integer(neg_or_g10_counts),
       impossible_perc = percent(impossible_n / n_rows, accuracy = 0.1)) %>%
  dplyr::select(-n_rows) %>%
  kbl(caption = "N. negative values per column", col.names = c("Variable", "Impossible (n)", "Impossible (%)"), align = c("l", "r", "r")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

N. negative values per column
Variable	Impossible (n)	Impossible (%)
my_rating	0	0.0%
imdb_rating	0	0.0%

Moreover, movies cannot come from future years (not yet). If so, those rows may need attention.

Show code

## Non-sensical values.
current_year <- format(Sys.Date(), "%Y")
future_years <- sapply(dta_model[, c("year")], function(x) sum(x > current_year))
tibble(n_rows = nrow(dta),
       impossible_n = as.integer(future_years),
       impossible_perc = percent(impossible_n / n_rows, accuracy = 0.1)) %>%
  dplyr::select(-n_rows) %>%
  kbl(caption = "Future values", col.names = c("Impossible (n)", "Impossible (%)"), align = c("r", "r")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Future values
Impossible (n)	Impossible (%)
0	0.0%

Data descriptives

Before we fit anything, we need to understand our data. We start by sizing the data set. How many rows do we have overall? How many unique movies?

Show code

## Core counts.
counts <- tibble(observations = nrow(dta_model),
                 movie_unique = n_distinct(dta_model$title))

## Kable.
kbl(counts, caption = "Dataset size and uniqueness", col.names = c("N","N. movies")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Dataset size and uniqueness
N	N. movies
2529	2492

We have more rows than unique titles, so some titles repeat. That alone doesn’t mean the data are bad$—$remakes, re-releases, or alternate cuts can share a title.

We check uniqueness by the (title, year) pair; now duplicates disappear. Earlier we claimed title was a unique key$—$it turns out that was wrong. The correct unique key is (title, year). We dispaly some repeated titles to confirm this intuition.

Show code

## Core counts.
counts <- tibble(observations = nrow(dta_model),
                 movie_year_unique = dta_model %>% 
                   distinct(title, year) %>% 
                   nrow())


kbl(counts, caption = "Dataset size and uniqueness", col.names = c("N","N. (title, year)")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover","condensed","responsive"))

Dataset size and uniqueness
N	N. (title, year)
2529	2529

Show code

## Select duplicate titles.
dup_titles <- dta_model %>%
  add_count(title, name = "n_title") %>%
  filter(n_title > 1) %>%
  arrange(desc(n_title), title, year) %>%
  dplyr::select(-n_title) %>%
  slice_head(n = 10) 

## Kable.
dup_titles %>%
  kbl(caption = paste0("Repeating titles (showing first 10 of ", nrow(dup_titles), ")")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Repeating titles (showing first 10 of 10)
title	year	my_rating	imdb_rating	director	country	genre	owned	director_affinity
Alice in Wonderland	1951	6	7.3	Clyde Geronimi	USA	Animation	0	4
Alice in Wonderland	2010	5	6.4	Tim Burton	USA	Familie	1	17
All Quiet on the Western Front	1930	6	8.1	Lewis Milestone	USA	Krig	1	1
All Quiet on the Western Front	2022	5	7.8	Edward Berger	Tyskland	Krig	0	2
Bad Education	2004	6	7.4	Pedro Almodóvar	Spanien	Drama	0	7
Bad Education	2019	6	7.1	Cory Finley	USA	Drama	0	2
Batman	1966	6	6.5	Leslie H. Martinson	USA	Action	0	1
Batman	1989	8	7.5	Tim Burton	USA	Action	1	17
Beauty and the Beast	1991	6	8.0	Gary Trousdale & Kirk Wise	USA	Animation	0	2
Beauty and the Beast	2017	5	7.1	Bill Condon	USA	Familie	1	1

Next, we investigate scale and plausibility. Min/max/mean/sd tell us if values live where we expect and whether any column needs attention (e.g., huge spread, suspicious extremes).

Why summary stats matter

If a variable’s range is off (e.g., negative rating or future years), models will faithfully learn nonsense. Moreover, a “nice” mean can hide a broken column (e.g., many zeros + a few giant values). Looking at all summary stats (min/max/mean/sd) is a fast “smell test” for outliers, unit mistakes, or columns that need transformation.

Here is a useful function to compute summary stats; you can also copy-paste it in your future projects.

Show code

## Useful function.
summarize_vars <- function(df, 
                           vars = character(),
                           digits = 3, include_na_level = TRUE) {
  ## Keep only columns that exist; warn on missing names.
  exist <- intersect(vars, names(df))

  if (length(setdiff(vars, exist)) > 0) {
    warning("These vars were not found and were ignored: ",
            paste(setdiff(vars, exist), collapse = ", "))
  }

  ## Summary stats.
  vars_list <- lapply(vars, function(v) {
    x <- df[[v]]

    tibble(variable = v,
           n = sum(!is.na(x)),
           missing = sum(is.na(x)),
           min = suppressWarnings(min(x, na.rm = TRUE)),
           mean = mean(x, na.rm = TRUE),
           sd = sd(x, na.rm = TRUE),
           max = suppressWarnings(max(x, na.rm = TRUE)))
  })
  
  vars_list <- Filter(Negate(is.null), vars_list)  
  
  ## Bundle in table.
  out <- if (length(vars_list)) {
    bind_rows(vars_list) %>%
      mutate(across(c(min, mean, sd, max), ~ round(.x, digits)))
  } else {
    tibble()
  }
}

Tip

For qualitative variables (e.g., genre), computing min/mean/sd/max doesn’t make sense. Instead, we dummy-encode each level into a 0/1 indicator. This is often helpful for modeling too, since coefficients become easier to interpret. To keep summaries readable, we avoid dummy-expanding very high-cardinality variables (e.g., directors) at this stage. Furthermore, since $\approx 66\%$ of titles are from the US (you can verify yourself), we’ll summarize country with a single indicator us ($= 1$ if the movie is from the US).

Important

Below we display the means of my_rating and imdb_rating. As noted above, these are ordinal scores, so their averages have limited interpretability. We report them here only for (coding) convenience.

Show code

## Expand `genre` into dummy columns.
genre_dummies <- as_tibble(model.matrix(~ dta_model$genre - 1), .name_repair = "minimal")
names(genre_dummies) <- levels(dta_model$genre)
genre_dummies <- genre_dummies %>%
  rename(Noir = `Film-Noir`,
         SciFi = `Sci-Fi`)

## Bind dummies to the original data. Drop original genre.
new_dta <- bind_cols(dta_model, genre_dummies) %>%
  dplyr::select(-genre)

## Code US variable and drop country.
new_dta <- new_dta %>%
  mutate(us = as.numeric(ifelse(country == "USA", 1, 0))) %>%
  dplyr::select(-country)

## Compute summary table using our function.
stats_table <- summarize_vars(new_dta, vars = c("my_rating", "imdb_rating", "owned", "director_affinity", "us", colnames(genre_dummies)))

## Kable.
stats_table %>%
  kbl(caption = "Descriptive statistics") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Descriptive statistics
variable	n	min	mean	sd	max
my_rating	2529	1.0	5.857	1.610	10.0
imdb_rating	2529	1.5	7.050	0.914	9.3
owned	2529	0.0	0.391	0.488	1.0
director_affinity	2529	0.0	5.375	6.750	33.0
us	2529	0.0	0.661	0.474	1.0
Action	2529	0.0	0.178	0.382	1.0
Animation	2529	0.0	0.073	0.260	1.0
Dokumentar	2529	0.0	0.012	0.110	1.0
Drama	2529	0.0	0.191	0.393	1.0
Familie	2529	0.0	0.021	0.143	1.0
Fantasy	2529	0.0	0.013	0.115	1.0
Noir	2529	0.0	0.018	0.132	1.0
Gyser	2529	0.0	0.130	0.336	1.0
Komedie	2529	0.0	0.156	0.363	1.0
Krig	2529	0.0	0.012	0.110	1.0
Krimi	2529	0.0	0.008	0.086	1.0
Musical	2529	0.0	0.011	0.106	1.0
Musik	2529	0.0	0.006	0.074	1.0
Romantik	2529	0.0	0.019	0.136	1.0
Romcom	2529	0.0	0.016	0.125	1.0
SciFi	2529	0.0	0.039	0.194	1.0
Sport	2529	0.0	0.007	0.082	1.0
Thriller	2529	0.0	0.080	0.272	1.0
Western	2529	0.0	0.011	0.103	1.0

Fit models

Now we get to the core task: fit simple, transparent models for $\mathbb P(Rating=m | X)$. Before we press “go,” let’s quickly peek at the current data set.

Show code

## Kable.
new_dta %>%
  slice_head(n = 7) %>%
  kbl(caption = "Estimation sample") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Estimation sample
title	year	my_rating	imdb_rating	director	owned	director_affinity	Dokumentar	Drama	Gyser	Krig	Romcom	SciFi	us
10 Cloverfield Lane	2016	7	7.2	Dan Trachtenberg	1	2	0	0	1	0	0	0	1
12 Angry Men	1957	9	9.0	Sidney Lumet	1	4	0	1	0	0	0	0	1
12 Monkeys	1995	7	8.0	Terry Gilliam	1	4	0	0	0	0	0	1	1
12 Years a Slave	2013	8	8.1	Steve McQueen	1	5	0	1	0	0	0	0	1
13th	2016	6	8.2	Ava DuVernay	0	1	1	0	0	0	0	0	1
17 Again	2009	2	6.4	Burr Steers	0	1	0	0	0	0	1	0	1
1917	2019	8	8.2	Sam Mendes	1	7	0	0	0	1	0	0	0

Baseline models

We start with a baseline: model $\mathbb P (Rating = m | X)$ using only imdb_rating, owned, and us. We fit two standard binary models: ordered Logit and Probit. We train the models on one split ($80\%$) and evaluate their performance on a held-out test set ($20\%$). That way, we measure how well the models generalize to data they haven’t seen.

Held-out test $=$ honest assessment

Fitting and judging on the same data rewards memorization (overfitting). A held-out test set simulates future, unseen observations.

Train on past, test on future

To mimic forecasting and avoid “peeking into the future,” we could split by release year: train on films released up to a cutoff year, test on films released after it. However, recent films differ systematically from older ones, so a time split would mix covariate shift with model performance. We thus use a standard random split.

Tip

imdb_rating is a (weighted) average of user scores, so we treat it as a numeric covariate. An alternative would be to round to the nearest integer and treat it as ordered.

Show code

## Train/test split.
spl <- initial_split(new_dta, prop = 0.8, strata = my_rating) # Keep class balance.
train <- training(spl)
test <- testing(spl)

Show code

## Fit baseline models.
ologit_base <- polr(factor(my_rating) ~ imdb_rating + owned + us, data = train, method = "logistic")  
oprobit_base <- polr(factor(my_rating) ~ imdb_rating + owned + us, data = train, method = "probit")

We evaluate on the held-out test set to get an honest sense of generalization. We use two performance measures:

\[ MSE(\hat{\pi}) = \frac{1}{n} \sum_{i = 1}^n \sum_{m = 1}^M [\mathbb{1} (Y_i = m) - \hat{\pi}_m (\boldsymbol{X_i})]^2,\] \[ RPS(\hat{\pi}) = \frac{1}{n} \sum_{i = 1}^n \frac{1}{M - 1} \sum_{m = 1}^M [ \mathbb{1} (Y_i \leq m) - \sum_{j = 1}^m \hat{\pi}_j (\boldsymbol{X_i})]^2.\] Lower is better for both.

Results indicate that ordered Logit and Probit perform practically identically.

Show code

## Predict on the test set. 
ologit_base_preds <- predict(ologit_base, newdata = test, type = "probs")
oprobit_base_preds <- predict(oprobit_base, newdata = test, type = "probs")

## Outcome on test set.
Y_test <- test$my_rating

## Metrics.
eval_metrics <- function(p, y) {
  tibble(mse = ocf::mean_squared_error(y, p), # MSE for ordered probability predictions.
         rps = ocf::mean_ranked_score(y, p)) # Rank probability score.
}

performance_base <- bind_rows(eval_metrics(ologit_base_preds,  Y_test) %>% 
                                mutate(model = "Logit"),
                    eval_metrics(oprobit_base_preds, Y_test) %>% 
                      mutate(model = "Probit")) %>%
  dplyr::select(model, mse, rps) %>%
  mutate(model_type = "Baseline")

## Plot.
performance_base_long <- performance_base %>%
  pivot_longer(cols = c(mse, rps), names_to = "metric", values_to = "value") %>%
  mutate(metric = recode(metric, mse = "MSE", rps = "RPS"),
         model = factor(model, levels = c("Logit", "Probit")))

ggplot(performance_base_long, aes(x = model, y = value, fill = model_type)) +
  geom_col(width = 0.68) +
  facet_grid(cols = vars(factor(metric, levels = c("MSE", "RPS"))), scales = "free_y") +
  labs(title = "Held-out test performance", x = NULL, y = "Score") +
  coord_cartesian(ylim = c(0, max(performance_base_long$value) * 1.12)) +
  theme_bw(base_size = 13) +
  theme(panel.grid.minor = element_blank(),
        axis.text.x = element_text(angle = 15, hjust = 0.5),
        legend.position = "none")

Adding more covariates

We extend the baseline by adding genre dummies, so the model can shift rating probabilities by genre. To avoid the dummy-variable trap (perfect multicollinearity), we omit one category and use it as the reference. We choose Gyser (horror) because Christian is objectively wrong about what counts as horror.

Results barely change after adding genre dummies. Apparently, Christian’s ratings aren’t strongly genre-driven.

Tip

Parsimony is a key desideratum in model building: when two models perform similarly on held-out data, prefer the simpler one for interpretability. Our results show that adding genre dummies yields no meaningful gain in performance. Therefore, in the next models we omit genre dummies.

Show code

## Fit new models.
these_genres <- setdiff(colnames(genre_dummies), "Gyser") 
rhs <- paste(c("imdb_rating + owned + us", sprintf("`%s`", these_genres)), collapse = " + ")
fml <- as.formula(paste("factor(my_rating) ~", rhs))

ologit_genre <- polr(fml, data = train, method = "logistic")  
oprobit_genre <- polr(fml, data = train, method = "probit")  

## Predict on the test set. 
ologit_genre_preds <- predict(ologit_genre, newdata = test, type = "probs")
ologit_genre_preds <- predict(oprobit_genre, newdata = test, type = "probs")

## Metrics.
performance_genre <- bind_rows(eval_metrics(ologit_genre_preds, Y_test) %>% 
                                mutate(model = "Logit"),
                    eval_metrics(ologit_genre_preds, Y_test) %>% 
                      mutate(model = "Probit")) %>%
  dplyr::select(model, mse, rps) %>%
  mutate(model_type = "+ Genre controls")

## Plot.
performance_genre_long <- performance_genre %>%
  pivot_longer(cols = c(mse, rps), names_to = "metric", values_to = "value") %>%
  mutate(metric = recode(metric, mse = "MSE", rps = "RPS"),
         model = factor(model, levels = c("Logit", "Probit")))

ggplot(bind_rows(performance_base_long, performance_genre_long), aes(x = model, y = value, fill = model_type)) +
  geom_col(width = 0.68) +
  facet_grid(rows = vars(factor(metric, levels = c("MSE", "RPS"))), 
             cols = vars(factor(model_type, levels = c("Baseline", "+ Genre controls"))),
             scales = "free_y") +
  labs(title = "Held-out test performance", x = NULL, y = "Score") +
  coord_cartesian(ylim = c(0, max(max(performance_base_long$value, performance_genre_long$value)) * 1.12)) +
  theme_bw(base_size = 13) +
  theme(panel.grid.minor = element_blank(),
        axis.text.x = element_text(angle = 15, hjust = 0.5),
        legend.position = "none")

Year context via feature engineering

Sometimes better predictions require feature engineering—being creative and thoughtful about new covariates.

Here we extend the baseline by adding decade dummies so the model can shift rating probabilities by release era (1950s, 1960s, $\dots$). We omit $1920s$ to avoid multicollinearity.

Again, results barely change after adding decade dummies.

Show code

## Construct director experience. Currently not used.
director_dta <- new_dta %>%
  count(director, year, name = "n_dir_year") %>%
  arrange(director, year) %>%
  group_by(director) %>%
  mutate(director_experience = dplyr::lag(cumsum(n_dir_year), 1, default = 0L)) %>%
  ungroup() 

train <- train %>%
  left_join(director_dta, by = c("director", "year"))
test <- test %>%
  left_join(director_dta, by = c("director", "year"))

## Construct decade dummies.
train <- train %>%
  mutate(decade_num = floor(year / 10) * 10,            
         decade_lab = ifelse(is.na(decade_num), NA, paste0(decade_num, "s")),
         decade = factor(decade_lab, levels = sort(unique(decade_lab)), ordered = TRUE))
test <- test %>%
  mutate(decade_num = floor(year / 10) * 10,            
         decade_lab = ifelse(is.na(decade_num), NA, paste0(decade_num, "s")),
         decade = factor(decade_lab, levels = sort(unique(decade_lab)), ordered = TRUE))

mm_train <- model.matrix(~ decade - 1, data = train, na.action = na.pass) %>%
  as_tibble()
mm_test <- model.matrix(~ decade - 1, data = test,  na.action = na.pass) %>%
  as_tibble()

train <- bind_cols(train, mm_train)
test <- bind_cols(test,  mm_test)

## Fit new models.
these_decades <- setdiff(colnames(mm_train), "decade1920s") 
rhs <- paste(c("imdb_rating + owned + us", sprintf("`%s`", these_decades)), collapse = " + ")
fml <- as.formula(paste("factor(my_rating) ~", rhs))

ologit_decade <- polr(fml, data = train, method = "logistic")  
oprobit_decade <- polr(fml, data = train, method = "probit")  

## Predict on the test set. 
ologit_decade_preds <- predict(ologit_decade, newdata = test, type = "probs")
oprobit_decade_preds <- predict(oprobit_decade, newdata = test, type = "probs")

## Metrics.
performance_decade <- bind_rows(eval_metrics(ologit_decade_preds, Y_test) %>% 
                    mutate(model = "Logit"),
                  eval_metrics(oprobit_decade_preds, Y_test) %>% 
                    mutate(model = "Probit")) %>%
  dplyr::select(model, mse, rps) %>%
  mutate(model_type = "+ Decade controls")

## Plot.
performance_decade_long <- performance_decade %>%
  pivot_longer(cols = c(mse, rps), names_to = "metric", values_to = "value") %>%
  mutate(metric = recode(metric, mse = "MSE", rps = "RPS"),
         model = factor(model, levels = c("Logit", "Probit")))

ggplot(bind_rows(performance_base_long, performance_genre_long, performance_decade_long), aes(x = model, y = value, fill = model_type)) +
  geom_col(width = 0.68) +
  facet_grid(rows = vars(factor(metric, levels = c("MSE", "RPS"))), 
             cols = vars(factor(model_type, levels = c("Baseline", "+ Genre controls", "+ Decade controls"))),
             scales = "free_y") +
  labs(title = "Held-out test performance", x = NULL, y = "Score") +
  coord_cartesian(ylim = c(0, max(max(performance_base_long$value, performance_genre_long$value, performance_decade_long$value)) * 1.12)) +
  theme_bw(base_size = 13) +
  theme(panel.grid.minor = element_blank(),
        axis.text.x = element_text(angle = 15, hjust = 0.5),
        legend.position = "none")

Explain

We now focus on the baseline specification$—$the simplest specification with competitive performance. To maximize accuracy for our end goal, we refit the models on the full data set (train + test) and report their raw summaries and interpretable outputs. All subsequent interpretation are based on these refit models.

Show code

## Fit models on full data.
ologit_base_full <- polr(factor(my_rating) ~ imdb_rating + owned + us, data = new_dta, method = "logistic", Hess = TRUE)  
oprobit_base_full <- polr(factor(my_rating) ~ imdb_rating + owned + us, data = new_dta, method = "probit", Hess = TRUE)

Let’s peek at the estimated coefficients. Keep in mind: ordered logit/probit coefficients aren’t on the probability scale, so you can’t read them as “probability changes.” That’s why we’ll focus on marginal effects next.

Show code

## Tidy fit summaries.
tidy_wald <- function(fit, model_name) {
  s <- summary(fit)
  co <- as.data.frame(s$coefficients)
  co$term <- rownames(co)
  co <- co %>% 
    filter(!grepl("\\|", term))
  z <- qnorm(0.975)
  co %>%
    transmute(term,
              estimate = Value,
              conf.low = Value - z * `Std. Error`,conf.high= Value + z * `Std. Error`,
              model = model_name)
}

tidy_logit <- tidy_wald(ologit_base_full,  "Logit")
tidy_probit <- tidy_wald(oprobit_base_full, "Probit")

all_tidy <- bind_rows(tidy_logit, tidy_probit)

## Format each cell as "\hat{\beta}\n[low, high]".
fmt_cell <- function(est, lo, hi, digits = 3) {
  paste0(formatC(est, format = "f", digits = digits), "\n[",
         formatC(lo, format = "f", digits = digits), ", ",
         formatC(hi, format = "f", digits = digits), "]")
}

tab_cells <- all_tidy %>%
  mutate(cell = fmt_cell(estimate, conf.low, conf.high)) %>%
  dplyr::select(Variable = term, model, cell) %>%
  pivot_wider(names_from = model, values_from = cell)

## Kable.
tab_cells %>%
  kbl(align = c("l", "c", "c"), booktabs = TRUE, escape = FALSE, caption = "Coefficient estimates and 95% confidence intervals") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  column_spec(1, width = "8cm")

Coefficient estimates and 95% confidence intervals
Variable	Logit	Probit
imdb_rating	1.968 [1.857, 2.079]	1.068 [1.010, 1.125]
owned	1.546 [1.383, 1.709]	0.867 [0.776, 0.958]
us	-0.204 [-0.356, -0.052]	-0.139 [-0.226, -0.052]

We now use the fitted models to compute mean marginal effects.

Tip

Mean marginal effects and marginal effects at the mean and generally differ, and they answer slightly different questions. A mean marginal effect averages individual marginal effects $\nabla^j \pi(\boldsymbol{x})$ over the sample, so it reflects “typical” behavior across real covariate profiles and is usually the most interpretable quantity to report. A marginal effect at the mean evaluates $\nabla^j \pi(\cdot)$ at the mean covariate vector; it’s quick, but the “average person” may not exist—especially awkward with dummies—so its interpretation is thinner. As a rule of thumb, lean on marginal effects at the mean for substantive reporting and keep mean marginal effects as a lightweight check.

Logit and Probit tell the same story: IMDb matters most, “owned” helps a bit, and US origin adds essentially no extra signal. At the decision margin, a bump in IMDb mostly moves mass out of 3–6 and into 7–8; it barely touches 10. Owned titles lean the same way$—$classic selection: Christian tends to buy what he likes.

Show code

## Useful function: mean marginal effects for a polr model.
ame_polr <- function(fit, data, vars_cont = character(), vars_bin = character()) {
  ## Handling inputs and checks.
  stopifnot(inherits(fit, "polr"))

  P0 <- predict(fit, newdata = data, type = "probs")
  classes <- colnames(P0)
  M <- length(classes)

  X <- model.matrix(fit, data)[, -1]
  beta <- coef(fit)               
  eta <- as.numeric(X %*% beta)  
  zetas <- as.numeric(fit$zeta)    
  bounds <- c(-Inf, zetas, Inf)

  link <- fit$method
  pdf_fun <- if (link == "logistic") {
    function(u) plogis(u) * (1 - plogis(u))
  } else if (link == "probit") {
    function(u) dnorm(u)
  } else stop("Unsupported link: ", link)

  out_rows <- list()

  ## Continuous variables: analytic derivative & average.
  for (v in vars_cont) {
    if (!v %in% names(beta)) next
    bj <- beta[[v]]
    ame_m <- sapply(seq_len(M), function(m) {
      f_left <- pdf_fun(bounds[m] - eta)   
      f_right <- pdf_fun(bounds[m+1] - eta)   
      mean(bj * (f_left - f_right), na.rm = TRUE)
    })
    out_rows[[v]] <- setNames(ame_m, classes)
  }

  ## Binary variables: discrete change & average.
  for (v in vars_bin) {
    if (!v %in% colnames(data)) next
    d1 <- data
    d0 <- data
    d1[[v]] <- 1L; d0[[v]] <- 0L
    P1 <- predict(fit, newdata = d1, type = "probs")
    P0 <- predict(fit, newdata = d0, type = "probs")
    ame_m <- colMeans(P1 - P0, na.rm = TRUE)
    out_rows[[v]] <- setNames(ame_m, classes)
  }

  ## Bind & pretty,
  ame_mat <- do.call(rbind, out_rows)
  ame_tbl <- as_tibble(ame_mat, rownames = "Variable")
  ame_tbl
}

## Apply the function.
continuous_vars <- c("imdb_rating")
binary_vars <- c("owned", "us")    

ame_logit <- ame_polr(ologit_base_full, new_dta, continuous_vars, binary_vars) 
ame_probit <- ame_polr(oprobit_base_full, new_dta, continuous_vars, binary_vars) 

## Two-panel table: classes as columns, variables as rows.
tab_logit <- ame_logit
tab_probit <- ame_probit 

n_logit <- nrow(tab_logit)
n_probit <- nrow(tab_probit)
tab_all <- bind_rows(tab_logit, tab_probit)

## Kable.
kbl(tab_all, align = c("l", rep("c", ncol(tab_all)-2), "l"), digits = 3, booktabs = TRUE,
    caption = "Average marginal Effects") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  group_rows("Panel A: Logit",  1, n_logit) %>%
  group_rows("Panel B: Probit", n_logit + 1, n_logit + n_probit)

Average marginal Effects
Variable	1	2	3	4	5	6	7	8	9	10
Panel A: Logit
imdb_rating	-0.016	-0.035	-0.062	-0.073	-0.073	-0.030	0.138	0.116	0.031	0.004
owned	-0.009	-0.022	-0.044	-0.060	-0.071	-0.048	0.132	0.098	0.021	0.002
us	0.002	0.004	0.006	0.008	0.008	0.003	-0.014	-0.012	-0.003	0.000
Panel B: Probit
imdb_rating	-0.016	-0.035	-0.061	-0.069	-0.069	-0.028	0.133	0.109	0.032	0.005
owned	-0.009	-0.022	-0.045	-0.059	-0.070	-0.044	0.127	0.098	0.023	0.002
us	0.002	0.004	0.008	0.009	0.009	0.004	-0.017	-0.014	-0.004	-0.001

Deliver

We turn the models into something we can use for our department’s movie nights. For each film we compute two personalized scores:

Watchlist score $\Rightarrow \mathbb P(Rating \geq 7 | X)$.
Disaster risk $\Rightarrow \mathbb P(Rating \leq 3 | X)$.

We then rank candidates by the watchlist score (high → low) and, separately, by disaster risk (high → low). Think of it as a two-list triage: what to watch next, and what to avoid.

Tip

This is where we turn patterns into decisions. The models summarized how Christian rates; now we use that to produce actionable lists$—$a short watch soon list and a short skip list.

Tip

Logit and Probit deliver nearly identical predictive performance, so we don’t expect downstream results to meaningfully differ. For clarity, we report only the logit results. A full report would include the probit tables/figures in an appendix as a robustness check.

Christian and IMDb mostly vibe$—$the top-right cluster (Shawshank, The Godfather, LOTR, 12 Angry Men, $\dots$) screams watch now, while the bottom-left rogues’ gallery (Sharknado, Disaster Movie, Catwoman, Home Alone 4, $\dots$) politely says hard pass.

Show code

## Select fit and compute watchlist score and disaster risk,
fit <- ologit_base_full

preds <- as.data.frame(predict(fit, newdata = new_dta, type = "probs"))
classes <- as.integer(colnames(preds))
p_good7 <- rowSums(preds[, classes >= 7, drop = FALSE])
p_bad3 <- rowSums(preds[, classes <= 3, drop = FALSE])

pred <- new_dta %>%
  mutate(p_good7 = p_good7, p_bad3 = p_bad3) %>%
  dplyr::select(title, year, director, imdb_rating, my_rating, p_good7, p_bad3)

best <- pred %>% arrange(desc(p_good7)) %>% slice_head(n = 10) %>%
  mutate(kind = "Watch soon", score = p_good7)
worst <- pred %>% arrange(desc(p_bad3))  %>% slice_head(n = 10) %>%
  mutate(kind = "Probably skip", score = p_bad3)

highlight <- bind_rows(best, worst)

## Plot.
ggplot() +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin = 7, ymax = Inf, fill = "steelblue", alpha = 0.06) +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = 3, fill = "firebrick", alpha = 0.06) +
  geom_point(data = new_dta, aes(imdb_rating, my_rating), color = "grey80", alpha = 0.35, size = 1.3) +
  geom_point(data = highlight, aes(imdb_rating, my_rating, color = kind, size = score), alpha = 0.95) +
  geom_hline(yintercept = c(3,7), linetype = "dashed", linewidth = 0.5, color = "grey60") +
  geom_text_repel(data = highlight, aes(imdb_rating, my_rating, 
                                        label = title, color = kind), 
                  size = 2.8, seed = 1975, box.padding = 1.35, point.padding = 0.55,
                  min.segment.length = 0, max.overlaps = Inf, show.legend = FALSE) +
  scale_color_manual(values = c("Watch soon" = "steelblue", "Probably skip" = "firebrick")) +
  scale_size(name = "Confidence", range = c(2.5, 7), breaks = c(0.4, 0.6, 0.8), labels = percent_format(accuracy = 1)) +
  scale_x_continuous("IMDb rating", limits = c(min(new_dta$imdb_rating, na.rm = TRUE) - 0.1, max(new_dta$imdb_rating, na.rm=TRUE) + 0.1)) +
  scale_y_continuous("Christian’s rating", breaks = 1:10, limits = c(1, 10)) +
  labs(caption = "Each dot is a film. Colored points label the top 10 'Watch soon' (blue) and 'Probably skip' (red). Point size is 
proportional to the corresponding probability.") +
  guides(color = guide_legend(override.aes = list(size = 4, alpha = 1))) +
  theme_bw(base_size = 13) +
  theme(panel.grid.minor = element_blank(), legend.position = "none", plot.title.position = "plot",
        legend.title = element_blank(), plot.caption.position = "plot", plot.caption = element_text(hjust = 0))