IMDB vs "Rotten tomatoes" audience ranking¶

In [1]:

# realy hard work here to supress messages
suppressWarnings(library(ggplot2))
suppressPackageStartupMessages(suppressWarnings(library(dplyr)))
suppressPackageStartupMessages(suppressWarnings(library(gridExtra)))
suppressPackageStartupMessages(suppressWarnings(library(GGally)))
library(statsr)

Warning message:
"package 'statsr' was built under R version 3.3.3"

In [2]:

load("movies.Rdata")

1. About the data¶

The data set is comprised of 651 randomly sampled movies produced and released prior to 2016. It contains various information about these movies, some technical such as title or release date and some post-release information such as rating and awards.

Generalizability: Since the movies are randomly sampled and are not correlated, the results of this study are generalizable to the whole population of movies.

Causality: Since this not an experiment and no random assignment is preformed, casual relations cannot be established based on this study.

2. Research question¶

IMDB and "Rotten Tomatoes" (ROTM) are two different movie related websites. While both contain ranking (or rating) for movies, ROTM also provides ranking by professional movie critics. Although the number of audience voter on both websites can be quite large, the audience rating score is never similar on both these websites (in contradiction to the central limit theorem). This means that certain features of the websites or ranking methodology cause this discrepancy.

Research question: I will try to predict the ranking on IMDB based on several factors, some of which come ROTM data.

While this doesn't have great practical implication, this investigation will shed some light on the question of "ROTM vs IMDB - which one is more reliable?" which is often debated in social media, for instance:

link1 link2 link3

To name a few.

3. Exploratory Data Analysis (EDA)¶

Title type¶

What are the movie types being present in the data set?

In [3]:

summary(movies$title_type)

Documentary: 55
Feature Film: 591
TV Movie: 5

Since the database is dominated by "Feature Film" we'll remove all "Documentary" and "TV Movies" from the database, to prevent unnecessary variability of the data.

In [4]:

movies = movies %>%
    filter(title_type == "Feature Film")

Adjusting the IMDB ranking data¶

The IMDB ranking data is given in 0-10 score. We'll adjust it to be in the range of 0-100:

In [5]:

movies$imdb_rating = movies$imdb_rating*10

Correlation in audience/critics/audience-imdb score¶

There are three variables for movie scores: audience_score: The score the audience gave on ROTM. critics_score: The score the critics gave on ROTM. imdb_rating: The score audience gave on IMDB.

We will also be interested in imdb_num_votes - the number of votes on IMDB.

Let us look at the correlation between these.

In [6]:

options(repr.plot.width=6, repr.plot.height=6)
ggpairs(movies, columns = c("audience_score", "critics_score", "imdb_num_votes", "imdb_rating"))

audience_score vs critics_score: These are moderately correlated (corr=0.675) whoever, if one looks at the graph, both have a "bump" towards lower scores. This "bump" does not exist for the imdb_rating. imdb_rating is strongly correlated (corr=0.849) with audience_score and also correlated with critics_score (corr=0.74).

imdb_rating is not correlated to any ranking variable.

Distribution of `imdb_rating`¶

Let's plot the distribution of imdb_rating:

In [7]:

options(repr.plot.width=4, repr.plot.height=3)
ggplot(movies, aes(imdb_rating)) + geom_histogram(binwidth=4)

In [8]:

summary(movies$imdb_rating)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  19.00   58.50   65.00   63.87   71.00   90.00

The distribution is approximately left-skewed normal: The majority of votes are centered around the median, with a rather thin tale towards the lower values.

`imdb_rating` compared to `audience_score` and `critics_score`¶

Let's look at summary statistics of these three variables:

In [9]:

summary(movies$imdb_rating)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  19.00   58.50   65.00   63.87   71.00   90.00

In [10]:

summary(movies$audience_score)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11.00   44.50   62.00   60.47   78.00   97.00

In [11]:

summary(movies$critics_score)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   31.00   57.00   54.78   79.00  100.00

The imdb_rating varible has the highest mean and median values and critics_score has the lowest. However, the 3rd quartile has higher value for critics_score and audience_score than imdb_rating.

4. Modeling¶

Model selection¶

I will now attempt to generate a linear model to predict the imdb_score based on other variables. The variables that will be considered are:

critics_score, audience_score, best_pic_nom, imdb_num_votes and best_dir_win.

The rest of the variables are less likely to have an effect. To select the best variables, and since the dataset is quite small, two-sided feature search algorithm is used - to cover all possibilities.

In [12]:

test.model = lm(data=movies, imdb_rating ~ 1)

final.model = step(test.model,
                   direction='both',
                   scope=(~ critics_score + 
                          audience_score + 
                          best_pic_nom  + 
                          imdb_num_votes + 
                          best_dir_win ))

Start:  AIC=2786.54
imdb_rating ~ 1

                 Df Sum of Sq   RSS    AIC
+ audience_score  1     47354 18381 2035.4
+ critics_score   1     36042 29692 2318.8
+ imdb_num_votes  1     11078 54657 2679.5
+ best_pic_nom    1      4242 61493 2749.1
+ best_dir_win    1      1986 63749 2770.4
<none>                        65735 2786.5

Step:  AIC=2035.4
imdb_rating ~ audience_score

                 Df Sum of Sq   RSS    AIC
+ critics_score   1      3403 14977 1916.4
+ imdb_num_votes  1       819 17561 2010.5
+ best_dir_win    1       284 18096 2028.2
+ best_pic_nom    1       139 18241 2032.9
<none>                        18381 2035.4
- audience_score  1     47354 65735 2786.5

Step:  AIC=1916.4
imdb_rating ~ audience_score + critics_score

                 Df Sum of Sq   RSS    AIC
+ imdb_num_votes  1     658.2 14319 1891.8
+ best_dir_win    1      97.7 14880 1914.5
<none>                        14977 1916.4
+ best_pic_nom    1      42.2 14935 1916.7
- critics_score   1    3403.2 18381 2035.4
- audience_score  1   14715.0 29692 2318.8

Step:  AIC=1891.84
imdb_rating ~ audience_score + critics_score + imdb_num_votes

                 Df Sum of Sq   RSS    AIC
<none>                        14319 1891.8
+ best_dir_win    1      40.3 14279 1892.2
+ best_pic_nom    1       0.0 14319 1893.8
- imdb_num_votes  1     658.2 14977 1916.4
- critics_score   1    3242.2 17561 2010.5
- audience_score  1   12407.1 26726 2258.7

Analysis: the algorithm had chosen the audience_score, critics_score and imdb_num_votes as the best predictors for imdb_rating. best_pic_nom and best_dir_win were found unimportant in this context.

Lets examine the final model:

In [13]:

summary(final.model)

Call:
lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes, 
    data = movies)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.4125  -1.8368   0.3983   2.9442  11.6168 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.745e+01  6.642e-01  56.378  < 2e-16 ***
audience_score 3.231e-01  1.433e-02  22.552  < 2e-16 ***
critics_score  1.147e-01  9.945e-03  11.529  < 2e-16 ***
imdb_num_votes 9.749e-06  1.877e-06   5.194 2.84e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.939 on 587 degrees of freedom
Multiple R-squared:  0.7822,	Adjusted R-squared:  0.7811 
F-statistic: 702.6 on 3 and 587 DF,  p-value: < 2.2e-16

Analysis: the model generally predicts that all variables are important and positively correlated with the imdb_rating, which means that higher scores on Rotten tomatoes mean higher scores on IMDB - which makes sense. Also, the model predicts that the larger the number of people who voted on the IMDB website, the higher the expected score on IMDB is - which also makes sense, since good movies are more likely to be popular.

5. Prediction¶

Although it came out in 2017, I've tested the model on predicting the IMDB score of "Guardians of the Galaxy Vol. 2". Let's generate a dataframe for the data, which was retrieved from the IMDB and Rotten tomatoes websites:

In [14]:

new_movie = data.frame(audience_score=86, critics_score=71, imdb_num_votes=125714)

Let's do the prediction:

In [15]:

predict(final.model, new_movie)
predict(final.model, new_movie, interval = "prediction", level = 0.95)

1: 74.596933407496

	fit	lwr	upr
1	74.59693	64.87352	84.32034

The model predicts the IMDB score of "Guardians of the Galaxy Vol. 2" to be 74.6, with confidence interval of 95% between (64.87, 84.3). Indeed, the observed IMBD score of "Guardians of the Galaxy Vol. 2" is 81, within our predicted confidence interval.

Residual diagnostic plots¶

To validate our analysis, let's plot the residuals:

In [16]:

ggplot(data = final.model, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

While the errors are (more-or-less) normally distributed around 0, the errors have a funnel shape, which becomes smaller for higher values. Several outliers exist in the 40 - 65 range. This means that the predictions are less accurate for the lower scores.

6. Conclusion¶

Conclusion: The qualitative insight obtained from the regression model, that IMDB ratings have more "critical" nature then the audience rating in MOTB - perhaps because the IMBD website does not make this distinction between "critics" and "audience".

There are two important shortcomings for this study. First, most of the data in the movies dataset is very sparse and categorical, such as the names of actors and directors, such that statistical reasoning is very challenging. This can be improved by including more observation in the dataset, perhaps also from earlier years. Second (and considering the first shortcoming) that small amount of dense data is correlated, such that making good predictions is quite challenging and may be a result of over-fitting.