# realy hard work here to supress messages
suppressWarnings(library(ggplot2))
suppressPackageStartupMessages(suppressWarnings(library(dplyr)))
suppressPackageStartupMessages(suppressWarnings(library(gridExtra)))
suppressPackageStartupMessages(suppressWarnings(library(GGally)))
library(statsr)
Warning message: "package 'statsr' was built under R version 3.3.3"
load("movies.Rdata")
The data set is comprised of 651 randomly sampled movies produced and released prior to 2016. It contains various information about these movies, some technical such as title or release date and some post-release information such as rating and awards.
Generalizability: Since the movies are randomly sampled and are not correlated, the results of this study are generalizable to the whole population of movies.
Causality: Since this not an experiment and no random assignment is preformed, casual relations cannot be established based on this study.
IMDB and "Rotten Tomatoes" (ROTM) are two different movie related websites. While both contain ranking (or rating) for movies, ROTM also provides ranking by professional movie critics. Although the number of audience voter on both websites can be quite large, the audience rating score is never similar on both these websites (in contradiction to the central limit theorem). This means that certain features of the websites or ranking methodology cause this discrepancy.
Research question: I will try to predict the ranking on IMDB based on several factors, some of which come ROTM data.
While this doesn't have great practical implication, this investigation will shed some light on the question of "ROTM vs IMDB - which one is more reliable?" which is often debated in social media, for instance:
To name a few.
What are the movie types being present in the data set?
summary(movies$title_type)
Since the database is dominated by "Feature Film" we'll remove all "Documentary" and "TV Movies" from the database, to prevent unnecessary variability of the data.
movies = movies %>%
filter(title_type == "Feature Film")
The IMDB ranking data is given in 0-10 score. We'll adjust it to be in the range of 0-100:
movies$imdb_rating = movies$imdb_rating*10
There are three variables for movie scores:
audience_score
: The score the audience gave on ROTM.
critics_score
: The score the critics gave on ROTM.
imdb_rating
: The score audience gave on IMDB.
We will also be interested in imdb_num_votes
- the number of votes on IMDB.
Let us look at the correlation between these.
options(repr.plot.width=6, repr.plot.height=6)
ggpairs(movies, columns = c("audience_score", "critics_score", "imdb_num_votes", "imdb_rating"))
audience_score
vs critics_score
: These are moderately correlated (corr=0.675) whoever, if one looks at the graph, both have a "bump" towards lower scores. This "bump" does not exist for the imdb_rating
. imdb_rating
is strongly correlated (corr=0.849) with audience_score
and also correlated with critics_score
(corr=0.74).
imdb_rating
is not correlated to any ranking variable.
imdb_rating
¶Let's plot the distribution of imdb_rating
:
options(repr.plot.width=4, repr.plot.height=3)
ggplot(movies, aes(imdb_rating)) + geom_histogram(binwidth=4)
summary(movies$imdb_rating)
Min. 1st Qu. Median Mean 3rd Qu. Max. 19.00 58.50 65.00 63.87 71.00 90.00
The distribution is approximately left-skewed normal: The majority of votes are centered around the median, with a rather thin tale towards the lower values.
imdb_rating
compared to audience_score
and critics_score
¶Let's look at summary statistics of these three variables:
summary(movies$imdb_rating)
Min. 1st Qu. Median Mean 3rd Qu. Max. 19.00 58.50 65.00 63.87 71.00 90.00
summary(movies$audience_score)
Min. 1st Qu. Median Mean 3rd Qu. Max. 11.00 44.50 62.00 60.47 78.00 97.00
summary(movies$critics_score)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 31.00 57.00 54.78 79.00 100.00
The imdb_rating
varible has the highest mean and median values and critics_score
has the lowest. However, the 3rd quartile has higher value for critics_score
and audience_score
than imdb_rating
.
I will now attempt to generate a linear model to predict the imdb_score
based on other variables. The variables that will be considered are:
critics_score
, audience_score
, best_pic_nom
, imdb_num_votes
and best_dir_win
.
The rest of the variables are less likely to have an effect. To select the best variables, and since the dataset is quite small, two-sided feature search algorithm is used - to cover all possibilities.
test.model = lm(data=movies, imdb_rating ~ 1)
final.model = step(test.model,
direction='both',
scope=(~ critics_score +
audience_score +
best_pic_nom +
imdb_num_votes +
best_dir_win ))
Start: AIC=2786.54 imdb_rating ~ 1 Df Sum of Sq RSS AIC + audience_score 1 47354 18381 2035.4 + critics_score 1 36042 29692 2318.8 + imdb_num_votes 1 11078 54657 2679.5 + best_pic_nom 1 4242 61493 2749.1 + best_dir_win 1 1986 63749 2770.4 <none> 65735 2786.5 Step: AIC=2035.4 imdb_rating ~ audience_score Df Sum of Sq RSS AIC + critics_score 1 3403 14977 1916.4 + imdb_num_votes 1 819 17561 2010.5 + best_dir_win 1 284 18096 2028.2 + best_pic_nom 1 139 18241 2032.9 <none> 18381 2035.4 - audience_score 1 47354 65735 2786.5 Step: AIC=1916.4 imdb_rating ~ audience_score + critics_score Df Sum of Sq RSS AIC + imdb_num_votes 1 658.2 14319 1891.8 + best_dir_win 1 97.7 14880 1914.5 <none> 14977 1916.4 + best_pic_nom 1 42.2 14935 1916.7 - critics_score 1 3403.2 18381 2035.4 - audience_score 1 14715.0 29692 2318.8 Step: AIC=1891.84 imdb_rating ~ audience_score + critics_score + imdb_num_votes Df Sum of Sq RSS AIC <none> 14319 1891.8 + best_dir_win 1 40.3 14279 1892.2 + best_pic_nom 1 0.0 14319 1893.8 - imdb_num_votes 1 658.2 14977 1916.4 - critics_score 1 3242.2 17561 2010.5 - audience_score 1 12407.1 26726 2258.7
Analysis: the algorithm had chosen the audience_score
, critics_score
and imdb_num_votes
as the best predictors for imdb_rating
. best_pic_nom
and best_dir_win
were found unimportant in this context.
Lets examine the final model:
summary(final.model)
Call: lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes, data = movies) Residuals: Min 1Q Median 3Q Max -25.4125 -1.8368 0.3983 2.9442 11.6168 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.745e+01 6.642e-01 56.378 < 2e-16 *** audience_score 3.231e-01 1.433e-02 22.552 < 2e-16 *** critics_score 1.147e-01 9.945e-03 11.529 < 2e-16 *** imdb_num_votes 9.749e-06 1.877e-06 5.194 2.84e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.939 on 587 degrees of freedom Multiple R-squared: 0.7822, Adjusted R-squared: 0.7811 F-statistic: 702.6 on 3 and 587 DF, p-value: < 2.2e-16
Analysis: the model generally predicts that all variables are important and positively correlated with the imdb_rating
, which means that higher scores on Rotten tomatoes mean higher scores on IMDB - which makes sense. Also, the model predicts that the larger the number of people who voted on the IMDB website, the higher the expected score on IMDB is - which also makes sense, since good movies are more likely to be popular.
Although it came out in 2017, I've tested the model on predicting the IMDB score of "Guardians of the Galaxy Vol. 2". Let's generate a dataframe for the data, which was retrieved from the IMDB and Rotten tomatoes websites:
new_movie = data.frame(audience_score=86, critics_score=71, imdb_num_votes=125714)
Let's do the prediction:
predict(final.model, new_movie)
predict(final.model, new_movie, interval = "prediction", level = 0.95)
fit | lwr | upr | |
---|---|---|---|
1 | 74.59693 | 64.87352 | 84.32034 |
The model predicts the IMDB score of "Guardians of the Galaxy Vol. 2" to be 74.6, with confidence interval of 95% between (64.87, 84.3). Indeed, the observed IMBD score of "Guardians of the Galaxy Vol. 2" is 81, within our predicted confidence interval.
To validate our analysis, let's plot the residuals:
ggplot(data = final.model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
While the errors are (more-or-less) normally distributed around 0, the errors have a funnel shape, which becomes smaller for higher values. Several outliers exist in the 40 - 65 range. This means that the predictions are less accurate for the lower scores.
Conclusion: The qualitative insight obtained from the regression model, that IMDB ratings have more "critical" nature then the audience rating in MOTB - perhaps because the IMBD website does not make this distinction between "critics" and "audience".
There are two important shortcomings for this study. First, most of the data in the movies
dataset is very sparse and categorical, such as the names of actors and directors, such that statistical reasoning is very challenging. This can be improved by including more observation in the dataset, perhaps also from earlier years. Second (and considering the first shortcoming) that small amount of dense data is correlated, such that making good predictions is quite challenging and may be a result of over-fitting.