In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango's rating system was biased and dishonest (Fandango is an online movie ratings aggregator). He published his analysis in this article.
Fandango displays a 5-star rating system on their website, where the minimum rating is 0 stars and the maximum is 5 stars.
Hickey found that there's a significant discrepancy between the number of stars displayed to users and the actual rating, which he was able to find in the HTML of the page. He was able to find that:
From the image we can clearly see that :
Fandango's officials replied that the biased rounding off was caused by a bug in their system rather than being intentional, and they promised to fix the bug as soon as possible. Presumably, this has already happened, although we can't tell for sure since the actual rating value doesn't seem to be displayed anymore in the pages' HTML.
Our main goal of this project is to analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.
To analyse if the ratings have been corrected, we need two datasets:
Walt Hickey made the data he analyzed publicly available on GitHub. We'll use the data he collected to analyze the characteristics of Fandango's rating system previous to his analysis.
The data after 'Hickey's analysis', which was collected by one of Dataquest's and is publicly available on Github..
It contains every film that has a Rotten Tomatoes rating, a RT User rating, a Metacritic score, a Metacritic User score, and IMDb score, and at least 30 fan reviews on Fandango. The data from Fandango was pulled on Aug. 24, 2015.
## Importing all the required libraries for our analysis:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from numpy import arange
hickey_main = pd.read_csv('fandango_score_comparison.csv')
print(hickey_main.columns)
print('The shape of the dataset: ', hickey_main.shape)
hickey_main.head()
Index(['FILM', 'RottenTomatoes', 'RottenTomatoes_User', 'Metacritic', 'Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue', 'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom', 'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round', 'Metacritic_norm_round', 'Metacritic_user_norm_round', 'IMDB_norm_round', 'Metacritic_user_vote_count', 'IMDB_user_vote_count', 'Fandango_votes', 'Fandango_Difference'], dtype='object') The shape of the dataset: (146, 22)
FILM | RottenTomatoes | RottenTomatoes_User | Metacritic | Metacritic_User | IMDB | Fandango_Stars | Fandango_Ratingvalue | RT_norm | RT_user_norm | ... | IMDB_norm | RT_norm_round | RT_user_norm_round | Metacritic_norm_round | Metacritic_user_norm_round | IMDB_norm_round | Metacritic_user_vote_count | IMDB_user_vote_count | Fandango_votes | Fandango_Difference | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Avengers: Age of Ultron (2015) | 74 | 86 | 66 | 7.1 | 7.8 | 5.0 | 4.5 | 3.70 | 4.3 | ... | 3.90 | 3.5 | 4.5 | 3.5 | 3.5 | 4.0 | 1330 | 271107 | 14846 | 0.5 |
1 | Cinderella (2015) | 85 | 80 | 67 | 7.5 | 7.1 | 5.0 | 4.5 | 4.25 | 4.0 | ... | 3.55 | 4.5 | 4.0 | 3.5 | 4.0 | 3.5 | 249 | 65709 | 12640 | 0.5 |
2 | Ant-Man (2015) | 80 | 90 | 64 | 8.1 | 7.8 | 5.0 | 4.5 | 4.00 | 4.5 | ... | 3.90 | 4.0 | 4.5 | 3.0 | 4.0 | 4.0 | 627 | 103660 | 12055 | 0.5 |
3 | Do You Believe? (2015) | 18 | 84 | 22 | 4.7 | 5.4 | 5.0 | 4.5 | 0.90 | 4.2 | ... | 2.70 | 1.0 | 4.0 | 1.0 | 2.5 | 2.5 | 31 | 3136 | 1793 | 0.5 |
4 | Hot Tub Time Machine 2 (2015) | 14 | 28 | 29 | 3.4 | 5.1 | 3.5 | 3.0 | 0.70 | 1.4 | ... | 2.55 | 0.5 | 1.5 | 1.5 | 1.5 | 2.5 | 88 | 19560 | 1021 | 0.5 |
5 rows × 22 columns
Let's see the columns the dataset contains :
Column | Definition |
---|---|
FILM |
The film in question. |
RottenTomatoes |
The Rotten Tomatoes Tomatometer score for the film |
RottenTomatoes_User |
The Rotten Tomatoes user score for the film |
Metacritic |
The Metacritic critic score for the film |
Metacritic_User |
The Metacritic user score for the film |
IMDB |
The IMDb user score for the film |
Fandango_Stars |
The number of stars the film had on its Fandango movie page |
Fandango_Ratingvalue |
The Fandango ratingValue for the film, as pulled from the HTML of each page. This is the actual average score the movie obtained. |
RT_norm |
The Rotten Tomatoes Tomatometer score for the film , normalized to a 0 to 5 point system |
RT_user_norm |
The Rotten Tomatoes user score for the film , normalized to a 0 to 5 point system |
Metacritic_norm |
The Metacritic critic score for the film, normalized to a 0 to 5 point system |
Metacritic_user_nom |
The Metacritic user score for the film, normalized to a 0 to 5 point system |
IMDB_norm |
The IMDb user score for the film, normalized to a 0 to 5 point system |
RT_norm_round |
The Rotten Tomatoes Tomatometer score for the film , normalized to a 0 to 5 point system and rounded to the nearest half-star |
RT_user_norm_round |
The Rotten Tomatoes user score for the film , normalized to a 0 to 5 point system and rounded to the nearest half-star |
Metacritic_norm_round |
The Metacritic critic score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star |
Metacritic_user_norm_round |
The Metacritic user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star |
IMDB_norm_round |
The IMDb user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star |
Metacritic_user_vote_count |
The number of user votes the film had on Metacritic |
IMDB_user_vote_count |
The number of user votes the film had on IMDb |
Fandango_votes |
The number of user votes the film had on Fandango |
Fandango_Difference |
The difference between the presented Fandango_Stars and the actual Fandango_Ratingvalue |
For our analysis we only need
FILM
,Fandango_Stars
,Fandango_Ratingvalue
,
Fandango_votes
, Fandango_Difference
The FILM
column contains the name, as well as the year the film was released. Since this analysis depends on the time difference between the datasets, let's seperate out the column for our ease of use:
# editing our dataframe to only contain columns that we need:
hickey = hickey_main[['FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes', 'Fandango_Difference','RottenTomatoes','IMDB']]
# extracting year from the name of the first movie in the dataset
name = 'Avengers: Age of Ultron (2015)'
print(name[-5:-1])
print(name[:-6])
# creating a new column to store the values:
hickey = hickey.copy()
hickey['Year'] = hickey['FILM'].apply(func= lambda x: x[-5:-1])
hickey['FILM'] = hickey['FILM'].apply(func= lambda x: x[:-6])
hickey['Year'].head()
2015 Avengers: Age of Ultron
0 2015 1 2015 2 2015 3 2015 4 2015 Name: Year, dtype: object
** It contains movie ratings data for 214 of the most popular movies (with a significant number of votes) released in 2016 and 2017. As of March 22, 2017, the ratings were up to date. Significant changes should be expected mostly for movies released in 2017.**
p_hickey = pd.read_csv('movie_ratings_16_17.csv')
print(p_hickey.columns)
print('The shape of the dataset: ', p_hickey.shape)
p_hickey.head()
Index(['movie', 'year', 'metascore', 'imdb', 'tmeter', 'audience', 'fandango', 'n_metascore', 'n_imdb', 'n_tmeter', 'n_audience', 'nr_metascore', 'nr_imdb', 'nr_tmeter', 'nr_audience'], dtype='object') The shape of the dataset: (214, 15)
movie | year | metascore | imdb | tmeter | audience | fandango | n_metascore | n_imdb | n_tmeter | n_audience | nr_metascore | nr_imdb | nr_tmeter | nr_audience | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10 Cloverfield Lane | 2016 | 76 | 7.2 | 90 | 79 | 3.5 | 3.80 | 3.60 | 4.50 | 3.95 | 4.0 | 3.5 | 4.5 | 4.0 |
1 | 13 Hours | 2016 | 48 | 7.3 | 50 | 83 | 4.5 | 2.40 | 3.65 | 2.50 | 4.15 | 2.5 | 3.5 | 2.5 | 4.0 |
2 | A Cure for Wellness | 2016 | 47 | 6.6 | 40 | 47 | 3.0 | 2.35 | 3.30 | 2.00 | 2.35 | 2.5 | 3.5 | 2.0 | 2.5 |
3 | A Dog's Purpose | 2017 | 43 | 5.2 | 33 | 76 | 4.5 | 2.15 | 2.60 | 1.65 | 3.80 | 2.0 | 2.5 | 1.5 | 4.0 |
4 | A Hologram for the King | 2016 | 58 | 6.1 | 70 | 57 | 3.0 | 2.90 | 3.05 | 3.50 | 2.85 | 3.0 | 3.0 | 3.5 | 3.0 |
Let's see the columns in the dataset:
Column | Description |
---|---|
movie |
the name of the movie |
year |
the release year of the movie |
metascore |
the Metacritic rating of the movie (the "metascore" - critic score) |
imdb |
the IMDB rating of the movie (user score) |
tmeter |
the Rotten Tomatoes rating of the movie (the "tomatometer" - critic score) |
audience |
the Rotten Tomatoes rating of the movie (user score) |
fandango |
the Fandango rating of the movie (user score) |
n_metascore |
the metascore normalized to a 0-5 scale |
n_imdb |
the IMDB rating normalized to a 0-5 scale |
n_tmeter |
the tomatometer normalized to a 0-5 scale |
n_audience |
the Rotten Tomatoes user score normalized to a 0-5 scale |
nr_metascore |
the metascore normalized to a 0-5 scale and rounded to the nearest 0.5 |
nr_imdb |
the IMDB rating normalized to a 0-5 scale and rounded to the nearest 0.5 |
nr_tmeter |
the tomatometer normalized to a 0-5 scale and rounded to the nearest 0.5 |
nr_audience |
the Rotten Tomatoes user score normalized to a 0-5 scale and rounded to the nearest 0.5 |
For our analysis we would only need
movie
,year
,fandango
from the dataset.
# # editing our dataframe to only contain columns that we need:
p_hickey = p_hickey[['movie', 'year', 'fandango','tmeter','imdb']]
Our goal : To determine whether there has been any change in Fandango's rating system after Hickey's analysis.
As we go through the 'Read me' files of both the datasets we can clearly see that:
The two datasets don't have the data about the same movies pulled at the different time frames, rather they have the data about different movies pulled at different time frames, with:
Since both the datasets were captured at different times, let's see if the dataset overlaps by checking the year of releases of movies in the datasets :
# for the dataset used by hickey:
hickey['Year'].value_counts()
2015 129 2014 17 Name: Year, dtype: int64
# for the dataset used after:
p_hickey['year'].value_counts()
2016 191 2017 23 Name: year, dtype: int64
We can see that the datasets don't capture releases from the same year.The goal of our analysis is to see the changes made in the Fandago rating system. Since the datasets do not contain information about the same movies,it is evident that :
Now, to match the need for our analysis we have two options :¶
Tweaking our goal seems a much faster choice compared to collecting new data. Also, it's quasi-impossible to collect a new sample previous to Hickey's analysis at this moment in time.
To determine whether there has been any change in Fandango's rating system after Hickey's analysis *by using a system of Aggregates.*
We know that the datasets do not contain information about the same movies. By using the system of aggregates, we will try to look at the bigger picture, i.e the overall ratings of popular movies from both the datasets to see if there has been any shift in Fandago's star point system.
Before we continue with our analysis, it is important to identify that popularity is a relative term, however, for our analysis we will use the top voted movies for both the datasets. We'll use Hickey's benchmark of 30 fan ratings and count a movie as popular only if it has 30 fan ratings or more on Fandango's website.
For the dataset containing information about the movies after Hickey's analysis, the 'Read Me' file clearly states that It contains movie ratings data for 214 of the most popular movies (with a significant number of votes.)
Although one of the sampling criteria in our second sample is movie popularity, the sample doesn't provide information about the number of fan ratings. We should be skeptical and ask whether this sample is truly representative and contains popular movies (movies with over 30 fan ratings).
One quick way to check the representativity of this sample is to sample randomly 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. Ideally, at least 8 out of the 10 movies have 30 fan ratings or more.
p_hickey.sample(10, random_state=1)
movie | year | fandango | tmeter | imdb | |
---|---|---|---|---|---|
108 | Mechanic: Resurrection | 2016 | 4.0 | 29 | 5.6 |
206 | Warcraft | 2016 | 4.0 | 28 | 7.0 |
106 | Max Steel | 2016 | 3.5 | 0 | 4.6 |
107 | Me Before You | 2016 | 4.5 | 58 | 7.4 |
51 | Fantastic Beasts and Where to Find Them | 2016 | 4.5 | 73 | 7.5 |
33 | Cell | 2016 | 3.0 | 11 | 4.3 |
59 | Genius | 2016 | 3.5 | 51 | 6.5 |
152 | Sully | 2016 | 4.5 | 85 | 7.5 |
4 | A Hologram for the King | 2016 | 3.0 | 70 | 6.1 |
31 | Captain America: Civil War | 2016 | 4.5 | 90 | 7.9 |
Above we used a value of 1 as the random seed. This is good practice because it suggests that we weren't trying out various random seeds just to get a favorable sample.
As of April 2018, these are the fan ratings we found:
Movie | Fan ratings |
---|---|
Mechanic: Resurrection | 2247 |
Warcraft | 7271 |
Max Steel | 493 |
Me Before You | 5263 |
Fantastic Beasts and Where to Find Them | 13400 |
Cell | 17 |
Genius | 127 |
Sully | 11877 |
A Hologram for the King | 500 |
Captain America: Civil War | 35057 |
90% of the movies in our sample are popular, which satisfies our condition. So let's move futher with the analysis and check the ratings for Hickey's dataset.
sum(hickey['Fandango_votes'] < 30)
0
Hence, we can now be sure about the popularity of the movies in our dataset.
Earlier, we saw that across the datasets we have information about the movie ratings for 4 years, namely, 2014, 2015, 2016 and 2017. If there has been a change in the overall rating system on Fandango, we would notice a big shift in the rating values between the years of 2015 and 2016.
Interestingly, having the data for 4 years, we can quantitavely verify the shift by plotting the estimates year-wise.
From the actual data provided to us in the dataset, we can take two approaches to quantify the actual shift in the ratings :
Let's start with the first step our analysis:
# plotting kernel density plots for better visualisation of Hickey's and post-hickey's dataset :
plt.style.use('fivethirtyeight')
hickey['Fandango_Stars'].plot.kde(label = '2014-2015', legend = True, figsize = (10,6))
p_hickey['fandango'].plot.kde(label = '2016-2017', legend = True)
plt.title("Distribution for Fandango's ratings using \n using Hickey's and Post Hickey's dataset",
y = 1.07) # the `y` parameter pads the title upward
plt.xlabel('Stars')
plt.xlim(0,5) # because ratings start at 0 and end at 5
plt.xticks(arange(0,5.1,.5))
# printing the mean rating value during 2014-2015 for easy reference :
plt.axvline(hickey['Fandango_Stars'].mean(), label='Hickey_mean', color = 'blue')
print('\033[1m' + "The mean ratings during 2014-2015 in hickey's dataset is : " + str(np.round(hickey['Fandango_Stars'].mean(),2)) + '\033[0m')
# printing the mean rating value during 2016-2017 for easy reference:
plt.axvline(p_hickey['fandango'].mean(), label='Post_Hickey_mean', color = 'red')
print('\033[1m' + "The mean ratings during 2016-2017 in post hickey's dataset is : " + str(np.round(p_hickey['fandango'].mean(),2)) + '\033[0m')
plt.legend(framealpha = 0, loc = 'upper center')
plt.show()
The mean ratings during 2014-2015 in hickey's dataset is : 4.09 The mean ratings during 2016-2017 in post hickey's dataset is : 3.89
Now that we have summarised that there is an overall shift in the ratings of movies on the Fandango webiste, let's use ** summary statistics in order to get a more precise picture about the direction of difference ** to see how the ratings shifted from 2015 to 2016, i.e in Hickey's dataset vs Post Hickey dataset:
p_hickey.head()
movie | year | fandango | tmeter | imdb | |
---|---|---|---|---|---|
0 | 10 Cloverfield Lane | 2016 | 3.5 | 90 | 7.2 |
1 | 13 Hours | 2016 | 4.5 | 50 | 7.3 |
2 | A Cure for Wellness | 2016 | 3.0 | 40 | 6.6 |
3 | A Dog's Purpose | 2017 | 4.5 | 33 | 5.2 |
4 | A Hologram for the King | 2016 | 3.0 | 70 | 6.1 |
# calculating the mean,median and mode, and using a grouped bar plot for better visualisation of direction of change :
## mean
hickey_2015_mean = np.round(hickey['Fandango_Stars'][hickey.Year == '2015'].mean(),2)
p_hickey['year'] = p_hickey['year'].astype('str')
p_hickey_2016_mean = np.round(p_hickey['fandango'][p_hickey.year == '2016'].mean(),2)
## median
hickey_2015_median = np.round(hickey['Fandango_Stars'][hickey.Year == '2015'].median(),2)
p_hickey_2016_median = np.round(p_hickey['fandango'][p_hickey.year == '2016'].median(),2)
## mode
hickey_2015_mode = hickey['Fandango_Stars'][hickey.Year == '2015'].mode()[0]
p_hickey_2016_mode = p_hickey['fandango'][p_hickey.year == '2016'].mode()[0]
# Summary metrics:
metrics_df = pd.DataFrame()
metrics_df['2015'] = [hickey_2015_mean,hickey_2015_median,hickey_2015_mode]
metrics_df['2016'] = [p_hickey_2016_mean,p_hickey_2016_median,p_hickey_2016_mode]
metrics_df.index = ['Mean','Median','Mode']
metrics_df['Difference in ratings'] = metrics_df['2016'] - metrics_df['2015']
metrics_df
2015 | 2016 | Difference in ratings | |
---|---|---|---|
Mean | 4.09 | 3.89 | -0.2 |
Median | 4.00 | 4.00 | 0.0 |
Mode | 4.50 | 4.00 | -0.5 |
# plotting the data:
metrics_df.plot.bar(figsize = (8,5))
plt.legend(framealpha = 0, loc = 'upper center')
plt.ylabel('Fandango Stars')
plt.title('Summary Statistics : 2015 vs 2016')
plt.show()
Although our analysis portrayed the fact that Fandango's overall rating system changed since Hickey's analysis, there are major underlying problems with our method of approach, including (but not limited to) :
# During Hickey's analysis :
hickey.head()
FILM | Fandango_Stars | Fandango_Ratingvalue | Fandango_votes | Fandango_Difference | RottenTomatoes | IMDB | Year | |
---|---|---|---|---|---|---|---|---|
0 | Avengers: Age of Ultron | 5.0 | 4.5 | 14846 | 0.5 | 74 | 7.8 | 2015 |
1 | Cinderella | 5.0 | 4.5 | 12640 | 0.5 | 85 | 7.1 | 2015 |
2 | Ant-Man | 5.0 | 4.5 | 12055 | 0.5 | 80 | 7.8 | 2015 |
3 | Do You Believe? | 5.0 | 4.5 | 1793 | 0.5 | 18 | 5.4 | 2015 |
4 | Hot Tub Time Machine 2 | 3.5 | 3.0 | 1021 | 0.5 | 14 | 5.1 | 2015 |
# after hickey's analysis:
p_hickey.head()
movie | year | fandango | tmeter | imdb | |
---|---|---|---|---|---|
0 | 10 Cloverfield Lane | 2016 | 3.5 | 90 | 7.2 |
1 | 13 Hours | 2016 | 4.5 | 50 | 7.3 |
2 | A Cure for Wellness | 2016 | 3.0 | 40 | 6.6 |
3 | A Dog's Purpose | 2017 | 4.5 | 33 | 5.2 |
4 | A Hologram for the King | 2016 | 3.0 | 70 | 6.1 |
We need to normalise Rotten Tomatoes' and IMDB's ratings to the same rating system as that of Fandango_stars (i.e out of 5)
# converting hickey's ratings :
hickey = hickey.copy()
hickey['RottenTomatoes_conversion'] = hickey['RottenTomatoes'].apply(lambda x: round(((x/100)*5)*2)/2)
hickey['IMDB_conversion'] = hickey['IMDB'].apply(lambda x: round(((x/10)*5)*2)/2)
# difference in ratings during Hickey's analysis :
hickey['fan_rot_difference_hickey']= hickey['Fandango_Stars'] - hickey['RottenTomatoes_conversion']
hickey['fan_imbd_difference_hickey']= hickey['Fandango_Stars'] - hickey['IMDB_conversion']
# converting p_hickey's ratings:
p_hickey = p_hickey.copy()
p_hickey['tmeter_conversion'] = p_hickey['tmeter'].apply(lambda x: round(((x/100)*5)*2)/2)
p_hickey['imdb_conversion'] = p_hickey['imdb'].apply(lambda x: round(((x/10)*5)*2)/2)
# difference in ratings post Hickey's analysis :
p_hickey['fan_rot_difference_phickey']= p_hickey['fandango'] - p_hickey['tmeter_conversion']
p_hickey['fan_imbd_difference_phickey']= p_hickey['fandango'] - p_hickey['imdb_conversion']
# average difference in ratings :
hickey_rot_mean = np.round(hickey['fan_rot_difference_hickey'].mean(),2)
hickey_imdb_mean = np.round(hickey['fan_imbd_difference_hickey'].mean(),2)
p_hickey_rot_mean = np.round(p_hickey['fan_rot_difference_phickey'].mean(),2)
p_hickey_imdb_mean = np.round(p_hickey['fan_imbd_difference_phickey'].mean(),2)
# creating the dataframe to hold the difference values :
final_df = pd.DataFrame()
final_df['Hickey'] = [hickey_rot_mean, hickey_imdb_mean]
final_df['Post_hickey'] = [p_hickey_rot_mean, p_hickey_imdb_mean]
final_df.index = ['Fandango_Rotten_Tomatoes_difference', 'Fandango_IMDB_difference']
final_df['Difference_between_datasets'] = final_df['Hickey'] - final_df['Post_hickey']
final_df
Hickey | Post_hickey | Difference_between_datasets | |
---|---|---|---|
Fandango_Rotten_Tomatoes_difference | 1.04 | 1.20 | -0.16 |
Fandango_IMDB_difference | 0.72 | 0.68 | 0.04 |
# plotting the data:
final_df.plot.bar(figsize = (8,5))
plt.legend(framealpha = 0, loc = 'upper center')
plt.ylabel('Fandango Stars')
plt.title("Mean Ratings of Fandango's VS Other Websites")
plt.show()
In order to understand these findings better we need to plot the average ratings of three websites (i.e Fandango, Rotten Tomatoes and IMDB).
# plotting kernel density plots for better visualisation of Hickey's dataset :
plt.style.use('fivethirtyeight')
hickey['Fandango_Stars'].plot.kde(label = 'Fandango_ratings', legend = True, figsize = (10,6))
hickey['RottenTomatoes_conversion'].plot.kde(label = 'Rotten_tomatoes_ratings', legend = True, figsize = (10,6))
hickey['IMDB_conversion'].plot.kde(label = 'IMDB_ratings', legend = True, figsize = (10,6))
plt.title("Distribution for Fandango's ratings using \n using Hickey's dataset between different websites",
y = 1.07)
plt.xlabel('Stars')
plt.xlim(0,5)
plt.xticks(arange(0,5.1,.5))
# visualising the mean rating value for different websites for easy reference :
plt.axvline(hickey['Fandango_Stars'].mean(), label='Fandango_mean', color = 'blue')
plt.axvline(hickey['RottenTomatoes_conversion'].mean(), label='Rotten_mean', color = 'red')
plt.axvline(hickey['IMDB_conversion'].mean(), label='IMDB_mean', color = 'yellow')
plt.legend(framealpha = 0, loc = 'upper left')
plt.show()
# plotting kernel density plots for better visualisation of post-hickey's dataset :
plt.style.use('fivethirtyeight')
p_hickey['fandango'].plot.kde(label = 'Fandango_ratings', legend = True, figsize = (10,6))
p_hickey['tmeter_conversion'].plot.kde(label = 'Rotten_ratings', legend = True)
p_hickey['imdb_conversion'].plot.kde(label = 'IMDB_ratings', legend = True)
plt.title("Distribution for Fandango's ratings Post Hickey's dataset",
y = 1.07)
plt.xlabel('Stars')
plt.xlim(0,5)
plt.xticks(arange(0,5.1,.5))
# visualising the mean rating value for different websites for easy reference :
plt.axvline(p_hickey['fandango'].mean(), label='Fandango_mean', color = 'blue')
plt.axvline(p_hickey['tmeter_conversion'].mean(), label='Rotten_mean', color = 'red')
plt.axvline(p_hickey['imdb_conversion'].mean(), label='IMDB_mean', color = 'yellow')
plt.legend(framealpha = 0, loc = 'upper left')
plt.show()
# mean difference dataframe between three webistes :
mean_df = pd.DataFrame()
# hickey's analysis :
h_fandango_mean = np.round(hickey['Fandango_Stars'].mean(),2)
h_rotten_mean = np.round(hickey['RottenTomatoes_conversion'].mean(),2)
h_imdb_mean = np.round(hickey['IMDB_conversion'].mean(),2)
# p_hickey's analysis :
p_fandango_mean = np.round(p_hickey['fandango'].mean(),2)
p_rotten_mean = np.round(p_hickey['tmeter_conversion'].mean(),2)
p_imdb_mean = np.round(p_hickey['imdb_conversion'].mean(),2)
mean_df['Hickey'] = [h_fandango_mean,h_rotten_mean,h_imdb_mean]
mean_df['Post_Hickey'] = [p_fandango_mean,p_rotten_mean,p_imdb_mean]
mean_df.index = ['Fandango','Rotten_Tomatoes','IMDB']
mean_df['Ratings_Difference'] = mean_df['Hickey'] - mean_df['Post_Hickey']
mean_df['Relative_Ratings(%)'] = np.round(((mean_df['Hickey'] - mean_df['Post_Hickey'])*100/mean_df['Hickey']),2)
mean_df
Hickey | Post_Hickey | Ratings_Difference | Relative_Ratings(%) | |
---|---|---|---|---|
Fandango | 4.09 | 3.89 | 0.20 | 4.89 |
Rotten_Tomatoes | 3.04 | 2.69 | 0.35 | 11.51 |
IMDB | 3.37 | 3.21 | 0.16 | 4.75 |
Since the same movies are rated by all the three websites in their respective datasets, we can clearly understand the data we obtained earlier.
Analysis 1 :
Analysis 2 :
In conclusion, though the average ratings of the movies fell on Fandango's website, the ratings on other webistes (especially Rotten Tomatoes') fell even further, indicating that the ratings system on Fandango's webiste might not have been corrected.