Investigating Fandango Movie Ratings¶

Problem Description :¶

In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango's rating system was biased and dishonest (Fandango is an online movie ratings aggregator). He published his analysis in this article.

Fandango displays a 5-star rating system on their website, where the minimum rating is 0 stars and the maximum is 5 stars.

Hickey found that there's a significant discrepancy between the number of stars displayed to users and the actual rating, which he was able to find in the HTML of the page. He was able to find that:

The actual rating was almost always rounded up to the nearest half-star. For instance, a 4.1 movie would be rounded off to 4.5 stars, not to 4 stars, as you may expect.
In the case of 8% of the ratings analyzed, the rounding up was done to the nearest whole star. For instance, a 4.5 rating would be rounded off to 5 stars.
For one movie rating, the rounding off was completely bizarre: from a rating of 4 in the HTML of the page to a displayed rating of 5 stars.

From the image we can clearly see that :

Both distributions above are strongly left skewed, suggesting that movie ratings on Fandango are generally high or very high.
We can see there's no rating under 2 stars in the sample Hickey analyzed.
The distribution of displayed ratings is clearly shifted to the right compared to the actual rating distribution, suggesting strongly that Fandango inflates the ratings under the hood.

AIM :¶

Fandango's officials replied that the biased rounding off was caused by a bug in their system rather than being intentional, and they promised to fix the bug as soon as possible. Presumably, this has already happened, although we can't tell for sure since the actual rating value doesn't seem to be displayed anymore in the pages' HTML.

Our main goal of this project is to analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

Dataset Description:¶

To analyse if the ratings have been corrected, we need two datasets:

Walt Hickey made the data he analyzed publicly available on GitHub. We'll use the data he collected to analyze the characteristics of Fandango's rating system previous to his analysis.
The data after 'Hickey's analysis', which was collected by one of Dataquest's and is publicly available on Github..

1. Hickey's Dataset :¶

It contains every film that has a Rotten Tomatoes rating, a RT User rating, a Metacritic score, a Metacritic User score, and IMDb score, and at least 30 fan reviews on Fandango. The data from Fandango was pulled on Aug. 24, 2015.

In [1]:

## Importing all the required libraries for our analysis:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from numpy import arange

In [2]:

hickey_main = pd.read_csv('fandango_score_comparison.csv')
print(hickey_main.columns)
print('The shape of the dataset: ', hickey_main.shape)
hickey_main.head()

Index(['FILM', 'RottenTomatoes', 'RottenTomatoes_User', 'Metacritic',
       'Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
       'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
       'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
       'Metacritic_norm_round', 'Metacritic_user_norm_round',
       'IMDB_norm_round', 'Metacritic_user_vote_count', 'IMDB_user_vote_count',
       'Fandango_votes', 'Fandango_Difference'],
      dtype='object')
The shape of the dataset:  (146, 22)

Out[2]:

	FILM	RottenTomatoes	RottenTomatoes_User	Metacritic	Metacritic_User	IMDB	Fandango_Stars	Fandango_Ratingvalue	RT_norm	RT_user_norm	...	IMDB_norm	RT_norm_round	RT_user_norm_round	Metacritic_norm_round	Metacritic_user_norm_round	IMDB_norm_round	Metacritic_user_vote_count	IMDB_user_vote_count	Fandango_votes	Fandango_Difference
0	Avengers: Age of Ultron (2015)	74	86	66	7.1	7.8	5.0	4.5	3.70	4.3	...	3.90	3.5	4.5	3.5	3.5	4.0	1330	271107	14846	0.5
1	Cinderella (2015)	85	80	67	7.5	7.1	5.0	4.5	4.25	4.0	...	3.55	4.5	4.0	3.5	4.0	3.5	249	65709	12640	0.5
2	Ant-Man (2015)	80	90	64	8.1	7.8	5.0	4.5	4.00	4.5	...	3.90	4.0	4.5	3.0	4.0	4.0	627	103660	12055	0.5
3	Do You Believe? (2015)	18	84	22	4.7	5.4	5.0	4.5	0.90	4.2	...	2.70	1.0	4.0	1.0	2.5	2.5	31	3136	1793	0.5
4	Hot Tub Time Machine 2 (2015)	14	28	29	3.4	5.1	3.5	3.0	0.70	1.4	...	2.55	0.5	1.5	1.5	1.5	2.5	88	19560	1021	0.5

5 rows × 22 columns

Let's see the columns the dataset contains :

Column	Definition
`FILM`	The film in question.
`RottenTomatoes`	The Rotten Tomatoes Tomatometer score for the film
`RottenTomatoes_User`	The Rotten Tomatoes user score for the film
`Metacritic`	The Metacritic critic score for the film
`Metacritic_User`	The Metacritic user score for the film
`IMDB`	The IMDb user score for the film
`Fandango_Stars`	The number of stars the film had on its Fandango movie page
`Fandango_Ratingvalue`	The Fandango ratingValue for the film, as pulled from the HTML of each page. This is the actual average score the movie obtained.
`RT_norm`	The Rotten Tomatoes Tomatometer score for the film , normalized to a 0 to 5 point system
`RT_user_norm`	The Rotten Tomatoes user score for the film , normalized to a 0 to 5 point system
`Metacritic_norm`	The Metacritic critic score for the film, normalized to a 0 to 5 point system
`Metacritic_user_nom`	The Metacritic user score for the film, normalized to a 0 to 5 point system
`IMDB_norm`	The IMDb user score for the film, normalized to a 0 to 5 point system
`RT_norm_round`	The Rotten Tomatoes Tomatometer score for the film , normalized to a 0 to 5 point system and rounded to the nearest half-star
`RT_user_norm_round`	The Rotten Tomatoes user score for the film , normalized to a 0 to 5 point system and rounded to the nearest half-star
`Metacritic_norm_round`	The Metacritic critic score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star
`Metacritic_user_norm_round`	The Metacritic user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star
`IMDB_norm_round`	The IMDb user score for the film, normalized to a 0 to 5 point system and rounded to the nearest half-star
`Metacritic_user_vote_count`	The number of user votes the film had on Metacritic
`IMDB_user_vote_count`	The number of user votes the film had on IMDb
`Fandango_votes`	The number of user votes the film had on Fandango
`Fandango_Difference`	The difference between the presented Fandango_Stars and the actual Fandango_Ratingvalue

For our analysis we only need FILM, Fandango_Stars, Fandango_Ratingvalue,

Fandango_votes, Fandango_Difference

The FILM column contains the name, as well as the year the film was released. Since this analysis depends on the time difference between the datasets, let's seperate out the column for our ease of use:

In [3]:

# editing our dataframe to only contain columns that we need:
hickey = hickey_main[['FILM', 'Fandango_Stars', 'Fandango_Ratingvalue', 'Fandango_votes', 'Fandango_Difference','RottenTomatoes','IMDB']]

# extracting year from the name of the first movie in the dataset
name = 'Avengers: Age of Ultron (2015)'
print(name[-5:-1])
print(name[:-6])

# creating a new column to store the values:
hickey = hickey.copy()
hickey['Year'] = hickey['FILM'].apply(func= lambda x: x[-5:-1])
hickey['FILM'] = hickey['FILM'].apply(func= lambda x: x[:-6])
hickey['Year'].head()

2015
Avengers: Age of Ultron

Out[3]:

0    2015
1    2015
2    2015
3    2015
4    2015
Name: Year, dtype: object

2. Post Hickey's analysis Dataset:¶

** It contains movie ratings data for 214 of the most popular movies (with a significant number of votes) released in 2016 and 2017. As of March 22, 2017, the ratings were up to date. Significant changes should be expected mostly for movies released in 2017.**

In [4]:

p_hickey = pd.read_csv('movie_ratings_16_17.csv')
print(p_hickey.columns)
print('The shape of the dataset: ', p_hickey.shape)
p_hickey.head()

Index(['movie', 'year', 'metascore', 'imdb', 'tmeter', 'audience', 'fandango',
       'n_metascore', 'n_imdb', 'n_tmeter', 'n_audience', 'nr_metascore',
       'nr_imdb', 'nr_tmeter', 'nr_audience'],
      dtype='object')
The shape of the dataset:  (214, 15)

Out[4]:

	movie	year	metascore	imdb	tmeter	audience	fandango	n_metascore	n_imdb	n_tmeter	n_audience	nr_metascore	nr_imdb	nr_tmeter	nr_audience
0	10 Cloverfield Lane	2016	76	7.2	90	79	3.5	3.80	3.60	4.50	3.95	4.0	3.5	4.5	4.0
1	13 Hours	2016	48	7.3	50	83	4.5	2.40	3.65	2.50	4.15	2.5	3.5	2.5	4.0
2	A Cure for Wellness	2016	47	6.6	40	47	3.0	2.35	3.30	2.00	2.35	2.5	3.5	2.0	2.5
3	A Dog's Purpose	2017	43	5.2	33	76	4.5	2.15	2.60	1.65	3.80	2.0	2.5	1.5	4.0
4	A Hologram for the King	2016	58	6.1	70	57	3.0	2.90	3.05	3.50	2.85	3.0	3.0	3.5	3.0

Let's see the columns in the dataset:

Column	Description
`movie`	the name of the movie
`year`	the release year of the movie
`metascore`	the Metacritic rating of the movie (the "metascore" - critic score)
`imdb`	the IMDB rating of the movie (user score)
`tmeter`	the Rotten Tomatoes rating of the movie (the "tomatometer" - critic score)
`audience`	the Rotten Tomatoes rating of the movie (user score)
`fandango`	the Fandango rating of the movie (user score)
`n_metascore`	the metascore normalized to a 0-5 scale
`n_imdb`	the IMDB rating normalized to a 0-5 scale
`n_tmeter`	the tomatometer normalized to a 0-5 scale
`n_audience`	the Rotten Tomatoes user score normalized to a 0-5 scale
`nr_metascore`	the metascore normalized to a 0-5 scale and rounded to the nearest 0.5
`nr_imdb`	the IMDB rating normalized to a 0-5 scale and rounded to the nearest 0.5
`nr_tmeter`	the tomatometer normalized to a 0-5 scale and rounded to the nearest 0.5
`nr_audience`	the Rotten Tomatoes user score normalized to a 0-5 scale and rounded to the nearest 0.5

For our analysis we would only need movie, year, fandango from the dataset.

In [5]:

# # editing our dataframe to only contain columns that we need:
p_hickey = p_hickey[['movie', 'year', 'fandango','tmeter','imdb']]

* Does the population match the interest for our goal ?*¶

Our goal : To determine whether there has been any change in Fandango's rating system after Hickey's analysis.

As we go through the 'Read me' files of both the datasets we can clearly see that:

The two datasets don't have the data about the same movies pulled at the different time frames, rather they have the data about different movies pulled at different time frames, with:
- Hickey's Dataset containing the information about 146 movies pulled at Aug. 24, 2015
- Post Hickey's Dataset containing information about 214 of the most popular movies pulled at March 22, 2017

Since both the datasets were captured at different times, let's see if the dataset overlaps by checking the year of releases of movies in the datasets :

In [6]:

# for the dataset used by hickey:
hickey['Year'].value_counts()

Out[6]:

2015    129
2014     17
Name: Year, dtype: int64

In [7]:

# for the dataset used after:
p_hickey['year'].value_counts()

Out[7]:

2016    191
2017     23
Name: year, dtype: int64

We can see that the datasets don't capture releases from the same year.The goal of our analysis is to see the changes made in the Fandago rating system. Since the datasets do not contain information about the same movies,it is evident that :

the population doesn't match the interest for our goal completely.
the sampling is biased as a gap in time, i.e the year in which the movie is released controls the movies picked for the datasets.

Now, to match the need for our analysis we have two options :¶

Capture entirely a new dataset to include the same movies as used by Hickey's analysis.
Tweak the goal slightly by placing some limitations on it.

Tweaking our goal seems a much faster choice compared to collecting new data. Also, it's quasi-impossible to collect a new sample previous to Hickey's analysis at this moment in time.

Our New Goal : ¶

To determine whether there has been any change in Fandango's rating system after Hickey's analysis *by using a system of Aggregates.*

What we mean : ¶

We know that the datasets do not contain information about the same movies. By using the system of aggregates, we will try to look at the bigger picture, i.e the overall ratings of popular movies from both the datasets to see if there has been any shift in Fandago's star point system.

Data Analysis:¶

Before we continue with our analysis, it is important to identify that popularity is a relative term, however, for our analysis we will use the top voted movies for both the datasets. We'll use Hickey's benchmark of 30 fan ratings and count a movie as popular only if it has 30 fan ratings or more on Fandango's website.

For the dataset containing information about the movies after Hickey's analysis, the 'Read Me' file clearly states that It contains movie ratings data for 214 of the most popular movies (with a significant number of votes.)

Although one of the sampling criteria in our second sample is movie popularity, the sample doesn't provide information about the number of fan ratings. We should be skeptical and ask whether this sample is truly representative and contains popular movies (movies with over 30 fan ratings).

One quick way to check the representativity of this sample is to sample randomly 10 movies from it and then check the number of fan ratings ourselves on Fandango's website. Ideally, at least 8 out of the 10 movies have 30 fan ratings or more.

In [8]:

p_hickey.sample(10, random_state=1)

Out[8]:

	movie	year	fandango	tmeter	imdb
108	Mechanic: Resurrection	2016	4.0	29	5.6
206	Warcraft	2016	4.0	28	7.0
106	Max Steel	2016	3.5	0	4.6
107	Me Before You	2016	4.5	58	7.4
51	Fantastic Beasts and Where to Find Them	2016	4.5	73	7.5
33	Cell	2016	3.0	11	4.3
59	Genius	2016	3.5	51	6.5
152	Sully	2016	4.5	85	7.5
4	A Hologram for the King	2016	3.0	70	6.1
31	Captain America: Civil War	2016	4.5	90	7.9

Above we used a value of 1 as the random seed. This is good practice because it suggests that we weren't trying out various random seeds just to get a favorable sample.

As of April 2018, these are the fan ratings we found:

Movie	Fan ratings
Mechanic: Resurrection	2247
Warcraft	7271
Max Steel	493
Me Before You	5263
Fantastic Beasts and Where to Find Them	13400
Cell	17
Genius	127
Sully	11877
A Hologram for the King	500
Captain America: Civil War	35057

90% of the movies in our sample are popular, which satisfies our condition. So let's move futher with the analysis and check the ratings for Hickey's dataset.

In [9]:

sum(hickey['Fandango_votes'] < 30)

Out[9]:

Hence, we can now be sure about the popularity of the movies in our dataset.

Earlier, we saw that across the datasets we have information about the movie ratings for 4 years, namely, 2014, 2015, 2016 and 2017. If there has been a change in the overall rating system on Fandango, we would notice a big shift in the rating values between the years of 2015 and 2016.

Interestingly, having the data for 4 years, we can quantitavely verify the shift by plotting the estimates year-wise.

From the actual data provided to us in the dataset, we can take two approaches to quantify the actual shift in the ratings :

1. Calculating the yearwise shift in the ratings :¶

Using Hickey's dataset to plot the ratings for 2014 and 2015.
Using Post Hickey's dataset to plot the ratings for 2016 and 2017.

Let's start with the first step our analysis:

In [10]:

# plotting kernel density plots for better visualisation of Hickey's and post-hickey's dataset :
plt.style.use('fivethirtyeight')

hickey['Fandango_Stars'].plot.kde(label = '2014-2015', legend = True, figsize = (10,6))
p_hickey['fandango'].plot.kde(label = '2016-2017', legend = True)

plt.title("Distribution for Fandango's ratings using \n using Hickey's and Post Hickey's dataset",
          y = 1.07) # the `y` parameter pads the title upward
plt.xlabel('Stars')
plt.xlim(0,5) # because ratings start at 0 and end at 5
plt.xticks(arange(0,5.1,.5))

# printing the mean rating value during 2014-2015 for easy reference :
plt.axvline(hickey['Fandango_Stars'].mean(), label='Hickey_mean', color = 'blue')
print('\033[1m' + "The mean ratings during 2014-2015 in hickey's dataset is : " + str(np.round(hickey['Fandango_Stars'].mean(),2)) + '\033[0m')

# printing the mean rating value during 2016-2017 for easy reference:
plt.axvline(p_hickey['fandango'].mean(), label='Post_Hickey_mean', color = 'red')
print('\033[1m' + "The mean ratings during 2016-2017 in post hickey's dataset is : " + str(np.round(p_hickey['fandango'].mean(),2)) + '\033[0m')

plt.legend(framealpha = 0, loc = 'upper center')

plt.show()

The mean ratings during 2014-2015 in hickey's dataset is : 4.09
The mean ratings during 2016-2017 in post hickey's dataset is : 3.89

Observations from the graph :¶

Skewness : It is very evident from the plot that both the datasets are highly right-skewed. However, Hickey's dataset seems to be more right skewed indicating that the overall ratings on Fandango have decreased from 2014-2015 to 2016-2017.
Mean : The mean ratings of the datasets have decreased from about 4.09 in hickey's dataset to 3.89 in post hickey's dataset indicating that the movies in 2016-2017 are rated lesser as compared to the movies in 2014-2015.
Tails : The left tails of both the graphs represent an increase in the overall range of rating values from 2014-2015 to 2016-2017.

Now that we have summarised that there is an overall shift in the ratings of movies on the Fandango webiste, let's use ** summary statistics in order to get a more precise picture about the direction of difference ** to see how the ratings shifted from 2015 to 2016, i.e in Hickey's dataset vs Post Hickey dataset:

In [11]:

p_hickey.head()

Out[11]:

	movie	year	fandango	tmeter	imdb
0	10 Cloverfield Lane	2016	3.5	90	7.2
1	13 Hours	2016	4.5	50	7.3
2	A Cure for Wellness	2016	3.0	40	6.6
3	A Dog's Purpose	2017	4.5	33	5.2
4	A Hologram for the King	2016	3.0	70	6.1

In [12]:

# calculating the mean,median and mode, and using a grouped bar plot for better visualisation of direction of change :
## mean
hickey_2015_mean = np.round(hickey['Fandango_Stars'][hickey.Year == '2015'].mean(),2)
p_hickey['year'] = p_hickey['year'].astype('str')
p_hickey_2016_mean = np.round(p_hickey['fandango'][p_hickey.year == '2016'].mean(),2)

## median
hickey_2015_median = np.round(hickey['Fandango_Stars'][hickey.Year == '2015'].median(),2)
p_hickey_2016_median = np.round(p_hickey['fandango'][p_hickey.year == '2016'].median(),2)

## mode
hickey_2015_mode = hickey['Fandango_Stars'][hickey.Year == '2015'].mode()[0]
p_hickey_2016_mode = p_hickey['fandango'][p_hickey.year == '2016'].mode()[0]

# Summary metrics:
metrics_df = pd.DataFrame()

metrics_df['2015'] = [hickey_2015_mean,hickey_2015_median,hickey_2015_mode]
metrics_df['2016'] = [p_hickey_2016_mean,p_hickey_2016_median,p_hickey_2016_mode]
metrics_df.index = ['Mean','Median','Mode']
metrics_df['Difference in ratings'] = metrics_df['2016'] - metrics_df['2015']

metrics_df

Out[12]:

	2015	2016	Difference in ratings
Mean	4.09	3.89	-0.2
Median	4.00	4.00	0.0
Mode	4.50	4.00	-0.5

In [13]:

# plotting the data:
metrics_df.plot.bar(figsize = (8,5))
plt.legend(framealpha = 0, loc = 'upper center')
plt.ylabel('Fandango Stars')
plt.title('Summary Statistics : 2015 vs 2016')
plt.show()

Observation :¶

The Mean movie ratings on the webiste fell by 0.2 stars from 2015 to 2016.
The Mode movie ratings on the webiste ratings fell by 0.5 stars from 2015 to 2016.

Problem with our previous analysis :¶

Although our analysis portrayed the fact that Fandango's overall rating system changed since Hickey's analysis, there are major underlying problems with our method of approach, including (but not limited to) :

Not using a yardstick to measure the ratings drop : There are various possible scenarios that could have made the ratings drop over the years which might not necessarily mean that the rating system of the website has improved. For instance, it is possible that the fans did not rate the movies as higly as Hickey's analysis got published. Hence, using the ratings from another website for the same movie year period to measure the rating system would provide more accurate results for our analysis.

2. Using relative ratings to compare Fandango's ratings :¶

In [14]:

# During Hickey's analysis :
hickey.head()

Out[14]:

	FILM	Fandango_Stars	Fandango_Ratingvalue	Fandango_votes	Fandango_Difference	RottenTomatoes	IMDB	Year
0	Avengers: Age of Ultron	5.0	4.5	14846	0.5	74	7.8	2015
1	Cinderella	5.0	4.5	12640	0.5	85	7.1	2015
2	Ant-Man	5.0	4.5	12055	0.5	80	7.8	2015
3	Do You Believe?	5.0	4.5	1793	0.5	18	5.4	2015
4	Hot Tub Time Machine 2	3.5	3.0	1021	0.5	14	5.1	2015

In [15]:

# after hickey's analysis:
p_hickey.head()

Out[15]:

	movie	year	fandango	tmeter	imdb
0	10 Cloverfield Lane	2016	3.5	90	7.2
1	13 Hours	2016	4.5	50	7.3
2	A Cure for Wellness	2016	3.0	40	6.6
3	A Dog's Purpose	2017	4.5	33	5.2
4	A Hologram for the King	2016	3.0	70	6.1

We need to normalise Rotten Tomatoes' and IMDB's ratings to the same rating system as that of Fandango_stars (i.e out of 5)

In [16]:

# converting hickey's ratings :
hickey = hickey.copy()
hickey['RottenTomatoes_conversion'] = hickey['RottenTomatoes'].apply(lambda x: round(((x/100)*5)*2)/2)
hickey['IMDB_conversion'] = hickey['IMDB'].apply(lambda x: round(((x/10)*5)*2)/2)

# difference in ratings during Hickey's analysis :
hickey['fan_rot_difference_hickey']= hickey['Fandango_Stars'] - hickey['RottenTomatoes_conversion']
hickey['fan_imbd_difference_hickey']= hickey['Fandango_Stars'] - hickey['IMDB_conversion']

# converting p_hickey's ratings:
p_hickey = p_hickey.copy()
p_hickey['tmeter_conversion'] = p_hickey['tmeter'].apply(lambda x: round(((x/100)*5)*2)/2)
p_hickey['imdb_conversion'] = p_hickey['imdb'].apply(lambda x: round(((x/10)*5)*2)/2)

# difference in ratings post Hickey's analysis :
p_hickey['fan_rot_difference_phickey']= p_hickey['fandango'] - p_hickey['tmeter_conversion']
p_hickey['fan_imbd_difference_phickey']= p_hickey['fandango'] - p_hickey['imdb_conversion']

# average difference in ratings :
hickey_rot_mean = np.round(hickey['fan_rot_difference_hickey'].mean(),2)
hickey_imdb_mean = np.round(hickey['fan_imbd_difference_hickey'].mean(),2)

p_hickey_rot_mean = np.round(p_hickey['fan_rot_difference_phickey'].mean(),2)
p_hickey_imdb_mean = np.round(p_hickey['fan_imbd_difference_phickey'].mean(),2)

# creating the dataframe to hold the difference values :
final_df = pd.DataFrame()
final_df['Hickey'] = [hickey_rot_mean, hickey_imdb_mean]
final_df['Post_hickey'] = [p_hickey_rot_mean, p_hickey_imdb_mean]
final_df.index = ['Fandango_Rotten_Tomatoes_difference', 'Fandango_IMDB_difference']
final_df['Difference_between_datasets'] = final_df['Hickey'] - final_df['Post_hickey']
final_df

Out[16]:

	Hickey	Post_hickey	Difference_between_datasets
Fandango_Rotten_Tomatoes_difference	1.04	1.20	-0.16
Fandango_IMDB_difference	0.72	0.68	0.04

In [17]:

# plotting the data:
final_df.plot.bar(figsize = (8,5))
plt.legend(framealpha = 0, loc = 'upper center')
plt.ylabel('Fandango Stars')
plt.title("Mean Ratings of Fandango's VS Other Websites")
plt.show()

Observation :¶

During Hickey's Analysis : There was an average difference of 1.04 stars between Fandango's ratings and Rotten Tomatoes ratings and an average difference of 0.72 stars between Fandango and IMDb of the same movies.
Post Hickey's analysis : There was an average difference of 1.20 stars between Fandango's ratings and Rotten Tomatoes ratings and an average difference of 0.68 stars between Fandango and IMDb.
The average difference *increased by 0.16 stars* between Rotten Tomatoes and Fandango, whereas, *fell by 0.04 stars* between Fandango and IMBD after Hickey's analysis.

In order to understand these findings better we need to plot the average ratings of three websites (i.e Fandango, Rotten Tomatoes and IMDB).

In [18]:

# plotting kernel density plots for better visualisation of Hickey's dataset :
plt.style.use('fivethirtyeight')

hickey['Fandango_Stars'].plot.kde(label = 'Fandango_ratings', legend = True, figsize = (10,6))
hickey['RottenTomatoes_conversion'].plot.kde(label = 'Rotten_tomatoes_ratings', legend = True, figsize = (10,6))
hickey['IMDB_conversion'].plot.kde(label = 'IMDB_ratings', legend = True, figsize = (10,6))


plt.title("Distribution for Fandango's ratings using \n using Hickey's dataset between different websites",
          y = 1.07) 
plt.xlabel('Stars')
plt.xlim(0,5) 
plt.xticks(arange(0,5.1,.5))

# visualising the mean rating value for different websites for easy reference :
plt.axvline(hickey['Fandango_Stars'].mean(), label='Fandango_mean', color = 'blue')
plt.axvline(hickey['RottenTomatoes_conversion'].mean(), label='Rotten_mean', color = 'red')
plt.axvline(hickey['IMDB_conversion'].mean(), label='IMDB_mean', color = 'yellow')

plt.legend(framealpha = 0, loc = 'upper left')
plt.show()

In [19]:

# plotting kernel density plots for better visualisation of post-hickey's dataset :
plt.style.use('fivethirtyeight')

p_hickey['fandango'].plot.kde(label = 'Fandango_ratings', legend = True, figsize = (10,6))
p_hickey['tmeter_conversion'].plot.kde(label = 'Rotten_ratings', legend = True)
p_hickey['imdb_conversion'].plot.kde(label = 'IMDB_ratings', legend = True)

plt.title("Distribution for Fandango's ratings Post Hickey's dataset",
          y = 1.07) 
plt.xlabel('Stars')
plt.xlim(0,5) 
plt.xticks(arange(0,5.1,.5))

# visualising the mean rating value for different websites for easy reference :
plt.axvline(p_hickey['fandango'].mean(), label='Fandango_mean', color = 'blue')
plt.axvline(p_hickey['tmeter_conversion'].mean(), label='Rotten_mean', color = 'red')
plt.axvline(p_hickey['imdb_conversion'].mean(), label='IMDB_mean', color = 'yellow')

plt.legend(framealpha = 0, loc = 'upper left')

plt.show()

In [20]:

# mean difference dataframe between three webistes :
mean_df = pd.DataFrame()

# hickey's analysis :
h_fandango_mean = np.round(hickey['Fandango_Stars'].mean(),2)
h_rotten_mean = np.round(hickey['RottenTomatoes_conversion'].mean(),2)
h_imdb_mean = np.round(hickey['IMDB_conversion'].mean(),2)

# p_hickey's analysis :
p_fandango_mean = np.round(p_hickey['fandango'].mean(),2)
p_rotten_mean = np.round(p_hickey['tmeter_conversion'].mean(),2)
p_imdb_mean = np.round(p_hickey['imdb_conversion'].mean(),2)

mean_df['Hickey'] = [h_fandango_mean,h_rotten_mean,h_imdb_mean]
mean_df['Post_Hickey'] = [p_fandango_mean,p_rotten_mean,p_imdb_mean] 


mean_df.index = ['Fandango','Rotten_Tomatoes','IMDB']
mean_df['Ratings_Difference'] = mean_df['Hickey'] - mean_df['Post_Hickey']
mean_df['Relative_Ratings(%)'] = np.round(((mean_df['Hickey'] - mean_df['Post_Hickey'])*100/mean_df['Hickey']),2)
mean_df

Out[20]:

	Hickey	Post_Hickey	Ratings_Difference	Relative_Ratings(%)
Fandango	4.09	3.89	0.20	4.89
Rotten_Tomatoes	3.04	2.69	0.35	11.51
IMDB	3.37	3.21	0.16	4.75

Observation :¶

Fandango Ratings : There was a drop of ~5% in Fandago's ratings between Hickey and Post Hickey datasets.
Rotten Tomatoes' ratings : There was a drop of ~11% in Rotten Tomatoes's ratings between Hickey and Post Hickey datasets.
IMDB Ratings : There was a drop of ~5% in IMDB's ratings between Hickey and Post Hickey datasets.

Since the same movies are rated by all the three websites in their respective datasets, we can clearly understand the data we obtained earlier.

IMDB and Fandango's ratings have a similar drop in ratings, indicating that either IMDB itself changed its rating system, or the movies were overall rated less during 2016-2017 as compared to 2014-2015.
Surprisingly, Rotten Tomatoes' ratings fell more significantly (~11.5% vs ~4.89%) as compared to Fandango's ratings.

Conclusion¶

Analysis 1 :
- Using only Fandango's ratings of different movies we found that the average ratings of movies on the website fell by 0.2 stars from 2014-2015 (Hickey's analysis) to 2016-2017 (Post hickey's dataset).
- Using only this , one could prematurely conclude that the ratings system on the Fandango's website was corrected post Hickey's analysis.
Analysis 2 :
- Using relative ratings to compare Fandango's ratings to the ratings on different websites of the same movies, we found that, between Hickey's dataset and Post hickey's dataset (i.e between 2014-2015 and 2016-2017), the average difference between the rating systems increased by 0.16 stars between Rotten Tomatoes and Fandango, whereas, fell by 0.04 stars between Fandango and IMBD after Hickey's analysis.
- Secondly, the relative ratings portray that between 2014-2015 and 2016-2017, the ratings fell for all the three websites, indicating that all the movies were rated less during 2016-2017.

In conclusion, though the average ratings of the movies fell on Fandango's website, the ratings on other webistes (especially Rotten Tomatoes') fell even further, indicating that the ratings system on Fandango's webiste might not have been corrected.

Investigating Fandango Movie Ratings¶

Problem Description :¶

AIM :¶

Dataset Description:¶

1. Hickey's Dataset :¶

2. Post Hickey's analysis Dataset:¶

*** Does the population match the interest for our goal ?***¶

Now, to match the need for our analysis we have two options :¶

Our New Goal : ¶

What we mean : ¶

Data Analysis:¶

1. Calculating the yearwise shift in the ratings :¶

Observations from the graph :¶

Observation :¶

Problem with our previous analysis :¶

2. Using relative ratings to compare Fandango's ratings :¶

Observation :¶

Observation :¶

Conclusion¶

* Does the population match the interest for our goal ?*¶