Investigating Fandango Movie Ratings

1. Is Fandango Still Inflating Ratings?

  • In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango's rating system was biased and dishonest (Fandango is an online movie ratings aggregator). He published his analysis in this article Fandango displays a 5-star rating system on their website, where the minimum rating is 0 stars and the maximum is 5 stars.
  • In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

2. Understanding the Data

  • One of the best ways to figure out whether there has been any change in Fandango's rating system after Hickey's analysis is to compare the system's characteristics previous and after the analysis.
  • will explore two data sets
    • fandango_score_comparison.csv: rating system characteristics previous to Hickey's analysis, you can find document here.
    • movie_ratings_16_17.csv: rating system's characteristics after Hickey's analysis, you can find document here.

Start Coding

  1. Create function to read the data sets.
  2. * Create read_dataset() function, take two parameter: (file_path, df_name) and return dataframe.
  3. Explore datasets.
In [1]:
# Baisc imports
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
In [2]:
# create read_dataset()
def read_dataset(file_path,df_name):
    df = pd.read_csv(file_path)
    df.name = df_name
    return df

apply read_dataset() function on both data sets we have

  • fandango_score_comparison.csv, assign value to previous variable
  • movie_ratings_16_17.csv, assign value to after variable
In [3]:
# apply read_dataset() on fandango_score_comparison. 
previous = read_dataset("fandango_score_comparison.csv","previous")

# apply read_dataset() on movie_ratings_16_17. 
after = read_dataset("movie_ratings_16_17.csv","after")

2- explore datasets

- Create print_rows() function , take two parameter:
    1- df: dataframe name
    2- num_rows: number of sampe need to display, print head & tail for same number of rows 
- Create explore_df() functions , take only one parameter:
    1- df: dataframe name 
        print dataframe inforamtion & describtion 
In [4]:
# create print_rows(df,num_rows) function 
def print_rows(df,num_rows):
    print(df.name,"head {}:".format(num_rows))
    display(df.head(num_rows))
    print(df.name,"tail {}:".format(num_rows))
    display(df.tail(num_rows))
In [5]:
# create explore_df(df) function
def explore_df(df):
    print(df.name,"information & describtion")
    display(df.info(),df.describe())
In [6]:
# apply print_rows() function on previous dataset
print_rows(previous,5)
previous head 5:
FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RT_norm RT_user_norm ... IMDB_norm RT_norm_round RT_user_norm_round Metacritic_norm_round Metacritic_user_norm_round IMDB_norm_round Metacritic_user_vote_count IMDB_user_vote_count Fandango_votes Fandango_Difference
0 Avengers: Age of Ultron (2015) 74 86 66 7.1 7.8 5.0 4.5 3.70 4.3 ... 3.90 3.5 4.5 3.5 3.5 4.0 1330 271107 14846 0.5
1 Cinderella (2015) 85 80 67 7.5 7.1 5.0 4.5 4.25 4.0 ... 3.55 4.5 4.0 3.5 4.0 3.5 249 65709 12640 0.5
2 Ant-Man (2015) 80 90 64 8.1 7.8 5.0 4.5 4.00 4.5 ... 3.90 4.0 4.5 3.0 4.0 4.0 627 103660 12055 0.5
3 Do You Believe? (2015) 18 84 22 4.7 5.4 5.0 4.5 0.90 4.2 ... 2.70 1.0 4.0 1.0 2.5 2.5 31 3136 1793 0.5
4 Hot Tub Time Machine 2 (2015) 14 28 29 3.4 5.1 3.5 3.0 0.70 1.4 ... 2.55 0.5 1.5 1.5 1.5 2.5 88 19560 1021 0.5

5 rows × 22 columns

previous tail 5:
FILM RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RT_norm RT_user_norm ... IMDB_norm RT_norm_round RT_user_norm_round Metacritic_norm_round Metacritic_user_norm_round IMDB_norm_round Metacritic_user_vote_count IMDB_user_vote_count Fandango_votes Fandango_Difference
141 Mr. Holmes (2015) 87 78 67 7.9 7.4 4.0 4.0 4.35 3.90 ... 3.70 4.5 4.0 3.5 4.0 3.5 33 7367 1348 0.0
142 '71 (2015) 97 82 83 7.5 7.2 3.5 3.5 4.85 4.10 ... 3.60 5.0 4.0 4.0 4.0 3.5 60 24116 192 0.0
143 Two Days, One Night (2014) 97 78 89 8.8 7.4 3.5 3.5 4.85 3.90 ... 3.70 5.0 4.0 4.5 4.5 3.5 123 24345 118 0.0
144 Gett: The Trial of Viviane Amsalem (2015) 100 81 90 7.3 7.8 3.5 3.5 5.00 4.05 ... 3.90 5.0 4.0 4.5 3.5 4.0 19 1955 59 0.0
145 Kumiko, The Treasure Hunter (2015) 87 63 68 6.4 6.7 3.5 3.5 4.35 3.15 ... 3.35 4.5 3.0 3.5 3.0 3.5 19 5289 41 0.0

5 rows × 22 columns

In [7]:
# applay explore_df() function on previous data frame
explore_df(previous)
previous information & describtion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 22 columns):
FILM                          146 non-null object
RottenTomatoes                146 non-null int64
RottenTomatoes_User           146 non-null int64
Metacritic                    146 non-null int64
Metacritic_User               146 non-null float64
IMDB                          146 non-null float64
Fandango_Stars                146 non-null float64
Fandango_Ratingvalue          146 non-null float64
RT_norm                       146 non-null float64
RT_user_norm                  146 non-null float64
Metacritic_norm               146 non-null float64
Metacritic_user_nom           146 non-null float64
IMDB_norm                     146 non-null float64
RT_norm_round                 146 non-null float64
RT_user_norm_round            146 non-null float64
Metacritic_norm_round         146 non-null float64
Metacritic_user_norm_round    146 non-null float64
IMDB_norm_round               146 non-null float64
Metacritic_user_vote_count    146 non-null int64
IMDB_user_vote_count          146 non-null int64
Fandango_votes                146 non-null int64
Fandango_Difference           146 non-null float64
dtypes: float64(15), int64(6), object(1)
memory usage: 25.2+ KB
None
RottenTomatoes RottenTomatoes_User Metacritic Metacritic_User IMDB Fandango_Stars Fandango_Ratingvalue RT_norm RT_user_norm Metacritic_norm ... IMDB_norm RT_norm_round RT_user_norm_round Metacritic_norm_round Metacritic_user_norm_round IMDB_norm_round Metacritic_user_vote_count IMDB_user_vote_count Fandango_votes Fandango_Difference
count 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 ... 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000 146.000000
mean 60.849315 63.876712 58.808219 6.519178 6.736986 4.089041 3.845205 3.042466 3.193836 2.940411 ... 3.368493 3.065068 3.226027 2.972603 3.270548 3.380137 185.705479 42846.205479 3848.787671 0.243836
std 30.168799 20.024430 19.517389 1.510712 0.958736 0.540386 0.502831 1.508440 1.001222 0.975869 ... 0.479368 1.514600 1.007014 0.990961 0.788116 0.502767 316.606515 67406.509171 6357.778617 0.152665
min 5.000000 20.000000 13.000000 2.400000 4.000000 3.000000 2.700000 0.250000 1.000000 0.650000 ... 2.000000 0.500000 1.000000 0.500000 1.000000 2.000000 4.000000 243.000000 35.000000 0.000000
25% 31.250000 50.000000 43.500000 5.700000 6.300000 3.500000 3.500000 1.562500 2.500000 2.175000 ... 3.150000 1.500000 2.500000 2.125000 3.000000 3.000000 33.250000 5627.000000 222.250000 0.100000
50% 63.500000 66.500000 59.000000 6.850000 6.900000 4.000000 3.900000 3.175000 3.325000 2.950000 ... 3.450000 3.000000 3.500000 3.000000 3.500000 3.500000 72.500000 19103.000000 1446.000000 0.200000
75% 89.000000 81.000000 75.000000 7.500000 7.400000 4.500000 4.200000 4.450000 4.050000 3.750000 ... 3.700000 4.500000 4.000000 4.000000 4.000000 3.500000 168.500000 45185.750000 4439.500000 0.400000
max 100.000000 94.000000 94.000000 9.600000 8.600000 5.000000 4.800000 5.000000 4.700000 4.700000 ... 4.300000 5.000000 4.500000 4.500000 5.000000 4.500000 2375.000000 334164.000000 34846.000000 0.500000

8 rows × 21 columns

  • After exploring previous data frame we can find that there are many columns have movies rating but from different sources which are not caring about.
  • We are interesting only with Fandango data so below only the columns we will work with:
    • FILM: The film in question.
    • Fandango_Stars: The number of stars the film had on its Fandango movie page.
    • Fandango_Ratingvalue: The Fandango ratingValue for the film, as pulled from the HTML of each page. This is the actual average score the movie obtained.
    • Fandango_votes: The number of user votes the film had on Fandango.
    • Fandango_Difference: The difference between the presented Fandango_Stars and the actual Fandango_Ratingvalue.
  • we will make another data frame have only columns mentioned above for easiest work.
  • You can find all columns describtion here.
In [8]:
# applay print_rows() function on after data frame 
print_rows(after,5) 
after head 5:
movie year metascore imdb tmeter audience fandango n_metascore n_imdb n_tmeter n_audience nr_metascore nr_imdb nr_tmeter nr_audience
0 10 Cloverfield Lane 2016 76 7.2 90 79 3.5 3.80 3.60 4.50 3.95 4.0 3.5 4.5 4.0
1 13 Hours 2016 48 7.3 50 83 4.5 2.40 3.65 2.50 4.15 2.5 3.5 2.5 4.0
2 A Cure for Wellness 2016 47 6.6 40 47 3.0 2.35 3.30 2.00 2.35 2.5 3.5 2.0 2.5
3 A Dog's Purpose 2017 43 5.2 33 76 4.5 2.15 2.60 1.65 3.80 2.0 2.5 1.5 4.0
4 A Hologram for the King 2016 58 6.1 70 57 3.0 2.90 3.05 3.50 2.85 3.0 3.0 3.5 3.0
after tail 5:
movie year metascore imdb tmeter audience fandango n_metascore n_imdb n_tmeter n_audience nr_metascore nr_imdb nr_tmeter nr_audience
209 X-Men: Apocalypse 2016 52 7.1 48 67 4.0 2.6 3.55 2.40 3.35 2.5 3.5 2.5 3.5
210 XX 2017 64 4.7 71 17 3.0 3.2 2.35 3.55 0.85 3.0 2.5 3.5 1.0
211 xXx: Return of Xander Cage 2017 42 5.4 43 45 4.0 2.1 2.70 2.15 2.25 2.0 2.5 2.0 2.0
212 Zoolander 2 2016 34 4.8 23 21 2.5 1.7 2.40 1.15 1.05 1.5 2.5 1.0 1.0
213 Zootopia 2016 78 8.1 98 92 4.5 3.9 4.05 4.90 4.60 4.0 4.0 5.0 4.5
In [9]:
# apply explore_df() function on after data frame 
explore_df(after)
after information & describtion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 15 columns):
movie           214 non-null object
year            214 non-null int64
metascore       214 non-null int64
imdb            214 non-null float64
tmeter          214 non-null int64
audience        214 non-null int64
fandango        214 non-null float64
n_metascore     214 non-null float64
n_imdb          214 non-null float64
n_tmeter        214 non-null float64
n_audience      214 non-null float64
nr_metascore    214 non-null float64
nr_imdb         214 non-null float64
nr_tmeter       214 non-null float64
nr_audience     214 non-null float64
dtypes: float64(10), int64(4), object(1)
memory usage: 25.2+ KB
None
year metascore imdb tmeter audience fandango n_metascore n_imdb n_tmeter n_audience nr_metascore nr_imdb nr_tmeter nr_audience
count 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000 214.000000
mean 2016.107477 53.266355 6.440654 53.621495 58.626168 3.894860 2.663318 3.220327 2.681075 2.931308 2.658879 3.214953 2.691589 2.915888
std 0.310444 17.843585 1.030056 30.242396 21.100040 0.516781 0.892179 0.515028 1.512120 1.055002 0.924619 0.526803 1.519273 1.060352
min 2016.000000 11.000000 3.500000 0.000000 11.000000 2.500000 0.550000 1.750000 0.000000 0.550000 0.500000 2.000000 0.000000 0.500000
25% 2016.000000 39.000000 5.825000 27.000000 43.250000 3.500000 1.950000 2.912500 1.350000 2.162500 2.000000 3.000000 1.500000 2.000000
50% 2016.000000 53.500000 6.500000 56.500000 60.500000 4.000000 2.675000 3.250000 2.825000 3.025000 2.500000 3.000000 3.000000 3.000000
75% 2016.000000 66.000000 7.200000 83.000000 76.750000 4.500000 3.300000 3.600000 4.150000 3.837500 3.500000 3.500000 4.000000 4.000000
max 2017.000000 99.000000 8.500000 99.000000 93.000000 5.000000 4.950000 4.250000 4.950000 4.650000 5.000000 4.000000 5.000000 4.500000
  • After exploring after data frame we can find that there are many columns have movies rating but from different sources which are not caring about.
  • We are interesting only with Fandango data so below only the columns we will work with: movie: the name of the movie. year: the release year of the movie. Fandango: the Fandango rating of the movie (user score).
  • We will make another data frame have only columns mentioned above for easiest work.
  • You can find all columns describtion here.
  • as we mentioned above we will select only columns related to our investigation on anther data frame for easiest work , so let us create new_df() function have 3 parameter:
    • df: dataframe name
    • col_index: columns index need to select.
    • df_name: data frame name.
    • function return new dataframe.
In [10]:
def new_df(df,col_index,df_name):
    new_df = df.iloc[:,col_index].copy()
    new_df.name = df_name
    return new_df
In [11]:
# apply new_df() function on previous data frame , select columns we mentioned above
new_previous = new_df(previous,[0,6,7,-2,-1],"new_previous")
In [12]:
# apply new_df() function on after data frame , select columns we mentioned above
new_after = new_df(after,[0,1,6],"new_previous")
In [13]:
# apply explore_df() on new previous to confirm our result
explore_df(new_previous)
new_previous information & describtion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 5 columns):
FILM                    146 non-null object
Fandango_Stars          146 non-null float64
Fandango_Ratingvalue    146 non-null float64
Fandango_votes          146 non-null int64
Fandango_Difference     146 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 5.8+ KB
None
Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference
count 146.000000 146.000000 146.000000 146.000000
mean 4.089041 3.845205 3848.787671 0.243836
std 0.540386 0.502831 6357.778617 0.152665
min 3.000000 2.700000 35.000000 0.000000
25% 3.500000 3.500000 222.250000 0.100000
50% 4.000000 3.900000 1446.000000 0.200000
75% 4.500000 4.200000 4439.500000 0.400000
max 5.000000 4.800000 34846.000000 0.500000
In [14]:
# apply explore_df() on new_after to confirm our result
explore_df(new_after)
new_previous information & describtion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 3 columns):
movie       214 non-null object
year        214 non-null int64
fandango    214 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 5.1+ KB
None
year fandango
count 214.000000 214.000000
mean 2016.107477 3.894860
std 0.310444 0.516781
min 2016.000000 2.500000
25% 2016.000000 3.500000
50% 2016.000000 4.000000
75% 2016.000000 4.500000
max 2017.000000 5.000000
  • our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis.
  • our population of interest is all movies rating in Fandango website regardless of the releasing year.
  • because our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis we have two data sets with two period before & after the analysis
  • the data we are working with was sampled we just need to be sure that samples are representative to the population otherwise we will get a large sample error & it cause wrong conclusions.

  • From Fandango repository we can see that he used the following sampling criteria: movies had at least 30 fan reviews on Fandango. The data from Fandango was pulled on Aug. 24, 2015.

    • according to that we can explore that sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those having under 30 fan ratings)
    • From Movie ratings (2016 and 2017) repository we can see that he used the following sampling criteria: movie ratings data for 214 of the most popular movies (with a significant number of votes) released in 2016 and 2017.
  • From all above we can explore that both data set not representative to the population both of them are good for their research & this kind of sample is called purposive sampling (or judgmental/selective/subjective sampling) but it's not good enough for our goal as both of them not random samples and biased samples also according to the goal it selected to.

Result of our exploration

  • At this point, we have at least two alternatives:
      * either we collect new data.
      * either we change the goal of our analysis by placing some limitations on it.

Tweaking our goal seems a much faster choice compared to collecting new data. Also, it's quasi-impossible to collect a new sample previous to Hickey's analysis at this moment in time.

Our new goal

  • Is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. This new goal should also be a fairly good proxy for our initial goal.

Working on our new goal

  • With the new goal, we now have two populations that we want to describe and compare with each other:

All Fandango's ratings for popular movies released in 2015. All Fandango's ratings for popular movies released in 2016.

The term "popular" is vague and we need to define it with precision before continuing. We'll use Hickey's benchmark of 30 fan ratings and consider a movie as "popular" only if it has 30 fan ratings or more on Fandango's website

lets coding

  • First of all we need to Check if both samples contain popular movies
  • create check_sample() function to select random samples from dataframe & check number fan ratings, function take 2 parameter:
    • df: data frame name.
    • num_sample: sample numbers.
In [15]:
def check_sample(df,num_sample):
    display(df.sample(num_sample,random_state=1))
In [16]:
# apply check_sample() on new_previous 
check_sample(new_previous,10)
FILM Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference
98 Get Hard (2015) 4.0 3.9 5933 0.1
66 The Gift (2015) 4.0 3.7 2680 0.3
53 Hot Pursuit (2015) 4.0 3.7 2618 0.3
75 San Andreas (2015) 4.5 4.3 9749 0.2
121 The Stanford Prison Experiment (2015) 4.0 3.9 51 0.1
74 The Hobbit: The Battle of the Five Armies (2014) 4.5 4.3 15337 0.2
119 Phoenix (2015) 3.5 3.4 70 0.1
128 Mission: Impossible – Rogue Nation (2015) 4.5 4.4 8357 0.1
44 Kingsman: The Secret Service (2015) 4.5 4.2 15205 0.3
58 Seventh Son (2015) 3.5 3.2 1213 0.3
In [17]:
# apply check_sample() on new_after 
check_sample(new_after,5)
movie year fandango
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
51 Fantastic Beasts and Where to Find Them 2016 4.5

for new_previous author had selected his data accoriding to criteria which is movies had at least 30 fan reviews on Fandango but also let us check & confirm that there are no movies have less than 30 fan

In [18]:
less_30_fan = new_previous[new_previous["Fandango_votes"] < 30].size
less_30_fan
Out[18]:
0
  • as we see now that new_previous dataset doesn't have any movie less than 30 fan
  • now let us confirm that movies on both dataset following the year we will work with new_previous : 2015 movies new_after : 2016 movies
In [19]:
# check release date on new_previous 
previous_movies_year = new_previous["FILM"].str.split("(").str[-1]
print(previous_movies_year.unique())
['2015)' '2014)']
In [20]:
# check release date on new_after
after_movies_year = new_after["year"].unique()
after_movies_year
Out[20]:
array([2016, 2017])
  • From above we can see that new_previous data frame have data for movies in 2014 & 2015.
  • From above we can see that new_after data frame have data for movies in 2016 & 2017.
  • We need to isolate only the sample points that belong to our populations of interest.
    • Isolate the movies released in 2015 in a separate data set.
    • Isolate the movies released in 2016 in another separate data set.
  • First will add year column on new_previous data frame so we can isolate data according to year easily
In [21]:
# add year column on new_previous data frame
new_previous["year"] = new_previous["FILM"].str.split("(").str[-1].str.replace(")","").astype("int")

# check year column 
new_previous["year"].unique()
Out[21]:
array([2015, 2014])

create isolate_year() function have 4 parameter:

* df: data frame 
* col_year: column name on data frame which specified years data
* year: year in int type need to work with
* df_name : string have data frame name 
In [22]:
def isolate_year(df,col_year,year,df_name):
    df_year = df.loc[df[col_year].astype("int")== year].copy()
    df_year.name = df_name
    return df_year
In [23]:
# apply isolate_year() function on new_previous
previous_2015 = isolate_year(new_previous,"year",2015,"previous_2015")
previous_2015["year"].unique()
Out[23]:
array([2015])
In [24]:
# apply isolate_year() function on new_after
after_2016 = isolate_year(new_after,"year",2016,"after_2016")
after_2016["year"].unique()
Out[24]:
array([2016])
  • We can now start analyzing the two samples we isolated before. Once again, our goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016.
  • start simple with making a high-level comparison between the shapes of the distributions of movie ratings for both samples.
  • Generate two kernel density plots on the same figure for the distribution of movie ratings of each sample
In [25]:
explore_df(previous_2015)
previous_2015 information & describtion
<class 'pandas.core.frame.DataFrame'>
Int64Index: 129 entries, 0 to 145
Data columns (total 6 columns):
FILM                    129 non-null object
Fandango_Stars          129 non-null float64
Fandango_Ratingvalue    129 non-null float64
Fandango_votes          129 non-null int64
Fandango_Difference     129 non-null float64
year                    129 non-null int64
dtypes: float64(3), int64(2), object(1)
memory usage: 7.1+ KB
None
Fandango_Stars Fandango_Ratingvalue Fandango_votes Fandango_Difference year
count 129.000000 129.000000 129.000000 129.000000 129.0
mean 4.085271 3.846512 3761.891473 0.238760 2015.0
std 0.538096 0.505446 6543.601748 0.152741 0.0
min 3.000000 2.700000 35.000000 0.000000 2015.0
25% 3.500000 3.500000 210.000000 0.100000 2015.0
50% 4.000000 3.900000 1415.000000 0.200000 2015.0
75% 4.500000 4.200000 4045.000000 0.400000 2015.0
max 5.000000 4.800000 34846.000000 0.500000 2015.0
In [26]:
explore_df(after_2016)
after_2016 information & describtion
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 213
Data columns (total 3 columns):
movie       191 non-null object
year        191 non-null int64
fandango    191 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.0+ KB
None
year fandango
count 191.0 191.000000
mean 2016.0 3.887435
std 0.0 0.509540
min 2016.0 2.500000
25% 2016.0 3.500000
50% 2016.0 4.000000
75% 2016.0 4.250000
max 2016.0 5.000000
In [27]:
plt.style.use('fivethirtyeight')
previous_2015["Fandango_Stars"].plot.kde(label="2015",legend=True)
after_2016["fandango"].plot.kde(label="2016",legend=True)
plt.title("Fandango's ratings distribution for popular movies in\n(2015 Vs 2016)",fontsize=20,y=1.05)
plt.xlim(0,5)
plt.xticks(np.arange(0,5.1,0.5))
plt.show()
  • 2015 and 2016 Fandango's ratings distribution both are left skewed distribution, wich approve that Fandango's have high fans ratings , need to make more analysis.
  • although the shape is similar but,on 2015 max seed is 4.5 although on 2016 is 4 so it seems that about the half point less on 2016 than 2015 which mean ratings in 2016 were slightly lower compared to 2015 .
  • let us now analyze more granular information
  • Because the data sets have different numbers of movies, we normalize the tables and show percentages instead.
  • create calc_per() function to noramlize ratings & return percentages,Function take 2 parameter:
    • df: data frame name.
    • col_name: data need to normalize.
In [28]:
# create calc_per()
def calc_per(df,col_name):
    ratings_per = df[col_name].value_counts(normalize=True).sort_index() * 100
    return ratings_per
In [29]:
# apply calc_per() on previous_2015 , ratings column "Fandango_Stars"
ratings_per_2015 = calc_per(previous_2015,"Fandango_Stars")
ratings_per_2015
Out[29]:
3.0     8.527132
3.5    17.829457
4.0    28.682171
4.5    37.984496
5.0     6.976744
Name: Fandango_Stars, dtype: float64
In [30]:
# apply calc_per() on after_2016 , ratings column "fandango"
ratings_per_2016 = calc_per(after_2016,"fandango")
ratings_per_2016
Out[30]:
2.5     3.141361
3.0     7.329843
3.5    24.083770
4.0    40.314136
4.5    24.607330
5.0     0.523560
Name: fandango, dtype: float64
  • The minimum rating is lower in 2016 — 2.5 instead of 3 stars, the minimum of 2015. with different frequency distributions 3,8.5 % in order.
  • maxmium rating in both data set is 5 with a big difference on the frequency distribution on 2016 5 star frequency is less than 1% although it about 7% on 2015
  • regardless to this lower on rating the maximum frequecny distribution on 2016 is 4 although it 4.5 in 2015.
  • this lower on 2016 rating reflect on increasing frequencies distribution on other rating (3.5 & 4) on 2016 than 2015.
  • in spite of all the points above, we still not sure about the direction of the change, so We'll take a couple of summary statistics to get a more precise picture about the direction of the difference.
  • We'll take each distribution of movie ratings and compute its mean, median, and mode, and then compare these statistics to determine what they tell about the direction of the difference.

Determining the Direction of the Change

  • Compute the mean, median, and mode for each distribution.
      * will create function calc_state() take two parameters:
          * df: data frame name
          * col_name: column name need to apply statics method on it
In [38]:
def calc_state(df,col_name):
    x_mean = round(df[col_name].mean(),2)
    x_median = round(df[col_name].median(),2)
#     on calculate mode we should slice the it with index 0 
    x_mode = round(df[col_name].mode()[0],2)
    return (x_mean,x_median,x_mode)
In [32]:
# apply calc_state on previous_2015 on Fandango_Stars & assign it as a list on summary_2015 variable
summary_2015 = list(calc_state(previous_2015,"Fandango_Stars"))
summary_2015
Out[32]:
[4.09, 4.0, 4.5]
In [37]:
# apply calc_state on after_2015 on fandango & assign it as a list on suuary_2016 variable
summary_2016 = list(calc_state(after_2016,"fandango"))
In [34]:
# create summary data frame   
summary = pd.DataFrame(summary_2015,index=["mean","median","mode"],columns=["2015"])
summary
Out[34]:
2015
mean 4.09
median 4.00
mode 4.50
In [35]:
# add 2016 on summary data frame 
summary["2016"] = summary_2016
summary
Out[35]:
2015 2016
mean 4.09 3.89
median 4.00 4.00
mode 4.50 4.00
  • it's appear that the data direction is in low way in 2016 than 2015 mean on 2016 is mean & model on 2016 is less than mean & median in 2015
  • let us plot it using pandas bar plot
In [48]:
plt.style.use("fivethirtyeight")
summary["2015"].plot.bar(color = '#0066FF', align = 'center', label = '2015', width = .25)
summary["2016"].plot.bar(color = '#CC0000', align = 'edge',label='2016',width=.25,rot=0,figsize=(8,5))

plt.title("Comparing summary statistics: 2015 vs 2016s",y=1.07)
plt.ylim(0,5.5)
plt.yticks(np.arange(0,5.1,.5))
plt.ylabel("Stars")
plt.legend(framealpha = 0, loc = 'upper center')
plt.show()
  • conclusion:
    • Our analysis showed that there's indeed a slight difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We also determined that, on average, popular movies released in 2016 were rated lower on Fandango than popular movies released in 2015.
    • inspite of that we are not sure that the decreasing on the rates caused by Fandango fixing the biased rating system after Hickey's analysis.
In [ ]: