Investigating Fandango Movie Ratings¶

** 1. Is Fandango Still Inflating Ratings?**

In October 2015, a data journalist named Walt Hickey analyzed movie ratings data and found strong evidence to suggest that Fandango's rating system was biased and dishonest (Fandango is an online movie ratings aggregator). He published his analysis in this article

Fandango displays a 5-star rating system on their website, where the minimum rating is 0 stars and the maximum is 5 stars.

In this project, we'll analyze more recent movie ratings data to determine whether there has been any change in Fandango's rating system after Hickey's analysis.

** 2. Understanding the Data**

One of the best ways to figure out whether there has been any change in Fandango's rating system after Hickey's analysis is to compare the system's characteristics previous and after the analysis.
will explore two data sets
- fandango_score_comparison.csv: rating system characteristics previous to Hickey's analysis, you can find document here.
- movie_ratings_16_17.csv: rating system's characteristics after Hickey's analysis, you can find document here.

Start Coding¶

Create function to read the data sets.
Explore datasets.

In [1]:

# Baisc imports
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

In [2]:

# create read_dataset()
def read_dataset(file_path,df_name):
    df = pd.read_csv(file_path)
    df.name = df_name
    return df

apply read_dataset() function on both data sets we have

fandango_score_comparison.csv, assign value to previous variable
movie_ratings_16_17.csv, assign value to after variable

In [3]:

# apply read_dataset() on fandango_score_comparison. 
previous = read_dataset("fandango_score_comparison.csv","previous")

# apply read_dataset() on movie_ratings_16_17. 
after = read_dataset("movie_ratings_16_17.csv","after")

2- explore datasets - Create print_rows() function , take two parameter: 1- df: dataframe name 2- num_rows: number of sampe need to display, print head & tail for same number of rows - Create explore_df() functions , take only one parameter: 1- df: dataframe name print dataframe inforamtion & describtion

In [4]:

# create print_rows(df,num_rows) function 
def print_rows(df,num_rows):
    print(df.name,"head {}:".format(num_rows))
    display(df.head(num_rows))
    print(df.name,"tail {}:".format(num_rows))
    display(df.tail(num_rows))

In [5]:

# create explore_df(df) function
def explore_df(df):
    print(df.name,"information & describtion")
    display(df.info(),df.describe())

In [6]:

# apply print_rows() function on previous dataset
print_rows(previous,5)

previous head 5:

	FILM	RottenTomatoes	RottenTomatoes_User	Metacritic	Metacritic_User	IMDB	Fandango_Stars	Fandango_Ratingvalue	RT_norm	RT_user_norm	...	IMDB_norm	RT_norm_round	RT_user_norm_round	Metacritic_norm_round	Metacritic_user_norm_round	IMDB_norm_round	Metacritic_user_vote_count	IMDB_user_vote_count	Fandango_votes	Fandango_Difference
0	Avengers: Age of Ultron (2015)	74	86	66	7.1	7.8	5.0	4.5	3.70	4.3	...	3.90	3.5	4.5	3.5	3.5	4.0	1330	271107	14846	0.5
1	Cinderella (2015)	85	80	67	7.5	7.1	5.0	4.5	4.25	4.0	...	3.55	4.5	4.0	3.5	4.0	3.5	249	65709	12640	0.5
2	Ant-Man (2015)	80	90	64	8.1	7.8	5.0	4.5	4.00	4.5	...	3.90	4.0	4.5	3.0	4.0	4.0	627	103660	12055	0.5
3	Do You Believe? (2015)	18	84	22	4.7	5.4	5.0	4.5	0.90	4.2	...	2.70	1.0	4.0	1.0	2.5	2.5	31	3136	1793	0.5
4	Hot Tub Time Machine 2 (2015)	14	28	29	3.4	5.1	3.5	3.0	0.70	1.4	...	2.55	0.5	1.5	1.5	1.5	2.5	88	19560	1021	0.5

5 rows × 22 columns

previous tail 5:

	FILM	RottenTomatoes	RottenTomatoes_User	Metacritic	Metacritic_User	IMDB	Fandango_Stars	Fandango_Ratingvalue	RT_norm	RT_user_norm	...	IMDB_norm	RT_norm_round	RT_user_norm_round	Metacritic_norm_round	Metacritic_user_norm_round	IMDB_norm_round	Metacritic_user_vote_count	IMDB_user_vote_count	Fandango_votes
141	Mr. Holmes (2015)	87	78	67	7.9	7.4	4.0	4.0	4.35	3.90	...	3.70	4.5	4.0	3.5	4.0	3.5	33	7367	1348
142	'71 (2015)	97	82	83	7.5	7.2	3.5	3.5	4.85	4.10	...	3.60	5.0	4.0	4.0	4.0	3.5	60	24116	192
143	Two Days, One Night (2014)	97	78	89	8.8	7.4	3.5	3.5	4.85	3.90	...	3.70	5.0	4.0	4.5	4.5	3.5	123	24345	118
144	Gett: The Trial of Viviane Amsalem (2015)	100	81	90	7.3	7.8	3.5	3.5	5.00	4.05	...	3.90	5.0	4.0	4.5	3.5	4.0	19	1955	59
145	Kumiko, The Treasure Hunter (2015)	87	63	68	6.4	6.7	3.5	3.5	4.35	3.15	...	3.35	4.5	3.0	3.5	3.0	3.5	19	5289	41

5 rows × 22 columns

In [7]:

# applay explore_df() function on previous data frame
explore_df(previous)

previous information & describtion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 22 columns):
FILM                          146 non-null object
RottenTomatoes                146 non-null int64
RottenTomatoes_User           146 non-null int64
Metacritic                    146 non-null int64
Metacritic_User               146 non-null float64
IMDB                          146 non-null float64
Fandango_Stars                146 non-null float64
Fandango_Ratingvalue          146 non-null float64
RT_norm                       146 non-null float64
RT_user_norm                  146 non-null float64
Metacritic_norm               146 non-null float64
Metacritic_user_nom           146 non-null float64
IMDB_norm                     146 non-null float64
RT_norm_round                 146 non-null float64
RT_user_norm_round            146 non-null float64
Metacritic_norm_round         146 non-null float64
Metacritic_user_norm_round    146 non-null float64
IMDB_norm_round               146 non-null float64
Metacritic_user_vote_count    146 non-null int64
IMDB_user_vote_count          146 non-null int64
Fandango_votes                146 non-null int64
Fandango_Difference           146 non-null float64
dtypes: float64(15), int64(6), object(1)
memory usage: 25.2+ KB

None

	RottenTomatoes	RottenTomatoes_User	Metacritic	Metacritic_User	IMDB	Fandango_Stars	Fandango_Ratingvalue	RT_norm	RT_user_norm	Metacritic_norm	...	IMDB_norm	RT_norm_round	RT_user_norm_round	Metacritic_norm_round	Metacritic_user_norm_round	IMDB_norm_round	Metacritic_user_vote_count	IMDB_user_vote_count	Fandango_votes	Fandango_Difference
count	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	...	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000
mean	60.849315	63.876712	58.808219	6.519178	6.736986	4.089041	3.845205	3.042466	3.193836	2.940411	...	3.368493	3.065068	3.226027	2.972603	3.270548	3.380137	185.705479	42846.205479	3848.787671	0.243836
std	30.168799	20.024430	19.517389	1.510712	0.958736	0.540386	0.502831	1.508440	1.001222	0.975869	...	0.479368	1.514600	1.007014	0.990961	0.788116	0.502767	316.606515	67406.509171	6357.778617	0.152665
min	5.000000	20.000000	13.000000	2.400000	4.000000	3.000000	2.700000	0.250000	1.000000	0.650000	...	2.000000	0.500000	1.000000	0.500000	1.000000	2.000000	4.000000	243.000000	35.000000	0.000000
25%	31.250000	50.000000	43.500000	5.700000	6.300000	3.500000	3.500000	1.562500	2.500000	2.175000	...	3.150000	1.500000	2.500000	2.125000	3.000000	3.000000	33.250000	5627.000000	222.250000	0.100000
50%	63.500000	66.500000	59.000000	6.850000	6.900000	4.000000	3.900000	3.175000	3.325000	2.950000	...	3.450000	3.000000	3.500000	3.000000	3.500000	3.500000	72.500000	19103.000000	1446.000000	0.200000
75%	89.000000	81.000000	75.000000	7.500000	7.400000	4.500000	4.200000	4.450000	4.050000	3.750000	...	3.700000	4.500000	4.000000	4.000000	4.000000	3.500000	168.500000	45185.750000	4439.500000	0.400000
max	100.000000	94.000000	94.000000	9.600000	8.600000	5.000000	4.800000	5.000000	4.700000	4.700000	...	4.300000	5.000000	4.500000	4.500000	5.000000	4.500000	2375.000000	334164.000000	34846.000000	0.500000

8 rows × 21 columns

After exploring previous data frame we can find that there are many columns have movies rating but from different sources which are not caring about.
We are interesting only with Fandango data so below only the columns we will work with:
- FILM: The film in question.
- Fandango_Stars: The number of stars the film had on its Fandango movie page.
- Fandango_Ratingvalue: The Fandango ratingValue for the film, as pulled from the HTML of each page. This is the actual average score the movie obtained.
- Fandango_votes: The number of user votes the film had on Fandango.
- Fandango_Difference: The difference between the presented Fandango_Stars and the actual Fandango_Ratingvalue.
we will make another data frame have only columns mentioned above for easiest work.
You can find all columns describtion here.

In [8]:

# applay print_rows() function on after data frame 
print_rows(after,5) 

after head 5:

	movie	year	metascore	imdb	tmeter	audience	fandango	n_metascore	n_imdb	n_tmeter	n_audience	nr_metascore	nr_imdb	nr_tmeter	nr_audience
0	10 Cloverfield Lane	2016	76	7.2	90	79	3.5	3.80	3.60	4.50	3.95	4.0	3.5	4.5	4.0
1	13 Hours	2016	48	7.3	50	83	4.5	2.40	3.65	2.50	4.15	2.5	3.5	2.5	4.0
2	A Cure for Wellness	2016	47	6.6	40	47	3.0	2.35	3.30	2.00	2.35	2.5	3.5	2.0	2.5
3	A Dog's Purpose	2017	43	5.2	33	76	4.5	2.15	2.60	1.65	3.80	2.0	2.5	1.5	4.0
4	A Hologram for the King	2016	58	6.1	70	57	3.0	2.90	3.05	3.50	2.85	3.0	3.0	3.5	3.0

after tail 5:

	movie	year	metascore	imdb	tmeter	audience	fandango	n_metascore	n_imdb	n_tmeter	n_audience	nr_metascore	nr_imdb	nr_tmeter	nr_audience
209	X-Men: Apocalypse	2016	52	7.1	48	67	4.0	2.6	3.55	2.40	3.35	2.5	3.5	2.5	3.5
210	XX	2017	64	4.7	71	17	3.0	3.2	2.35	3.55	0.85	3.0	2.5	3.5	1.0
211	xXx: Return of Xander Cage	2017	42	5.4	43	45	4.0	2.1	2.70	2.15	2.25	2.0	2.5	2.0	2.0
212	Zoolander 2	2016	34	4.8	23	21	2.5	1.7	2.40	1.15	1.05	1.5	2.5	1.0	1.0
213	Zootopia	2016	78	8.1	98	92	4.5	3.9	4.05	4.90	4.60	4.0	4.0	5.0	4.5

In [9]:

# apply explore_df() function on after data frame 
explore_df(after)

after information & describtion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 15 columns):
movie           214 non-null object
year            214 non-null int64
metascore       214 non-null int64
imdb            214 non-null float64
tmeter          214 non-null int64
audience        214 non-null int64
fandango        214 non-null float64
n_metascore     214 non-null float64
n_imdb          214 non-null float64
n_tmeter        214 non-null float64
n_audience      214 non-null float64
nr_metascore    214 non-null float64
nr_imdb         214 non-null float64
nr_tmeter       214 non-null float64
nr_audience     214 non-null float64
dtypes: float64(10), int64(4), object(1)
memory usage: 25.2+ KB

None

	year	metascore	imdb	tmeter	audience	fandango	n_metascore	n_imdb	n_tmeter	n_audience	nr_metascore	nr_imdb	nr_tmeter	nr_audience
count	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000	214.000000
mean	2016.107477	53.266355	6.440654	53.621495	58.626168	3.894860	2.663318	3.220327	2.681075	2.931308	2.658879	3.214953	2.691589	2.915888
std	0.310444	17.843585	1.030056	30.242396	21.100040	0.516781	0.892179	0.515028	1.512120	1.055002	0.924619	0.526803	1.519273	1.060352
min	2016.000000	11.000000	3.500000	0.000000	11.000000	2.500000	0.550000	1.750000	0.000000	0.550000	0.500000	2.000000	0.000000	0.500000
25%	2016.000000	39.000000	5.825000	27.000000	43.250000	3.500000	1.950000	2.912500	1.350000	2.162500	2.000000	3.000000	1.500000	2.000000
50%	2016.000000	53.500000	6.500000	56.500000	60.500000	4.000000	2.675000	3.250000	2.825000	3.025000	2.500000	3.000000	3.000000	3.000000
75%	2016.000000	66.000000	7.200000	83.000000	76.750000	4.500000	3.300000	3.600000	4.150000	3.837500	3.500000	3.500000	4.000000	4.000000
max	2017.000000	99.000000	8.500000	99.000000	93.000000	5.000000	4.950000	4.250000	4.950000	4.650000	5.000000	4.000000	5.000000	4.500000

After exploring after data frame we can find that there are many columns have movies rating but from different sources which are not caring about.
We are interesting only with Fandango data so below only the columns we will work with:

movie: the name of the movie. year: the release year of the movie. Fandango: the Fandango rating of the movie (user score).

We will make another data frame have only columns mentioned above for easiest work.
You can find all columns describtion here.

as we mentioned above we will select only columns related to our investigation on anther data frame for easiest work ,

so let us create new_df() function have 3 parameter: * df: dataframe name * col_index: columns index need to select. * df_name: data frame name. * function return new dataframe.

In [10]:

def new_df(df,col_index,df_name):
    new_df = df.iloc[:,col_index].copy()
    new_df.name = df_name
    return new_df

In [11]:

# apply new_df() function on previous data frame , select columns we mentioned above
new_previous = new_df(previous,[0,6,7,-2,-1],"new_previous")

In [12]:

# apply new_df() function on after data frame , select columns we mentioned above
new_after = new_df(after,[0,1,6],"new_previous")

In [13]:

# apply explore_df() on new previous to confirm our result
explore_df(new_previous)

new_previous information & describtion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 5 columns):
FILM                    146 non-null object
Fandango_Stars          146 non-null float64
Fandango_Ratingvalue    146 non-null float64
Fandango_votes          146 non-null int64
Fandango_Difference     146 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 5.8+ KB

None

	Fandango_Stars	Fandango_Ratingvalue	Fandango_votes	Fandango_Difference
count	146.000000	146.000000	146.000000	146.000000
mean	4.089041	3.845205	3848.787671	0.243836
std	0.540386	0.502831	6357.778617	0.152665
min	3.000000	2.700000	35.000000	0.000000
25%	3.500000	3.500000	222.250000	0.100000
50%	4.000000	3.900000	1446.000000	0.200000
75%	4.500000	4.200000	4439.500000	0.400000
max	5.000000	4.800000	34846.000000	0.500000

In [14]:

# apply explore_df() on new_after to confirm our result
explore_df(new_after)

new_previous information & describtion
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 3 columns):
movie       214 non-null object
year        214 non-null int64
fandango    214 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 5.1+ KB

None

	year	fandango
count	214.000000	214.000000
mean	2016.107477	3.894860
std	0.310444	0.516781
min	2016.000000	2.500000
25%	2016.000000	3.500000
50%	2016.000000	4.000000
75%	2016.000000	4.500000
max	2017.000000	5.000000

our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis.
our population of interest is all movies rating in Fandango website regardless of the releasing year.
because our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis we have two data sets with two period before & after the analysis
the data we are working with was sampled we just need to be sure that samples are representative to the population otherwise we will get a large sample error & it cause wrong conclusions.
From Fandango repository we can see that he used the following sampling criteria:

movies had at least 30 fan reviews on Fandango. The data from Fandango was pulled on Aug. 24, 2015.

according to that we can explore that sampling was clearly not random because not every movie had the same chance to be included in the sample — some movies didn't have a chance at all (like those having under 30 fan ratings)
From Movie ratings (2016 and 2017) repository we can see that he used the following sampling criteria:

movie ratings data for 214 of the most popular movies (with a significant number of votes) released in 2016 and 2017.

From all above we can explore that both data set not representative to the population both of them are good for their research & this kind of sample is called purposive sampling (or judgmental/selective/subjective sampling) but it's not good enough for our goal as both of them not random samples and biased samples also according to the goal it selected to.

** Result of our exploration **

At this point, we have at least two alternatives: * either we collect new data. * either we change the goal of our analysis by placing some limitations on it.

Tweaking our goal seems a much faster choice compared to collecting new data. Also, it's quasi-impossible to collect a new sample previous to Hickey's analysis at this moment in time.

** Our new goal**

Is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. This new goal should also be a fairly good proxy for our initial goal.

** Working on our new goal**

With the new goal, we now have two populations that we want to describe and compare with each other:

All Fandango's ratings for popular movies released in 2015. All Fandango's ratings for popular movies released in 2016.

The term "popular" is vague and we need to define it with precision before continuing. We'll use Hickey's benchmark of 30 fan ratings and consider a movie as "popular" only if it has 30 fan ratings or more on Fandango's website

** lets coding **

First of all we need to Check if both samples contain popular movies
create check_sample() function to select random samples from dataframe & check number fan ratings,

function take 2 parameter: * df: data frame name. * num_sample: sample numbers.

In [15]:

def check_sample(df,num_sample):
    display(df.sample(num_sample,random_state=1))

In [16]:

# apply check_sample() on new_previous 
check_sample(new_previous,10)

	FILM	Fandango_Stars	Fandango_Ratingvalue	Fandango_votes	Fandango_Difference
98	Get Hard (2015)	4.0	3.9	5933	0.1
66	The Gift (2015)	4.0	3.7	2680	0.3
53	Hot Pursuit (2015)	4.0	3.7	2618	0.3
75	San Andreas (2015)	4.5	4.3	9749	0.2
121	The Stanford Prison Experiment (2015)	4.0	3.9	51	0.1
74	The Hobbit: The Battle of the Five Armies (2014)	4.5	4.3	15337	0.2
119	Phoenix (2015)	3.5	3.4	70	0.1
128	Mission: Impossible â€“ Rogue Nation (2015)	4.5	4.4	8357	0.1
44	Kingsman: The Secret Service (2015)	4.5	4.2	15205	0.3
58	Seventh Son (2015)	3.5	3.2	1213	0.3

In [17]:

# apply check_sample() on new_after 
check_sample(new_after,5)

	movie	year	fandango
108	Mechanic: Resurrection	2016	4.0
206	Warcraft	2016	4.0
106	Max Steel	2016	3.5
107	Me Before You	2016	4.5
51	Fantastic Beasts and Where to Find Them	2016	4.5

for new_previous author had selected his data accoriding to criteria which is movies had at least 30 fan reviews on Fandango but also let us check & confirm that there are no movies have less than 30 fan

In [18]:

less_30_fan = new_previous[new_previous["Fandango_votes"] < 30].size
less_30_fan

Out[18]:

as we see now that new_previous dataset doesn't have any movie less than 30 fan
now let us confirm that movies on both dataset following the year we will work with

new_previous : 2015 movies new_after : 2016 movies

In [19]:

# check release date on new_previous 
previous_movies_year = new_previous["FILM"].str.split("(").str[-1]
print(previous_movies_year.unique())

['2015)' '2014)']

In [20]:

# check release date on new_after
after_movies_year = new_after["year"].unique()
after_movies_year

Out[20]:

array([2016, 2017])

From above we can see that new_previous data frame have data for movies in 2014 & 2015.
From above we can see that new_after data frame have data for movies in 2016 & 2017.
We need to isolate only the sample points that belong to our populations of interest.
- Isolate the movies released in 2015 in a separate data set.
- Isolate the movies released in 2016 in another separate data set.
First will add year column on new_previous data frame so we can isolate data according to year easily

In [21]:

# add year column on new_previous data frame
new_previous["year"] = new_previous["FILM"].str.split("(").str[-1].str.replace(")","").astype("int")

# check year column 
new_previous["year"].unique()

Out[21]:

array([2015, 2014])

create isolate_year() function have 4 parameter: * df: data frame * col_year: column name on data frame which specified years data * year: year in int type need to work with * df_name : string have data frame name

In [22]:

def isolate_year(df,col_year,year,df_name):
    df_year = df.loc[df[col_year].astype("int")== year].copy()
    df_year.name = df_name
    return df_year

In [23]:

# apply isolate_year() function on new_previous
previous_2015 = isolate_year(new_previous,"year",2015,"previous_2015")
previous_2015["year"].unique()

Out[23]:

array([2015])

In [24]:

# apply isolate_year() function on new_after
after_2016 = isolate_year(new_after,"year",2016,"after_2016")
after_2016["year"].unique()

Out[24]:

array([2016])

We can now start analyzing the two samples we isolated before. Once again, our goal is to determine whether there's any difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016.
start simple with making a high-level comparison between the shapes of the distributions of movie ratings for both samples.
Generate two kernel density plots on the same figure for the distribution of movie ratings of each sample

In [25]:

explore_df(previous_2015)

previous_2015 information & describtion
<class 'pandas.core.frame.DataFrame'>
Int64Index: 129 entries, 0 to 145
Data columns (total 6 columns):
FILM                    129 non-null object
Fandango_Stars          129 non-null float64
Fandango_Ratingvalue    129 non-null float64
Fandango_votes          129 non-null int64
Fandango_Difference     129 non-null float64
year                    129 non-null int64
dtypes: float64(3), int64(2), object(1)
memory usage: 7.1+ KB

None

	Fandango_Stars	Fandango_Ratingvalue	Fandango_votes	Fandango_Difference	year
count	129.000000	129.000000	129.000000	129.000000	129.0
mean	4.085271	3.846512	3761.891473	0.238760	2015.0
std	0.538096	0.505446	6543.601748	0.152741	0.0
min	3.000000	2.700000	35.000000	0.000000	2015.0
25%	3.500000	3.500000	210.000000	0.100000	2015.0
50%	4.000000	3.900000	1415.000000	0.200000	2015.0
75%	4.500000	4.200000	4045.000000	0.400000	2015.0
max	5.000000	4.800000	34846.000000	0.500000	2015.0

In [26]:

explore_df(after_2016)

after_2016 information & describtion
<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 213
Data columns (total 3 columns):
movie       191 non-null object
year        191 non-null int64
fandango    191 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.0+ KB

None

	year	fandango
count	191.0	191.000000
mean	2016.0	3.887435
std	0.0	0.509540
min	2016.0	2.500000
25%	2016.0	3.500000
50%	2016.0	4.000000
75%	2016.0	4.250000
max	2016.0	5.000000

In [27]:

plt.style.use('fivethirtyeight')
previous_2015["Fandango_Stars"].plot.kde(label="2015",legend=True)
after_2016["fandango"].plot.kde(label="2016",legend=True)
plt.title("Fandango's ratings distribution for popular movies in\n(2015 Vs 2016)",fontsize=20,y=1.05)
plt.xlim(0,5)
plt.xticks(np.arange(0,5.1,0.5))
plt.show()

2015 and 2016 Fandango's ratings distribution both are left skewed distribution, wich approve that Fandango's have high fans ratings , need to make more analysis.
although the shape is similar but,on 2015 max seed is 4.5 although on 2016 is 4 so it seems that about the half point less on 2016 than 2015 which mean ratings in 2016 were slightly lower compared to 2015 .
let us now analyze more granular information
Because the data sets have different numbers of movies, we normalize the tables and show percentages instead.
create calc_per() function to noramlize ratings & return percentages,Function take 2 parameter:
- df: data frame name.
- col_name: data need to normalize.

In [28]:

# create calc_per()
def calc_per(df,col_name):
    ratings_per = df[col_name].value_counts(normalize=True).sort_index() * 100
    return ratings_per

In [29]:

# apply calc_per() on previous_2015 , ratings column "Fandango_Stars"
ratings_per_2015 = calc_per(previous_2015,"Fandango_Stars")
ratings_per_2015

Out[29]:

3.0     8.527132
3.5    17.829457
4.0    28.682171
4.5    37.984496
5.0     6.976744
Name: Fandango_Stars, dtype: float64

In [30]:

# apply calc_per() on after_2016 , ratings column "fandango"
ratings_per_2016 = calc_per(after_2016,"fandango")
ratings_per_2016

Out[30]:

2.5     3.141361
3.0     7.329843
3.5    24.083770
4.0    40.314136
4.5    24.607330
5.0     0.523560
Name: fandango, dtype: float64

The minimum rating is lower in 2016 — 2.5 instead of 3 stars, the minimum of 2015. with different frequency distributions 3,8.5 % in order.
maxmium rating in both data set is 5 with a big difference on the frequency distribution on 2016 5 star frequency is less than 1% although it about 7% on 2015
regardless to this lower on rating the maximum frequecny distribution on 2016 is 4 although it 4.5 in 2015.
this lower on 2016 rating reflect on increasing frequencies distribution on other rating (3.5 & 4) on 2016 than 2015.
in spite of all the points above, we still not sure about the direction of the change, so We'll take a couple of summary statistics to get a more precise picture about the direction of the difference.
We'll take each distribution of movie ratings and compute its mean, median, and mode, and then compare these statistics to determine what they tell about the direction of the difference.

Determining the Direction of the Change¶

Compute the mean, median, and mode for each distribution. * will create function calc_state() take two parameters: * df: data frame name * col_name: column name need to apply statics method on it

In [38]:

def calc_state(df,col_name):
    x_mean = round(df[col_name].mean(),2)
    x_median = round(df[col_name].median(),2)
#     on calculate mode we should slice the it with index 0 
    x_mode = round(df[col_name].mode()[0],2)
    return (x_mean,x_median,x_mode)

In [32]:

# apply calc_state on previous_2015 on Fandango_Stars & assign it as a list on summary_2015 variable
summary_2015 = list(calc_state(previous_2015,"Fandango_Stars"))
summary_2015

Out[32]:

[4.09, 4.0, 4.5]

In [37]:

# apply calc_state on after_2015 on fandango & assign it as a list on suuary_2016 variable
summary_2016 = list(calc_state(after_2016,"fandango"))

In [34]:

# create summary data frame   
summary = pd.DataFrame(summary_2015,index=["mean","median","mode"],columns=["2015"])
summary

Out[34]:

	2015
mean	4.09
median	4.00
mode	4.50

In [35]:

# add 2016 on summary data frame 
summary["2016"] = summary_2016
summary

Out[35]:

	2015	2016
mean	4.09	3.89
median	4.00	4.00
mode	4.50	4.00

it's appear that the data direction is in low way in 2016 than 2015 mean on 2016 is mean & model on 2016 is less than mean & median in 2015
let us plot it using pandas bar plot

In [48]:

plt.style.use("fivethirtyeight")
summary["2015"].plot.bar(color = '#0066FF', align = 'center', label = '2015', width = .25)
summary["2016"].plot.bar(color = '#CC0000', align = 'edge',label='2016',width=.25,rot=0,figsize=(8,5))

plt.title("Comparing summary statistics: 2015 vs 2016s",y=1.07)
plt.ylim(0,5.5)
plt.yticks(np.arange(0,5.1,.5))
plt.ylabel("Stars")
plt.legend(framealpha = 0, loc = 'upper center')
plt.show()

conclusion:
- Our analysis showed that there's indeed a slight difference between Fandango's ratings for popular movies in 2015 and Fandango's ratings for popular movies in 2016. We also determined that, on average, popular movies released in 2016 were rated lower on Fandango than popular movies released in 2015.
- inspite of that we are not sure that the decreasing on the rates caused by Fandango fixing the biased rating system after Hickey's analysis.

In [ ]: