** 1. Is Fandango Still Inflating Ratings?**
Fandango displays a 5-star rating system on their website, where the minimum rating is 0 stars and the maximum is 5 stars.
** 2. Understanding the Data**
# Baisc imports
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# create read_dataset()
def read_dataset(file_path,df_name):
df = pd.read_csv(file_path)
df.name = df_name
return df
apply read_dataset() function on both data sets we have
# apply read_dataset() on fandango_score_comparison.
previous = read_dataset("fandango_score_comparison.csv","previous")
# apply read_dataset() on movie_ratings_16_17.
after = read_dataset("movie_ratings_16_17.csv","after")
2- explore datasets - Create print_rows() function , take two parameter: 1- df: dataframe name 2- num_rows: number of sampe need to display, print head & tail for same number of rows - Create explore_df() functions , take only one parameter: 1- df: dataframe name print dataframe inforamtion & describtion
# create print_rows(df,num_rows) function
def print_rows(df,num_rows):
print(df.name,"head {}:".format(num_rows))
display(df.head(num_rows))
print(df.name,"tail {}:".format(num_rows))
display(df.tail(num_rows))
# create explore_df(df) function
def explore_df(df):
print(df.name,"information & describtion")
display(df.info(),df.describe())
# apply print_rows() function on previous dataset
print_rows(previous,5)
previous head 5:
FILM | RottenTomatoes | RottenTomatoes_User | Metacritic | Metacritic_User | IMDB | Fandango_Stars | Fandango_Ratingvalue | RT_norm | RT_user_norm | ... | IMDB_norm | RT_norm_round | RT_user_norm_round | Metacritic_norm_round | Metacritic_user_norm_round | IMDB_norm_round | Metacritic_user_vote_count | IMDB_user_vote_count | Fandango_votes | Fandango_Difference | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Avengers: Age of Ultron (2015) | 74 | 86 | 66 | 7.1 | 7.8 | 5.0 | 4.5 | 3.70 | 4.3 | ... | 3.90 | 3.5 | 4.5 | 3.5 | 3.5 | 4.0 | 1330 | 271107 | 14846 | 0.5 |
1 | Cinderella (2015) | 85 | 80 | 67 | 7.5 | 7.1 | 5.0 | 4.5 | 4.25 | 4.0 | ... | 3.55 | 4.5 | 4.0 | 3.5 | 4.0 | 3.5 | 249 | 65709 | 12640 | 0.5 |
2 | Ant-Man (2015) | 80 | 90 | 64 | 8.1 | 7.8 | 5.0 | 4.5 | 4.00 | 4.5 | ... | 3.90 | 4.0 | 4.5 | 3.0 | 4.0 | 4.0 | 627 | 103660 | 12055 | 0.5 |
3 | Do You Believe? (2015) | 18 | 84 | 22 | 4.7 | 5.4 | 5.0 | 4.5 | 0.90 | 4.2 | ... | 2.70 | 1.0 | 4.0 | 1.0 | 2.5 | 2.5 | 31 | 3136 | 1793 | 0.5 |
4 | Hot Tub Time Machine 2 (2015) | 14 | 28 | 29 | 3.4 | 5.1 | 3.5 | 3.0 | 0.70 | 1.4 | ... | 2.55 | 0.5 | 1.5 | 1.5 | 1.5 | 2.5 | 88 | 19560 | 1021 | 0.5 |
5 rows × 22 columns
previous tail 5:
FILM | RottenTomatoes | RottenTomatoes_User | Metacritic | Metacritic_User | IMDB | Fandango_Stars | Fandango_Ratingvalue | RT_norm | RT_user_norm | ... | IMDB_norm | RT_norm_round | RT_user_norm_round | Metacritic_norm_round | Metacritic_user_norm_round | IMDB_norm_round | Metacritic_user_vote_count | IMDB_user_vote_count | Fandango_votes | Fandango_Difference | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
141 | Mr. Holmes (2015) | 87 | 78 | 67 | 7.9 | 7.4 | 4.0 | 4.0 | 4.35 | 3.90 | ... | 3.70 | 4.5 | 4.0 | 3.5 | 4.0 | 3.5 | 33 | 7367 | 1348 | 0.0 |
142 | '71 (2015) | 97 | 82 | 83 | 7.5 | 7.2 | 3.5 | 3.5 | 4.85 | 4.10 | ... | 3.60 | 5.0 | 4.0 | 4.0 | 4.0 | 3.5 | 60 | 24116 | 192 | 0.0 |
143 | Two Days, One Night (2014) | 97 | 78 | 89 | 8.8 | 7.4 | 3.5 | 3.5 | 4.85 | 3.90 | ... | 3.70 | 5.0 | 4.0 | 4.5 | 4.5 | 3.5 | 123 | 24345 | 118 | 0.0 |
144 | Gett: The Trial of Viviane Amsalem (2015) | 100 | 81 | 90 | 7.3 | 7.8 | 3.5 | 3.5 | 5.00 | 4.05 | ... | 3.90 | 5.0 | 4.0 | 4.5 | 3.5 | 4.0 | 19 | 1955 | 59 | 0.0 |
145 | Kumiko, The Treasure Hunter (2015) | 87 | 63 | 68 | 6.4 | 6.7 | 3.5 | 3.5 | 4.35 | 3.15 | ... | 3.35 | 4.5 | 3.0 | 3.5 | 3.0 | 3.5 | 19 | 5289 | 41 | 0.0 |
5 rows × 22 columns
# applay explore_df() function on previous data frame
explore_df(previous)
previous information & describtion <class 'pandas.core.frame.DataFrame'> RangeIndex: 146 entries, 0 to 145 Data columns (total 22 columns): FILM 146 non-null object RottenTomatoes 146 non-null int64 RottenTomatoes_User 146 non-null int64 Metacritic 146 non-null int64 Metacritic_User 146 non-null float64 IMDB 146 non-null float64 Fandango_Stars 146 non-null float64 Fandango_Ratingvalue 146 non-null float64 RT_norm 146 non-null float64 RT_user_norm 146 non-null float64 Metacritic_norm 146 non-null float64 Metacritic_user_nom 146 non-null float64 IMDB_norm 146 non-null float64 RT_norm_round 146 non-null float64 RT_user_norm_round 146 non-null float64 Metacritic_norm_round 146 non-null float64 Metacritic_user_norm_round 146 non-null float64 IMDB_norm_round 146 non-null float64 Metacritic_user_vote_count 146 non-null int64 IMDB_user_vote_count 146 non-null int64 Fandango_votes 146 non-null int64 Fandango_Difference 146 non-null float64 dtypes: float64(15), int64(6), object(1) memory usage: 25.2+ KB
None
RottenTomatoes | RottenTomatoes_User | Metacritic | Metacritic_User | IMDB | Fandango_Stars | Fandango_Ratingvalue | RT_norm | RT_user_norm | Metacritic_norm | ... | IMDB_norm | RT_norm_round | RT_user_norm_round | Metacritic_norm_round | Metacritic_user_norm_round | IMDB_norm_round | Metacritic_user_vote_count | IMDB_user_vote_count | Fandango_votes | Fandango_Difference | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | ... | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 | 146.000000 |
mean | 60.849315 | 63.876712 | 58.808219 | 6.519178 | 6.736986 | 4.089041 | 3.845205 | 3.042466 | 3.193836 | 2.940411 | ... | 3.368493 | 3.065068 | 3.226027 | 2.972603 | 3.270548 | 3.380137 | 185.705479 | 42846.205479 | 3848.787671 | 0.243836 |
std | 30.168799 | 20.024430 | 19.517389 | 1.510712 | 0.958736 | 0.540386 | 0.502831 | 1.508440 | 1.001222 | 0.975869 | ... | 0.479368 | 1.514600 | 1.007014 | 0.990961 | 0.788116 | 0.502767 | 316.606515 | 67406.509171 | 6357.778617 | 0.152665 |
min | 5.000000 | 20.000000 | 13.000000 | 2.400000 | 4.000000 | 3.000000 | 2.700000 | 0.250000 | 1.000000 | 0.650000 | ... | 2.000000 | 0.500000 | 1.000000 | 0.500000 | 1.000000 | 2.000000 | 4.000000 | 243.000000 | 35.000000 | 0.000000 |
25% | 31.250000 | 50.000000 | 43.500000 | 5.700000 | 6.300000 | 3.500000 | 3.500000 | 1.562500 | 2.500000 | 2.175000 | ... | 3.150000 | 1.500000 | 2.500000 | 2.125000 | 3.000000 | 3.000000 | 33.250000 | 5627.000000 | 222.250000 | 0.100000 |
50% | 63.500000 | 66.500000 | 59.000000 | 6.850000 | 6.900000 | 4.000000 | 3.900000 | 3.175000 | 3.325000 | 2.950000 | ... | 3.450000 | 3.000000 | 3.500000 | 3.000000 | 3.500000 | 3.500000 | 72.500000 | 19103.000000 | 1446.000000 | 0.200000 |
75% | 89.000000 | 81.000000 | 75.000000 | 7.500000 | 7.400000 | 4.500000 | 4.200000 | 4.450000 | 4.050000 | 3.750000 | ... | 3.700000 | 4.500000 | 4.000000 | 4.000000 | 4.000000 | 3.500000 | 168.500000 | 45185.750000 | 4439.500000 | 0.400000 |
max | 100.000000 | 94.000000 | 94.000000 | 9.600000 | 8.600000 | 5.000000 | 4.800000 | 5.000000 | 4.700000 | 4.700000 | ... | 4.300000 | 5.000000 | 4.500000 | 4.500000 | 5.000000 | 4.500000 | 2375.000000 | 334164.000000 | 34846.000000 | 0.500000 |
8 rows × 21 columns
# applay print_rows() function on after data frame
print_rows(after,5)
after head 5:
movie | year | metascore | imdb | tmeter | audience | fandango | n_metascore | n_imdb | n_tmeter | n_audience | nr_metascore | nr_imdb | nr_tmeter | nr_audience | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10 Cloverfield Lane | 2016 | 76 | 7.2 | 90 | 79 | 3.5 | 3.80 | 3.60 | 4.50 | 3.95 | 4.0 | 3.5 | 4.5 | 4.0 |
1 | 13 Hours | 2016 | 48 | 7.3 | 50 | 83 | 4.5 | 2.40 | 3.65 | 2.50 | 4.15 | 2.5 | 3.5 | 2.5 | 4.0 |
2 | A Cure for Wellness | 2016 | 47 | 6.6 | 40 | 47 | 3.0 | 2.35 | 3.30 | 2.00 | 2.35 | 2.5 | 3.5 | 2.0 | 2.5 |
3 | A Dog's Purpose | 2017 | 43 | 5.2 | 33 | 76 | 4.5 | 2.15 | 2.60 | 1.65 | 3.80 | 2.0 | 2.5 | 1.5 | 4.0 |
4 | A Hologram for the King | 2016 | 58 | 6.1 | 70 | 57 | 3.0 | 2.90 | 3.05 | 3.50 | 2.85 | 3.0 | 3.0 | 3.5 | 3.0 |
after tail 5:
movie | year | metascore | imdb | tmeter | audience | fandango | n_metascore | n_imdb | n_tmeter | n_audience | nr_metascore | nr_imdb | nr_tmeter | nr_audience | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
209 | X-Men: Apocalypse | 2016 | 52 | 7.1 | 48 | 67 | 4.0 | 2.6 | 3.55 | 2.40 | 3.35 | 2.5 | 3.5 | 2.5 | 3.5 |
210 | XX | 2017 | 64 | 4.7 | 71 | 17 | 3.0 | 3.2 | 2.35 | 3.55 | 0.85 | 3.0 | 2.5 | 3.5 | 1.0 |
211 | xXx: Return of Xander Cage | 2017 | 42 | 5.4 | 43 | 45 | 4.0 | 2.1 | 2.70 | 2.15 | 2.25 | 2.0 | 2.5 | 2.0 | 2.0 |
212 | Zoolander 2 | 2016 | 34 | 4.8 | 23 | 21 | 2.5 | 1.7 | 2.40 | 1.15 | 1.05 | 1.5 | 2.5 | 1.0 | 1.0 |
213 | Zootopia | 2016 | 78 | 8.1 | 98 | 92 | 4.5 | 3.9 | 4.05 | 4.90 | 4.60 | 4.0 | 4.0 | 5.0 | 4.5 |
# apply explore_df() function on after data frame
explore_df(after)
after information & describtion <class 'pandas.core.frame.DataFrame'> RangeIndex: 214 entries, 0 to 213 Data columns (total 15 columns): movie 214 non-null object year 214 non-null int64 metascore 214 non-null int64 imdb 214 non-null float64 tmeter 214 non-null int64 audience 214 non-null int64 fandango 214 non-null float64 n_metascore 214 non-null float64 n_imdb 214 non-null float64 n_tmeter 214 non-null float64 n_audience 214 non-null float64 nr_metascore 214 non-null float64 nr_imdb 214 non-null float64 nr_tmeter 214 non-null float64 nr_audience 214 non-null float64 dtypes: float64(10), int64(4), object(1) memory usage: 25.2+ KB
None
year | metascore | imdb | tmeter | audience | fandango | n_metascore | n_imdb | n_tmeter | n_audience | nr_metascore | nr_imdb | nr_tmeter | nr_audience | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 | 214.000000 |
mean | 2016.107477 | 53.266355 | 6.440654 | 53.621495 | 58.626168 | 3.894860 | 2.663318 | 3.220327 | 2.681075 | 2.931308 | 2.658879 | 3.214953 | 2.691589 | 2.915888 |
std | 0.310444 | 17.843585 | 1.030056 | 30.242396 | 21.100040 | 0.516781 | 0.892179 | 0.515028 | 1.512120 | 1.055002 | 0.924619 | 0.526803 | 1.519273 | 1.060352 |
min | 2016.000000 | 11.000000 | 3.500000 | 0.000000 | 11.000000 | 2.500000 | 0.550000 | 1.750000 | 0.000000 | 0.550000 | 0.500000 | 2.000000 | 0.000000 | 0.500000 |
25% | 2016.000000 | 39.000000 | 5.825000 | 27.000000 | 43.250000 | 3.500000 | 1.950000 | 2.912500 | 1.350000 | 2.162500 | 2.000000 | 3.000000 | 1.500000 | 2.000000 |
50% | 2016.000000 | 53.500000 | 6.500000 | 56.500000 | 60.500000 | 4.000000 | 2.675000 | 3.250000 | 2.825000 | 3.025000 | 2.500000 | 3.000000 | 3.000000 | 3.000000 |
75% | 2016.000000 | 66.000000 | 7.200000 | 83.000000 | 76.750000 | 4.500000 | 3.300000 | 3.600000 | 4.150000 | 3.837500 | 3.500000 | 3.500000 | 4.000000 | 4.000000 |
max | 2017.000000 | 99.000000 | 8.500000 | 99.000000 | 93.000000 | 5.000000 | 4.950000 | 4.250000 | 4.950000 | 4.650000 | 5.000000 | 4.000000 | 5.000000 | 4.500000 |
movie: the name of the movie. year: the release year of the movie. Fandango: the Fandango rating of the movie (user score).
so let us create new_df() function have 3 parameter: * df: dataframe name * col_index: columns index need to select. * df_name: data frame name. * function return new dataframe.
def new_df(df,col_index,df_name):
new_df = df.iloc[:,col_index].copy()
new_df.name = df_name
return new_df
# apply new_df() function on previous data frame , select columns we mentioned above
new_previous = new_df(previous,[0,6,7,-2,-1],"new_previous")
# apply new_df() function on after data frame , select columns we mentioned above
new_after = new_df(after,[0,1,6],"new_previous")
# apply explore_df() on new previous to confirm our result
explore_df(new_previous)
new_previous information & describtion <class 'pandas.core.frame.DataFrame'> RangeIndex: 146 entries, 0 to 145 Data columns (total 5 columns): FILM 146 non-null object Fandango_Stars 146 non-null float64 Fandango_Ratingvalue 146 non-null float64 Fandango_votes 146 non-null int64 Fandango_Difference 146 non-null float64 dtypes: float64(3), int64(1), object(1) memory usage: 5.8+ KB
None
Fandango_Stars | Fandango_Ratingvalue | Fandango_votes | Fandango_Difference | |
---|---|---|---|---|
count | 146.000000 | 146.000000 | 146.000000 | 146.000000 |
mean | 4.089041 | 3.845205 | 3848.787671 | 0.243836 |
std | 0.540386 | 0.502831 | 6357.778617 | 0.152665 |
min | 3.000000 | 2.700000 | 35.000000 | 0.000000 |
25% | 3.500000 | 3.500000 | 222.250000 | 0.100000 |
50% | 4.000000 | 3.900000 | 1446.000000 | 0.200000 |
75% | 4.500000 | 4.200000 | 4439.500000 | 0.400000 |
max | 5.000000 | 4.800000 | 34846.000000 | 0.500000 |
# apply explore_df() on new_after to confirm our result
explore_df(new_after)
new_previous information & describtion <class 'pandas.core.frame.DataFrame'> RangeIndex: 214 entries, 0 to 213 Data columns (total 3 columns): movie 214 non-null object year 214 non-null int64 fandango 214 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 5.1+ KB
None
year | fandango | |
---|---|---|
count | 214.000000 | 214.000000 |
mean | 2016.107477 | 3.894860 |
std | 0.310444 | 0.516781 |
min | 2016.000000 | 2.500000 |
25% | 2016.000000 | 3.500000 |
50% | 2016.000000 | 4.000000 |
75% | 2016.000000 | 4.500000 |
max | 2017.000000 | 5.000000 |
our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis.
our population of interest is all movies rating in Fandango website regardless of the releasing year.
because our goal is to determine whether there has been any change in Fandango's rating system after Hickey's analysis we have two data sets with two period before & after the analysis
the data we are working with was sampled we just need to be sure that samples are representative to the population otherwise we will get a large sample error & it cause wrong conclusions.
From Fandango repository we can see that he used the following sampling criteria:
movies had at least 30 fan reviews on Fandango. The data from Fandango was pulled on Aug. 24, 2015.
movie ratings data for 214 of the most popular movies (with a significant number of votes) released in 2016 and 2017.
** Result of our exploration **
Tweaking our goal seems a much faster choice compared to collecting new data. Also, it's quasi-impossible to collect a new sample previous to Hickey's analysis at this moment in time.
** Our new goal**
** Working on our new goal**
All Fandango's ratings for popular movies released in 2015. All Fandango's ratings for popular movies released in 2016.
The term "popular" is vague and we need to define it with precision before continuing. We'll use Hickey's benchmark of 30 fan ratings and consider a movie as "popular" only if it has 30 fan ratings or more on Fandango's website
** lets coding **
function take 2 parameter: * df: data frame name. * num_sample: sample numbers.
def check_sample(df,num_sample):
display(df.sample(num_sample,random_state=1))
# apply check_sample() on new_previous
check_sample(new_previous,10)
FILM | Fandango_Stars | Fandango_Ratingvalue | Fandango_votes | Fandango_Difference | |
---|---|---|---|---|---|
98 | Get Hard (2015) | 4.0 | 3.9 | 5933 | 0.1 |
66 | The Gift (2015) | 4.0 | 3.7 | 2680 | 0.3 |
53 | Hot Pursuit (2015) | 4.0 | 3.7 | 2618 | 0.3 |
75 | San Andreas (2015) | 4.5 | 4.3 | 9749 | 0.2 |
121 | The Stanford Prison Experiment (2015) | 4.0 | 3.9 | 51 | 0.1 |
74 | The Hobbit: The Battle of the Five Armies (2014) | 4.5 | 4.3 | 15337 | 0.2 |
119 | Phoenix (2015) | 3.5 | 3.4 | 70 | 0.1 |
128 | Mission: Impossible – Rogue Nation (2015) | 4.5 | 4.4 | 8357 | 0.1 |
44 | Kingsman: The Secret Service (2015) | 4.5 | 4.2 | 15205 | 0.3 |
58 | Seventh Son (2015) | 3.5 | 3.2 | 1213 | 0.3 |
# apply check_sample() on new_after
check_sample(new_after,5)
movie | year | fandango | |
---|---|---|---|
108 | Mechanic: Resurrection | 2016 | 4.0 |
206 | Warcraft | 2016 | 4.0 |
106 | Max Steel | 2016 | 3.5 |
107 | Me Before You | 2016 | 4.5 |
51 | Fantastic Beasts and Where to Find Them | 2016 | 4.5 |
for new_previous author had selected his data accoriding to criteria which is movies had at least 30 fan reviews on Fandango but also let us check & confirm that there are no movies have less than 30 fan
less_30_fan = new_previous[new_previous["Fandango_votes"] < 30].size
less_30_fan
0
new_previous : 2015 movies new_after : 2016 movies
# check release date on new_previous
previous_movies_year = new_previous["FILM"].str.split("(").str[-1]
print(previous_movies_year.unique())
['2015)' '2014)']
# check release date on new_after
after_movies_year = new_after["year"].unique()
after_movies_year
array([2016, 2017])
# add year column on new_previous data frame
new_previous["year"] = new_previous["FILM"].str.split("(").str[-1].str.replace(")","").astype("int")
# check year column
new_previous["year"].unique()
array([2015, 2014])
create isolate_year() function have 4 parameter: * df: data frame * col_year: column name on data frame which specified years data * year: year in int type need to work with * df_name : string have data frame name
def isolate_year(df,col_year,year,df_name):
df_year = df.loc[df[col_year].astype("int")== year].copy()
df_year.name = df_name
return df_year
# apply isolate_year() function on new_previous
previous_2015 = isolate_year(new_previous,"year",2015,"previous_2015")
previous_2015["year"].unique()
array([2015])
# apply isolate_year() function on new_after
after_2016 = isolate_year(new_after,"year",2016,"after_2016")
after_2016["year"].unique()
array([2016])
explore_df(previous_2015)
previous_2015 information & describtion <class 'pandas.core.frame.DataFrame'> Int64Index: 129 entries, 0 to 145 Data columns (total 6 columns): FILM 129 non-null object Fandango_Stars 129 non-null float64 Fandango_Ratingvalue 129 non-null float64 Fandango_votes 129 non-null int64 Fandango_Difference 129 non-null float64 year 129 non-null int64 dtypes: float64(3), int64(2), object(1) memory usage: 7.1+ KB
None
Fandango_Stars | Fandango_Ratingvalue | Fandango_votes | Fandango_Difference | year | |
---|---|---|---|---|---|
count | 129.000000 | 129.000000 | 129.000000 | 129.000000 | 129.0 |
mean | 4.085271 | 3.846512 | 3761.891473 | 0.238760 | 2015.0 |
std | 0.538096 | 0.505446 | 6543.601748 | 0.152741 | 0.0 |
min | 3.000000 | 2.700000 | 35.000000 | 0.000000 | 2015.0 |
25% | 3.500000 | 3.500000 | 210.000000 | 0.100000 | 2015.0 |
50% | 4.000000 | 3.900000 | 1415.000000 | 0.200000 | 2015.0 |
75% | 4.500000 | 4.200000 | 4045.000000 | 0.400000 | 2015.0 |
max | 5.000000 | 4.800000 | 34846.000000 | 0.500000 | 2015.0 |
explore_df(after_2016)
after_2016 information & describtion <class 'pandas.core.frame.DataFrame'> Int64Index: 191 entries, 0 to 213 Data columns (total 3 columns): movie 191 non-null object year 191 non-null int64 fandango 191 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 6.0+ KB
None
year | fandango | |
---|---|---|
count | 191.0 | 191.000000 |
mean | 2016.0 | 3.887435 |
std | 0.0 | 0.509540 |
min | 2016.0 | 2.500000 |
25% | 2016.0 | 3.500000 |
50% | 2016.0 | 4.000000 |
75% | 2016.0 | 4.250000 |
max | 2016.0 | 5.000000 |
plt.style.use('fivethirtyeight')
previous_2015["Fandango_Stars"].plot.kde(label="2015",legend=True)
after_2016["fandango"].plot.kde(label="2016",legend=True)
plt.title("Fandango's ratings distribution for popular movies in\n(2015 Vs 2016)",fontsize=20,y=1.05)
plt.xlim(0,5)
plt.xticks(np.arange(0,5.1,0.5))
plt.show()
# create calc_per()
def calc_per(df,col_name):
ratings_per = df[col_name].value_counts(normalize=True).sort_index() * 100
return ratings_per
# apply calc_per() on previous_2015 , ratings column "Fandango_Stars"
ratings_per_2015 = calc_per(previous_2015,"Fandango_Stars")
ratings_per_2015
3.0 8.527132 3.5 17.829457 4.0 28.682171 4.5 37.984496 5.0 6.976744 Name: Fandango_Stars, dtype: float64
# apply calc_per() on after_2016 , ratings column "fandango"
ratings_per_2016 = calc_per(after_2016,"fandango")
ratings_per_2016
2.5 3.141361 3.0 7.329843 3.5 24.083770 4.0 40.314136 4.5 24.607330 5.0 0.523560 Name: fandango, dtype: float64
def calc_state(df,col_name):
x_mean = round(df[col_name].mean(),2)
x_median = round(df[col_name].median(),2)
# on calculate mode we should slice the it with index 0
x_mode = round(df[col_name].mode()[0],2)
return (x_mean,x_median,x_mode)
# apply calc_state on previous_2015 on Fandango_Stars & assign it as a list on summary_2015 variable
summary_2015 = list(calc_state(previous_2015,"Fandango_Stars"))
summary_2015
[4.09, 4.0, 4.5]
# apply calc_state on after_2015 on fandango & assign it as a list on suuary_2016 variable
summary_2016 = list(calc_state(after_2016,"fandango"))
# create summary data frame
summary = pd.DataFrame(summary_2015,index=["mean","median","mode"],columns=["2015"])
summary
2015 | |
---|---|
mean | 4.09 |
median | 4.00 |
mode | 4.50 |
# add 2016 on summary data frame
summary["2016"] = summary_2016
summary
2015 | 2016 | |
---|---|---|
mean | 4.09 | 3.89 |
median | 4.00 | 4.00 |
mode | 4.50 | 4.00 |
plt.style.use("fivethirtyeight")
summary["2015"].plot.bar(color = '#0066FF', align = 'center', label = '2015', width = .25)
summary["2016"].plot.bar(color = '#CC0000', align = 'edge',label='2016',width=.25,rot=0,figsize=(8,5))
plt.title("Comparing summary statistics: 2015 vs 2016s",y=1.07)
plt.ylim(0,5.5)
plt.yticks(np.arange(0,5.1,.5))
plt.ylabel("Stars")
plt.legend(framealpha = 0, loc = 'upper center')
plt.show()