Background While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you can download from their GitHub repository.
Task For this project, you'll clean and explore the dataset in Jupyter notebook. To see a sample notebook containing all of the answers, visit the project's GitHub repository.
Info about Data Set
Get Data Set from here:
https://github.com/fivethirtyeight/data/tree/master/star-wars-survey
eightthirtyfive article:
https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/
Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
Check Data
star_wars[star_wars.columns[3:15]].sample(5)
Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1067 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4.0 | 5.0 | 6.0 | 3.0 | 1.0 | 2.0 |
319 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
937 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4.0 | 5.0 | 6.0 | 1.0 | 2.0 | 3.0 |
337 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
527 | Star Wars: Episode I The Phantom Menace | NaN | NaN | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4.0 | 5.0 | 6.0 | 2.0 | 1.0 | 3.0 |
star_wars.head(5)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3292879998 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
1 | 3292879538 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 | 3292765271 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 | 3292763116 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4 | 3292731220 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
# columns converted:
# - Have you seen the 6 star wars movies?
# - Do you consider yourself a fan of star wars?
yes_no ={'Yes':True, 'No':False}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
#check
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
Create dictionary to rename Which SW movie have you seen checkbox columns
col_rn_dict = dict(zip(star_wars.columns[3:9],
['seen_1','seen_2','seen_3','seen_4','seen_5','seen_6']))
star_wars = star_wars.rename(columns=col_rn_dict)
Convert "seen_1" column values
eps1_no={"Star Wars: Episode I The Phantom Menace": True,
np.NaN: False}
star_wars['seen_1'] = star_wars['seen_1'].map(eps1_no)
Convert "seen_2" column values
eps2_no={"Star Wars: Episode II Attack of the Clones": True,
np.NaN: False}
star_wars['seen_2'] = star_wars['seen_2'].map(eps2_no)
Convert "seen_3" column values
eps3_no={"Star Wars: Episode III Revenge of the Sith": True,
np.NaN: False}
star_wars['seen_3'] = star_wars['seen_3'].map(eps3_no)
Convert "seen_4" column values
eps4_no={"Star Wars: Episode IV A New Hope": True,
np.NaN: False}
star_wars['seen_4'] = star_wars['seen_4'].map(eps4_no)
Convert "seen_5" column values
eps5_no={"Star Wars: Episode V The Empire Strikes Back": True,
np.NaN: False}
star_wars['seen_5'] = star_wars['seen_5'].map(eps5_no)
Convert "seen_6" column values
eps6_no={"Star Wars: Episode VI Return of the Jedi": True,
np.NaN: False}
star_wars['seen_6'] = star_wars['seen_6'].map(eps6_no)
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
rename columns with ranking_x
col_rn_dict = dict(zip(star_wars.columns[9:15],
['ranking_1','ranking_2','ranking_3','ranking_4','ranking_5','ranking_6']))
star_wars = star_wars.rename(columns=col_rn_dict)
star_wars.iloc[:,9:15].mean()
ranking_1 3.732934 ranking_2 4.087321 ranking_3 4.341317 ranking_4 3.272727 ranking_5 2.513158 ranking_6 3.047847 dtype: float64
As Kwu mentions in his post:
https://community.dataquest.io/t/guided-project-star-wars-opinion-wars-never-neglect-to-learn-about-the-dataset/553526
The data is biased due the choice of default value. In case a subject has not seen the movie they filled a default number equal to the saga's number.
This means that if Episode III has not been seen, the value of 3 would be automatically filled.
The "FiveThirtyEight"-team dealt with that by analysing only a reduced number of subjects, that have seen the entire 6 movies.
https://fivethirtyeight.com/features/americas-favorite-star-wars-movies-and-least-favorite-characters/
I came up with a way to remove default values (1-6) for not seen movies by rating them as np.nan
# Correct default-ratings by setting them to NaN
#so that the ratings of unseen movies = NaN
for coln,sni in zip(star_wars.columns[9:15],star_wars.columns[3:9]):
star_wars[coln+'_corr'] = star_wars[coln]
ns_bool = star_wars[sni]==False
star_wars.loc[ns_bool,coln+'_corr']=np.nan
fig,axs = plt.subplots()
star_wars.iloc[:,9:15].mean().plot(marker='v')
star_wars.iloc[:,38:44].mean().plot(marker='o')
axs.grid(True)
axs.set_xticklabels(axs.get_xticklabels(),rotation=45)
axs.set_title('Compare average ratings')
axs.set_ylabel('Average Rating [-]')
axs.set_xlabel('Ranked Episode')
axs.legend(['Orig. Biased Ratings','NaN Corrected Ratings'])
<matplotlib.legend.Legend at 0x7ff66cf60250>
With increasing Episode number 1->6 the biased average ratings show bigger difference compared against the corrected data.
def calc_seen_perc(col):
temp = col.value_counts()
return temp.loc[True]/(temp[True]+temp[False])*100
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(6, 6))
star_wars[star_wars.columns[3:9]].apply(calc_seen_perc).sort_values().plot.barh(ax=axs[0])
axs[0].set_title("Watched by subjects [%]")
axs[0].set_xlabel('Percentage - [%]')
axs[0].set_yticklabels(['3rd Episode','2nd Episode','4th Episode',
'1st Episode','6th Episode','5th Episode'])
axs[0].grid(True)
#star_wars.iloc[:,9:15].mean().iloc[[0 1 2 3 5 4]].plot.barh(ax=axs[1],color='r')
star_wars.iloc[:,9:15].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[1],color='tab:blue' ,width=0.15 ,align='edge')
star_wars.iloc[:,38:44].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[1],color='tab:orange',width=-0.15,align='edge')
axs[1].set_title("Avg Rating by subjects [1-6]")
axs[1].set_xlabel('Rating - [-]')
axs[1].set_yticklabels([])
axs[1].grid(True)
axs[1].legend(['biased','corrected'])
axs[1].set_ylim([-.5, 5.5])
(-0.5, 5.5)
It seems that the most seen Star Wars movies (episode V + VI) are rated amongs highest. Whereas the less seen movies (episode II+III) achieved lowest rankings
def calc_seen_abs(col):
temp = col.value_counts()
return temp.loc[True]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(6, 6))
star_wars[star_wars.columns[3:9]].apply(calc_seen_abs).sort_values().plot.barh(ax=axs[0])
axs[0].set_title("Watched by subjects [-]")
axs[0].set_xlabel('Absolute views - [-]')
axs[0].set_yticklabels(['3rd Episode','2nd Episode','4th Episode',
'1st Episode','6th Episode','5th Episode'])
axs[0].grid(True)
#star_wars.iloc[:,9:15].mean().iloc[[0 1 2 3 5 4]].plot.barh(ax=axs[1],color='r')
star_wars.iloc[:,9:15].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[1],color='tab:blue' ,width=0.15 ,align='edge')
star_wars.iloc[:,38:44].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[1],color='tab:orange',width=-0.15 ,align='edge')
axs[1].set_title("Avg Rating by subjects [1-6]")
axs[1].set_xlabel('Rating - [-]')
axs[1].set_yticklabels([])
axs[1].grid(True)
axs[1].set_ylim([-.5, 5.5])
axs[1].legend(['biased','corrected'])
<matplotlib.legend.Legend at 0x7ff66d0231c0>
Same trend can be seen on absolute views. Movies that have been seen by most of the inerviewed persons get the highest ratings (episode V+VI). Whereas movies with low view percentage seem to achieve lowest average ratings (episode II+III)
fix,axs = plt.subplots()
star_wars[star_wars.columns[3:9]].apply(sum).sort_values().plot.barh()
axs.set_yticklabels(['3rd Episode','2nd Episode','4th Episode',
'1st Episode','6th Episode','5th Episode'])
axs.grid(True)
The ranking seems to match with the release date, Episode V getting most views. As sequel to Episode IV which created a lot of buzz with the initial release.
Episode 1 got a lot of media attention, by being the relaunch of the skywalker saga. Dragging a lot of viewers into the cinemas, this high attention seems then to have lowered for the following episodes II+III.
sw_fans = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==True]
sw_nofans = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==False]
#quick check amount of fans / no fans
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(6, 6),constrained_layout=True)
sw_fans[sw_fans.columns[3:9]].apply(calc_seen_abs).sort_values().plot.barh(ax=axs[0,0])
sw_nofans[sw_nofans.columns[3:9]].apply(calc_seen_abs).sort_values().plot.barh(ax=axs[1,0],color='tab:red')
axs[0,0].set_title("Watched by SW fans [-]")
axs[0,0].set_xlabel('Absolute views - [-]')
axs[0,0].set_yticklabels(['3rd Episode','2nd Episode','4th Episode',
'1st Episode','6th Episode','5th Episode'])
axs[1,0].set_title("Watched by NO SW fans [-]")
axs[1,0].set_xlabel('Absolute views - [-]')
axs[1,0].set_yticklabels(['3rd Episode','2nd Episode','4th Episode',
'1st Episode','6th Episode','5th Episode'])
sw_fans.iloc[:,9:15].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[0,1],width=0.2 ,align='edge')
sw_fans.iloc[:,38:44].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[0,1],width=-0.2 ,align='edge',color='tab:orange')
sw_nofans.iloc[:,9:15].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[1,1],color='tab:red',width=0.2 ,align='edge')
sw_nofans.iloc[:,38:44].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[1,1],color='maroon',width=-0.2 ,align='edge')
axs[0,1].set_title("Avg Rating by SW Fans [1-6]")
axs[0,1].set_xlabel('Rating - [-]')
axs[0,1].set_yticklabels([])
axs[0,1].set_ylim([-.5, 5.5])
axs[0,1].legend(['biased','corrected'])
axs[1,1].set_title("Avg Rating by NO SW Fans [1-6]")
axs[1,1].set_xlabel('Rating - [-]')
axs[1,1].set_yticklabels([])
axs[1,1].set_ylim([-.5, 5.5])
axs[1,1].legend(['biased','corrected'])
axs[0,0].grid(True)
axs[0,1].grid(True)
axs[1,0].grid(True)
axs[1,1].grid(True)
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(6, 6),constrained_layout=True)
sw_fans[sw_fans.columns[3:9]].apply(calc_seen_abs).iloc[[4,5,3,0,1,2]].plot.barh(ax=axs[0,0])
sw_nofans[sw_nofans.columns[3:9]].apply(calc_seen_abs).iloc[[4,0,5,1,3,2]].plot.barh(ax=axs[1,0],color='r')
axs[0,0].set_title("Watched by SW fans [-]")
axs[0,0].set_xlabel('Absolute views - [-]')
axs[0,0].set_yticklabels(['5th Episode','6th Episode','4th Episode',
'1st Episode','2nd Episode','3rd Episode'])
axs[1,0].set_title("Watched by NO SW fans [-]")
axs[1,0].set_xlabel('Absolute views - [-]')
axs[1,0].set_yticklabels(['5th Episode','1st Episode','6th Episode',
'2nd Episode','4th Episode','3rd Episode'])
#sw_nofans.iloc[:,9:15].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[1,1],color='r')
#sw_fans.iloc[:,9:15].mean().iloc[[2,1,3,0,5,4]].plot.barh(ax=axs[0,1])
sw_fans.iloc[:,9:15].mean().sort_values().plot.barh(ax=axs[0,1],color='tab:blue',width=0.2 ,align='edge')
sw_fans.iloc[:,38:44].mean().sort_values().plot.barh(ax=axs[0,1],color='tab:orange',width=-0.2 ,align='edge')
sw_nofans.iloc[:,9:15].mean().sort_values().plot.barh(ax=axs[1,1],color='tab:red',width=0.2 ,align='edge')
sw_nofans.iloc[:,38:44].mean().sort_values().plot.barh(ax=axs[1,1],color='maroon',width=-0.2 ,align='edge')
axs[0,1].set_title("Avg Rating by SW Fans [1-6]")
axs[0,1].set_xlabel('Rating - [-]')
axs[0,1].set_yticklabels([])
axs[0,1].set_ylim([-.5,5.5])
axs[1,1].set_title("Avg Rating by NO SW Fans [1-6]")
axs[1,1].set_xlabel('Rating - [-]')
axs[1,1].set_yticklabels([])
axs[1,1].set_ylim([-.5,5.5])
axs[0,0].grid(True)
axs[0,1].grid(True)
axs[1,0].grid(True)
axs[1,1].grid(True)
fig,axs = plt.subplots(figsize=(6,4),constrained_layout=True)
eps_name = ['1st Episode','2nd Episode','3rd Episode',
'4th Episode','5th Episode','6th Episode']
X = sw_fans[sw_fans.columns[3:9]].apply(calc_seen_perc)
#Y = sw_fans.iloc[:,9:15].mean()
Y = sw_fans.iloc[:,38:44].mean()
plt.scatter(X,Y)
for i, label in enumerate(eps_name):
plt.annotate(label, (X[i], Y[i]))
X = sw_nofans[sw_fans.columns[3:9]].apply(calc_seen_perc)
#Y = sw_nofans.iloc[:,9:15].mean()
Y = sw_nofans.iloc[:,38:44].mean()
plt.scatter(X,Y)
for i, label in enumerate(eps_name):
plt.annotate(label, (X[i], Y[i]))
axs.legend(['Star Wars Fans', 'Not a SW fan'])
axs.grid(True)
axs.set_title('Star War Fans/No Fans Cross plot: view vs rating')
axs.set_ylabel('Ratings [1-6]')
axs.set_xlabel('View percentage [%]')
Text(0.5, 0, 'View percentage [%]')
This cross plot allows a much clearer distinction of the two groups and their repsective view rate and rating:
1. Non fans view percentage are clearly below Fans
2. Episode V gets highest ranking with highest view rate in both groups
3. SW fans rate the initial trilogy (IV+V+VI) much higher then the later sequels
4. Non-SW fans rate 1st Episode second best
males = star_wars[star_wars['Gender']=='Male']
females = star_wars[star_wars['Gender']=='Female']
fig,axs = plt.subplots()
eps_name = ['1st Episode','2nd Episode','3rd Episode',
'4th Episode','5th Episode','6th Episode']
X = males[males.columns[3:9]].apply(calc_seen_perc)
#Y = males.iloc[:,9:15].mean()
Y = males.iloc[:,38:44].mean()
plt.scatter(X,Y)
for i, label in enumerate(eps_name):
plt.annotate(label, (X[i], Y[i]))
X = females[females.columns[3:9]].apply(calc_seen_perc)
#Y = females.iloc[:,9:15].mean()
Y = females.iloc[:,38:44].mean()
plt.scatter(X,Y,color='tab:red')
for i, label in enumerate(eps_name):
plt.annotate(label, (X[i], Y[i]))
axs.legend(['Males', 'Females'])
axs.grid(True)
axs.set_title('Star War Males/Females Cross plot: view vs rating')
axs.set_ylabel('Ratings [1-6]')
axs.set_xlabel('View percentage [%]')
Text(0.5, 0, 'View percentage [%]')
Males rate the initial trilogy reasonably higher then the sequels (ref. Cross plot Star Wars Fans/No).
Females ratings for Episode I are much better, Episode II+III are rated inverted compared with Male rating
View rate differs between Male and Female by ~15%
fig,axs = plt.subplots(nrows=1, ncols=2)
sw_fans['Gender'].value_counts().plot.bar(ax=axs[0])
sw_nofans['Gender'].value_counts().sort_values(ascending=True).plot.bar(ax=axs[1],color='tab:red')
axs[0].set_title('Star Wars Fans')
axs[1].set_title('No Star Wars Fans')
Text(0.5, 1.0, 'No Star Wars Fans')
We observe more star war fans amongs males, therefore the male ratings ressembles the one of the star war fans.
trekkie = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?']=='Yes']
no_trekkie = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?']=='No']
fig,axs = plt.subplots()
eps_name = ['1st Episode','2nd Episode','3rd Episode',
'4th Episode','5th Episode','6th Episode']
X = trekkie[trekkie.columns[3:9]].apply(calc_seen_perc)
#Y = trekkie.iloc[:,9:15].mean()
Y = trekkie.iloc[:,38:44].mean()
plt.scatter(X,Y)
for i, label in enumerate(eps_name):
plt.annotate(label, (X[i], Y[i]))
X = no_trekkie[no_trekkie.columns[3:9]].apply(calc_seen_perc)
#Y = no_trekkie.iloc[:,9:15].mean()
Y = no_trekkie.iloc[:,38:44].mean()
plt.scatter(X,Y,color='m')
for i, label in enumerate(eps_name):
plt.annotate(label, (X[i], Y[i]))
axs.legend(['Trekkie', 'Not a Trekkie'])
axs.grid(True)
axs.set_title('Star War Trekkie/no Trekkie Cross plot: view vs rating')
axs.set_ylabel('Ratings [1-6]')
axs.set_xlabel('View percentage [%]')
Text(0.5, 0, 'View percentage [%]')
Trekkies and non-Trekkies have same rating order (except Episodes II+III).
The trekkie rating shows though similar gap between Episode VI and Episode I, as for the Star Wars fans or Males.
The view percentage differs strongly, showing the strongest gap so far observed with ~20%.
fig,axs = plt.subplots()
#grouped = star_wars[star_wars.columns[9:15]].groupby(star_wars['Gender'])
#grouped = star_wars[star_wars.columns[38:44]].groupby(star_wars['Gender'])
#grouped.mean().plot().bar(ax=axs)
grouped = star_wars[star_wars.columns[3:9]].groupby(star_wars['Gender'])
grouped.agg(calc_seen_perc).plot.bar(ax=axs)
axs.grid(axis='y')
We can observe a general shift between female and male view rate, episode ranking seems to correlate. Most viewn episodes: V,VI,I,...
g = sns.FacetGrid(star_wars, col='Gender',
hue='Do you consider yourself to be a fan of the Star Wars film franchise?')
g.map(sns.distplot,"ranking_1")#,kde=False)
g.add_legend(title='Star Wars Fan?')
<seaborn.axisgrid.FacetGrid at 0x7ff6628d12e0>
Female And Male ratings seem to depend a lot of the fact if the person was a Star Wars fan. Non-Fans seem to rate Episode 1 evenly distributed, whereas most Fans rate the move 4 or higher.
g = sns.FacetGrid(star_wars, col='Gender',
hue='Do you consider yourself to be a fan of the Star Wars film franchise?')
g.map(sns.distplot,"ranking_2")#,kde=False)
g.add_legend(title='Star Wars Fan?')
<seaborn.axisgrid.FacetGrid at 0x7ff662e035e0>
Both female and male SW fans seem to dislike Episode 2 rating it 5 (male>50%, female>25%).
Non SW fans seem to rate the movie evenly somewhere between 2-5.
g = sns.FacetGrid(star_wars, col='Gender',
hue='Do you consider yourself to be a fan of the Star Wars film franchise?')
g.map(sns.distplot,"ranking_3")#,kde=False)
g.add_legend(title='Star Wars Fan?')
<seaborn.axisgrid.FacetGrid at 0x7ff66cf8df40>
SW fans (female & male) rate Episode 3 or higher.
Non fans ratings are more evenly distributed
g = sns.FacetGrid(star_wars, col='Gender',
hue='Do you consider yourself to be a fan of the Star Wars film franchise?')
g.map(sns.distplot,"ranking_4")#,kde=False)
g.add_legend(title='Star Wars Fan?')
<seaborn.axisgrid.FacetGrid at 0x7ff662764700>
SW fans ratings of Episode 4 show a clear peak just below 2.
Whereas non-SW fans seem to rate Episode 4 3 or higher
g = sns.FacetGrid(star_wars, col='Gender',
hue='Do you consider yourself to be a fan of the Star Wars film franchise?')
g.map(sns.distplot,"ranking_5")#,kde=False)
g.add_legend(title='Star Wars Fan?')
<seaborn.axisgrid.FacetGrid at 0x7ff66276f670>
Episode 5 seems to obtain same rating distribution independent of gender and if the person was or not a star wars fan
g = sns.FacetGrid(star_wars, col='Gender',
hue='Do you consider yourself to be a fan of the Star Wars film franchise?')
g.map(sns.distplot,"ranking_6")#,kde=False)
g.add_legend(title='Star Wars Fan?')
<seaborn.axisgrid.FacetGrid at 0x7ff662e8aeb0>
Episode 6 most ratings <3 in case of SW fans, independently of gender.
Non Fans show two camps, either very good ratings (1) or very poor (6) for both female or male subjects.
fig,axs = plt.subplots()#(nrows=2)
#grouped = star_wars[star_wars.columns[9:15]].groupby(star_wars['Education'])
#grouped.mean().plot.bar(ax=axs[0])
#grouped.min().plot.bar(ax=axs[1])
grouped = star_wars[star_wars.columns[3:9]].groupby(star_wars['Education'])
grouped.agg(calc_seen_perc).plot.bar(ax=axs)
axs.grid(axis='y')
The view percentage seems to increase with higher educational degree, from 40% -> 70%.
fig,axs = plt.subplots()#(nrows=2)
grouped = star_wars[star_wars.columns[9:15]].groupby(star_wars['Education'])
grouped.mean().plot.bar(ax=axs)
#grouped.min().plot.bar(ax=axs[1])
#grouped = star_wars[star_wars.columns[3:9]].groupby(star_wars['Education'])
#grouped.agg(calc_seen_perc).plot.bar(ax=axs)
axs.grid(axis='y')
axs.legend(loc=4)
<matplotlib.legend.Legend at 0x7ff662764ca0>
Relative ranking of Episodes is maintained independent of educational degree, some inversions can be seen.
Episode 5 is rated with an average ranking of 1 amongst "Less than high school degree" in contrast with other educational groups 2-3.
fig,axs = plt.subplots()
grouped = star_wars[star_wars.columns[9:15]].groupby(star_wars['Location (Census Region)'])
#grouped.mean().plot.bar(ax=axs)
grouped = star_wars[star_wars.columns[3:9]].groupby(star_wars['Location (Census Region)'])
grouped.agg(calc_seen_perc).mean(axis=1).sort_values(ascending=False).plot.bar(ax=axs)
#axs.legend(loc=4)
axs.grid(axis='y')
axs.set_title('Average view rate of all Episodes')
axs.set_ylabel('Avg view percentage [%]')
Text(0, 0.5, 'Avg view percentage [%]')
fig,axs = plt.subplots()
grouped = star_wars[star_wars.columns[9:15]].groupby(star_wars['Location (Census Region)'])
#grouped.mean().plot.bar(ax=axs)
grouped = star_wars[star_wars.columns[3:9]].groupby(star_wars['Location (Census Region)'])
grouped.agg(calc_seen_perc).plot.bar(ax=axs)
axs.legend(loc=4)
axs.grid(axis='y')
axs.set_title('Average view rate of each Episode by region')
axs.set_ylabel('Avg view percentage [%]')
Text(0, 0.5, 'Avg view percentage [%]')
The highest view percentages of all episodes are achieved in
The East South Central region shows the highest view rates for the original trilogy (IV,V,VI).
The West South Central achieved the lowest view rate of the new trilogy (I,II,III)
fig,axs = plt.subplots()
grouped = star_wars[star_wars.columns[9:15]].groupby(star_wars['Which character shot first?'])
#grouped.mean().plot.bar(ax=axs)
grouped = star_wars[star_wars.columns[3:9]].groupby(star_wars['Which character shot first?'])
grouped.agg(calc_seen_perc).plot.bar(ax=axs)
axs.grid(axis='y')
axs.legend(loc=3)
<matplotlib.legend.Legend at 0x7ff6626e4a00>
def calc_who_shot_perc(col):
temp = col.value_counts()
return temp.loc[r"I don't understand this question"]/(temp["I don't understand this question"]+temp['Han']+temp['Greedo'])*100
fig,axs = plt.subplots()
grouped = star_wars['Which character shot first?'].groupby(star_wars['Age'])
grouped.agg(calc_who_shot_perc).plot.bar(ax=axs)
axs.grid(axis='y')
axs.legend(loc=3)
axs.set_title("Percentage: I don't understand this question" )
Text(0.5, 1.0, "Percentage: I don't understand this question")
Subjects that answered with 'I don't understand this question' show the lowest view percentage of all Episodes and especially the Episode IV.
More than 50% of the over year old answered with 'I don't get this question'. The Percentage seems to increase with age of the interviewed Person.
Which is strange, as the older persons can be expected to have seen the original cut of Episode 4 in 1977.
star_wars.columns[15:29]
Index(['Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28'], dtype='object')
# create a renaming dictionaray based on five-thirtyeight published info
character_rn_dict = dict(zip(star_wars.columns[15:29],
['Han Solo','Luke Skywalker','Princess Leia Organa','Anakin Skywalker','Obi Wan Kenobi','Emperor Palpatine','Darth Vader','Lando Calrissian','Boba Fett','C-3P0','R2 D2','Jar Jar Binks','Padme Amidala','Yoda']))
# ['Luke','Han','Leia','Obi Wan','Yoda','R2-D2','C-3PO','Anakin','Darth Vader','Lando','Padme Amidala','Boba Fett','Palpatine','Jar Jar Binks']))
star_wars = star_wars.rename(columns=character_rn_dict)
star_wars['Han Solo'].value_counts()#['Very favorably']+
Very favorably 610 Somewhat favorably 151 Neither favorably nor unfavorably (neutral) 44 Unfamiliar (N/A) 15 Somewhat unfavorably 8 Very unfavorably 1 Name: Han Solo, dtype: int64
def get_fav_perc(col):
temp = col.value_counts()
return round((temp['Very favorably']+temp['Somewhat favorably'])/temp.sum()*10000)/100
def get_unfav_perc(col):
temp = col.value_counts()
return round((temp['Very unfavorably']+temp['Somewhat unfavorably'])/temp.sum()*10000)/100
fav = star_wars[star_wars.columns[15:29]].agg(get_fav_perc)
unfav = star_wars[star_wars.columns[15:29]].agg(get_unfav_perc)
fig,axs = plt.subplots()
fav.sort_values(ascending=False).plot.bar(ax=axs,color='tab:blue',width=0.4 ,align='edge')
axs.grid(axis='y')
axs.set_title('Most liked Characters')
axs.set_ylabel('Percentage [%]')
Text(0, 0.5, 'Percentage [%]')
fig,axs = plt.subplots()
unfav.sort_values().plot.bar(ax=axs,color='tab:red',width=-0.4 ,align='edge')
axs.grid(axis='y')
axs.set_title('Most disliked Characters')
axs.set_ylabel('Percentage [%]')
Text(0, 0.5, 'Percentage [%]')
fig,axs = plt.subplots()
fav.plot.bar(ax=axs,color='tab:blue',width=0.4 ,align='edge')
unfav.plot.bar(ax=axs,color='tab:red',width=-0.4 ,align='edge')
axs.grid(axis='y')
axs.set_title('Most controversial Characters')
axs.set_ylabel('Percentage [%]')
Text(0, 0.5, 'Percentage [%]')