Star Wars Opinion Wars¶

Cooincidentally, May the Fourth [Be With You] just passed!

Which Star Wars movie is the most beloved among fans? Fivethirtyeight conducted a survey to provide some insight on the

I would guess that it's one of original 3, but is it the Empire Strikes Back per FiveThrityEight's prediction? Do only some poeple get to weigh in seriously on this? What's the epic story?

Peek at Data Set¶

There are 38 suvey questions and 1186 respondants. The survey questions at a glance focus on how dedicated a star wars fans the respondant is, which movies they have watched, their preferences and favorites, as well as some general demographic information.

The data file originally compiled by fivethrityeight was provided in the guided project and can be downloaded from github

In [1]:

import pandas as pd
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")

In [2]:

## Get # respondents and # of questions
star_wars.shape

Out[2]:

(1186, 38)

In [3]:

## Get a feel for the data by reviewing the first 10 rows
star_wars[:10]

Out[3]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	3292879998	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3.0	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
1	3292879538	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
2	3292765271	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1.0	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
3	3292763116	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
4	3292731220	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3292719380	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	1.0	...	Very favorably	Han	Yes	No	Yes	Male	18-29	$25,000 - $49,999	Bachelor degree	Middle Atlantic
6	3292684787	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	6.0	...	Very favorably	Han	Yes	No	No	Male	18-29	NaN	High school degree	East North Central
7	3292663732	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	4.0	...	Very favorably	Han	No	NaN	Yes	Male	18-29	NaN	High school degree	South Atlantic
8	3292654043	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Somewhat favorably	Han	No	NaN	No	Male	18-29	$0 - $24,999	Some college or Associate degree	South Atlantic
9	3292640424	Yes	No	NaN	Star Wars: Episode II Attack of the Clones	NaN	NaN	NaN	NaN	1.0	...	Very favorably	I don't understand this question	No	NaN	No	Male	18-29	$25,000 - $49,999	Some college or Associate degree	Pacific

10 rows × 38 columns

In [4]:

# List the survey questions
for col in star_wars.columns :
    print(col)

RespondentID
Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
Which of the following Star Wars films have you seen? Please select all that apply.
Unnamed: 4
Unnamed: 5
Unnamed: 6
Unnamed: 7
Unnamed: 8
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
Unnamed: 10
Unnamed: 11
Unnamed: 12
Unnamed: 13
Unnamed: 14
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.
Unnamed: 16
Unnamed: 17
Unnamed: 18
Unnamed: 19
Unnamed: 20
Unnamed: 21
Unnamed: 22
Unnamed: 23
Unnamed: 24
Unnamed: 25
Unnamed: 26
Unnamed: 27
Unnamed: 28
Which character shot first?
Are you familiar with the Expanded Universe?
Do you consider yourself to be a fan of the Expanded Universe?
Do you consider yourself to be a fan of the Star Trek franchise?
Gender
Age
Household Income
Education
Location (Census Region)

'Boolean-ating' Some Columns¶

Columns at indexes 1 and 2 store Yes/No responses which I am converting to True/False.

In [5]:

# store long column header for easier reference
seen_any_col = star_wars.columns[1]
seen_any_col

Out[5]:

'Have you seen any of the 6 films in the Star Wars franchise?'

In [6]:

# store long column header for easier reference
fan_col = star_wars.columns[2]
fan_col

Out[6]:

'Do you consider yourself to be a fan of the Star Wars film franchise?'

In [7]:

#define boolean map to apply to seen_any_col and fan_col
yes_no = {
    "Yes" : True,
    "No" : False
}

Q1: Every Respondant indicated whether they had seen any of the 6 Star Wars Films

In [8]:

## print responses before transforming them to boolean values
star_wars[seen_any_col].value_counts(dropna=False)

Out[8]:

Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [9]:

## apply transformation map
star_wars[seen_any_col] = star_wars[seen_any_col].map(yes_no)

## validate result by comparing with values before transformation
print(star_wars[seen_any_col].value_counts(dropna=False))

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

Q2: 29% of Respondants did NOT indicate whether they consider themselves a Fan of the Franchise

In [10]:

## print responses before transforming them to boolean values
star_wars[fan_col].value_counts(dropna=False)

Out[10]:

Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

In [11]:

## apply transformation map - ignore but keep null values
star_wars[fan_col] = star_wars[fan_col].map(yes_no, na_action="ignore")

## validate result by comparing with values before transformation
print(star_wars[fan_col].value_counts())

## show proportion of null values:
print("Null:", round((star_wars[fan_col].isnull().sum()/len(star_wars[fan_col])*100), 2), "%")

True     552
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
Null: 29.51 %

'Boolean-ating' Values not Stored as Yes/No¶

Questions 3-8 (6 columns) store which Star Wars Episodes the respondant has seen. Each column represents an Episode and if the value is the name of the Episode, the respondant has seen the Episode. Otherwise the value is null to indicate the respondant has NOT see the episode.

Here's what that looks like:

In [12]:

star_wars.iloc[:,3:9].head(3)

Out[12]:

	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
0	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
1	NaN	NaN	NaN	NaN	NaN	NaN
2	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN

Q3-8: These columns will have changes applied so that:

1. Headers are meaningful

2. Values are True/False

Generating Transformation Maps without Copy + Paste¶

I don't want to repetitively copy paste anything EVER if I don't have to.

Seen This Episode? Column Header Maps

I am going to generate the map.

In [13]:

## isolate columns 
seen_cols = star_wars.columns[3:9]

In [14]:

## iterate through the column names at indexes 3-9 to create a mapping dictionary
map_episode_header = {}
for (i,header) in zip(range(6),star_wars[seen_cols]) :
    map_episode_header[header] = "seen_" + str(i+1)

In [15]:

## verify the mapping dictionary looks correct
map_episode_header

Out[15]:

{'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
 'Unnamed: 4': 'seen_2',
 'Unnamed: 5': 'seen_3',
 'Unnamed: 6': 'seen_4',
 'Unnamed: 7': 'seen_5',
 'Unnamed: 8': 'seen_6'}

In [16]:

## apply the map to rename the columns
star_wars=star_wars.rename(columns=map_episode_header)

## store the new columns headers and validate result
seen_cols = star_wars.columns[3:9]
seen_cols

Out[16]:

Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')

Seen This Episode? Column Value Map

I now generate the map that will transform the column values to boolean.

This seems a little trickier. I bet there is a quicker, cleaner method but I hvae broken out the steps.

First I need to get the values for True in each column (= the 6 Episode Names). I am assuming it is the most popular non-null value in each column.

I am putting all the episodes titles into a single mapping dictionary.

In [17]:

## Value counts to use for validating transformation
for episode in star_wars[seen_cols] :
    print(star_wars[episode].value_counts(dropna=False))    

Star Wars: Episode I  The Phantom Menace    673
NaN                                         513
Name: seen_1, dtype: int64
NaN                                            615
Star Wars: Episode II  Attack of the Clones    571
Name: seen_2, dtype: int64
NaN                                            636
Star Wars: Episode III  Revenge of the Sith    550
Name: seen_3, dtype: int64
Star Wars: Episode IV  A New Hope    607
NaN                                  579
Name: seen_4, dtype: int64
Star Wars: Episode V The Empire Strikes Back    758
NaN                                             428
Name: seen_5, dtype: int64
Star Wars: Episode VI Return of the Jedi    738
NaN                                         448
Name: seen_6, dtype: int64

In [18]:

## first entry in the mapping dictionary is False for null
import numpy as np
bool_episode_map = {np.NaN : False}
episode_titles = []

## next add each title to the dictionary mapping to True
for episode in seen_cols :
    episode_titles.append(star_wars[episode].value_counts().idxmax())
    bool_episode_map[star_wars[episode].value_counts().idxmax()] = True

# Validate mapping dictionary
bool_episode_map

Out[18]:

{nan: False,
 'Star Wars: Episode I  The Phantom Menace': True,
 'Star Wars: Episode II  Attack of the Clones': True,
 'Star Wars: Episode III  Revenge of the Sith': True,
 'Star Wars: Episode IV  A New Hope': True,
 'Star Wars: Episode V The Empire Strikes Back': True,
 'Star Wars: Episode VI Return of the Jedi': True}

In [19]:

## apply map and validate result
temp=pd.Series()
for episode in seen_cols :
    temp = star_wars[episode].map(bool_episode_map)
    star_wars[episode] = temp
    print(star_wars[episode].value_counts())

True     673
False    513
Name: seen_1, dtype: int64
False    615
True     571
Name: seen_2, dtype: int64
False    636
True     550
Name: seen_3, dtype: int64
True     607
False    579
Name: seen_4, dtype: int64
True     758
False    428
Name: seen_5, dtype: int64
True     738
False    448
Name: seen_6, dtype: int64

<ipython-input-19-6d79fe886860>:2: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  temp=pd.Series()

Minor Clean up of Episode Rankings¶

There are 6 columns that store the rank of each Episode relative to the others. Rank of 1 = favourite, Rank of 6 = least favourite. Otherwise the value is null to indicate the respondant has NOT see the episode.

Here's what that looks like:

In [20]:

star_wars.iloc[:3,9:15]

Out[20]:

	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14
0	3.0	2.0	1.0	4.0	5.0	6.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	2.0	3.0	4.0	5.0	6.0

Q9-14: These columns will have changes applied so that:

1. Headers are meaningful

2. Values are Float Type

The below changes the header names using same method as above.

In [21]:

## store the old headers
ranking_cols = star_wars.columns[9:15]
ranking_cols

Out[21]:

Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14'],
      dtype='object')

In [22]:

## create the map
map_ranking_header = {}
for (i, rank) in zip(range(6), ranking_cols):
    map_ranking_header[rank] = "ranking_" + str(i+1)
    
## Validate the mapping dictionary looks correct
map_ranking_header

Out[22]:

{'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1',
 'Unnamed: 10': 'ranking_2',
 'Unnamed: 11': 'ranking_3',
 'Unnamed: 12': 'ranking_4',
 'Unnamed: 13': 'ranking_5',
 'Unnamed: 14': 'ranking_6'}

In [23]:

## apply the map to rename the columns
star_wars = star_wars.rename(columns=map_ranking_header)

## store the new columns headers and validate result
ranking_cols = star_wars.columns[9:15]
ranking_cols

Out[23]:

Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')

In [24]:

## convert values to float type and validate 
star_wars[ranking_cols] = star_wars[ranking_cols].astype(float)
type(star_wars[ranking_cols[0]][0])

Out[24]:

numpy.float64

Plot Mean Ranking¶

The mean ranking of each movie will indicate how popular it is.

In [25]:

# store the mean ranking value for each episode
episode_rankings = round(star_wars[ranking_cols].mean(),2)
episode_rankings

Out[25]:

ranking_1    3.73
ranking_2    4.09
ranking_3    4.34
ranking_4    3.27
ranking_5    2.51
ranking_6    3.05
dtype: float64

In [26]:

episodes = []
for i in range(6) :
    episodes.append("Episode " + str(i+1))
xlabels = episodes
episodes

Out[26]:

['Episode 1', 'Episode 2', 'Episode 3', 'Episode 4', 'Episode 5', 'Episode 6']

In [27]:

# plot the mean ranking values
import matplotlib.pyplot as plt
%matplotlib inline

episode_rankings.plot.bar()
plt.title("Mean Ranking per Episode")
plt.xlabel("Lower Value = Higher Ranking")
plt.ylabel("Mean Ranking")
plt.xticks(ticks=range(6), labels=xlabels, rotation = 30)
plt.show()

In [28]:

# print full title of Episode 5
print("The Best Ranked Episode is: ", episode_titles[4])

The Best Ranked Episode is:  Star Wars: Episode V The Empire Strikes Back

Episode Ranking Observations¶

Respondants ranking Episode 5 the highest, which is indeed The Empire Strikes Back.

This could be because more respondants saw this movie than any other movie in the series:

In [29]:

# count how many respondants saw each movie
episode_views = star_wars[seen_cols].sum()
print(episode_views.tolist())

[673, 571, 550, 607, 758, 738]

In [30]:

## plot the bar graph
episode_views.plot.bar()
plt.title("Views per Episode")
plt.ylabel("Views")
plt.xticks(ticks=range(6), labels=xlabels, rotation = 30)
plt.show()

Collecting 'per Episode' Info¶

I want to consolidate all we've done already into a dataframe. Then I can use these results to compare with subgroup results.

episode_stats is a dataframe compiling the information we've generated for each Episode so far and additional per-episode information will be added as I go along.

In [31]:

episode_stats = pd.DataFrame()
episode_stats["episode_num"] = pd.Series(xlabels)
episode_stats["episode_title"] = pd.Series(episode_titles)
episode_stats["seen_count"] = pd.Series(episode_views.tolist())
episode_stats["episode_ranking"] = pd.Series(episode_rankings.tolist())
episode_stats

Out[31]:

	episode_num	episode_title	seen_count	episode_ranking
0	Episode 1	Star Wars: Episode I The Phantom Menace	673	3.73
1	Episode 2	Star Wars: Episode II Attack of the Clones	571	4.09
2	Episode 3	Star Wars: Episode III Revenge of the Sith	550	4.34
3	Episode 4	Star Wars: Episode IV A New Hope	607	3.27
4	Episode 5	Star Wars: Episode V The Empire Strikes Back	758	2.51
5	Episode 6	Star Wars: Episode VI Return of the Jedi	738	3.05

Star Wars Gender Wars¶

I will check whether male and female respondants prefer different movies.

In [32]:

## divide data set
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]

In [33]:

## How many males have seen each episode:

male_seen = []
male_seen = males[seen_cols].sum()
print(male_seen)

seen_1    361
seen_2    323
seen_3    317
seen_4    342
seen_5    392
seen_6    387
dtype: int64

In [34]:

## How many females have seen each episode:

female_seen = []
female_seen = females[seen_cols].sum()
print(female_seen)

seen_1    298
seen_2    237
seen_3    222
seen_4    255
seen_5    353
seen_6    338
dtype: int64

In [35]:

## calculate the mean male rankings

male_rankings = round(males[ranking_cols].mean(),2)
print(male_rankings)   

ranking_1    4.04
ranking_2    4.22
ranking_3    4.27
ranking_4    3.00
ranking_5    2.46
ranking_6    3.00
dtype: float64

In [36]:

## calculate the mean female rankings

female_rankings = round(females[ranking_cols].mean(),2)
print(female_rankings)   

ranking_1    3.43
ranking_2    3.95
ranking_3    4.42
ranking_4    3.54
ranking_5    2.57
ranking_6    3.08
dtype: float64

Consolidating this into our dataframe of interesting per-Episode information:

In [37]:

episode_stats["male_count"] = pd.Series(male_seen.tolist())
episode_stats["female_count"] = pd.Series(female_seen.tolist())
episode_stats["male_ranking"] = pd.Series(male_rankings.tolist())
episode_stats["female_ranking"] = pd.Series(female_rankings.tolist())
print(episode_stats)

  episode_num                                 episode_title  seen_count  \
0   Episode 1      Star Wars: Episode I  The Phantom Menace         673   
1   Episode 2   Star Wars: Episode II  Attack of the Clones         571   
2   Episode 3   Star Wars: Episode III  Revenge of the Sith         550   
3   Episode 4             Star Wars: Episode IV  A New Hope         607   
4   Episode 5  Star Wars: Episode V The Empire Strikes Back         758   
5   Episode 6      Star Wars: Episode VI Return of the Jedi         738   

   episode_ranking  male_count  female_count  male_ranking  female_ranking  
0             3.73         361           298          4.04            3.43  
1             4.09         323           237          4.22            3.95  
2             4.34         317           222          4.27            4.42  
3             3.27         342           255          3.00            3.54  
4             2.51         392           353          2.46            2.57  
5             3.05         387           338          3.00            3.08

Fan Favourites¶

I would like to know if there are different answers depending on whether the respondant indicated they were a star wars fan or a star trek fan. But first how much information do we have on this?

In [38]:

## identify the columns for Star Wars and Star Trek fans
#(print(star_wars.columns))
fan_col = str(star_wars.columns[2])
print(fan_col)
trek_col = star_wars.columns[32]
print(trek_col)

Do you consider yourself to be a fan of the Star Wars film franchise?
Do you consider yourself to be a fan of the Star Trek franchise?

In [39]:

print(star_wars[fan_col].value_counts(dropna=False))
print()
print(star_wars[trek_col].value_counts(dropna=False))

True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

No     641
Yes    427
NaN    118
Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64

Fans vs. not Fans¶

Do non-fans even watch many Star Wars movies?

Below I find out and put the answer directly into my master dataframe.

In [40]:

notfans = star_wars[star_wars[fan_col] == False]
episode_stats["notfan_count"] = pd.Series(notfans[seen_cols].sum().tolist())
episode_stats["notfan_count"]

Out[40]:

0    173
1    108
2    100
3    124
4    220
5    201
Name: notfan_count, dtype: int64

In [41]:

fans = star_wars[star_wars[fan_col] == True]
episode_stats["fan_count"] = pd.Series(fans[seen_cols].sum().tolist())
episode_stats["fan_count"]

Out[41]:

0    500
1    463
2    450
3    483
4    538
5    537
Name: fan_count, dtype: int64

Non-fans have not seen many of the episodes.

Episode 5 was seen the most often (220 respondants) but Episode 2 was seen less than half as often (100 respondants).

In contrast, the most fans saw Episode 5 (538 respondants) and while Episode 2 was still seen by the fewest respondants who identified as fans, the viewer count was still proprortionally high (450 respondants), meaning only 16% fewer viewers.

In [42]:

## calculate the mean fan ranking

episode_stats["fan_ranking"] = pd.Series(round((fans[ranking_cols].mean()),2).tolist())
episode_stats["fan_ranking"]

Out[42]:

0    4.14
1    4.34
2    4.42
3    2.93
4    2.33
5    2.83
Name: fan_ranking, dtype: float64

In [43]:

## calculate the mean fan ranking

episode_stats["notfan_ranking"] = pd.Series(round((notfans[ranking_cols].mean()),2).tolist())
episode_stats["notfan_ranking"]

Out[43]:

0    2.94
1    3.59
2    4.19
3    3.93
4    2.86
5    3.47
Name: notfan_ranking, dtype: float64

In [44]:

episode_stats

Out[44]:

	episode_num	episode_title	seen_count	episode_ranking	male_count	female_count	male_ranking	female_ranking	notfan_count	fan_count	fan_ranking	notfan_ranking
0	Episode 1	Star Wars: Episode I The Phantom Menace	673	3.73	361	298	4.04	3.43	173	500	4.14	2.94
1	Episode 2	Star Wars: Episode II Attack of the Clones	571	4.09	323	237	4.22	3.95	108	463	4.34	3.59
2	Episode 3	Star Wars: Episode III Revenge of the Sith	550	4.34	317	222	4.27	4.42	100	450	4.42	4.19
3	Episode 4	Star Wars: Episode IV A New Hope	607	3.27	342	255	3.00	3.54	124	483	2.93	3.93
4	Episode 5	Star Wars: Episode V The Empire Strikes Back	758	2.51	392	353	2.46	2.57	220	538	2.33	2.86
5	Episode 6	Star Wars: Episode VI Return of the Jedi	738	3.05	387	338	3.00	3.08	201	537	2.83	3.47

Synthesizing Information for Viewing and Concluding¶

There are 2 modifications I would like to perform on the data to help better visualize some insights:

For every column with view counts, add a column with percentages.
Invert the mean ranking scores so that a higher value equates to a higher ranking.

1. Add columns with percentages

To do this I will define a function

In [45]:

# First I perform the manipulations on one column in my episode stats dataframe
episode_stats["seen_percent"] = round((episode_stats["seen_count"] / episode_stats["seen_count"].sum() * 100), 2)
episode_stats.columns

Out[45]:

Index(['episode_num', 'episode_title', 'seen_count', 'episode_ranking',
       'male_count', 'female_count', 'male_ranking', 'female_ranking',
       'notfan_count', 'fan_count', 'fan_ranking', 'notfan_ranking',
       'seen_percent'],
      dtype='object')

In [46]:

# now that I know how to process one count column into percentages, I wrote a function to convert any count column.
def percents (col_count) :
    episode_stats[col_count.replace("count","percent")] = round((episode_stats[col_count] / episode_stats[col_count].sum() * 100), 2)

In [47]:

## now I store the other columns with counts that I want to change to percentages
seen_bool = episode_stats.columns.str.contains("count")
count_cols = episode_stats.columns[seen_bool]
count_cols

Out[47]:

Index(['seen_count', 'male_count', 'female_count', 'notfan_count',
       'fan_count'],
      dtype='object')

In [48]:

## process all count columns into a percent column
for col in count_cols :
    percents(col)
    
## view new columns
episode_stats[count_cols.str.replace("count","percent")]

Out[48]:

	seen_percent	male_percent	female_percent	notfan_percent	fan_percent
0	17.27	17.01	17.50	18.68	16.83
1	14.65	15.22	13.92	11.66	15.58
2	14.11	14.94	13.04	10.80	15.15
3	15.58	16.12	14.97	13.39	16.26
4	19.45	18.47	20.73	23.76	18.11
5	18.94	18.24	19.85	21.71	18.07

Reset Index to the Episode Number

In [49]:

## I don't like the index as 0-5 so I'm going to set it to the Episode Number
episode_stats = episode_stats.set_index("episode_num")

In [50]:

## view a subset of the columns to validate, ie. the female stats
bool_female = episode_stats.columns.str.contains("female")
female_stats = episode_stats.columns[bool_female]
episode_stats[female_stats]

Out[50]:

	female_count	female_ranking	female_percent
episode_num
Episode 1	298	3.43	17.50
Episode 2	237	3.95	13.92
Episode 3	222	4.42	13.04
Episode 4	255	3.54	14.97
Episode 5	353	2.57	20.73
Episode 6	338	3.08	19.85

2. Invert Mean Ranking Score

Low ranking scores are better, but visually that doesn't translate.

A film can have a rank up to 6 so I will make my inverted mean rank = 6 - mean rank.

In [51]:

## Create a function to aggregate on

def invert (mean_rank) :
    return(round(6-mean_rank,2))

In [52]:

## identify columns with mean rankings that I want to invert
rank_bool = episode_stats.columns.str.contains("ranking")
rank_cols = episode_stats.columns[rank_bool]
rank_cols

Out[52]:

Index(['episode_ranking', 'male_ranking', 'female_ranking', 'fan_ranking',
       'notfan_ranking'],
      dtype='object')

In [53]:

## invert rankings of all relevant columns
for col in rank_cols :
    episode_stats[col+"_inverted"] = invert(episode_stats[col])
episode_stats

Out[53]:

	episode_title	seen_count	episode_ranking	male_count	female_count	male_ranking	female_ranking	notfan_count	fan_count	fan_ranking	...	seen_percent	male_percent	female_percent	notfan_percent	fan_percent	episode_ranking_inverted	male_ranking_inverted	female_ranking_inverted	fan_ranking_inverted	notfan_ranking_inverted
episode_num
Episode 1	Star Wars: Episode I The Phantom Menace	673	3.73	361	298	4.04	3.43	173	500	4.14	...	17.27	17.01	17.50	18.68	16.83	2.27	1.96	2.57	1.86	3.06
Episode 2	Star Wars: Episode II Attack of the Clones	571	4.09	323	237	4.22	3.95	108	463	4.34	...	14.65	15.22	13.92	11.66	15.58	1.91	1.78	2.05	1.66	2.41
Episode 3	Star Wars: Episode III Revenge of the Sith	550	4.34	317	222	4.27	4.42	100	450	4.42	...	14.11	14.94	13.04	10.80	15.15	1.66	1.73	1.58	1.58	1.81
Episode 4	Star Wars: Episode IV A New Hope	607	3.27	342	255	3.00	3.54	124	483	2.93	...	15.58	16.12	14.97	13.39	16.26	2.73	3.00	2.46	3.07	2.07
Episode 5	Star Wars: Episode V The Empire Strikes Back	758	2.51	392	353	2.46	2.57	220	538	2.33	...	19.45	18.47	20.73	23.76	18.11	3.49	3.54	3.43	3.67	3.14
Episode 6	Star Wars: Episode VI Return of the Jedi	738	3.05	387	338	3.00	3.08	201	537	2.83	...	18.94	18.24	19.85	21.71	18.07	2.95	3.00	2.92	3.17	2.53

6 rows × 21 columns

Visualizing Survey Results¶

This is where I get to have fun with plotting :-)

The reason I made the dataframe to store the information is because it gives me the flexibility to sort on various columns and also helps to apply the correct Episode labels on the plots.

In [54]:

## import my seaborn library ... just in case?:
import seaborn as sns
import matplotlib.style as style

style.use('fivethirtyeight')

The bar plot was created using this example from matplotlib.org

In [55]:

fig, ax1 = plt.subplots()
xlabels = ['I', 'II', 'III', 'IV', 'V', 'VI']
fan_inverted = episode_stats["fan_ranking_inverted"].tolist()
notfan_inverted = episode_stats["notfan_ranking_inverted"].tolist()


x = np.arange(len(xlabels))  # the label locations
width = 0.35  # the width of the bars

rects1 = ax1.bar(x - width/2, fan_inverted, width, label='Fans')
rects2 = ax1.bar(x + width/2, notfan_inverted, width, label='Not Fans')

ax1.set_title("Star Wars Favorites: Depends Who You Ask!")
ax1.set_ylabel("Inverted Mean Ranking\n(Highest Relative Value = Favorite)", fontsize='small')
ax1.set_xlabel("Episodes")
ax1.set_xticks(x)
ax1.set_xticklabels(xlabels)
ax1.set_facecolor('white')
ax1.grid(None)
ax1.legend(loc='upper center', bbox_to_anchor=(0.35, 0.95), edgecolor='white', facecolor='white')

plt.show()

Conclusion - Guided Project¶

So far what we've discovered does not provide much more insight into the data and survey results.

We have ascertained that overall Episode V: The Empire Strikes Back is the most viewed and most popular. Has it become the most viewed because it's the most recommended ... or is it the most liked because it has been viewed the most? A person cannot rank a film they have not seen. Do Data Scientists have these chicken-before-the-egg and self-fulfilling-prophecy debates on the effect of observing trends that potentially influences and exaggerates those trends?

It was suggested in the guided portion to continue with comparing results between fans and non-fans, star trek fan preferences, gender or demographic information. A quick glance regarding this showed:

Not much difference in preference between the genders
Many non-star wars fans didn't see many of the films
Star wars Fans rank Episodes 4-6 high (the original 3 movies) and Episodes 1-3 much lower

*As Plotted Above: Not-fans ranked Episode 1 as their second favorite, almost beating Episode 5. Episode 1 was not nearly as popular in any other subgroup, especially not among the self-identified fans.

Project Continuation - My Story¶

I am curious to prove or disprove the following statements:

Star Wars fans are predominantly male
Younger Star Wars fans will score the 2nd trilogy (Episodes 1-3) higher and while mature (Gen X & Y) fans will prefer the original trilogy (Episodes 4-6).
There will be more interesting insights on preferences if instead I look at the proportion of #1 rankings for each episode.

Rankings from a Different Perspective¶

Since many respondants didn't see all episodes, I would like to know what it looks like to plot the ranking counts for each episode instead. I intended to count up all of the #1 rankings for Episode 1, all the #2 rankings for Episode 1, etc. and do the same for each episode.

However, in doing so I noticed something strange - all films had 835 ranking values. How can that be if I already know that some episodes have been viewed more often than others?

I checked for invalid episode ranking results when the corresponding episode had not been seen for Episode 3 because it was among the least viewed. I found 286 invalid rankings entered!

In [56]:

## begin with deriving the ranking counts for just one episode
compare_3 = star_wars[["seen_3","ranking_3"]]
invalid = compare_3.loc[(compare_3["seen_3"] == False) & (compare_3["ranking_3"].notnull())]
invalid

Out[56]:

	seen_3	ranking_3
9	False	3.0
16	False	2.0
50	False	5.0
59	False	6.0
61	False	3.0
...	...	...
1169	False	6.0
1172	False	4.0
1176	False	3.0
1179	False	5.0
1185	False	2.0

286 rows × 2 columns

Removing Invalid Ranking Scores¶

I expect there to be a large difference in the means scores if I remove the invalid rankings for each episode!

In [57]:

star_wars_rankings = star_wars.copy()
star_wars_rankings.head(5)

Out[57]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	3292879998	True	True	True	True	True	True	True	True	3.0	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
1	3292879538	False	NaN	False	False	False	False	False	False	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
2	3292765271	True	False	True	True	True	False	False	False	1.0	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
3	3292763116	True	True	True	True	True	True	True	True	5.0	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
4	3292731220	True	True	True	True	True	True	True	True	5.0	...	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

5 rows × 38 columns

To find invalid rankings, I will:

Ignore rows where the respondant did not see any of the movies. In the process I will check:
- did the respondant answer No to Question 1 asking whether they had seen any of the movies
- did the respondant answer False to seeing each of the episodes
- did the respondant's survey have null answers in the episode ranking columns
?
Compare Episodes seen with the Ranking ... more on this later

Separate Survey Responses based on whether Valid from Invalid Rankings

Check for null responses

In [58]:

## if respondant said they did not watch any of the episodes then seen 1-6 are all False as expected
star_wars_rankings.loc[(star_wars_rankings[seen_any_col] == False), seen_cols].value_counts()

Out[58]:

seen_1  seen_2  seen_3  seen_4  seen_5  seen_6
False   False   False   False   False   False     250
dtype: int64

In [59]:

## similarily rankings 1-6 are also null as epxected
star_wars_rankings.loc[(star_wars_rankings[seen_any_col] == False), ranking_cols].value_counts()

Out[59]:

Series([], dtype: int64)

In [60]:

## initiate a series to store the rank_validity in the dataframe
star_wars_rankings["rank_validity"] = pd.Series()

<ipython-input-60-6fa5701d6a4c>:2: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  star_wars_rankings["rank_validity"] = pd.Series()

The function below should take a row from the dataframe as input and then return a string code to designate its validity classification:

Check if seen_1-6 are all False and return "all_null"
Count True in seen_1-6 and if count = 6 return "all_valid"
Where seen_1-6 are true, if the corresponding ranking > count then return "some_invalid". Otherwise return "some_valid".

some_valid will still have to be processed later to remove ranking where seen is not True.

In [61]:

# remind ourselves of what the seen_1-6 column data looks like:
star_wars_rankings[seen_cols][:5]

Out[61]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6
0	True	True	True	True	True	True
1	False	False	False	False	False	False
2	True	True	True	False	False	False
3	True	True	True	True	True	True
4	True	True	True	True	True	True

In [62]:

#identify respondants who have seen all and any movies to attribute validity values
bool_all = star_wars_rankings[seen_cols].all(axis=1)
bool_any = star_wars_rankings[seen_cols].any(axis=1)

In [63]:

## check how many respondants saw all or some of the episodes:
print("# Respondants who have seen ALL Episodes: ", len(star_wars_rankings.loc[bool_all]))

## some = any but not all
bool_some = bool_any & (~bool_all)
print("# Respondants who have seen SOME but not all Episodes: ", len(star_wars_rankings.loc[bool_some]))

# Respondants who have seen ALL Episodes:  471
# Respondants who have seen SOME but not all Episodes:  364

Amazingly we have 471 respondants who have seen all of the episodes!

So we have data to work with regardless of how many potentially valid responses I pull out from the 364 responses with not all movies seen.

In [64]:

## quick reference to the seen and ranking columns we have to review for each row
seen_ranking_cols = seen_cols.tolist()+ranking_cols.tolist()

In [65]:

star_wars_rankings.loc[bool_some, seen_ranking_cols][:10]

Out[65]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
2	True	True	True	False	False	False	1.0	2.0	3.0	4.0	5.0	6.0
9	False	True	False	False	False	False	1.0	2.0	3.0	4.0	5.0	6.0
16	False	False	False	True	False	False	4.0	1.0	2.0	3.0	5.0	6.0
17	True	True	True	False	False	True	1.0	2.0	3.0	4.0	5.0	6.0
21	True	True	True	True	True	False	3.0	4.0	5.0	1.0	2.0	6.0
33	False	True	True	False	False	False	6.0	1.0	2.0	3.0	4.0	5.0
50	True	False	False	True	True	True	4.0	6.0	5.0	3.0	1.0	2.0
59	True	True	False	True	True	True	5.0	4.0	6.0	1.0	3.0	2.0
61	False	False	False	True	False	False	1.0	2.0	3.0	4.0	5.0	6.0
76	False	False	False	False	True	True	3.0	4.0	5.0	6.0	1.0	2.0

Next I need to clean the rankings so that if the ranking is valid, the value is retained, if the ranking is invalid it is replaced by null.

In [66]:

## I admit I do not understand why when I pass an individual row to the fuction I need to use row[seen].bool()
## but when I apply the fuction on my dataframe, I need to remove .bool()

## This function:
## 1. sets the rank_validity flag
## 2. removes ranking values that are considered invalid

def rerank_df (row) :
    seen_count = 0
## assume there are some good rankings unless proven otherwise
    row["rank_validity"] = "some_valid"

## count how many episodes respondant has seen
    for seen in seen_cols :
        if (row[seen]) :
            seen_count +=1
            
## check if no episodes seen - means that per-episode rankings are all null            
    if (seen_count == 0) :
## set flag and return rankings unchanged
        row["rank_validity"] = "all_null"
        return(row)
## check if all episodes seen - means that per-episode rankings are valid
    elif (seen_count == 6) :
## set flag and return rankings unchanged
        row["rank_validity"] = "all_valid"
        return(row)
            
## if some but not all episodes have been seen, iterate through the episodes again
    for (seen,ranking) in zip(seen_cols,ranking_cols) :
## if respondant has seen the episode
        if row[seen] :
## but the ranking of the episode doesn't make sense because it's outside of the range of # episodes seen
            if (float(row[ranking]) > seen_count) :
## override the ranking with a null value
                row[ranking] = np.NaN
## ranking has proven to be invalid so flag is overwritten
                row["rank_validity"]="some_invalid"
## if respondant has not seen the episode then override rankign with null value
        else :
            row[ranking] = np.NaN
## if some rankings are invalid, set all rankings to null
    if (row["rank_validity"] == "some_invalid") :
        row[ranking_cols] = np.NaN
## return row with changes
    return(row)

In [67]:

## apply function to all rows where only some episodes have been seen:
star_wars_rankings = star_wars_rankings.apply(rerank_df, axis=1)

In [68]:

## check our counts - we expect 471 valid rankings where all episodes were seen
len(star_wars_rankings.loc[star_wars_rankings["rank_validity"] == "all_valid"])

Out[68]:

In [69]:

## check the results - I expect less than 364 valid rankings where only some episodes were seen
len(star_wars_rankings.loc[star_wars_rankings["rank_validity"] == "some_valid"])

Out[69]:

In [70]:

## admire the result
star_wars_rankings[bool_some][seen_ranking_cols][:10]

Out[70]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
2	True	True	True	False	False	False	1.0	2.0	3.0	NaN	NaN	NaN
9	False	True	False	False	False	False	NaN	NaN	NaN	NaN	NaN	NaN
16	False	False	False	True	False	False	NaN	NaN	NaN	NaN	NaN	NaN
17	True	True	True	False	False	True	NaN	NaN	NaN	NaN	NaN	NaN
21	True	True	True	True	True	False	3.0	4.0	5.0	1.0	2.0	NaN
33	False	True	True	False	False	False	NaN	1.0	2.0	NaN	NaN	NaN
50	True	False	False	True	True	True	4.0	NaN	NaN	3.0	1.0	2.0
59	True	True	False	True	True	True	5.0	4.0	NaN	1.0	3.0	2.0
61	False	False	False	True	False	False	NaN	NaN	NaN	NaN	NaN	NaN
76	False	False	False	False	True	True	NaN	NaN	NaN	NaN	1.0	2.0

In [71]:

## if some rankings are invalid then 
star_wars_rankings.loc[star_wars_rankings["rank_validity"] == 'some_invalid', seen_ranking_cols][:5]

Out[71]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
9	False	True	False	False	False	False	NaN	NaN	NaN	NaN	NaN	NaN
16	False	False	False	True	False	False	NaN	NaN	NaN	NaN	NaN	NaN
17	True	True	True	False	False	True	NaN	NaN	NaN	NaN	NaN	NaN
61	False	False	False	True	False	False	NaN	NaN	NaN	NaN	NaN	NaN
142	False	False	False	False	True	True	NaN	NaN	NaN	NaN	NaN	NaN

In [72]:

star_wars_rankings.loc[star_wars_rankings["rank_validity"] == 'some_valid', seen_ranking_cols][:5]

Out[72]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
2	True	True	True	False	False	False	1.0	2.0	3.0	NaN	NaN	NaN
21	True	True	True	True	True	False	3.0	4.0	5.0	1.0	2.0	NaN
33	False	True	True	False	False	False	NaN	1.0	2.0	NaN	NaN	NaN
50	True	False	False	True	True	True	4.0	NaN	NaN	3.0	1.0	2.0
59	True	True	False	True	True	True	5.0	4.0	NaN	1.0	3.0	2.0

In [73]:

star_wars_rankings.loc[star_wars_rankings["rank_validity"] == 'all_valid', seen_ranking_cols][:5]

Out[73]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
0	True	True	True	True	True	True	3.0	2.0	1.0	4.0	5.0	6.0
3	True	True	True	True	True	True	5.0	6.0	1.0	2.0	4.0	3.0
4	True	True	True	True	True	True	5.0	4.0	6.0	2.0	1.0	3.0
5	True	True	True	True	True	True	1.0	4.0	3.0	6.0	5.0	2.0
6	True	True	True	True	True	True	6.0	5.0	4.0	3.0	1.0	2.0

Recap on Ranking Value Changes¶

Ranking Issue Summarized¶

The Surveys responses for rankings was misleading due to invalid information - the per-Episode Ranking columns contained rankings for episodes that respondants had not seen.

I presume this is an oversight in the survey where all columns had to be filled in unless the respondant indicated they had not seen ANY of the episodes in Question 1, in which case null was apply to all per-episode ranking columns.

If respondants had see any of the episodes, the survey retains the default rankings of 1-6 for Episodes 1-6 unless the respondant changed them.

Ranking Issue Solution¶

I modified the per-episode ranking columns so that:

if the episode was not seen, the ranking was converted to null
if the episode seen has a rank value higher than the # of episodes seen by the respondant, all rankings on this survey are considered suspect and invalid and are all converted to null.

Resulting Data Subset¶

A subset of 471 rankings by respondants who have seen all the movies
A subset of 282 rankings by respondants who have only seen some of the movies.
A flag in the dataset to indicate validity of the rankings ["all_valid", "all_null", "some_valid", "some_null"]

Plot New Rankings in a New Style¶

I am not a huge fan of plotting the mean ranking values because this information is both flat (no depth to it) and confusing (less is better).

I prefer to visualize this in a way that also reflects the number of viewings of the episode.

I followed the recommendations in this towards data science article

In [74]:

import seaborn as sns

In [75]:

## store the count of each rank by episode
rank_count = pd.DataFrame()

In [76]:

for (epi,rank) in zip(episodes, ranking_cols) :
    rank_count[epi] = star_wars_rankings[rank].value_counts().sort_index()

In [77]:

per_episode_rank_count = rank_count.transpose().iloc[::-1]
per_episode_rank_count

Out[77]:

	1.0	2.0	3.0	4.0	5.0	6.0
Episode 6	125	219	200	50	29	53
Episode 5	276	210	101	36	52	16
Episode 4	190	126	105	64	32	60
Episode 3	31	38	89	137	114	116
Episode 2	23	68	68	103	198	84
Episode 1	109	47	80	173	84	142

I don't know why but I have the run the below code TWICE to see the expected plot. After the first run the sizing of the horizontal bar plot is shrunk but the text is correct. After the second run the plot resizes to fit the figure. Any suggestions appreciated!!

In [79]:

fig, axes = plt.subplots()

per_episode_rank_count.plot.barh(stacked=True, ax=axes, width=0.8)    

def set_sizes(fig_size, font_size):
    plt.rcParams["figure.figsize"] = fig_size
    plt.rcParams["font.size"] = font_size
    plt.rcParams["xtick.labelsize"] = font_size
    plt.rcParams["ytick.labelsize"] = font_size+4
    plt.rcParams["axes.labelsize"] = font_size+2
    plt.rcParams["axes.titlesize"] = font_size+6
    plt.rcParams["legend.fontsize"] = font_size+2
set_sizes((12,6), 10)

axes.legend(["#1 Ranked", "#2 Ranked", "#3 Ranked", "#4 Ranked", "#5 Ranked", "#6 Ranked"], 
            loc='center right',
           bbox_to_anchor=(0.99,0.6))

axes.set_title("Per Episode Ranking Counts")
axes.text(s=((episode_stats["episode_title"][4])+ " is by far the most seen, most beloved, least disliked"), 
          x=9, y=0.9,
         fontsize=14,
         color="white")
axes.text(s=((episode_stats["episode_title"][0])+ " is divided by two forces, both loved and hated"), 
          x=9, y=4.9,
         fontsize=14,
         color="white")

plt.show()
## from the article - a future ambition to annotate with values!!
# import matplotlib
# import os
# from dataclasses import dataclass

# Patch = matplotlib.patches.Patch
# PosVal = Tuple[float, Tuple[float, float]] 
# Axis = matplotlib.axes.Axes
# PosValFunc = Callable[[Patch], PosVal]
# @dataclass
# class AnnotateBars:
#     font_size: int = 10
#     color: str = "black"
#     n_dec: int = 2
#     def horizontal(self, ax: Axis, centered=False):
#         def get_vals(p: Patch) -> PosVal:
#             value = p.get_width()
#             div = 2 if centered else 1
#             pos = (
#                 p.get_x() + p.get_width() / div,
#                 p.get_y() + p.get_height() / 2,
#             )
#             return value, pos
#         ha = "center" if centered else  "left"
#         self._annotate(ax, get_vals, ha=ha, va="center")
#     def _annotate(self, ax, func: PosValFunc, **kwargs):
#         cfg = {"color": self.color, 
#                "fontsize": self.font_size, **kwargs}
#         for p in ax.patches:
#             value, pos = func(p)
#             ax.annotate(f"{value:.{self.n_dec}f}", pos, **cfg)

Final Words¶

I prefer my distribution bar plot of ranking counts because it captures more information than just plotting the mean, which could not convey the polarity in opinions on Episode 1. This analysis is more in line with fivethirtyeight's Top Third / Middle Third / Bottom Third ranking distribution groupings.

Future Improvements¶

In this towards data science article I really liked how the values were annotated on the plot. Ideally my plot would also show these values.

Footnote on Reading about the Data!!¶

With this all done and my lovely histogram created to refelct how often a valid ranking was even attributed to an episode, I was ready to wrap up and present my project.

THAT’S when I decided to take look at the information from fivethirtyeight about the dataset

THAT’S when I learned they had explicitly only taken into consideration the rankings by the 471 respondents who indicated they had seen ALL of the films.

While I do not consider the time spent exercising dataset cleaning and manipulations as wasted time, I would have preferred the efficiency of learning this information by just reading about the dataset!!!

In [ ]: