Cooincidentally, May the Fourth [Be With You] just passed!
Which Star Wars movie is the most beloved among fans? Fivethirtyeight conducted a survey to provide some insight on the
I would guess that it's one of original 3, but is it the Empire Strikes Back per FiveThrityEight's prediction? Do only some poeple get to weigh in seriously on this? What's the epic story?
There are 38 suvey questions and 1186 respondants. The survey questions at a glance focus on how dedicated a star wars fans the respondant is, which movies they have watched, their preferences and favorites, as well as some general demographic information.
The data file originally compiled by fivethrityeight was provided in the guided project and can be downloaded from github
import pandas as pd
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
## Get # respondents and # of questions
star_wars.shape
(1186, 38)
## Get a feel for the data by reviewing the first 10 rows
star_wars[:10]
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3292879998 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
1 | 3292879538 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 | 3292765271 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 | 3292763116 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4 | 3292731220 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3292719380 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1.0 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
6 | 3292684787 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6.0 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
7 | 3292663732 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4.0 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
8 | 3292654043 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
9 | 3292640424 | Yes | No | NaN | Star Wars: Episode II Attack of the Clones | NaN | NaN | NaN | NaN | 1.0 | ... | Very favorably | I don't understand this question | No | NaN | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
10 rows × 38 columns
# List the survey questions
for col in star_wars.columns :
print(col)
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe? Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
Columns at indexes 1 and 2 store Yes/No responses which I am converting to True/False.
# store long column header for easier reference
seen_any_col = star_wars.columns[1]
seen_any_col
'Have you seen any of the 6 films in the Star Wars franchise?'
# store long column header for easier reference
fan_col = star_wars.columns[2]
fan_col
'Do you consider yourself to be a fan of the Star Wars film franchise?'
#define boolean map to apply to seen_any_col and fan_col
yes_no = {
"Yes" : True,
"No" : False
}
Q1: Every Respondant indicated whether they had seen any of the 6 Star Wars Films
## print responses before transforming them to boolean values
star_wars[seen_any_col].value_counts(dropna=False)
Yes 936 No 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
## apply transformation map
star_wars[seen_any_col] = star_wars[seen_any_col].map(yes_no)
## validate result by comparing with values before transformation
print(star_wars[seen_any_col].value_counts(dropna=False))
True 936 False 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
Q2: 29% of Respondants did NOT indicate whether they consider themselves a Fan of the Franchise
## print responses before transforming them to boolean values
star_wars[fan_col].value_counts(dropna=False)
Yes 552 NaN 350 No 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
## apply transformation map - ignore but keep null values
star_wars[fan_col] = star_wars[fan_col].map(yes_no, na_action="ignore")
## validate result by comparing with values before transformation
print(star_wars[fan_col].value_counts())
## show proportion of null values:
print("Null:", round((star_wars[fan_col].isnull().sum()/len(star_wars[fan_col])*100), 2), "%")
True 552 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64 Null: 29.51 %
Questions 3-8 (6 columns) store which Star Wars Episodes the respondant has seen. Each column represents an Episode and if the value is the name of the Episode, the respondant has seen the Episode. Otherwise the value is null to indicate the respondant has NOT see the episode.
Here's what that looks like:
star_wars.iloc[:,3:9].head(3)
Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | |
---|---|---|---|---|---|---|
0 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN |
Q3-8: These columns will have changes applied so that:
1. Headers are meaningful
2. Values are True/False
I don't want to repetitively copy paste anything EVER if I don't have to.
Seen This Episode? Column Header Maps
I am going to generate the map.
## isolate columns
seen_cols = star_wars.columns[3:9]
## iterate through the column names at indexes 3-9 to create a mapping dictionary
map_episode_header = {}
for (i,header) in zip(range(6),star_wars[seen_cols]) :
map_episode_header[header] = "seen_" + str(i+1)
## verify the mapping dictionary looks correct
map_episode_header
{'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1', 'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6': 'seen_4', 'Unnamed: 7': 'seen_5', 'Unnamed: 8': 'seen_6'}
## apply the map to rename the columns
star_wars=star_wars.rename(columns=map_episode_header)
## store the new columns headers and validate result
seen_cols = star_wars.columns[3:9]
seen_cols
Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')
Seen This Episode? Column Value Map
I now generate the map that will transform the column values to boolean.
This seems a little trickier. I bet there is a quicker, cleaner method but I hvae broken out the steps.
First I need to get the values for True in each column (= the 6 Episode Names). I am assuming it is the most popular non-null value in each column.
I am putting all the episodes titles into a single mapping dictionary.
## Value counts to use for validating transformation
for episode in star_wars[seen_cols] :
print(star_wars[episode].value_counts(dropna=False))
Star Wars: Episode I The Phantom Menace 673 NaN 513 Name: seen_1, dtype: int64 NaN 615 Star Wars: Episode II Attack of the Clones 571 Name: seen_2, dtype: int64 NaN 636 Star Wars: Episode III Revenge of the Sith 550 Name: seen_3, dtype: int64 Star Wars: Episode IV A New Hope 607 NaN 579 Name: seen_4, dtype: int64 Star Wars: Episode V The Empire Strikes Back 758 NaN 428 Name: seen_5, dtype: int64 Star Wars: Episode VI Return of the Jedi 738 NaN 448 Name: seen_6, dtype: int64
## first entry in the mapping dictionary is False for null
import numpy as np
bool_episode_map = {np.NaN : False}
episode_titles = []
## next add each title to the dictionary mapping to True
for episode in seen_cols :
episode_titles.append(star_wars[episode].value_counts().idxmax())
bool_episode_map[star_wars[episode].value_counts().idxmax()] = True
# Validate mapping dictionary
bool_episode_map
{nan: False, 'Star Wars: Episode I The Phantom Menace': True, 'Star Wars: Episode II Attack of the Clones': True, 'Star Wars: Episode III Revenge of the Sith': True, 'Star Wars: Episode IV A New Hope': True, 'Star Wars: Episode V The Empire Strikes Back': True, 'Star Wars: Episode VI Return of the Jedi': True}
## apply map and validate result
temp=pd.Series()
for episode in seen_cols :
temp = star_wars[episode].map(bool_episode_map)
star_wars[episode] = temp
print(star_wars[episode].value_counts())
True 673 False 513 Name: seen_1, dtype: int64 False 615 True 571 Name: seen_2, dtype: int64 False 636 True 550 Name: seen_3, dtype: int64 True 607 False 579 Name: seen_4, dtype: int64 True 758 False 428 Name: seen_5, dtype: int64 True 738 False 448 Name: seen_6, dtype: int64
<ipython-input-19-6d79fe886860>:2: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. temp=pd.Series()
There are 6 columns that store the rank of each Episode relative to the others. Rank of 1 = favourite, Rank of 6 = least favourite. Otherwise the value is null to indicate the respondant has NOT see the episode.
Here's what that looks like:
star_wars.iloc[:3,9:15]
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | |
---|---|---|---|---|---|---|
0 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
Q9-14: These columns will have changes applied so that:
1. Headers are meaningful
2. Values are Float Type
The below changes the header names using same method as above.
## store the old headers
ranking_cols = star_wars.columns[9:15]
ranking_cols
Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], dtype='object')
## create the map
map_ranking_header = {}
for (i, rank) in zip(range(6), ranking_cols):
map_ranking_header[rank] = "ranking_" + str(i+1)
## Validate the mapping dictionary looks correct
map_ranking_header
{'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1', 'Unnamed: 10': 'ranking_2', 'Unnamed: 11': 'ranking_3', 'Unnamed: 12': 'ranking_4', 'Unnamed: 13': 'ranking_5', 'Unnamed: 14': 'ranking_6'}
## apply the map to rename the columns
star_wars = star_wars.rename(columns=map_ranking_header)
## store the new columns headers and validate result
ranking_cols = star_wars.columns[9:15]
ranking_cols
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'], dtype='object')
## convert values to float type and validate
star_wars[ranking_cols] = star_wars[ranking_cols].astype(float)
type(star_wars[ranking_cols[0]][0])
numpy.float64
The mean ranking of each movie will indicate how popular it is.
# store the mean ranking value for each episode
episode_rankings = round(star_wars[ranking_cols].mean(),2)
episode_rankings
ranking_1 3.73 ranking_2 4.09 ranking_3 4.34 ranking_4 3.27 ranking_5 2.51 ranking_6 3.05 dtype: float64
episodes = []
for i in range(6) :
episodes.append("Episode " + str(i+1))
xlabels = episodes
episodes
['Episode 1', 'Episode 2', 'Episode 3', 'Episode 4', 'Episode 5', 'Episode 6']
# plot the mean ranking values
import matplotlib.pyplot as plt
%matplotlib inline
episode_rankings.plot.bar()
plt.title("Mean Ranking per Episode")
plt.xlabel("Lower Value = Higher Ranking")
plt.ylabel("Mean Ranking")
plt.xticks(ticks=range(6), labels=xlabels, rotation = 30)
plt.show()
# print full title of Episode 5
print("The Best Ranked Episode is: ", episode_titles[4])
The Best Ranked Episode is: Star Wars: Episode V The Empire Strikes Back
Respondants ranking Episode 5 the highest, which is indeed The Empire Strikes Back.
This could be because more respondants saw this movie than any other movie in the series:
# count how many respondants saw each movie
episode_views = star_wars[seen_cols].sum()
print(episode_views.tolist())
[673, 571, 550, 607, 758, 738]
## plot the bar graph
episode_views.plot.bar()
plt.title("Views per Episode")
plt.ylabel("Views")
plt.xticks(ticks=range(6), labels=xlabels, rotation = 30)
plt.show()
I want to consolidate all we've done already into a dataframe. Then I can use these results to compare with subgroup results.
episode_stats is a dataframe compiling the information we've generated for each Episode so far and additional per-episode information will be added as I go along.
episode_stats = pd.DataFrame()
episode_stats["episode_num"] = pd.Series(xlabels)
episode_stats["episode_title"] = pd.Series(episode_titles)
episode_stats["seen_count"] = pd.Series(episode_views.tolist())
episode_stats["episode_ranking"] = pd.Series(episode_rankings.tolist())
episode_stats
episode_num | episode_title | seen_count | episode_ranking | |
---|---|---|---|---|
0 | Episode 1 | Star Wars: Episode I The Phantom Menace | 673 | 3.73 |
1 | Episode 2 | Star Wars: Episode II Attack of the Clones | 571 | 4.09 |
2 | Episode 3 | Star Wars: Episode III Revenge of the Sith | 550 | 4.34 |
3 | Episode 4 | Star Wars: Episode IV A New Hope | 607 | 3.27 |
4 | Episode 5 | Star Wars: Episode V The Empire Strikes Back | 758 | 2.51 |
5 | Episode 6 | Star Wars: Episode VI Return of the Jedi | 738 | 3.05 |
I will check whether male and female respondants prefer different movies.
## divide data set
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
## How many males have seen each episode:
male_seen = []
male_seen = males[seen_cols].sum()
print(male_seen)
seen_1 361 seen_2 323 seen_3 317 seen_4 342 seen_5 392 seen_6 387 dtype: int64
## How many females have seen each episode:
female_seen = []
female_seen = females[seen_cols].sum()
print(female_seen)
seen_1 298 seen_2 237 seen_3 222 seen_4 255 seen_5 353 seen_6 338 dtype: int64
## calculate the mean male rankings
male_rankings = round(males[ranking_cols].mean(),2)
print(male_rankings)
ranking_1 4.04 ranking_2 4.22 ranking_3 4.27 ranking_4 3.00 ranking_5 2.46 ranking_6 3.00 dtype: float64
## calculate the mean female rankings
female_rankings = round(females[ranking_cols].mean(),2)
print(female_rankings)
ranking_1 3.43 ranking_2 3.95 ranking_3 4.42 ranking_4 3.54 ranking_5 2.57 ranking_6 3.08 dtype: float64
Consolidating this into our dataframe of interesting per-Episode information:
episode_stats["male_count"] = pd.Series(male_seen.tolist())
episode_stats["female_count"] = pd.Series(female_seen.tolist())
episode_stats["male_ranking"] = pd.Series(male_rankings.tolist())
episode_stats["female_ranking"] = pd.Series(female_rankings.tolist())
print(episode_stats)
episode_num episode_title seen_count \ 0 Episode 1 Star Wars: Episode I The Phantom Menace 673 1 Episode 2 Star Wars: Episode II Attack of the Clones 571 2 Episode 3 Star Wars: Episode III Revenge of the Sith 550 3 Episode 4 Star Wars: Episode IV A New Hope 607 4 Episode 5 Star Wars: Episode V The Empire Strikes Back 758 5 Episode 6 Star Wars: Episode VI Return of the Jedi 738 episode_ranking male_count female_count male_ranking female_ranking 0 3.73 361 298 4.04 3.43 1 4.09 323 237 4.22 3.95 2 4.34 317 222 4.27 4.42 3 3.27 342 255 3.00 3.54 4 2.51 392 353 2.46 2.57 5 3.05 387 338 3.00 3.08
I would like to know if there are different answers depending on whether the respondant indicated they were a star wars fan or a star trek fan. But first how much information do we have on this?
## identify the columns for Star Wars and Star Trek fans
#(print(star_wars.columns))
fan_col = str(star_wars.columns[2])
print(fan_col)
trek_col = star_wars.columns[32]
print(trek_col)
Do you consider yourself to be a fan of the Star Wars film franchise? Do you consider yourself to be a fan of the Star Trek franchise?
print(star_wars[fan_col].value_counts(dropna=False))
print()
print(star_wars[trek_col].value_counts(dropna=False))
True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64 No 641 Yes 427 NaN 118 Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64
Do non-fans even watch many Star Wars movies?
Below I find out and put the answer directly into my master dataframe.
notfans = star_wars[star_wars[fan_col] == False]
episode_stats["notfan_count"] = pd.Series(notfans[seen_cols].sum().tolist())
episode_stats["notfan_count"]
0 173 1 108 2 100 3 124 4 220 5 201 Name: notfan_count, dtype: int64
fans = star_wars[star_wars[fan_col] == True]
episode_stats["fan_count"] = pd.Series(fans[seen_cols].sum().tolist())
episode_stats["fan_count"]
0 500 1 463 2 450 3 483 4 538 5 537 Name: fan_count, dtype: int64
Non-fans have not seen many of the episodes.
Episode 5 was seen the most often (220 respondants) but Episode 2 was seen less than half as often (100 respondants).
In contrast, the most fans saw Episode 5 (538 respondants) and while Episode 2 was still seen by the fewest respondants who identified as fans, the viewer count was still proprortionally high (450 respondants), meaning only 16% fewer viewers.
## calculate the mean fan ranking
episode_stats["fan_ranking"] = pd.Series(round((fans[ranking_cols].mean()),2).tolist())
episode_stats["fan_ranking"]
0 4.14 1 4.34 2 4.42 3 2.93 4 2.33 5 2.83 Name: fan_ranking, dtype: float64
## calculate the mean fan ranking
episode_stats["notfan_ranking"] = pd.Series(round((notfans[ranking_cols].mean()),2).tolist())
episode_stats["notfan_ranking"]
0 2.94 1 3.59 2 4.19 3 3.93 4 2.86 5 3.47 Name: notfan_ranking, dtype: float64
episode_stats
episode_num | episode_title | seen_count | episode_ranking | male_count | female_count | male_ranking | female_ranking | notfan_count | fan_count | fan_ranking | notfan_ranking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Episode 1 | Star Wars: Episode I The Phantom Menace | 673 | 3.73 | 361 | 298 | 4.04 | 3.43 | 173 | 500 | 4.14 | 2.94 |
1 | Episode 2 | Star Wars: Episode II Attack of the Clones | 571 | 4.09 | 323 | 237 | 4.22 | 3.95 | 108 | 463 | 4.34 | 3.59 |
2 | Episode 3 | Star Wars: Episode III Revenge of the Sith | 550 | 4.34 | 317 | 222 | 4.27 | 4.42 | 100 | 450 | 4.42 | 4.19 |
3 | Episode 4 | Star Wars: Episode IV A New Hope | 607 | 3.27 | 342 | 255 | 3.00 | 3.54 | 124 | 483 | 2.93 | 3.93 |
4 | Episode 5 | Star Wars: Episode V The Empire Strikes Back | 758 | 2.51 | 392 | 353 | 2.46 | 2.57 | 220 | 538 | 2.33 | 2.86 |
5 | Episode 6 | Star Wars: Episode VI Return of the Jedi | 738 | 3.05 | 387 | 338 | 3.00 | 3.08 | 201 | 537 | 2.83 | 3.47 |
There are 2 modifications I would like to perform on the data to help better visualize some insights:
1. Add columns with percentages
To do this I will define a function
# First I perform the manipulations on one column in my episode stats dataframe
episode_stats["seen_percent"] = round((episode_stats["seen_count"] / episode_stats["seen_count"].sum() * 100), 2)
episode_stats.columns
Index(['episode_num', 'episode_title', 'seen_count', 'episode_ranking', 'male_count', 'female_count', 'male_ranking', 'female_ranking', 'notfan_count', 'fan_count', 'fan_ranking', 'notfan_ranking', 'seen_percent'], dtype='object')
# now that I know how to process one count column into percentages, I wrote a function to convert any count column.
def percents (col_count) :
episode_stats[col_count.replace("count","percent")] = round((episode_stats[col_count] / episode_stats[col_count].sum() * 100), 2)
## now I store the other columns with counts that I want to change to percentages
seen_bool = episode_stats.columns.str.contains("count")
count_cols = episode_stats.columns[seen_bool]
count_cols
Index(['seen_count', 'male_count', 'female_count', 'notfan_count', 'fan_count'], dtype='object')
## process all count columns into a percent column
for col in count_cols :
percents(col)
## view new columns
episode_stats[count_cols.str.replace("count","percent")]
seen_percent | male_percent | female_percent | notfan_percent | fan_percent | |
---|---|---|---|---|---|
0 | 17.27 | 17.01 | 17.50 | 18.68 | 16.83 |
1 | 14.65 | 15.22 | 13.92 | 11.66 | 15.58 |
2 | 14.11 | 14.94 | 13.04 | 10.80 | 15.15 |
3 | 15.58 | 16.12 | 14.97 | 13.39 | 16.26 |
4 | 19.45 | 18.47 | 20.73 | 23.76 | 18.11 |
5 | 18.94 | 18.24 | 19.85 | 21.71 | 18.07 |
Reset Index to the Episode Number
## I don't like the index as 0-5 so I'm going to set it to the Episode Number
episode_stats = episode_stats.set_index("episode_num")
## view a subset of the columns to validate, ie. the female stats
bool_female = episode_stats.columns.str.contains("female")
female_stats = episode_stats.columns[bool_female]
episode_stats[female_stats]
female_count | female_ranking | female_percent | |
---|---|---|---|
episode_num | |||
Episode 1 | 298 | 3.43 | 17.50 |
Episode 2 | 237 | 3.95 | 13.92 |
Episode 3 | 222 | 4.42 | 13.04 |
Episode 4 | 255 | 3.54 | 14.97 |
Episode 5 | 353 | 2.57 | 20.73 |
Episode 6 | 338 | 3.08 | 19.85 |
2. Invert Mean Ranking Score
Low ranking scores are better, but visually that doesn't translate.
A film can have a rank up to 6 so I will make my inverted mean rank = 6 - mean rank.
## Create a function to aggregate on
def invert (mean_rank) :
return(round(6-mean_rank,2))
## identify columns with mean rankings that I want to invert
rank_bool = episode_stats.columns.str.contains("ranking")
rank_cols = episode_stats.columns[rank_bool]
rank_cols
Index(['episode_ranking', 'male_ranking', 'female_ranking', 'fan_ranking', 'notfan_ranking'], dtype='object')
## invert rankings of all relevant columns
for col in rank_cols :
episode_stats[col+"_inverted"] = invert(episode_stats[col])
episode_stats
episode_title | seen_count | episode_ranking | male_count | female_count | male_ranking | female_ranking | notfan_count | fan_count | fan_ranking | ... | seen_percent | male_percent | female_percent | notfan_percent | fan_percent | episode_ranking_inverted | male_ranking_inverted | female_ranking_inverted | fan_ranking_inverted | notfan_ranking_inverted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
episode_num | |||||||||||||||||||||
Episode 1 | Star Wars: Episode I The Phantom Menace | 673 | 3.73 | 361 | 298 | 4.04 | 3.43 | 173 | 500 | 4.14 | ... | 17.27 | 17.01 | 17.50 | 18.68 | 16.83 | 2.27 | 1.96 | 2.57 | 1.86 | 3.06 |
Episode 2 | Star Wars: Episode II Attack of the Clones | 571 | 4.09 | 323 | 237 | 4.22 | 3.95 | 108 | 463 | 4.34 | ... | 14.65 | 15.22 | 13.92 | 11.66 | 15.58 | 1.91 | 1.78 | 2.05 | 1.66 | 2.41 |
Episode 3 | Star Wars: Episode III Revenge of the Sith | 550 | 4.34 | 317 | 222 | 4.27 | 4.42 | 100 | 450 | 4.42 | ... | 14.11 | 14.94 | 13.04 | 10.80 | 15.15 | 1.66 | 1.73 | 1.58 | 1.58 | 1.81 |
Episode 4 | Star Wars: Episode IV A New Hope | 607 | 3.27 | 342 | 255 | 3.00 | 3.54 | 124 | 483 | 2.93 | ... | 15.58 | 16.12 | 14.97 | 13.39 | 16.26 | 2.73 | 3.00 | 2.46 | 3.07 | 2.07 |
Episode 5 | Star Wars: Episode V The Empire Strikes Back | 758 | 2.51 | 392 | 353 | 2.46 | 2.57 | 220 | 538 | 2.33 | ... | 19.45 | 18.47 | 20.73 | 23.76 | 18.11 | 3.49 | 3.54 | 3.43 | 3.67 | 3.14 |
Episode 6 | Star Wars: Episode VI Return of the Jedi | 738 | 3.05 | 387 | 338 | 3.00 | 3.08 | 201 | 537 | 2.83 | ... | 18.94 | 18.24 | 19.85 | 21.71 | 18.07 | 2.95 | 3.00 | 2.92 | 3.17 | 2.53 |
6 rows × 21 columns
This is where I get to have fun with plotting :-)
The reason I made the dataframe to store the information is because it gives me the flexibility to sort on various columns and also helps to apply the correct Episode labels on the plots.
## import my seaborn library ... just in case?:
import seaborn as sns
import matplotlib.style as style
style.use('fivethirtyeight')
The bar plot was created using this example from matplotlib.org
fig, ax1 = plt.subplots()
xlabels = ['I', 'II', 'III', 'IV', 'V', 'VI']
fan_inverted = episode_stats["fan_ranking_inverted"].tolist()
notfan_inverted = episode_stats["notfan_ranking_inverted"].tolist()
x = np.arange(len(xlabels)) # the label locations
width = 0.35 # the width of the bars
rects1 = ax1.bar(x - width/2, fan_inverted, width, label='Fans')
rects2 = ax1.bar(x + width/2, notfan_inverted, width, label='Not Fans')
ax1.set_title("Star Wars Favorites: Depends Who You Ask!")
ax1.set_ylabel("Inverted Mean Ranking\n(Highest Relative Value = Favorite)", fontsize='small')
ax1.set_xlabel("Episodes")
ax1.set_xticks(x)
ax1.set_xticklabels(xlabels)
ax1.set_facecolor('white')
ax1.grid(None)
ax1.legend(loc='upper center', bbox_to_anchor=(0.35, 0.95), edgecolor='white', facecolor='white')
plt.show()
So far what we've discovered does not provide much more insight into the data and survey results.
We have ascertained that overall Episode V: The Empire Strikes Back is the most viewed and most popular. Has it become the most viewed because it's the most recommended ... or is it the most liked because it has been viewed the most? A person cannot rank a film they have not seen. Do Data Scientists have these chicken-before-the-egg and self-fulfilling-prophecy debates on the effect of observing trends that potentially influences and exaggerates those trends?
It was suggested in the guided portion to continue with comparing results between fans and non-fans, star trek fan preferences, gender or demographic information. A quick glance regarding this showed:
*As Plotted Above: Not-fans ranked Episode 1 as their second favorite, almost beating Episode 5. Episode 1 was not nearly as popular in any other subgroup, especially not among the self-identified fans.
I am curious to prove or disprove the following statements:
Since many respondants didn't see all episodes, I would like to know what it looks like to plot the ranking counts for each episode instead. I intended to count up all of the #1 rankings for Episode 1, all the #2 rankings for Episode 1, etc. and do the same for each episode.
However, in doing so I noticed something strange - all films had 835 ranking values. How can that be if I already know that some episodes have been viewed more often than others?
I checked for invalid episode ranking results when the corresponding episode had not been seen for Episode 3 because it was among the least viewed. I found 286 invalid rankings entered!
## begin with deriving the ranking counts for just one episode
compare_3 = star_wars[["seen_3","ranking_3"]]
invalid = compare_3.loc[(compare_3["seen_3"] == False) & (compare_3["ranking_3"].notnull())]
invalid
seen_3 | ranking_3 | |
---|---|---|
9 | False | 3.0 |
16 | False | 2.0 |
50 | False | 5.0 |
59 | False | 6.0 |
61 | False | 3.0 |
... | ... | ... |
1169 | False | 6.0 |
1172 | False | 4.0 |
1176 | False | 3.0 |
1179 | False | 5.0 |
1185 | False | 2.0 |
286 rows × 2 columns
I expect there to be a large difference in the means scores if I remove the invalid rankings for each episode!
star_wars_rankings = star_wars.copy()
star_wars_rankings.head(5)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3292879998 | True | True | True | True | True | True | True | True | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
1 | 3292879538 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 | 3292765271 | True | False | True | True | True | False | False | False | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 | 3292763116 | True | True | True | True | True | True | True | True | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4 | 3292731220 | True | True | True | True | True | True | True | True | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
To find invalid rankings, I will:
Separate Survey Responses based on whether Valid from Invalid Rankings
Check for null responses
## if respondant said they did not watch any of the episodes then seen 1-6 are all False as expected
star_wars_rankings.loc[(star_wars_rankings[seen_any_col] == False), seen_cols].value_counts()
seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 False False False False False False 250 dtype: int64
## similarily rankings 1-6 are also null as epxected
star_wars_rankings.loc[(star_wars_rankings[seen_any_col] == False), ranking_cols].value_counts()
Series([], dtype: int64)
## initiate a series to store the rank_validity in the dataframe
star_wars_rankings["rank_validity"] = pd.Series()
<ipython-input-60-6fa5701d6a4c>:2: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. star_wars_rankings["rank_validity"] = pd.Series()
The function below should take a row from the dataframe as input and then return a string code to designate its validity classification:
some_valid will still have to be processed later to remove ranking where seen is not True.
# remind ourselves of what the seen_1-6 column data looks like:
star_wars_rankings[seen_cols][:5]
seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | |
---|---|---|---|---|---|---|
0 | True | True | True | True | True | True |
1 | False | False | False | False | False | False |
2 | True | True | True | False | False | False |
3 | True | True | True | True | True | True |
4 | True | True | True | True | True | True |
#identify respondants who have seen all and any movies to attribute validity values
bool_all = star_wars_rankings[seen_cols].all(axis=1)
bool_any = star_wars_rankings[seen_cols].any(axis=1)
## check how many respondants saw all or some of the episodes:
print("# Respondants who have seen ALL Episodes: ", len(star_wars_rankings.loc[bool_all]))
## some = any but not all
bool_some = bool_any & (~bool_all)
print("# Respondants who have seen SOME but not all Episodes: ", len(star_wars_rankings.loc[bool_some]))
# Respondants who have seen ALL Episodes: 471 # Respondants who have seen SOME but not all Episodes: 364
Amazingly we have 471 respondants who have seen all of the episodes!
So we have data to work with regardless of how many potentially valid responses I pull out from the 364 responses with not all movies seen.
## quick reference to the seen and ranking columns we have to review for each row
seen_ranking_cols = seen_cols.tolist()+ranking_cols.tolist()
star_wars_rankings.loc[bool_some, seen_ranking_cols][:10]
seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
9 | False | True | False | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
16 | False | False | False | True | False | False | 4.0 | 1.0 | 2.0 | 3.0 | 5.0 | 6.0 |
17 | True | True | True | False | False | True | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
21 | True | True | True | True | True | False | 3.0 | 4.0 | 5.0 | 1.0 | 2.0 | 6.0 |
33 | False | True | True | False | False | False | 6.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
50 | True | False | False | True | True | True | 4.0 | 6.0 | 5.0 | 3.0 | 1.0 | 2.0 |
59 | True | True | False | True | True | True | 5.0 | 4.0 | 6.0 | 1.0 | 3.0 | 2.0 |
61 | False | False | False | True | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
76 | False | False | False | False | True | True | 3.0 | 4.0 | 5.0 | 6.0 | 1.0 | 2.0 |
Next I need to clean the rankings so that if the ranking is valid, the value is retained, if the ranking is invalid it is replaced by null.
## I admit I do not understand why when I pass an individual row to the fuction I need to use row[seen].bool()
## but when I apply the fuction on my dataframe, I need to remove .bool()
## This function:
## 1. sets the rank_validity flag
## 2. removes ranking values that are considered invalid
def rerank_df (row) :
seen_count = 0
## assume there are some good rankings unless proven otherwise
row["rank_validity"] = "some_valid"
## count how many episodes respondant has seen
for seen in seen_cols :
if (row[seen]) :
seen_count +=1
## check if no episodes seen - means that per-episode rankings are all null
if (seen_count == 0) :
## set flag and return rankings unchanged
row["rank_validity"] = "all_null"
return(row)
## check if all episodes seen - means that per-episode rankings are valid
elif (seen_count == 6) :
## set flag and return rankings unchanged
row["rank_validity"] = "all_valid"
return(row)
## if some but not all episodes have been seen, iterate through the episodes again
for (seen,ranking) in zip(seen_cols,ranking_cols) :
## if respondant has seen the episode
if row[seen] :
## but the ranking of the episode doesn't make sense because it's outside of the range of # episodes seen
if (float(row[ranking]) > seen_count) :
## override the ranking with a null value
row[ranking] = np.NaN
## ranking has proven to be invalid so flag is overwritten
row["rank_validity"]="some_invalid"
## if respondant has not seen the episode then override rankign with null value
else :
row[ranking] = np.NaN
## if some rankings are invalid, set all rankings to null
if (row["rank_validity"] == "some_invalid") :
row[ranking_cols] = np.NaN
## return row with changes
return(row)
## apply function to all rows where only some episodes have been seen:
star_wars_rankings = star_wars_rankings.apply(rerank_df, axis=1)
## check our counts - we expect 471 valid rankings where all episodes were seen
len(star_wars_rankings.loc[star_wars_rankings["rank_validity"] == "all_valid"])
471
## check the results - I expect less than 364 valid rankings where only some episodes were seen
len(star_wars_rankings.loc[star_wars_rankings["rank_validity"] == "some_valid"])
282
## admire the result
star_wars_rankings[bool_some][seen_ranking_cols][:10]
seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | NaN | NaN | NaN |
9 | False | True | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN |
16 | False | False | False | True | False | False | NaN | NaN | NaN | NaN | NaN | NaN |
17 | True | True | True | False | False | True | NaN | NaN | NaN | NaN | NaN | NaN |
21 | True | True | True | True | True | False | 3.0 | 4.0 | 5.0 | 1.0 | 2.0 | NaN |
33 | False | True | True | False | False | False | NaN | 1.0 | 2.0 | NaN | NaN | NaN |
50 | True | False | False | True | True | True | 4.0 | NaN | NaN | 3.0 | 1.0 | 2.0 |
59 | True | True | False | True | True | True | 5.0 | 4.0 | NaN | 1.0 | 3.0 | 2.0 |
61 | False | False | False | True | False | False | NaN | NaN | NaN | NaN | NaN | NaN |
76 | False | False | False | False | True | True | NaN | NaN | NaN | NaN | 1.0 | 2.0 |
## if some rankings are invalid then
star_wars_rankings.loc[star_wars_rankings["rank_validity"] == 'some_invalid', seen_ranking_cols][:5]
seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | False | True | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN |
16 | False | False | False | True | False | False | NaN | NaN | NaN | NaN | NaN | NaN |
17 | True | True | True | False | False | True | NaN | NaN | NaN | NaN | NaN | NaN |
61 | False | False | False | True | False | False | NaN | NaN | NaN | NaN | NaN | NaN |
142 | False | False | False | False | True | True | NaN | NaN | NaN | NaN | NaN | NaN |
star_wars_rankings.loc[star_wars_rankings["rank_validity"] == 'some_valid', seen_ranking_cols][:5]
seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | NaN | NaN | NaN |
21 | True | True | True | True | True | False | 3.0 | 4.0 | 5.0 | 1.0 | 2.0 | NaN |
33 | False | True | True | False | False | False | NaN | 1.0 | 2.0 | NaN | NaN | NaN |
50 | True | False | False | True | True | True | 4.0 | NaN | NaN | 3.0 | 1.0 | 2.0 |
59 | True | True | False | True | True | True | 5.0 | 4.0 | NaN | 1.0 | 3.0 | 2.0 |
star_wars_rankings.loc[star_wars_rankings["rank_validity"] == 'all_valid', seen_ranking_cols][:5]
seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | True | True | True | True | True | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
3 | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
4 | True | True | True | True | True | True | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
5 | True | True | True | True | True | True | 1.0 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 |
6 | True | True | True | True | True | True | 6.0 | 5.0 | 4.0 | 3.0 | 1.0 | 2.0 |
The Surveys responses for rankings was misleading due to invalid information - the per-Episode Ranking columns contained rankings for episodes that respondants had not seen.
I presume this is an oversight in the survey where all columns had to be filled in unless the respondant indicated they had not seen ANY of the episodes in Question 1, in which case null was apply to all per-episode ranking columns.
If respondants had see any of the episodes, the survey retains the default rankings of 1-6 for Episodes 1-6 unless the respondant changed them.
I modified the per-episode ranking columns so that:
I am not a huge fan of plotting the mean ranking values because this information is both flat (no depth to it) and confusing (less is better).
I prefer to visualize this in a way that also reflects the number of viewings of the episode.
I followed the recommendations in this towards data science article
import seaborn as sns
## store the count of each rank by episode
rank_count = pd.DataFrame()
for (epi,rank) in zip(episodes, ranking_cols) :
rank_count[epi] = star_wars_rankings[rank].value_counts().sort_index()
per_episode_rank_count = rank_count.transpose().iloc[::-1]
per_episode_rank_count
1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | |
---|---|---|---|---|---|---|
Episode 6 | 125 | 219 | 200 | 50 | 29 | 53 |
Episode 5 | 276 | 210 | 101 | 36 | 52 | 16 |
Episode 4 | 190 | 126 | 105 | 64 | 32 | 60 |
Episode 3 | 31 | 38 | 89 | 137 | 114 | 116 |
Episode 2 | 23 | 68 | 68 | 103 | 198 | 84 |
Episode 1 | 109 | 47 | 80 | 173 | 84 | 142 |
I don't know why but I have the run the below code TWICE to see the expected plot. After the first run the sizing of the horizontal bar plot is shrunk but the text is correct. After the second run the plot resizes to fit the figure. Any suggestions appreciated!!
fig, axes = plt.subplots()
per_episode_rank_count.plot.barh(stacked=True, ax=axes, width=0.8)
def set_sizes(fig_size, font_size):
plt.rcParams["figure.figsize"] = fig_size
plt.rcParams["font.size"] = font_size
plt.rcParams["xtick.labelsize"] = font_size
plt.rcParams["ytick.labelsize"] = font_size+4
plt.rcParams["axes.labelsize"] = font_size+2
plt.rcParams["axes.titlesize"] = font_size+6
plt.rcParams["legend.fontsize"] = font_size+2
set_sizes((12,6), 10)
axes.legend(["#1 Ranked", "#2 Ranked", "#3 Ranked", "#4 Ranked", "#5 Ranked", "#6 Ranked"],
loc='center right',
bbox_to_anchor=(0.99,0.6))
axes.set_title("Per Episode Ranking Counts")
axes.text(s=((episode_stats["episode_title"][4])+ " is by far the most seen, most beloved, least disliked"),
x=9, y=0.9,
fontsize=14,
color="white")
axes.text(s=((episode_stats["episode_title"][0])+ " is divided by two forces, both loved and hated"),
x=9, y=4.9,
fontsize=14,
color="white")
plt.show()
## from the article - a future ambition to annotate with values!!
# import matplotlib
# import os
# from dataclasses import dataclass
# Patch = matplotlib.patches.Patch
# PosVal = Tuple[float, Tuple[float, float]]
# Axis = matplotlib.axes.Axes
# PosValFunc = Callable[[Patch], PosVal]
# @dataclass
# class AnnotateBars:
# font_size: int = 10
# color: str = "black"
# n_dec: int = 2
# def horizontal(self, ax: Axis, centered=False):
# def get_vals(p: Patch) -> PosVal:
# value = p.get_width()
# div = 2 if centered else 1
# pos = (
# p.get_x() + p.get_width() / div,
# p.get_y() + p.get_height() / 2,
# )
# return value, pos
# ha = "center" if centered else "left"
# self._annotate(ax, get_vals, ha=ha, va="center")
# def _annotate(self, ax, func: PosValFunc, **kwargs):
# cfg = {"color": self.color,
# "fontsize": self.font_size, **kwargs}
# for p in ax.patches:
# value, pos = func(p)
# ax.annotate(f"{value:.{self.n_dec}f}", pos, **cfg)
I prefer my distribution bar plot of ranking counts because it captures more information than just plotting the mean, which could not convey the polarity in opinions on Episode 1. This analysis is more in line with fivethirtyeight's Top Third / Middle Third / Bottom Third ranking distribution groupings.
In this towards data science article I really liked how the values were annotated on the plot. Ideally my plot would also show these values.
With this all done and my lovely histogram created to refelct how often a valid ranking was even attributed to an episode, I was ready to wrap up and present my project.
THAT’S when I decided to take look at the information from fivethirtyeight about the dataset
THAT’S when I learned they had explicitly only taken into consideration the rankings by the 471 respondents who indicated they had seen ALL of the films.
While I do not consider the time spent exercising dataset cleaning and manipulations as wasted time, I would have preferred the efficiency of learning this information by just reading about the dataset!!!