While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
# Explore the data to look for any strange values
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.292685e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
8 | 3.292664e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
9 | 3.292654e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
10 rows × 38 columns
# review the column name
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?Âæ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
RespondentID
is "NaN"¶# Select rows where the RespondentID column is not null
star_wars = star_wars[star_wars["RespondentID"].notnull()]
# check again and confirm there's no null values in RespondentID any more
star_wars["RespondentID"].isnull().value_counts(dropna=False)
False 1186 Name: RespondentID, dtype: int64
For the next two columns,
Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
Both represent YES/NO
quetions. They can also be NaN
if respondent chooses not to answer a question. We'd like to convert the string value into boolean value True/False
without having to do a string comparison.
# build a dictionary and map in the column
yes_no_dict = {
"Yes":True,
"No":False
}
star_wars["Have you seen any of the 6 films in the Star Wars franchise?"] = star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].map(yes_no_dict)
star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].value_counts(dropna=False)
True 936 False 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
# use the same dictionary to map in the column
star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"] = star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"].map(yes_no_dict)
star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"].value_counts(dropna=False)
True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
For analysis purpose, we need to convert value in a cell into True/False
, besides, column names are too complicated that we will rename them with df.rename(columns=dict)
# carefully check every value you need to map to avoid typo
mapping_dict = {
"Star Wars: Episode I The Phantom Menace":True,
"Star Wars: Episode II Attack of the Clones":True,
"Star Wars: Episode III Revenge of the Sith":True,
"Star Wars: Episode IV A New Hope":True,
"Star Wars: Episode V The Empire Strikes Back":True,
"Star Wars: Episode VI Return of the Jedi":True,
np.NaN:False
}
# use a for-loop to map values one column by another
for i in range(3,9):
star_wars.iloc[:,i] = star_wars.iloc[:,i].map(mapping_dict)
# Check every colmun if all values are corrected as we expect
star_wars.iloc[:,6].value_counts(dropna=False)
True 607 False 579 Name: Unnamed: 6, dtype: int64
# build a column rename dictionary and use df.rename(columns=dict)
rename_dict = {
"Which of the following Star Wars films have you seen? Please select all that apply.":"seen_1",
"Unnamed: 4":"seen_2",
"Unnamed: 5":"seen_3",
"Unnamed: 6":"seen_4",
"Unnamed: 7":"seen_5",
"Unnamed: 8":"seen_6",
}
star_wars = star_wars.rename(columns=rename_dict)
# check again if column name updated
star_wars.columns[3:9]
Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')
The next 6 columns ask the respondent to rank the Star Wars movies in order of least favorite(6) to most favorite(1). Fotunately, we only need to convert the value into numeric type.
Column names is too complicated so we need to rename with ranking_1
, ranking_2
and so on.
# Convert each of the column to float type
star_wars.iloc[:,9:15] = star_wars.iloc[:,9:15].astype(float)
# Check again if the value is float type now
star_wars.iloc[:,9].value_counts()
4.0 237 6.0 168 3.0 130 1.0 129 5.0 100 2.0 71 Name: Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film., dtype: int64
# rename the ranking columns
rename_dict_2 = {
"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.":"ranking_1",
"Unnamed: 10":"ranking_2",
"Unnamed: 11":"ranking_3",
"Unnamed: 12":"ranking_4",
"Unnamed: 13":"ranking_5",
"Unnamed: 14":"ranking_6"
}
star_wars = star_wars.rename(columns=rename_dict_2)
star_wars.columns[9:15]
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'], dtype='object')
Now that we've cleaned up the ranking columns, we can find the highest-ranked movie more quickly. To achieve this, take the mean of each of the ranking columns using pandas.DataFrame.mean()method on dataframes
# select specified columns, calculate mean value
# make a bar plot
star_wars.iloc[:,9:15].mean().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fc82855b4a8>
Remember that value 1 represents favorite and value 5 represents least favorite. The smaller value means respondents love this movie more.
We have a conclusion that respondents love "Star Wars: Episode V The Empire Strikes Back" most(2.51), and love "Star Wars: Episode III Revenge of the Sith" least(4.34).
# select columns we need
# use a sum() to calculate how many times
# they go to see this movie
# make a bar plot
star_wars.iloc[:,3:9].sum().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8285bf518>
We'd like to split a dataframe into two groups by gender. Go deeper to analyze questions like "the most viewed movie", "the highest-ranked movie" and other statistic separately for each group.
# split a dataframe into two groups by gender
male = star_wars[star_wars["Gender"]=="Male"]
female = star_wars[star_wars["Gender"]=="Female"]
# find the most viewed movie by gender
fig = plt.figure(figsize = (10, 5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.bar(range(1,7),male.iloc[:,3:9].sum())
ax2.bar(range(1,7),female.iloc[:,3:9].sum())
ax1.set_xticklabels(male.columns[3:9], rotation=90)
ax2.set_xticklabels(female.columns[3:9], rotation=90)
ax1.set_title("Most viewed movie by male")
ax2.set_title("Most viewed movie by female")
<matplotlib.text.Text at 0x7fc8263ad7f0>
We can find both genders watched Star Wars: Episode V The Empire Strikes Back
the most, and watched Star Wars: Episode III Revenge of the Sith
the least.
However, the difference in male is quite smaller than female.
# find the highest-ranked movie
fig = plt.figure(figsize = (10, 5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax1.bar(range(1,7),male.iloc[:,9:15].mean())
ax2.bar(range(1,7),female.iloc[:,9:15].mean())
ax1.set_xticklabels(male.columns[9:15], rotation=90)
ax2.set_xticklabels(female.columns[9:15], rotation=90)
ax1.set_title("Highest-ranked movie by male")
ax2.set_title("Highest-ranked movie by female")
<matplotlib.text.Text at 0x7fc8262e2400>
The lower in ranked column, respondents love the moview more.
We can find both genders love Star Wars: Episode V The Empire Strikes Back
the most and love Star Wars: Episode III Revenge of the Sith
the least.
We find one thing interesting, male loves Star Wars: Episode I The Phantom Menace
about 0.5 more than female, while female loves Star Wars: Episode IV A New Hope
about 0.5 more than male.
Maybe if we look deeper in movie content and we can realize the reasons make this difference.
Education
, Location (Census Region)
and Which character shot first?
¶# make a copy to avoid changing original dataset
star_wars_1 = star_wars.copy()
# review the value counts of Education
star_wars_1["Education"].value_counts()
Some college or Associate degree 328 Bachelor degree 321 Graduate degree 275 High school degree 105 Less than high school degree 7 Name: Education, dtype: int64
# Since only 7 respondent are less than high school degree
# we'd like to combine them with high school
star_wars_1 = star_wars_1.replace("Less than high school degree","High school degree or less").replace("High school degree","High school degree or less")
star_wars_1["Education"].value_counts()
Some college or Associate degree 328 Bachelor degree 321 Graduate degree 275 High school degree or less 112 Name: Education, dtype: int64
# split data into 4 groups by education degree
graduate = star_wars_1[star_wars["Education"]=="Graduate degree"]
bachelor = star_wars_1[star_wars_1["Education"]=="Bachelor degree"]
college = star_wars_1[star_wars["Education"]=="Some college or Associate degree"]
high_school = star_wars_1[star_wars_1["Education"]=="High school degree or less"]
# find the most-viewed movie by education degree
fig = plt.figure(figsize = (15, 12))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)
ax1.bar(range(1,7),graduate.iloc[:,3:9].sum())
ax2.bar(range(1,7),bachelor.iloc[:,3:9].sum())
ax3.bar(range(1,7),college.iloc[:,3:9].sum())
ax4.bar(range(1,7),high_school.iloc[:,3:9].sum())
ax1.set_xticklabels(graduate.columns[3:9], rotation=90)
ax2.set_xticklabels(bachelor.columns[3:9], rotation=90)
ax3.set_xticklabels(college.columns[3:9], rotation=90)
ax4.set_xticklabels(high_school.columns[3:9], rotation=90)
ax1.set_title("Highest-ranked movie by graduate degree")
ax2.set_title("Highest-ranked movie by bachelor degree")
ax3.set_title("Highest-ranked movie by college ot associated degree", y=1)
ax4.set_title("Highest-ranked movie by high school degree or less")
plt.tight_layout()
plt.show()
For respondents who are graduate degree and bachelor degree, most of them only saw episode 1, 5 and 6 and seldom saw episode 3.
However, for respondents who are college degree and high school or less degree, they've seen every episode which is very different from previous group.
Maybe we should go deeper for age of respondent to find relationship. Let's find out the age distribution who've seen episode 3 by education degree below.
graduate[graduate["seen_3"] == True]["Age"].value_counts()
45-60 50 30-44 48 > 60 26 18-29 15 Name: Age, dtype: int64
bachelor[bachelor["seen_3"] == True]["Age"].value_counts()
45-60 52 18-29 49 30-44 41 > 60 21 Name: Age, dtype: int64
college[college["seen_3"] == True]["Age"].value_counts()
18-29 59 30-44 46 > 60 45 45-60 33 Name: Age, dtype: int64
high_school[high_school["seen_3"] == True]["Age"].value_counts()
18-29 20 30-44 17 45-60 7 > 60 7 Name: Age, dtype: int64
Among college and high school degree, most respondents who've seen episode 3 are 18-44 years old.
For bachelor degree, those who've seen episode 3 are 45-60 and 18-29. For graduate degree, 30-60 years old respondents have seen episode 3.
I only have these finding and can't make any conclusions until now...