While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.
The data has several columns, including:
There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format. As a result, this data set needs a lot of cleaning.
seen.plot(
kind="bar", title="Star Wars Movie Number of Views",
)
plt.xticks(
np.arange(6),
["Episode I", "Episode II", "Episode III", "Episode IV", "Episode V", "Episode VI"],
)
plt.show()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
stars = pd.read_csv("StarWars.csv", encoding="ISO-8859-1")
stars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?æ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
# The first row has no actual data so we'll drop that row
stars.drop(index=0, axis=0, inplace=True)
# Reset the index after removing 0
stars.reset_index(drop=True)
stars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?æ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
Now all of our data has a RespondantID, we're going to clean the 'Have you seen any of the 6 films in the Star Wars franchise?' and 'Do you consider yourself to be a fan of the Star Wars film franchise?' columns and make Yes: True and No: False.
(
stars[
"Do you consider yourself to be a fan of the Star Wars film franchise?"
].value_counts(dropna=False)
)
Yes 552 NaN 350 No 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
(
stars["Have you seen any of the 6 films in the Star Wars franchise?"].value_counts(
dropna=False
)
)
Yes 936 No 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
mapping = {"Yes": True, "No": False}
stars["Do you consider yourself to be a fan of the Star Wars film franchise?"] = stars[
"Do you consider yourself to be a fan of the Star Wars film franchise?"
].map(mapping)
stars["Have you seen any of the 6 films in the Star Wars franchise?"] = stars[
"Have you seen any of the 6 films in the Star Wars franchise?"
].map(mapping)
Now the six columns after ask if the respondant has seen the Star Wars movie. For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
We'll need to convert each of these columns to a Boolean, then rename the column something more intuitive. We can convert the values the same way we did earlier, except that we'll need to include the movie title and NaN in the mapping dictionary.
print(stars.iloc[:, 3:9].columns)
print("\n")
print(stars.iloc[:, 9:15].columns)
Index(['Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'], dtype='object') Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], dtype='object')
# Rename the Seen columns to Seen SW# and Rank columns to Rank SW#
new_columns = [
"RespondentID",
"Have you seen any of the 6 films in the Star Wars franchise?",
"Do you consider yourself to be a fan of the Star Wars film franchise?",
"Seen SWI?",
"Seen SWII?",
"Seen SWIII?",
"Seen SWIV?",
"Seen SWV?",
"Seen SWVI?",
"Rank SWI",
"Rank SWII",
"Rank SWIII",
"Rank SWIV",
"Rank SWV",
"Rank SWVI",
"Please state whether you view the following characters",
"Unnamed: 16",
"Unnamed: 17",
"Unnamed: 18",
"Unnamed: 19",
"Unnamed: 20",
"Unnamed: 21",
"Unnamed: 22",
"Unnamed: 23",
"Unnamed: 24",
"Unnamed: 25",
"Unnamed: 26",
"Unnamed: 27",
"Unnamed: 28",
"Which character shot first?",
"Are you familiar with the Expanded Universe?",
"Do you consider yourself to be a fan of the Expanded Universe?",
"Do you consider yourself to be a fan of the Star Trek franchise?",
"Gender",
"Age",
"Household Income",
"Education",
"Location (Census Region)",
]
stars.columns = new_columns
# Change the values in columns to boolean
# Create the new value mapping dict
mapping = {np.nan: False}
for i in range(3, 9):
movie = stars.iloc[0, i]
mapping[movie] = True
# Map the values
for col in stars.iloc[:, 3:9].columns:
stars[col] = stars[col].map(mapping)
stars
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Seen SWI? | Seen SWII? | Seen SWIII? | Seen SWIV? | Seen SWV? | Seen SWVI? | Rank SWI | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1182 | 3.288389e+09 | True | True | True | True | True | True | True | True | 5 | ... | Very favorably | Han | No | NaN | Yes | Female | 18-29 | $0 - $24,999 | Some college or Associate degree | East North Central |
1183 | 3.288379e+09 | True | True | True | True | True | True | True | True | 4 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Female | 30-44 | $50,000 - $99,999 | Bachelor degree | Mountain |
1184 | 3.288375e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | No | Female | 30-44 | $50,000 - $99,999 | Bachelor degree | Middle Atlantic |
1185 | 3.288373e+09 | True | True | True | True | True | True | True | True | 4 | ... | Very favorably | Han | No | NaN | Yes | Female | 45-60 | $100,000 - $149,999 | Some college or Associate degree | East North Central |
1186 | 3.288373e+09 | True | False | True | True | False | False | True | True | 6 | ... | Very unfavorably | I don't understand this question | No | NaN | No | Female | > 60 | $50,000 - $99,999 | Graduate degree | Pacific |
1186 rows × 38 columns
Now the next six columns contain ranking of the SW movie from 1-6, with 1 being the most favorite and 6 the least. We need to convert it to numerical data. I'm also going to drop the character columns to shorten the DF since I won't use their data for analysis.
# Convert column data to floats
stars[stars.columns[9:15]] = stars[stars.columns[9:15]].astype(float)
stars[stars.columns[9:15]]
Rank SWI | Rank SWII | Rank SWIII | Rank SWIV | Rank SWV | Rank SWVI | |
---|---|---|---|---|---|---|
1 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
2 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
4 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
5 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
... | ... | ... | ... | ... | ... | ... |
1182 | 5.0 | 4.0 | 6.0 | 3.0 | 2.0 | 1.0 |
1183 | 4.0 | 5.0 | 6.0 | 2.0 | 3.0 | 1.0 |
1184 | NaN | NaN | NaN | NaN | NaN | NaN |
1185 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 | 1.0 |
1186 | 6.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
1186 rows × 6 columns
# Drop character ranking columns
cols = stars.iloc[:, 15:29].columns
stars.drop(columns=cols, axis=1, inplace=True)
stars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Seen SWI? | Seen SWII? | Seen SWIII? | Seen SWIV? | Seen SWV? | Seen SWVI? | Rank SWI | ... | Rank SWVI | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | ... | 6.0 | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | ... | 6.0 | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | 3.0 | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | 3.0 | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 24 columns
Now we're going to visualize the results as a bar graph to see which movie is the highest ranked.
ranking = stars.iloc[:, 9:15].mean()
ranking.plot(kind="bar", title="Star Wars Movie Ranking", ylim=(0, 6))
plt.xticks(
np.arange(6),
["Episode I", "Episode II", "Episode III", "Episode IV", "Episode V", "Episode VI"],
)
plt.show()
From analyzing the bar graph above, we can see that Star Wars: Episode V The Empire Strikes Back has the lowest number and thus is the highest ranked movie.
seen = stars.iloc[:, 3:9].sum()
seen.plot(
kind="bar", title="Star Wars Movie Number of Views",
)
plt.xticks(
np.arange(6),
["Episode I", "Episode II", "Episode III", "Episode IV", "Episode V", "Episode VI"],
)
plt.show()
We can see from the graph that Star Wars: Episode V The Empire Strikes Back was viewd the most.
# Females ranking and seen times
females = stars[stars["Gender"] == "Female"]
ranking = females.iloc[:, 9:15].mean()
ranking.plot(kind="bar", title="Star Wars Movie Female Ranking", ylim=(0, 6))
plt.xticks(
np.arange(6),
["Episode I", "Episode II", "Episode III", "Episode IV", "Episode V", "Episode VI"],
)
plt.show()
seen = females.iloc[:, 3:9].sum()
seen.plot(
kind="bar", title="Star Wars Movie Number of Female Views",
)
plt.xticks(
np.arange(6),
["Episode I", "Episode II", "Episode III", "Episode IV", "Episode V", "Episode VI"],
)
plt.show()
From analyzing the bar graphs, we can clearly see that, still, Star Wars: Episode V The Empire Strikes Bacm is the most viwed and highest ranked movie of the star wars franchise among females.
# Males ranking and seen times
males = stars[stars["Gender"] == "Male"]
ranking = males.iloc[:, 9:15].mean()
ranking.plot(kind="bar", title="Star Wars Movie Male Ranking", ylim=(0, 6))
plt.xticks(
np.arange(6),
["Episode I", "Episode II", "Episode III", "Episode IV", "Episode V", "Episode VI"],
)
plt.show()
seen = males.iloc[:, 3:9].sum()
seen.plot(
kind="bar", title="Star Wars Movie Number of Male Views",
)
plt.xticks(
np.arange(6),
["Episode I", "Episode II", "Episode III", "Episode IV", "Episode V", "Episode VI"],
)
plt.show(haha)
From analyzing the bar graphs, we can see that, by a close margin in views, Star Wars: Episode V The Empire Strikes Bacm is the most viwed and highest ranked movie of the star wars franchise among males.