Star War Survey¶

While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read data into dataframe¶

In [2]:

star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
# Explore the data to look for any strange values
star_wars.head(10)

Out[2]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	NaN	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	...	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response
1	3.292880e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3.292731e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
6	3.292719e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	1	...	Very favorably	Han	Yes	No	Yes	Male	18-29	$25,000 - $49,999	Bachelor degree	Middle Atlantic
7	3.292685e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	6	...	Very favorably	Han	Yes	No	No	Male	18-29	NaN	High school degree	East North Central
8	3.292664e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	4	...	Very favorably	Han	No	NaN	Yes	Male	18-29	NaN	High school degree	South Atlantic
9	3.292654e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Somewhat favorably	Han	No	NaN	No	Male	18-29	$0 - $24,999	Some college or Associate degree	South Atlantic

10 rows × 38 columns

In [3]:

# review the column name
star_wars.columns

Out[3]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

Remove any rows of `RespondentID` is "NaN"¶

In [4]:

# Select rows where the RespondentID column is not null
star_wars = star_wars[star_wars["RespondentID"].notnull()]

# check again and confirm there's no null values in RespondentID any more
star_wars["RespondentID"].isnull().value_counts(dropna=False)

Out[4]:

False    1186
Name: RespondentID, dtype: int64

Convert string "YES" & "NO" to boolean value for analysis¶

For the next two columns,

Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?

Both represent YES/NO quetions. They can also be NaN if respondent chooses not to answer a question. We'd like to convert the string value into boolean value True/False without having to do a string comparison.

In [5]:

# build a dictionary and map in the column
yes_no_dict = {
    "Yes":True, 
    "No":False
}

star_wars["Have you seen any of the 6 films in the Star Wars franchise?"] = star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].map(yes_no_dict)
star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].value_counts(dropna=False)

Out[5]:

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [6]:

# use the same dictionary to map in the column 
star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"] = star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"].map(yes_no_dict)
star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"].value_counts(dropna=False)

Out[6]:

True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

Convert each columns name and values¶

For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.

For analysis purpose, we need to convert value in a cell into True/False, besides, column names are too complicated that we will rename them with df.rename(columns=dict)

In [7]:

# carefully check every value you need to map to avoid typo
mapping_dict = {
    "Star Wars: Episode I  The Phantom Menace":True,
    "Star Wars: Episode II  Attack of the Clones":True,
    "Star Wars: Episode III  Revenge of the Sith":True,
    "Star Wars: Episode IV  A New Hope":True,
    "Star Wars: Episode V The Empire Strikes Back":True,
    "Star Wars: Episode VI Return of the Jedi":True,
    np.NaN:False
}

# use a for-loop to map values one column by another
for i in range(3,9):
    star_wars.iloc[:,i] = star_wars.iloc[:,i].map(mapping_dict)

In [8]:

# Check every colmun if all values are corrected as we expect
star_wars.iloc[:,6].value_counts(dropna=False)

Out[8]:

True     607
False    579
Name: Unnamed: 6, dtype: int64

In [9]:

# build a column rename dictionary and use df.rename(columns=dict)
rename_dict = {
    "Which of the following Star Wars films have you seen? Please select all that apply.":"seen_1",
    "Unnamed: 4":"seen_2",
    "Unnamed: 5":"seen_3",
    "Unnamed: 6":"seen_4",
    "Unnamed: 7":"seen_5",
    "Unnamed: 8":"seen_6",
}

star_wars = star_wars.rename(columns=rename_dict)
# check again if column name updated
star_wars.columns[3:9]

Out[9]:

Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')

Cleaning the Ranking Column¶

The next 6 columns ask the respondent to rank the Star Wars movies in order of least favorite(6) to most favorite(1). Fotunately, we only need to convert the value into numeric type.

Column names is too complicated so we need to rename with ranking_1, ranking_2 and so on.

In [10]:

# Convert each of the column to float type
star_wars.iloc[:,9:15] = star_wars.iloc[:,9:15].astype(float)
# Check again if the value is float type now
star_wars.iloc[:,9].value_counts()

Out[10]:

4.0    237
6.0    168
3.0    130
1.0    129
5.0    100
2.0     71
Name: Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film., dtype: int64

In [11]:

# rename the ranking columns
rename_dict_2 = {
    "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.":"ranking_1",
    "Unnamed: 10":"ranking_2",
    "Unnamed: 11":"ranking_3",
    "Unnamed: 12":"ranking_4",
    "Unnamed: 13":"ranking_5",
    "Unnamed: 14":"ranking_6"
}

    
star_wars = star_wars.rename(columns=rename_dict_2)
star_wars.columns[9:15]

Out[11]:

Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')

Finding the Highest-Ranked Movie¶

Now that we've cleaned up the ranking columns, we can find the highest-ranked movie more quickly. To achieve this, take the mean of each of the ranking columns using pandas.DataFrame.mean()method on dataframes

In [12]:

# select specified columns, calculate mean value
# make a bar plot
star_wars.iloc[:,9:15].mean().plot.bar()

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc82855b4a8>

Remember that value 1 represents favorite and value 5 represents least favorite. The smaller value means respondents love this movie more.

We have a conclusion that respondents love "Star Wars: Episode V The Empire Strikes Back" most(2.51), and love "Star Wars: Episode III Revenge of the Sith" least(4.34).

Finding the Most Viewed Movie¶

In [13]:

# select columns we need
# use a sum() to calculate how many times 
# they go to see this movie
# make a bar plot
star_wars.iloc[:,3:9].sum().plot.bar()

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc8285bf518>

Exploring the Data by Binary Segments¶

We'd like to split a dataframe into two groups by gender. Go deeper to analyze questions like "the most viewed movie", "the highest-ranked movie" and other statistic separately for each group.

In [14]:

# split a dataframe into two groups by gender
male = star_wars[star_wars["Gender"]=="Male"]
female = star_wars[star_wars["Gender"]=="Female"]

In [15]:

# find the most viewed movie by gender
fig = plt.figure(figsize = (10, 5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

ax1.bar(range(1,7),male.iloc[:,3:9].sum())
ax2.bar(range(1,7),female.iloc[:,3:9].sum())
ax1.set_xticklabels(male.columns[3:9], rotation=90)
ax2.set_xticklabels(female.columns[3:9], rotation=90)
ax1.set_title("Most viewed movie by male")
ax2.set_title("Most viewed movie by female")

Out[15]:

<matplotlib.text.Text at 0x7fc8263ad7f0>

We can find both genders watched Star Wars: Episode V The Empire Strikes Back the most, and watched Star Wars: Episode III Revenge of the Sith the least.

However, the difference in male is quite smaller than female.

In [16]:

# find the highest-ranked movie
fig = plt.figure(figsize = (10, 5))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

ax1.bar(range(1,7),male.iloc[:,9:15].mean())
ax2.bar(range(1,7),female.iloc[:,9:15].mean())
ax1.set_xticklabels(male.columns[9:15], rotation=90)
ax2.set_xticklabels(female.columns[9:15], rotation=90)
ax1.set_title("Highest-ranked movie by male")
ax2.set_title("Highest-ranked movie by female")

Out[16]:

<matplotlib.text.Text at 0x7fc8262e2400>

The lower in ranked column, respondents love the moview more.

We can find both genders love Star Wars: Episode V The Empire Strikes Back the most and love Star Wars: Episode III Revenge of the Sith the least.

We find one thing interesting, male loves Star Wars: Episode I The Phantom Menace about 0.5 more than female, while female loves Star Wars: Episode IV A New Hope about 0.5 more than male.

Maybe if we look deeper in movie content and we can realize the reasons make this difference.

Explore more on `Education`, `Location (Census Region)` and `Which character shot first?`¶

In [17]:

# make a copy to avoid changing original dataset
star_wars_1 = star_wars.copy()

In [18]:

# review the value counts of Education
star_wars_1["Education"].value_counts()

Out[18]:

Some college or Associate degree    328
Bachelor degree                     321
Graduate degree                     275
High school degree                  105
Less than high school degree          7
Name: Education, dtype: int64

In [19]:

# Since only 7 respondent are less than high school degree
# we'd like to combine them with high school

star_wars_1 = star_wars_1.replace("Less than high school degree","High school degree or less").replace("High school degree","High school degree or less")
star_wars_1["Education"].value_counts()

Out[19]:

Some college or Associate degree    328
Bachelor degree                     321
Graduate degree                     275
High school degree or less          112
Name: Education, dtype: int64

In [20]:

# split data into 4 groups by education degree
graduate = star_wars_1[star_wars["Education"]=="Graduate degree"]
bachelor = star_wars_1[star_wars_1["Education"]=="Bachelor degree"]
college = star_wars_1[star_wars["Education"]=="Some college or Associate degree"]
high_school = star_wars_1[star_wars_1["Education"]=="High school degree or less"]

In [21]:

# find the most-viewed movie by education degree
fig = plt.figure(figsize = (15, 12))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

ax1.bar(range(1,7),graduate.iloc[:,3:9].sum())
ax2.bar(range(1,7),bachelor.iloc[:,3:9].sum())
ax3.bar(range(1,7),college.iloc[:,3:9].sum())
ax4.bar(range(1,7),high_school.iloc[:,3:9].sum())
ax1.set_xticklabels(graduate.columns[3:9], rotation=90)
ax2.set_xticklabels(bachelor.columns[3:9], rotation=90)
ax3.set_xticklabels(college.columns[3:9], rotation=90)
ax4.set_xticklabels(high_school.columns[3:9], rotation=90)
ax1.set_title("Highest-ranked movie by graduate degree")
ax2.set_title("Highest-ranked movie by bachelor degree")
ax3.set_title("Highest-ranked movie by college ot associated degree", y=1)
ax4.set_title("Highest-ranked movie by high school degree or less")
plt.tight_layout()
plt.show()

For respondents who are graduate degree and bachelor degree, most of them only saw episode 1, 5 and 6 and seldom saw episode 3.

However, for respondents who are college degree and high school or less degree, they've seen every episode which is very different from previous group.

Maybe we should go deeper for age of respondent to find relationship. Let's find out the age distribution who've seen episode 3 by education degree below.

In [22]:

graduate[graduate["seen_3"] == True]["Age"].value_counts()

Out[22]:

45-60    50
30-44    48
> 60     26
18-29    15
Name: Age, dtype: int64

In [23]:

bachelor[bachelor["seen_3"] == True]["Age"].value_counts()

Out[23]:

45-60    52
18-29    49
30-44    41
> 60     21
Name: Age, dtype: int64

In [24]:

college[college["seen_3"] == True]["Age"].value_counts()

Out[24]:

18-29    59
30-44    46
> 60     45
45-60    33
Name: Age, dtype: int64

In [25]:

high_school[high_school["seen_3"] == True]["Age"].value_counts()

Out[25]:

18-29    20
30-44    17
45-60     7
> 60      7
Name: Age, dtype: int64

Among college and high school degree, most respondents who've seen episode 3 are 18-44 years old.

For bachelor degree, those who've seen episode 3 are 45-60 and 18-29. For graduate degree, 30-60 years old respondents have seen episode 3.

I only have these finding and can't make any conclusions until now...

In [ ]: