To verify if *The Empire Strikes Back* is really the best *Star Wars* movie out there
The data consists of responses received by SurveyMoneky to the survey conducted by the team at FiveThirtyEight about the *Star Wars* movies which includes the following columns:
Header | Description |
---|---|
RespondentID |
An anonymized ID for the respondent (person taking the survey) |
Gender |
The respondent's gender |
Age |
The respondent's age |
Household Income |
The respondent's income |
Education |
The respondent's education level |
Location (Census Region) |
The respondent's location |
Have you seen any of the 6 films in the Star Wars franchise? |
Has a Yes or No response |
Do you consider yourself to be a fan of the Star Wars film franchise? |
Has a Yes or No response |
They received 835 total responses, which can be downloaded from their GitHub repository.
## Let's first import all the required libraries :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
## Importing the data into a dataframe :
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
# Exploring the data :
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3292879998 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
1 | 3292879538 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 | 3292765271 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 | 3292763116 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4 | 3292731220 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3292719380 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1.0 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
6 | 3292684787 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6.0 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
7 | 3292663732 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4.0 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
8 | 3292654043 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
9 | 3292640424 | Yes | No | NaN | Star Wars: Episode II Attack of the Clones | NaN | NaN | NaN | NaN | 1.0 | ... | Very favorably | I don't understand this question | No | NaN | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
10 rows × 38 columns
## Let's see all the columns of the dataframe:
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
The columns Have you seen any of the 6 films in the Star Wars franchise?
and Do you consider yourself to be a fan of the Star Wars film franchise?
seem to contain Yes
and No
values. We can convert these values into Boolean values, that will make the data cleaning process much easier.
## First, let's verify our observation :
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Yes 936 No 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Yes 552 NaN 350 No 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
## Now that we have verified the values in the columns, let's convert them into boolean:
yes_no = {'Yes': True, 'No': False}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
True 936 False 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
We see above that some of the columns have headers like Unnamed: 4
, or Which of the following Star Wars films have you seen? Please select all that apply.
.
Let's change these headers into something better.
## Let's change the column headers :
headers = {'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6': 'seen_4', 'Unnamed: 7': 'seen_5',
'Unnamed: 8': 'seen_6'}
star_wars = star_wars.rename(columns=headers)
# checking the headers
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
The values contained in these seem to be 'checkboxes' containing the name of the movie. Rather than storing the names of the movies, we can change these columns to contain Boolean values indicating whether the respondent has seen the movie.
## We see that the first row has all the movies,let's use that to create a dict to change values:
to_change = {star_wars.iloc[0,3] : True, star_wars.iloc[0,4] : True, star_wars.iloc[0,5] :True,
star_wars.iloc[0,6] : True, star_wars.iloc[0,7] : True, star_wars.iloc[0,8] : True,
np.nan : False
}
for series in star_wars.columns[3:9]:
star_wars[series] = star_wars[series].map(to_change)
# checking the values :
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3292879998 | True | True | True | True | True | True | True | True | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
1 | 3292879538 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 | 3292765271 | True | False | True | True | True | False | False | False | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 | 3292763116 | True | True | True | True | True | True | True | True | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4 | 3292731220 | True | True | True | True | True | True | True | True | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
## let's look at the next 6 columns :
star_wars.iloc[:, 9:15]
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | |
---|---|---|---|---|---|---|
0 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
3 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
4 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
... | ... | ... | ... | ... | ... | ... |
1181 | 5.0 | 4.0 | 6.0 | 3.0 | 2.0 | 1.0 |
1182 | 4.0 | 5.0 | 6.0 | 2.0 | 3.0 | 1.0 |
1183 | NaN | NaN | NaN | NaN | NaN | NaN |
1184 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 | 1.0 |
1185 | 6.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
1186 rows × 6 columns
These columns hold the ranks that each of the respondents have given to the *Star Wars* movies. The headers of these columns seem unintuitive so let's change them into suitable headers.
## changing the headers:
i=1
for header in star_wars.columns[9:15]:
new_header = 'ranking_{}'.format(i)
star_wars = star_wars.rename(columns={header : new_header})
i +=1
# checking the new headers:
star_wars.columns[9:15]
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'], dtype='object')
## Let's look at the column dtpye :
star_wars.iloc[:, 9:15].dtypes
ranking_1 float64 ranking_2 float64 ranking_3 float64 ranking_4 float64 ranking_5 float64 ranking_6 float64 dtype: object
rankings = star_wars.iloc[:, 9:15].mean()
rankings.plot.bar(title= 'Movie rankings (Lower is better!)', rot=0,color='red')
<matplotlib.axes._subplots.AxesSubplot at 0x7f675ab36b80>
We see that ranking_4
: Star Wars: Episode IV A New Hope , ranking_5
: Star Wars: Episode V The Empire Strikes Back and ranking_6
: *Star Wars: Episode VI Return of the Jedi* have been rated the best out of the bunch. These movies are older and tend to have a staunch fan following.
movie_seen = star_wars.iloc[:, 3:9].sum()
movie_seen.plot.bar(title= 'How many people have seen the movie', rot=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f675a7aeb50>
We again see that the older movies have been watched more than the newer ones, addng to our findings from the ratings.
## creating a subset dataframe that contains only male values in the gender column:
star_gender = star_wars.groupby('Gender').agg(np.sum)
# Plotting the most watched film:
star_gender.iloc[:,1:8].plot.bar(title = 'Most watched film by gender')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()
# Plotting the best ranked film:
star_gender.iloc[:,8:].plot.bar(title = 'The best ranked film by gender (Lower is better!)')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()
By Gender :
## creating a subset dataframe that contains only fan values:
star_wars_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']== True]
print ('The number of fans of the Star wars franchise :', star_wars_fan.shape[0])
## creating a subset dataframe that contains only male values in the gender column:
star_fan = star_wars.groupby('Do you consider yourself to be a fan of the Star Wars film franchise?').agg(np.sum)
# Plotting the most watched film:
star_fan.iloc[:,1:8].plot.bar(title = 'Most watched film by fandom')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()
The number of fans of the Star wars franchise : 552
# Plotting the best ranked film:
star_fan.iloc[:,8:].plot.bar(title = 'The best ranked film by fandom (Lower is better!)')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()
## creating a subset dataframe that contains only Star trek fan values:
star_trek_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?']== 'Yes']
# Let's see the number of fans of the franchsie:
print ('Number of people who a fan of the Star Trek franchise :', star_trek_fan.shape[0])
## creating a subset dataframe that contains only male values in the gender column:
star_trek_fan = star_wars.groupby('Do you consider yourself to be a fan of the Star Trek franchise?').agg(np.sum)
# Plotting the most watched film:
star_trek_fan.iloc[:,2:8].plot.bar(title = 'Most watched film by Star Trekkers')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()
Number of people who a fan of the Star Trek franchise : 427
Being a fan of the Star Wars franchise or being a fan of *Star Trek* doesn't seem to affect our analysis. The two oldest films i.e. *Star Wars: Episode V The Empire Strikes Back* and *Star Wars: Episode VI Return of the Jedi* are the most popular Star Wars films.
loc = star_wars['Location (Census Region)'].value_counts()
loc.plot.bar(title = 'Number of respondents based on location')
<matplotlib.axes._subplots.AxesSubplot at 0x7f67597435b0>
We see that East North Central had the most respondants and East South Central had the least number of respondents for the survey.
star_group = star_wars.groupby('Location (Census Region)').agg(np.sum)
star_group = star_group.drop('RespondentID', axis=1)
# Let's plot the values to visually understand the result better:
star_group.iloc[:,1].plot(kind='bar', title= 'Number of films watched by location')
<matplotlib.axes._subplots.AxesSubplot at 0x7f67596afb80>
We see that the people from Pacific region are more exited about Star Wars films as most of the respondents had seen the films. On the other hand East South Central has the least number of viewers.
star_group.iloc[:,1:7].plot(kind='bar', title= 'Films watched by location')
<matplotlib.axes._subplots.AxesSubplot at 0x7f675abf2a90>
We again witness the popularity of the older films as they are ranked much better than the newer ones all over the country.
star_wars['Which character shot first?'].value_counts().plot.bar(title='Which character shot first')
<matplotlib.axes._subplots.AxesSubplot at 0x7f6759611df0>
The age old ambiguity still holds about who fired the gun first. Though I personally believe that Han fired first at Greedo in the cantina
Let's condense our findings from above :
From our analysis we clearly see that *Star Wars: Episode V The Empire Strikes Back
* and *Star Wars: Episode VI Return of the Jedi
* have been rated the best out of the bunch (view rate and rankings ). These movies are older and tend to have a staunch fan following (including me.)
Genderwise:
Men
seem to follow the *Star Wars* franchise as compared to Females
Fandom Trivia :
Star Wars: Episode V The Empire Strikes Back
* and *Star Wars: Episode VI Return of the Jedi
* are unbeatable!By Location :
Who Shot First :