by Raghav_A
While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 1186 total responses, which can be downloaded from their GitHub repository.
For this project, I'll be cleaning and exploring the data set in Jupyter notebook, and answering some very interesting questions (if you are a Star Wars fan!)
Let's get started.
First, I will import the relevant python libraries, and read the dataset into a pandas DataFrame object -
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
In order to get an understanding of the shape, object type of columns, and the nature of the data in our dataset, we will explore the dataset using df.head()
, df.info()
, df.shape
and df.columns
methods and attributes -
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
star_wars.shape
(1187, 38)
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?Âæ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
character_names = pd.DataFrame(star_wars.iloc[0,15:29])
character_names
0 | |
---|---|
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Han Solo |
Unnamed: 16 | Luke Skywalker |
Unnamed: 17 | Princess Leia Organa |
Unnamed: 18 | Anakin Skywalker |
Unnamed: 19 | Obi Wan Kenobi |
Unnamed: 20 | Emperor Palpatine |
Unnamed: 21 | Darth Vader |
Unnamed: 22 | Lando Calrissian |
Unnamed: 23 | Boba Fett |
Unnamed: 24 | C-3P0 |
Unnamed: 25 | R2 D2 |
Unnamed: 26 | Jar Jar Binks |
Unnamed: 27 | Padme Amidala |
Unnamed: 28 | Yoda |
star_wars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1187 entries, 0 to 1186 Data columns (total 38 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RespondentID 1186 non-null float64 1 Have you seen any of the 6 films in the Star Wars franchise? 1187 non-null object 2 Do you consider yourself to be a fan of the Star Wars film franchise? 837 non-null object 3 Which of the following Star Wars films have you seen? Please select all that apply. 674 non-null object 4 Unnamed: 4 572 non-null object 5 Unnamed: 5 551 non-null object 6 Unnamed: 6 608 non-null object 7 Unnamed: 7 759 non-null object 8 Unnamed: 8 739 non-null object 9 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. 836 non-null object 10 Unnamed: 10 837 non-null object 11 Unnamed: 11 836 non-null object 12 Unnamed: 12 837 non-null object 13 Unnamed: 13 837 non-null object 14 Unnamed: 14 837 non-null object 15 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. 830 non-null object 16 Unnamed: 16 832 non-null object 17 Unnamed: 17 832 non-null object 18 Unnamed: 18 824 non-null object 19 Unnamed: 19 826 non-null object 20 Unnamed: 20 815 non-null object 21 Unnamed: 21 827 non-null object 22 Unnamed: 22 821 non-null object 23 Unnamed: 23 813 non-null object 24 Unnamed: 24 828 non-null object 25 Unnamed: 25 831 non-null object 26 Unnamed: 26 822 non-null object 27 Unnamed: 27 815 non-null object 28 Unnamed: 28 827 non-null object 29 Which character shot first? 829 non-null object 30 Are you familiar with the Expanded Universe? 829 non-null object 31 Do you consider yourself to be a fan of the Expanded Universe?Âæ 214 non-null object 32 Do you consider yourself to be a fan of the Star Trek franchise? 1069 non-null object 33 Gender 1047 non-null object 34 Age 1047 non-null object 35 Household Income 859 non-null object 36 Education 1037 non-null object 37 Location (Census Region) 1044 non-null object dtypes: float64(1), object(37) memory usage: 352.5+ KB
Before we can proceed with the analysis and subsequent visualisation of data, the dataset needs to be checked fot any inconsistencies and bad data, that might affect our analysis.
Due to the nature of this data, I decided that it is best to move and inspect the data column-by-column.
Yes
, No
and NaN
values in these 2 columns. For the sake of our analysis, we will convert the Yes
values to True
, and No
values to False
. Also, if a surveyee has not answered the question Do you consider yourself to be a fan of the Star Wars film franchise?
, then we will assume that he/she is not a fan of Star Wars, and will change the NaN
values to False
.Let's make the changes -
# Removing Null RespondentID rows
star_wars = star_wars[star_wars['RespondentID'].notnull()]
# Displaying the top 5 rows of the first 3 columns...
star_wars[star_wars.columns[:3]].head(6)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | |
---|---|---|---|
1 | 3.292880e+09 | Yes | Yes |
2 | 3.292880e+09 | No | NaN |
3 | 3.292765e+09 | Yes | No |
4 | 3.292763e+09 | Yes | Yes |
5 | 3.292731e+09 | Yes | Yes |
6 | 3.292719e+09 | Yes | Yes |
# Converting the Yes into Boolean True and No & NaN into Boolean False...(for Columns 2 & 3)
cols_1_to_3 = star_wars[star_wars.columns[1:3]].applymap(lambda element: True if element=='Yes' else False).copy()
star_wars[star_wars.columns[1:3]] = cols_1_to_3
# Displaying the top 5 rows of the transformed first 3 columns...
star_wars[star_wars.columns[:3]].head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | |
---|---|---|---|
1 | 3.292880e+09 | True | True |
2 | 3.292880e+09 | False | False |
3 | 3.292765e+09 | True | False |
4 | 3.292763e+09 | True | True |
5 | 3.292731e+09 | True | True |
NaN
values or Non-NULL
(Name of the Movie) values in these 6 columns. For the sake of our analysis, we will convert the Non-NULL
values to True
, and Nan
values to False
.Seen_1
for when the surveyee has seen The Phantom Menace
(Episode 1), Seen_2
for when the surveyee has seen Attack of the Clones
(Episode 2) and so on till Episode 6 (Column no. 9)Let's make the changes -
# Displaying the top 5 rows of column indexes 3 to 8...
star_wars[star_wars.columns[3:9]].head()
Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | |
---|---|---|---|---|---|---|
1 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
2 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN |
4 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
5 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
# Displaying the Value-Counts of columns 4 to 9...
[star_wars[col].value_counts(dropna=False) for col in star_wars.columns[3:9]]
[Star Wars: Episode I The Phantom Menace 673 NaN 513 Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64, NaN 615 Star Wars: Episode II Attack of the Clones 571 Name: Unnamed: 4, dtype: int64, NaN 636 Star Wars: Episode III Revenge of the Sith 550 Name: Unnamed: 5, dtype: int64, Star Wars: Episode IV A New Hope 607 NaN 579 Name: Unnamed: 6, dtype: int64, Star Wars: Episode V The Empire Strikes Back 758 NaN 428 Name: Unnamed: 7, dtype: int64, Star Wars: Episode VI Return of the Jedi 738 NaN 448 Name: Unnamed: 8, dtype: int64]
# Assigning Boolean True and False to values of columns 4 to 9
cols_4_to_9 = star_wars[star_wars.columns[3:9]].applymap(lambda element: True if 'Star Wars' in str(element) else False).copy()
star_wars[star_wars.columns[3:9]] = cols_4_to_9
#Displaying top 5 rows of columns 4 to 9...
star_wars[star_wars.columns[3:9]] .head()
Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | |
---|---|---|---|---|---|---|
1 | True | True | True | True | True | True |
2 | False | False | False | False | False | False |
3 | True | True | True | False | False | False |
4 | True | True | True | True | True | True |
5 | True | True | True | True | True | True |
# Changing names of columns 4 to 9 to 'Seen_1,','Seen_2', and so on till 'Seen_6'...
bool_dict = {
'Which of the following Star Wars films have you seen? Please select all that apply.':'Seen_1',
'Unnamed: 4': 'Seen_2',
'Unnamed: 5': 'Seen_3',
'Unnamed: 6': 'Seen_4',
'Unnamed: 7': 'Seen_5',
'Unnamed: 8': 'Seen_6'}
star_wars = star_wars.rename(columns = bool_dict)
# Displaying the top 5 rows of columns 4 to 9 (to view/check changed column names)
star_wars.iloc[:5,3:9]
Seen_1 | Seen_2 | Seen_3 | Seen_4 | Seen_5 | Seen_6 | |
---|---|---|---|---|---|---|
1 | True | True | True | True | True | True |
2 | False | False | False | False | False | False |
3 | True | True | True | False | False | False |
4 | True | True | True | True | True | True |
5 | True | True | True | True | True | True |
NaN
values or NUMERIC Non-NULL
(Rank of the Movie) values. We will leave these values as they are.ranking_1
for when the surveyee has ranked The Phantom Menace
(Episode 1) on a scale of 1 to 6 (1 being the best, 6 being the worst), ranking_2
for when the surveyee has ranked Attack of the Clones
(Episode 2) and so on till Episode 6 (Column no. 15)Let's make the changes -
# Displaying the top 5 rows of columns 10 to 15
star_wars[star_wars.columns[9:15]].head()
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | |
---|---|---|---|---|---|---|
1 | 3 | 2 | 1 | 4 | 5 | 6 |
2 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1 | 2 | 3 | 4 | 5 | 6 |
4 | 5 | 6 | 1 | 2 | 4 | 3 |
5 | 5 | 4 | 6 | 2 | 1 | 3 |
# Renaming the columns 10 to 15 as per the names stated below...
bool_dict = {'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.':'ranking_1',
'Unnamed: 10':'ranking_2',
'Unnamed: 11':'ranking_3',
'Unnamed: 12':'ranking_4',
'Unnamed: 13':'ranking_5',
'Unnamed: 14':'ranking_6'
}
star_wars = star_wars.rename(columns = bool_dict)
star_wars['ranking_1'].dtype
dtype('O')
# converting dtype of columns ranking_1 to ranking_6 from 'Object'(str) to Float64
for i in range(1,7):
star_wars['ranking_'+str(i)] = star_wars['ranking_'+str(i)].astype(float)
star_wars['ranking_6']
1 6.0 2 NaN 3 6.0 4 3.0 5 3.0 ... 1182 1.0 1183 1.0 1184 NaN 1185 1.0 1186 5.0 Name: ranking_6, Length: 1186, dtype: float64
# Displaying the Transformed columns 10 to 15...
star_wars[star_wars.columns[9:15]]
ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|
1 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
2 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
4 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
5 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
... | ... | ... | ... | ... | ... | ... |
1182 | 5.0 | 4.0 | 6.0 | 3.0 | 2.0 | 1.0 |
1183 | 4.0 | 5.0 | 6.0 | 2.0 | 3.0 | 1.0 |
1184 | NaN | NaN | NaN | NaN | NaN | NaN |
1185 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 | 1.0 |
1186 | 6.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
1186 rows × 6 columns
NaN
, Very Favourably
,Somewhat Favourably
,Somewhat Unfavourably
,Very Unfavourably
and Neither favorably nor unfavorably (neutral)
values. These values don't seem to have any bad data in them, so it's best to leave these values as they are.Star Wars
character they represent. For instance, column 16 should be renamed to Han Solo
, column 17 to Princess Leia Organa
and so on till column 29.Let's make the changes -
star_wars[star_wars.columns[15:29]].head()
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) |
4 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably |
5 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably |
# Displaying Character Names corresponding to the columns...
character_names
0 | |
---|---|
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Han Solo |
Unnamed: 16 | Luke Skywalker |
Unnamed: 17 | Princess Leia Organa |
Unnamed: 18 | Anakin Skywalker |
Unnamed: 19 | Obi Wan Kenobi |
Unnamed: 20 | Emperor Palpatine |
Unnamed: 21 | Darth Vader |
Unnamed: 22 | Lando Calrissian |
Unnamed: 23 | Boba Fett |
Unnamed: 24 | C-3P0 |
Unnamed: 25 | R2 D2 |
Unnamed: 26 | Jar Jar Binks |
Unnamed: 27 | Padme Amidala |
Unnamed: 28 | Yoda |
# Renaming the columns 16 to 29 as per the dictionary values given below...
character_dict={'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.':'Han Solo',
'Unnamed: 16': 'Luke Skywalker',
'Unnamed: 17': 'Princess Leia Organa',
'Unnamed: 18': 'Anakin Skywalker',
'Unnamed: 19': 'Obi Wan Kenobi',
'Unnamed: 20': 'Emperor Palpatine',
'Unnamed: 21': 'Darth Vader',
'Unnamed: 22': 'Lando Calrissian',
'Unnamed: 23': 'Boba Fett',
'Unnamed: 24': 'C-3P0',
'Unnamed: 25': 'R2 D2',
'Unnamed: 26': 'Jar Jar Binks',
'Unnamed: 27': 'Padme Amidala',
'Unnamed: 28': 'Yoda'
}
star_wars = star_wars.rename(columns = character_dict)
# Displaying top 5 rows of the transformed columns 16 to 29...
star_wars[star_wars.columns[15:29]].head()
Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably |
2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) |
4 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably |
5 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably |
Let's make the changes -
star_wars[star_wars.columns[29:]].head()
Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|
1 | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
star_wars = star_wars.rename(columns = {star_wars.columns[29]:'Who shot first - Han or Greedo?',
star_wars.columns[31]:'Do you consider yourself to be a fan of the Expanded Universe?'})
star_wars[star_wars.columns[30:33]]
Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | |
---|---|---|---|
1 | Yes | No | No |
2 | NaN | NaN | Yes |
3 | No | NaN | No |
4 | No | NaN | Yes |
5 | Yes | No | No |
... | ... | ... | ... |
1182 | No | NaN | Yes |
1183 | No | NaN | Yes |
1184 | NaN | NaN | No |
1185 | No | NaN | Yes |
1186 | No | NaN | No |
1186 rows × 3 columns
# Changing columns 30 to 32 values to Boolean True and False...
cols_30_to_32 = star_wars[star_wars.columns[30:33]].applymap(lambda value: True if value == 'Yes' else False).copy()
star_wars[star_wars.columns[30:33]] = cols_30_to_32
# Displaying top 5 rows of columns 30 to 32
star_wars[star_wars.columns[30:33]].head()
Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | |
---|---|---|---|
1 | True | False | False |
2 | False | False | True |
3 | False | False | False |
4 | False | False | True |
5 | True | False | False |
In order to get some useful insights from our data, we should isolate our data and consider only those surveyees who have seen all the Star Wars movies. We have a total of 834 such surveyees in our dataset (out of 1187 total responders), which is a number we can work with.
So, our first step should be to create a new dataset with only those responders who have seen all 6 Star Wars
movies.
seen_all_movies = star_wars[star_wars[star_wars.columns[9:15]].notnull().all(axis = 1)]
# Displaying top 5 rows of the new dataset comprising of responders who.ve seen all 6 movies...
seen_all_movies.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Seen_1 | Seen_2 | Seen_3 | Seen_4 | Seen_5 | Seen_6 | ranking_1 | ... | Yoda | Who shot first - Han or Greedo? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | ... | Very favorably | I don't understand this question | True | False | False | Male | 18-29 | NaN | High school degree | South Atlantic |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | False | False | False | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | Very favorably | I don't understand this question | False | False | True | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | Somewhat favorably | Greedo | True | False | False | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | True | True | True | True | True | True | True | True | 1.0 | ... | Very favorably | Han | True | False | True | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
5 rows × 38 columns
A total of 835 respondents have seen atleast 1 Star Wars movie.
# Number of surveyees who have seen atleast 1 star wars movie
star_wars[star_wars.columns[3:9]].any(axis = 1).sum()
835
Star Wars
movies?¶A total of 834 respondents have seen all 6 Star Wars movies.
# Total non-null responders
seen_all_movies.shape[0]
834
Turns out, The Empire Striked Back
is the most viewed Star Wars movie, with almost 65% responders having watched it. See for yourselves -
episode_dict = {'Seen_1': 'Episode 1 The Phantom Menace',
'Seen_2': 'Episode 2 Attack of the Clones',
'Seen_3': 'Episode 3 Revenge of the Sith',
'Seen_4': 'Episode 4 A New Hope',
'Seen_5': 'Episode 5 The Empire Strikes Back',
'Seen_6': 'Episode 6 Return of the Jedi'}
percent_viewers = star_wars[star_wars.columns[3:9]].mean()*100
percent_viewers = round(percent_viewers)
percent_viewers = percent_viewers.rename(episode_dict).sort_index(ascending = False)
percent_viewers.plot.barh()
plt.title('Percentage of responders who have seen Star Wars')
plt.xlabel('Percentage')
plt.show()
# Highest Rated Star Wars Movie -
episode_dict2= {'ranking_1': 'Episode 1 The Phantom Menace',
'ranking_2': 'Episode 2 Attack of the Clones',
'ranking_3': 'Episode 3 Revenge of the Sith',
'ranking_4': 'Episode 4 A New Hope',
'ranking_5': 'Episode 5 The Empire Strikes Back',
'ranking_6': 'Episode 6 Return of the Jedi'}
most_fav = seen_all_movies[seen_all_movies.columns[9:15]]
most_fav = most_fav.applymap(lambda element: element == 1).mean()*100
most_fav = most_fav.rename(episode_dict2).sort_index(ascending = False)
most_fav.plot.barh()
plt.title('Best Star Wars Movie')
plt.xlabel('Percentage')
plt.show()
# Least Favourite Star Wars Movie -
least_fav = seen_all_movies[seen_all_movies.columns[9:15]]
least_fav = least_fav.applymap(lambda element: element == 6).mean()*100
least_fav = least_fav.rename(episode_dict2).sort_index(ascending = False)
least_fav = least_fav.plot.barh()
plt.title('Least Favourite Star Wars Movie')
plt.xlabel('Percentage')
plt.show()
NOTE: For this analysis, I have reviewed only those respondents who've seen all 6 movies, it only makes sense that we review those particular rows only.
# Displaying the first few rows of the dataset we wil be analaysing...
seen_all_movies[seen_all_movies.columns[15:29]].head()
Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably |
3 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) |
4 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably |
5 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably |
6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Neither favorably nor unfavorably (neutral) | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably |
sns.set(style="whitegrid")
char_sequence = ['Emperor Palpatine','Jar Jar Binks','Boba Fett','Lando Calrissian','Padme Amidala','Anakin Skywalker',
'Darth Vader','C-3P0','Princess Leia Organa','Luke Skywalker','R2 D2','Obi Wan Kenobi','Yoda','Han Solo']
favorability = ['Very favorably', 'Somewhat favorably','Neither favorably nor unfavorably (neutral)',
'Somewhat unfavorably','Very unfavorably']
favorable = ['Very Favorable', 'Somewhat Favorable','Neutral',
'Somewhat Unfavorable','Very Unfavorable']
colors = ['green','blue','grey','orange','red']
per_count = []
fig = plt.figure(figsize = (15,5))
for i in range(1,6):
ax = fig.add_subplot(1,5,i)
char_favorability = seen_all_movies[seen_all_movies.columns[15:29]]
char_favorability = char_favorability.applymap(lambda value: True if value==favorability[i-1] else False)
char_favorability = 100*char_favorability.mean()
for char in char_sequence:
per_count.append(char_favorability[char])
for c in range(1,15):
ax.text(per_count[c-1]+3,char_sequence[c-1],int(round(per_count[c-1])))
ax.set_xlim(0,75)
ax.barh(char_sequence,per_count, color=colors[i-1])
ax.set_title(favorable[i-1])
if i>1:
ax.set_yticklabels([])
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.set_xlabel('Percentage')
per_count = []
# plt.title('Star Wars Characters Favorability Ratings')
male = seen_all_movies[seen_all_movies['Gender']=='Male']
female = seen_all_movies[seen_all_movies['Gender']=='Female']
sns.set(style="whitegrid")
char_sequence = ['Emperor Palpatine','Jar Jar Binks','Boba Fett','Lando Calrissian','Padme Amidala','Anakin Skywalker',
'Darth Vader','C-3P0','Princess Leia Organa','Luke Skywalker','R2 D2','Obi Wan Kenobi','Yoda','Han Solo']
favorability = ['Very favorably','Very unfavorably']
per_count_male = []
per_count_female = []
fig = plt.figure(figsize = (14,7))
for i in range(0,2):
ax = fig.add_subplot(1,2,i+1)
unfav_char_male = male[male.columns[15:29]]
unfav_char_male = unfav_char_male.applymap(lambda value: True if value==favorability[i] else False)
unfav_char_male = 100*unfav_char_male.mean()
unfav_char_female = female[female.columns[15:29]]
unfav_char_female = unfav_char_female.applymap(lambda value: True if value==favorability[i] else False)
unfav_char_female = 100*unfav_char_female.mean()
for char in char_sequence:
per_count_male.append(unfav_char_male[char])
per_count_female.append(unfav_char_female[char])
length = np.arange(len(char_sequence))
width=0.4
ax.barh(length+0.2, per_count_female, width, label = 'Female')
ax.barh(length-0.2, per_count_male, width, label = 'Male')
ax.set_yticks(length)
ax.set_yticklabels(char_sequence)
for key, spine in ax.spines.items():
spine.set_visible(False)
if i == 0:
plt.title('Most Favourable Star Wars Character (Audience Gender Wise)')
else:
plt.title('Most Unfavourable Star Wars Character (Audience Gender Wise)')
ax.set_yticklabels([])
per_count_male = []
per_count_female = []
ax.set_xlabel('Percentage')
plt.legend()
plt.show()
Want to know how what percentage of Star Wars viewers of American descent are actual fans of Star Wars?
Or Fans of Star Trek?
Or Fans of the Expanded Universe?
Or even know about the Star Wars Expanded Universe?
See for yourselves!
sns.set(style="whitegrid")
fig,ax = plt.subplots(figsize = (5,2))
q_list = ['Do you consider yourself to be a fan of the Star Wars film franchise?',
'Are you familiar with the Expanded Universe?',
'Do you consider yourself to be a fan of the Expanded Universe?',
'Do you consider yourself to be a fan of the Star Trek franchise?']
a_list = []
for i in range(0,4):
a_list.append(100*seen_all_movies[q_list[i]].mean())
length = np.arange(0,4)
width = 0.8
ax.barh(length,a_list, width)
ax.set_yticks(length)
ax.set_yticklabels(['Fans of Star Wars','Familiar with Expanded Universe','Fans of Expanded Universe',
'Fans of Star Trek'])
ax.set_xlim(0,100)
ax.set_xlabel('Percentage')
Text(0.5, 0, 'Percentage')
And lastly, the graph below depicts the impact historical revisionism can have on a society (for those who "don't understand the question", watch the 1977 original Star Wars - A New Hope
and the 1997 special edition of the same movie). For those who aren't interested in watching, you can check out this brief article on Wikipedia that throws light on the controversy surrounding the subtle change done in the 1997 special edition of A New Hope
, which showed Han in a less morally ambiguous light (much to some hardcore fans dismay).
fig,ax = plt.subplots()
(seen_all_movies['Who shot first - Han or Greedo?'].value_counts()/8.34).plot.barh()
ax.set_xlabel('Percentage')
Text(0.5, 0, 'Percentage')
But of course if one is Yoda, one might have a third opinion on this.
Episode 5 - The Empire Strikes Back
is by far the favourite Movie among the viewers in the Star Wars franchise.Revenge of the Sith
(Episode 3) is the least liked movie among the viewers.Jar Jar Binks
is more hated than the personification of evil in the galaxy - Darth Vader and Emperor Palpatine (although, women view Darth Vader
a bit more unfavourably than Jar Jar.Han Solo
and Obi Wan Kenobi
are most liked by Male audience, whereas Yoda
and R2 D2
are most liked by the female audience.Anakin Skywalker
becomes Darth Vader
, Male audience likes Darth Vader
more, whereas the Female audience likes Anakin Skywalker
more. Weird, but interesting.