Star Wars Survey
Reading in the data
import pandas as pd
import numpy as np
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
Exploring the data set
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.292685e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
8 | 3.292664e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
9 | 3.292654e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
10 rows × 38 columns
print(star_wars.columns)
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?Âæ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
star_wars.shape
(1187, 38)
Data cleaning
we will start by removing the null values in the RespondentID
since it's meant to have a unique number
star_wars['RespondentID'].notnull().sum()
1186
star_wars = star_wars[star_wars['RespondentID'].notnull()]
star_wars.shape
(1186, 38)
We will convert the next few columns from Yes/No to True/False to make it easier to work with. After that we will rename the columns that pertains to star wars seen and ranking so that it can easily be comprehended.
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Yes 936 No 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Yes 552 NaN 350 No 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
yes_no = {'Yes': True, 'No': False}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
True 936 False 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
star_wars[star_wars.columns[3]].value_counts(dropna=False)
Star Wars: Episode I The Phantom Menace 673 NaN 513 Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64
dic_map = {'Star Wars: Episode I The Phantom Menace': True, 'Star Wars: Episode II Attack of the Clones': True, 'Star Wars: Episode III Revenge of the Sith': True, 'Star Wars: Episode IV A New Hope': True, 'Star Wars: Episode V The Empire Strikes Back': True, 'Star Wars: Episode VI Return of the Jedi': True, np.NaN: False}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(dic_map)
#print(star_wars[col])
star_wars[star_wars.columns[8]].value_counts(dropna=False)
True 738 False 448 Name: Unnamed: 8, dtype: int64
print(star_wars.columns[3:9])
Index(['Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'], dtype='object')
star_wars = star_wars.rename(columns={'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1', 'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6': 'seen_4', 'Unnamed: 7': 'seen_5', 'Unnamed: 8': 'seen_6'})
print(star_wars.columns[3:9])
Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
print(star_wars.columns[9:15])
Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], dtype='object')
star_wars = star_wars.rename(columns={'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1', 'Unnamed: 10': 'ranking_2', 'Unnamed: 11': 'ranking_3', 'Unnamed: 12': 'ranking_4', 'Unnamed: 13': 'ranking_5', 'Unnamed: 14': 'ranking_6'})
print(star_wars.columns[9:15])
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'], dtype='object')
Analyze data
we will start by computing the mean of the ranking columns and making a bar chart of each. Then we proceed to computing the sum of the seen columns and plotting a bar chart of each.
%matplotlib inline
ranking_mean = star_wars.iloc[:,9:15].mean()
ranking_mean.plot(kind='bar', title='Mean rankings', ylim=(0,5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f4063f28>
From the above chart we can see that the highest ranked star wars movie is Star Wars: Episode V The Empire Strikes Back since it has the lowest mean score and the least ranked is Star Wars: Episode III Revenge of the Sith since it has the highest mean score.
seen_sum = star_wars.iloc[:,3:9].sum()
print(seen_sum)
seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies', ylim=(500,800))
seen_1 673 seen_2 571 seen_3 550 seen_4 607 seen_5 758 seen_6 738 dtype: int64
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1d43978>
As can be seen from the chart, Star Wars: Episode V The Empire Strikes Back
is the most seen which i believe should be as a result of the high ranking. and the least seen unsuprisingly is Star Wars: Episode III Revenge of the Sith
which should be as a result of the low rank it received.
Split the data into two groups by gender
Let's split the data into two groups by gender and reperform our analysis to see if there will be any interesting pattern.
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
males_ranking_mean = males.iloc[:,9:15].mean()
print(males_ranking_mean)
males_ranking_mean.plot(kind='bar', title='Mean rankings by Males', ylim=(0,5))
ranking_1 4.037825 ranking_2 4.224586 ranking_3 4.274882 ranking_4 2.997636 ranking_5 2.458629 ranking_6 3.002364 dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f58f9c88>
females_ranking_mean = females.iloc[:,9:15].mean()
print(females_ranking_mean)
females_ranking_mean.plot(kind='bar', title='Mean rankings by Females', ylim=(0,5))
ranking_1 3.429293 ranking_2 3.954660 ranking_3 4.418136 ranking_4 3.544081 ranking_5 2.569270 ranking_6 3.078086 dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1c901d0>
males_seen_sum = males.iloc[:,3:9].sum()
print(males_seen_sum)
males_seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies by Males', ylim=(200,400))
seen_1 361 seen_2 323 seen_3 317 seen_4 342 seen_5 392 seen_6 387 dtype: int64
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1be0668>
females_seen_sum = females.iloc[:,3:9].sum()
print(females_seen_sum)
females_seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies by Females', ylim=(200,400))
seen_1 298 seen_2 237 seen_3 222 seen_4 255 seen_5 353 seen_6 338 dtype: int64
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1c01be0>
Performing the analysis by splitting the data into two groups by gender did not change the pattern of the results we received for the highest ranked and the most seen star wars movies.