In this project, we are addressing the question regarding the Star Wars series - does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
To achieve this, we are going to analyse the data collected by FiveThirtyEightafter surveying Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("whitegrid", {'axes.grid' : False})
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?Âæ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
The data has several columns, including:
RespondentID
- An anonymized ID for the respondent (person taking the survey)Gender
- The respondent's genderAge
- The respondent's ageHousehold Income
- The respondent's incomeEducation
- The respondent's education levelLocation (Census Region)
- The respondent's locationHave you seen any of the 6 films in the Star Wars franchise?
- Has a Yes
or No
responseDo you consider yourself to be a fan of the Star Wars film franchise?
- Has a Yes
or No
responseWhich of the following Star Wars films have you seen? Please select all that apply.
We will now check for any strange values in the dataset.
star_wars.head(5)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
It is obvious from the above result cell that RespondentID
contains NaN
and we should clean this column before proceeding with our analysis.
star_wars = star_wars[pd.notnull(star_wars["RespondentID"])]
star_wars.head(3)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 rows × 38 columns
Some of the columns represent Yes/No
questions and it is also important to bear in mind that it can also have NaN
where a respondent chooses not to answer a question. The columns in question are:
Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
Let's jump straightaway into cleaning these columns.
# dictionary to define a mapping for each value in the series
# map value Yes to boolean value True and No to False
yes_no = {
"Yes": True,
"No": False
}
# function to map and convert column values to Boolean
def convert_to_bool(col):
return col.map(yes_no)
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = convert_to_bool(star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] )
print(star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False))
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = convert_to_bool(star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'])
print(star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False))
True 936 False 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64 True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
Now we have True
, False
and NaN
values for both the columns
If we check our column names we can notice that there are nearly 4 columns that represent a single checkbox question. The columns are:
Which of the following Star Wars films have you seen? Please select all that apply.
- Whether or not the respondent saw Star Wars: Episode I The Phantom Menace
.Unnamed: 4
- Whether or not the respondent saw Star Wars: Episode II Attack of the Clones
.Unnamed: 5
- Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith
.`Unnamed: 6` - Whether or not the respondent saw `Star Wars: Episode IV A New Hope`.
`Unnamed: 7` - Whether or not the respondent saw `Star Wars: Episode V The Empire Strikes Back`.
`Unnamed: 8` - Whether or not the respondent saw `Star Wars: Episode VI Return of the Jedi`.
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll conver each of these columns to a Boolean, then rename the column for sanity purposes. 🤓
# mapping dictionary for movies
movie_mapping = {
"Star Wars: Episode I The Phantom Menace": True,
np.nan: False,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True
}
# map values and convert to Boolean values.
# columns numbers 3 to 9 represent the columns in question
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(movie_mapping)
# rename columns
star_wars = star_wars.rename(columns={
"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6"
})
star_wars.head(2)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 rows × 38 columns
Next, we have columns that rank the Star Wars in order of least favorite to most favorite, 1 being most favorite and 6 being the least favorite. The following are the columns that rank the movies:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
- How much the respondent liked Star Wars: Episode I The Phantom Menace
Unnamed: 10
- How much the respondent liked Star Wars: Episode II Attack of the Clones
Unnamed: 11
- How much the respondent liked Star Wars: Episode III Revenge of the Sith
Unnamed: 12
- How much the respondent liked Star Wars: Episode IV A New Hope
Unnamed: 13
- How much the respondent liked Star Wars: Episode V The Empire Strikes Back
Unnamed: 14
- How much the respondent liked Star Wars: Episode VI Return of the Jedi
We'll convert each column to a numeric type and then rename the columns. The columns numbers range from 9 to 15 in this case.
# Convert each of the columns above to a float type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
# rename the columns
star_wars = star_wars.rename(columns={
"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6"
})
star_wars.head(2)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 rows × 38 columns
# find the highest-ranked movie by finding the mean of each rating
star_wars[star_wars.columns[9:15]].mean()
ranking_1 3.732934 ranking_2 4.087321 ranking_3 4.341317 ranking_4 3.272727 ranking_5 2.513158 ranking_6 3.047847 dtype: float64
# plot the mean values
star_wars[star_wars.columns[9:15]].mean().plot(kind='bar')
sns.despine()
plt.show()
From the plot, we can say that ranking_5
has the lowest ranking ie, Star Wars: Episode V The Empire Strikes Back
is the most favorite movie. We have to remember that the rankings values are 1 through 6, 1 means the film was the most favorite, and 6 means it was the least favorite.
We have already cleaned up the seen columns and converted their values to the Boolean type. Now let's find the most viewed movie from the series.
# columns numbers 3 to 9 represent the columns seen
star_wars[star_wars.columns[3:9]].sum()
seen_1 673 seen_2 571 seen_3 550 seen_4 607 seen_5 758 seen_6 738 dtype: int64
# plot the values
star_wars[star_wars.columns[3:9]].sum().plot(kind='bar')
sns.despine()
plt.show()
seen_5
or Star Wars: Episode V The Empire Strikes Back
is the most viewed movie which explains why the highest ranked movie is also the same ie, more number of people watched Star Wars: Episode V The Empire Strikes Back
than other movies in the Star War series.
Let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:
Do you consider yourself to be a fan of the Star Wars film franchise?
- True or FalseDo you consider yourself to be a fan of the Star Trek franchise?
- Yes or NoGender
- Male or FemaleWe can compute the most viewed movie, the highest-ranked movie, and other statistics separately for each group.
# split the data into two groups based on gender
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
# Highest-ranked movie - Male respondents and plot the values
print("Highest-ranked movie - Male respondents \n\n",males[males.columns[9:15]].mean())
males[males.columns[9:15]].mean().plot(kind='bar', title="Movie ranking by Male respondents")
plt.show()
# Highest-ranked movie - Female respondents and plot the values
print("Highest-ranked movie - Female respondents \n\n",females[females.columns[9:15]].mean())
females[females.columns[9:15]].mean().plot(kind='bar', title="Movie ranking by Female respondents")
sns.despine()
plt.show()
Highest-ranked movie - Male respondents ranking_1 4.037825 ranking_2 4.224586 ranking_3 4.274882 ranking_4 2.997636 ranking_5 2.458629 ranking_6 3.002364 dtype: float64
Highest-ranked movie - Female respondents ranking_1 3.429293 ranking_2 3.954660 ranking_3 4.418136 ranking_4 3.544081 ranking_5 2.569270 ranking_6 3.078086 dtype: float64
# Most most viewed - Male and plot the values
print("Most most viewed - Male respondents\n\n",males[males.columns[3:9]].sum())
males[males.columns[3:9]].sum().plot(kind='bar',title="Most viewed movie by Male respondents")
plt.show()
# Most most viewed - Female and plot the values
print("Most most viewed - Female respondents\n\n",females[females.columns[3:9]].sum())
females[females.columns[3:9]].sum().plot(kind='bar',title="Most viewed movie by Female respondents")
sns.despine()
plt.show()
Most most viewed - Male respondents seen_1 361 seen_2 323 seen_3 317 seen_4 342 seen_5 392 seen_6 387 dtype: int64
Most most viewed - Female respondents seen_1 298 seen_2 237 seen_3 222 seen_4 255 seen_5 353 seen_6 338 dtype: int64
From the plots,episode 5 received highest rating and views from both men and women. More men watched episodes 1-3 but didnt like the episodes compared to women. Episodes 5 and 6 shows more views from both men and women.
# analysis based on column-
# Do you consider yourself to be a fan of the Star Wars film franchise?
# rename the column in both male and female dataset we grouped in the previous step
male_fans = males.rename(columns={
'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"})
# drop NaN values
print(male_fans['fan_or_not'].value_counts(dropna=False))
male_fans['fan_or_not']= male_fans['fan_or_not'].fillna(False)
print('\nafter removing NaN values\n',male_fans['fan_or_not'].value_counts(dropna=False,normalize=True))
# plot values
male_fans['fan_or_not'].value_counts(dropna=False,normalize=True).plot(kind='bar', title='Star Wars Fan or not - Male')
sns.despine()
plt.show()
True 303 False 120 NaN 74 Name: fan_or_not, dtype: int64 after removing NaN values True 0.609658 False 0.390342 Name: fan_or_not, dtype: float64
female_fans = females.rename(columns={
'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"})
# drop NaN values
print(female_fans['fan_or_not'].value_counts(dropna=False))
female_fans['fan_or_not']= female_fans['fan_or_not'].fillna(False)
print('\nafter removing NaN values\n',female_fans['fan_or_not'].value_counts(dropna=False,normalize=True))
# plot values
female_fans['fan_or_not'].value_counts(dropna=False,normalize=True).plot(kind='bar', title='Star Wars Fan or not - Female')
sns.despine()
plt.show()
True 238 False 159 NaN 152 Name: fan_or_not, dtype: int64 after removing NaN values False 0.566485 True 0.433515 Name: fan_or_not, dtype: float64
# combined plot - Gender, fan_ot_not
all_fans =star_wars.rename(columns={
'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"})
# considering NaN values as False
all_fans['fan_or_not']= all_fans['fan_or_not'].fillna(False)
# group by columns in question and plot
fans_by_gender=all_fans.groupby(['Gender','fan_or_not']).size()
df=fans_by_gender.unstack()
df.plot(kind='bar', title="Are you a fan of Star Wars?")
sns.despine()
It is obvious from the plots above that women are not fans of Star Wars series, whereas men are!
# check the values in the column
star_wars['Education'].value_counts()
Some college or Associate degree 328 Bachelor degree 321 Graduate degree 275 High school degree 105 Less than high school degree 7 Name: Education, dtype: int64
# rename the column
star_wars = star_wars.rename(columns={
'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"})
# create a pivot table
ranking_by_education = star_wars.pivot_table(index="Education", values=star_wars.columns[9:15])
print(ranking_by_education)
# plot the data
ranking_by_education.plot(kind='bar', title='Ranking by education', figsize=(20,10),fontsize=10)
sns.despine()
plt.show()
# sns heatmap
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(ranking_by_education, annot=True, linewidths=.5, ax=ax)
ax.set_title('Ranking by education')
ranking_1 ranking_2 ranking_3 ranking_4 \ Education Bachelor degree 3.828244 4.290076 4.521073 3.114504 Graduate degree 3.822222 4.225664 4.500000 3.199115 High school degree 3.802817 3.746479 4.126761 3.211268 Less than high school degree 5.000000 5.333333 3.666667 2.666667 Some college or Associate degree 3.551181 3.885827 4.102362 3.503937 ranking_5 ranking_6 Education Bachelor degree 2.309160 2.931298 Graduate degree 2.323009 2.920354 High school degree 2.873239 3.239437 Less than high school degree 1.000000 3.333333 Some college or Associate degree 2.783465 3.173228
<matplotlib.text.Text at 0x7fef99a0b400>
views_by_education = star_wars.pivot_table(index="Education", values=star_wars.columns[3:9])
print(views_by_education)
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(views_by_education*100, annot=True,fmt='.1f' ,linewidths=.5, ax=ax)
ax.set_title('Views by education')
seen_1 seen_2 seen_3 seen_4 \ Education Bachelor degree 0.641745 0.529595 0.507788 0.607477 Graduate degree 0.650909 0.541818 0.505455 0.592727 High school degree 0.542857 0.457143 0.457143 0.504762 Less than high school degree 0.428571 0.428571 0.428571 0.428571 Some college or Associate degree 0.643293 0.567073 0.557927 0.548780 seen_5 seen_6 Education Bachelor degree 0.757009 0.728972 Graduate degree 0.752727 0.730909 High school degree 0.580952 0.571429 Less than high school degree 0.428571 0.428571 Some college or Associate degree 0.692073 0.679878
<matplotlib.text.Text at 0x7fef99b81b00>
The data above shows that respondents with less than high school education were the ones who most liked episode 5 in the Star Wars franchise but only 43% of them watched it. On contrast, almost 78% of respondents with a bachelor's degree watched episode 5 and also rated it an avereage of 2.3 .
# check values in location column
star_wars['Location (Census Region)'].value_counts()
East North Central 181 Pacific 175 South Atlantic 170 Middle Atlantic 122 West South Central 110 West North Central 93 Mountain 79 New England 75 East South Central 38 Name: Location (Census Region), dtype: int64
ranking_by_location = star_wars.pivot_table(index="Location (Census Region)", values=star_wars.columns[9:15])
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(ranking_by_location, annot=True, linewidths=.5, ax=ax)
ax.set_title('Ranking by region')
plt.show()
#views by location
views_location = star_wars.pivot_table(index="Location (Census Region)", values=star_wars.columns[3:9])
#
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(views_location, annot=True, linewidths=.5, ax=ax)
ax.set_title('Views by region')
plt.show()
From our analysis of location data, we see that respondents across all the regions rated episode 5 with a higher ranking. Approx. 82% of espondents from East South Central region views episode 5. Taking a closer look, the data shows that more number of respondents ( more than 50%) watched episode 1,4,5 and 6 across the regions.
Which character shot first?
¶# check the values in the column
star_wars['Which character shot first?'].value_counts(dropna=False)
NaN 358 Han 325 I don't understand this question 306 Greedo 197 Name: Which character shot first?, dtype: int64
# replacing NaN values
star_wars['Which character shot first?'].fillna("I don't understand this question", inplace = True)
print(star_wars['Which character shot first?'].value_counts(normalize=True))
star_wars['Which character shot first?'].value_counts(normalize=True).plot(kind='bar', title='Who was shot first - all respondents')
sns.despine()
plt.show()
I don't understand this question 0.559865 Han 0.274030 Greedo 0.166105 Name: Which character shot first?, dtype: float64
Which character shot first?
: Response based on Gender
¶star_wars['Which character shot first?'].fillna("I don't understand this question", inplace = True)
print("Who was shot first? - all fans \n",star_wars['Which character shot first?'].value_counts())
grouped=star_wars.groupby(['Gender','Which character shot first?']).size()
df=grouped.unstack()
df.plot(kind='bar')
sns.despine()
# sns.set_style( {'axes.grid' : False})
Who was shot first? - all fans I don't understand this question 664 Han 325 Greedo 197 Name: Which character shot first?, dtype: int64
Male and female respondents said Han
was shot first, however more female respondents said they did not understand the question.
star_wars['Age'].value_counts(normalize=True)
45-60 0.278203 > 60 0.257170 30-44 0.256214 18-29 0.208413 Name: Age, dtype: float64
# views by age
views_by_age = star_wars.pivot_table(index="Age", values=star_wars.columns[3:9])
print(views_by_age)
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(views_by_age*100, annot=True,fmt='.1f' ,linewidths=.5, ax=ax)
ax.set_title('Views by Age')
seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 Age 18-29 0.733945 0.678899 0.665138 0.697248 0.733945 0.733945 30-44 0.652985 0.589552 0.567164 0.656716 0.735075 0.735075 45-60 0.621993 0.508591 0.487973 0.567010 0.756014 0.721649 > 60 0.531599 0.394052 0.371747 0.386617 0.624535 0.587361
<matplotlib.text.Text at 0x7fef998e2358>
We can see that approx. more than 66% of viewers under the age group 18-29 watched all the episodes and 73.4% of them watched episode 5 and the figures shows that only the series was least watched by viewers above 60 years of age however, 62.5% watched episode 5 which is the highest views in this age group.
More than 73% of viewers under the age groups 18-29, 30-44 and 45-60 watched episode 5 and clearly, episode 5 was most viewed by all the viewers when compared to other episodes in the series.
# rankings by age
ranks_by_age = star_wars.pivot_table(index="Age", values=star_wars.columns[9:15])
print(ranks_by_age)
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(ranks_by_age, annot=True,fmt='.1f' ,linewidths=.5, ax=ax)
ax.set_title('Rankings by Age')
ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6 Age 18-29 4.100000 4.100000 3.966667 2.994444 2.722222 3.116667 30-44 4.347826 4.309179 4.475728 2.932367 2.212560 2.714976 45-60 3.541667 4.170833 4.537500 3.308333 2.437500 3.004167 > 60 3.010417 3.761658 4.316062 3.808290 2.730570 3.357513
<matplotlib.text.Text at 0x7fef99c42588>
Clearly, the highest ranked movie by all the people from all the given age ranges is episode 5 with an average of 2.5 rating.
We started our analysis of the survey data collected by FiveThirtyEight to answer the question does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
From our analysis of the survey results of 835 responses, it is obvious that Star Wars: The Empire Strikes Back
is the best of all the episodes in the Star Wars franchise. It was not only the most watched movie but also the episode with the top ratings. We also found out that compared to women, more men were fans of the Star Wars movies.