This project attempts to see whether or not the Empire Strikes Back is the best movie of the Star Wars movie franchise. We will be exploring data collected from a survey of avid fans of the series to discuss whether The Empire Strikes Back is the greatest Star Wars movie (at least prior to The Force Awakens coming out).
Our project will be a showcase of data cleaning techniques and attempt to analyze the following question:
We will start by reading in our data set and doing some initial examing of the data frame that we create.
import pandas as pd
star_wars = pd.read_csv('star_wars.csv',encoding='ISO-8859-1')
star_wars.head(10)
star_wars['RespondentID'] = star_wars['RespondentID'].notnull()
star_wars.isnull().sum()
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].unique()
Our last two steps have been to clean up the initial data. We first removed all entries whose Respondent ID was null. Now we are going to change the answer types from 'Yes' and 'No' to 'True' and 'False'.
yes_no = {
'Yes': True,
'No': False
}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].unique()
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].unique()
After changing the values to boolean values, we will now move on to the next columns, which all deal with answers related to which Star Wars movies each respondent has seen. We will also rename each of the columns to match the movie that is in question, rather than the full question as it is stated.
star_wars.columns
star_wars['Unnamed: 4'].unique()
import numpy as np
ep_1 = {
'Star Wars: Episode I The Phantom Menace' : True,
np.NaN : False
}
ep_2 = {
'Star Wars: Episode II Attack of the Clones' : True,
np.NaN : False
}
ep_3 = {
'Star Wars: Episode III Revenge of the Sith' : True,
np.NaN : False
}
ep_4 = {
'Star Wars: Episode IV A New Hope' : True,
np.NaN : False
}
ep_5 = {
'Star Wars: Episode V The Empire Strikes Back' : True,
np.NaN : False
}
ep_6 = {
'Star Wars: Episode VI Return of the Jedi' : True,
np.NaN : False
}
star_wars = star_wars.rename(columns={'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1','Unnamed: 4':'seen_2','Unnamed: 5':'seen_3','Unnamed: 6':'seen_4','Unnamed: 7':'seen_5','Unnamed: 8':'seen_6'})
star_wars['seen_1'] = star_wars['seen_1'].map(ep_1)
star_wars['seen_2'] = star_wars['seen_2'].map(ep_2)
star_wars['seen_3'] = star_wars['seen_3'].map(ep_3)
star_wars['seen_4'] = star_wars['seen_4'].map(ep_4)
star_wars['seen_5'] = star_wars['seen_5'].map(ep_5)
star_wars['seen_6'] = star_wars['seen_6'].map(ep_6)
star_wars.head(10)
Now that we have columns that all display boolean values as well as new column names corresponding to each movie in the series, we move to the next 6 columns, which rank the Star Wars movies from 1-6. We will now convert the string values to numeric so that we can analyze this data more effectively.
# Removing the first line, which is not one of the respondents' surveys
star_wars = star_wars.loc[1:]
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
star_wars[star_wars.columns[9:15]]
star_wars = star_wars.rename(columns={'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.':'ep_1_ranking','Unnamed: 10':'ep_2_ranking','Unnamed: 11':'ep_3_ranking','Unnamed: 12':'ep_4_ranking','Unnamed: 13':'ep_5_ranking','Unnamed: 14':'ep_6_ranking'})
All the rankings are now converted to numbers and our columns have been renamed to match which movie is being ranked. Now, we will be able to see the average ranking for each movie.
import matplotlib.pyplot as plt
from numpy import arange
%matplotlib inline
cols = ['ep_1_ranking','ep_2_ranking','ep_3_ranking','ep_4_ranking','ep_5_ranking','ep_6_ranking']
bar_widths = [star_wars[i].mean(skipna=True) for i in cols]
bar_positions = arange(6) + 0.75
tick_positions = range(1,8)
colors = ['gray', 'gray', 'gray', 'gray', 'blue', 'gray']
fig, ax = plt.subplots()
rects = ax.barh(bar_positions,bar_widths,align='center',color = colors, tick_label = cols)
ax.set_title('Star Wars Movie Rankings')
ax.set_ylabel('Movies')
ax.set_xlabel('Average Ranking')
ax.spines['right'].set_visible(False)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.tick_params(bottom='off',left='off',right='off',top='off')
plt.show()
As we see in our bar chart above, Episode VI: The Empire Strikes Back, actually ranks lowest out of all the movies on the list, which means that it ranks the best out of all our movies in the Star Wars world according to our survey! Thus we can confirm:
Just to confirm that this hold true generally, we should go ahead and check the columns that correspond to whether each movie has been seen or not. We will explore this info below.
cols = ['seen_1','seen_2','seen_3','seen_4','seen_5','seen_6']
bar_widths = [star_wars[i].sum() for i in cols]
bar_positions = arange(6) + 0.75
tick_positions = range(1,8)
colors = ['gray', 'gray', 'gray', 'gray', 'blue', 'gray']
fig, ax = plt.subplots()
rects = ax.barh(bar_positions,bar_widths,align='center',color = colors, tick_label = cols)
ax.set_title('Star Wars Movie Viewership')
ax.set_ylabel('Movies')
ax.set_xlabel('Number of Viewers')
ax.spines['right'].set_visible(False)
for key, spine in ax.spines.items():
spine.set_visible(False)
ax.tick_params(bottom='off',left='off',right='off',top='off')
plt.show()
Not only did we confirm it was the best rated movie, but the chart above confirms it is also the most watched movie out of the first 6 episodes!
Suppose we want to go one step further, and categorize male vs female respondants to see if there is any difference in the most viewed movies or rankings. We can do this by making two data frames, male and female, that relate to these subgroups.
male = star_wars[star_wars['Gender'] == 'Male']
female = star_wars[star_wars['Gender'] == 'Female']
cols_seen = ['seen_1','seen_2','seen_3','seen_4','seen_5','seen_6']
bar_widths_seen = [male[i].sum() for i in cols_seen]
bar_widths_seen_f = [female[i].sum() for i in cols_seen]
bar_positions = arange(6) + 0.75
tick_positions = range(1,8)
colors = ['gray', 'gray', 'gray', 'gray', 'blue', 'gray']
fig = plt.figure(figsize=(15,10))
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)
ax1.barh(bar_positions,bar_widths_seen,align='center',color = colors, tick_label = cols_seen)
ax1.set_title('Star Wars Movie Rankings For Men')
ax1.set_ylabel('Movies')
ax1.set_xlabel('Average Ranking')
ax1.spines['right'].set_visible(False)
for key, spine in ax1.spines.items():
spine.set_visible(False)
ax1.tick_params(bottom='off',left='off',right='off',top='off')
cols = ['ep_1_ranking','ep_2_ranking','ep_3_ranking','ep_4_ranking','ep_5_ranking','ep_6_ranking']
bar_widths = [male[i].mean(skipna=True) for i in cols]
bar_positions = arange(6) + 0.75
tick_positions = range(1,8)
colors = ['gray', 'gray', 'gray', 'gray', 'blue', 'gray']
ax2.barh(bar_positions,bar_widths,align='center',color = colors, tick_label = cols)
ax2.set_title('Star Wars Male Movie Viewership ')
# ax2.set_ylabel('Movies')
ax2.set_xlabel('Average Ranking')
ax2.spines['right'].set_visible(False)
for key, spine in ax2.spines.items():
spine.set_visible(False)
ax2.tick_params(bottom='off',left='off',right='off',top='off')
ax3.barh(bar_positions,bar_widths_seen_f,align='center',color = colors, tick_label = cols_seen)
ax3.set_title('Star Wars Movie Rankings For Women')
ax3.set_ylabel('Movies')
ax3.set_xlabel('Average Ranking')
ax3.spines['right'].set_visible(False)
for key, spine in ax3.spines.items():
spine.set_visible(False)
ax3.tick_params(bottom='off',left='off',right='off',top='off')
cols = ['ep_1_ranking','ep_2_ranking','ep_3_ranking','ep_4_ranking','ep_5_ranking','ep_6_ranking']
bar_widths_f = [female[i].mean(skipna=True) for i in cols]
bar_positions = arange(6) + 0.75
tick_positions = range(1,8)
colors = ['gray', 'gray', 'gray', 'gray', 'blue', 'gray']
ax4.barh(bar_positions,bar_widths_f,align='center',color = colors, tick_label = cols)
ax4.set_title('Star Wars Female Movie Viewership')
# ax4.set_ylabel('Movies')
ax4.set_xlabel('Average Ranking')
ax4.spines['right'].set_visible(False)
for key, spine in ax4.spines.items():
spine.set_visible(False)
ax4.tick_params(bottom='off',left='off',right='off',top='off')
fig.show()
With the 4 charts above, we have a few observations that we find:
What we found in this exploration into Star Wars Survey Data is that The Empire Strikes Back reigns supreme. It is not only the most popular of the movies in the franchise, it is also the highest rated movie. Additionally, we also found that females and males are almost identically interested in the Star Wars movies. The only noticeable difference being that Female viewers tend to like Ep 1 slightly more than their male counterparts, while male viewers tend to like Ep 4 slightly more than female viewers.
For future analysis, we will look to explore other demographic comparisons, including education levels, locations, and others.