Whether you are a Star Wars fan or not, it is undeniable that the Star Wars saga is a staple of the theaters and also a topic of great debate, between friends, as to which of the six original movies is the best and as to who shot first?
In this project, we will analyze existing data to answer those questions.
The data shows that Star Wars: Episode V The Empire Strikes Back
is the best movie and Star Wars: Episode III Revenge of the Sith
is the least favorite.
While this is clearly true for Star Wars Fans, Non-Star wars Fans rank Star Wars: Episode I The Phantom Menace
as good as Star Wars: Episode V The Empire Strikes Back
. Star Wars Fans have quite the opposite view about Star Wars: Episode I The Phantom Menace
, ranking it as one of the least favorite.
Moreover, we found that the majority of non-Star Wars Fans don't understand the controversy of Who shot first, while Star Wars Fans agree that Han Solo shot first. For more details, please refer to the full analysis below.
for this project, we will use data collected by the team at FiveThirtyEight. They surveyed Star wars Fans using SurveyMonkey and received 855 responses, which can be found in their [GitHub repository][1]. [1]:https://github.com/dataquestio/solutions/blob/master/Mission201Solution.ipynb
let's strat by importing and exploring the data set.
import pandas as pd
import numpy as np
star_wars = pd.read_csv("star_wars.csv",encoding="ISO-8859-1")
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.292685e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
8 | 3.292664e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
9 | 3.292654e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
10 rows × 38 columns
In the step above, we have printed the first ten rows of the dataset.
The first column represents the responded ID; this column should contain unique values. However, we can see that the first row has a null value. The next two columns are Yes/No
questions. The following columns are a question followed by columns named Unnamed
, the value of those columns seem to contain the information for the question that proceeds it.
In the next step, we will print the names of the columns.
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?Âæ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
As we can see, some of the columns are called Unnamed
. Those columns follow questions where the respondent has to rank or choose between movies or characters; these columns will need to be cleaned in order to have tidy data for our analysis.
As we saw before, there is a row with a null value for RespondenID
. As this column should contain unique values, we will drop the row with the null value.
#droping row with null value for respondentID
star_wars = star_wars.dropna(axis=0,subset=['RespondentID'])
star_wars['RespondentID'].isnull().sum()
0
The columns below are Yes/No questions:
Let's explore the content of those columns.
# Use value_counts methods to see all unique values in a column
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Yes 936 No 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Yes 552 NaN 350 No 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
Both columns contain Yes/No
values, and one of the columns contains NaN
values. To continue with our analysis, we are going to convert both to Boolean, becuase they are easier for analysis. For the time being, we will keep NaN
values as they are, and map "Yes"
to True and "No"
to False.
# mapping dictionary
yes_no ={"Yes": True,"No": False}
# calling the function map() to perform the mapping
star_wars['Have you seen any of the 6 films in the Star Wars franchise?']=star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] =star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
#Checking that the mapping was correct
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
As we saw above, some of the columns are named Unnamed
. Those correspond to checkbox questions.
The first question is:
The values of this column give us information on whether or not the respondent saw Star Wars: Episode I The Phantom Menace
.
The next 5 columns give us the following information:
Unnamed: 4
- Whether or not the respondent saw Star Wars: Episode II Attack of the Clones.Unnamed: 5
- Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.Unnamed: 6
- Whether or not the respondent saw Star Wars: Episode IV A New Hope.Unnamed: 7
- Whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.Unnamed: 8
- Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.For each column, if the value is the name of the movie, that means the respondent saw the movie. If the value is NaN
, the respondent either didn't answer or didn't see the film. We'll assume that they didn't see the movie.
In the next step we will convert these columns to Boolean.
# variable to extract the column names
col_name = star_wars.columns[3:9]
# list for the individual movie names
movie_names = []
#For loop to extract the unique values representing the movie names
for col in range(len(col_name)):
movie_names.append(star_wars[col_name[col]].unique()[0])
#Dictionary that will be used for the Boolean mapping
mapper = {np.NaN: False,
}
#Loop to add each movie to the dictionary
for movie in movie_names:
mapper[movie]= True
#Mapping each column in the dataset to the dictionary
for col in col_name:
star_wars[col]= star_wars[col].map(mapper)
#Printing the first 5 rows to check that the mapping is correct
star_wars[col_name].head()
Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | |
---|---|---|---|---|---|---|
1 | True | True | True | True | True | True |
2 | False | False | False | False | False | False |
3 | True | True | True | False | False | False |
4 | True | True | True | True | True | True |
5 | True | True | True | True | True | True |
In the step above, we have successfully converted the values to Boolean; however, the column names remain confusing. We will rename those columns to seen_1
, seen_2
, etc.
#Dictionary for the mapping of the column names
column_map={}
#loop to complete the dictionary with the desired column name format
for col in range(len(col_name)):
column_map[col_name[col]]='seen_{}'.format(col+1)
#Calling rename() method to update the columns names
star_wars = star_wars.rename(columns=column_map)
star_wars.columns[3:9]# checking the the column name are correct
Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. Let's explore the values in those columns.
#print the frist 5 rows of col 9 to 14
star_wars[star_wars.columns[9:15]].head()
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | |
---|---|---|---|---|---|---|
1 | 3 | 2 | 1 | 4 | 5 | 6 |
2 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1 | 2 | 3 | 4 | 5 | 6 |
4 | 5 | 6 | 1 | 2 | 4 | 3 |
5 | 5 | 4 | 6 | 2 | 1 | 3 |
Each column contains values from 1
to 6
, where 1
is the most favorite, and 6
is the least favorite. We will need to convert these values to numeric.
Furthermore, the column names are not intuitive; therefore, we will update the names to be more intuitive, using the following format,ranking_1
. Each column name represents the following:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film
. - How much the respondent liked Star Wars: Episode I The Phantom MenaceUnnamed: 10
- How much the respondent liked Star Wars: Episode II Attack of the ClonesUnnamed: 11
- How much the respondent liked Star Wars: Episode III Revenge of the SithUnnamed: 12
- How much the respondent liked Star Wars: Episode IV A New HopeUnnamed: 13
- How much the respondent liked Star Wars: Episode V The Empire Strikes BackUnnamed: 14
- How much the respondent liked Star Wars: Episode VI Return of the Jedicol_name= star_wars.columns[9:15]
#Convert the values to float
star_wars[col_name] = star_wars[col_name].astype(float)
# Column name mapper
column_name={}
# Adding each column name and new format to the dictionary
for col in range(len(col_name)):
column_name[col_name[col]]= "ranking_{}".format(col+1)
#updating the colum names in the dataframe
star_wars = star_wars.rename(columns=column_name)
star_wars.columns[9:15]#checking the new names
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'], dtype='object')
# getting the average value for each movie
ranking =star_wars[star_wars.columns[9:15]].mean()
%matplotlib inline
#creating a barchart of the average ranking
ax =ranking.plot(kind='bar',colormap='winter',title='Average movie rank')
#removing the spines
for key,spine in ax.spines.items():
spine.set_visible(False)
#removing the axes ticks
ax.tick_params(axis='both',which='both',bottom=False,top=False,left=False,right=False)
The bar chart above shows the average ranking for each fo the 6 Star wars movie. In the survey, 1
represents the respondent's favorite movie; therefore, the movie with the lowest average ranking is the highest-ranked movie.
As we can see movie 5, Star Wars: Episode V The Empire Strikes Back
is the highest-ranked with an average rank of 2.5; on the other hand, the third movie, Star Wars: Episode III Revenge of the Sith
is the lowest-ranked movie, with an average rank over 6.
Next, lest look at the box plot, to get a better idea of the spread of the ranking data for each movie.
star_wars[star_wars.columns[9:15]].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6db41a240>
The boxplot, shows the 5th movie ranks top 3 50% of the time, while the 3rd movie ranks from 3rd to 6th 50% of the time.
This data confirms that Star Wars: Episode V The Empire Strikes Back
is the highest-ranked and Star Wars: Episode III Revenge of the Sith
is the least valued movie.
In the next steps, we will try to find which movie is the most seen.
seen = star_wars[star_wars.columns[3:9]].sum()
ax = seen.plot(kind='barh',colormap="winter", title="Number of views")
#removing the spines
for key,spine in ax.spines.items():
spine.set_visible(False)
#removing the axes ticks
ax.tick_params(axis='both',which='both',bottom=False,top=False,left=False,right=False,labelbottom= False)
The bar chart above shows the number of views for each movie. We can see that both 5th and 6th movie have the highest number of views and the 3rd and 2nd movie has the lowest views.
For the rest of our analysis, we will explore the differences is the highest-ranked movie between Star Wars fans and non-fans.
# splitting the dataframe into Star Wars fans and non-fans
wars_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==True]
non_wars_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==False]
# getting the avergae ranking for Star Wars fans
wars_avg = wars_fan[wars_fan.columns[9:15]].mean()
# getting the average ranking for Non Star wars fan
non_wars_avg = non_wars_fan[non_wars_fan.columns[9:15]].mean()
# creating a Dataframe combinning both averages
star_wars_avgs = pd.DataFrame({'Fans':wars_avg,'Non Fans':non_wars_avg})
# Plotting the average movie rank
ax=star_wars_avgs.plot(kind='bar',title='Average movie rank for Fans and Non-Fans')
#removing the spines
for key,spine in ax.spines.items():
spine.set_visible(False)
#removing the axes ticks
ax.tick_params(axis='both',which='both',bottom=False,top=False,left=False,right=False)
The bar chart above shows the average movie rank for Star Wars Fans and Non-Star Wars Fans.
As we can see both, tend to agree that the 5th movie is the favorite, and the 3rd movie is the least favorite. However, there are some differences in preferences.
For Non-Fans, the 1st movie is also one of the favorites, with an average rank score similar to the 5th movie. However, Star Wars fan ranks this movie on the least favorite, with an average rank of 4th.
There is some controversy in the world of Star Wars. This controversy revolves around a scene that has been changed at different times, where Han Solo and Greedo, have a shoot out. Depending on the version, it appears that Greedo shot first, whereas, in other versions, it seems like that Han Solo shot first.
In the next step, we will explore differences in answer toWhich character shot first?
between Fans and Non-Fans.
#creating frequency distribution for Fans and Non Fans
fan_shot = wars_fan['Which character shot first?'].value_counts(normalize=True)
non_fan_shot = non_wars_fan['Which character shot first?'].value_counts(normalize=True)
#creating a DataFrame with Fans and Non fans dsitrutions
shot = pd.DataFrame({"Fans": fan_shot,"Non Fans":non_fan_shot})
#plotting the frequency of each answer
ax = shot.plot(kind='bar', title= 'Who shot first?')
#removing the spines
for key,spine in ax.spines.items():
spine.set_visible(False)
#removing the axes ticks
ax.tick_params(axis='both',which='both',bottom=False,top=False,left=False,right=False)
The bar chart above shows that the majority of Star Wars fans think that Han Solo shot first, whereas the majority of non-star wars fans do not understand the question.
In this project, we have analyzed the result of a Star Wars survey to find which is the best Star wars movie. The data shows that Star Wars: Episode V The Empire Strikes Back
is the best movie and Star Wars: Episode III Revenge of the Sith
is the least favorite.
While this is clearly true for Star Wars Fans, Non-Star wars fans rank Star Wars: Episode I The Phantom Menace
as good as Star Wars: Episode V The Empire Strikes Back
. Star Wars Fans have quite the opposite view about Star Wars: Episode I The Phantom Menace
, ranking it as one of the least favorite.
Moreover, we found that the majority of non-Star Wars Fans don't understand the controversy of Who shot first, while Star Wars Fans agree that Han Solo shot first.