While waiting for the Star Wars VII: The Force Awakens to come out, the team at FiveThirtyEight collected data through online survey to assess the popularity of each individual film in Star Wars. They published their data on github, which you can download from their Github repository.
Based on their data, we will try to determine the most favorite film and character in Star Wars franchise.
We will first try to set up our environment by importing tools that can support us along the way
import numpy as np ##Importing numpy library
import matplotlib.pyplot as plt ##Importing matplotlib library
import seaborn as sns ##Importing seaborn library
import pandas as pd ##Importing pandas library
%matplotlib inline
sns.set_style('darkgrid') ##Set our general style to 'darkgrid'
Importing our dataset from csv file and set our display to 50 columns so that we can see our dataset as a whole. We will then look at our first 3 rows from our dataset
star_wars = pd.read_csv('StarWars.csv',encoding='ISO-8859-1') ##Importing our excel file
pd.options.display.max_columns = 50 ##Enable us to see all columms
star_wars.head(3)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?æ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
Lets understand more about our dataset:
Column Name | Description |
---|---|
RespondentID |
An anonymized ID for the respondent (person taking the survey) |
Gender |
The respondent's gender |
Age |
The respondent's age |
Household Income |
The respondent's income |
Education |
The respondent's education level |
Location (Census Region) |
The respondent's location |
Have you seen any of the 6 films in the Star Wars franchise? |
Has a Yes or No response |
Do you consider yourself to be a fan of the Star Wars film franchise? |
Has a Yes or No response |
After reading the first few columns of the data frame. We found out that there are *more than 20 Unnamed columns* that might be obstacles for us to analyse the data. Hence, we will try to change our column names by looking for relevant data. For example, if we look at *Unnamed 16 to Unnamed 28*, we found out that the first row contains the character names while other rows contains different strings.
We will change Unnamed 16 to Unnamed 28 columns names by replacing with character names.
column_names = list(star_wars.iloc[:,15:29].columns) ##Extracting the column names
char_names = list(star_wars.iloc[0,15:29]) ##Extracting the character names
char_names_dict = {} ##Creating an empty dictionary
i = 0
for c in column_names: ##Loop over column names
char_names_dict[c] = char_names[i] ##Appending keys and values to
i += 1
star_wars.rename(char_names_dict,axis=1,inplace=True) ##Renaming from Unnamed 16 to Unnamed 28 with character names
Since our first row has few nonnull datas, while some of it contains irrelevant data. We decided not to use it and delete from our dataset
star_wars = star_wars[star_wars['RespondentID'].notnull()] ##First row is the only with with no value for RespondentID
As we look to other column names, we will notice that there are still few other Unnamed columns such as *Unnamed 4 to Unnamed 14*.
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Han Solo', 'Luke Skywalker', 'Princess Leia Organa', 'Anakin Skywalker', 'Obi Wan Kenobi', 'Emperor Palpatine', 'Darth Vader', 'Lando Calrissian', 'Boba Fett', 'C-3P0', 'R2 D2', 'Jar Jar Binks', 'Padme Amidala', 'Yoda', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?æ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
Based on FiveThirtyEight github, six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.
The columns for this question are:
columns | descriptions |
---|---|
Which of the following Star Wars films have you seen? Please select all that apply. | Whether or not the respondent saw Star Wars: Episode I The Phantom Menace. |
Unnamed: 4 | Whether or not the respondent saw Star Wars: Episode II Attack of the Clones. |
Unnamed: 5 | Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith. |
Unnamed: 6 | Whether or not the respondent saw Star Wars: Episode IV A New Hope. |
Unnamed: 7 | Whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back. |
Unnamed: 8 | Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi. |
We will change these column names with seen_?
where '?' is the episode of the franchise
column_names = list(star_wars.iloc[:,3:9].columns) ##Extracting our column names
rename_dict = {} ##Create an empty dictionary
i = 1
for cn in column_names: ##Loop over column names
rename_dict[cn] = 'seen_{}'.format(i) ##Generating a dictionary with column names as keys and seen_? as values
i += 1
star_wars.rename(rename_dict,axis=1,inplace=True) ##Changing our column names
The next six columns ask the respondents to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite movie. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:
Columns | Descriptions |
---|---|
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | How much the respondent liked Star Wars: Episode I The Phantom Menace |
Unnamed: 10 | How much the respondent liked Star Wars: Episode II Attack of the Clones |
Unnamed: 11 | How much the respondent liked Star Wars: Episode III Revenge of the Sith |
Unnamed: 12 | How much the respondent liked Star Wars: Episode IV A New Hope |
Unnamed: 13 | How much the respondent liked Star Wars: Episode V The Empire Strikes Back |
Unnamed: 14 | How much the respondent liked Star Wars: Episode VI Return of the Jedi |
Therefore, we will changing from Unnamed columns to rank_? where '?' is the episode in the franchise
rank_cols = list(star_wars.iloc[:,9:15].columns)
rank_dict = {}
i = 1
for rc in rank_cols: ##Generating a dictionary contains our old names and new names
rank_dict[rc] = 'rank_{}'.format(i)
i += 1
star_wars.rename(rank_dict,axis=1,inplace=True)
star_wars.head(5)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | rank_1 | rank_2 | rank_3 | rank_4 | rank_5 | rank_6 | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?æ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | 2 | 3 | 4 | 5 | 6 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 6 | 1 | 2 | 4 | 3 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 4 | 6 | 2 | 1 | 3 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
2,3,31,32,33
contains Yes/No/NaN
answers, which could be turned into a boolean values True/False/NaN.name of the movie is True
, while NaN is False
##Lets first change our columns 2,3,31,32,33 from Yes/No/NaN to True/False/NaN
yes_no = {
'Yes' : True,
'No' : False,
}
cols = [1,2,30,31,32]
for c in cols: ##Loop over each column
star_wars.iloc[:,c] = star_wars.iloc[:,c].map(yes_no) ##Changing our values into boolean values
for c in cols:
print(star_wars.iloc[:,c].value_counts(dropna=False)) ##Checking if we had done correctly
print('\n')
True 936 False 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64 True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64 False 615 NaN 358 True 213 Name: Are you familiar with the Expanded Universe?, dtype: int64 NaN 973 False 114 True 99 Name: Do you consider yourself to be a fan of the Expanded Universe?æ, dtype: int64 False 641 True 427 NaN 118 Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64
We will continue to work with *column 4 to 9* by changing Movie name and NaN values into boolean values
movie_names = list(star_wars.iloc[0,3:9]) ##Extracting movie names from column 4 to 9
movie_names
def to_bool(movie): ##Creating definition
if movie in movie_names: ##If the cell contains movie names then it is True
return True
else: ##If the cell not contains movie names then it is False
return False
star_wars.iloc[:,3:9] = star_wars.iloc[:,3:9].applymap(to_bool)
After finish cleaning our first 9 columns, we shall check our *next 6 columns*(Column 9 to 15). Since there only contains numeric values, the data itself seems to be clean. However, we should check if it is in a string type or integer type
star_wars[star_wars.columns[9:15]].dtypes ##Checking types of columns 9 to 14
rank_1 object rank_2 object rank_3 object rank_4 object rank_5 object rank_6 object dtype: object
Unfotunately, it is in object or called the *string type. Therefore, we will have to change it into the float type*.
star_wars.iloc[:,9:15] = star_wars.iloc[:,9:15].astype(float) ##Changing type of columns 9 to 14 to float type
Lets check our dataset again
star_wars.head(5)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | rank_1 | rank_2 | rank_3 | rank_4 | rank_5 | rank_6 | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?æ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | True | False | False | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don't understand this question | False | NaN | False | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | False | NaN | True | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably | Greedo | True | False | False | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
Our first 15 columns has been changed into usable data for analysing. Looking into character names columns, we notice that our results are in string objects, which will not help much to our analysing process. Therefore, we shall transform our string objects into numerical values.Lets check our data to see how we going to transform.
star_wars.iloc[:,15].value_counts(dropna=False) ##Checking unique values from character name columns
Very favorably 610 NaN 357 Somewhat favorably 151 Neither favorably nor unfavorably (neutral) 44 Unfamiliar (N/A) 15 Somewhat unfavorably 8 Very unfavorably 1 Name: Han Solo, dtype: int64
As we can see, our unique values can be divided into 4 groups: Favorable,Unfavorable,Neutral,Unfamiliar. We can transform this into numerical values
Type | Numerical Values |
---|---|
Favorable | 1 |
Unfavorable | 2 |
Neutral | 3 |
Unfamiliar | 4 |
def to_number(string):
if 'Very favorably' == string or 'Somewhat favorably' == string: ##Return 1 if somewhat love the character
return 1
elif 'Somewhat unfavorably' == string or 'Very unfavorably' == string: ##Return 2 if somewhat hate the character
return 2
elif 'Neither favorably nor unfavorably (neutral)' == string: ##Return 3 if not have strong sentiment
return 3
elif 'Unfamiliar (N/A)' or np.nan == string: ##Return 4 if not familar or lost value
return 4
star_wars.iloc[:,15:29] = star_wars.iloc[:,15:29].applymap(to_number)
star_wars.head(3) ##checking our works
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | rank_1 | rank_2 | rank_3 | rank_4 | rank_5 | rank_6 | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?æ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 4 | 1 | 1 | 1 | 1 | 1 | I don't understand this question | True | False | False | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | NaN | NaN | NaN | True | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 1 | 1 | 1 | 1 | 1 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | I don't understand this question | False | NaN | False | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
Moving to analysis stage, we would like to see which Star Wars film is believed to be the worst/best and which one is most viewed by the respondents.
starwars_names = [] ##Generating the Star War movie names
for i in ['I','II','III','IV','V','VI']:
name = 'Star Wars {}'.format(i)
starwars_names.append(name)
starwars_names ##Checking our starwars names
['Star Wars I', 'Star Wars II', 'Star Wars III', 'Star Wars IV', 'Star Wars V', 'Star Wars VI']
In this part, we are going to analyse the survey provided by FiveThirtyEight to see which Star Wars film does the viewer hate the most. We are going to create a simple bar plot for each movie. To remind us, the survey asked a total of *1186 individuals* how much do they *hate* each film in the Star Wars franchise. Therefore, the higher the rating, the worse the movie is.
plt.figure(figsize=(10,6))
ax = star_wars.iloc[:,9:15].mean().plot.bar(ylim=(0,6),position=0) ###Generating a bar plot with a limit of 6 y-values
num = list(star_wars.iloc[:,9:15].mean().astype(float)) ###Generating a list of percentage number
for i in range(0,6): ###Adding percentage number to each bar
x = i + 0.05
y = num[i] + 0.2
string = str(round(num[i],2))
ax.text(x,y,string,fontsize=16)
ax.get_children()[2].set_color('r') ###Set 2nd plot to color red
ax.grid(False) ###Remove grid
ax.set_yticks([]) ###Set y tick labels
ax.set_xticks(np.arange(0,6)+0.25) ###Set x tick postions
ax.set_xticklabels(starwars_names,fontsize=13,rotation=0) ###Set x tick labels
plt.title('Worst movie in Star Wars franchise',fontsize=20,fontweight=525) ###Set plot title
plt.show()
We are going to select respondent who have answered our gender questions. Therefore we will filter according to our critera
males = star_wars[star_wars['Gender']=='Male']
females = star_wars[star_wars['Gender'] =='Female']
fig = plt.figure(figsize=(10,6))
###Generating first graph
ax = males.iloc[:,9:15].mean().plot.bar(color='Green',ylim=(0,5),position=0)
###Adding numbers to graph
num = list(males.iloc[:,9:15].mean().astype(float))
for i in range(0,6):
x = i + 0.05
y = num[i] + 0.2
string = str(round(num[i],2))
plt.text(x,y,string,fontsize=16)
###Customizing our grids and x,y ticks
ax.grid(False)
ax.set_xticks(np.arange(0,6)+0.25)
ax.set_xticklabels(starwars_names,fontsize=13,rotation=0)
plt.xlabel('Male',fontsize=15)
plt.ylabel('Hate Rates',fontsize=15)
plt.title('Most hated Star Wars movie according to male fans',fontsize=20)
plt.show()
###Generating second graph
fig = plt.figure(figsize=(10,6))
ax = females.iloc[:,9:15].mean().plot.bar(color='Blue',ylim=(0,5),position=0)
###Adding numbers to graph
num = list(females.iloc[:,9:15].mean().astype(float))
for i in range(0,6):
x = i + 0.05
y = num[i] + 0.2
string = str(round(num[i],2))
plt.text(x,y,string,fontsize=16)
###Customizing our grids and x,y ticks
ax.set_xticks(np.arange(0,6)+0.25)
ax.set_xticklabels(starwars_names,fontsize=13,rotation=0)
ax.grid(False)
plt.xlabel('female',fontsize=15)
plt.ylabel('Hate Rates',fontsize=15)
plt.title('Most hated Star Wars movie according to female fans',fontsize=20)
plt.show()
Based on our result, we can see that the 3 oldest Star Wars
movie were not as good as the later 3
. While the most hated
one was 'Star Wars III: Revenge of the Sith'
at *4.34, viewers loved 'Star Wars V: The Empire Strikes Back'
most at 2.51*
If you are a fan of Star Wars film franchise, you would be curious to know which movie was the hottest so that you could have a conversation with your friends about it. Fortunately, FiveThirtyEight provided us with data to see which movies have they watched. We want to be selective about this data, we going to choose rows/responder who have watched at least 1 movie in the Star Wars movies franchise.
We going to select rows that based on 'Have you seen any of the 6 films in the Star Wars franchise?'
and we only going to *choose rows with True values*.
seen_any = star_wars[star_wars['Have you seen any of the 6 films in the Star Wars franchise?']]
seen_any.shape
(936, 38)
Since there are 936 rows, there are *936 viewers* who have watched Star Wars. We going to *generate bar plots* to see the percentage of 936 viewers who have watched each Star Wars movie.
plt.figure(figsize=(10,6)) ##
ax = seen_any.iloc[:,3:9].sum().plot.bar(ylim=(0,936),position=0) ###Generating bar plots with a limit of 936 for y values
num = list(seen_any.iloc[:,3:9].sum().astype(int)) ###Generating a list of percentage numbers
for i in range(0,6): ###Adding text to each bar
x = i
y = num[i]+20
string = str(round(num[0+i]/936 *100,2)) +'%'
ax.text(x,y,string,fontsize=16)
ax.get_children()[4].set_color('#D62728') ###Set fifth bar plot to #D62728
ax.grid(False) ###Remove grids
ax.set_yticks([]) ###Remove y tick labels
ax.set_xticks(np.arange(0,6)+0.25) ###Set x tick labels positions
ax.set_xticklabels(starwars_names,rotation=0,fontsize=13) ###Set x tick labels name
plt.title('Which \'Star Wars\' Movies Have You Seen? \n (based on 936 respondents)',fontsize=18,fontweight=525)
plt.show()
As we might have guessed from our previous analyse, *'Star Wars V: The Empire Strikes Back'* was the most popular movie within the franchise, while *'Star Wars III: Revenge of the Sith'* remained the least successful of all 6 movies. The next 2 popular movies were *Star Wars VI* and *Star Wars I*. This may have proved that many of the viewers had skipped the other middle movies.
We may have known the most popular and the most loved movies. But the survey provided information whether the viewer had watched all of the Star Wars movies. This could help us to get a better view of the Star Wars movies by determining the best 1 out of 6 based on individuals who have watched all 6 of them
We would want to *select rows that all 6 'seen_?' columns were answered True*
star_wars['seen_all'] = star_wars.iloc[:,3:9].all(axis=1)
print('There are',star_wars['seen_all'].sum(),'out of 1186 viewers who have watched all 6 movies')
There are 471 out of 1186 viewers who have watched all 6 movies
Lets check how the first 5 people chose their favorite movie, for 1 is the most favorited
while 6 is the most hated
rank_seen_all = star_wars[star_wars['seen_all']].iloc[:,9:15]
rank_seen_all.head(5)
rank_1 | rank_2 | rank_3 | rank_4 | rank_5 | rank_6 | |
---|---|---|---|---|---|---|
1 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
4 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
5 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
6 | 1.0 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 |
7 | 6.0 | 5.0 | 4.0 | 3.0 | 1.0 | 2.0 |
We should create rows where it shows the sum of rows that contains 1 for each movie(column). Ultimately, we would like to create percentage rows to compared 6 movies as a whole.
rank_seen_all.loc['1_counts'] = (rank_seen_all == 1).sum(axis=0) ###Counting 1 for each column
rank_seen_all.loc['1_counts_per'] = round(rank_seen_all.loc['1_counts']/ 471 *100,2) ### round (1_counts/471 * 100) to 2 figures after ,
pd.DataFrame(rank_seen_all.loc[['1_counts','1_counts_per']])
rank_1 | rank_2 | rank_3 | rank_4 | rank_5 | rank_6 | |
---|---|---|---|---|---|---|
1_counts | 47.00 | 18.00 | 27.00 | 128.00 | 169.00 | 82.00 |
1_counts_per | 9.98 | 3.82 | 5.73 | 27.18 | 35.88 | 17.41 |
Since it is easier to compare through visualizations, we shall generate 4 bar plots to see which movie was the most favored based on people who have watched all 6 movies.
plt.figure(figsize=(12,8))
ax = rank_seen_all.loc['1_counts_per'].plot.bar(ylim=(0,50),position=0)
ax.set_xticks(np.arange(0,6))
num = list(rank_seen_all.loc['1_counts_per'])
for i in range(0,6): ###Generating percentage numbers for each figure
x = i
y = num[i]+3
string = str(num[i])
string = string + '%'
ax.text(x,y,string,fontsize=18)
ax.get_children()[4].set_color('#D62728') ###Fifth bar will be set to a #D62728 color
ax.grid(False) ###Remove grid
ax.set_yticks([]) ###Remove y ticks
ax.set_xticks(np.arange(0,6)+0.25) ###Set positions for x ticks labels
ax.set_xticklabels(starwars_names,rotation=0,fontsize=15) ###Set x tick labels
plt.title('BEST STAR WARS MOVIE \n according to fans ',fontsize=22,fontweight=525)
plt.show()
It is no doubt that *Star Wars: The Empire Strikes Back* was again have the highest rates of people who loved it. However, people who have watched all 6 movies believed that *Star Wars II* was worsen than *Star Wars III*, which we got a different result compared to our last 2 graphs.
We have done our researches for movies, we move on to how the viewers felt about a number of the franchise's characters in terms of favorability. We will be selecting those respondents who have actually watched a Star Wars movie which is *936 viewers*, based on our previous findings.
We will filter rows and columns based on our criteria, choosing character columns and rows that contains True value in *'Have you seen any of the 6 films in the Star Wars franchise?'*
all_characters = seen_any.iloc[:,15:29].T ## We will work with a subset of our dataset called all_characters
all_characters
1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 27 | 28 | ... | 1156 | 1157 | 1159 | 1161 | 1162 | 1163 | 1164 | 1165 | 1166 | 1167 | 1168 | 1170 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1180 | 1181 | 1182 | 1183 | 1185 | 1186 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Han Solo | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 4 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Luke Skywalker | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 3 | 1 | 1 | 1 | 1 | 1 | 3 | 1 |
Princess Leia Organa | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Anakin Skywalker | 1 | 1 | 1 | 2 | 1 | 1 | 3 | 1 | 1 | 4 | 1 | 1 | 1 | 2 | 3 | 1 | 1 | 1 | 1 | 3 | 3 | 2 | 1 | 1 | 1 | ... | 1 | 1 | 3 | 1 | 1 | 2 | 3 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 3 | 1 | 1 | 2 | 1 | 2 |
Obi Wan Kenobi | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Emperor Palpatine | 1 | 4 | 1 | 2 | 3 | 1 | 2 | 1 | 2 | 4 | 1 | 4 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 2 | 1 | 1 | 1 | 2 | ... | 4 | 1 | 3 | 1 | 1 | 1 | 4 | 1 | 4 | 2 | 4 | 1 | 1 | 3 | 1 | 4 | 3 | 3 | 3 | 3 | 1 | 1 | 3 | 3 | 2 |
Darth Vader | 1 | 4 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 4 | 1 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 2 | ... | 1 | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 3 | 3 | 2 | 1 | 1 | 2 | 1 | 1 |
Lando Calrissian | 4 | 4 | 1 | 3 | 3 | 1 | 3 | 1 | 2 | 4 | 3 | 4 | 2 | 1 | 3 | 3 | 1 | 2 | 3 | 3 | 1 | 3 | 1 | 1 | 1 | ... | 1 | 3 | 3 | 1 | 3 | 3 | 1 | 1 | 1 | 4 | 4 | 1 | 2 | 3 | 4 | 4 | 3 | 3 | 4 | 3 | 1 | 1 | 1 | 1 | 2 |
Boba Fett | 4 | 4 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 4 | 1 | 3 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 2 | 3 | 1 | 1 | 2 | ... | 1 | 1 | 3 | 1 | 3 | 2 | 2 | 3 | 1 | 1 | 4 | 1 | 2 | 3 | 1 | 4 | 1 | 3 | 4 | 3 | 1 | 1 | 4 | 1 | 4 |
C-3P0 | 1 | 4 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 4 | 1 | 1 | 1 | 2 | 3 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
R2 D2 | 1 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 1 | 3 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Jar Jar Binks | 1 | 4 | 1 | 2 | 1 | 2 | 2 | 2 | 1 | 4 | 1 | 4 | 1 | 2 | 3 | 1 | 2 | 2 | 2 | 3 | 2 | 1 | 1 | 2 | 1 | ... | 1 | 1 | 3 | 1 | 2 | 3 | 1 | 2 | 4 | 2 | 4 | 2 | 3 | 3 | 1 | 4 | 3 | 3 | 1 | 3 | 1 | 1 | 2 | 1 | 2 |
Padme Amidala | 1 | 4 | 1 | 1 | 3 | 1 | 2 | 2 | 2 | 4 | 1 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 3 | 3 | 3 | 3 | 1 | 1 | 1 | ... | 4 | 1 | 3 | 1 | 3 | 1 | 1 | 2 | 1 | 1 | 3 | 2 | 3 | 3 | 1 | 4 | 3 | 3 | 4 | 3 | 1 | 1 | 2 | 1 | 3 |
Yoda | 1 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
14 rows × 936 columns
Since we going to analyse which character is most loved and hated, it would be easier for us to have columns that show percentage of each favorability for each character according to our viewers.
We shall be generating 2 groups of 4 collumns, 1 group for counts while other percentage of favorabilities
cols = ['Favorable', 'Unfavorable', 'Neutral', 'Unfamiliar']
cols_perc = ['Favorable_perc', 'Unfavorable_perc', 'Neutral_perc', 'Unfamiliar_perc']
i = 0
for c in cols: ###Creating our sum columns
i += 1
all_characters[c] = (all_characters == i).sum(axis=1) ###Count columns that contains 1,2,3 or 4
i = 0
for cp in cols_perc: ###Creating our percentage columns
all_characters[cp] = all_characters[cols[i]] / 936 * 100 ###Sum columns/936 * 100
all_characters[cp] = round(all_characters[cp],1) ###Round the percentage columns to 4 sig figures
i += 1
Lets check the data for our newly generated columns
all_characters[cols+cols_perc].sort_values('Favorable_perc',ascending=True)
Favorable | Unfavorable | Neutral | Unfamiliar | Favorable_perc | Unfavorable_perc | Neutral_perc | Unfamiliar_perc | |
---|---|---|---|---|---|---|---|---|
Jar Jar Binks | 242 | 306 | 164 | 224 | 25.9 | 32.7 | 17.5 | 23.9 |
Emperor Palpatine | 253 | 192 | 213 | 278 | 27.0 | 20.5 | 22.8 | 29.7 |
Boba Fett | 291 | 141 | 248 | 256 | 31.1 | 15.1 | 26.5 | 27.4 |
Padme Amidala | 351 | 92 | 207 | 286 | 37.5 | 9.8 | 22.1 | 30.6 |
Lando Calrissian | 365 | 71 | 236 | 264 | 39.0 | 7.6 | 25.2 | 28.2 |
Darth Vader | 481 | 251 | 84 | 120 | 51.4 | 26.8 | 9.0 | 12.8 |
Anakin Skywalker | 514 | 122 | 135 | 165 | 54.9 | 13.0 | 14.4 | 17.6 |
C-3P0 | 703 | 30 | 80 | 124 | 75.1 | 3.2 | 8.5 | 13.2 |
R2 D2 | 747 | 17 | 57 | 116 | 79.8 | 1.8 | 6.1 | 12.4 |
Yoda | 749 | 17 | 51 | 120 | 80.0 | 1.8 | 5.4 | 12.8 |
Obi Wan Kenobi | 750 | 16 | 43 | 128 | 80.1 | 1.7 | 4.6 | 13.7 |
Princess Leia Organa | 757 | 19 | 48 | 113 | 80.9 | 2.0 | 5.1 | 12.1 |
Han Solo | 762 | 9 | 44 | 122 | 81.4 | 1.0 | 4.7 | 13.0 |
Luke Skywalker | 771 | 17 | 38 | 112 | 82.4 | 1.8 | 4.1 | 12.0 |
From the percentage columns, we going to determine the most hated as well as the most loved character in the Star Wars film franchise. We will generate 4 horizontal bar plots that showed percentage of how each character was voted on whether being *Favorable, Unfavorable, Neutral* or *Unfamiliar*
all_characters = all_characters.sort_values('Favorable',ascending=True)
sns.set_style('whitegrid') ###Set to a whitegrid style from seaborn library
char_names = all_characters.index
popularity = ['Favorable', 'Unfavorable', 'Neutral', 'Unfamiliar']
popularity_perc = ['Favorable_perc', 'Unfavorable_perc', 'Neutral_perc', 'Unfamiliar_perc']
color = ['#D81B60','#1E88E5','#FFC107','#004D40']
###Add a layout to prepare our plots
fig = plt.figure(figsize=(11,6))
i = 0
for pp in popularity_perc:
i += 1
ax = fig.add_subplot(1,4,i) ###Generating 4 plots with 4 columns and 1 row
ax.barh(char_names,all_characters[pp],color=color[i-1]) ###Generating bar plots with character names and color
ax.grid(False) ###Remove grid
ax.set_xticks([]) ###Remove X tick labels
string = popularity[i-1]
ax.set_title(string ,fontsize=16,fontweight=525) ###Set title for each of 4 bar plots
if i != 1: ###Only the first bar plot have character names
ax.set_yticks([])
perc_labels = list(all_characters[pp]) ###Add percentage numbers for all figures
y = -0.1
for pl in perc_labels:
x = pl + 1
string = str(pl) + '%'
ax.text(x,y,string,fontsize=12)
y += 1
for key,spine in ax.spines.items(): ###Remove all 4 borders
spine.set_visible(False)
plt.suptitle('\'Star Wars\' Character Favorability',fontweight='bold',fontsize=20) ###Creating an overall title
plt.show()
Based on our findings, *Luke Skywalker,Han Solo were the most favorable characters while *Jar Jar Binks* and *Darth Vader* was the most hated character in the Star Wars movies franchise. Most people werent familar with *Padme Amidala and Emperor Palpatine whereas Princess Leia Organa,Yoda and R2 D2 were among the most popular characters.
star_wars_pv = star_wars.pivot_table(values=star_wars.iloc[:,9:15],index='Location (Census Region)')
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(star_wars_pv,cmap='viridis',annot=True,lw=2,ax=ax)
ax.set_xticklabels(starwars_names,fontsize=13,rotation=0)
plt.title('Most hated Star Wars movie in the franchise according to each region',fontsize=18,fontweight=525)
plt.show()
New England
and East South Central
respondents disagreed and chose *Star Wars 2* is even worse than *Star Wars III*.star_wars_pv = star_wars.pivot_table(values=star_wars.iloc[:,3:9],index='Location (Census Region)')
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(star_wars_pv,cmap='viridis',annot=True,lw=2,ax=ax)
ax.set_xticklabels(starwars_names,fontsize=13,rotation=0)
plt.title('Most viewed Star Wars movie in the franchise according to each region',fontsize=18,fontweight=525)
plt.show()
East North Central, South Atlantic
and West South Central
are less familar with Star Wars compared to other regionsEast South Central
have the *highest rate* that had watched *Star Wars V, with 82% respondents* have watched.Based on the online survey from `FiveThirtyEight` and our research,we saw that the fifth installment of the film franchise, The Empire Strikes Back, was considered to be the best movie out of the other 5 mentioned films, whereas the third movie, Revenge of the Sith, was believed to be the least favorite movie in the franchise. We also found out that Luke Skywalker, the main protagonist was the top most favorable character while Jar Jar Binks was the utmost disliked character of all