A long time ago in a pre-pandemic world...
The data has several columns, including:
Column | Descritpion |
---|---|
RespondentID |
An anonymized ID for the respondent (person taking the survey) |
Gender |
The respondent's gender |
Age |
The respondent's age |
Household Income |
The respondent's income |
Education |
The respondent's education level |
Location (Census Region) |
The respondent's location |
Have you seen any of the 6 films in the Star Wars franchise? |
Has a Yes or No response |
Do you consider yourself to be a fan of the Star Wars film franchise? |
Has a Yes or No response |
There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format. As a result, this data set needs a lot of cleaning.
import pandas as pd
import numpy as np
import missingno as msno # check for missing records
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
pd.set_option('display.max_columns', 38)
pd.set_option('max_colwidth', 150)
pd.options.display.float_format = '{:,.3f}'.format
#read in the file
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | nan | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3,292,879,998.000 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3,292,879,538.000 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3,292,765,271.000 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | 2 | 3 | 4 | 5 | 6 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3,292,763,116.000 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 6 | 1 | 2 | 4 | 3 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
It seems that the first row is not an entry but a subtitle for some of the columns instead. We´ll use it later to rename the columns.
#save the 1st row for later
aux_col_names = star_wars.iloc[0, :]
#drop the 1st row
star_wars = star_wars.iloc[1:, :]
aux_col_names.reset_index()
index | 0 | |
---|---|---|
0 | RespondentID | NaN |
1 | Have you seen any of the 6 films in the Star Wars franchise? | Response |
2 | Do you consider yourself to be a fan of the Star Wars film franchise? | Response |
3 | Which of the following Star Wars films have you seen? Please select all that apply. | Star Wars: Episode I The Phantom Menace |
4 | Unnamed: 4 | Star Wars: Episode II Attack of the Clones |
5 | Unnamed: 5 | Star Wars: Episode III Revenge of the Sith |
6 | Unnamed: 6 | Star Wars: Episode IV A New Hope |
7 | Unnamed: 7 | Star Wars: Episode V The Empire Strikes Back |
8 | Unnamed: 8 | Star Wars: Episode VI Return of the Jedi |
9 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Star Wars: Episode I The Phantom Menace |
10 | Unnamed: 10 | Star Wars: Episode II Attack of the Clones |
11 | Unnamed: 11 | Star Wars: Episode III Revenge of the Sith |
12 | Unnamed: 12 | Star Wars: Episode IV A New Hope |
13 | Unnamed: 13 | Star Wars: Episode V The Empire Strikes Back |
14 | Unnamed: 14 | Star Wars: Episode VI Return of the Jedi |
15 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Han Solo |
16 | Unnamed: 16 | Luke Skywalker |
17 | Unnamed: 17 | Princess Leia Organa |
18 | Unnamed: 18 | Anakin Skywalker |
19 | Unnamed: 19 | Obi Wan Kenobi |
20 | Unnamed: 20 | Emperor Palpatine |
21 | Unnamed: 21 | Darth Vader |
22 | Unnamed: 22 | Lando Calrissian |
23 | Unnamed: 23 | Boba Fett |
24 | Unnamed: 24 | C-3P0 |
25 | Unnamed: 25 | R2 D2 |
26 | Unnamed: 26 | Jar Jar Binks |
27 | Unnamed: 27 | Padme Amidala |
28 | Unnamed: 28 | Yoda |
29 | Which character shot first? | Response |
30 | Are you familiar with the Expanded Universe? | Response |
31 | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Response |
32 | Do you consider yourself to be a fan of the Star Trek franchise? | Response |
33 | Gender | Response |
34 | Age | Response |
35 | Household Income | Response |
36 | Education | Response |
37 | Location (Census Region) | Response |
#initial data exploration
star_wars.describe(include='all')
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1,186.000 | 1186 | 836 | 673 | 571 | 550 | 607 | 758 | 738 | 835 | 836 | 835 | 836 | 836 | 836 | 829 | 831 | 831 | 823 | 825 | 814 | 826 | 820 | 812 | 827 | 830 | 821 | 814 | 826 | 828 | 828 | 213 | 1068 | 1046 | 1046 | 858 | 1036 | 1043 |
unique | nan | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 3 | 2 | 2 | 2 | 2 | 4 | 5 | 5 | 9 |
top | nan | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4 | 5 | 6 | 1 | 1 | 2 | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Neither favorably nor unfavorably (neutral) | Neither favorably nor unfavorably (neutral) | Very favorably | Very favorably | Very unfavorably | Neither favorably nor unfavorably (neutral) | Very favorably | Han | No | No | No | Female | 45-60 | $50,000 - $99,999 | Some college or Associate degree | East North Central |
freq | nan | 936 | 552 | 673 | 571 | 550 | 607 | 758 | 738 | 237 | 300 | 217 | 204 | 289 | 232 | 610 | 552 | 547 | 269 | 591 | 213 | 310 | 236 | 248 | 474 | 562 | 204 | 207 | 605 | 325 | 615 | 114 | 641 | 549 | 291 | 298 | 328 | 181 |
mean | 3,290,128,200.533 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
std | 1,055,638.908 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
min | 3,288,372,923.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25% | 3,289,450,962.750 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
50% | 3,290,147,175.500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
75% | 3,290,814,462.500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max | 3,292,879,998.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Noticeable data cleanliness issues:
'Unnamed: 4' : 'Unnamed: 8'
),'Unnamed: 10' : 'Unnamed: 14'
),'Do you consider yourself to be a fan of the Expanded Universe?'
column has got almost 80% of its values null.In this episode we´ll rename the columns and also fix some of the values.
We want new column names be shorter for an easier referring to them in the code but still descriptive.
#renaming the column names
star_wars = star_wars.rename(columns={
'Have you seen any of the 6 films in the Star Wars franchise?': 'seen_any',
'Do you consider yourself to be a fan of the Star Wars film franchise?': 'sw_fan',
'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_ep1',
'Unnamed: 4' : 'seen_ep2',
'Unnamed: 5' : 'seen_ep3',
'Unnamed: 6' : 'seen_ep4',
'Unnamed: 7' : 'seen_ep5',
'Unnamed: 8' : 'seen_ep6',
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'ranking_ep1',
'Unnamed: 10': 'ranking_ep2',
'Unnamed: 11': 'ranking_ep3',
'Unnamed: 12': 'ranking_ep4',
'Unnamed: 13': 'ranking_ep5',
'Unnamed: 14': 'ranking_ep6',
'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.': 'Han Solo',
'Unnamed: 16': 'Luke Skywalker',
'Unnamed: 17': 'Princess Leia Organa',
'Unnamed: 18': 'Anakin Skywalker',
'Unnamed: 19': 'Obi Wan Kenobi',
'Unnamed: 20': 'Emperor Palpatine',
'Unnamed: 21': 'Darth Vader',
'Unnamed: 22': 'Lando Calrissian',
'Unnamed: 23': 'Boba Fett',
'Unnamed: 24': 'C-3P0',
'Unnamed: 25': 'R2 D2',
'Unnamed: 26': 'Jar Jar Binks',
'Unnamed: 27': 'Padme Amidala',
'Unnamed: 28': 'Yoda',
'Which character shot first?': 'shot_first',
'Are you familiar with the Expanded Universe?': 'know_eu',
'Do you consider yourself to be a fan of the Expanded Universe?Âæ': 'eu_fan',
'Do you consider yourself to be a fan of the Star Trek franchise?': 'st_fan'
})
star_wars.head(5)
RespondentID | seen_any | sw_fan | seen_ep1 | seen_ep2 | seen_ep3 | seen_ep4 | seen_ep5 | seen_ep6 | ranking_ep1 | ranking_ep2 | ranking_ep3 | ranking_ep4 | ranking_ep5 | ranking_ep6 | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | shot_first | know_eu | eu_fan | st_fan | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3,292,879,998.000 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3,292,879,538.000 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3,292,765,271.000 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | 2 | 3 | 4 | 5 | 6 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3,292,763,116.000 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 6 | 1 | 2 | 4 | 3 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3,292,731,220.000 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 4 | 6 | 2 | 1 | 3 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
The values 'seen_any'
, 'sw_fan'
,'know_eu'
, 'eu_fan'
and 'st_fan'
contain either 'Yes' or 'No' values, with some missing values in between. For ease of usage throughout the analysis, these values are mapped to boolean.
# convert Yes/No to boolean
yes_no_mapping = {'Yes': True, 'No': False}
yes_no_cols = ['seen_any', 'sw_fan', 'know_eu', 'eu_fan', 'st_fan']
for col in yes_no_cols:
star_wars[col] = star_wars[col].map(yes_no_mapping)
star_wars.head()
RespondentID | seen_any | sw_fan | seen_ep1 | seen_ep2 | seen_ep3 | seen_ep4 | seen_ep5 | seen_ep6 | ranking_ep1 | ranking_ep2 | ranking_ep3 | ranking_ep4 | ranking_ep5 | ranking_ep6 | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | shot_first | know_eu | eu_fan | st_fan | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3,292,879,998.000 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | True | False | False | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3,292,879,538.000 | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3,292,765,271.000 | True | False | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | 2 | 3 | 4 | 5 | 6 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don't understand this question | False | NaN | False | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3,292,763,116.000 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 6 | 1 | 2 | 4 | 3 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | False | NaN | True | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3,292,731,220.000 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 4 | 6 | 2 | 1 | 3 | Very favorably | Somewhat favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very unfavorably | Somewhat favorably | Neither favorably nor unfavorably (neutral) | Very favorably | Somewhat favorably | Somewhat favorably | Very unfavorably | Somewhat favorably | Somewhat favorably | Greedo | True | False | False | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
The columns 'seen_ep1' : 'seen_ep6'
indicate whether the respondent saw the correspondent movie or no. If the movie name is listed, the respondent saw the episode and the NaN
values indicate that either the question was not answered or the respondent didn´t see the movie. We´ll convert these columns to Boolean type as well. The values with the movie name will be converted to True
and the null values to False
.
#convert 'seen_ep?' columns to boolean
star_wars.loc[:,'seen_ep1':'seen_ep6'] = star_wars.loc[:,'seen_ep1':'seen_ep6'].notnull()
star_wars.loc[:,'seen_ep1':'seen_ep6'].head()
seen_ep1 | seen_ep2 | seen_ep3 | seen_ep4 | seen_ep5 | seen_ep6 | |
---|---|---|---|---|---|---|
1 | True | True | True | True | True | True |
2 | False | False | False | False | False | False |
3 | True | True | True | False | False | False |
4 | True | True | True | True | True | True |
5 | True | True | True | True | True | True |
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1
means the film was the most favorite, and 6
means it was the least favorite. Each of the following columns can contain the value 1
, 2
, 3
, 4
, 5
, 6
, or NaN
. For the further analysis we´ll convert these column to numeric ones and also we are going to invert the rating so that '6'
would mean the most favorite and '1'
- the least favorite.
star_wars.loc[:,'ranking_ep1':'ranking_ep6'] = star_wars.loc[:,'ranking_ep1':'ranking_ep6'].astype(float)
star_wars.loc[:,'ranking_ep1':'ranking_ep6'] = star_wars.loc[:,'ranking_ep1':'ranking_ep6'].applymap(lambda x: 7-x)
star_wars.loc[:,'ranking_ep1':'ranking_ep6'].head()
ranking_ep1 | ranking_ep2 | ranking_ep3 | ranking_ep4 | ranking_ep5 | ranking_ep6 | |
---|---|---|---|---|---|---|
1 | 4.000 | 5.000 | 6.000 | 3.000 | 2.000 | 1.000 |
2 | nan | nan | nan | nan | nan | nan |
3 | 6.000 | 5.000 | 4.000 | 3.000 | 2.000 | 1.000 |
4 | 2.000 | 1.000 | 6.000 | 5.000 | 3.000 | 4.000 |
5 | 2.000 | 3.000 | 1.000 | 5.000 | 6.000 | 4.000 |
The main Star Wars character columns are answers to the question 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her'.
We will convert the ranking system to numbers so that we can make calculations:
Very favorably
- for each rating the character will receive 2 pointsSomewhat favorably
- for each rating the character will receive 1 pointNeither favorably nor unfavorably (neutral)
- 0 pointsSomewhat unfavorably
- for each rating the character will be deducted 1 pointVery unfavorably
- for each rating the character will be deducted 2 pointsUnfamiliar (N/A)
- 0 pointspoints = {
'Very favorably': 2,
'Somewhat favorably': 1,
'Neither favorably nor unfavorably (neutral)': 0,
'Somewhat unfavorably': -1,
'Very unfavorably': -2,
'Unfamiliar (N/A)': 0,
np.NaN: 0
}
for col in star_wars.loc[:, 'Han Solo' : 'Yoda']:
star_wars[col] = star_wars[col].map(points)
star_wars.loc[:, 'Han Solo' : 'Yoda'].head()
Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 2 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 1 | -1 | 2 | 2 | 2 | 2 | 2 |
5 | 2 | 1 | 1 | -1 | 2 | -2 | 1 | 0 | 2 | 1 | 1 | -2 | 1 | 1 |
This episode is all about dealing with missing values. We´ll start with identifying them.
#overview of missing values
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
msno.bar(star_wars)
'eu_fan'
is the column with most values missing;The 'seen_any'
column might help us to deal with a part of missing data. Probably the respondents who haven´t seen any of the Star Wars movies left without an answer the questions where they were asked to rate each movie and didn´t understand the question about who shot first.
print('The number of respondents who haven´t seen any of the Star Wars movies:', star_wars[star_wars['seen_any'] == False].shape[0])
print('They left the following questions without an answer:')
star_wars[star_wars['seen_any'] == False].isnull().sum()
The number of respondents who haven´t seen any of the Star Wars movies: 250 They left the following questions without an answer:
RespondentID 0 seen_any 0 sw_fan 250 seen_ep1 0 seen_ep2 0 seen_ep3 0 seen_ep4 0 seen_ep5 0 seen_ep6 0 ranking_ep1 250 ranking_ep2 250 ranking_ep3 250 ranking_ep4 250 ranking_ep5 250 ranking_ep6 250 Han Solo 0 Luke Skywalker 0 Princess Leia Organa 0 Anakin Skywalker 0 Obi Wan Kenobi 0 Emperor Palpatine 0 Darth Vader 0 Lando Calrissian 0 Boba Fett 0 C-3P0 0 R2 D2 0 Jar Jar Binks 0 Padme Amidala 0 Yoda 0 shot_first 250 know_eu 250 eu_fan 250 st_fan 10 Gender 24 Age 24 Household Income 67 Education 30 Location (Census Region) 25 dtype: int64
It´s confirmed now that the entries from those who haven´t seen any of the Star Wars movies will not contribute to the analysis. So we´ll continue only with the answers from the respondents who saw at least one episode of the saga.
star_wars = star_wars[star_wars['seen_any'] == True]
The null values in the 'sw_fan
', 'know_eu'
and 'st_fan'
columns can be imputed with False
and the null values in the 'shot_first'
with 'I don´t understand this question'
.
star_wars[['sw_fan', 'know_eu', 'st_fan']] = star_wars[['sw_fan', 'know_eu', 'st_fan']].fillna(False).fillna('')
star_wars['shot_first'] = star_wars['shot_first'].fillna('I don´t understand this question')
As for the 'eu_fan'
column, similarly to the 'sw_fan'
we suppose that if a person not familiar with Expanded Universe can´t be a fan of it.
star_wars.loc[star_wars['know_eu'] == False, 'eu_fan'] = False
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
msno.bar(star_wars)
There might be a group of respondents who answered the first question if they had seen any movie of the saga but then lost their interest in completing the survey and left other questions without an answer.
star_wars[star_wars.loc[:, 'ranking_ep1':'ranking_ep6'].isnull().apply(lambda x: all(x), axis=1)].describe(include='all')
RespondentID | seen_any | sw_fan | seen_ep1 | seen_ep2 | seen_ep3 | seen_ep4 | seen_ep5 | seen_ep6 | ranking_ep1 | ranking_ep2 | ranking_ep3 | ranking_ep4 | ranking_ep5 | ranking_ep6 | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | shot_first | know_eu | eu_fan | st_fan | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 100.000 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100.000 | 100 | 100 | 100 | 100 | 0 | 0 | 0 | 0 | 0 |
unique | nan | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
top | nan | True | False | False | False | False | False | False | False | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | I don´t understand this question | False | False | False | NaN | NaN | NaN | NaN | NaN |
freq | nan | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 100 | 100 | 100 | 100 | NaN | NaN | NaN | NaN | NaN |
mean | 3,290,145,657.870 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nan | nan | nan | nan | nan | nan | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
std | 886,869.706 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nan | nan | nan | nan | nan | nan | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
min | 3,288,455,900.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nan | nan | nan | nan | nan | nan | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25% | 3,289,614,702.500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nan | nan | nan | nan | nan | nan | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
50% | 3,290,353,222.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nan | nan | nan | nan | nan | nan | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
75% | 3,290,725,892.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nan | nan | nan | nan | nan | nan | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max | 3,292,637,870.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nan | nan | nan | nan | nan | nan | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
After confirming the above hypothesis we can drop those rows without any hesitations.
star_wars.drop(star_wars[star_wars['ranking_ep1'].isnull()].index, inplace=True)
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
msno.bar(star_wars)
The data in the 'Household income'
is categorical data. Considering a significant amount of missing values in it, we wouldn´t want to loose that much information. At the moment we´ll fill it with 'no info'
introducing like a new category. And the rest of the missing values will be replaced by a most common value in the column.
star_wars['Household Income'] = star_wars['Household Income'].fillna('no info')
star_wars = star_wars.fillna(star_wars.mode().iloc[0])
Not only the films compete with each other for the people´s love and recognition, but also the trilogies: the episodes IV - VI, originally released between 1977 and 1983, and the prequel trilogy released later in 1999 through 2005.
First, we´ll discover which how many respondents saw each movie and thus find out which movie is a most seen one.
#complete movies´ titles, to be used for plot labeling
titles = [
'Episode I The Phantom Menace',
'Episode II Attack of the Clones',
'Episode III Revenge of the Sith',
'Episode IV A New Hope',
'Episode V The Empire Strikes Back',
'Episode VI Return of the Jedi'
]
#columns which refer to if a movie was seen by a respondent
seen_cols = ['seen_ep1', 'seen_ep2', 'seen_ep3', 'seen_ep4', 'seen_ep5', 'seen_ep6']
#the columns which refer to movie rankings
rank_cols = ['ranking_ep1', 'ranking_ep2', 'ranking_ep3', 'ranking_ep4', 'ranking_ep5', 'ranking_ep6']
#prepare the dataframe for plotting
seen = pd.DataFrame(data=[titles,
star_wars.loc[:,seen_cols].sum(),
(star_wars.loc[:,seen_cols].sum()/star_wars.loc[:,seen_cols].shape[0]).round(2)]).T
seen.columns = ['Star Wars movie', 'Number of views', 'Views_per']
#define the average niews per trilogy
trilogy_views = ['',] * 6
trilogy_views[:3] = [seen.loc[0:2, 'Views_per'].sum()/3, ] * 3
trilogy_views[-3:] = [seen.loc[3:6, 'Views_per'].sum()/3, ] * 3
seen['views_per_trilogy'] = trilogy_views
#plot
fig = px.bar(seen, x='Number of views', y='Star Wars movie', orientation='h',
custom_data=['Views_per', 'views_per_trilogy'],
category_orders={'Star Wars movie':titles})
#plot aesthetics
##color map highlighting only the most seen movie
colors=[]
for val in seen['Number of views']:
colors.append('rgb(252, 128, 14)' if val == seen['Number of views'].max() else 'rgb(137, 137, 137)')
fig.update_traces(hovertemplate='<i>Views:</i> %{x} <br><i>seen by %{customdata[0]:.0%} of respondents</i> <br> (the trilogy seen by %{customdata[1]:.0%} of respondents) ',
marker_color=colors)
fig.update_layout(title={'text':'<b>Views recieved by each movie in the Star Wars franchise</b><br>based on 835 respondents',
'font':{'size':22}},
yaxis = {'ticksuffix': ' ',
'tickfont':{'size':16}})
fig.show()
The Episode V is the most seen movie of the saga (seen by 91% of respondents), followed by the Episode VI. The original trilogy has got more views on average, than its prequel (84% vs. 72%)
Now, let´s proceed to the rankings. Remember that although FiveThirtyEight team asked to score the most favorite movie with 1
and the least favorite with 6
, earlier we inverted the rankings for a more clear visual presentation.
#prepare the data for plotting
ranking = star_wars[rank_cols].mean()
ranking.index = titles
#plot
fig = px.bar(ranking, x=ranking.values, y=ranking.index, orientation='h',
labels={'x': 'average rank', 'y': ''},
category_orders={'Star Wars movie':titles})
#plot aesthetics
##color map highlighting only the most seen movie
colors=[]
for val in ranking.values:
colors.append('rgb(252, 128, 14)' if val == ranking.values.max() else 'rgb(137, 137, 137)')
fig.update_traces(hovertemplate='<i>Average rank:</i> %{x:.2f}', marker_color=colors)
fig.update_layout(title={'text':'<b>Most favorite movie in the Star Wars franchise</b><br>based on 835 respondents',
'font':{'size':22}},
yaxis = {'ticksuffix': ' ',
'tickfont':{'size':16}})
fig.show()
The most favorite movie, with no doubts, is the Episode V The Empire Strikes Back with the average rank of 4.49. Three most favorite films are the ones of the original trilogy.
Let´s see now if the respondents score differ a lot between different social-democratic groups. In order to complete this, we´ll need to transform our dataframe to a long-type format.
# grouping the columns
id_vars = ['sw_fan', 'Gender', 'Age', 'Location (Census Region)']
# long-type table for movie ranks
ranks_melt = pd.melt(star_wars, id_vars=id_vars, var_name='movie',
value_vars=rank_cols, value_name='movie_rank')
movie_dict = {'ranking_ep1': 'Episode I The Phantom Menace',
'ranking_ep2': 'Episode II Attack of the Clones',
'ranking_ep3': 'Episode III Revenge of the Sith',
'ranking_ep4': 'Episode IV A New Hope',
'ranking_ep5': 'Episode V The Empire Strikes Back',
'ranking_ep6': 'Episode VI Return of the Jedi'}
ranks_melt['movie'] = ranks_melt['movie'].map(movie_dict)
#check the result
ranks_melt.head()
sw_fan | Gender | Age | Location (Census Region) | movie | movie_rank | |
---|---|---|---|---|---|---|
0 | True | Male | 18-29 | South Atlantic | Episode I The Phantom Menace | 4.000 |
1 | False | Male | 18-29 | West North Central | Episode I The Phantom Menace | 6.000 |
2 | True | Male | 18-29 | West North Central | Episode I The Phantom Menace | 2.000 |
3 | True | Male | 18-29 | West North Central | Episode I The Phantom Menace | 2.000 |
4 | True | Male | 18-29 | Middle Atlantic | Episode I The Phantom Menace | 6.000 |
Now, let´s check if there´s something drawing our attention in the responses of respondents by their gender.
#prepare data for plotting
ranks_sex = ranks_melt.pivot_table(index=['movie', 'Gender'], values='movie_rank').reset_index()
#plot
fig = px.bar(ranks_sex, x='movie_rank', y='movie', orientation='h',
facet_col = 'Gender', facet_col_wrap=1, barmode='group',
labels = {'movie_rank': 'average rank',
'movie': 'Star Wars movie'},
category_orders={'movie':titles}
)
#plot aesthetics
##color map highlighting the best ranked movie and other points of interest
colors_of_interest = list(colors)
colors_of_interest[0] = 'rgb(95, 158, 209)'
fig.update_traces(hovertemplate='<i>Average rank:</i> %{x}', marker_color = colors) #highlight the best ranked movie
fig.update_traces(row=0, marker_color = colors_of_interest) #apply 'colors_of_interest' to the first facet plot
fig.update_layout(title={'text':'<b>Most favorite movie in the Star Wars franchise, by gender</b>',
'font':{'size':22}})
fig.update_yaxes(ticksuffix=' ', tickfont={'size':14})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1]))
##draw mannualy the legend
###'best ranked' rectangle
fig.add_shape(type='rect',
xref='paper', yref='paper',
x0=3.5, x1=3.9, y0=4.6, y1=5.0,
col=1, row=1,
line={'width': 0.8,
'color': 'rgb(252, 128, 14)' },
fillcolor= 'rgb(252, 128, 14)')
###'best ranked' text
fig.add_annotation(text='Best ranked', xref='paper', yref='paper', x=4.3, y=4.8, align='left',
col=1, row=1, showarrow=False)
###'of an interes' rectangle
fig.add_shape(type='rect',
xref='paper', yref='paper',
x0=3.5, x1=3.9, y0=3.9, y1=4.3,
col=1, row=1,
line={'width': 0.8,
'color': 'rgb(95, 158, 209)'},
fillcolor='rgb(95, 158, 209)')
###'of an interes' text
fig.add_annotation(text=' Of an interest', xref='paper', yref='paper', x=4.3, y=4.1, align='left',
col=1, row=1, showarrow=False)
fig.show()
We see no difference in respect of the most favorite movie both by men and women, which is the Episode V. The interesting fact is that the male respondents seem to have it quite clear which trilogy they like more, giving the best score to the original one. None of the latest movies has got the average ranking higher than 3. Although the female respondents also seem to like episodes 4-6 more, the third most favorite movie by their opinion is the Episode I The Phantom Menace.
Let´s see if been a fan of Star Wars changes anything.
ranks_sex_sw_fan = ranks_melt.pivot_table(index=['movie', 'Gender', 'sw_fan'], values='movie_rank').reset_index()
fig = px.bar(ranks_sex_sw_fan, x='movie_rank', y='movie', orientation='h',
facet_col = 'Gender', facet_row='sw_fan', barmode='group',
labels = {'movie_rank': 'average rank',
'movie': 'Star Wars movie'},
category_orders={'movie':titles}
)
fig.update_traces(hovertemplate='<i>Average rank:</i> %{x}', marker_color = colors)
fig.update_traces(row=0, marker_color = colors_of_interest)
fig.update_layout(title={'text':'<b>Most favorite movie in the Star Wars franchise, by gender and fans</b>',
'font':{'size':22}})
fig.update_yaxes(ticksuffix=' ', tickfont={'size':13})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1].replace('False', 'Not Fan').replace('True', 'Fan')))
fig.add_shape(type='rect',
xref='paper', yref='paper',
x0=3.1, x1=3.5, y0=4.6, y1=5.0,
col=2, row=1,
line={'width': 0.8,
'color': 'rgb(252, 128, 14)' },
fillcolor= 'rgb(252, 128, 14)'
)
fig.add_annotation(text='Best ranked', xref='paper', yref='paper', x=4.2, y=4.8, align='left', col=2, row=1, showarrow=False)
fig.add_shape(type='rect',
xref='paper', yref='paper',
x0=3.1, x1=3.5, y0=3.9, y1=4.3,
col=2, row=1,
line={'width': 0.8,
'color': 'rgb(95, 158, 209)'},
fillcolor='rgb(95, 158, 209)'
)
fig.add_annotation(text='Of an interest', xref='paper', yref='paper', x=4.3, y=4.1, align='left', col=2, row=1, showarrow=False)
fig.show()
Both male and female fans are loyal to the original trilogy, although they score the prequel movies somewhat differently. Those who don´t consider themselves as the Star Wars fan give very high scores to the Episode I. It´s the second most favorite movie according to the male non-fans and shares the first place with the Episode V according to the female non-fans.
It´s also interesting to see how the respondents of different age rate the movies.
#prepare data for plotting
ranks_age = ranks_melt.pivot_table(index=['movie', 'Age'], values='movie_rank').reset_index()
#plot
fig = px.bar(ranks_age, x='movie_rank', y='movie', orientation='h',
facet_col = 'Age', facet_col_wrap=2,
category_orders={'movie':titles}
)
#plot aesthetics
fig.update_traces(hovertemplate='<i>Average rank:</i> %{x}', marker_color = colors)
fig.update_traces(row=1, col=2, marker_color = colors_of_interest)
fig.update_layout(title={'text':'<b>Most favorite movie in the Star Wars franchise, by age</b>',
'font':{'size':22}})
fig.update_yaxes(ticksuffix=' ', tickfont={'size':13})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1] + ' ' + 'y.o.'))
##draw mannualy the legend
###'best ranked' rectangle
fig.add_shape(type='rect',
xref='paper', yref='paper',
x0=3.1, x1=3.5, y0=4.6, y1=5.0,
col=1, row=0,
line={'width': 0.8,
'color': 'rgb(252, 128, 14)' },
fillcolor= 'rgb(252, 128, 14)')
###'best ranked' text
fig.add_annotation(text='Best ranked', xref='paper', yref='paper', x=4.2, y=4.8,
align='left', col=1, row=0, showarrow=False)
###'of an interest' rectangle
fig.add_shape(type='rect',
xref='paper', yref='paper',
x0=3.1, x1=3.5, y0=3.9, y1=4.3,
col=1, row=0,
line={'width': 0.8,
'color': 'rgb(95, 158, 209)'},
fillcolor='rgb(95, 158, 209)')
###'of an interest' text
fig.add_annotation(text='Of an interest', xref='paper', yref='paper', x=4.3, y=4.1,
align='left', col=1, row=0, showarrow=False)
fig.show()
The surprising phenomenon of the Episode I The Phantom Menace is also true for the respondents which are older than 60.
Definitely we can´t leave this phenomenon without any further investigation. Let´s check who give a high score (3 to 6) to this episode and how a lower score.
#define the number of each score given by different groups of respondents
rank_per_age = pd.crosstab([ranks_melt['sw_fan'], ranks_melt['Age'], ranks_melt['Gender'],
ranks_melt['movie']], ranks_melt['movie_rank'], colnames=[''])
#retrieve the info realted to the 'Episode I' only
rank_per_age_ep1 = rank_per_age.xs('Episode I The Phantom Menace', level='movie').reset_index()
#calculate the number of high scores and low scores
rank_per_age_ep1['low_rank'] = rank_per_age_ep1[[1.0, 2.0, 3.0]].sum(axis=1)
rank_per_age_ep1['high_rank'] = rank_per_age_ep1[[4.0, 5.0, 6.0]].sum(axis=1)
#drop the individual score columns
rank_per_age_ep1 = rank_per_age_ep1.drop(rank_per_age_ep1.columns[3:9], axis=1)
#melt the data to a long-type format
rank_per_age_ep1 = pd.melt(rank_per_age_ep1, id_vars=['sw_fan', 'Age', 'Gender'], value_vars=['low_rank', 'high_rank'],
var_name='rank', value_name='number of votes')
#plot
fig = px.bar(rank_per_age_ep1, x='Age', y='number of votes', color='rank', barmode='group',
facet_row = 'sw_fan', facet_col='Gender', facet_row_spacing=0.08,
color_discrete_map={'low_rank': 'rgb(0, 107, 164)', 'high_rank': 'rgb(200, 82, 0)'})
#plot aesthetics
fig.update_traces(hovertemplate='<i>Number of votes:</i> %{y}')
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1].replace('False', 'Not Fan').replace('True', 'Fan')))
fig.update_layout(title={'text':'<b>Episode I ranks distribution, by age and gender</b>',
'font':{'size':22}})
fig.update_xaxes(showticklabels=True)
fig.update_yaxes(ticksuffix=' ')
fig.show()
The female non-fans of all ages tend to give higher score to the Episode I, while the same is true only for the male non-fans older than 45. These higher scores of non-fans are more than compensated by lower scores of both male and female Star Wars fan of all ages except those older than 60, where the higher and lower scores are divided almost equally.
And finally let´s see if the preferences change according to the U.S. regions.
#prepare the data for plotting (movie rankings by the U.S. region)
rank_per_region = ranks_melt.pivot_table(index=['Location (Census Region)', 'movie'], values='movie_rank').unstack(level=0)
#plot
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(rank_per_region,cmap='viridis',annot=True,lw=2,ax=ax, cbar=False)
#plot aesthetics
ax.set_xticklabels(rank_per_region.columns.droplevel(),fontsize=13,rotation=90)
ax.tick_params(left=False, bottom=False)
ax.set(xlabel='U.S. Census Region', ylabel='')
plt.title('Best Star Wars movie in the franchise according to the U.S. region',fontsize=18,fontweight=525)
Text(0.5, 1.0, 'Best Star Wars movie in the franchise according to the U.S. region')
The Empire Strikes Back keeps to be the most favorite movie across different regions of the United States. Curiously, the East South Central respondents are more extreme in their judgments: the highest and the lowest average rankings belong to them, given to the Episode V and Episode II respectively. And the West South Central is the only region where a movie from the prequel trilogy, the Episode I in this case, is placed among the top 3 most favorite movies of the whole Saga.
We already know what the respondents think about all the movies of the Star Wars franchise, now it´s time to discover who is the most liked and the most hated character.
#calculate the average rating of each character
char_rating = star_wars.loc[:, 'Han Solo' : 'Yoda'].mean().sort_values(ascending=True)
#plot
fig = px.bar(char_rating, x=char_rating.values, y=char_rating.index, orientation='h',
labels={'x': 'average rating', 'y': 'Star Wars character'})
#plot aesthetics
##color map highlighting the positive and negative ratings
colors=[]
for val in char_rating.values:
colors.append('#FC4040' if val < 0 else '#009999')
fig.update_layout(title={'text':'<b>Rating of the popular characters from the Star Wars franchise</b>',
'font':{'size':22}})
fig.update_xaxes(range=[-.5, 2])
fig.update_traces(hovertemplate='<i>Average rating:</i> %{x}', marker_color=colors)
fig.show()
Han Solo, Yoda, Obi Wan Kenobi, Luke Skywalker, R2 D2, Princess Leia and C-3P0 are generally viewed as likable characters while the others are either viewed as more controversial or neutral or negative. Han Solo has got the highest rating and Jar Jar Binks the lowest.
We don´t conform with general overview, different social-demographics groups might think differently as well.
char_cols=star_wars.columns[15:29]
#calculate mean rating by gender and fanship
chars_per_fan = star_wars.pivot_table(index=['Gender', 'sw_fan'], values=char_cols).reset_index()
chars_per_fan = pd.melt(chars_per_fan, id_vars=['Gender', 'sw_fan'], var_name='character',
value_vars=char_cols, value_name='char_rating')
#plot
fig = px.bar(chars_per_fan, x='char_rating', y='character', orientation='h',
color = 'Gender', facet_col = 'sw_fan', barmode='group',
labels = {'char_rating': 'average rating',
'character': ''},
category_orders = {'character': char_rating.index[::-1]},
color_discrete_map={'Male': 'rgb(0, 107, 164)', 'Female': 'rgb(200, 82, 0)'},
height=600)
#plot aesthetics
fig.update_traces(hovertemplate='%{y}, <i>average rating:</i> %{x:.2f}')
fig.update_xaxes(matches=None)
fig.update_layout(title={'text':'<b>Rating of the popular Star Wars characters, by gender and fanship </b>',
'font':{'size':22}})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1].replace('False', 'Not Fan').replace('True', 'Fan')))
fig.show()
Expectedly, men and women seem to have somewhat different opinion on the point. Jar Jar Binks is most hated character but not by women. Female fans think that the worst character is the Emperor Palpatine and female non-fans see the Darth Vader to be the worst one. And Jar Jar Binks is more hated by the male fans than the male non-fans. As for the most favorite character, the female non fans like most R2 D2, while female fans seem to like Yoda a bit more than others. Male fans and non fans seem to agree on their favorite, Han Solo.
Let´s see what respondents of different age think about the saga characters.
#calculate the meat character rating by gender and age
chars_per_age = star_wars.pivot_table(index=['Gender', 'Age'], values=char_cols).reset_index()
chars_per_age = pd.melt(chars_per_age, id_vars=['Gender', 'Age'], var_name='character',
value_vars=char_cols, value_name='char_rating')
#plot
fig = px.bar(chars_per_age, x='char_rating', y='character', orientation='h', barmode='group',
color = 'Gender', facet_col = 'Age', facet_col_wrap=2, facet_row_spacing=0.03,
labels = {'char_rating': 'average rating',
'character': 'Star Wars character'},
category_orders = {'character': char_rating.index[::-1]},
color_discrete_map={'Male': 'rgb(0, 107, 164)', 'Female': 'rgb(200, 82, 0)'},
height = 1100)
#plot aesthetics
fig.update_traces(hovertemplate='%{y}, <i>average rating:</i> %{x:.2f}')
fig.update_layout(title={'text':'<b>Rating of the popular Star Wars characters, by gender and age </b>',
'font':{'size':22}})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1] + ' ' + 'y.o.'))
fig.show()
Most male respondents younger than 44 seem to hate Jar Jar Bink, its rating is firmly negative, while elder men are more discreet in their judgment of this character, although naming it as less favorite anyway. And women can´t decide if it´s Emperor Palpatine or Jar Jar Binks being the less favorite for them.
The last episode of this project is aimed to see the distribution of the survey participants focusing on their gender, age and whether or not they are fans of the Star Wars.
#build a hierarcical table for the sunburst plot
distributions = star_wars.groupby(['sw_fan', 'Gender', 'Age']).size().reset_index()
distributions.rename({0: 'Number of respondents'}, axis=1, inplace=True)
#map the values to be used as labels on the plot
distributions['sw_fan'] = distributions['sw_fan'].map({True: 'Fans', False: 'Not fans'})
#plot
fig = px.sunburst(distributions, path=['sw_fan', 'Gender', 'Age'], values='Number of respondents',
height=700, template='seaborn')
#plot aesthetics
fig.update_traces(hovertemplate='%{percentParent:.2%} of %{currentPath}', selector=dict(type='sunburst'))
fig.update_layout(title={'text':'<b>Distribution of respondents in various categories</b><br>Fanship, gender and age distribution',
'font':{'size':21}})
fig.show()
Out or more than 1100 respondents who participated in the survey, we analyzed answers of 835 respondents who actually saw at least one movie of the saga. Here is what we discovered: