Notebook

Introduction¶

A long time ago in a pre-pandemic world...

It´s a period of an unclouded joy and an optimistic prospection. Disney bought the Lukasfilm Studio recently.
The beloved by many Star Wars saga is finally to be continued. The fans of Star Wars are waiting for the 'Star Wars VII: The Force Awakens' film to come out. In order to make the wait shorter,
the team of FiveThirtyEight surveyed more than 1,100 americans on their views about the franchise. They used the SurveyMonkey online tool to collect the data and
not only shared the results of their analysis in a nice article but also shared the
dataset collected. Which we'll use to perform our own analysis.

Prologue¶

The data has several columns, including:

Column	Descritpion
`RespondentID`	An anonymized ID for the respondent (person taking the survey)
`Gender`	The respondent's gender
`Age`	The respondent's age
`Household Income`	The respondent's income
`Education`	The respondent's education level
`Location (Census Region)`	The respondent's location
`Have you seen any of the 6 films in the Star Wars franchise?`	Has a Yes or No response
`Do you consider yourself to be a fan of the Star Wars film franchise?`	Has a Yes or No response

There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format. As a result, this data set needs a lot of cleaning.

In [1]:

import pandas as pd
import numpy as np
import missingno as msno # check for missing records
import warnings

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

pd.set_option('display.max_columns', 38)
pd.set_option('max_colwidth', 150)
pd.options.display.float_format = '{:,.3f}'.format

In [2]:

#read in the file
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head()

Out[2]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	nan	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response
1	3,292,879,998.000	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	2	1	4	5	6	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3,292,879,538.000	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3,292,765,271.000	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	2	3	4	5	6	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3,292,763,116.000	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	6	1	2	4	3	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

It seems that the first row is not an entry but a subtitle for some of the columns instead. We´ll use it later to rename the columns.

In [3]:

#save the 1st row for later 
aux_col_names = star_wars.iloc[0, :]

#drop the 1st row
star_wars = star_wars.iloc[1:, :]

aux_col_names.reset_index()

Out[3]:

	index	0
0	RespondentID	NaN
1	Have you seen any of the 6 films in the Star Wars franchise?	Response
2	Do you consider yourself to be a fan of the Star Wars film franchise?	Response
3	Which of the following Star Wars films have you seen? Please select all that apply.	Star Wars: Episode I The Phantom Menace
4	Unnamed: 4	Star Wars: Episode II Attack of the Clones
5	Unnamed: 5	Star Wars: Episode III Revenge of the Sith
6	Unnamed: 6	Star Wars: Episode IV A New Hope
7	Unnamed: 7	Star Wars: Episode V The Empire Strikes Back
8	Unnamed: 8	Star Wars: Episode VI Return of the Jedi
9	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Star Wars: Episode I The Phantom Menace
10	Unnamed: 10	Star Wars: Episode II Attack of the Clones
11	Unnamed: 11	Star Wars: Episode III Revenge of the Sith
12	Unnamed: 12	Star Wars: Episode IV A New Hope
13	Unnamed: 13	Star Wars: Episode V The Empire Strikes Back
14	Unnamed: 14	Star Wars: Episode VI Return of the Jedi
15	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Han Solo
16	Unnamed: 16	Luke Skywalker
17	Unnamed: 17	Princess Leia Organa
18	Unnamed: 18	Anakin Skywalker
19	Unnamed: 19	Obi Wan Kenobi
20	Unnamed: 20	Emperor Palpatine
21	Unnamed: 21	Darth Vader
22	Unnamed: 22	Lando Calrissian
23	Unnamed: 23	Boba Fett
24	Unnamed: 24	C-3P0
25	Unnamed: 25	R2 D2
26	Unnamed: 26	Jar Jar Binks
27	Unnamed: 27	Padme Amidala
28	Unnamed: 28	Yoda
29	Which character shot first?	Response
30	Are you familiar with the Expanded Universe?	Response
31	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Response
32	Do you consider yourself to be a fan of the Star Trek franchise?	Response
33	Gender	Response
34	Age	Response
35	Household Income	Response
36	Education	Response
37	Location (Census Region)	Response

In [4]:

#initial data exploration
star_wars.describe(include='all')

Out[4]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
count	1,186.000	1186	836	673	571	550	607	758	738	835	836	835	836	836	836	829	831	831	823	825	814	826	820	812	827	830	821	814	826	828	828	213	1068	1046	1046	858	1036	1043
unique	nan	2	2	1	1	1	1	1	1	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	6	3	2	2	2	2	4	5	5	9
top	nan	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	4	5	6	1	1	2	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Neither favorably nor unfavorably (neutral)	Neither favorably nor unfavorably (neutral)	Very favorably	Very favorably	Very unfavorably	Neither favorably nor unfavorably (neutral)	Very favorably	Han	No	No	No	Female	45-60	$50,000 - $99,999	Some college or Associate degree	East North Central
freq	nan	936	552	673	571	550	607	758	738	237	300	217	204	289	232	610	552	547	269	591	213	310	236	248	474	562	204	207	605	325	615	114	641	549	291	298	328	181
mean	3,290,128,200.533	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
std	1,055,638.908	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
min	3,288,372,923.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25%	3,289,450,962.750	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
50%	3,290,147,175.500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75%	3,290,814,462.500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
max	3,292,879,998.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Noticeable data cleanliness issues:

many columns are not named properly (not descriptive, too long, etc.),
some columns have got only 1 unique value (i.e.'Unnamed: 4' : 'Unnamed: 8'),
some columns' values are represented as string objects while they are actually numbers (i.e. 'Unnamed: 10' : 'Unnamed: 14' ),
binary variables ('Yes/No') are not formatted as booleans,
the 'Do you consider yourself to be a fan of the Expanded Universe?' column has got almost 80% of its values null.

Cleaning the Data. Episode I¶

In this episode we´ll rename the columns and also fix some of the values.

We want new column names be shorter for an easier referring to them in the code but still descriptive.

In [5]:

#renaming the column names
star_wars = star_wars.rename(columns={
    'Have you seen any of the 6 films in the Star Wars franchise?': 'seen_any',
    'Do you consider yourself to be a fan of the Star Wars film franchise?': 'sw_fan',
    'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_ep1',
    'Unnamed: 4' : 'seen_ep2',
    'Unnamed: 5' : 'seen_ep3',
    'Unnamed: 6' : 'seen_ep4',
    'Unnamed: 7' : 'seen_ep5',
    'Unnamed: 8' : 'seen_ep6',
    'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'ranking_ep1',
    'Unnamed: 10': 'ranking_ep2',
    'Unnamed: 11': 'ranking_ep3',
    'Unnamed: 12': 'ranking_ep4',
    'Unnamed: 13': 'ranking_ep5',
    'Unnamed: 14': 'ranking_ep6',
    'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.': 'Han Solo',
    'Unnamed: 16': 'Luke Skywalker',
    'Unnamed: 17': 'Princess Leia Organa',
    'Unnamed: 18': 'Anakin Skywalker',
    'Unnamed: 19': 'Obi Wan Kenobi',
    'Unnamed: 20': 'Emperor Palpatine',
    'Unnamed: 21': 'Darth Vader',
    'Unnamed: 22': 'Lando Calrissian',
    'Unnamed: 23': 'Boba Fett',
    'Unnamed: 24': 'C-3P0',
    'Unnamed: 25': 'R2 D2',
    'Unnamed: 26': 'Jar Jar Binks',
    'Unnamed: 27': 'Padme Amidala',
    'Unnamed: 28': 'Yoda',
    'Which character shot first?': 'shot_first',
    'Are you familiar with the Expanded Universe?': 'know_eu',
    'Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦': 'eu_fan',
    'Do you consider yourself to be a fan of the Star Trek franchise?': 'st_fan'
})

star_wars.head(5)

Out[5]:

	RespondentID	seen_any	sw_fan	seen_ep1	seen_ep2	seen_ep3	seen_ep4	seen_ep5	seen_ep6	ranking_ep1	ranking_ep2	ranking_ep3	ranking_ep4	ranking_ep5	ranking_ep6	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda	shot_first	know_eu	eu_fan	st_fan	Gender	Age	Household Income	Education	Location (Census Region)
1	3,292,879,998.000	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	2	1	4	5	6	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3,292,879,538.000	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3,292,765,271.000	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	2	3	4	5	6	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3,292,763,116.000	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	6	1	2	4	3	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3,292,731,220.000	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	4	6	2	1	3	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

The values 'seen_any', 'sw_fan','know_eu', 'eu_fan' and 'st_fan' contain either 'Yes' or 'No' values, with some missing values in between. For ease of usage throughout the analysis, these values are mapped to boolean.

In [6]:

# convert Yes/No to boolean
yes_no_mapping = {'Yes': True, 'No': False}

yes_no_cols = ['seen_any', 'sw_fan', 'know_eu', 'eu_fan', 'st_fan']

for col in yes_no_cols:
    star_wars[col] = star_wars[col].map(yes_no_mapping)

star_wars.head()

Out[6]:

	RespondentID	seen_any	sw_fan	seen_ep1	seen_ep2	seen_ep3	seen_ep4	seen_ep5	seen_ep6	ranking_ep1	ranking_ep2	ranking_ep3	ranking_ep4	ranking_ep5	ranking_ep6	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda	shot_first	know_eu	eu_fan	st_fan	Gender	Age	Household Income	Education	Location (Census Region)
1	3,292,879,998.000	True	True	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	2	1	4	5	6	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	True	False	False	Male	18-29	NaN	High school degree	South Atlantic
2	3,292,879,538.000	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	True	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3,292,765,271.000	True	False	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	2	3	4	5	6	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	I don't understand this question	False	NaN	False	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3,292,763,116.000	True	True	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	6	1	2	4	3	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	False	NaN	True	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3,292,731,220.000	True	True	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	4	6	2	1	3	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably	Greedo	True	False	False	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

The columns 'seen_ep1' : 'seen_ep6' indicate whether the respondent saw the correspondent movie or no. If the movie name is listed, the respondent saw the episode and the NaN values indicate that either the question was not answered or the respondent didn´t see the movie. We´ll convert these columns to Boolean type as well. The values with the movie name will be converted to True and the null values to False.

In [7]:

#convert 'seen_ep?' columns to boolean
star_wars.loc[:,'seen_ep1':'seen_ep6'] = star_wars.loc[:,'seen_ep1':'seen_ep6'].notnull()
star_wars.loc[:,'seen_ep1':'seen_ep6'].head()

Out[7]:

	seen_ep1	seen_ep2	seen_ep3	seen_ep4	seen_ep5	seen_ep6
1	True	True	True	True	True	True
2	False	False	False	False	False	False
3	True	True	True	False	False	False
4	True	True	True	True	True	True
5	True	True	True	True	True	True

The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN. For the further analysis we´ll convert these column to numeric ones and also we are going to invert the rating so that '6' would mean the most favorite and '1' - the least favorite.

In [8]:

star_wars.loc[:,'ranking_ep1':'ranking_ep6'] = star_wars.loc[:,'ranking_ep1':'ranking_ep6'].astype(float)
star_wars.loc[:,'ranking_ep1':'ranking_ep6'] = star_wars.loc[:,'ranking_ep1':'ranking_ep6'].applymap(lambda x: 7-x)
star_wars.loc[:,'ranking_ep1':'ranking_ep6'].head()

Out[8]:

	ranking_ep1	ranking_ep2	ranking_ep3	ranking_ep4	ranking_ep5	ranking_ep6
1	4.000	5.000	6.000	3.000	2.000	1.000
2	nan	nan	nan	nan	nan	nan
3	6.000	5.000	4.000	3.000	2.000	1.000
4	2.000	1.000	6.000	5.000	3.000	4.000
5	2.000	3.000	1.000	5.000	6.000	4.000

The main Star Wars character columns are answers to the question 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her'.

We will convert the ranking system to numbers so that we can make calculations:

Very favorably - for each rating the character will receive 2 points
Somewhat favorably - for each rating the character will receive 1 point
Neither favorably nor unfavorably (neutral) - 0 points
Somewhat unfavorably - for each rating the character will be deducted 1 point
Very unfavorably - for each rating the character will be deducted 2 points
Unfamiliar (N/A) - 0 points

In [9]:

points = {
    'Very favorably': 2,
    'Somewhat favorably': 1,
    'Neither favorably nor unfavorably (neutral)': 0,
    'Somewhat unfavorably': -1,
    'Very unfavorably': -2,
    'Unfamiliar (N/A)': 0,
    np.NaN: 0
    }

for col in star_wars.loc[:, 'Han Solo' : 'Yoda']:
    star_wars[col] = star_wars[col].map(points)
    
star_wars.loc[:, 'Han Solo' : 'Yoda'].head()

Out[9]:

	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda
1	2	2	2	2	2	2	2	0	0	2	2	2	2	2
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	1	1	1	1	1	0	0	0	0	0	0	0	0	0
4	2	2	2	2	2	1	2	1	-1	2	2	2	2	2
5	2	1	1	-1	2	-2	1	0	2	1	1	-2	1	1

Cleaning the Data. Episode II¶

This episode is all about dealing with missing values. We´ll start with identifying them.

In [10]:

#overview of missing values
with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    msno.bar(star_wars)

as we´ve seen earlier, the 'eu_fan' is the column with most values missing;
there various columns with about 30% of values missing, we´ll check if there´re any common pattern and what´s possible to do about it;
some respondents (about 10%) preferred not to answer the question related to their social-demographic status, probably it´ll be possible to impute those values missing.

The 'seen_any' column might help us to deal with a part of missing data. Probably the respondents who haven´t seen any of the Star Wars movies left without an answer the questions where they were asked to rate each movie and didn´t understand the question about who shot first.

In [11]:

print('The number of respondents who haven´t seen any of the Star Wars movies:', star_wars[star_wars['seen_any'] == False].shape[0])
print('They left the following questions without an answer:')
star_wars[star_wars['seen_any'] == False].isnull().sum()

The number of respondents who haven´t seen any of the Star Wars movies: 250
They left the following questions without an answer:

Out[11]:

RespondentID                  0
seen_any                      0
sw_fan                      250
seen_ep1                      0
seen_ep2                      0
seen_ep3                      0
seen_ep4                      0
seen_ep5                      0
seen_ep6                      0
ranking_ep1                 250
ranking_ep2                 250
ranking_ep3                 250
ranking_ep4                 250
ranking_ep5                 250
ranking_ep6                 250
Han Solo                      0
Luke Skywalker                0
Princess Leia Organa          0
Anakin Skywalker              0
Obi Wan Kenobi                0
Emperor Palpatine             0
Darth Vader                   0
Lando Calrissian              0
Boba Fett                     0
C-3P0                         0
R2 D2                         0
Jar Jar Binks                 0
Padme Amidala                 0
Yoda                          0
shot_first                  250
know_eu                     250
eu_fan                      250
st_fan                       10
Gender                       24
Age                          24
Household Income             67
Education                    30
Location (Census Region)     25
dtype: int64

It´s confirmed now that the entries from those who haven´t seen any of the Star Wars movies will not contribute to the analysis. So we´ll continue only with the answers from the respondents who saw at least one episode of the saga.

In [12]:

star_wars = star_wars[star_wars['seen_any'] == True]

The null values in the 'sw_fan', 'know_eu' and 'st_fan' columns can be imputed with False and the null values in the 'shot_first' with 'I don´t understand this question'.

In [13]:

star_wars[['sw_fan', 'know_eu', 'st_fan']] = star_wars[['sw_fan', 'know_eu', 'st_fan']].fillna(False).fillna('')
star_wars['shot_first'] = star_wars['shot_first'].fillna('I don´t understand this question') 

As for the 'eu_fan' column, similarly to the 'sw_fan' we suppose that if a person not familiar with Expanded Universe can´t be a fan of it.

In [14]:

star_wars.loc[star_wars['know_eu'] == False, 'eu_fan'] = False

In [15]:

with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    msno.bar(star_wars)

There might be a group of respondents who answered the first question if they had seen any movie of the saga but then lost their interest in completing the survey and left other questions without an answer.

In [16]:

star_wars[star_wars.loc[:, 'ranking_ep1':'ranking_ep6'].isnull().apply(lambda x: all(x), axis=1)].describe(include='all')

Out[16]:

	RespondentID	seen_any	sw_fan	seen_ep1	seen_ep2	seen_ep3	seen_ep4	seen_ep5	seen_ep6	ranking_ep1	ranking_ep2	ranking_ep3	ranking_ep4	ranking_ep5	ranking_ep6	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda	shot_first	know_eu	eu_fan	st_fan	Gender	Age	Household Income	Education	Location (Census Region)
count	100.000	100	100	100	100	100	100	100	100	0.000	0.000	0.000	0.000	0.000	0.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100.000	100	100	100	100	0	0	0	0	0
unique	nan	1	1	1	1	1	1	1	1	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1	1	1	0	0	0	0	0
top	nan	True	False	False	False	False	False	False	False	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	I don´t understand this question	False	False	False	NaN	NaN	NaN	NaN	NaN
freq	nan	100	100	100	100	100	100	100	100	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	100	100	100	100	NaN	NaN	NaN	NaN	NaN
mean	3,290,145,657.870	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	nan	nan	nan	nan	nan	nan	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
std	886,869.706	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	nan	nan	nan	nan	nan	nan	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
min	3,288,455,900.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	nan	nan	nan	nan	nan	nan	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25%	3,289,614,702.500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	nan	nan	nan	nan	nan	nan	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
50%	3,290,353,222.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	nan	nan	nan	nan	nan	nan	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75%	3,290,725,892.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	nan	nan	nan	nan	nan	nan	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
max	3,292,637,870.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	nan	nan	nan	nan	nan	nan	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

After confirming the above hypothesis we can drop those rows without any hesitations.

In [17]:

star_wars.drop(star_wars[star_wars['ranking_ep1'].isnull()].index, inplace=True)

In [18]:

with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    msno.bar(star_wars)

The data in the 'Household income' is categorical data. Considering a significant amount of missing values in it, we wouldn´t want to loose that much information. At the moment we´ll fill it with 'no info' introducing like a new category. And the rest of the missing values will be replaced by a most common value in the column.

In [19]:

star_wars['Household Income'] = star_wars['Household Income'].fillna('no info')
star_wars = star_wars.fillna(star_wars.mode().iloc[0])

Analysis. Episode I. Most seen and best ranked¶

Not only the films compete with each other for the people´s love and recognition, but also the trilogies: the episodes IV - VI, originally released between 1977 and 1983, and the prequel trilogy released later in 1999 through 2005.

First, we´ll discover which how many respondents saw each movie and thus find out which movie is a most seen one.

In [20]:

#complete movies´ titles, to be used for plot labeling
titles = [
'Episode I The Phantom Menace',
'Episode II Attack of the Clones',
'Episode III Revenge of the Sith',
'Episode IV A New Hope',
'Episode V The Empire Strikes Back',
'Episode VI Return of the Jedi'
] 

#columns which refer to if a movie was seen by a respondent
seen_cols = ['seen_ep1', 'seen_ep2', 'seen_ep3', 'seen_ep4', 'seen_ep5', 'seen_ep6']

#the columns which refer to movie rankings
rank_cols = ['ranking_ep1', 'ranking_ep2', 'ranking_ep3', 'ranking_ep4', 'ranking_ep5', 'ranking_ep6']

In [21]:

#prepare the dataframe for plotting
seen = pd.DataFrame(data=[titles, 
                          star_wars.loc[:,seen_cols].sum(), 
                          (star_wars.loc[:,seen_cols].sum()/star_wars.loc[:,seen_cols].shape[0]).round(2)]).T

seen.columns = ['Star Wars movie', 'Number of views', 'Views_per']

#define the average niews per trilogy
trilogy_views = ['',] * 6
trilogy_views[:3] = [seen.loc[0:2, 'Views_per'].sum()/3, ] * 3
trilogy_views[-3:] = [seen.loc[3:6, 'Views_per'].sum()/3, ] * 3
seen['views_per_trilogy'] = trilogy_views

#plot
fig = px.bar(seen, x='Number of views', y='Star Wars movie', orientation='h', 
             custom_data=['Views_per', 'views_per_trilogy'], 
             category_orders={'Star Wars movie':titles})

#plot aesthetics
##color map highlighting only the most seen movie
colors=[] 
for val in seen['Number of views']:
    colors.append('rgb(252, 128, 14)' if val ==  seen['Number of views'].max() else 'rgb(137, 137, 137)')

fig.update_traces(hovertemplate='<i>Views:</i> %{x} <br><i>seen by %{customdata[0]:.0%} of respondents</i> <br> (the trilogy seen by %{customdata[1]:.0%} of respondents) ', 
                  marker_color=colors)

fig.update_layout(title={'text':'<b>Views recieved by each movie in the Star Wars franchise</b><br>based on 835 respondents',
                         'font':{'size':22}},
                  yaxis = {'ticksuffix': '  ',
                           'tickfont':{'size':16}})

fig.show()

The Episode V is the most seen movie of the saga (seen by 91% of respondents), followed by the Episode VI. The original trilogy has got more views on average, than its prequel (84% vs. 72%)

Now, let´s proceed to the rankings. Remember that although FiveThirtyEight team asked to score the most favorite movie with 1 and the least favorite with 6, earlier we inverted the rankings for a more clear visual presentation.

In [22]:

#prepare the data for plotting
ranking = star_wars[rank_cols].mean()
ranking.index = titles

#plot
fig = px.bar(ranking, x=ranking.values, y=ranking.index, orientation='h', 
             labels={'x': 'average rank', 'y': ''},
             category_orders={'Star Wars movie':titles})

#plot aesthetics
##color map highlighting only the most seen movie
colors=[] 
for val in ranking.values:
    colors.append('rgb(252, 128, 14)' if val ==  ranking.values.max() else 'rgb(137, 137, 137)')

fig.update_traces(hovertemplate='<i>Average rank:</i> %{x:.2f}', marker_color=colors)
fig.update_layout(title={'text':'<b>Most favorite movie in the Star Wars franchise</b><br>based on 835 respondents',
                         'font':{'size':22}},
                  yaxis = {'ticksuffix': '  ',
                           'tickfont':{'size':16}})
fig.show()

The most favorite movie, with no doubts, is the Episode V The Empire Strikes Back with the average rank of 4.49. Three most favorite films are the ones of the original trilogy.

Let´s see now if the respondents score differ a lot between different social-democratic groups. In order to complete this, we´ll need to transform our dataframe to a long-type format.

In [23]:

# grouping the columns
id_vars = ['sw_fan', 'Gender', 'Age', 'Location (Census Region)']


# long-type table for movie ranks
ranks_melt = pd.melt(star_wars, id_vars=id_vars, var_name='movie',
                    value_vars=rank_cols, value_name='movie_rank')

movie_dict = {'ranking_ep1': 'Episode I The Phantom Menace', 
              'ranking_ep2': 'Episode II Attack of the Clones',
              'ranking_ep3': 'Episode III Revenge of the Sith',
              'ranking_ep4': 'Episode IV A New Hope',
              'ranking_ep5': 'Episode V The Empire Strikes Back',
              'ranking_ep6': 'Episode VI Return of the Jedi'}
ranks_melt['movie'] = ranks_melt['movie'].map(movie_dict) 

#check the result
ranks_melt.head()

Out[23]:

	sw_fan	Gender	Age	Location (Census Region)	movie	movie_rank
0	True	Male	18-29	South Atlantic	Episode I The Phantom Menace	4.000
1	False	Male	18-29	West North Central	Episode I The Phantom Menace	6.000
2	True	Male	18-29	West North Central	Episode I The Phantom Menace	2.000
3	True	Male	18-29	West North Central	Episode I The Phantom Menace	2.000
4	True	Male	18-29	Middle Atlantic	Episode I The Phantom Menace	6.000

Now, let´s check if there´s something drawing our attention in the responses of respondents by their gender.

In [24]:

#prepare data for plotting
ranks_sex = ranks_melt.pivot_table(index=['movie', 'Gender'], values='movie_rank').reset_index()

#plot
fig = px.bar(ranks_sex, x='movie_rank', y='movie', orientation='h', 
             facet_col = 'Gender', facet_col_wrap=1, barmode='group',
             labels = {'movie_rank': 'average rank',
                       'movie': 'Star Wars movie'},
             category_orders={'movie':titles}
             )

#plot aesthetics
##color map highlighting the best ranked movie and other points of interest
colors_of_interest = list(colors)
colors_of_interest[0] = 'rgb(95, 158, 209)'

fig.update_traces(hovertemplate='<i>Average rank:</i> %{x}', marker_color = colors) #highlight the best ranked movie
fig.update_traces(row=0, marker_color = colors_of_interest) #apply 'colors_of_interest' to the first facet plot
fig.update_layout(title={'text':'<b>Most favorite movie in the Star Wars franchise, by gender</b>',
                         'font':{'size':22}})
fig.update_yaxes(ticksuffix='  ', tickfont={'size':14})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1]))

##draw mannualy the legend 
###'best ranked' rectangle
fig.add_shape(type='rect',
    xref='paper', yref='paper',
    x0=3.5, x1=3.9, y0=4.6, y1=5.0,
    col=1, row=1,
    line={'width': 0.8,
          'color': 'rgb(252, 128, 14)' },
    fillcolor= 'rgb(252, 128, 14)')
###'best ranked' text
fig.add_annotation(text='Best ranked', xref='paper', yref='paper', x=4.3, y=4.8, align='left', 
                   col=1, row=1, showarrow=False)
###'of an interes' rectangle
fig.add_shape(type='rect',
    xref='paper', yref='paper',
    x0=3.5, x1=3.9, y0=3.9, y1=4.3,
    col=1, row=1,
    line={'width': 0.8,
          'color': 'rgb(95, 158, 209)'},
    fillcolor='rgb(95, 158, 209)')
###'of an interes' text
fig.add_annotation(text='  Of an interest', xref='paper', yref='paper', x=4.3, y=4.1, align='left', 
                   col=1, row=1, showarrow=False)

fig.show()

We see no difference in respect of the most favorite movie both by men and women, which is the Episode V. The interesting fact is that the male respondents seem to have it quite clear which trilogy they like more, giving the best score to the original one. None of the latest movies has got the average ranking higher than 3. Although the female respondents also seem to like episodes 4-6 more, the third most favorite movie by their opinion is the Episode I The Phantom Menace.

Let´s see if been a fan of Star Wars changes anything.

In [25]:

ranks_sex_sw_fan = ranks_melt.pivot_table(index=['movie', 'Gender', 'sw_fan'], values='movie_rank').reset_index()

fig = px.bar(ranks_sex_sw_fan, x='movie_rank', y='movie', orientation='h', 
             facet_col = 'Gender', facet_row='sw_fan', barmode='group',
             labels = {'movie_rank': 'average rank',
                       'movie': 'Star Wars movie'},
             category_orders={'movie':titles}
             )


fig.update_traces(hovertemplate='<i>Average rank:</i> %{x}', marker_color = colors)
fig.update_traces(row=0, marker_color = colors_of_interest)
fig.update_layout(title={'text':'<b>Most favorite movie in the Star Wars franchise, by gender and fans</b>',
                         'font':{'size':22}})
fig.update_yaxes(ticksuffix='  ', tickfont={'size':13})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1].replace('False', 'Not Fan').replace('True', 'Fan')))

fig.add_shape(type='rect',
    xref='paper', yref='paper',
    x0=3.1, x1=3.5, y0=4.6, y1=5.0,
    col=2, row=1,
    line={'width': 0.8,
          'color': 'rgb(252, 128, 14)' },
    fillcolor= 'rgb(252, 128, 14)'
          )

fig.add_annotation(text='Best ranked', xref='paper', yref='paper', x=4.2, y=4.8, align='left', col=2, row=1, showarrow=False)

fig.add_shape(type='rect',
    xref='paper', yref='paper',
    x0=3.1, x1=3.5, y0=3.9, y1=4.3,
    col=2, row=1,
    line={'width': 0.8,
          'color': 'rgb(95, 158, 209)'},
    fillcolor='rgb(95, 158, 209)'
)

fig.add_annotation(text='Of an interest', xref='paper', yref='paper', x=4.3, y=4.1, align='left', col=2, row=1, showarrow=False)


fig.show()

Both male and female fans are loyal to the original trilogy, although they score the prequel movies somewhat differently. Those who don´t consider themselves as the Star Wars fan give very high scores to the Episode I. It´s the second most favorite movie according to the male non-fans and shares the first place with the Episode V according to the female non-fans.

It´s also interesting to see how the respondents of different age rate the movies.

In [26]:

#prepare data for plotting
ranks_age = ranks_melt.pivot_table(index=['movie', 'Age'], values='movie_rank').reset_index()

#plot
fig = px.bar(ranks_age, x='movie_rank', y='movie', orientation='h', 
             facet_col = 'Age', facet_col_wrap=2,
             category_orders={'movie':titles}
             )

#plot aesthetics
fig.update_traces(hovertemplate='<i>Average rank:</i> %{x}', marker_color = colors)
fig.update_traces(row=1, col=2, marker_color = colors_of_interest)
fig.update_layout(title={'text':'<b>Most favorite movie in the Star Wars franchise, by age</b>',
                         'font':{'size':22}})
fig.update_yaxes(ticksuffix='  ', tickfont={'size':13})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1] + ' ' + 'y.o.'))

##draw mannualy the legend 
###'best ranked' rectangle
fig.add_shape(type='rect',
    xref='paper', yref='paper',
    x0=3.1, x1=3.5, y0=4.6, y1=5.0,
    col=1, row=0,
    line={'width': 0.8,
          'color': 'rgb(252, 128, 14)' },
    fillcolor= 'rgb(252, 128, 14)')
###'best ranked' text
fig.add_annotation(text='Best ranked', xref='paper', yref='paper', x=4.2, y=4.8, 
                   align='left', col=1, row=0, showarrow=False)
###'of an interest' rectangle
fig.add_shape(type='rect',
    xref='paper', yref='paper',
    x0=3.1, x1=3.5, y0=3.9, y1=4.3,
    col=1, row=0,
    line={'width': 0.8,
          'color': 'rgb(95, 158, 209)'},
    fillcolor='rgb(95, 158, 209)')
###'of an interest' text
fig.add_annotation(text='Of an interest', xref='paper', yref='paper', x=4.3, y=4.1, 
                   align='left', col=1, row=0, showarrow=False)

fig.show()

The surprising phenomenon of the Episode I The Phantom Menace is also true for the respondents which are older than 60.

Definitely we can´t leave this phenomenon without any further investigation. Let´s check who give a high score (3 to 6) to this episode and how a lower score.

In [27]:

#define the number of each score given by different groups of respondents 
rank_per_age = pd.crosstab([ranks_melt['sw_fan'], ranks_melt['Age'], ranks_melt['Gender'], 
                            ranks_melt['movie']], ranks_melt['movie_rank'], colnames=[''])

#retrieve the info realted to the 'Episode I' only
rank_per_age_ep1 = rank_per_age.xs('Episode I The Phantom Menace', level='movie').reset_index()

#calculate the number of high scores and low scores
rank_per_age_ep1['low_rank'] = rank_per_age_ep1[[1.0, 2.0, 3.0]].sum(axis=1) 
rank_per_age_ep1['high_rank'] = rank_per_age_ep1[[4.0, 5.0, 6.0]].sum(axis=1) 

#drop the individual score columns
rank_per_age_ep1 = rank_per_age_ep1.drop(rank_per_age_ep1.columns[3:9], axis=1)

#melt the data to a long-type format
rank_per_age_ep1 = pd.melt(rank_per_age_ep1, id_vars=['sw_fan', 'Age', 'Gender'], value_vars=['low_rank', 'high_rank'],
                     var_name='rank', value_name='number of votes')

In [28]:

#plot
fig = px.bar(rank_per_age_ep1, x='Age', y='number of votes', color='rank', barmode='group', 
             facet_row = 'sw_fan', facet_col='Gender', facet_row_spacing=0.08,
             color_discrete_map={'low_rank': 'rgb(0, 107, 164)', 'high_rank': 'rgb(200, 82, 0)'})

#plot aesthetics
fig.update_traces(hovertemplate='<i>Number of votes:</i> %{y}')
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1].replace('False', 'Not Fan').replace('True', 'Fan')))
fig.update_layout(title={'text':'<b>Episode I ranks distribution, by age and gender</b>',
                         'font':{'size':22}})
fig.update_xaxes(showticklabels=True)
fig.update_yaxes(ticksuffix='  ')

fig.show()

The female non-fans of all ages tend to give higher score to the Episode I, while the same is true only for the male non-fans older than 45. These higher scores of non-fans are more than compensated by lower scores of both male and female Star Wars fan of all ages except those older than 60, where the higher and lower scores are divided almost equally.

And finally let´s see if the preferences change according to the U.S. regions.

In [29]:

#prepare the data for plotting (movie rankings by the U.S. region)
rank_per_region = ranks_melt.pivot_table(index=['Location (Census Region)', 'movie'], values='movie_rank').unstack(level=0)

#plot
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(rank_per_region,cmap='viridis',annot=True,lw=2,ax=ax, cbar=False)

#plot aesthetics
ax.set_xticklabels(rank_per_region.columns.droplevel(),fontsize=13,rotation=90)
ax.tick_params(left=False, bottom=False)
ax.set(xlabel='U.S. Census Region', ylabel='')
plt.title('Best Star Wars movie in the franchise according to the U.S. region',fontsize=18,fontweight=525)

Out[29]:

Text(0.5, 1.0, 'Best Star Wars movie in the franchise according to the U.S. region')

The Empire Strikes Back keeps to be the most favorite movie across different regions of the United States. Curiously, the East South Central respondents are more extreme in their judgments: the highest and the lowest average rankings belong to them, given to the Episode V and Episode II respectively. And the West South Central is the only region where a movie from the prequel trilogy, the Episode I in this case, is placed among the top 3 most favorite movies of the whole Saga.

Analysis. Episode II. Characters¶

We already know what the respondents think about all the movies of the Star Wars franchise, now it´s time to discover who is the most liked and the most hated character.

In [30]:

#calculate the average rating of each character
char_rating = star_wars.loc[:, 'Han Solo' : 'Yoda'].mean().sort_values(ascending=True)

#plot
fig = px.bar(char_rating, x=char_rating.values, y=char_rating.index, orientation='h', 
             labels={'x': 'average rating', 'y': 'Star Wars character'})

#plot aesthetics
##color map highlighting the positive and negative ratings
colors=[] 
for val in char_rating.values:
    colors.append('#FC4040' if val < 0 else '#009999')

fig.update_layout(title={'text':'<b>Rating of the popular characters from the Star Wars franchise</b>',
                         'font':{'size':22}})    
fig.update_xaxes(range=[-.5, 2])
fig.update_traces(hovertemplate='<i>Average rating:</i> %{x}', marker_color=colors)

fig.show()

Han Solo, Yoda, Obi Wan Kenobi, Luke Skywalker, R2 D2, Princess Leia and C-3P0 are generally viewed as likable characters while the others are either viewed as more controversial or neutral or negative. Han Solo has got the highest rating and Jar Jar Binks the lowest.

We don´t conform with general overview, different social-demographics groups might think differently as well.

In [31]:

char_cols=star_wars.columns[15:29]

In [32]:

#calculate mean rating by gender and fanship
chars_per_fan = star_wars.pivot_table(index=['Gender', 'sw_fan'], values=char_cols).reset_index()
chars_per_fan = pd.melt(chars_per_fan, id_vars=['Gender', 'sw_fan'], var_name='character',
                    value_vars=char_cols, value_name='char_rating')
#plot
fig = px.bar(chars_per_fan, x='char_rating', y='character', orientation='h',
             color = 'Gender', facet_col = 'sw_fan', barmode='group',
             labels = {'char_rating': 'average rating',
                       'character': ''},
             category_orders = {'character': char_rating.index[::-1]},
             color_discrete_map={'Male': 'rgb(0, 107, 164)', 'Female': 'rgb(200, 82, 0)'},
             height=600)

#plot aesthetics
fig.update_traces(hovertemplate='%{y}, <i>average rating:</i> %{x:.2f}')
fig.update_xaxes(matches=None)
fig.update_layout(title={'text':'<b>Rating of the popular Star Wars characters, by gender and fanship </b>',
                         'font':{'size':22}})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1].replace('False', 'Not Fan').replace('True', 'Fan')))



fig.show()

Expectedly, men and women seem to have somewhat different opinion on the point. Jar Jar Binks is most hated character but not by women. Female fans think that the worst character is the Emperor Palpatine and female non-fans see the Darth Vader to be the worst one. And Jar Jar Binks is more hated by the male fans than the male non-fans. As for the most favorite character, the female non fans like most R2 D2, while female fans seem to like Yoda a bit more than others. Male fans and non fans seem to agree on their favorite, Han Solo.

Let´s see what respondents of different age think about the saga characters.

In [33]:

#calculate the meat character rating by gender and age
chars_per_age = star_wars.pivot_table(index=['Gender', 'Age'], values=char_cols).reset_index()
chars_per_age = pd.melt(chars_per_age, id_vars=['Gender', 'Age'], var_name='character',
                    value_vars=char_cols, value_name='char_rating')

#plot
fig = px.bar(chars_per_age, x='char_rating', y='character', orientation='h', barmode='group',
             color = 'Gender', facet_col = 'Age', facet_col_wrap=2, facet_row_spacing=0.03,
             labels = {'char_rating': 'average rating',
                       'character': 'Star Wars character'},
             category_orders = {'character': char_rating.index[::-1]},
             color_discrete_map={'Male': 'rgb(0, 107, 164)', 'Female': 'rgb(200, 82, 0)'},
             height = 1100)

#plot aesthetics
fig.update_traces(hovertemplate='%{y}, <i>average rating:</i> %{x:.2f}')
fig.update_layout(title={'text':'<b>Rating of the popular Star Wars characters, by gender and age </b>',
                         'font':{'size':22}})
fig.for_each_annotation(lambda a: a.update(text=a.text.split('=')[-1] + ' ' + 'y.o.'))

fig.show()

Most male respondents younger than 44 seem to hate Jar Jar Bink, its rating is firmly negative, while elder men are more discreet in their judgment of this character, although naming it as less favorite anyway. And women can´t decide if it´s Emperor Palpatine or Jar Jar Binks being the less favorite for them.

Analysis. Episode IV. Survey Respondent Portrait¶

The last episode of this project is aimed to see the distribution of the survey participants focusing on their gender, age and whether or not they are fans of the Star Wars.

In [34]:

#build a hierarcical table for the sunburst plot
distributions = star_wars.groupby(['sw_fan', 'Gender', 'Age']).size().reset_index()
distributions.rename({0: 'Number of respondents'}, axis=1, inplace=True)

#map the values to be used as labels on the plot
distributions['sw_fan'] = distributions['sw_fan'].map({True: 'Fans', False: 'Not fans'}) 

#plot
fig = px.sunburst(distributions, path=['sw_fan', 'Gender', 'Age'], values='Number of respondents', 
                  height=700, template='seaborn')

#plot aesthetics
fig.update_traces(hovertemplate='%{percentParent:.2%} of %{currentPath}', selector=dict(type='sunburst'))
fig.update_layout(title={'text':'<b>Distribution of respondents in various categories</b><br>Fanship, gender and age distribution',
                         'font':{'size':21}})

fig.show()

Two thirds of respondents consider themselves to be fans of the Star Wars movies, which is not a surprise. The data was collected as a survey: it´s logically to assume that those who don´t like the movies were less likely to participate in the survey.
there´re slightly more men among the fans than women and in the non-fans group, vice versa, there´re slightly more women than men.
the age distribution is more or less even. It might be worth mentioning though that about 57% of fans are of middle age, between 30 and 60 years old, while the non fan part of respondents is a bit older, where about 60% are older than 45.

Conclusion¶

Out or more than 1100 respondents who participated in the survey, we analyzed answers of 835 respondents who actually saw at least one movie of the saga. Here is what we discovered:

the Episode V The Empire Strikes Back is by no doubts the most popular movie and the best rated by the survey participants
the phenomenon of the Episode I was discovered. This episode is highly rated by non fan of Star Wars respondents and by respondents elder than 60 years old, getting it's rating almost as high as the Episode V's one
Han Solo, Yoda, Obi Wan Kenobi, Luke Skywalker, R2 D2, Princess Leia and C-3P0 are generally viewed as most favorable characters with some fluctuations in their rating depending on the social-demographic group of respondents
the utmost unfavorite character is Jar Jar Binks according to the average rating of characters. He has got especially low ratings from the male respondents, while female respondents see the Emperor Paplaptine and Darth Vader even worse characters on some occasions
2/3 of the respondents included in the analysis consider themselves as fans of the Star Wars Saga with a slight male dominance in gender distribution while not fans has got a slight female dominance in their gender distribution
more than half of the Star Wars fans are between 30 and 60 years old, while more than half of the not fans is of 45 year old and older.

	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda
1	2	2	2	2	2	2	2	0	0	2	2	2	2	2
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	1	1	1	1	1	0	0	0	0	0	0	0	0	0
4	2	2	2	2	2	1	2	1	-1	2	2	2	2	2
5	2	1	1	-1	2	-2	1	0	2	1	1	-2	1	1

	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda
1	2	2	2	2	2	2	2	0	0	2	2	2	2	2
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	1	1	1	1	1	0	0	0	0	0	0	0	0	0
4	2	2	2	2	2	1	2	1	-1	2	2	2	2	2
5	2	1	1	-1	2	-2	1	0	2	1	1	-2	1	1

	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda
1	2	2	2	2	2	2	2	0	0	2	2	2	2	2
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	1	1	1	1	1	0	0	0	0	0	0	0	0	0
4	2	2	2	2	2	1	2	1	-1	2	2	2	2	2
5	2	1	1	-1	2	-2	1	0	2	1	1	-2	1	1