Introduction

A long time ago in a pre-pandemic world...

It´s a period of an unclouded joy and an optimistic prospection. Disney bought the Lukasfilm Studio recently.
The beloved by many Star Wars saga is finally to be continued. The fans of Star Wars are waiting for the
'Star Wars VII: The Force Awakens' film to come out. In order to make the wait shorter,
the team of FiveThirtyEight surveyed more than 1,100 americans on their views about
the franchise. They used the SurveyMonkey online tool to collect the data and
not only shared the results of their analysis in a nice article but also shared the
dataset collected. Which we'll use to perform our own analysis.

star_wars_logo_PNG34.png

Prologue

The data has several columns, including:

Column Descritpion
RespondentID An anonymized ID for the respondent (person taking the survey)
Gender The respondent's gender
Age The respondent's age
Household Income The respondent's income
Education The respondent's education level
Location (Census Region) The respondent's location
Have you seen any of the 6 films in the Star Wars franchise? Has a Yes or No response
Do you consider yourself to be a fan of the Star Wars film franchise? Has a Yes or No response

There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format. As a result, this data set needs a lot of cleaning.

In [1]:
import pandas as pd
import numpy as np
import missingno as msno # check for missing records
import warnings

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

pd.set_option('display.max_columns', 38)
pd.set_option('max_colwidth', 150)
pd.options.display.float_format = '{:,.3f}'.format
In [2]:
#read in the file
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head()
Out[2]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 nan Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda Response Response Response Response Response Response Response Response Response
1 3,292,879,998.000 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 2 1 4 5 6 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3,292,879,538.000 No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3,292,765,271.000 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 2 3 4 5 6 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3,292,763,116.000 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 6 1 2 4 3 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

It seems that the first row is not an entry but a subtitle for some of the columns instead. We´ll use it later to rename the columns.

In [3]:
#save the 1st row for later 
aux_col_names = star_wars.iloc[0, :]

#drop the 1st row
star_wars = star_wars.iloc[1:, :]

aux_col_names.reset_index()
Out[3]:
index 0
0 RespondentID NaN
1 Have you seen any of the 6 films in the Star Wars franchise? Response
2 Do you consider yourself to be a fan of the Star Wars film franchise? Response
3 Which of the following Star Wars films have you seen? Please select all that apply. Star Wars: Episode I The Phantom Menace
4 Unnamed: 4 Star Wars: Episode II Attack of the Clones
5 Unnamed: 5 Star Wars: Episode III Revenge of the Sith
6 Unnamed: 6 Star Wars: Episode IV A New Hope
7 Unnamed: 7 Star Wars: Episode V The Empire Strikes Back
8 Unnamed: 8 Star Wars: Episode VI Return of the Jedi
9 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Star Wars: Episode I The Phantom Menace
10 Unnamed: 10 Star Wars: Episode II Attack of the Clones
11 Unnamed: 11 Star Wars: Episode III Revenge of the Sith
12 Unnamed: 12 Star Wars: Episode IV A New Hope
13 Unnamed: 13 Star Wars: Episode V The Empire Strikes Back
14 Unnamed: 14 Star Wars: Episode VI Return of the Jedi
15 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Han Solo
16 Unnamed: 16 Luke Skywalker
17 Unnamed: 17 Princess Leia Organa
18 Unnamed: 18 Anakin Skywalker
19 Unnamed: 19 Obi Wan Kenobi
20 Unnamed: 20 Emperor Palpatine
21 Unnamed: 21 Darth Vader
22 Unnamed: 22 Lando Calrissian
23 Unnamed: 23 Boba Fett
24 Unnamed: 24 C-3P0
25 Unnamed: 25 R2 D2
26 Unnamed: 26 Jar Jar Binks
27 Unnamed: 27 Padme Amidala
28 Unnamed: 28 Yoda
29 Which character shot first? Response
30 Are you familiar with the Expanded Universe? Response
31 Do you consider yourself to be a fan of the Expanded Universe?ξ Response
32 Do you consider yourself to be a fan of the Star Trek franchise? Response
33 Gender Response
34 Age Response
35 Household Income Response
36 Education Response
37 Location (Census Region) Response
In [4]:
#initial data exploration
star_wars.describe(include='all')
Out[4]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
count 1,186.000 1186 836 673 571 550 607 758 738 835 836 835 836 836 836 829 831 831 823 825 814 826 820 812 827 830 821 814 826 828 828 213 1068 1046 1046 858 1036 1043
unique nan 2 2 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 3 2 2 2 2 4 5 5 9
top nan Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4 5 6 1 1 2 Very favorably Very favorably Very favorably Somewhat favorably Very favorably Neither favorably nor unfavorably (neutral) Very favorably Neither favorably nor unfavorably (neutral) Neither favorably nor unfavorably (neutral) Very favorably Very favorably Very unfavorably Neither favorably nor unfavorably (neutral) Very favorably Han No No No Female 45-60 $50,000 - $99,999 Some college or Associate degree East North Central
freq nan 936 552 673 571 550 607 758 738 237 300 217 204 289 232 610 552 547 269 591 213 310 236 248 474 562 204 207 605 325 615 114 641 549 291 298 328 181
mean 3,290,128,200.533 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 1,055,638.908 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
min 3,288,372,923.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% 3,289,450,962.750 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% 3,290,147,175.500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% 3,290,814,462.500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
max 3,292,879,998.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Noticeable data cleanliness issues:

  • many columns are not named properly (not descriptive, too long, etc.),
  • some columns have got only 1 unique value (i.e.'Unnamed: 4' : 'Unnamed: 8'),
  • some columns' values are represented as string objects while they are actually numbers (i.e. 'Unnamed: 10' : 'Unnamed: 14' ),
  • binary variables ('Yes/No') are not formatted as booleans,
  • the 'Do you consider yourself to be a fan of the Expanded Universe?' column has got almost 80% of its values null.

Cleaning the Data. Episode I

In this episode we´ll rename the columns and also fix some of the values.

We want new column names be shorter for an easier referring to them in the code but still descriptive.

In [5]:
#renaming the column names
star_wars = star_wars.rename(columns={
    'Have you seen any of the 6 films in the Star Wars franchise?': 'seen_any',
    'Do you consider yourself to be a fan of the Star Wars film franchise?': 'sw_fan',
    'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_ep1',
    'Unnamed: 4' : 'seen_ep2',
    'Unnamed: 5' : 'seen_ep3',
    'Unnamed: 6' : 'seen_ep4',
    'Unnamed: 7' : 'seen_ep5',
    'Unnamed: 8' : 'seen_ep6',
    'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'ranking_ep1',
    'Unnamed: 10': 'ranking_ep2',
    'Unnamed: 11': 'ranking_ep3',
    'Unnamed: 12': 'ranking_ep4',
    'Unnamed: 13': 'ranking_ep5',
    'Unnamed: 14': 'ranking_ep6',
    'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.': 'Han Solo',
    'Unnamed: 16': 'Luke Skywalker',
    'Unnamed: 17': 'Princess Leia Organa',
    'Unnamed: 18': 'Anakin Skywalker',
    'Unnamed: 19': 'Obi Wan Kenobi',
    'Unnamed: 20': 'Emperor Palpatine',
    'Unnamed: 21': 'Darth Vader',
    'Unnamed: 22': 'Lando Calrissian',
    'Unnamed: 23': 'Boba Fett',
    'Unnamed: 24': 'C-3P0',
    'Unnamed: 25': 'R2 D2',
    'Unnamed: 26': 'Jar Jar Binks',
    'Unnamed: 27': 'Padme Amidala',
    'Unnamed: 28': 'Yoda',
    'Which character shot first?': 'shot_first',
    'Are you familiar with the Expanded Universe?': 'know_eu',
    'Do you consider yourself to be a fan of the Expanded Universe?ξ': 'eu_fan',
    'Do you consider yourself to be a fan of the Star Trek franchise?': 'st_fan'
})

star_wars.head(5)
Out[5]:
RespondentID seen_any sw_fan seen_ep1 seen_ep2 seen_ep3 seen_ep4 seen_ep5 seen_ep6 ranking_ep1 ranking_ep2 ranking_ep3 ranking_ep4 ranking_ep5 ranking_ep6 Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda shot_first know_eu eu_fan st_fan Gender Age Household Income Education Location (Census Region)
1 3,292,879,998.000 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 2 1 4 5 6 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3,292,879,538.000 No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3,292,765,271.000 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 2 3 4 5 6 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3,292,763,116.000 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 6 1 2 4 3 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3,292,731,220.000 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 4 6 2 1 3 Very favorably Somewhat favorably Somewhat favorably Somewhat unfavorably Very favorably Very unfavorably Somewhat favorably Neither favorably nor unfavorably (neutral) Very favorably Somewhat favorably Somewhat favorably Very unfavorably Somewhat favorably Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

The values 'seen_any', 'sw_fan','know_eu', 'eu_fan' and 'st_fan' contain either 'Yes' or 'No' values, with some missing values in between. For ease of usage throughout the analysis, these values are mapped to boolean.

In [6]:
# convert Yes/No to boolean
yes_no_mapping = {'Yes': True, 'No': False}

yes_no_cols = ['seen_any', 'sw_fan', 'know_eu', 'eu_fan', 'st_fan']

for col in yes_no_cols:
    star_wars[col] = star_wars[col].map(yes_no_mapping)

star_wars.head()
Out[6]:
RespondentID seen_any sw_fan seen_ep1 seen_ep2 seen_ep3 seen_ep4 seen_ep5 seen_ep6 ranking_ep1 ranking_ep2 ranking_ep3 ranking_ep4 ranking_ep5 ranking_ep6 Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda shot_first know_eu eu_fan st_fan Gender Age Household Income Education Location (Census Region)
1 3,292,879,998.000 True True Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 2 1 4 5 6 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question True False False Male 18-29 NaN High school degree South Atlantic
2 3,292,879,538.000 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN True Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3,292,765,271.000 True False Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 2 3 4 5 6 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don't understand this question False NaN False Male 18-29 $0 - $24,999 High school degree West North Central
4 3,292,763,116.000 True True Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 6 1 2 4 3 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question False NaN True Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3,292,731,220.000 True True Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 4 6 2 1 3 Very favorably Somewhat favorably Somewhat favorably Somewhat unfavorably Very favorably Very unfavorably Somewhat favorably Neither favorably nor unfavorably (neutral) Very favorably Somewhat favorably Somewhat favorably Very unfavorably Somewhat favorably Somewhat favorably Greedo True False False Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

The columns 'seen_ep1' : 'seen_ep6' indicate whether the respondent saw the correspondent movie or no. If the movie name is listed, the respondent saw the episode and the NaN values indicate that either the question was not answered or the respondent didn´t see the movie. We´ll convert these columns to Boolean type as well. The values with the movie name will be converted to True and the null values to False.

In [7]:
#convert 'seen_ep?' columns to boolean
star_wars.loc[:,'seen_ep1':'seen_ep6'] = star_wars.loc[:,'seen_ep1':'seen_ep6'].notnull()
star_wars.loc[:,'seen_ep1':'seen_ep6'].head()
Out[7]:
seen_ep1 seen_ep2 seen_ep3 seen_ep4 seen_ep5 seen_ep6
1 True True True True True True
2 False False False False False False
3 True True True False False False
4 True True True True True True
5 True True True True True True

The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN. For the further analysis we´ll convert these column to numeric ones and also we are going to invert the rating so that '6' would mean the most favorite and '1' - the least favorite.

In [8]:
star_wars.loc[:,'ranking_ep1':'ranking_ep6'] = star_wars.loc[:,'ranking_ep1':'ranking_ep6'].astype(float)
star_wars.loc[:,'ranking_ep1':'ranking_ep6'] = star_wars.loc[:,'ranking_ep1':'ranking_ep6'].applymap(lambda x: 7-x)
star_wars.loc[:,'ranking_ep1':'ranking_ep6'].head()
Out[8]:
ranking_ep1 ranking_ep2 ranking_ep3 ranking_ep4 ranking_ep5 ranking_ep6
1 4.000 5.000 6.000 3.000 2.000 1.000
2 nan nan nan nan nan nan
3 6.000 5.000 4.000 3.000 2.000 1.000
4 2.000 1.000 6.000 5.000 3.000 4.000
5 2.000 3.000 1.000 5.000 6.000 4.000

The main Star Wars character columns are answers to the question 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her'.

We will convert the ranking system to numbers so that we can make calculations:

  • Very favorably - for each rating the character will receive 2 points
  • Somewhat favorably - for each rating the character will receive 1 point
  • Neither favorably nor unfavorably (neutral) - 0 points
  • Somewhat unfavorably - for each rating the character will be deducted 1 point
  • Very unfavorably - for each rating the character will be deducted 2 points
  • Unfamiliar (N/A) - 0 points
In [9]:
points = {
    'Very favorably': 2,
    'Somewhat favorably': 1,
    'Neither favorably nor unfavorably (neutral)': 0,
    'Somewhat unfavorably': -1,
    'Very unfavorably': -2,
    'Unfamiliar (N/A)': 0,
    np.NaN: 0
    }

for col in star_wars.loc[:, 'Han Solo' : 'Yoda']:
    star_wars[col] = star_wars[col].map(points)
    
star_wars.loc[:, 'Han Solo' : 'Yoda'].head()
Out[9]:
Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda
1 2 2 2 2 2 2 2 0 0 2 2 2 2 2
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 0 0 0 0 0 0 0 0 0
4 2 2 2 2 2 1 2 1 -1 2 2 2 2 2
5 2 1 1 -1 2 -2 1 0 2 1 1 -2 1 1

Cleaning the Data. Episode II

This episode is all about dealing with missing values. We´ll start with identifying them.

In [10]:
#overview of missing values
with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    msno.bar(star_wars)
  • as we´ve seen earlier, the 'eu_fan' is the column with most values missing;
  • there various columns with about 30% of values missing, we´ll check if there´re any common pattern and what´s possible to do about it;
  • some respondents (about 10%) preferred not to answer the question related to their social-demographic status, probably it´ll be possible to impute those values missing.

The 'seen_any' column might help us to deal with a part of missing data. Probably the respondents who haven´t seen any of the Star Wars movies left without an answer the questions where they were asked to rate each movie and didn´t understand the question about who shot first.

In [11]:
print('The number of respondents who haven´t seen any of the Star Wars movies:', star_wars[star_wars['seen_any'] == False].shape[0])
print('They left the following questions without an answer:')
star_wars[star_wars['seen_any'] == False].isnull().sum()
The number of respondents who haven´t seen any of the Star Wars movies: 250
They left the following questions without an answer:
Out[11]:
RespondentID                  0
seen_any                      0
sw_fan                      250
seen_ep1                      0
seen_ep2                      0
seen_ep3                      0
seen_ep4                      0
seen_ep5                      0
seen_ep6                      0
ranking_ep1                 250
ranking_ep2                 250
ranking_ep3                 250
ranking_ep4                 250
ranking_ep5                 250
ranking_ep6                 250
Han Solo                      0
Luke Skywalker                0
Princess Leia Organa          0
Anakin Skywalker              0
Obi Wan Kenobi                0
Emperor Palpatine             0
Darth Vader                   0
Lando Calrissian              0
Boba Fett                     0
C-3P0                         0
R2 D2                         0
Jar Jar Binks                 0
Padme Amidala                 0
Yoda                          0
shot_first                  250
know_eu                     250
eu_fan                      250
st_fan                       10
Gender                       24
Age                          24
Household Income             67
Education                    30
Location (Census Region)     25
dtype: int64

It´s confirmed now that the entries from those who haven´t seen any of the Star Wars movies will not contribute to the analysis. So we´ll continue only with the answers from the respondents who saw at least one episode of the saga.

In [12]:
star_wars = star_wars[star_wars['seen_any'] == True]

The null values in the 'sw_fan', 'know_eu' and 'st_fan' columns can be imputed with False and the null values in the 'shot_first' with 'I don´t understand this question'.

In [13]:
star_wars[['sw_fan', 'know_eu', 'st_fan']] = star_wars[['sw_fan', 'know_eu', 'st_fan']].fillna(False).fillna('')
star_wars['shot_first'] = star_wars['shot_first'].fillna('I don´t understand this question') 

As for the 'eu_fan' column, similarly to the 'sw_fan' we suppose that if a person not familiar with Expanded Universe can´t be a fan of it.

In [14]:
star_wars.loc[star_wars['know_eu'] == False, 'eu_fan'] = False
In [15]:
with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    msno.bar(star_wars)

There might be a group of respondents who answered the first question if they had seen any movie of the saga but then lost their interest in completing the survey and left other questions without an answer.

In [16]:
star_wars[star_wars.loc[:, 'ranking_ep1':'ranking_ep6'].isnull().apply(lambda x: all(x), axis=1)].describe(include='all')
Out[16]:
RespondentID seen_any sw_fan seen_ep1 seen_ep2 seen_ep3 seen_ep4 seen_ep5 seen_ep6 ranking_ep1 ranking_ep2 ranking_ep3 ranking_ep4 ranking_ep5 ranking_ep6 Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda shot_first know_eu eu_fan st_fan Gender Age Household Income Education Location (Census Region)
count 100.000 100 100 100 100 100 100 100 100 0.000 0.000 0.000 0.000 0.000 0.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100.000 100 100 100 100 0 0 0 0 0
unique nan 1 1 1 1 1 1 1 1 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1 1 1 1 0 0 0 0 0
top nan True False False False False False False False nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan I don´t understand this question False False False NaN NaN NaN NaN NaN
freq nan 100 100 100 100 100 100 100 100 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 100 100 100 100 NaN NaN NaN NaN NaN
mean 3,290,145,657.870 NaN NaN NaN NaN NaN NaN NaN NaN nan nan nan nan nan nan 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 886,869.706 NaN NaN NaN NaN NaN NaN NaN NaN nan nan nan nan nan nan 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
min 3,288,455,900.000 NaN NaN NaN NaN NaN NaN NaN NaN nan nan nan nan nan nan 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% 3,289,614,702.500 NaN NaN NaN NaN NaN NaN NaN NaN nan nan nan nan nan nan 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% 3,290,353,222.000 NaN NaN NaN NaN NaN NaN NaN NaN nan nan nan nan nan nan 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% 3,290,725,892.000 NaN NaN NaN NaN NaN NaN NaN NaN nan nan nan nan nan nan 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
max 3,292,637,870.000 NaN NaN NaN NaN NaN NaN NaN NaN nan nan nan nan nan nan 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN

After confirming the above hypothesis we can drop those rows without any hesitations.

In [17]:
star_wars.drop(star_wars[star_wars['ranking_ep1'].isnull()].index, inplace=True)
In [18]:
with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    msno.bar(star_wars)

The data in the 'Household income' is categorical data. Considering a significant amount of missing values in it, we wouldn´t want to loose that much information. At the moment we´ll fill it with 'no info' introducing like a new category. And the rest of the missing values will be replaced by a most common value in the column.

In [19]:
star_wars['Household Income'] = star_wars['Household Income'].fillna('no info')
star_wars = star_wars.fillna(star_wars.mode().iloc[0])

Analysis. Episode I. Most seen and best ranked

Not only the films compete with each other for the people´s love and recognition, but also the trilogies: the episodes IV - VI, originally released between 1977 and 1983, and the prequel trilogy released later in 1999 through 2005.

First, we´ll discover which how many respondents saw each movie and thus find out which movie is a most seen one.

In [20]:
#complete movies´ titles, to be used for plot labeling
titles = [
'Episode I The Phantom Menace',
'Episode II Attack of the Clones',
'Episode III Revenge of the Sith',
'Episode IV A New Hope',
'Episode V The Empire Strikes Back',
'Episode VI Return of the Jedi'
] 

#columns which refer to if a movie was seen by a respondent
seen_cols = ['seen_ep1', 'seen_ep2', 'seen_ep3', 'seen_ep4', 'seen_ep5', 'seen_ep6']

#the columns which refer to movie rankings
rank_cols = ['ranking_ep1', 'ranking_ep2', 'ranking_ep3', 'ranking_ep4', 'ranking_ep5', 'ranking_ep6']
In [21]:
#prepare the dataframe for plotting
seen = pd.DataFrame(data=[titles, 
                          star_wars.loc[:,seen_cols].sum(), 
                          (star_wars.loc[:,seen_cols].sum()/star_wars.loc[:,seen_cols].shape[0]).round(2)]).T

seen.columns = ['Star Wars movie', 'Number of views', 'Views_per']

#define the average niews per trilogy
trilogy_views = ['',] * 6
trilogy_views[:3] = [seen.loc[0:2, 'Views_per'].sum()/3, ] * 3
trilogy_views[-3:] = [seen.loc[3:6, 'Views_per'].sum()/3, ] * 3
seen['views_per_trilogy'] = trilogy_views

#plot
fig = px.bar(seen, x='Number of views', y='Star Wars movie', orientation='h', 
             custom_data=['Views_per', 'views_per_trilogy'], 
             category_orders={'Star Wars movie':titles})

#plot aesthetics
##color map highlighting only the most seen movie
colors=[] 
for val in seen['Number of views']:
    colors.append('rgb(252, 128, 14)' if val ==  seen['Number of views'].max() else 'rgb(137, 137, 137)')

fig.update_traces(hovertemplate='<i>Views:</i> %{x} <br><i>seen by %{customdata[0]:.0%} of respondents</i> <br> (the trilogy seen by %{customdata[1]:.0%} of respondents) ', 
                  marker_color=colors)

fig.update_layout(title={'text':'<b>Views recieved by each movie in the Star Wars franchise</b><br>based on 835 respondents',
                         'font':{'size':22}},
                  yaxis = {'ticksuffix': '  ',
                           'tickfont':{'size':16}})

fig.show()