A long time ago in a pre-pandemic world...
The data has several columns, including:
Column | Descritpion |
---|---|
RespondentID |
An anonymized ID for the respondent (person taking the survey) |
Gender |
The respondent's gender |
Age |
The respondent's age |
Household Income |
The respondent's income |
Education |
The respondent's education level |
Location (Census Region) |
The respondent's location |
Have you seen any of the 6 films in the Star Wars franchise? |
Has a Yes or No response |
Do you consider yourself to be a fan of the Star Wars film franchise? |
Has a Yes or No response |
There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format. As a result, this data set needs a lot of cleaning.
import pandas as pd
import numpy as np
import missingno as msno # check for missing records
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
pd.set_option('display.max_columns', 38)
pd.set_option('max_colwidth', 150)
pd.options.display.float_format = '{:,.3f}'.format
#read in the file
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head()
It seems that the first row is not an entry but a subtitle for some of the columns instead. We´ll use it later to rename the columns.
#save the 1st row for later
aux_col_names = star_wars.iloc[0, :]
#drop the 1st row
star_wars = star_wars.iloc[1:, :]
aux_col_names.reset_index()
#initial data exploration
star_wars.describe(include='all')
Noticeable data cleanliness issues:
'Unnamed: 4' : 'Unnamed: 8'
),'Unnamed: 10' : 'Unnamed: 14'
),'Do you consider yourself to be a fan of the Expanded Universe?'
column has got almost 80% of its values null. In this episode we´ll rename the columns and also fix some of the values.
We want new column names be shorter for an easier referring to them in the code but still descriptive.
#renaming the column names
star_wars = star_wars.rename(columns={
'Have you seen any of the 6 films in the Star Wars franchise?': 'seen_any',
'Do you consider yourself to be a fan of the Star Wars film franchise?': 'sw_fan',
'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_ep1',
'Unnamed: 4' : 'seen_ep2',
'Unnamed: 5' : 'seen_ep3',
'Unnamed: 6' : 'seen_ep4',
'Unnamed: 7' : 'seen_ep5',
'Unnamed: 8' : 'seen_ep6',
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'ranking_ep1',
'Unnamed: 10': 'ranking_ep2',
'Unnamed: 11': 'ranking_ep3',
'Unnamed: 12': 'ranking_ep4',
'Unnamed: 13': 'ranking_ep5',
'Unnamed: 14': 'ranking_ep6',
'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.': 'Han Solo',
'Unnamed: 16': 'Luke Skywalker',
'Unnamed: 17': 'Princess Leia Organa',
'Unnamed: 18': 'Anakin Skywalker',
'Unnamed: 19': 'Obi Wan Kenobi',
'Unnamed: 20': 'Emperor Palpatine',
'Unnamed: 21': 'Darth Vader',
'Unnamed: 22': 'Lando Calrissian',
'Unnamed: 23': 'Boba Fett',
'Unnamed: 24': 'C-3P0',
'Unnamed: 25': 'R2 D2',
'Unnamed: 26': 'Jar Jar Binks',
'Unnamed: 27': 'Padme Amidala',
'Unnamed: 28': 'Yoda',
'Which character shot first?': 'shot_first',
'Are you familiar with the Expanded Universe?': 'know_eu',
'Do you consider yourself to be a fan of the Expanded Universe?Âæ': 'eu_fan',
'Do you consider yourself to be a fan of the Star Trek franchise?': 'st_fan'
})
star_wars.head(5)
The values 'seen_any'
, 'sw_fan'
,'know_eu'
, 'eu_fan'
and 'st_fan'
contain either 'Yes' or 'No' values, with some missing values in between. For ease of usage throughout the analysis, these values are mapped to boolean.
# convert Yes/No to boolean
yes_no_mapping = {'Yes': True, 'No': False}
yes_no_cols = ['seen_any', 'sw_fan', 'know_eu', 'eu_fan', 'st_fan']
for col in yes_no_cols:
star_wars[col] = star_wars[col].map(yes_no_mapping)
star_wars.head()
The columns 'seen_ep1' : 'seen_ep6'
indicate whether the respondent saw the correspondent movie or no. If the movie name is listed, the respondent saw the episode and the NaN
values indicate that either the question was not answered or the respondent didn´t see the movie. We´ll convert these columns to Boolean type as well. The values with the movie name will be converted to True
and the null values to False
.
#convert 'seen_ep?' columns to boolean
star_wars.loc[:,'seen_ep1':'seen_ep6'] = star_wars.loc[:,'seen_ep1':'seen_ep6'].notnull()
star_wars.loc[:,'seen_ep1':'seen_ep6'].head()
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1
means the film was the most favorite, and 6
means it was the least favorite. Each of the following columns can contain the value 1
, 2
, 3
, 4
, 5
, 6
, or NaN
. For the further analysis we´ll convert these column to numeric ones and also we are going to invert the rating so that '6'
would mean the most favorite and '1'
- the least favorite.
star_wars.loc[:,'ranking_ep1':'ranking_ep6'] = star_wars.loc[:,'ranking_ep1':'ranking_ep6'].astype(float)
star_wars.loc[:,'ranking_ep1':'ranking_ep6'] = star_wars.loc[:,'ranking_ep1':'ranking_ep6'].applymap(lambda x: 7-x)
star_wars.loc[:,'ranking_ep1':'ranking_ep6'].head()
The main Star Wars character columns are answers to the question 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her'.
We will convert the ranking system to numbers so that we can make calculations:
Very favorably
- for each rating the character will receive 2 pointsSomewhat favorably
- for each rating the character will receive 1 pointNeither favorably nor unfavorably (neutral)
- 0 pointsSomewhat unfavorably
- for each rating the character will be deducted 1 pointVery unfavorably
- for each rating the character will be deducted 2 pointsUnfamiliar (N/A)
- 0 pointspoints = {
'Very favorably': 2,
'Somewhat favorably': 1,
'Neither favorably nor unfavorably (neutral)': 0,
'Somewhat unfavorably': -1,
'Very unfavorably': -2,
'Unfamiliar (N/A)': 0,
np.NaN: 0
}
for col in star_wars.loc[:, 'Han Solo' : 'Yoda']:
star_wars[col] = star_wars[col].map(points)
star_wars.loc[:, 'Han Solo' : 'Yoda'].head()
This episode is all about dealing with missing values. We´ll start with identifying them.
#overview of missing values
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
msno.bar(star_wars)
'eu_fan'
is the column with most values missing;The 'seen_any'
column might help us to deal with a part of missing data. Probably the respondents who haven´t seen any of the Star Wars movies left without an answer the questions where they were asked to rate each movie and didn´t understand the question about who shot first.
print('The number of respondents who haven´t seen any of the Star Wars movies:', star_wars[star_wars['seen_any'] == False].shape[0])
print('They left the following questions without an answer:')
star_wars[star_wars['seen_any'] == False].isnull().sum()
It´s confirmed now that the entries from those who haven´t seen any of the Star Wars movies will not contribute to the analysis. So we´ll continue only with the answers from the respondents who saw at least one episode of the saga.
star_wars = star_wars[star_wars['seen_any'] == True]
The null values in the 'sw_fan
', 'know_eu'
and 'st_fan'
columns can be imputed with False
and the null values in the 'shot_first'
with 'I don´t understand this question'
.
star_wars[['sw_fan', 'know_eu', 'st_fan']] = star_wars[['sw_fan', 'know_eu', 'st_fan']].fillna(False).fillna('')
star_wars['shot_first'] = star_wars['shot_first'].fillna('I don´t understand this question')
As for the 'eu_fan'
column, similarly to the 'sw_fan'
we suppose that if a person not familiar with Expanded Universe can´t be a fan of it.
star_wars.loc[star_wars['know_eu'] == False, 'eu_fan'] = False
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
msno.bar(star_wars)
There might be a group of respondents who answered the first question if they had seen any movie of the saga but then lost their interest in completing the survey and left other questions without an answer.
star_wars[star_wars.loc[:, 'ranking_ep1':'ranking_ep6'].isnull().apply(lambda x: all(x), axis=1)].describe(include='all')
After confirming the above hypothesis we can drop those rows without any hesitations.
star_wars.drop(star_wars[star_wars['ranking_ep1'].isnull()].index, inplace=True)
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
msno.bar(star_wars)
The data in the 'Household income'
is categorical data. Considering a significant amount of missing values in it, we wouldn´t want to loose that much information. At the moment we´ll fill it with 'no info'
introducing like a new category. And the rest of the missing values will be replaced by a most common value in the column.
star_wars['Household Income'] = star_wars['Household Income'].fillna('no info')
star_wars = star_wars.fillna(star_wars.mode().iloc[0])
Not only the films compete with each other for the people´s love and recognition, but also the trilogies: the episodes IV - VI, originally released between 1977 and 1983, and the prequel trilogy released later in 1999 through 2005.
First, we´ll discover which how many respondents saw each movie and thus find out which movie is a most seen one.
#complete movies´ titles, to be used for plot labeling
titles = [
'Episode I The Phantom Menace',
'Episode II Attack of the Clones',
'Episode III Revenge of the Sith',
'Episode IV A New Hope',
'Episode V The Empire Strikes Back',
'Episode VI Return of the Jedi'
]
#columns which refer to if a movie was seen by a respondent
seen_cols = ['seen_ep1', 'seen_ep2', 'seen_ep3', 'seen_ep4', 'seen_ep5', 'seen_ep6']
#the columns which refer to movie rankings
rank_cols = ['ranking_ep1', 'ranking_ep2', 'ranking_ep3', 'ranking_ep4', 'ranking_ep5', 'ranking_ep6']
#prepare the dataframe for plotting
seen = pd.DataFrame(data=[titles,
star_wars.loc[:,seen_cols].sum(),
(star_wars.loc[:,seen_cols].sum()/star_wars.loc[:,seen_cols].shape[0]).round(2)]).T
seen.columns = ['Star Wars movie', 'Number of views', 'Views_per']
#define the average niews per trilogy
trilogy_views = ['',] * 6
trilogy_views[:3] = [seen.loc[0:2, 'Views_per'].sum()/3, ] * 3
trilogy_views[-3:] = [seen.loc[3:6, 'Views_per'].sum()/3, ] * 3
seen['views_per_trilogy'] = trilogy_views
#plot
fig = px.bar(seen, x='Number of views', y='Star Wars movie', orientation='h',
custom_data=['Views_per', 'views_per_trilogy'],
category_orders={'Star Wars movie':titles})
#plot aesthetics
##color map highlighting only the most seen movie
colors=[]
for val in seen['Number of views']:
colors.append('rgb(252, 128, 14)' if val == seen['Number of views'].max() else 'rgb(137, 137, 137)')
fig.update_traces(hovertemplate='<i>Views:</i> %{x} <br><i>seen by %{customdata[0]:.0%} of respondents</i> <br> (the trilogy seen by %{customdata[1]:.0%} of respondents) ',
marker_color=colors)
fig.update_layout(title={'text':'<b>Views recieved by each movie in the Star Wars franchise</b><br>based on 835 respondents',
'font':{'size':22}},
yaxis = {'ticksuffix': ' ',
'tickfont':{'size':16}})
fig.show()