America's Favourite Star Wars Movies¶

(and Least Favourite Characters)¶

by Raghav_A

While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 1186 total responses, which can be downloaded from their GitHub repository.

For this project, I'll be cleaning and exploring the data set in Jupyter notebook, and answering some very interesting questions (if you are a Star Wars fan!)

Let's get started.

Reading Dataset¶

First, I will import the relevant python libraries, and read the dataset into a pandas DataFrame object -

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")

Exploring the Dataset¶

In order to get an understanding of the shape, object type of columns, and the nature of the data in our dataset, we will explore the dataset using df.head(), df.info(), df.shape and df.columns methods and attributes -

In [2]:

star_wars.head()

Out[2]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	NaN	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	...	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response
1	3.292880e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

5 rows × 38 columns

In [3]:

star_wars.shape

Out[3]:

(1187, 38)

In [4]:

star_wars.columns

Out[4]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

In [5]:

character_names = pd.DataFrame(star_wars.iloc[0,15:29])

In [6]:

character_names

Out[6]:

	0
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Han Solo
Unnamed: 16	Luke Skywalker
Unnamed: 17	Princess Leia Organa
Unnamed: 18	Anakin Skywalker
Unnamed: 19	Obi Wan Kenobi
Unnamed: 20	Emperor Palpatine
Unnamed: 21	Darth Vader
Unnamed: 22	Lando Calrissian
Unnamed: 23	Boba Fett
Unnamed: 24	C-3P0
Unnamed: 25	R2 D2
Unnamed: 26	Jar Jar Binks
Unnamed: 27	Padme Amidala
Unnamed: 28	Yoda

In [7]:

star_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187 entries, 0 to 1186
Data columns (total 38 columns):
 #   Column                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                         --------------  -----  
 0   RespondentID                                                                                                                                   1186 non-null   float64
 1   Have you seen any of the 6 films in the Star Wars franchise?                                                                                   1187 non-null   object 
 2   Do you consider yourself to be a fan of the Star Wars film franchise?                                                                          837 non-null    object 
 3   Which of the following Star Wars films have you seen? Please select all that apply.                                                            674 non-null    object 
 4   Unnamed: 4                                                                                                                                     572 non-null    object 
 5   Unnamed: 5                                                                                                                                     551 non-null    object 
 6   Unnamed: 6                                                                                                                                     608 non-null    object 
 7   Unnamed: 7                                                                                                                                     759 non-null    object 
 8   Unnamed: 8                                                                                                                                     739 non-null    object 
 9   Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.  836 non-null    object 
 10  Unnamed: 10                                                                                                                                    837 non-null    object 
 11  Unnamed: 11                                                                                                                                    836 non-null    object 
 12  Unnamed: 12                                                                                                                                    837 non-null    object 
 13  Unnamed: 13                                                                                                                                    837 non-null    object 
 14  Unnamed: 14                                                                                                                                    837 non-null    object 
 15  Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.                                 830 non-null    object 
 16  Unnamed: 16                                                                                                                                    832 non-null    object 
 17  Unnamed: 17                                                                                                                                    832 non-null    object 
 18  Unnamed: 18                                                                                                                                    824 non-null    object 
 19  Unnamed: 19                                                                                                                                    826 non-null    object 
 20  Unnamed: 20                                                                                                                                    815 non-null    object 
 21  Unnamed: 21                                                                                                                                    827 non-null    object 
 22  Unnamed: 22                                                                                                                                    821 non-null    object 
 23  Unnamed: 23                                                                                                                                    813 non-null    object 
 24  Unnamed: 24                                                                                                                                    828 non-null    object 
 25  Unnamed: 25                                                                                                                                    831 non-null    object 
 26  Unnamed: 26                                                                                                                                    822 non-null    object 
 27  Unnamed: 27                                                                                                                                    815 non-null    object 
 28  Unnamed: 28                                                                                                                                    827 non-null    object 
 29  Which character shot first?                                                                                                                    829 non-null    object 
 30  Are you familiar with the Expanded Universe?                                                                                                   829 non-null    object 
 31  Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦                                                                             214 non-null    object 
 32  Do you consider yourself to be a fan of the Star Trek franchise?                                                                               1069 non-null   object 
 33  Gender                                                                                                                                         1047 non-null   object 
 34  Age                                                                                                                                            1047 non-null   object 
 35  Household Income                                                                                                                               859 non-null    object 
 36  Education                                                                                                                                      1037 non-null   object 
 37  Location (Census Region)                                                                                                                       1044 non-null   object 
dtypes: float64(1), object(37)
memory usage: 352.5+ KB

Inspecting and Cleaning the Dataset¶

Before we can proceed with the analysis and subsequent visualisation of data, the dataset needs to be checked fot any inconsistencies and bad data, that might affect our analysis.

Due to the nature of this data, I decided that it is best to move and inspect the data column-by-column.

Cleaning Columns No. 1 to 3¶

Column 1 - Since there is only 1 Null value of respondent ID (the first row of the dataset), we delete the row and exclude it from our analysis.
Columns 2 & 3 - There are Yes, No and NaN values in these 2 columns. For the sake of our analysis, we will convert the Yes values to True, and No values to False. Also, if a surveyee has not answered the question Do you consider yourself to be a fan of the Star Wars film franchise?, then we will assume that he/she is not a fan of Star Wars, and will change the NaN values to False.

Let's make the changes -

In [8]:

# Removing Null RespondentID rows
star_wars = star_wars[star_wars['RespondentID'].notnull()]

In [9]:

# Displaying the top 5 rows of the first 3 columns...
star_wars[star_wars.columns[:3]].head(6)

Out[9]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?
1	3.292880e+09	Yes	Yes
2	3.292880e+09	No	NaN
3	3.292765e+09	Yes	No
4	3.292763e+09	Yes	Yes
5	3.292731e+09	Yes	Yes
6	3.292719e+09	Yes	Yes

In [10]:

# Converting the Yes into Boolean True and No & NaN into Boolean False...(for Columns 2 & 3)
cols_1_to_3 = star_wars[star_wars.columns[1:3]].applymap(lambda element: True if element=='Yes' else False).copy()
star_wars[star_wars.columns[1:3]] = cols_1_to_3

In [11]:

# Displaying the top 5 rows of the transformed first 3 columns...
star_wars[star_wars.columns[:3]].head()

Out[11]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?
1	3.292880e+09	True	True
2	3.292880e+09	False	False
3	3.292765e+09	True	False
4	3.292763e+09	True	True
5	3.292731e+09	True	True

Cleaning Columns No. 4 to 9¶

Column 4 to 9 have either NaN values or Non-NULL (Name of the Movie) values in these 6 columns. For the sake of our analysis, we will convert the Non-NULL values to True, and Nan values to False.
A more suitable name for the columns would be Seen_1 for when the surveyee has seen The Phantom Menace (Episode 1), Seen_2 for when the surveyee has seen Attack of the Clones (Episode 2) and so on till Episode 6 (Column no. 9)

Let's make the changes -

In [12]:

# Displaying the top 5 rows of column indexes 3 to 8...
star_wars[star_wars.columns[3:9]].head()

Out[12]:

	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
1	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
2	NaN	NaN	NaN	NaN	NaN	NaN
3	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN
4	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
5	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi

In [13]:

# Displaying the Value-Counts of columns 4 to 9...
[star_wars[col].value_counts(dropna=False) for col in star_wars.columns[3:9]]

Out[13]:

[Star Wars: Episode I  The Phantom Menace    673
 NaN                                         513
 Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64,
 NaN                                            615
 Star Wars: Episode II  Attack of the Clones    571
 Name: Unnamed: 4, dtype: int64,
 NaN                                            636
 Star Wars: Episode III  Revenge of the Sith    550
 Name: Unnamed: 5, dtype: int64,
 Star Wars: Episode IV  A New Hope    607
 NaN                                  579
 Name: Unnamed: 6, dtype: int64,
 Star Wars: Episode V The Empire Strikes Back    758
 NaN                                             428
 Name: Unnamed: 7, dtype: int64,
 Star Wars: Episode VI Return of the Jedi    738
 NaN                                         448
 Name: Unnamed: 8, dtype: int64]

In [14]:

# Assigning Boolean True and False to values of columns 4 to 9
cols_4_to_9 = star_wars[star_wars.columns[3:9]].applymap(lambda element: True if 'Star Wars' in str(element) else False).copy()
star_wars[star_wars.columns[3:9]] = cols_4_to_9

In [15]:

#Displaying top 5 rows of columns 4 to 9...
star_wars[star_wars.columns[3:9]] .head()

Out[15]:

	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
1	True	True	True	True	True	True
2	False	False	False	False	False	False
3	True	True	True	False	False	False
4	True	True	True	True	True	True
5	True	True	True	True	True	True

In [16]:

# Changing names of columns 4 to 9 to 'Seen_1,','Seen_2', and so on till 'Seen_6'...
bool_dict = {
    'Which of the following Star Wars films have you seen? Please select all that apply.':'Seen_1',
    'Unnamed: 4': 'Seen_2',
    'Unnamed: 5': 'Seen_3',
    'Unnamed: 6': 'Seen_4',
    'Unnamed: 7': 'Seen_5',
    'Unnamed: 8': 'Seen_6'}

star_wars = star_wars.rename(columns = bool_dict)
# Displaying the top 5 rows of columns 4 to 9 (to view/check changed column names)
star_wars.iloc[:5,3:9]

Out[16]:

	Seen_1	Seen_2	Seen_3	Seen_4	Seen_5	Seen_6
1	True	True	True	True	True	True
2	False	False	False	False	False	False
3	True	True	True	False	False	False
4	True	True	True	True	True	True
5	True	True	True	True	True	True

Cleaning Columns No. 10 to 15¶

Columns 10 to 15 have either NaN values or NUMERIC Non-NULL (Rank of the Movie) values. We will leave these values as they are.
A more suitable name for the columns would be ranking_1 for when the surveyee has ranked The Phantom Menace (Episode 1) on a scale of 1 to 6 (1 being the best, 6 being the worst), ranking_2 for when the surveyee has ranked Attack of the Clones (Episode 2) and so on till Episode 6 (Column no. 15)
Although the values are numeric in nature, the values in columns 10 to 15 are assigned 'str' data type. We will change these values to 'float' so as to do calculations on them later on

Let's make the changes -

In [17]:

# Displaying the top 5 rows of columns 10 to 15
star_wars[star_wars.columns[9:15]].head()

Out[17]:

	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14
1	3	2	1	4	5	6
2	NaN	NaN	NaN	NaN	NaN	NaN
3	1	2	3	4	5	6
4	5	6	1	2	4	3
5	5	4	6	2	1	3

In [18]:

# Renaming the columns 10 to 15 as per the names stated below...
bool_dict = {'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.':'ranking_1',
             'Unnamed: 10':'ranking_2', 
             'Unnamed: 11':'ranking_3',
             'Unnamed: 12':'ranking_4',
             'Unnamed: 13':'ranking_5',
             'Unnamed: 14':'ranking_6'
            }
star_wars = star_wars.rename(columns = bool_dict)

In [19]:

star_wars['ranking_1'].dtype

Out[19]:

dtype('O')

In [20]:

# converting dtype of columns ranking_1 to ranking_6 from 'Object'(str) to Float64
for i in range(1,7): 
    star_wars['ranking_'+str(i)] = star_wars['ranking_'+str(i)].astype(float)
star_wars['ranking_6']

Out[20]:

1       6.0
2       NaN
3       6.0
4       3.0
5       3.0
       ... 
1182    1.0
1183    1.0
1184    NaN
1185    1.0
1186    5.0
Name: ranking_6, Length: 1186, dtype: float64

In [21]:

# Displaying the Transformed columns 10 to 15...
star_wars[star_wars.columns[9:15]]

Out[21]:

	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
1	3.0	2.0	1.0	4.0	5.0	6.0
2	NaN	NaN	NaN	NaN	NaN	NaN
3	1.0	2.0	3.0	4.0	5.0	6.0
4	5.0	6.0	1.0	2.0	4.0	3.0
5	5.0	4.0	6.0	2.0	1.0	3.0
...	...	...	...	...	...	...
1182	5.0	4.0	6.0	3.0	2.0	1.0
1183	4.0	5.0	6.0	2.0	3.0	1.0
1184	NaN	NaN	NaN	NaN	NaN	NaN
1185	4.0	3.0	6.0	5.0	2.0	1.0
1186	6.0	1.0	2.0	3.0	4.0	5.0

1186 rows × 6 columns

Cleaning Columns No. 16 to 29¶

Columns 16 to 29 have NaN, Very Favourably,Somewhat Favourably,Somewhat Unfavourably,Very Unfavourably and Neither favorably nor unfavorably (neutral) values. These values don't seem to have any bad data in them, so it's best to leave these values as they are.
As these columns as the surveyee whether they view a certain character favourably or not, it is more suitable to name the columns according to the Star Wars character they represent. For instance, column 16 should be renamed to Han Solo, column 17 to Princess Leia Organa and so on till column 29.

Let's make the changes -

In [22]:

star_wars[star_wars.columns[15:29]].head()

Out[22]:

	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28
1	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)
4	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably
5	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably

In [23]:

# Displaying Character Names corresponding to the columns...
character_names

Out[23]:

	0
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Han Solo
Unnamed: 16	Luke Skywalker
Unnamed: 17	Princess Leia Organa
Unnamed: 18	Anakin Skywalker
Unnamed: 19	Obi Wan Kenobi
Unnamed: 20	Emperor Palpatine
Unnamed: 21	Darth Vader
Unnamed: 22	Lando Calrissian
Unnamed: 23	Boba Fett
Unnamed: 24	C-3P0
Unnamed: 25	R2 D2
Unnamed: 26	Jar Jar Binks
Unnamed: 27	Padme Amidala
Unnamed: 28	Yoda

In [24]:

# Renaming the columns 16 to 29 as per the dictionary values given below...
character_dict={'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.':'Han Solo',
               'Unnamed: 16': 'Luke Skywalker',
               'Unnamed: 17': 'Princess Leia Organa',
               'Unnamed: 18': 'Anakin Skywalker',
               'Unnamed: 19': 'Obi Wan Kenobi',
               'Unnamed: 20': 'Emperor Palpatine',
               'Unnamed: 21': 'Darth Vader',
               'Unnamed: 22': 'Lando Calrissian',
               'Unnamed: 23': 'Boba Fett',
               'Unnamed: 24': 'C-3P0',
               'Unnamed: 25': 'R2 D2',
               'Unnamed: 26': 'Jar Jar Binks',
               'Unnamed: 27': 'Padme Amidala',
               'Unnamed: 28': 'Yoda'
               }
star_wars = star_wars.rename(columns = character_dict)

In [25]:

# Displaying top 5 rows of the transformed columns 16 to 29...
star_wars[star_wars.columns[15:29]].head()

Out[25]:

	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda
1	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)
4	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably
5	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably

Cleaning Columns No. 30 to 38¶

Column 32 has an unknown character at the end (possibly due to some encoding error), which should be removed.
Columns 30 to 32 values should be converted to Boolean True and Boolean False.
Columns 30 intends to ask the surveyee - "Who shot first - Han or Greedo?", so the coumn should be renamed as this instead.

Let's make the changes -

In [26]:

star_wars[star_wars.columns[29:]].head()

Out[26]:

	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

In [27]:

star_wars = star_wars.rename(columns = {star_wars.columns[29]:'Who shot first - Han or Greedo?',
                                       star_wars.columns[31]:'Do you consider yourself to be a fan of the Expanded Universe?'})

In [28]:

star_wars[star_wars.columns[30:33]]

Out[28]:

	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?
1	Yes	No	No
2	NaN	NaN	Yes
3	No	NaN	No
4	No	NaN	Yes
5	Yes	No	No
...	...	...	...
1182	No	NaN	Yes
1183	No	NaN	Yes
1184	NaN	NaN	No
1185	No	NaN	Yes
1186	No	NaN	No

1186 rows × 3 columns

In [29]:

# Changing columns 30 to 32 values to Boolean True and False...
cols_30_to_32 = star_wars[star_wars.columns[30:33]].applymap(lambda value: True if value == 'Yes' else False).copy()
star_wars[star_wars.columns[30:33]] = cols_30_to_32

In [30]:

# Displaying top 5 rows of columns 30 to 32
star_wars[star_wars.columns[30:33]].head()

Out[30]:

	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?
1	True	False	False
2	False	False	True
3	False	False	False
4	False	False	True
5	True	False	False

Analysis¶

In order to get some useful insights from our data, we should isolate our data and consider only those surveyees who have seen all the Star Wars movies. We have a total of 834 such surveyees in our dataset (out of 1187 total responders), which is a number we can work with.

So, our first step should be to create a new dataset with only those responders who have seen all 6 Star Wars movies.

In [31]:

seen_all_movies = star_wars[star_wars[star_wars.columns[9:15]].notnull().all(axis = 1)]

In [32]:

# Displaying top 5 rows of the new dataset comprising of responders who.ve seen all 6 movies...
seen_all_movies.head()

Out[32]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Seen_1	Seen_2	Seen_3	Seen_4	Seen_5	Seen_6	ranking_1	...	Yoda	Who shot first - Han or Greedo?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	3.292880e+09	True	True	True	True	True	True	True	True	3.0	...	Very favorably	I don't understand this question	True	False	False	Male	18-29	NaN	High school degree	South Atlantic
3	3.292765e+09	True	False	True	True	True	False	False	False	1.0	...	Unfamiliar (N/A)	I don't understand this question	False	False	False	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	True	True	True	True	True	True	True	True	5.0	...	Very favorably	I don't understand this question	False	False	True	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3.292731e+09	True	True	True	True	True	True	True	True	5.0	...	Somewhat favorably	Greedo	True	False	False	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
6	3.292719e+09	True	True	True	True	True	True	True	True	1.0	...	Very favorably	Han	True	False	True	Male	18-29	$25,000 - $49,999	Bachelor degree	Middle Atlantic

5 rows × 38 columns

How many responders have seen the Star Wars movies?¶

Q: How many have seen atleast 1 star wars movie?¶

A total of 835 respondents have seen atleast 1 Star Wars movie.

In [33]:

# Number of surveyees who have seen atleast 1 star wars movie
star_wars[star_wars.columns[3:9]].any(axis = 1).sum()

Out[33]:

Q: How many respondents have seen all 6 `Star Wars` movies?¶

A total of 834 respondents have seen all 6 Star Wars movies.

In [34]:

# Total non-null responders
seen_all_movies.shape[0]

Out[34]:

Q: What's the Movie-Wise division of Star Wars movies' viewership?¶

Turns out, The Empire Striked Back is the most viewed Star Wars movie, with almost 65% responders having watched it. See for yourselves -

In [35]:

episode_dict = {'Seen_1': 'Episode 1 The Phantom Menace',
               'Seen_2': 'Episode 2 Attack of the Clones',
               'Seen_3': 'Episode 3 Revenge of the Sith',
               'Seen_4': 'Episode 4 A New Hope',
               'Seen_5': 'Episode 5 The Empire Strikes Back',
               'Seen_6': 'Episode 6 Return of the Jedi'}
percent_viewers = star_wars[star_wars.columns[3:9]].mean()*100
percent_viewers = round(percent_viewers)
percent_viewers = percent_viewers.rename(episode_dict).sort_index(ascending = False)
percent_viewers.plot.barh()
plt.title('Percentage of responders who have seen Star Wars')
plt.xlabel('Percentage')
plt.show()

Q: Which is the Favourite Star Wars movie?¶

In [36]:

# Highest Rated Star Wars Movie - 
episode_dict2= {'ranking_1': 'Episode 1 The Phantom Menace',
               'ranking_2': 'Episode 2 Attack of the Clones',
               'ranking_3': 'Episode 3 Revenge of the Sith',
               'ranking_4': 'Episode 4 A New Hope',
               'ranking_5': 'Episode 5 The Empire Strikes Back',
               'ranking_6': 'Episode 6 Return of the Jedi'}
most_fav = seen_all_movies[seen_all_movies.columns[9:15]]
most_fav = most_fav.applymap(lambda element: element == 1).mean()*100
most_fav = most_fav.rename(episode_dict2).sort_index(ascending = False)
most_fav.plot.barh()
plt.title('Best Star Wars Movie')
plt.xlabel('Percentage')
plt.show()

Q: ... And Which is the Least Favourite?¶

In [37]:

# Least Favourite Star Wars Movie - 
least_fav = seen_all_movies[seen_all_movies.columns[9:15]]
least_fav = least_fav.applymap(lambda element: element == 6).mean()*100
least_fav = least_fav.rename(episode_dict2).sort_index(ascending = False)
least_fav = least_fav.plot.barh()
plt.title('Least Favourite Star Wars Movie')
plt.xlabel('Percentage')
plt.show()

Q: Who is the Most (& Least) Favourite Star Wars Character?¶

NOTE: For this analysis, I have reviewed only those respondents who've seen all 6 movies, it only makes sense that we review those particular rows only.

In [38]:

# Displaying the first few rows of the dataset we wil be analaysing...
seen_all_movies[seen_all_movies.columns[15:29]].head()

Out[38]:

	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda
1	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably
3	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)
4	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably
5	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably
6	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Neither favorably nor unfavorably (neutral)	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably

In [39]:

sns.set(style="whitegrid")

char_sequence = ['Emperor Palpatine','Jar Jar Binks','Boba Fett','Lando Calrissian','Padme Amidala','Anakin Skywalker',
                 'Darth Vader','C-3P0','Princess Leia Organa','Luke Skywalker','R2 D2','Obi Wan Kenobi','Yoda','Han Solo']
favorability  = ['Very favorably', 'Somewhat favorably','Neither favorably nor unfavorably (neutral)',
                 'Somewhat unfavorably','Very unfavorably']
favorable     = ['Very Favorable', 'Somewhat Favorable','Neutral',
                 'Somewhat Unfavorable','Very Unfavorable']
colors        = ['green','blue','grey','orange','red']
per_count     = []

fig = plt.figure(figsize = (15,5))

for i in range(1,6):
    ax = fig.add_subplot(1,5,i)
    char_favorability = seen_all_movies[seen_all_movies.columns[15:29]]
    char_favorability = char_favorability.applymap(lambda value: True if value==favorability[i-1] else False)
    char_favorability = 100*char_favorability.mean()
    for char in char_sequence:
        per_count.append(char_favorability[char])
    for c in range(1,15):
        ax.text(per_count[c-1]+3,char_sequence[c-1],int(round(per_count[c-1])))
    ax.set_xlim(0,75)
    ax.barh(char_sequence,per_count, color=colors[i-1])
    ax.set_title(favorable[i-1])
    
    if i>1:
        ax.set_yticklabels([])
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xlabel('Percentage')
        
    per_count = []
# plt.title('Star Wars Characters Favorability Ratings')

Q: Do Women and Men like (or dislike) the same characters?¶

In [40]:

male = seen_all_movies[seen_all_movies['Gender']=='Male']
female = seen_all_movies[seen_all_movies['Gender']=='Female']

In [41]:

sns.set(style="whitegrid")

char_sequence = ['Emperor Palpatine','Jar Jar Binks','Boba Fett','Lando Calrissian','Padme Amidala','Anakin Skywalker',
                 'Darth Vader','C-3P0','Princess Leia Organa','Luke Skywalker','R2 D2','Obi Wan Kenobi','Yoda','Han Solo']
favorability = ['Very favorably','Very unfavorably']
per_count_male = []
per_count_female = []

fig = plt.figure(figsize = (14,7))

for i in range(0,2):
    ax = fig.add_subplot(1,2,i+1)
    unfav_char_male = male[male.columns[15:29]]
    unfav_char_male = unfav_char_male.applymap(lambda value: True if value==favorability[i] else False)
    unfav_char_male = 100*unfav_char_male.mean()

    unfav_char_female = female[female.columns[15:29]]
    unfav_char_female = unfav_char_female.applymap(lambda value: True if value==favorability[i] else False)
    unfav_char_female = 100*unfav_char_female.mean()

    for char in char_sequence:
        per_count_male.append(unfav_char_male[char])
        per_count_female.append(unfav_char_female[char])

    length = np.arange(len(char_sequence))
    width=0.4
    ax.barh(length+0.2, per_count_female, width, label = 'Female')
    ax.barh(length-0.2, per_count_male, width, label = 'Male')
    ax.set_yticks(length)
    ax.set_yticklabels(char_sequence)
    for key, spine in ax.spines.items():
        spine.set_visible(False)
    if i == 0: 
        plt.title('Most Favourable Star Wars Character (Audience Gender Wise)')
    else:
        plt.title('Most Unfavourable Star Wars Character (Audience Gender Wise)')
        ax.set_yticklabels([])
    per_count_male = []
    per_count_female = []
    ax.set_xlabel('Percentage')
plt.legend()
plt.show()

Bonus...¶

Want to know how what percentage of Star Wars viewers of American descent are actual fans of Star Wars?
Or Fans of Star Trek?
Or Fans of the Expanded Universe?
Or even know about the Star Wars Expanded Universe?

See for yourselves!

In [42]:

sns.set(style="whitegrid")
fig,ax = plt.subplots(figsize = (5,2))
q_list = ['Do you consider yourself to be a fan of the Star Wars film franchise?',
         'Are you familiar with the Expanded Universe?',
         'Do you consider yourself to be a fan of the Expanded Universe?',
         'Do you consider yourself to be a fan of the Star Trek franchise?']
a_list = []

for i in range(0,4):
    a_list.append(100*seen_all_movies[q_list[i]].mean())
length = np.arange(0,4)
width = 0.8
ax.barh(length,a_list, width)
ax.set_yticks(length)
ax.set_yticklabels(['Fans of Star Wars','Familiar with Expanded Universe','Fans of Expanded Universe',
                  'Fans of Star Trek'])
ax.set_xlim(0,100)
ax.set_xlabel('Percentage')

Out[42]:

Text(0.5, 0, 'Percentage')

Han Shot First!¶

And lastly, the graph below depicts the impact historical revisionism can have on a society (for those who "don't understand the question", watch the 1977 original Star Wars - A New Hope and the 1997 special edition of the same movie). For those who aren't interested in watching, you can check out this brief article on Wikipedia that throws light on the controversy surrounding the subtle change done in the 1997 special edition of A New Hope, which showed Han in a less morally ambiguous light (much to some hardcore fans dismay).

In [43]:

fig,ax = plt.subplots()
(seen_all_movies['Who shot first - Han or Greedo?'].value_counts()/8.34).plot.barh()
ax.set_xlabel('Percentage')

Out[43]:

Text(0.5, 0, 'Percentage')

But of course if one is Yoda, one might have a third opinion on this.

Conclusion¶

Episode 5 - The Empire Strikes Back is by far the favourite Movie among the viewers in the Star Wars franchise.
Revenge of the Sith (Episode 3) is the least liked movie among the viewers.
Jar Jar Binks is more hated than the personification of evil in the galaxy - Darth Vader and Emperor Palpatine (although, women view Darth Vader a bit more unfavourably than Jar Jar.
Han Solo and Obi Wan Kenobi are most liked by Male audience, whereas Yoda and R2 D2 are most liked by the female audience.
(Spoiler Alert:) Although Anakin Skywalker becomes Darth Vader, Male audience likes Darth Vader more, whereas the Female audience likes Anakin Skywalker more. Weird, but interesting.
Han Shot First - and it's good to know that most of the audience agrees with it!

America's Favourite Star Wars Movies¶

(and Least Favourite Characters)¶

Reading Dataset¶

Exploring the Dataset¶

Inspecting and Cleaning the Dataset¶

Cleaning Columns No. 1 to 3¶

Cleaning Columns No. 4 to 9¶

Cleaning Columns No. 10 to 15¶

Cleaning Columns No. 16 to 29¶

Cleaning Columns No. 30 to 38¶

Analysis¶

How many responders have seen the Star Wars movies?¶

Q: How many have seen atleast 1 star wars movie?¶

Q: How many respondents have seen all 6 Star Wars movies?¶

Q: What's the Movie-Wise division of Star Wars movies' viewership?¶

Q: Which is the Favourite Star Wars movie?¶

Q: ... And Which is the Least Favourite?¶

Q: Who is the Most (& Least) Favourite Star Wars Character?¶

Q: Do Women and Men like (or dislike) the same characters?¶

Bonus...¶

Han Shot First!¶

Conclusion¶

Thank You!¶

Q: How many respondents have seen all 6 `Star Wars` movies?¶