Using the force to analyse a Star Wars survey¶

In this project we're going to explore data from a survey conducted by FiveThirtyEight on the Star Wars films. The survey was taken in 2014, at which point only six Star Wars films had been released, the original trilogy (Episodes IV, V, and VI) and the prequel trilogy (Episodes I, II, and III).

The survey, which received 835 total responses, sought to ascertain:

how many people are Star Wars fans
which film in the series is best/worst
which characters are most loved/hated
who shot first?

Raw survey data available on Github.

In this project, we're going to clean up the raw data to make it easier to work with, and then try and perform some of our own analysis, including segmenting the survey population around certain demographics.

The anchor links below can be used to navigate to specific sections of the report:

Reading and exploring the data
Initial observations
Data cleaning
Column reference
Analysis
Conclusion

In [1]:

# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import ticker
from rich import print

# draw plots inline
%matplotlib inline

# set global default font sizes for plots
plt.rcParams['axes.labelsize'] = 13
plt.rcParams['axes.titlesize'] = 18
plt.rcParams['xtick.labelsize']= 13
plt.rcParams['ytick.labelsize']= 13

# set top-level seaborn styles
sns.set_style("dark")
sns.set_palette("deep")

# centre-align all plots
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

Out[1]:

Reading and exploring the data¶

Back to top

In [2]:

star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1") # need to specify encoding as dataset has some 
                                                                # characters not present in default utf-8

star_wars.head()

Out[2]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?æ	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	NaN	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	...	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response
1	3.292880e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

5 rows × 38 columns

In [3]:

star_wars.shape

Out[3]:

(1187, 38)

Initial observations¶

Back to top

the dataset contains 38 columns and 1187 rows
each row of the dataset appears to be a single response to the survey tied to a unique ID, RespondentID
- the exception to this is the first row, which does not appear to contain actual response data
several columns contain Yes/No values, we should try and convert these to Boolean values
there are a number of columns with the prefix Unnamed, these will probably need renaming to something more intuitive
- the first row looks like it might contain the values that correspond to the Unnamed columns, we'll explore this further in the cleanup stage

Since we have 38 columns in our dataset, some of which have very long names, it will be much easier to reference our columns by their index value (as opposed to referencing by column label) using the DataFrame.iloc() method. Let's create a function to print out every column label and the correpsonding index.

In [4]:

# function to print each column with its corresponding index number for easy reference
def print_cols():
    for col in star_wars.columns:
        index_no = star_wars.columns.get_loc(col)
        print("{} - {}".format(index_no, col))
        
print_cols()

0 - RespondentID

1 - Have you seen any of the 6 films in the Star Wars franchise?

2 - Do you consider yourself to be a fan of the Star Wars film franchise?

3 - Which of the following Star Wars films have you seen? Please select all that apply.

4 - Unnamed: 4

5 - Unnamed: 5

6 - Unnamed: 6

7 - Unnamed: 7

8 - Unnamed: 8

9 - Please rank the Star Wars films in order of preference with 1 being your favorite film in
the franchise and 6 being your least favorite film.

10 - Unnamed: 10

11 - Unnamed: 11

12 - Unnamed: 12

13 - Unnamed: 13

14 - Unnamed: 14

15 - Please state whether you view the following characters favorably, unfavorably, or are 
unfamiliar with him/her.

16 - Unnamed: 16

17 - Unnamed: 17

18 - Unnamed: 18

19 - Unnamed: 19

20 - Unnamed: 20

21 - Unnamed: 21

22 - Unnamed: 22

23 - Unnamed: 23

24 - Unnamed: 24

25 - Unnamed: 25

26 - Unnamed: 26

27 - Unnamed: 27

28 - Unnamed: 28

29 - Which character shot first?

30 - Are you familiar with the Expanded Universe?

31 - Do you consider yourself to be a fan of the Expanded Universe?æ

32 - Do you consider yourself to be a fan of the Star Trek franchise?

33 - Gender

34 - Age

35 - Household Income

36 - Education

37 - Location (Census Region)

Data cleaning¶

Back to top

Remove first row from dataset and set aside¶

The first row of our dataset does not contain actual respondent data, but does contain information that will be useful to us for cleaning up column names. We'll create a copy of the first row so we can refer back to it later, and then delete it from the main dataset.

In [5]:

# store the first row in a separate dataframe
first_row = star_wars.iloc[0].copy().to_frame()

# drop the first row from the dataset and reset index
star_wars = star_wars.drop(0).reset_index(drop=True)

# validate change
star_wars.head(1)

Out[5]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?æ	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	3.292880e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic

1 rows × 38 columns

Clean missing characters¶

One column in our dataset contains characters that could not be interpreted by the ISO-8859-1 encoding. The missing characters do not appear to be important so we'll remove them now to prevent possible issues later.

In [6]:

# Rename column with missing characters
star_wars = star_wars.rename(columns={
    'Do you consider yourself to be a fan of the Expanded Universe?æ': 
    'Do you consider yourself to be a fan of the Expanded Universe?'
    })

# validate change
print(star_wars.columns[31])

Do you consider yourself to be a fan of the Expanded Universe?

Clean yes/no answer columns¶

The following columns represent Yes/No-type questions:

Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
Are you familiar with the Expanded Universe?
Do you consider yourself to be a fan of the Expanded Universe?
Do you consider yourself to be a fan of the Star Trek franchise?

They all contain the string values Yes and No (and NaN). We can make the data a bit easier to analyse later by converting each column to a Boolean with only the values True, False, and NaN. Booleans are easier to work with because we can select the rows that are True or False without having to do a string comparison.

In [7]:

# create mapping dictionary
bool_map = {
    "Yes": True,
    "No": False,
}

yes_no_cols = [
    "Have you seen any of the 6 films in the Star Wars franchise?",
    "Do you consider yourself to be a fan of the Star Wars film franchise?",
    "Are you familiar with the Expanded Universe?",
    "Do you consider yourself to be a fan of the Expanded Universe?",
    "Do you consider yourself to be a fan of the Star Trek franchise?"
]

# map new values to columns
for col in yes_no_cols:
    star_wars[col] = star_wars[col].map(bool_map)

# validate change
star_wars[yes_no_cols].head()

Out[7]:

	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?
0	True	True	True	False	False
1	False	NaN	NaN	NaN	True
2	True	False	False	NaN	False
3	True	True	False	NaN	True
4	True	True	True	False	False

Clean question on seen Star Wars films¶

These six columns in our star_wars dataset represent a single checkbox question in the survey asking which of the six Star Wars films the respondent has seen.

In [8]:

# show next six columns
star_wars.iloc[:5,3:9]

Out[8]:

	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
0	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
1	NaN	NaN	NaN	NaN	NaN	NaN
2	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN
3	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
4	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi

Which of the following Star Wars films have you seen? Please select all that apply. — whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
Unnamed: 4 — whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
Unnamed: 5 — whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.
Unnamed: 6 — whether or not the respondent saw Star Wars: Episode IV A New Hope.
Unnamed: 7 — whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.
Unnamed: 8 — whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.

For each of these columns, it appears that if the value of the cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. For cleaning purposes, we'll assume NaN means they didn't see the movie.

As before, let's convert each of these columns to contain Boolean values.

In [9]:

# create mapping dictionary
film_map = {
    np.nan: False,
    "Star Wars: Episode I  The Phantom Menace": True,
    "Star Wars: Episode II  Attack of the Clones": True,
    "Star Wars: Episode III  Revenge of the Sith": True,
    "Star Wars: Episode IV  A New Hope": True,
    "Star Wars: Episode V The Empire Strikes Back": True,
    "Star Wars: Episode VI Return of the Jedi": True
}

# remap values in columns to bools and rename columns titles to be more intuitive
for count, col in enumerate(star_wars[star_wars.columns[3:9]]):
    star_wars[col] = star_wars[col].map(film_map)
    star_wars.rename(columns={col: "seen_" + str(count + 1)}, inplace=True)

# check output
star_wars.iloc[:5,3:9]

Out[9]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6
0	True	True	True	True	True	True
1	False	False	False	False	False	False
2	True	True	True	False	False	False
3	True	True	True	True	True	True
4	True	True	True	True	True	True

Clean film ranking columns¶

The next six columns in the dataset relate to a question asking the respondent to rank the Star Wars films in order from least, 6, to most favourite 1. Possible values are therefore: 1, 2, 3, 4, 5, 6, and NaN.

In [10]:

star_wars.iloc[:5,9:15]

Out[10]:

	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14
0	3	2	1	4	5	6
1	NaN	NaN	NaN	NaN	NaN	NaN
2	1	2	3	4	5	6
3	5	6	1	2	4	3
4	5	4	6	2	1	3

Here's what each column references:

Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace
Unnamed: 10 — How much the respondent liked Star Wars: Episode II Attack of the Clones
Unnamed: 11 — How much the respondent liked Star Wars: Episode III Revenge of the Sith
Unnamed: 12 — How much the respondent liked Star Wars: Episode IV A New Hope
Unnamed: 13 — How much the respondent liked Star Wars: Episode V The Empire Strikes Back
Unnamed: 14 — How much the respondent liked Star Wars: Episode VI Return of the Jedi

As before, we will need to rename these columns so that they are more intuitive. We will also need to convert each column to a numeric datatype.

In [11]:

# convert column values to float type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

# rename columns
for count, col in enumerate(star_wars[star_wars.columns[9:15]]):
    star_wars.rename(columns={col: "ranking_" + str(count + 1)}, inplace=True)

# check output
star_wars.iloc[:5,9:15]

Out[11]:

	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
0	3.0	2.0	1.0	4.0	5.0	6.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	2.0	3.0	4.0	5.0	6.0
3	5.0	6.0	1.0	2.0	4.0	3.0
4	5.0	4.0	6.0	2.0	1.0	3.0

Clean character favourability columns¶

Columns 15 through 28 relate to a question asking how the respondents rated various characters in the Star Wars films.

In [12]:

star_wars.iloc[:5,15:29]

Out[12]:

	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28
0	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)
3	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably
4	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably

If we slice the first_row we created earlier by the same index, we can see which character maps to which column.

In [13]:

first_row[15:29]

Out[13]:

	0
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Han Solo
Unnamed: 16	Luke Skywalker
Unnamed: 17	Princess Leia Organa
Unnamed: 18	Anakin Skywalker
Unnamed: 19	Obi Wan Kenobi
Unnamed: 20	Emperor Palpatine
Unnamed: 21	Darth Vader
Unnamed: 22	Lando Calrissian
Unnamed: 23	Boba Fett
Unnamed: 24	C-3P0
Unnamed: 25	R2 D2
Unnamed: 26	Jar Jar Binks
Unnamed: 27	Padme Amidala
Unnamed: 28	Yoda

Let's update columns 15:29 in our star_wars dataframe with the corresponding character name in first_row.

In [14]:

# convert slice of first_row to a dictionary
character_map = first_row[15:29].to_dict()

# this gives us a nested dictionary -  we will need to remove the outer key, '0', 
# for df.rename() to work, while keeping the rest of the dictionary intact
print("[bold]Nested dictionary[/bold] \n")
print(character_map)

# use dictionary comprehension to keep only the inner key-value pairs
character_map = dict(ele for sub in character_map.values() for ele in sub.items())

print("\n[bold]Top level removed[/bold]\n")
print(character_map)

# rename unamed columns with character names using dictionary
star_wars = star_wars.rename(columns=character_map)

print("\n[bold]Updated column names[/bold]\n")
print(star_wars.columns[15:29])

Nested dictionary

{
    0: {
        'Please state whether you view the following characters favorably, unfavorably, or 
are unfamiliar with him/her.': 'Han Solo',
        'Unnamed: 16': 'Luke Skywalker',
        'Unnamed: 17': 'Princess Leia Organa',
        'Unnamed: 18': 'Anakin Skywalker',
        'Unnamed: 19': 'Obi Wan Kenobi',
        'Unnamed: 20': 'Emperor Palpatine',
        'Unnamed: 21': 'Darth Vader',
        'Unnamed: 22': 'Lando Calrissian',
        'Unnamed: 23': 'Boba Fett',
        'Unnamed: 24': 'C-3P0',
        'Unnamed: 25': 'R2 D2',
        'Unnamed: 26': 'Jar Jar Binks',
        'Unnamed: 27': 'Padme Amidala',
        'Unnamed: 28': 'Yoda'
    }
}

Top level removed

{
    'Please state whether you view the following characters favorably, unfavorably, or are 
unfamiliar with him/her.': 'Han Solo',
    'Unnamed: 16': 'Luke Skywalker',
    'Unnamed: 17': 'Princess Leia Organa',
    'Unnamed: 18': 'Anakin Skywalker',
    'Unnamed: 19': 'Obi Wan Kenobi',
    'Unnamed: 20': 'Emperor Palpatine',
    'Unnamed: 21': 'Darth Vader',
    'Unnamed: 22': 'Lando Calrissian',
    'Unnamed: 23': 'Boba Fett',
    'Unnamed: 24': 'C-3P0',
    'Unnamed: 25': 'R2 D2',
    'Unnamed: 26': 'Jar Jar Binks',
    'Unnamed: 27': 'Padme Amidala',
    'Unnamed: 28': 'Yoda'
}

Updated column names

Index(['Han Solo', 'Luke Skywalker', 'Princess Leia Organa',
       'Anakin Skywalker', 'Obi Wan Kenobi', 'Emperor Palpatine',
       'Darth Vader', 'Lando Calrissian', 'Boba Fett', 'C-3P0', 'R2 D2',
       'Jar Jar Binks', 'Padme Amidala', 'Yoda'],
      dtype='object')

Now let's take a look at what unique values appear across the character columns.

In [15]:

# identify all unique values in the character columns
char_vals = (star_wars.iloc[:, 15:29].values  # df.values() converts df to np array 
                                     .ravel() # np.ravel flattens the array
            )                   
unique_vals =  pd.unique(char_vals) # can then supply the np.array to pd.unique()
print(unique_vals)

['Very favorably' 'Unfamiliar (N/A)' nan 'Somewhat favorably'
 'Somewhat unfavorably' 'Very unfavorably'
 'Neither favorably nor unfavorably (neutral)']

Unlike the film rankings, the rating scale for characters is not numerical. If we wanted to do any kind of aggregation beyond counting, we'll need to convert the character rating scale to a numerical one.

Let's apply a numerical rating to the character columns. We can adopt the same order as the film rankings, where a lower number is better/more favourable and a higher number is worse/less favourable. We'll leave the value Unfamiliar (N/A) in place, but we'll trim it's name to just Unfamiliar.

In [16]:

# dictionary to remap character favourability to a number
char_dict = {
    'Very favorably': 1,
    'Somewhat favorably': 2,
    'Neither favorably nor unfavorably (neutral)': 3,
    'Somewhat unfavorably': 4,
    'Very unfavorably': 5,
    'Unfamiliar (N/A)': 'Unfamiliar'
}

# update character values
star_wars.iloc[:, 15:29] = star_wars.iloc[:, 15:29].replace(char_dict)

# spot check a column to validate changes
print("Random sample to validate changes")
display(star_wars.iloc[:,15:29].sample(n=5))

Random sample to validate changes

	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda
896	1	1	1	4	1	1	1	1	1	1	1	5	3	1
1002	1	1	1	2	1	Unfamiliar	5	4	Unfamiliar	1	1	Unfamiliar	1	1
91	1	2	2	5	1	Unfamiliar	3	Unfamiliar	3	1	2	5	Unfamiliar	1
176	1	2	2	3	2	2	2	3	3	2	2	5	2	2
623	1	1	1	1	2	2	3	1	3	1	1	2	2	NaN

That should be it for data cleaning. Let's review our updated columns.

Column reference¶

Back to top

In [17]:

print_cols()

0 - RespondentID

1 - Have you seen any of the 6 films in the Star Wars franchise?

2 - Do you consider yourself to be a fan of the Star Wars film franchise?

3 - seen_1

4 - seen_2

5 - seen_3

6 - seen_4

7 - seen_5

8 - seen_6

9 - ranking_1

10 - ranking_2

11 - ranking_3

12 - ranking_4

13 - ranking_5

14 - ranking_6

15 - Han Solo

16 - Luke Skywalker

17 - Princess Leia Organa

18 - Anakin Skywalker

19 - Obi Wan Kenobi

20 - Emperor Palpatine

21 - Darth Vader

22 - Lando Calrissian

23 - Boba Fett

24 - C-3P0

25 - R2 D2

26 - Jar Jar Binks

27 - Padme Amidala

28 - Yoda

29 - Which character shot first?

30 - Are you familiar with the Expanded Universe?

31 - Do you consider yourself to be a fan of the Expanded Universe?

32 - Do you consider yourself to be a fan of the Star Trek franchise?

33 - Gender

34 - Age

35 - Household Income

36 - Education

37 - Location (Census Region)

Analysis¶

Back to top

Computing the mean rank¶

Now that we've cleaned up the ranking columns it will be easier to find the highest-ranked film. To do this, we will need to take the mean of each ranking column.

In [18]:

# calculate and print mean of each rank column
mean_ranking = star_wars.iloc[:,9:15].mean()
print("Mean film rank values")
print("="*25)
print(mean_ranking)

# plot bar chart of mean rankings
fig, ax = plt.subplots(figsize=(12,6))
ax = sns.barplot(x=mean_ranking, 
                 y=mean_ranking.index, 
                 color="#1f77b4")
ax.set_xlabel('Mean rank')
ax.set_title("Mean rank of Star Wars films")

# create function to relabel ticks with the epiosde number to make plots more user-friendly
def relabel(axis):
    positions = (0, 1, 2, 3, 4, 5)
    labels = ("Episode I", "Episode II", "Episode III", "Episode IV", "Episode V", "Episode VI")
    for ax in fig.get_axes():
        if axis == "y":
            ax.set_yticks(positions)
            ax.set_yticklabels(labels)
        elif axis == "x":
            ax.set_xticks(positions)
            ax.set_xticklabels(labels)
        else:
            print("Invalid axis, enter either 'x' or 'y'")
            
relabel("y") # relabel y-ticks with episode numbers
    
plt.show()

Mean film rank values

=========================

ranking_1    3.732934
ranking_2    4.087321
ranking_3    4.341317
ranking_4    3.272727
ranking_5    2.513158
ranking_6    3.047847
dtype: float64

Recall that the ranking is ordered from 1 - most favourite to 6 - least favourite. Therefore the lower the mean ranking the better. The bar chart reveals that respondents rated films in the original triology higher than films in the prequel trilogy, with Star Wars: Episode V The Empire Strikes Back (ranking_5) being, on average, the overall favourite film. Star Wars: Episode III Revenge of the Sith (ranking_3) was the least favourite, which in the author's opinion is quite frankly absurd, but nevermind.

How many people have seen each film¶

Earlier we cleaned up the 'seen' columns and converted their values to the Boolean type. Pandas methods that aggregate data treat Booleans like integers, i.e. True = 1 and False = 0. This means we can figure out how many people have seen each film just by taking the sum of the column.

In [19]:

# sum each `seen` column to get count of people who have seen each film
sum_seen = star_wars.iloc[:,3:9].sum()
print(sum_seen)

# plot data on bar chart
fig, ax = plt.subplots(figsize=(12,6))
ax = sns.barplot(x=sum_seen, 
                 y=sum_seen.index, 
                 color="#1f77b4")
ax.set_xlabel('Number seen')
ax.set_title("Number of respondents who have seen each Star Wars film")

relabel("y") # relabel y-ticks with episode numbers

plt.show()

seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738
dtype: int64

We can see that most survey respondents (64%) have seen Star Wars: Episode V The Empire Strikes Back, while Star Wars: Episode III Revenge of the Sith has been seen by the fewest respondents (46%). Could this be related to how people rank each film? Respondents would most likely rank a film they have seen higher than one they haven't. Perhaps it would be better to remove respondents that haven't seen all the films - we'll take a look at this in the next section.

Segmenting the audience¶

We now know how the survey population has ranked each film, but we can take a more granular look at the data by examining how certain segments of the survey population responded.

Respondents who have seen all the films¶

To see if there is any bias in the answers to film rankings by respondents who have not seen all the films, we'll segment a subset of the population who have seen all the films, and then see if there is any difference in how the films are ranked between this cohort and the general population.

In [20]:

## identify respondents who have seen all the films ##

# use df.all() to create a new column indicating whether a respondent has seen all films
star_wars["seen_all"] = star_wars.iloc[:, 3:9].all(axis='columns')

# confirm output with random sample
star_wars[["seen_1", "seen_2", "seen_3", "seen_4", "seen_5", "seen_6", "seen_all"]].sample(10, random_state=5)

Out[20]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	seen_all
1153	False	False	False	False	False	False	False
448	True	True	True	True	True	True	True
664	False	False	False	True	False	False	False
1135	False	False	False	False	False	False	False
11	False	False	False	False	False	False	False
149	False	False	False	False	True	False	False
50	True	False	False	True	True	True	False
1138	True	False	True	False	True	True	False
72	True	True	True	True	True	True	True
843	True	True	True	True	True	True	True

Looks like that worked, now let's create our subset and plot the mean rank for each film again. Too make it easier to compare the subset of respondents who have seen all the films with the entire survey population, we'll plot the mean ranking of both groups on the same chart.

In [21]:

# create subset dataframe of respondents who have seen all the films
seen_all = star_wars[star_wars["seen_all"] == True]

# calc percent of survey pop that has seen Star Wars films
percent_seen_all = seen_all.shape[0] / star_wars.shape[0]
print("{:.2f} percent of respondents have seen all Star Wars films".format(percent_seen_all * 100))

# determine mean film rankings of the 'seen all' cohort
mean_ranking_seen_all = seen_all.iloc[:,9:15].mean()

# combine film rankings for 'seen all' cohort with total survey pop into one dataframe
combined_ranks = pd.concat([mean_ranking, mean_ranking_seen_all], axis=1).rename(
                                                                                 columns={0:"total_pop", 
                                                                                          1:"seen_all"}
                                                                                )

print("[bold]Mean film ranks by 'seen_all' cohort and the total survey population[/bold]")
print(combined_ranks)


# plot bar chart
fig, ax = plt.subplots(figsize=(12,6))
combined_ranks.plot.barh(ax=ax)
ax.invert_yaxis() # invert y-axis to keep consistent with seaborn barplots
ax.set_title("""Comparing mean rank of Star Wars films between subset of respondents 
             who have seen all films and the total survey population""")

relabel("y") # relabel y-ticks with episode numbers

plt.show()

39.71 percent of respondents have seen all Star Wars films

Mean film ranks by 'seen_all' cohort and the total survey population

           total_pop  seen_all
ranking_1   3.732934  4.237792
ranking_2   4.087321  4.326964
ranking_3   4.341317  4.253191
ranking_4   3.272727  2.866242
ranking_5   2.513158  2.380042
ranking_6   3.047847  2.932059

Out of all 1187 respondents, only 471 (40%) have seen all the Star Wars films.

The general pattern in film rankings is very similar, however the seen_all cohort tends to rank the original triology higher, and the prequel triology lower. Also, the seen_all cohort ranks Episode II the lowest of all episodes, compared with the total population who rate Episode III the lowest. As we suspected earlier, there may well be a bias affect in film rankings caused by respondents who have not seen all the films, specifically for Episode III which has been watched by the fewest respondents. However, the affect is probably not signfiicant enough to warrant removing any respondent who has not seen all the films from further analysis.

Gender and Star Wars¶

Let's check to see if the survey recieved a fairly similar number of male and female respondents. If the survey is heavily weighted towards one gender, it may affect our analysis.

In [22]:

# group dataframe by Gender column
gen = star_wars.groupby(['Gender'])

# plot count of each respondent by gender
fig, ax = plt.subplots(figsize=(6,8))
ax = sns.barplot(x=gen["RespondentID"].count().index,
                 y=gen["RespondentID"].count(),
                 color="#1f77b4")
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_title("Number of respondents by gender")

plt.show()

print(gen["RespondentID"].count())

Gender
Female    549
Male      497
Name: RespondentID, dtype: int64

There are slightly more Female respondents than Male, but they both roughly account for 50% of the total survey population.

Let's take a look at what percentage of males and females identified themselves as a fan of the Star Wars films.

In [23]:

## determine if there is a greater % of male star wars fans (of all male respondents) 
## or female fans (of all female respondents)

# select the Star Wars fan col in the gender-grouped dataset
gen_fans = gen["Do you consider yourself to be a fan of the Star Wars film franchise?"]

# calcualte percent of gender that are fans
gen_fans_p = gen_fans.agg(lambda x: round(x.sum() / x.count() * 100, 2))

# plot percent of fans by gender
fig, ax = plt.subplots(figsize=(6,8))
ax = sns.barplot(x=gen_fans_p.index,
                 y=gen_fans_p,
                 color="#1f77b4")
ax.set_ylabel('Percentage')
ax.set_xlabel('')
ax.set_title("Percent of gender that identify as a fan")

plt.show()

print(gen_fans_p)

Gender
Female    59.95
Male      71.63
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: float64

Approximately 72% of Male respondents identified as a fan of the Star Wars films, compared with 60% of Female respondents. The Star Wars films appear to be more popular with a Male audience.

Let's see if there is any difference between male and female respondents in the Star Wars films they have seen and how they ranked them.

In [24]:

# calculate the percent of each gender that have seen each film
gen_seen = gen[["seen_1", "seen_2", "seen_3", "seen_4", "seen_5", "seen_6"]]
gen_seen_p = gen_seen.agg(lambda x: round(x.sum() / x.count() * 100, 2))
gen_seen_t = gen_seen_p.T # transpose index and columns

# get mean rank of each film by gender
gen_rank = gen[["ranking_1", "ranking_2", "ranking_3", "ranking_4", "ranking_5", "ranking_6"]].mean()
gen_rank_t = gen_rank.T # tranpose index and columns

fig, axs = plt.subplots(nrows = 2, ncols = 1, figsize = (10, 15))

# plot percentage of gender that have watched each film
ax1 = gen_seen_t.plot.barh(title="Percentage of respondents who have seen each film, split by gender", ax = axs[0])
ax1.set_xlabel("Percent")
ax1.invert_yaxis() # invert y-axis to keep consistent with seaborn barplots

# plot rank of films by gender
ax2 = gen_rank_t.plot.barh(title="Mean rank of each film, split by gender", ax = axs[1])
ax2.set_xlabel("Mean rank")
ax2.invert_yaxis() # invert y-axis to keep consistent with seaborn barplots

# relabel y-ticks with episode numbers
relabel("y") 

plt.subplots_adjust(hspace=.3) # increase height space between plots for presentation

plt.show()

The top graph shows us that for all films there is a higher percentage of males that have seen the film compared with the percentage of females. The largest disparity is in Episode III, only 40% of all female respondents have seen the film, compared with 64% of all male respondents. The general pattern in films seen by both genders reflects what we saw earlier for the total survey population.

Looking at the second graph of mean film rankings by gender, we can see that Males seem to view the first couple of films in the prequel trilogy much less favourably than Females. For all other films in the series, Females tend to have a more negative outlook than Males.

Who shot first?¶

While there is a question in the survey that directly asks whether a respondent is a fan of the Star Wars series, perhaps a more telling indication of Star Wars fandom is a respondent's answer to the question, 'Which character shot first?'

For context, a controversial change was made to a scene in the re-release of Episode IV. The scene depicts the character Han Solo in a confrontation with another character, Greedo. In the original version of the scene, Han Solo shoots Greedo dead in cold blood. However, in the re-release of Episode IV, the scene was modified so that Greedo shoots (and misses) Han first before Han retaliates, portraying Han as acting in self-defence rather than being the aggressor. Many fans reacted negatively to this change, arguing that it altered Han Solo's initial moral ambiguity and weakened his story arc, making his later transition from anti-hero to hero less meaningful.

Ultimately, awareness of this change represents somewhat esoteric knowledge of the Star Wars films, which could imply a strong level of fandom.

In [25]:

# get percentage of respondents for each answer
wsf_per = star_wars["Which character shot first?"].value_counts(normalize=True)

# plot pie chart of mean rankings
ax = wsf_per.plot.pie(figsize=(12,8), autopct='%1.0f%%', textprops={'fontsize': 13}, colors = sns.color_palette())
ax.set_ylabel("")
ax.set_title("Who shot first?")
ax.axis('equal') # set equal aspect ratio so pie is a true circle

plt.show()

We can see that the majority of respondents think that either Han shot first (39%) or do not understand the question (37%). A smaller number of respondents think Greedo shot first (24%).

We might expect the respondent's age to be a factor in their answer here. Older respondents are more likely to have seen the original 1977 release of Episode IV where Han shoots first, while younger respondents are perhaps more likely to of seen the 1997 re-release where Greedo is the instigator. Let's take a look at this question segmented by the respondent's age.

In [26]:

# prepare data for plotting - we'll get the percentage of each answer from each age group
wsf_age = star_wars[["Which character shot first?", "Age"]]
wsf_age_grp = wsf_age.groupby(["Age"])
wsf_age_per = (wsf_age_grp["Which character shot first?"].value_counts(normalize=True)
                                                         .rename("Percent")
                                                         .reset_index()
              )

# plot 100 percent stacked bar chart
fig, ax = plt.subplots(figsize=(12,6))

# seaborn has no native support for stacked bar charts
# but we can repurpose the histogram plot to get a 100 percent stacked barchart
ax = sns.histplot(wsf_age_per, x='Age', 
                  hue="Which character shot first?", 
                  weights='Percent',
                  multiple='stack', 
                  shrink=0.8)
ax.set_ylabel("Percentage")
ax.set_title("'Who shot first?', split by age group")

# relocate legend so it's not overlapping bars
legend = ax.get_legend()
legend.set_bbox_to_anchor((0.65, -0.15))

plt.show()

Turns out, contrary to our expectations, that the majority of respondents in the youngest age group (18-29) actually answered Han when asked Which character shot first?. More so than any other age group. Perhaps the 18-29 age group has the most die-hard Star Wars fans.

A similar percentage of respondents answered Greedo in each age group, though this percentage was a little higher in the the 30-44 category. Also of note, people aged over 60 were the least likely to understand what this question meant, even though they would probably have been the target audience at the time of release of Episode IV. However, it's important to remember that this detail is so minor it's easy to forget.

Are fans of Star Wars more likely to be fans of Star Trek too?¶

People have always drawn comparisons between Star Wars and Star Trek, enough so in fact to warrant a Wikipedia article. Both sci-fi franchises have strong fan bases, but just how much fan cross-over is there? Let's find out if fans of Star Wars are more or less likely to be fans of Star Trek.

First we'll identify what percentage of our total survey population are fans of Star Trek.

In [27]:

# identify percentage of Star Trek fans in total survey pop
st_fans = (star_wars.iloc[:,32].value_counts(normalize=True)
                               .rename(index={True: 'Fan', False: "Not fan"}) # rename labels for presentation
                               .sort_index() # sort by index label rather than value so colours are kept consistent
                                             # with the next pie chart
          )

# plot pie chart
ax = st_fans.plot.pie(figsize=(12,8), 
                      autopct='%1.0f%%', 
                      textprops={'fontsize': 13}, 
                      startangle=90)
ax.set_ylabel("")
ax.set_title("Percentage of Star Trek fans/not-fans across all survey respondents")
ax.axis('equal') # set equal aspect ratio so pie is a true circle

plt.show()

The majority of survey respondents are not fans of Star Trek with a roughly 60:40 split of non-fans to fans. Now let's see how this compares with respondents who are fans of Star Wars.

In [28]:

# get subset of Star Wars fans
sw_fans = star_wars[star_wars.iloc[:, 2] == True]

# get percentage of Star Wars fans that are fans/not fans of Star Trek
sw_st_fans = (sw_fans.iloc[:,32].value_counts(normalize=True)
                                .rename(index={True: 'Fan', False: "Not fan"}) # rename labels for presentation
                                .sort_index() # sort by index label rather than value so colours are kept consistent
                                              # with preceding pie chart
             )

# plot pie chart
ax = sw_st_fans.plot.pie(figsize=(12,8), 
                         autopct='%1.0f%%', 
                         textprops={'fontsize': 13},
                         startangle=90,
                         )
ax.set_ylabel("")
ax.set_title("Percentage of Star Wars fans that are also Star Trek fans")
ax.axis('equal') # set equal aspect ratio so pie is a true circle

plt.show()

In a complete flip from the total survey population, we can see that a clear majority of respondents who are Star Wars fans are also fans of Star Trek too.

Character analysis¶

The Star Wars films have a number of iconic and memorable characters, but there are also some that may be less familiar to a general audience. Let's see if we can identify the most and least recognisable characters.

In [29]:

# get count of each response in every character column
char_count = star_wars.iloc[:, 15:29].apply(pd.value_counts)

print("[bold]Count of each rating for every character[/bold]")
display(char_count)

# separate counts into two series - familiar and unfamiliar
unfamiliar = char_count.loc["Unfamiliar"].sort_values(ascending=False) # isolate 'Unfamiliar' row from char_count
familiar = (char_count.loc[char_count.index != "Unfamiliar"]
                                                            .sum() # aggregate all other rows in char_count
                                                            .rename("Familiar") # give a name to new series
           )

# combine both series objects into one df
familiarity = pd.concat([unfamiliar, familiar], axis=1)

# calculate percentage of respondents who are unfamiliar
familiarity["% Unfamiliar"] = round((familiarity["Unfamiliar"] / familiarity.sum(axis=1)) * 100, 2)

# show resulting df
print("[bold]Count of each rating for every character[/bold]")
display(familiarity)

Count of each rating for every character

	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda
1	610	552	547	245	591	110	310	142	138	474	562	112	168	605
2	151	219	210	269	159	143	171	223	153	229	185	130	183	144
3	44	38	48	135	43	213	84	236	248	79	57	164	207	51
Unfamiliar	15	6	8	52	17	156	10	148	132	15	10	109	164	10
4	8	13	12	83	8	68	102	63	96	23	10	102	58	8
5	1	3	6	39	7	124	149	8	45	7	6	204	34	8

Count of each rating for every character

	Unfamiliar	Familiar	% Unfamiliar
Padme Amidala	164	650	20.15
Emperor Palpatine	156	658	19.16
Lando Calrissian	148	672	18.05
Boba Fett	132	680	16.26
Jar Jar Binks	109	712	13.28
Anakin Skywalker	52	771	6.32
Obi Wan Kenobi	17	808	2.06
Han Solo	15	814	1.81
C-3P0	15	812	1.81
Darth Vader	10	816	1.21
R2 D2	10	820	1.20
Yoda	10	816	1.21
Princess Leia Organa	8	823	0.96
Luke Skywalker	6	825	0.72

In [30]:

# plot mean ratings
fig, ax = plt.subplots(figsize=(8,10))

sort_fam = familiarity.sort_values("% Unfamiliar", ascending=True)

ax = sns.stripplot(data=sort_fam, 
                   x="% Unfamiliar", y=sort_fam.index,
                   size=13, palette="flare_r")

ax.yaxis.grid(True) # add y-axis grid lines
ax.set(xlim=(0, 25)) # set x-axis to span the entire range of ranks

ax.set_xlabel("% of respondents unfamiliar with character")
ax.set_title("Unfamiliarity with Star Wars characters", fontsize=20)

plt.show()

We can see that among our survey population, Padme Amidala was the most unfamiliar character in the Star Wars films with 20% of respondents not recognising her. A possible reason could be that she only appears in the prequel films (recall from earlier that fewer respondents have seen episodes I-III) but perhaps people may just be less familiar with the name Padme and more familiar with her title as Queen Amidala. Name familiarity may also explain why Emperor Palpatine is the second most unfamiliar character, not being recognised by 19% of respondents, despite playing a prominent role in both trilogies. In the original trilogy, he is known as 'Darth Sidious' or 'The Emperor', his surname 'Palpatine' is not revealed until later films.

Conversely, Luke Skywalker was the most familiar character, with less than 1% of respondents claiming to not be familiar with the character. Not surprising really given he is the main protagonist in the original films.

Character rating¶

Now let's try and identify which character was rated the highest in terms of favourability. To do this, we will need to calculate the the average favourability rating for each character.

In [31]:

# get mean rating for each character
char_rating = star_wars.iloc[:, 15:29]
char_rating.replace({'Unfamiliar':np.nan}, inplace=True)
char_mean_rating = (char_rating.agg(np.mean).to_frame()
                                            .reset_index()
                                            .rename(columns={"index": "Character", 0: "Rating"})
                   )

# plot mean ratings
fig, ax = plt.subplots(figsize=(8,10))

ax = sns.stripplot(data=char_mean_rating.sort_values("Rating", ascending=True), 
                   x="Rating", y="Character",
                   size=13, palette="flare_r")

ax.yaxis.grid(True) # add y-axis grid lines
ax.set(xlim=(1, 5)) # set x-axis to span the entire range of ranks

ax.set_xlabel("Mean favourability score", fontsize=15)
ax.set_ylabel("")
ax.set_title("Average favourability score of various Star Wars characters", fontsize=20)

plt.show()

Our dot plot reveals that, on average, Han Solo is most favourable character in the Star Wars films, followed closely by Jedi masters Obi Wan Kenobi and Yoda.

Unsurprisingly, the controversial character Jar Jar Binks has the lowest average favourabiity score.

It's interesting that all of the characters in the top half of the favourability scores are generally portrayed as 'good', whereas 57% of the characters in the bottom half are 'villains'. Perhaps people have a bias towards morally good characters.

Character rating by demographic¶

Lets have a deeper look into Han Solo and Jar Jar Binks and see if there is any variation in favourability ratings by age group. While we would expect Han Solo to be favoured by all, perhaps the older generations will be even more fond of the character since they were around at the time of release for the original triology. Additionally, we could expect Jar Jar Binks to be favoured more by the younger respondents, who would of been children at the time of release of the prequel trilogy and would perhaps be more receptive to the character's goofy antics than the older generations.

To make the data easier to digest, we'll convert the five numerical character favourability ratings into three text categories, Positive, Neutral, and Negative.

In [32]:

# create subset of character favourability ratings along with age and gender columns
fav_char = star_wars[["Han Solo", "Jar Jar Binks", "Age", "Gender"]].copy()

# convert 5-point numerical rating into 3-point rating with text categories
replace_vals = fav_char[["Han Solo", "Jar Jar Binks"]].replace(
                                                              {'Unfamiliar':np.nan, 
                                                              1.0:"Positive", 
                                                              2.0:"Positive", 
                                                              3.0:"Neutral", 
                                                              4.0:"Negative", 
                                                              5.0:"Negative"}
                                                              )

# since our rating system is now string-based, we'll make use of Panda's categorical dtype
# so that we can create a logical order for our values and improve the presentation of the plots
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=["Positive", "Neutral", "Negative"], ordered=True)
apply_cats = replace_vals.astype(cat_type)

# update rating columns with new categorical dtypes
fav_char[["Han Solo", "Jar Jar Binks"]] = apply_cats

# group the character ratings by age
age_fav_char = fav_char.groupby(["Age"])

# get percentage of each rating for each character, it's necessary to use unstack() here to
# pivot the rating values into columns so our bar plots look nice
han_fav = age_fav_char["Han Solo"].value_counts(normalize=True).unstack()
jar_fav = age_fav_char["Jar Jar Binks"].value_counts(normalize=True).unstack()

# function to setup bar plot with desired styling
def plot_fav_bar(df, title):
    ax = df.plot.bar(figsize=(15,7), 
                     color={"Positive": "#5f9e6e", "Neutral": "Grey", "Negative": "#b55d60"},
                     rot=0,
                     ylim=(0,1),
                     title=title,
                     ylabel="Percentage",
                     xlabel="")
    plt.show()
    
# create bar plots
plot_fav_bar(han_fav, "Favourability of Han Solo by age group")    
plot_fav_bar(jar_fav, "Favourability of Jar Jar Binks by age group")

The plots show us that Han Solo is universally favoured across all age groups, with very few individuals viewing the character in a neutral or negative light. Jar Jar Binks is much more polarised, it is clear that, and counter to our expectations, the younger generations are much more hostile towards the character, while the older generations are more tolerant. In fact, the majority of respondents aged >60 actually view Jar Jar Binks positively!

It's probably safe to assume that Han Solo is universally loved by all, no matter the demographic. However, Jar Jar Binks is clearly polarising among respondents, so let's see if there's any difference in favourability by Gender for the character.

In [33]:

# group the character ratings by age
fav_gen = fav_char.groupby(["Gender"])

# get percentage of each rating for each character, it's necessary to use unstack() here to
# pivot the rating values into columns so our bar plots look nice
jar_fav = fav_gen["Jar Jar Binks"].value_counts(normalize=True).unstack()

# create bar plots
plot_fav_bar(jar_fav, "Favourability of Jar Jar Binks by gender")

Male respondents maintain the general dislike of Jar Jar Binks, however, the majority of female respondents (40%) actually enjoy the character, compared with 35% who didn't like the character, and 25% who felt neutral.

Which US subregion has the highest ratio of Star Wars fans to non-fans?¶

The survey data includes the location of each respondent in the United States by subregion. Let's take a look at each subregion and identify how many respondents are from each area.

In [34]:

# get respondent count of each subregion
star_wars.iloc[:, 37].value_counts()

Out[34]:

East North Central    181
Pacific               175
South Atlantic        170
Middle Atlantic       122
West South Central    110
West North Central     93
Mountain               79
New England            75
East South Central     38
Name: Location (Census Region), dtype: int64

Most respondents are from states in the East North Central subregion, interestingly, East South Central had the fewest number of respondents, though this is likely related to general population figures.

Plotting a geographic map with geopandas¶

We can use the geopandas library to read in a shapefile of the United States. The shapefile contains geometric boundary data for each state which can be used to plot a map of the USA.

In [35]:

# import necessary libraries
import geopandas as gpd
from shapely.geometry import Point, Polygon

# read in shapefile of USA to geopandas
usa = gpd.read_file('./maps/states_21basic/states.shp')

usa.head()

Out[35]:

	drawseq	state_abbr	state_fips	state_name	sub_region	geometry
0	NaN	HI	15	Hawaii	Pacific	MULTIPOLYGON (((-160.07380 22.00418, -160.0497...
1	NaN	WA	53	Washington	Pacific	MULTIPOLYGON (((-122.40202 48.22522, -122.4628...
2	NaN	MT	30	Montana	Mountain	POLYGON ((-111.47543 44.70216, -111.48080 44.6...
3	NaN	ME	23	Maine	New England	MULTIPOLYGON (((-69.77728 44.07415, -69.85993 ...
4	NaN	ND	38	North Dakota	West North Central	POLYGON ((-98.73044 45.93827, -99.00683 45.939...

We can see our usa geodataframe contains the boundary data for each state, stored as a polygon in the geometry column. Importantly, each state is assigned to a sub_region, let's familarise ourselves with the subregions of the USA.

In [36]:

## plot map of US subregions ##

# exclude Alaska and Hawaii as they are too far away and make the mainland appear small (sorry!)
main_states = usa[(usa["state_abbr"] != "HI") & (usa["state_abbr"] != "AK")]

fig, ax = plt.subplots(figsize=(14,14))

main_states.plot(ax=ax,
                 alpha=.5, # mute colours with transparency so they're less intense
                 column="sub_region", 
                 edgecolor="white", 
                 linewidth=1, 
                 legend=True,
                 legend_kwds={'loc': 'lower right'})

plt.title("United States subregions")

plt.show()

Note: Alaska and Hawaii are part of the Pacific subregion but have been removed from this map to make it easier to view the mainland.

Now that we have a good idea of where each subregion is, let's identify what percentage of respondents from each subregion identify as 'fans' of Star Wars. First, we need to combine our Star Wars survey data with our geodataframe, main_states.

Merge Star Wars survey data with shapefile data¶

In [37]:

# get columns pertaining to fan status and location from survey dataset
fan_loc = star_wars.iloc[:, [2,37]].copy()

# rename columns for easier manipulation
fan_loc = fan_loc.rename(columns={"Do you consider yourself to be a fan of the Star Wars film franchise?": "fan",
                                    "Location (Census Region)": "sub_region"})

# group fans by sub_region and aggregate by percentage of fans to non-fans in each region
fan_loc_per = fan_loc.groupby("sub_region")["fan"].value_counts(normalize=True).unstack().reset_index()

# remove redundant index name
fan_loc_per = fan_loc_per.rename_axis(None, axis=1)

# rename columns
fan_loc_per = fan_loc_per.rename(columns={False: "non_fan", True: "fan"})

# merge star wars fan data with geodataframe taken from the shapefile using 'sub_region' as the key
geo_fans = pd.merge(fan_loc_per, main_states, on="sub_region", indicator=True)

# note - given that we do not have the granularity down to the state-level in our Star Wars dataset, we would ideally
# use a shapefile containing geometry for the sub-regions themselves, rather than individiaul states
# however this has been hard to source
print("First 15 rows")
geo_fans.head(15)

First 15 rows

Out[37]:

	sub_region	non_fan	fan	drawseq	state_abbr	state_fips	state_name	geometry	_merge
0	East North Central	0.373134	0.626866	NaN	WI	55	Wisconsin	MULTIPOLYGON (((-87.74856 44.96162, -87.83999 ...	both
1	East North Central	0.373134	0.626866	22.0	IN	18	Indiana	POLYGON ((-86.34161 38.17729, -86.36435 38.193...	both
2	East North Central	0.373134	0.626866	26.0	OH	39	Ohio	POLYGON ((-83.27276 38.60926, -83.29004 38.596...	both
3	East North Central	0.373134	0.626866	27.0	IL	17	Illinois	POLYGON ((-88.07159 37.51104, -88.08791 37.476...	both
4	East North Central	0.373134	0.626866	50.0	MI	26	Michigan	MULTIPOLYGON (((-88.49753 48.17380, -88.62533 ...	both
5	East South Central	0.375000	0.625000	33.0	KY	21	Kentucky	MULTIPOLYGON (((-86.51067 36.65507, -86.77054 ...	both
6	East South Central	0.375000	0.625000	40.0	TN	47	Tennessee	POLYGON ((-83.95461 35.45554, -84.01256 35.407...	both
7	East South Central	0.375000	0.625000	43.0	AL	01	Alabama	POLYGON ((-85.07007 31.98070, -85.11515 31.907...	both
8	East South Central	0.375000	0.625000	44.0	MS	28	Mississippi	POLYGON ((-88.45080 31.43562, -88.43456 31.120...	both
9	Middle Atlantic	0.311828	0.688172	18.0	PA	42	Pennsylvania	POLYGON ((-77.47579 39.71962, -78.09595 39.725...	both
10	Middle Atlantic	0.311828	0.688172	17.0	NY	36	New York	MULTIPOLYGON (((-79.76324 42.26733, -79.44402 ...	both
11	Middle Atlantic	0.311828	0.688172	21.0	NJ	34	New Jersey	POLYGON ((-75.48928 39.71486, -75.47597 39.720...	both
12	Mountain	0.279412	0.720588	NaN	MT	30	Montana	POLYGON ((-111.47543 44.70216, -111.48080 44.6...	both
13	Mountain	0.279412	0.720588	NaN	WY	56	Wyoming	POLYGON ((-104.05362 41.69822, -104.05550 41.5...	both
14	Mountain	0.279412	0.720588	NaN	ID	16	Idaho	POLYGON ((-117.02630 43.67903, -117.02379 43.7...	both

Plot choropleth map of fan percentage¶

In [38]:

# need to set the 'active geometry' on our pandas dataframe to get a geopandas geodataframe
geo_fans = geo_fans.set_geometry("geometry")

fig, ax = plt.subplots(figsize=(15,15))

# split ax so we can plot the legend on the new axes
from mpl_toolkits.axes_grid1 import make_axes_locatable
divider = make_axes_locatable(ax) # divide the axes labelled 'ax'
cax = divider.append_axes("right", size="5%", pad=0.1) # append a new axes at same height as oriignal axes

# plot cholorepleth map of USA by subregions 
geo_fans.plot(column="fan", 
              ax=ax,
              legend=True, 
              cmap="flare", 
              cax=cax, # cax param is the axes on which to draw the legend when using a color map
              #legend_kwds={"title":"yo"}
             )

ax.set_title("Percentage of respondents identifying as a Star Wars fan by US subregion")

plt.show()

We can see that the Mountain and New England subregions have among the strongest ratios of fans to non- Star Wars fans. Over 70% of the respondents in these two subregions identified as a fan of Star Wars. Conversely, the Pacific subregion has the lowest percentage of respondents who considered themselves a fan.

Conclusion¶

Back to top

General¶

1187 total responses to the survey
On average, Episode V is ranked the highest, while Episode III is ranked the lowest
Episode V has been watched by the greatest number of respondents (758 out of 1187)
Episode III has been watched by the fewest number of respondents (550 out of 1187)

Seen all films segment¶

Only 40% of respondents have seen all the Star Wars films (471 out of 1187)
The seen_all cohort film rankings generally reflects that of the total survey population, with a couple of differences:
- film rankings by the seen_all cohort are more extreme than of total_pop - the original trilogy is viewed even more favourably, and the prequel trilogy is considered even less appealing
- the seen_all cohort actually ranks Episode II the lowest of all films

Gender¶

There is roughly a 50:50 split between Male and Female respondents, with slightly more females
Males are more likely to identify as a fan of Star Wars than females (72% of Male respondents are a fan of Star Wars, compared with 60% of Female respondents)
Male respondents typically have seen more Star Wars films than Female respondents
Male respondents judge the first two films of the Star Wars prequel trilogy more harshly compared with females, while Female respondents view all the other films in the series more negatively than males.

Which character shot first?¶

By a slight majority, most respondents believe that Han shot first, though a near equal number of respondents did not understand the question at all, while a clear minority of respondents answered Greedo
While we expected a greater proportion of respondents in the youngest age group (18-29) to answer Greedo when compared with other age groups, this was not the case. In fact, 30-44 year olds had the greatest representation of Greedo voters.

Star Wars vs Star Trek¶

The majority of survey respondents did not consider themselves to be a fan of Star Trek
People who identify as a fan of Star Wars are much more likely to be a fan of Star Trek than the general population

Characters¶

Unsurprisingly, Luke Skywalker is the most familiar character in the Star Wars series
Respondents were least familiar with Padme Amidala, with up to 20% of respondents unsure of who the character was
Han Solo was viewed most favourably of all characters and is well appreciated across all age groups
Jar Jar Binks was the least favourable character in the Star Wars universe, however, he is actually viewed positively by the majority of respondents aged 60 years and over, and also by the majority of female respondents. It's not clear why this might be the case
The top half of characters in terms of favourability rating are generally portrayed as 'good', whereas the majority of characters in the bottom half by favourability rating are 'villains'. Perhaps people have a favourability bias towards morally good characters.

Fans by location¶

The Mountain and New England subregions of the USA have the highest percentage of respondents identifying as fans (approximately 72%)
The Pacific subregion has the lowest percentage (approximately 60%). This is still fairly high, so it's worth considering the bias that a Star Wars survey is much more likely to be taken by fans of the series than non-fans.

	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
0	3.0	2.0	1.0	4.0	5.0	6.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	2.0	3.0	4.0	5.0	6.0
3	5.0	6.0	1.0	2.0	4.0	3.0
4	5.0	4.0	6.0	2.0	1.0	3.0

	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
0	3.0	2.0	1.0	4.0	5.0	6.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	2.0	3.0	4.0	5.0	6.0
3	5.0	6.0	1.0	2.0	4.0	3.0
4	5.0	4.0	6.0	2.0	1.0	3.0

	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
0	3.0	2.0	1.0	4.0	5.0	6.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	2.0	3.0	4.0	5.0	6.0
3	5.0	6.0	1.0	2.0	4.0	3.0
4	5.0	4.0	6.0	2.0	1.0	3.0