To Star Wars and Beyond:¶

Star Wars Fans Who Connect With More than Star Wars¶

In this study, we will evaluate Star Wars fans in terms of their fanship for the franchise in general, the Star Wars Expanded Universe, and the series Star Trek. We will see if we can discern any meaningful patterns in the said data.

We start by importing the prerequisite modules, and read in our file.

In [1]:

import pandas as pd
import numpy as np
import regex as re
import matplotlib.pyplot as plt
%matplotlib inline
star_wars  = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")

In [2]:

star_wars.columns

Out[2]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

In [3]:

star_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 38 columns):
 #   Column                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                         --------------  -----  
 0   RespondentID                                                                                                                                   1186 non-null   int64  
 1   Have you seen any of the 6 films in the Star Wars franchise?                                                                                   1186 non-null   object 
 2   Do you consider yourself to be a fan of the Star Wars film franchise?                                                                          836 non-null    object 
 3   Which of the following Star Wars films have you seen? Please select all that apply.                                                            673 non-null    object 
 4   Unnamed: 4                                                                                                                                     571 non-null    object 
 5   Unnamed: 5                                                                                                                                     550 non-null    object 
 6   Unnamed: 6                                                                                                                                     607 non-null    object 
 7   Unnamed: 7                                                                                                                                     758 non-null    object 
 8   Unnamed: 8                                                                                                                                     738 non-null    object 
 9   Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.  835 non-null    float64
 10  Unnamed: 10                                                                                                                                    836 non-null    float64
 11  Unnamed: 11                                                                                                                                    835 non-null    float64
 12  Unnamed: 12                                                                                                                                    836 non-null    float64
 13  Unnamed: 13                                                                                                                                    836 non-null    float64
 14  Unnamed: 14                                                                                                                                    836 non-null    float64
 15  Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.                                 829 non-null    object 
 16  Unnamed: 16                                                                                                                                    831 non-null    object 
 17  Unnamed: 17                                                                                                                                    831 non-null    object 
 18  Unnamed: 18                                                                                                                                    823 non-null    object 
 19  Unnamed: 19                                                                                                                                    825 non-null    object 
 20  Unnamed: 20                                                                                                                                    814 non-null    object 
 21  Unnamed: 21                                                                                                                                    826 non-null    object 
 22  Unnamed: 22                                                                                                                                    820 non-null    object 
 23  Unnamed: 23                                                                                                                                    812 non-null    object 
 24  Unnamed: 24                                                                                                                                    827 non-null    object 
 25  Unnamed: 25                                                                                                                                    830 non-null    object 
 26  Unnamed: 26                                                                                                                                    821 non-null    object 
 27  Unnamed: 27                                                                                                                                    814 non-null    object 
 28  Unnamed: 28                                                                                                                                    826 non-null    object 
 29  Which character shot first?                                                                                                                    828 non-null    object 
 30  Are you familiar with the Expanded Universe?                                                                                                   828 non-null    object 
 31  Do you consider yourself to be a fan of the Expanded Universe?                                                                                 213 non-null    object 
 32  Do you consider yourself to be a fan of the Star Trek franchise?                                                                               1068 non-null   object 
 33  Gender                                                                                                                                         1046 non-null   object 
 34  Age                                                                                                                                            1046 non-null   object 
 35  Household Income                                                                                                                               858 non-null    object 
 36  Education                                                                                                                                      1036 non-null   object 
 37  Location (Census Region)                                                                                                                       1043 non-null   object 
dtypes: float64(6), int64(1), object(31)
memory usage: 352.2+ KB

After surveying the columns and column info, let's do some data transformations to facilitate graps, and evenntually, correlations. We will map the Yes- No strings into bools, and conevrt the number strings into integers.

In [4]:

truth_map_1 = {
    "Yes": True,
    "No": False}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(truth_map_1)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(truth_map_1)

We can also make the column names we will be using more helpful.

In [5]:

star_wars.iloc[:,3:9] = star_wars.iloc[:,3:9].replace('Star[.]*', True, regex = True ).replace(np.nan, False)

In [6]:

for x in range(3,9 ):
    star_wars = star_wars.rename(columns = {star_wars.columns[x]:"seen_{}".format(x-2)})
    

In [7]:

star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

In [8]:

for x in range(9, 15):
    star_wars = star_wars.rename(columns = {star_wars.columns[x]: 'movie_{}_rank'.format(x-8)})
    
star_wars.columns

Out[8]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
       'movie_1_rank', 'movie_2_rank', 'movie_3_rank', 'movie_4_rank',
       'movie_5_rank', 'movie_6_rank',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

To establish a baseline, lets look at what the movie preferences, as well as the frequency of watching, was for the general population. (Note that for the rankings, a lower number indicates a greater preference.)

In [9]:

rankings = star_wars.iloc[:,9:15]
ranking_mean = rankings.mean()

ranking_mean.plot.bar()

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f4b958e7070>

In [10]:

star_wars.iloc[:, 3:9].mean().plot.bar()

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f4bc9b7a9a0>

Now let's take a look at the difference between those who identify as fans of the entire franchise, and those who do not:

In [46]:

star_wars.iloc[:, 1:].groupby('Do you consider yourself to be a fan of the Star Wars film franchise?').mean().T.plot.bar()

Out[46]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f4b935e46d0>

We can note several interesting observations:

People who consider themselves fans of the franchise have consistently seen evry given movie more. This is pretty intuitive.
The most popular movie amongst those who are not true fans (in terms of viewing) was the 5th movie. It would seem that many chose to watch this film (which was the second in the series produced) after the exceptional success and novelty of the first.
Serious fans seem to have a greater appreciation of the second trilogy (Films 1-3) than the second trilogy, while the opposite is true for the non-serious fans.

Now let's look at some numbers for Star Trek fans

In [11]:

star_wars.iloc[:, 1:].groupby('Do you consider yourself to be a fan of the Star Trek franchise?').mean().T.plot.bar()

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f4b9367f9a0>

We can note that the profile and preferences for those who are also Star Trek fans are quite similar to those of Star Wars franchise fans. The preference of the non-fans for the second trilogy is less pronounced, though.

Let's take a look at the numbers for those who are Star Wars Expanded Universe fans:

In [48]:

star_wars['Are you familiar with the Expanded Universe?'].value_counts(dropna =False)

Out[48]:

No     615
NaN    358
Yes    213
Name: Are you familiar with the Expanded Universe?, dtype: int64

In [47]:

star_wars['Do you consider yourself to be a fan of the Expanded Universe?'].value_counts(dropna =False)

Out[47]:

NaN    973
No     114
Yes     99
Name: Do you consider yourself to be a fan of the Expanded Universe?, dtype: int64

We can see that only a very small group of fans are actually fans of the Expanded Universe. Many more simply don't even know what it is. Within such a specific subset of the fanbase, are there any predominant characteristics?

First, let's make a copy of some of the descriptive categories we'd like to dig in to.

In [14]:

my_wars = star_wars.copy().iloc[:, 30:37]

my_wars

Out[14]:

	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education
0	Yes	No	No	Male	18-29	NaN	High school degree
1	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree
2	No	NaN	No	Male	18-29	$0 - $24,999	High school degree
3	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree
4	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree
...	...	...	...	...	...	...	...
1181	No	NaN	Yes	Female	18-29	$0 - $24,999	Some college or Associate degree
1182	No	NaN	Yes	Female	30-44	$50,000 - $99,999	Bachelor degree
1183	NaN	NaN	No	Female	30-44	$50,000 - $99,999	Bachelor degree
1184	No	NaN	Yes	Female	45-60	$100,000 - $149,999	Some college or Associate degree
1185	No	NaN	No	Female	> 60	$50,000 - $99,999	Graduate degree

1186 rows × 7 columns

Now, let's convert all values to numeric or bool to facilitate running the correlations. To do so, we'll make some mappings.

In [15]:

my_wars['Age'].value_counts()

Out[15]:

45-60    291
> 60     269
30-44    268
18-29    218
Name: Age, dtype: int64

In [16]:

my_wars['Household Income'].value_counts()

Out[16]:

$50,000 - $99,999      298
$25,000 - $49,999      186
$100,000 - $149,999    141
$0 - $24,999           138
$150,000+               95
Name: Household Income, dtype: int64

In [17]:

my_wars.iloc[:, -1].value_counts()

Out[17]:

Some college or Associate degree    328
Bachelor degree                     321
Graduate degree                     275
High school degree                  105
Less than high school degree          7
Name: Education, dtype: int64

In [18]:

def true_bool(x):
    if x == 'Yes':
        return float(1)
    elif x == 'No':
        return float(0)
    else: 
        return np.nan
gender_map = {'Male': 0, 'Female':1}
age_map = {'18-29': 1,'30-44':2, '45-60': 3, '>60':4 }
income_map = {'$0 - $24,999': 0, '$25,000 - $49,999':1, '$50,000 - $99,999':2, '$100,000 - $149,999':3, '$150,000+':4   }
ed_map = {'Less than high school degree':1, 'High school degree':2, 'Some college or Associate degree': 3, 'Bachelor degree': 4, 'Graduate degree':5 }

In [19]:

my_wars.iloc[:,:3] = my_wars.iloc[:, :3].applymap(true_bool)
my_wars.iloc[:, 3] = my_wars.iloc[:, 3].map(gender_map)
my_wars['Age'] = my_wars['Age'].map(age_map)
my_wars['Household Income'] = my_wars['Household Income'].map(income_map)
my_wars.iloc[:, -1] = my_wars.iloc[:, -1].map(ed_map)

In [20]:

my_wars

Out[20]:

	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education
0	1	0	0	0.0	1.0	NaN	2.0
1	NaN	NaN	1	0.0	1.0	0.0	4.0
2	0	NaN	0	0.0	1.0	0.0	2.0
3	0	NaN	1	0.0	1.0	3.0	3.0
4	1	0	0	0.0	1.0	3.0	3.0
...	...	...	...	...	...	...	...
1181	0	NaN	1	1.0	1.0	0.0	3.0
1182	0	NaN	1	1.0	2.0	2.0	4.0
1183	NaN	NaN	0	1.0	2.0	2.0	4.0
1184	0	NaN	1	1.0	3.0	3.0	3.0
1185	0	NaN	0	1.0	NaN	2.0	5.0

1186 rows × 7 columns

In [21]:

my_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 7 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   Are you familiar with the Expanded Universe?                      828 non-null    object 
 1   Do you consider yourself to be a fan of the Expanded Universe?    213 non-null    object 
 2   Do you consider yourself to be a fan of the Star Trek franchise?  1068 non-null   object 
 3   Gender                                                            1046 non-null   float64
 4   Age                                                               777 non-null    float64
 5   Household Income                                                  858 non-null    float64
 6   Education                                                         1036 non-null   float64
dtypes: float64(4), object(3)
memory usage: 65.0+ KB

In [22]:

my_wars.iloc[:, :3] = my_wars.iloc[:, :3].astype(float)

Now, we are ready to run our baseline correlations.

In [23]:

my_wars.corr(method = 'pearson')

Out[23]:

	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education
Are you familiar with the Expanded Universe?	1.000000	NaN	0.189222	-0.193061	-0.076499	-0.004521	-0.066236
Do you consider yourself to be a fan of the Expanded Universe?	NaN	1.000000	0.128644	-0.008796	-0.135259	0.053812	-0.019133
Do you consider yourself to be a fan of the Star Trek franchise?	0.189222	0.128644	1.000000	-0.136584	0.147298	0.050203	0.071583
Gender	-0.193061	-0.008796	-0.136584	1.000000	-0.002160	-0.072513	0.039980
Age	-0.076499	-0.135259	0.147298	-0.002160	1.000000	0.215972	0.195255
Household Income	-0.004521	0.053812	0.050203	-0.072513	0.215972	1.000000	0.285583
Education	-0.066236	-0.019133	0.071583	0.039980	0.195255	0.285583	1.000000

None of the correlations from the this data set are particularly strong. Now, let us limit it to cases where the respondent is familiar with the Expanded Universe

In [49]:

my_wars1 = my_wars[my_wars.iloc[:,0] == 1]
my_wars1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 213 entries, 0 to 1175
Data columns (total 7 columns):
 #   Column                                                            Non-Null Count  Dtype  
---  ------                                                            --------------  -----  
 0   Are you familiar with the Expanded Universe?                      213 non-null    float64
 1   Do you consider yourself to be a fan of the Expanded Universe?    213 non-null    float64
 2   Do you consider yourself to be a fan of the Star Trek franchise?  213 non-null    float64
 3   Gender                                                            212 non-null    float64
 4   Age                                                               171 non-null    float64
 5   Household Income                                                  177 non-null    float64
 6   Education                                                         211 non-null    float64
dtypes: float64(7)
memory usage: 13.3 KB

In [50]:

my_wars1.corr(method = 'pearson')

Out[50]:

	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education
Are you familiar with the Expanded Universe?	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Do you consider yourself to be a fan of the Expanded Universe?	NaN	1.000000	0.128644	-0.008796	-0.135259	0.053812	-0.019133
Do you consider yourself to be a fan of the Star Trek franchise?	NaN	0.128644	1.000000	0.044691	0.276272	0.079401	0.080488
Gender	NaN	-0.008796	0.044691	1.000000	-0.026541	-0.029926	0.167025
Age	NaN	-0.135259	0.276272	-0.026541	1.000000	0.152805	0.221346
Household Income	NaN	0.053812	0.079401	-0.029926	0.152805	1.000000	0.225105
Education	NaN	-0.019133	0.080488	0.167025	0.221346	0.225105	1.000000

The correlations indicate that there is no particular demographic, be it age, gender, degree of schooling, or income, which is more inclined to be a fan of the entire Expanded Universe.

Now let's see if there are any particular prefernces or watching tendencies for those who follow the Expanded Universe

In [43]:

total_wars = pd.concat([my_wars.iloc[:, 0:2],star_wars.iloc[:, 3:15]], axis = 1)
total_wars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 14 columns):
 #   Column                                                          Non-Null Count  Dtype  
---  ------                                                          --------------  -----  
 0   Are you familiar with the Expanded Universe?                    828 non-null    float64
 1   Do you consider yourself to be a fan of the Expanded Universe?  213 non-null    float64
 2   seen_1                                                          1186 non-null   bool   
 3   seen_2                                                          1186 non-null   bool   
 4   seen_3                                                          1186 non-null   bool   
 5   seen_4                                                          1186 non-null   bool   
 6   seen_5                                                          1186 non-null   bool   
 7   seen_6                                                          1186 non-null   bool   
 8   movie_1_rank                                                    835 non-null    float64
 9   movie_2_rank                                                    836 non-null    float64
 10  movie_3_rank                                                    835 non-null    float64
 11  movie_4_rank                                                    836 non-null    float64
 12  movie_5_rank                                                    836 non-null    float64
 13  movie_6_rank                                                    836 non-null    float64
dtypes: bool(6), float64(8)
memory usage: 81.2 KB

In [45]:

total_wars.corr(method = 'pearson')

Out[45]:

	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	movie_1_rank	movie_2_rank	movie_3_rank	movie_4_rank	movie_5_rank	movie_6_rank
Are you familiar with the Expanded Universe?	1.000000	NaN	0.159340	0.260348	0.269342	0.195778	0.114120	0.130095	0.184859	0.066172	-0.072465	-0.075515	-0.069800	-0.026537
Do you consider yourself to be a fan of the Expanded Universe?	NaN	1.000000	0.093491	0.064151	0.100418	0.043833	-0.013946	-0.015669	0.030407	0.015449	-0.040088	-0.025161	0.006173	0.011561
seen_1	0.159340	0.093491	1.000000	0.783358	0.729996	0.665818	0.648044	0.653696	0.067218	0.013792	-0.067711	-0.146503	0.066301	0.079381
seen_2	0.260348	0.064151	0.783358	1.000000	0.883886	0.687882	0.611608	0.642843	0.246639	0.041711	-0.102122	-0.160216	-0.014686	-0.002038
seen_3	0.269342	0.100418	0.729996	0.883886	1.000000	0.698517	0.617805	0.651306	0.308085	0.134838	-0.181001	-0.147843	-0.049921	-0.053451
seen_4	0.195778	0.043833	0.665818	0.687882	0.698517	1.000000	0.734259	0.759477	0.440301	0.365598	0.174842	-0.554932	-0.136834	-0.143364
seen_5	0.114120	-0.013946	0.648044	0.611608	0.617805	0.734259	1.000000	0.910124	0.385813	0.388224	0.248817	-0.130101	-0.422226	-0.368499
seen_6	0.130095	-0.015669	0.653696	0.642843	0.651306	0.759477	0.910124	1.000000	0.431521	0.391197	0.237803	-0.159497	-0.272718	-0.509609
movie_1_rank	0.184859	0.030407	0.067218	0.246639	0.308085	0.440301	0.385813	0.431521	1.000000	0.415511	0.066760	-0.451862	-0.454098	-0.462642
movie_2_rank	0.066172	0.015449	0.013792	0.041711	0.134838	0.365598	0.388224	0.391197	0.415511	1.000000	0.336002	-0.435664	-0.528662	-0.532254
movie_3_rank	-0.072465	-0.040088	-0.067711	-0.102122	-0.181001	0.174842	0.248817	0.237803	0.066760	0.336002	1.000000	-0.299704	-0.452946	-0.421262
movie_4_rank	-0.075515	-0.025161	-0.146503	-0.160216	-0.147843	-0.554932	-0.130101	-0.159497	-0.451862	-0.435664	-0.299704	1.000000	0.003324	-0.043641
movie_5_rank	-0.069800	0.006173	0.066301	-0.014686	-0.049921	-0.136834	-0.422226	-0.272718	-0.454098	-0.528662	-0.452946	0.003324	1.000000	0.312429
movie_6_rank	-0.026537	0.011561	0.079381	-0.002038	-0.053451	-0.143364	-0.368499	-0.509609	-0.462642	-0.532254	-0.421262	-0.043641	0.312429	1.000000

Our correlation table once again indicates that the Expanded Universe fan group is very heterogenous, and does not seem to have any particular watching preferences.

In conclusion, we have not been able to identify a strong marker or indicator of an Expanded Universe fan, despite the small nature of their group.