In this study, we will evaluate Star Wars fans in terms of their fanship for the franchise in general, the Star Wars Expanded Universe, and the series Star Trek. We will see if we can discern any meaningful patterns in the said data.
We start by importing the prerequisite modules, and read in our file.
import pandas as pd
import numpy as np
import regex as re
import matplotlib.pyplot as plt
%matplotlib inline
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
star_wars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1186 entries, 0 to 1185 Data columns (total 38 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RespondentID 1186 non-null int64 1 Have you seen any of the 6 films in the Star Wars franchise? 1186 non-null object 2 Do you consider yourself to be a fan of the Star Wars film franchise? 836 non-null object 3 Which of the following Star Wars films have you seen? Please select all that apply. 673 non-null object 4 Unnamed: 4 571 non-null object 5 Unnamed: 5 550 non-null object 6 Unnamed: 6 607 non-null object 7 Unnamed: 7 758 non-null object 8 Unnamed: 8 738 non-null object 9 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. 835 non-null float64 10 Unnamed: 10 836 non-null float64 11 Unnamed: 11 835 non-null float64 12 Unnamed: 12 836 non-null float64 13 Unnamed: 13 836 non-null float64 14 Unnamed: 14 836 non-null float64 15 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. 829 non-null object 16 Unnamed: 16 831 non-null object 17 Unnamed: 17 831 non-null object 18 Unnamed: 18 823 non-null object 19 Unnamed: 19 825 non-null object 20 Unnamed: 20 814 non-null object 21 Unnamed: 21 826 non-null object 22 Unnamed: 22 820 non-null object 23 Unnamed: 23 812 non-null object 24 Unnamed: 24 827 non-null object 25 Unnamed: 25 830 non-null object 26 Unnamed: 26 821 non-null object 27 Unnamed: 27 814 non-null object 28 Unnamed: 28 826 non-null object 29 Which character shot first? 828 non-null object 30 Are you familiar with the Expanded Universe? 828 non-null object 31 Do you consider yourself to be a fan of the Expanded Universe? 213 non-null object 32 Do you consider yourself to be a fan of the Star Trek franchise? 1068 non-null object 33 Gender 1046 non-null object 34 Age 1046 non-null object 35 Household Income 858 non-null object 36 Education 1036 non-null object 37 Location (Census Region) 1043 non-null object dtypes: float64(6), int64(1), object(31) memory usage: 352.2+ KB
After surveying the columns and column info, let's do some data transformations to facilitate graps, and evenntually, correlations. We will map the Yes- No strings into bools, and conevrt the number strings into integers.
truth_map_1 = {
"Yes": True,
"No": False}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(truth_map_1)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(truth_map_1)
We can also make the column names we will be using more helpful.
star_wars.iloc[:,3:9] = star_wars.iloc[:,3:9].replace('Star[.]*', True, regex = True ).replace(np.nan, False)
for x in range(3,9 ):
star_wars = star_wars.rename(columns = {star_wars.columns[x]:"seen_{}".format(x-2)})
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
for x in range(9, 15):
star_wars = star_wars.rename(columns = {star_wars.columns[x]: 'movie_{}_rank'.format(x-8)})
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'movie_1_rank', 'movie_2_rank', 'movie_3_rank', 'movie_4_rank', 'movie_5_rank', 'movie_6_rank', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
To establish a baseline, lets look at what the movie preferences, as well as the frequency of watching, was for the general population. (Note that for the rankings, a lower number indicates a greater preference.)
rankings = star_wars.iloc[:,9:15]
ranking_mean = rankings.mean()
ranking_mean.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f4b958e7070>
star_wars.iloc[:, 3:9].mean().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f4bc9b7a9a0>
Now let's take a look at the difference between those who identify as fans of the entire franchise, and those who do not:
star_wars.iloc[:, 1:].groupby('Do you consider yourself to be a fan of the Star Wars film franchise?').mean().T.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f4b935e46d0>
We can note several interesting observations:
Now let's look at some numbers for Star Trek fans
star_wars.iloc[:, 1:].groupby('Do you consider yourself to be a fan of the Star Trek franchise?').mean().T.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f4b9367f9a0>
We can note that the profile and preferences for those who are also Star Trek fans are quite similar to those of Star Wars franchise fans. The preference of the non-fans for the second trilogy is less pronounced, though.
Let's take a look at the numbers for those who are Star Wars Expanded Universe fans:
star_wars['Are you familiar with the Expanded Universe?'].value_counts(dropna =False)
No 615 NaN 358 Yes 213 Name: Are you familiar with the Expanded Universe?, dtype: int64
star_wars['Do you consider yourself to be a fan of the Expanded Universe?'].value_counts(dropna =False)
NaN 973 No 114 Yes 99 Name: Do you consider yourself to be a fan of the Expanded Universe?, dtype: int64
We can see that only a very small group of fans are actually fans of the Expanded Universe. Many more simply don't even know what it is. Within such a specific subset of the fanbase, are there any predominant characteristics?
First, let's make a copy of some of the descriptive categories we'd like to dig in to.
my_wars = star_wars.copy().iloc[:, 30:37]
my_wars
Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | |
---|---|---|---|---|---|---|---|
0 | Yes | No | No | Male | 18-29 | NaN | High school degree |
1 | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree |
2 | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree |
3 | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree |
4 | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree |
... | ... | ... | ... | ... | ... | ... | ... |
1181 | No | NaN | Yes | Female | 18-29 | $0 - $24,999 | Some college or Associate degree |
1182 | No | NaN | Yes | Female | 30-44 | $50,000 - $99,999 | Bachelor degree |
1183 | NaN | NaN | No | Female | 30-44 | $50,000 - $99,999 | Bachelor degree |
1184 | No | NaN | Yes | Female | 45-60 | $100,000 - $149,999 | Some college or Associate degree |
1185 | No | NaN | No | Female | > 60 | $50,000 - $99,999 | Graduate degree |
1186 rows × 7 columns
Now, let's convert all values to numeric or bool to facilitate running the correlations. To do so, we'll make some mappings.
my_wars['Age'].value_counts()
45-60 291 > 60 269 30-44 268 18-29 218 Name: Age, dtype: int64
my_wars['Household Income'].value_counts()
$50,000 - $99,999 298 $25,000 - $49,999 186 $100,000 - $149,999 141 $0 - $24,999 138 $150,000+ 95 Name: Household Income, dtype: int64
my_wars.iloc[:, -1].value_counts()
Some college or Associate degree 328 Bachelor degree 321 Graduate degree 275 High school degree 105 Less than high school degree 7 Name: Education, dtype: int64
def true_bool(x):
if x == 'Yes':
return float(1)
elif x == 'No':
return float(0)
else:
return np.nan
gender_map = {'Male': 0, 'Female':1}
age_map = {'18-29': 1,'30-44':2, '45-60': 3, '>60':4 }
income_map = {'$0 - $24,999': 0, '$25,000 - $49,999':1, '$50,000 - $99,999':2, '$100,000 - $149,999':3, '$150,000+':4 }
ed_map = {'Less than high school degree':1, 'High school degree':2, 'Some college or Associate degree': 3, 'Bachelor degree': 4, 'Graduate degree':5 }
my_wars.iloc[:,:3] = my_wars.iloc[:, :3].applymap(true_bool)
my_wars.iloc[:, 3] = my_wars.iloc[:, 3].map(gender_map)
my_wars['Age'] = my_wars['Age'].map(age_map)
my_wars['Household Income'] = my_wars['Household Income'].map(income_map)
my_wars.iloc[:, -1] = my_wars.iloc[:, -1].map(ed_map)
my_wars
Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0.0 | 1.0 | NaN | 2.0 |
1 | NaN | NaN | 1 | 0.0 | 1.0 | 0.0 | 4.0 |
2 | 0 | NaN | 0 | 0.0 | 1.0 | 0.0 | 2.0 |
3 | 0 | NaN | 1 | 0.0 | 1.0 | 3.0 | 3.0 |
4 | 1 | 0 | 0 | 0.0 | 1.0 | 3.0 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... |
1181 | 0 | NaN | 1 | 1.0 | 1.0 | 0.0 | 3.0 |
1182 | 0 | NaN | 1 | 1.0 | 2.0 | 2.0 | 4.0 |
1183 | NaN | NaN | 0 | 1.0 | 2.0 | 2.0 | 4.0 |
1184 | 0 | NaN | 1 | 1.0 | 3.0 | 3.0 | 3.0 |
1185 | 0 | NaN | 0 | 1.0 | NaN | 2.0 | 5.0 |
1186 rows × 7 columns
my_wars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1186 entries, 0 to 1185 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Are you familiar with the Expanded Universe? 828 non-null object 1 Do you consider yourself to be a fan of the Expanded Universe? 213 non-null object 2 Do you consider yourself to be a fan of the Star Trek franchise? 1068 non-null object 3 Gender 1046 non-null float64 4 Age 777 non-null float64 5 Household Income 858 non-null float64 6 Education 1036 non-null float64 dtypes: float64(4), object(3) memory usage: 65.0+ KB
my_wars.iloc[:, :3] = my_wars.iloc[:, :3].astype(float)
Now, we are ready to run our baseline correlations.
my_wars.corr(method = 'pearson')
Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | |
---|---|---|---|---|---|---|---|
Are you familiar with the Expanded Universe? | 1.000000 | NaN | 0.189222 | -0.193061 | -0.076499 | -0.004521 | -0.066236 |
Do you consider yourself to be a fan of the Expanded Universe? | NaN | 1.000000 | 0.128644 | -0.008796 | -0.135259 | 0.053812 | -0.019133 |
Do you consider yourself to be a fan of the Star Trek franchise? | 0.189222 | 0.128644 | 1.000000 | -0.136584 | 0.147298 | 0.050203 | 0.071583 |
Gender | -0.193061 | -0.008796 | -0.136584 | 1.000000 | -0.002160 | -0.072513 | 0.039980 |
Age | -0.076499 | -0.135259 | 0.147298 | -0.002160 | 1.000000 | 0.215972 | 0.195255 |
Household Income | -0.004521 | 0.053812 | 0.050203 | -0.072513 | 0.215972 | 1.000000 | 0.285583 |
Education | -0.066236 | -0.019133 | 0.071583 | 0.039980 | 0.195255 | 0.285583 | 1.000000 |
None of the correlations from the this data set are particularly strong. Now, let us limit it to cases where the respondent is familiar with the Expanded Universe
my_wars1 = my_wars[my_wars.iloc[:,0] == 1]
my_wars1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 213 entries, 0 to 1175 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Are you familiar with the Expanded Universe? 213 non-null float64 1 Do you consider yourself to be a fan of the Expanded Universe? 213 non-null float64 2 Do you consider yourself to be a fan of the Star Trek franchise? 213 non-null float64 3 Gender 212 non-null float64 4 Age 171 non-null float64 5 Household Income 177 non-null float64 6 Education 211 non-null float64 dtypes: float64(7) memory usage: 13.3 KB
my_wars1.corr(method = 'pearson')
Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | |
---|---|---|---|---|---|---|---|
Are you familiar with the Expanded Universe? | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Do you consider yourself to be a fan of the Expanded Universe? | NaN | 1.000000 | 0.128644 | -0.008796 | -0.135259 | 0.053812 | -0.019133 |
Do you consider yourself to be a fan of the Star Trek franchise? | NaN | 0.128644 | 1.000000 | 0.044691 | 0.276272 | 0.079401 | 0.080488 |
Gender | NaN | -0.008796 | 0.044691 | 1.000000 | -0.026541 | -0.029926 | 0.167025 |
Age | NaN | -0.135259 | 0.276272 | -0.026541 | 1.000000 | 0.152805 | 0.221346 |
Household Income | NaN | 0.053812 | 0.079401 | -0.029926 | 0.152805 | 1.000000 | 0.225105 |
Education | NaN | -0.019133 | 0.080488 | 0.167025 | 0.221346 | 0.225105 | 1.000000 |
The correlations indicate that there is no particular demographic, be it age, gender, degree of schooling, or income, which is more inclined to be a fan of the entire Expanded Universe.
Now let's see if there are any particular prefernces or watching tendencies for those who follow the Expanded Universe
total_wars = pd.concat([my_wars.iloc[:, 0:2],star_wars.iloc[:, 3:15]], axis = 1)
total_wars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1186 entries, 0 to 1185 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Are you familiar with the Expanded Universe? 828 non-null float64 1 Do you consider yourself to be a fan of the Expanded Universe? 213 non-null float64 2 seen_1 1186 non-null bool 3 seen_2 1186 non-null bool 4 seen_3 1186 non-null bool 5 seen_4 1186 non-null bool 6 seen_5 1186 non-null bool 7 seen_6 1186 non-null bool 8 movie_1_rank 835 non-null float64 9 movie_2_rank 836 non-null float64 10 movie_3_rank 835 non-null float64 11 movie_4_rank 836 non-null float64 12 movie_5_rank 836 non-null float64 13 movie_6_rank 836 non-null float64 dtypes: bool(6), float64(8) memory usage: 81.2 KB
total_wars.corr(method = 'pearson')
Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | movie_1_rank | movie_2_rank | movie_3_rank | movie_4_rank | movie_5_rank | movie_6_rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Are you familiar with the Expanded Universe? | 1.000000 | NaN | 0.159340 | 0.260348 | 0.269342 | 0.195778 | 0.114120 | 0.130095 | 0.184859 | 0.066172 | -0.072465 | -0.075515 | -0.069800 | -0.026537 |
Do you consider yourself to be a fan of the Expanded Universe? | NaN | 1.000000 | 0.093491 | 0.064151 | 0.100418 | 0.043833 | -0.013946 | -0.015669 | 0.030407 | 0.015449 | -0.040088 | -0.025161 | 0.006173 | 0.011561 |
seen_1 | 0.159340 | 0.093491 | 1.000000 | 0.783358 | 0.729996 | 0.665818 | 0.648044 | 0.653696 | 0.067218 | 0.013792 | -0.067711 | -0.146503 | 0.066301 | 0.079381 |
seen_2 | 0.260348 | 0.064151 | 0.783358 | 1.000000 | 0.883886 | 0.687882 | 0.611608 | 0.642843 | 0.246639 | 0.041711 | -0.102122 | -0.160216 | -0.014686 | -0.002038 |
seen_3 | 0.269342 | 0.100418 | 0.729996 | 0.883886 | 1.000000 | 0.698517 | 0.617805 | 0.651306 | 0.308085 | 0.134838 | -0.181001 | -0.147843 | -0.049921 | -0.053451 |
seen_4 | 0.195778 | 0.043833 | 0.665818 | 0.687882 | 0.698517 | 1.000000 | 0.734259 | 0.759477 | 0.440301 | 0.365598 | 0.174842 | -0.554932 | -0.136834 | -0.143364 |
seen_5 | 0.114120 | -0.013946 | 0.648044 | 0.611608 | 0.617805 | 0.734259 | 1.000000 | 0.910124 | 0.385813 | 0.388224 | 0.248817 | -0.130101 | -0.422226 | -0.368499 |
seen_6 | 0.130095 | -0.015669 | 0.653696 | 0.642843 | 0.651306 | 0.759477 | 0.910124 | 1.000000 | 0.431521 | 0.391197 | 0.237803 | -0.159497 | -0.272718 | -0.509609 |
movie_1_rank | 0.184859 | 0.030407 | 0.067218 | 0.246639 | 0.308085 | 0.440301 | 0.385813 | 0.431521 | 1.000000 | 0.415511 | 0.066760 | -0.451862 | -0.454098 | -0.462642 |
movie_2_rank | 0.066172 | 0.015449 | 0.013792 | 0.041711 | 0.134838 | 0.365598 | 0.388224 | 0.391197 | 0.415511 | 1.000000 | 0.336002 | -0.435664 | -0.528662 | -0.532254 |
movie_3_rank | -0.072465 | -0.040088 | -0.067711 | -0.102122 | -0.181001 | 0.174842 | 0.248817 | 0.237803 | 0.066760 | 0.336002 | 1.000000 | -0.299704 | -0.452946 | -0.421262 |
movie_4_rank | -0.075515 | -0.025161 | -0.146503 | -0.160216 | -0.147843 | -0.554932 | -0.130101 | -0.159497 | -0.451862 | -0.435664 | -0.299704 | 1.000000 | 0.003324 | -0.043641 |
movie_5_rank | -0.069800 | 0.006173 | 0.066301 | -0.014686 | -0.049921 | -0.136834 | -0.422226 | -0.272718 | -0.454098 | -0.528662 | -0.452946 | 0.003324 | 1.000000 | 0.312429 |
movie_6_rank | -0.026537 | 0.011561 | 0.079381 | -0.002038 | -0.053451 | -0.143364 | -0.368499 | -0.509609 | -0.462642 | -0.532254 | -0.421262 | -0.043641 | 0.312429 | 1.000000 |
Our correlation table once again indicates that the Expanded Universe fan group is very heterogenous, and does not seem to have any particular watching preferences.
In conclusion, we have not been able to identify a strong marker or indicator of an Expanded Universe fan, despite the small nature of their group.