Scenario: While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses for all, and now we will clean it to see what's the hidden insights of this survey: Are the StarWar's fan feel that The Empire Strikes Back is the best of all series?
To clearly, the Episode: The Empire Strikes Back is created and introduced from 1980's, and while the team FiveThirtyEight created this survey (2015s'), there are total 4 episode is introduced after The Empire Strikes Back. Our task is find the insight from this survey data to answer the question that team FiveThirtyEight interest: we can accept it or reject it depend on the data's result we have. The conclusion at the end of this project will be the final answer for the team.
Because we will work with csv file, but sometime we don't know what's that's file encoding? So, we will make a function to check the file encoding before we load in DataFrame.
# pip install chardet
## Check the file encoding:
from chardet.universaldetector import UniversalDetector
def detect_encode(file_name):
detector = UniversalDetector()
for item in open(file_name, 'rb'):
detector.feed(item)
if detector.done: break
detector.close()
print(detector.result)
Now let's download the file into our local and check the encoding of this.
# # Download the file:
# import opendatasets as od
# page='https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv'
# od.download(page)
## Check the encoding:
# detect_encode('StarWars.csv')
The encoding of file is in Windows-1252
, so we will pass this encoding name into function call our DataFrame, and check file the final times to sure everything is OK.
## Load library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
## Load data in:
starwar = pd.read_csv('StarWars.csv', encoding='Windows-1252', delimiter=',')
starwar.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1187 entries, 0 to 1186 Data columns (total 38 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RespondentID 1186 non-null float64 1 Have you seen any of the 6 films in the Star Wars franchise? 1187 non-null object 2 Do you consider yourself to be a fan of the Star Wars film franchise? 837 non-null object 3 Which of the following Star Wars films have you seen? Please select all that apply. 674 non-null object 4 Unnamed: 4 572 non-null object 5 Unnamed: 5 551 non-null object 6 Unnamed: 6 608 non-null object 7 Unnamed: 7 759 non-null object 8 Unnamed: 8 739 non-null object 9 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. 836 non-null object 10 Unnamed: 10 837 non-null object 11 Unnamed: 11 836 non-null object 12 Unnamed: 12 837 non-null object 13 Unnamed: 13 837 non-null object 14 Unnamed: 14 837 non-null object 15 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. 830 non-null object 16 Unnamed: 16 832 non-null object 17 Unnamed: 17 832 non-null object 18 Unnamed: 18 824 non-null object 19 Unnamed: 19 826 non-null object 20 Unnamed: 20 815 non-null object 21 Unnamed: 21 827 non-null object 22 Unnamed: 22 821 non-null object 23 Unnamed: 23 813 non-null object 24 Unnamed: 24 828 non-null object 25 Unnamed: 25 831 non-null object 26 Unnamed: 26 822 non-null object 27 Unnamed: 27 815 non-null object 28 Unnamed: 28 827 non-null object 29 Which character shot first? 829 non-null object 30 Are you familiar with the Expanded Universe? 829 non-null object 31 Do you consider yourself to be a fan of the Expanded Universe?ξ 214 non-null object 32 Do you consider yourself to be a fan of the Star Trek franchise? 1069 non-null object 33 Gender 1047 non-null object 34 Age 1047 non-null object 35 Household Income 859 non-null object 36 Education 1037 non-null object 37 Location (Census Region) 1044 non-null object dtypes: float64(1), object(37) memory usage: 352.5+ KB
Let's see a few records to know the data more clearly:
# Setting for maximum columns = 60
pd.set_option('display.max_columns',60)
# Check the first 5 records:
starwar.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | 2 | 3 | 4 | 5 | 6 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | 6 | 1 | 2 | 4 | 3 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
We can see the structre of this survey like this:
1. Handle YES/NO columns
We have two filed of YES/NO column:
Have you seen any of the 6 films in the Star Wars franchise?
(1st)Do you consider yourself to be a fan of the Star Wars film franchise?
(2nd)As each field can contains missing value (due to attendance reject to answer), we do the following:
df.info()
function we know the first filed is none of missing, but the 2nd has => we will count again to confirmseries.map()
function.The process will be like the code block below.
# Count the unique value again:
starwar['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Yes 936 No 250 Response 1 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
starwar['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Yes 552 NaN 350 No 284 Response 1 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
Skip thorugh the Respond records, we will convert the other value now:
# Convert the value to boolean:
yes_no = {'Yes':True, 'No':False}
col = ['Have you seen any of the 6 films in the Star Wars franchise?',
'Do you consider yourself to be a fan of the Star Wars film franchise?']
for item in col:
starwar[item] = starwar[item].map(yes_no, na_action='ignore')
# Check the result:
starwar['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
True 936 False 250 NaN 1 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
starwar['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
True 552 NaN 351 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
2. Clean Check-box columns
Everything seem OK, now let's take the next step with the next 6 question, we can see that:
Which of the following Star Wars films have you seen? Please select all that apply.
, right below it is title film: Star Wars: Episode I The Phantom Menace
with mean that the 1st question is about the attendance seen the Eposide 1 or notStar Wars: Episode II Attack of the Clones
...With that said, we can do the following below:
Which of the following Star Wars films have you seen? Please select all that apply.
=> seen_1
# Check the unique value on each field:
def check_value(df, start, end):
col_check = df.columns[start:end]
result = []
for item in col_check:
val = df[item].unique()
result.append(val[0])
if val[1] not in result:
val[1] = np.nan
result.append(val[1])
continue
return result
check_value(starwar,3, 9)
['Star Wars: Episode I The Phantom Menace', nan, 'Star Wars: Episode II Attack of the Clones', 'Star Wars: Episode III Revenge of the Sith', 'Star Wars: Episode IV A New Hope', 'Star Wars: Episode V The Empire Strikes Back', 'Star Wars: Episode VI Return of the Jedi']
# Convert the value to boolean:
convert_value = check_value(starwar,3,9)
mapper = {}
for name in convert_value:
if len(str(name))==3:
mapper[name] = False
continue
mapper[name] = True
def convert_value(df, start, end):
cols = df.columns[start:end]
for i in cols:
df[i] = df[i].map(mapper)
return df[i]
convert_value(starwar,3,9)
0 True 1 True 2 False 3 False 4 True ... 1182 True 1183 True 1184 False 1185 True 1186 True Name: Unnamed: 8, Length: 1187, dtype: bool
# Convert the name of column:
def convert_columns(df, start, end, str_, number):
cols = df.columns[start:end]
revert = []
for i in range(1,number+1):
revert.append('{}_{}'.format(str_,i))
mapp = {}
for old, new in zip(cols, revert):
mapp[old] = new
return df.rename(mapp, inplace=True, axis=1)
## Run the function
convert_columns(starwar, 3, 9, 'seen', 6)
## Check some records:
starwar.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | True | True | True | True | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3 | 2 | 1 | 4 | 5 | 6 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1 | 2 | 3 | 4 | 5 | 6 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5 | 6 | 1 | 2 | 4 | 3 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
3. Clean the ranking columns:
Similary as the Check-box columns above, but now we don't need to cleaning as much as the Check-boxs field. Instead, we rename the column to be ranking_n
and cast the value to be numeric type.
# Cast the value:
starwar[starwar.columns[9:15]] = starwar.loc[2:,starwar.columns[9:15]].astype('float')
# Check:
starwar[starwar.columns[9:15]].dtypes
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. float64 Unnamed: 10 float64 Unnamed: 11 float64 Unnamed: 12 float64 Unnamed: 13 float64 Unnamed: 14 float64 dtype: object
# Rename the columns:
# Convert the name of column:
convert_columns(starwar, 9, 15, 'ranking', 6)
## Check:
starwar.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. | Unnamed: 16 | Unnamed: 17 | Unnamed: 18 | Unnamed: 19 | Unnamed: 20 | Unnamed: 21 | Unnamed: 22 | Unnamed: 23 | Unnamed: 24 | Unnamed: 25 | Unnamed: 26 | Unnamed: 27 | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | True | True | True | True | True | True | NaN | NaN | NaN | NaN | NaN | NaN | Han Solo | Luke Skywalker | Princess Leia Organa | Anakin Skywalker | Obi Wan Kenobi | Emperor Palpatine | Darth Vader | Lando Calrissian | Boba Fett | C-3P0 | R2 D2 | Jar Jar Binks | Padme Amidala | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | NaN | NaN | NaN | NaN | NaN | NaN | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Somewhat favorably | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | Somewhat favorably | Very favorably | Somewhat favorably | Somewhat unfavorably | Very favorably | Very favorably | Very favorably | Very favorably | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4. Clean the favourable columns
Like the structre of Check-boxs and Ranking field, the favourable field have these structre:
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.
to Unnamed: 28
is the question, co-responed to:Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.
=> Han Solo
Unname: 16
=> Luke Skywalker, and so onVery favourably
,Somewhat favorably
,Unfamiliar
,NA
So, what we should do is:
Very favourably
,Somewhat favorably
,Unfamiliar
, and of course, N/A is 'C'The convert process will be shown as code-block below.
# Get the columns name and the name of actor:
list_actor_name = starwar.iloc[0,15:29].values
actor_name = []
for ac_name in list_actor_name:
names = (ac_name
.replace(' ','_')
.lower()
.strip())
actor_name.append(names)
# modify the actor name:
old_name = starwar.columns[15:29]
mapp1 = {}
for od, nw in zip(old_name, actor_name):
mapp1[od] = nw
starwar.rename(mapper=mapp1, inplace=True, axis=1)
# Check the result:
starwar.columns[15:29]
Index(['han_solo', 'luke_skywalker', 'princess_leia_organa', 'anakin_skywalker', 'obi_wan_kenobi', 'emperor_palpatine', 'darth_vader', 'lando_calrissian', 'boba_fett', 'c-3p0', 'r2_d2', 'jar_jar_binks', 'padme_amidala', 'yoda'], dtype='object')
# Check the unique value in each field:
starwar['han_solo'].value_counts(dropna=False)
Very favorably 610 NaN 357 Somewhat favorably 151 Neither favorably nor unfavorably (neutral) 44 Unfamiliar (N/A) 15 Somewhat unfavorably 8 Han Solo 1 Very unfavorably 1 Name: han_solo, dtype: int64
We will do a little change of our process:
Very favorably
, Somewhat favorably
, Neither favorably nor unfavorably
,Unfamiliar
,Somewhat unfavorably
, Very unfavorably
=> We will change our standardized rank as:Very favorably
Somewhat favorably
Neither favorably nor unfavorably (neutral)
Unfamiliar
or NaN value (because in the decription, Unfamiliar equal to Unknow, can consider as missing value)Somewhat unfavorably
Very unfavorably
#Convert the value:
mapping = {'Very favorably':1, 'Somewhat favorably':2,'Neither favorably nor unfavorably (neutral)':3,
'Unfamiliar (N/A)':4, 'Somewhat unfavorably':5, 'Very unfavorably':6}
colume = starwar.columns[15:29]
for mark in colume:
starwar[mark] = starwar[mark].map(mapping, na_action='ignore')
#Check:
starwar['han_solo']
0 NaN 1 1.0 2 NaN 3 2.0 4 1.0 ... 1182 1.0 1183 1.0 1184 NaN 1185 1.0 1186 1.0 Name: han_solo, Length: 1187, dtype: float64
# Fill missing value
df1 = starwar[colume].copy()
df1.fillna(4, inplace=True, axis=1)
starwar.loc[:,colume] = df1
starwar.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | han_solo | luke_skywalker | princess_leia_organa | anakin_skywalker | obi_wan_kenobi | emperor_palpatine | darth_vader | lando_calrissian | boba_fett | c-3p0 | r2_d2 | jar_jar_binks | padme_amidala | yoda | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | True | True | True | True | True | True | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 4.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 5.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
For the convenient of further step, we'll clear the first record, which is fill with only 'Respone'... and other guide syntax.
## Drop the first records:
starwar.drop(index=0,axis=0, inplace=True)
We'already have the data that cleaned, now we're going to consider some explantory method about this data:
Male
and Female
about how their taste of actor? What's their (average) ranking about the highest ranking episode?And some of other content could appear in our analysis process.
1. First, let's find the episode with highest ranking:
# Fill the value again for the first record:
#Check data:
starwar.iloc[0,9:15].values
array([nan, nan, nan, nan, nan, nan], dtype=object)
#Replace data:
replace = [3, 2, 1, 4, 5, 6]
for nw, od in zip(replace, starwar.columns[9:15]):
starwar[od].iloc[0] = nw
#Check:
starwar.head()
C:\Users\bieng\anaconda3\lib\site-packages\pandas\core\indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | han_solo | luke_skywalker | princess_leia_organa | anakin_skywalker | obi_wan_kenobi | emperor_palpatine | darth_vader | lando_calrissian | boba_fett | c-3p0 | r2_d2 | jar_jar_binks | padme_amidala | yoda | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 4.0 | 4.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 5.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 | 1.0 | 2.0 | 2.0 | 5.0 | 1.0 | 6.0 | 2.0 | 3.0 | 1.0 | 2.0 | 2.0 | 6.0 | 2.0 | 2.0 | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
## Find the mean ranking:
#1. Create a new dataframe of ranking
cols_ranking = starwar.columns[9:15]
rank = starwar.copy()[cols_ranking]
#2. Compute the mean:
mean_rank = rank.mean()
#3. Plot the rank:
tick_name = []
for i in range(1,7):
tick_name.append('Episode_{}'.format(i)) #Define the ticks name
mean_rank.plot.bar(legend=True, label='Ranking', rot=30)
plt.xticks(ticks=range(0,6), labels=tick_name)
plt.xlabel('Episode')
plt.ylabel('Mean ranking')
plt.title('Highest ranking of each episode\nLower is better')
plt.show()
We've got the first result: The Episode V (The Empries Strike Backs
) is the episode with highest ranking (by the survey's attendance - as low as best)
Now, let's look into something more deeper, like:
Age
?? by Education
?? or by Location
???To do that, we'll do below:
2: Clean data
## Check the unique value:
#1. Create the list unique value
age_value = []
education_value = []
location_value = []
array = [-1, -2, -4]
for name in starwar.columns[array]:
if name == 'Age':
age_value.append(starwar[name].value_counts(dropna=False).sort_index())
elif name == 'Education':
education_value.append(starwar[name].value_counts(dropna=False).sort_index())
else:
location_value.append(starwar[name].value_counts(dropna=False).sort_index())
#2. Create the dic store data:
store = {}
list_data = [age_value, education_value, location_value]
list_name = ['Age', 'Education', 'Location']
for item, value in zip(list_name, list_data):
store[item] = value
store
{'Age': [18-29 218 30-44 268 45-60 291 > 60 269 NaN 140 Name: Age, dtype: int64], 'Education': [Bachelor degree 321 Graduate degree 275 High school degree 105 Less than high school degree 7 Some college or Associate degree 328 NaN 150 Name: Education, dtype: int64], 'Location': [East North Central 181 East South Central 38 Middle Atlantic 122 Mountain 79 New England 75 Pacific 175 South Atlantic 170 West North Central 93 West South Central 110 NaN 143 Name: Location (Census Region), dtype: int64]}
# Check the missing value of each field:
starwar[starwar['Age'].isnull()]
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | han_solo | luke_skywalker | princess_leia_organa | anakin_skywalker | obi_wan_kenobi | emperor_palpatine | darth_vader | lando_calrissian | boba_fett | c-3p0 | r2_d2 | jar_jar_binks | padme_amidala | yoda | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 3.292638e+09 | True | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12 | 3.292635e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
81 | 3.291669e+09 | True | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
87 | 3.291650e+09 | False | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | No | NaN | NaN | NaN | NaN | NaN |
97 | 3.291570e+09 | True | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1109 | 3.288512e+09 | True | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1140 | 3.288460e+09 | True | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1142 | 3.288459e+09 | True | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1144 | 3.288456e+09 | True | NaN | False | False | False | False | False | False | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1162 | 3.288418e+09 | True | False | True | True | True | False | True | True | 3.0 | 4.0 | 5.0 | 6.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 5.0 | 3.0 | 3.0 | 2.0 | 2.0 | 5.0 | 3.0 | 3.0 | Greedo | No | NaN | No | NaN | NaN | NaN | NaN | NaN |
140 rows × 38 columns
We can see the odd thing here:
=> In order to get this odd thing clearly, we will use heatmap to check the missing value on this case, with the missing value is show by light color, and the data pattern shown as black color
# Check missing value by heatmap graph:
#1. Import seaborn library:
import seaborn as sns
# #3. Get the graph
def plot_null(df):
# Modify the frame
plt.figure(figsize=(20,10))
# Identify the data
data = df.isnull()
#Plot
sns.heatmap(data, cbar=False, yticklabels=False)
plt.xticks(rotation=90, size='x-large')
plot_null(starwar)
We can see that:
Age
have missing value, it will be missing at the other field too, except actor_favourite
field ( han_solo
to yoda
) because we've convert both Unfimiliar
and NaN
to D
rank => If we remove the missing at Age
, we don't get any loss data at the other fieldDo you consider yourself to be..
, we will convert all the missing value and No
to be False
, and analyze it, due to the missing pattern/ data pattern at two field have the signification different.=> We will remove all the missing at Age
field.
# Remove the missing value at Age field:
starwar.dropna(subset=['Age'], inplace=True)
For the Education
field, we can see the missing case almost same the Age
, and we have get rid of almost these missing value. With the remain missing, we'll find the correlation between Age
field and Education
field, and then fill with the co-respone of Age
value.
For the Household Income
, we will put it later.
# Find the correlation of `Age` and `Education` field:
col_c = ['Age', 'Education']
print(starwar[col_c].notnull().corr())
# Find the value of Education co-respond with `Age`:
starwar[col_c][starwar['Education'].isnull()]
Age Education Age NaN NaN Education NaN 1.0
Age | Education | |
---|---|---|
25 | 18-29 | NaN |
33 | 18-29 | NaN |
88 | > 60 | NaN |
93 | 18-29 | NaN |
263 | 45-60 | NaN |
415 | 45-60 | NaN |
527 | 30-44 | NaN |
673 | 30-44 | NaN |
823 | 45-60 | NaN |
929 | > 60 | NaN |
Since for correlation between 2 field Age
and Education
is None (and for Location
too) => It's better not to fill in, considering by the size of missing value is 10 records of Education
, and the number of records remain after delete these 10 case is not have large effect
=> Let's delete these missing value at Education
too, and we will delete out in Location
if any.
#Check the `Location` missing value:
print(starwar.iloc[:,-1].isnull().sum())
# Delete the missing record in `Education`:
starwar.dropna(subset=['Education'], inplace=True)
3
# Delete the missing record in `Location`:
starwar.dropna(subset=[starwar.columns[-1]], inplace=True)
# Check for the missing value at 3 field we modify:
num = [-1, -2, -4]
order = starwar.columns[num]
starwar[order].isnull().sum()
Location (Census Region) 0 Education 0 Age 0 dtype: int64
The last cleaning on this chapter is for Gender
, but if look back the missing value detect graph, we get the observe that almost the missing records in Age
is in Gender
too; so, for the remain missing value (if any), we can get rid of it now without worry about the loss of data.
# Check the missing value at `Gender`:
starwar['Gender'].isnull().sum()
0
Luckily, the supposing was True, all the missing value in Gender
was blown away along with Age
when cleaning Age
field. Now, let's jump to analysis by:
For the next analysis step, we have some way to perform the aggregate data by:
Age
+ Gender
(aggerate by Age
and filter by Gender
)Location
+ Age
or Garden
+ Location
3: Discover the highest ranking Episode by Age
+ Gender
We've look at the graph above and see what's episode is received the highest ranking (as low as best) => Let's look behind the curtain, we want to know:
Male
and Female
, between each group of Age
, what's the group have the most attendance?? => This result to determine the most result (bad or good) from what's group.#Check the 'gender' value:
starwar['Gender'].value_counts(dropna=False)
Female 545 Male 489 Name: Gender, dtype: int64
# Aggregate by 'Age'
age_group = starwar.copy().groupby('Age')
#Filter by `Male`:
age_male = age_group.apply(lambda x: x[x['Gender']=='Male'])
rank_by_male = age_male[age_male.columns[9:15]]
# Input sum ranking of each age group:
sum_rank_by_male = rank_by_male.reset_index().groupby('Age').agg(np.sum)
result = sum_rank_by_male['ranking_5']
#Filter by `Female':
age_female = age_group.apply(lambda x: x[x['Gender']=='Female'])
rank_by_female = age_female[age_female.columns[9:15]]
# Input sum ranking of each age group:
sum_rank_by_female = rank_by_female.reset_index().groupby('Age').agg(np.sum)
result_2 = sum_rank_by_female['ranking_5']
def plot_the_pie(series_1, series_2, df, title_1, title_2, col_name):
# Define data for each group:
group_1 = list(series_1)
group_2 = list(series_2)
# Create explode data:
explodes = [0.2, 0, 0.3, 0]
# Create auto pct funct:
def get_pct(pct, data):
result = int(pct / 100.*np.sum(data)) ##formula: percentage/100 * sum(data) = item in data
return ("{:.1f}% \n {}".format(pct, result))
# Create wedge properties:
wd = {'linewidth': 1, 'edgecolor':'black'}
# Create label:
key = list(df[col_name].unique())
# Plot the pie:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
# For male group
wedges, texts, autotexts = ax1.pie(x=group_1,
autopct= lambda pct: get_pct(pct, group_1),
explode = explodes,
wedgeprops=wd,
colors=['cyan','pink','violet','green'],
labels=key,
startangle=90)
ax1.legend(wedges, key, title='Age',loc=6, bbox_to_anchor=[1,1]) #Set legend box
plt.setp(autotexts, size = 12, weight ="bold")
ax1.set_title('{}'.format(title_1)) # Set title
# For female group
wedges, texts, autotexts = ax2.pie(x=group_2,
autopct= lambda pct: get_pct(pct, group_2),
explode = explodes,
wedgeprops=wd,
colors=['cyan','pink','violet','green'],
labels=key,
startangle=90)
# ax2.legend(wedges, key, title='Age', loc=6, bbox_to_anchor=[1,1])
plt.setp(autotexts, size = 12, weight ="bold")
ax2.set_title('{}'.format(title_2))
plt.show()
plot_the_pie(result, result_2, starwar,
'Ranking ratio for Episode V by age\nGender: Male', 'Ranking ratio for Episode V by age\nGender: Female',
'Age')
Look at the graph, we can get the overview for our attendance:
After got the ranking ratio, we can focus on the average point that they are ranking for this Episode, and make the bar chart to see what's happend.
## Get the mean point of each gender:
#Male:
mean_rank_by_male = rank_by_male.reset_index().groupby('Age').agg(np.mean)
mean_1 = mean_rank_by_male['ranking_5']
#Female:
mean_rank_by_female = rank_by_female.reset_index().groupby('Age').agg(np.mean)
mean_2 = mean_rank_by_female['ranking_5']
#Plot the bar chat:
plt.barh(y = list(starwar['Age'].unique()),width = list(mean_1), label='Male', height=0.3)
plt.barh(y = list(starwar['Age'].unique()),width = list(mean_2), align='edge', height=-0.3, label='Female')
plt.yticks(ticks=np.arange(0,4), labels=list(starwar['Age'].unique()))
plt.xlabel('Ranking points in average')
plt.ylabel('Age group')
plt.legend()
plt.title('Ranking point by each gender in each age group\n for StarWar Episode V')
plt.show()
Compare with the above pie chart ranking vote number ratio, we should divide the result to 2 area: Love and uhh, normal:
30-44
, so it's the only group by Age
'save' the ranking points of Episode V>60
- though they still like it, but the ranking score seem as quite as between love and normal. In other words, if our Mr. in 30-44
like this Episode so much - then - our Mr. in Senior group seem like feel a quite pale taste than age group 30-44
18-29
: The average point that Ms./Mrs. ranking is closer to 3.0, worse than Mr.'s (2.5) => It seem like our Ms./Mrs. attendance not interest too much in the episode.45-60
: Both Mr. and Mrs.'s result is greater than 2.5 (between 2.5 and 3.0) => Our attendance don't feel like they get a very good taste from the episode, the feeling is urghh, normal, normal and normal.4: Expand for Location
, what's the region with the best ranking for this episode??
In order to expand the result, we will discover it by what does the ranking ratio by Location
? We will perform the two chart: Pie chart for know the raking ratio distribution for each region, and Bar chat to discover the average ranking point that each region has gave out to the episode. Our purpose is:
## Aggregate data:
location = starwar.copy().groupby(starwar.columns[-1])
## Ranking ratio of each region:
sum_by_location = location.agg(np.sum)
to_piechart = sum_by_location['ranking_5']
## Average ranking point by each region:
mean_by_location = location.agg(np.mean)
to_barchart = mean_by_location['ranking_5']
def plot_the_pie_and_bar(data_to_piechart, data_to_barchart,
num_cols , jud_num, df,title_of_legend,titl_piechart,titl_xlabel,titl_barchart,
greater_or_less='greater'):
# Get the data:
piechart = list(data_to_piechart)
barchat = list(data_to_barchart)
# Define pce func for piechart:
def get_func(pct, data):
result = int(pct/100*np.sum(data))
return ('{:.1f}% \n {} votes').format(pct, result)
# Define wedge props:
wd = {'linewidth':1, 'edgecolor':'black'}
# Define labels:
key = list(starwar[starwar.columns[num_cols]].unique())
#Plot process:
#1. Define the frame:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,10))
#2. Plot the pie chart:
wedges, texts, autotexts = ax1.pie(x=piechart,
autopct= lambda pct:get_func(pct,piechart),
startangle= 90,
wedgeprops=wd)
ax1.legend(wedges, key, title='{}'.format(title_of_legend), loc=6, bbox_to_anchor=[-0.2, 0.8])
plt.setp(autotexts, size=12, weight='bold')
ax1.set_title('{}'.format(titl_piechart))
#3. Plot the bar chart:
if greater_or_less=='greater':
color_def = data_to_barchart>jud_num #To filter the region with ranking point greater than 2.5
color_def_fil = color_def.map({True:'orange', False:'grey'})
ax2.barh(y=np.arange(1,len(barchat)+1), width=barchat, align='center', color=color_def_fil)
else:
color_def_2 = data_to_barchart<jud_num
color_def_fil_2 = color_def_2.map({True:'green',False:'grey'})
ax2.barh(y=np.arange(1,len(barchat)+1), width=barchat, align='center', color=color_def_fil_2)
ax2.set_yticks(np.arange(1,len(barchat)+1))
ax2.set_yticklabels(key)
ax2.set_xlabel('{}'.format(titl_xlabel))
ax2.set_title('{}'.format(titl_barchart))
plt.show()
plot_the_pie_and_bar(to_piechart, to_barchart, -1, 2.5, starwar,
'Location', 'Ranking ratio of each region', 'Average ranking (points)',
'The average ranking point of each region', 'greater')
For the convient, the region with average ranking point greater than 2.5 had been marked as Orange color, and look at the result, we got:
Connect with the result of Age
and Gender
we have Mr. in age 30-44
love this episode. Is there their sign in the West South Central? Let's do the quick bar chat to see the average ranking they gave for this Episode.
## Aggregate data by Location: West South Central
west_south = location.get_group('West South Central')
##Filter by `Male` and give it the average points:
west_south_male = west_south[west_south['Gender']=='Male']
ranking_by_age_west_south_male = west_south_male.groupby('Age')[west_south_male.columns[9:15]]
mean_ranking = ranking_by_age_west_south_male.agg(np.mean)
## Plot the barchart:
color_bar = mean_ranking['ranking_5']<2
color_fill = color_bar.map({True:'Green', False:'Grey'})
mean_ranking['ranking_5'].plot.barh(color=color_fill)
plt.xlabel('Average points')
plt.ylabel('Age group')
plt.title('The average ranking point by\nMr. in West South Central region')
plt.show()
Sound like we've captured them, the fan of Eposide V StarWar is Mr. in Worth South Central in age of 30-44
, and sadly that's only them like this Episode and love it in fantasy way. Even in their region, and same gender, our Senior get it like urrgh normal, the rest is oh, it's good, but not in love too much.
Because we've captured the fan of this Episode, let's see what's their favour charachters. About the seen total for each episode, we will put it after the next analysis below, because we can get a invidual topic about this theme
CONCLUSION:
30-44
at West South Central
Somewhat Favourable
to Normal
, especially people whose age in range 45-60
and 18-29
, the group above 60
is equal finePacific
, but they don't feel very fantastic with this Episode.We can sure one thing: Only the fan of Episode V can get crazy for their characters, and now, let's get started with our Mr. attendance in West South Central region.
# Create the condition:
con_1 = starwar['Gender'] == 'Male'
con_3 = starwar['Age'] == '30-44'
total = con_1&con_3
#Filter the data:
start_1 = starwar.copy()[total]
start_2 = start_1[start_1[starwar.columns[-1]]=='West South Central']
compute = start_2[start_2.columns[15:29]]
# Compute the mean:
favour_1 = compute.mean()
#Plot the bar chart:
#Define color bar:
color_fil = favour_1<2
color_tab = color_fil.map({True:'green', False:'grey'})
favour_1.plot.barh(color = color_tab, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')
plt.title('The favourite characters by fan of Episode V')
plt.show()
Among many character here, all of them is in degree Somewhat Favourable
to Normal
=> We still can't say about what's the most favourable characters in these group.
We're already find the taste of the fan Episode V, now let's take it back to their region, to see what's the favourite character. We are expecting the result is not different than these because all of the rest have feeling of 50:50, so they likely not focus on any character.
#Filter the data:
west_south_charac = west_south.loc[:,starwar.columns[15:29]]
mean_favou = west_south_charac.agg(np.mean)
#Plot the bar chart:
#Define color bar:
color_fil_2 = mean_favou<2.5
color_tab_2 = color_fil_2.map({True:'green', False:'grey'})
mean_favou.plot.barh(color = color_tab_2, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')
plt.title('The favourite characters by West South Central Region')
plt.show()
We still got 7 familiar names like above, but this time, all of them is in degree Somewhat Favourable
. Let's check the optional item: All the Ms/ Mrs in West South Central Region, and finally we dig into all of data to confirm one thing:
Because the ranking for this film is not much as too much loving (above 2, closer to 2.5 points) => the favourite for each character if any is not higher, we can expect it as Somewhat Favourable
#Aggregate data by Ms/ Mrs:
west_south_female = west_south[west_south['Gender']=='Female']
# Compute the mean case 1: Ms/ Mrs in West South Central
female_favour_char = west_south_female[starwar.columns[15:29]].mean()
#Case 2: Get all the mean favour in data
cal_rec = starwar[starwar.columns[15:29]]
cal_mean = cal_rec.mean()
#Plot the bar chart for case 1
#Define color bar:
color_fil_3 = female_favour_char<2.5
color_tab_3 = color_fil_3.map({True:'green', False:'grey'})
female_favour_char.plot.barh(color = color_tab_3, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')
plt.title('The favourite characters by all Ms/ Mrs \n in West South Central Region')
plt.show()
The result we see here is the same result in the graph above when we research for West South Central Region. Obvious that because so many people, contain our Ms/ Mrs not much interest in Eposide V, so the favourite ranking somewhat effect by this factor.
#Plot the bar chart for case 1
#Define color bar:
color_fil_4 = cal_mean<2
color_tab_4 = color_fil_4.map({True:'green', False:'grey'})
cal_mean.plot.barh(color = color_tab_4, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')
plt.title('The favourite characters by all attendance')
plt.show()
For this finally test, the result was set to show only those have favourable raking less than 2 (closer to Somewhat Favourable
), and we got 2 characters: Han-Solo
and Yoda
. About Han-Solo
, along with Yoda
, we will find only those records with these two characters have favourable ranking less than 2 to see in what region, and which group of age, gender like these two character.
## Get the data only for characters got favourite point (less than 2.0)
favour_cha = starwar.copy()[(starwar['han_solo']<2)|(starwar['yoda']<2)]
## Aggregate data:
#1. By location:
cha_loc = favour_cha.groupby(starwar.columns[-1])
favou_loc = cha_loc[starwar.columns[15:29]].agg(np.mean)
sum_favour_loc = cha_loc[starwar.columns[15:29]].agg(np.sum)
#2. By Age and Gender:
cha_age = favour_cha.groupby(starwar['Age'])
cha_gender = favour_cha.groupby(starwar['Gender'])
# Prepare data:
#Case 1: For `Han-Solo`
to_piechart_case1 = sum_favour_loc['han_solo']
to_barchat_case1 = favou_loc['han_solo']
#Case 2: For 'Yoda':
to_piechart_2 = sum_favour_loc['yoda']
to_barchart_2 = favou_loc['yoda']
plot_the_pie_and_bar(to_piechart_case1, to_barchat_case1, -1, 1.2, starwar,
'Location', 'The favourite ranking for Han-Solo\n by each region',
'Average ranking (points)',
'The average favourute ranking point for Han-Solo of each region', 'less')
plot_the_pie_and_bar(to_piechart_2, to_barchart_2, -1, 1.2, starwar,
'Location', 'The favourite ranking for Yoda\n by each region',
'Average ranking (points)',
'The average favourute ranking point for Yoda of each region', 'less')
Before we get the conclusion, let's see how much records we have for the favourite of Han-Solo
and Yoda
, and notes that the result we got is after we filter only records with favourite ranking points for Han-Solo
and Yoda
is less than 2 points:
#Check the number of records after filter:
total = favour_cha.shape[0]
total_data = starwar.shape[0]
ratio = (total/total_data)*100
print('The number of records for Han-Solo and Yoda = {}'.format(total))
print('The prop votes for Han-Solo and Yoda = {:.1f}%'.format(ratio))
The number of records for Han-Solo and Yoda = 698 The prop votes for Han-Solo and Yoda = 67.5%
Let's imagine that in 100% our attendance, we've got 67,5% among them like Han-Solo
and Yoda
(it is True even to the Fan of Episode V). Now for the group of 67,5%, let's say that we will make it become a different group called Group love Han-Solo and Yoda
with the percentage is 100%. Let's get started:
Han-Solo
: this character seem to populate with more countries than Yoda, it contains: New England
, Mountain
, Pacific
,Middle Atlantic
,West North/West South Central
,South Atlantic
.Yoda
: We got these contries list love this character: East North Central
, Middle Atlantic
, West North/West South Central
, South Atlantic
.Pacific
is prefer Han-Solo
than Yoda
, and so on.Han-Solo
is from Pacific
region, and this is also the region love this characterYoda
is from Mountain
, but as not lucky as Han Solo
, the Mountain region have a little less prefer Yoda
. We also have Pacific
is the second one, but the result is same as Mountain
.We've got the list of which regions love which characters, now let's dig deeper into the representation regions by this rule: We will pick the regions which have mostly votes for the character, with the good ranking for the character. By this rule, we will pick Pacific
for Han-Solo
and South Atlantic
for Yoda
, to answer the question: In these region, how many people by Gender
and Age
love their character?
In the end of this pattern analysis, we will back to the fan of Episode V and see the bar chart for the further analysis.
# Prepare data for `Pacific`
pacific_cha_age = cha_age.apply(lambda g:g[g[g.columns[-1]]=='Pacific'])
pacific_cha_gen = cha_gender.apply(lambda y:y[y[y.columns[-1]]=='Pacific'])
## Compute the sum:
sum_pacific_cha_age = pacific_cha_age[starwar.columns[15:29]].reset_index().groupby('Age').agg(np.sum)
sum_pacific_cha_gender = pacific_cha_gen[starwar.columns[15:29]].reset_index().groupby('Gender').agg(np.sum)
## Compute the mean:
mean_pacific_cha_age = pacific_cha_age[starwar.columns[15:29]].reset_index().groupby('Age').agg(np.mean)
mean_pacific_cha_gen = pacific_cha_gen[starwar.columns[15:29]].reset_index().groupby('Gender').agg(np.mean)
#Prepare data for `South Atlantic`
south_cha_age = cha_age.apply(lambda z:z[z[z.columns[-1]]=='South Atlantic'])
south_cha_gen = cha_gender.apply(lambda j:j[j[j.columns[-1]]=='South Atlantic'])
##Compute the sum:
sum_south_cha_age = south_cha_age[starwar.columns[15:29]].reset_index().groupby('Age').agg(np.sum)
sum_south_cha_gen = south_cha_gen[starwar.columns[15:29]].reset_index().groupby('Gender').agg(np.sum)
##Compute the mean:
mean_south_cha_age = south_cha_age[starwar.columns[15:29]].reset_index().groupby('Age').agg(np.mean)
mean_south_cha_gen = south_cha_gen[starwar.columns[15:29]].reset_index().groupby('Gender').agg(np.mean)
#Prepare data to visualize:
#1. For Han Solo:
paci_pie_age = sum_pacific_cha_age['han_solo']
paci_pie_gen = sum_pacific_cha_gender['han_solo']
paci_bar_age = mean_pacific_cha_age['han_solo']
paci_bar_gen = mean_pacific_cha_gen['han_solo']
#2. For Yoda:
south_pie_age = sum_south_cha_age['yoda']
south_pie_gen = sum_south_cha_gen['yoda']
south_bar_age = mean_south_cha_age['yoda']
south_bar_gen = mean_south_cha_gen['yoda']
## Chart for Han Solo, Pacific, by Age
plot_the_pie_and_bar(paci_pie_age, paci_bar_age, -4,1.2, starwar,'Age',
'The distribution number of votes for Han-Solo by age\n at Pacific',
'Average ranking(points)','The average ranking points for Han-Solo by age\n at Pacific', 'less')
# Chart for Han Solo, Pacific, by Gender
plot_the_pie_and_bar(paci_pie_gen, paci_bar_gen, -5,1.2, starwar,'Gender',
'The distribution number of votes for Han-Solo by gender\n at Pacific',
'Average ranking(points)','The average ranking points for Han-Solo by gender\n at Pacific', 'less')
Note that all the result here is less than 2 points, but in the specific group, we will compare the detail points (less than 1.2 points) to release what's the true fan of character (most love for the character). With that said, let's look at the result, we got:
Age
: While the ranking points as closer to 1 points as loving the character, we have 59 votes from people above 45 years old => Our older people seems to love Han-Solo
than young people.Gender
: In the average points result, our Ms./ Mrs. seems loving Han-Solo
more than our Mr.. In other words, in the Group Love Han-Solo
in Pacific, the survey have Men as attendance more than Women, but the ranking points of Women is better than Men.Now, it's time to see the result at the side of Yoda
# Chart for `Yoda`, South Atlantic, by Age
plot_the_pie_and_bar(south_pie_age, south_bar_age, -4,1.2, starwar,'Age',
'The distribution number of votes for Yoda by age\n at South Atlantic',
'Average ranking(points)','The average ranking points for Yoda by age\n at South Atlantic', 'less')
# Chart for `Yoda`, South Atlantic, by Gender
plot_the_pie_and_bar(south_pie_gen, south_bar_gen, -5,1.2, starwar,'Gender',
'The distribution number of votes for Yoda by gender\n at South Atlantic',
'Average ranking(points)','The average ranking points for Yoda by gender\n at South Atlantic', 'less')
We are looking at the completely opposite result for Yoda
character:
30-44
loving so much Yoda
, but their votes just 18,7% from 100% votes for Yoda
that less than 2 points in their region. Especially, the group ages 45-60
have the mostly ranking votes ratio, but they ranking Yoda
as 1.6 points (closer to Somewhat Favourable
)Genre
ranking Yoda
as their mostly favourable character=> The character Yoda
though get the overall points less than 2 points, but in the actually we can considering Yoda
as Somewhat Favourable
, depend in the actually result we have
*Note: The overall points Pacific
vote for Han-Solo
not as better than South Atlantic
vote for Yoda
, and there is no bias sample (pick-up the side with better point from the first). So that, the different between favourable raking points of Han Solo
and Yoda
isn't the attemp to take down this character or vote up for that character.
Now, let's back to the fan of Episode V. We know that someone who had ranked for Han-Solo
more than 2 points, so that the average overall is up over 2 points, let's see the bar chart to release this:
## Call back the data:
to_barchart_check = compute['han_solo']
## See the overview statistic:
to_barchart_check.describe()
count 10.000000 mean 2.300000 std 1.337494 min 1.000000 25% 1.000000 50% 2.000000 75% 3.750000 max 4.000000 Name: han_solo, dtype: float64
## See how many people with their ranking vote:
to_barchart_check.value_counts().sort_index(ascending=False)
4.0 3 3.0 1 2.0 2 1.0 4 Name: han_solo, dtype: int64
#Plot bar chart:
result = to_barchart_check.value_counts().sort_index(ascending=False)
result.plot.barh()
plt.xticks(ticks=np.arange(0,10,1),label=[0, 1, 2, 3, 4])
plt.xlim([0, 4])
plt.xlabel('The number of fan with their ranking points')
plt.ylabel('The ranking points')
plt.title('The distribution of ranking points\n fan votes for Han-Solo')
plt.legend()
plt.show()
From the statistic distribution for ranking point of Han-Solo
result, we got 50% votes less than 2 points, and the remain is greater than 2 points. This is match for overall data with 67,5% like Han-Solo
and Yoda
among 100% survey data.
CONCLUSION:
Han-Solo
, especially from Ms./Mrs. and people whose age above 45 years oldHan-Solo
but there are 50% is normal => A half of Episode's fan like Han-Solo
Unfamiliar
, and it is normal.Firstly, let's draw a simple Piechart to see the overview distribution of seen votes for each episode:
#Get data:
seen_result = starwar[starwar.columns[3:9]]
sum_seen = seen_result.sum()
#Plot the pie:
#Define data:
to_piechart = list(sum_seen)
#Define autopct:
def get_pct(pct, data):
result = int(pct/100*np.sum(data))
return ('{:.1f}% \n {} votes'.format(pct, result))
#Define key:
key = ['Episode_{}'.format(i) for i in range(1,7)]
#Define the wedges:
wd = {'linewidth' : 1, 'edgecolor':'black'}
#Plot:
fig, ax = plt.subplots(figsize=(8,9))
wedges, texts, autotexts = ax.pie(x = to_piechart,
autopct = lambda pct:get_pct(pct, to_piechart),
wedgeprops = wd,
startangle=90)
ax.legend(wedges, key, title='Episode', loc=6, bbox_to_anchor=[1.0, 0.75])
plt.setp(autotexts, size=12, weight='bold')
ax.set_title('The ratio of seen votes from attendance')
plt.show()
Not far as expect, Episode V: The Empire Strikes Back is the expisode get the most votes as have seen by attendance. Now we have some question to digger in:
Male
and Female
, what gender is mostly seen this episode?For the task above, we will:
Age
and filter by Gender
, take the sum votes of two group and make a Pie chart to see the overview.First, we will answer the first question: Is the region with mostly votes for have seen will be the region with the worst ranking points?
## Prepare data for pie chart:
votes_piechart = sum_by_location['seen_5']
#To plot:
plot_the_pie_and_bar(votes_piechart, to_barchart,-1, 2.5, starwar,
'Location', 'The seen votes of each region', 'Average ranking (points)',
'The average ranking point of each region', 'greater')
The result is fairly to know:
South Atlantic
, and luckily, though the average ranking points isn't perfect but not over than 2.5 points => this region still have some favourable for this EpisodeMountain
, and they also have average ranking points less than 2.5 points like South Atlantic
.Now, move to the next question: What's Age
range, or what's Gender
mostly seen this episode, and what's the ranking they gave for this episode?
#Prepare data for graph
#1. For Male:
seen_by_male = age_male[age_male.columns[3:9]]
# Input sum seen of each age group:
sum_seen_by_male = seen_by_male.reset_index().groupby('Age').agg(np.sum)
seen_votes_pie = sum_seen_by_male['seen_5']
#2. For Female:
seen_by_female = age_female[age_female.columns[3:9]]
# Input sum seen of each age group:
sum_seen_by_female = seen_by_female.reset_index().groupby('Age').agg(np.sum)
seen_votes_pie_female = sum_seen_by_female['seen_5']
##Plot the data for Male:
plot_the_pie_and_bar(seen_votes_pie, mean_1,-4, 2.5, starwar,
'Age', 'The seen votes by Age\n Gender:Male', 'Average ranking (points)',
'The average ranking point of each age group\n Gender: Male', 'less')
##Plot the data for Female:
plot_the_pie_and_bar(seen_votes_pie_female, mean_2,-4, 2.5, starwar,
'Age', 'The seen votes by Age\n Gender:Female', 'Average ranking (points)',
'The average ranking point of each age group\n Gender:Female', 'less')
Combine with the result depend on each age group, divide by Gender
(named: Ranking point by each gender in each age group\n for StarWar Episode V
), we get these observe:
Male
group: We can confirm again our Mr. in age of 30-44
with good ranking points is the mostly group who have seen the Episode V. In addition, our Senior People is the second one, but the ranking points isn't good as expect.Female
group: The mostly votes from our Mrs above 60 years old, with the good ranking points, the second position have two range: 30-44
and 45-60
, but, the answer for question: Who's ranking good points for Episode V? => It's group age of 30-44
.## Prepare data for Overall Gender:
gender_check = starwar.copy().groupby('Gender')
sum_seen = gender_check.agg(np.sum)
mean_seen = gender_check.agg(np.mean)
## To plot:
sum_seen_pie = sum_seen['seen_5']
mean_seen_bar = mean_seen['ranking_5']
#Plot:
plot_the_pie_and_bar(sum_seen_pie, mean_seen_bar,-5, 2.5, starwar,
'Gender', 'The seen votes by Gender',
'Average ranking (points)',
'The average ranking point of each Gender', 'less')
Now, it's time to go back to the fan of Episode V: All of them have seen the episode, OR, they heard from anyone that this Episode so good, and ranking it with good points??
But invidually, I think that the answer is right here: The group of Mr. whose age in 30-44
is the mostly group votes that have seen the Episode (Chart called: The seen votes by Age\n Gender:Male
), and what's the fan of Episode V? It's them
For the detail, read the conclusion of About the ranking of each episode
CONCLUSION:
30-44
and for Men, and above 60
for WomenSouth Atlantic
is the region that got mostly number of people have seen this episode, in other words, Episode V is popular at South Atlantic
We've went through all the main part of survey, now we're going to dive the funny part of survey: Member of fansite, fanchart... First of all, let's check the missing value at these field: Do you consider yourself to be a fan of the Star Wars film franchise?
, Do you consider yourself to be a fan of the Expanded Universe?ξ
, Do you consider yourself to be a fan of the Star Trek franchise?
#Summary the missing value of these field:
num_co = [2, -6, -7]
col = starwar.columns[num_co]
statsic_missing = {}
for item in col:
statsic_missing[item] = starwar[item].isnull().sum()
statsic_missing
{'Do you consider yourself to be a fan of the Star Wars film franchise?': 220, 'Do you consider yourself to be a fan of the Star Trek franchise?': 0, 'Do you consider yourself to be a fan of the Expanded Universe?ξ': 824}
We'll checking carefully at 2 fields have missing value:
# Store the unique value:
for_df = {}
num_col_2 = [2, -7, 30]
col_2 = starwar.columns[num_col_2]
for name in col_2:
for_df[name] = starwar[name].value_counts(dropna=False).sort_index(ascending=True)
df = pd.DataFrame(data = for_df)
df
Do you consider yourself to be a fan of the Star Wars film franchise? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Are you familiar with the Expanded Universe? | |
---|---|---|---|
False | 275.0 | NaN | NaN |
True | 539.0 | NaN | NaN |
NaN | 220.0 | 824.0 | 220.0 |
No | NaN | 112.0 | 604.0 |
Yes | NaN | 98.0 | 210.0 |
Now we know what's the value inside each field, we do:
Yes
as True and No
as FalseDo you consider.. Star Wars film franchise?
to be star_wars_member
, Do you consider... Star Trek franchise?
to be star_trek_member
Are you familiar...
to be know_expanded_ver
, the remain is expanded_universe_mem
.#Convert value:
convert_col_num = [30, -7]
convert_cols = starwar.columns[convert_col_num]
for cols in convert_cols:
starwar[cols] = starwar[cols].map(yes_no, na_action='ignore')
starwar[starwar.columns[-7]].value_counts(dropna=False)
NaN 824 False 112 True 98 Name: Do you consider yourself to be a fan of the Expanded Universe?ξ, dtype: int64
##Fill missing value:
#1. For field have less 220 value:
col_mis_num = [2, 30]
col_mis = starwar.columns[col_mis_num]
modify = starwar[col_mis].fillna(False, axis=1)
starwar.loc[:,col_mis] = modify
#Check:
starwar[col_mis].notnull().sum()
Do you consider yourself to be a fan of the Star Wars film franchise? 1034 Are you familiar with the Expanded Universe? 1034 dtype: int64
#2. For field greater than 220:
col_mis_2 = starwar.columns[-7]
starwar[col_mis_2].fillna(False, inplace=True)
starwar[col_mis_2].notnull().sum()
1034
## Check the final:
for_df_new = {}
for name in col_2:
for_df_new[name] = starwar[name].value_counts(dropna=False).sort_index(ascending=True)
df2 = pd.DataFrame(data = for_df_new)
df2
Do you consider yourself to be a fan of the Star Wars film franchise? | Do you consider yourself to be a fan of the Expanded Universe?ξ | Are you familiar with the Expanded Universe? | |
---|---|---|---|
False | 495 | 936 | 824 |
True | 539 | 98 | 210 |
#Check the last columns:
starwar[starwar.columns[-6]].value_counts(dropna=False)
No 611 Yes 423 Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64
#Convert:
col_fi = starwar.columns[-6]
starwar.loc[:,col_fi] = starwar[col_fi].map(yes_no)
#Check:
starwar[starwar.columns[-6]].value_counts(dropna=False)
False 611 True 423 Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64
#Rename the columns:
col_na_num = [2, -7, -6, 30]
col_na = starwar.columns[col_na_num]
name_rep = ['star_wars_member','expanded_universe_mem','star_trek_member','know_expanded_ver']
map_rep = {}
for nw, od in zip(name_rep, col_na):
map_rep[od] = nw
starwar.rename(map_rep, axis=1, inplace=True)
starwar.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'star_wars_member', 'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6', 'han_solo', 'luke_skywalker', 'princess_leia_organa', 'anakin_skywalker', 'obi_wan_kenobi', 'emperor_palpatine', 'darth_vader', 'lando_calrissian', 'boba_fett', 'c-3p0', 'r2_d2', 'jar_jar_binks', 'padme_amidala', 'yoda', 'Which character shot first?', 'know_expanded_ver', 'expanded_universe_mem', 'star_trek_member', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
#Check the final result value
for_df_final = {}
for name in name_rep:
for_df_final[name] = starwar[name].value_counts(dropna=False).sort_index(ascending=True)
df3 = pd.DataFrame(data = for_df_final)
df3
star_wars_member | expanded_universe_mem | star_trek_member | know_expanded_ver | |
---|---|---|---|---|
False | 495 | 936 | 611 | 824 |
True | 539 | 98 | 423 | 210 |
The data was cleaned, now let's consider about the analytic. There are some way to analysis:
expanded_universe_mem
seems to low, no one like it? Or it have not get a properly advertise way to everyone?Let's move to question 1: The expanded_universe_mem
seems to low, no one like it? Or it have not get a properly advertise way to everyone?
#Aggregate data :
fan_age = starwar.copy().groupby('Age').agg(np.sum)
gender_group = starwar.copy().groupby('Gender')
fan_gender = gender_group.agg(np.sum)
#To plot:
expand_age_1 = fan_age['expanded_universe_mem']
expand_age_2 = fan_age['know_expanded_ver']
expand_gender_1 = fan_gender['expanded_universe_mem']
expand_gender_2 = fan_gender['know_expanded_ver']
#To bar chart:
plt.figure(figsize=(30,20))
plt.subplot(2,2,1)
plt.barh(y=list(starwar['Age'].unique()),width=expand_age_2, height=0.4)
plt.xlabel('Number of people')
plt.ylabel('Age')
plt.title('The popular degree of\n Expanded Universe bye Age')
plt.subplot(2,2,2)
plt.barh(y=list(starwar['Age'].unique()),width=expand_age_1, height=0.4)
plt.xlabel('Number of people')
plt.ylabel('Age')
plt.title('The number of member on Expanded Universe')
plt.subplot(2,2,3)
plt.barh(y=list(starwar['Gender'].unique()),width=expand_gender_2, height=0.4)
plt.xlabel('Number of people')
plt.ylabel('Gender')
plt.title('The popular degree of\n Expanded Universe by Gender')
plt.subplot(2,2,4)
plt.barh(y=list(starwar['Gender'].unique()),width=expand_gender_1, height=0.4)
plt.xlabel('Number of people')
plt.ylabel('Gender')
plt.title('The number of member on Expanded Universe')
plt.show()
In the graph we observe:
Expanded Universe
is popular with Female
than Male
, and popular with age by above 60
+ 18-29
, but with the number is smaller than overall records (150 vs 1034 records)Expanded Universe
mostly Female
and people with age below 30 years oldThe countracy thing is, all of people in age of 18-29
is voted for Episode V with not good points but they're good about activity like Fansign, Fanchart, and Expanded Universe
is one of the example.
So, we know that the scale (and the popular) of this fansign is small (for the less number who know it, and the less member active in). Let's see have they joined in other fansign or have another reason??
## Prepare data for 'StarWar' fansign:
stw_age_1 = fan_age['star_wars_member']
stw_age_2 = fan_age['star_trek_member']
stw_gender_1 = fan_gender['star_wars_member']
stw_gender_2 = fan_gender['star_trek_member']
#Plot bar chart
plt.figure(figsize=(30,20))
plt.subplot(2,2,1)
plt.barh(y=list(starwar['Age'].unique()),width=stw_age_1, height=0.4)
plt.xlabel('Number of people')
plt.ylabel('Age')
plt.title('The number of member on StarWar fansign by Age')
plt.subplot(2,2,2)
plt.barh(y=list(starwar['Gender'].unique()),width=stw_gender_1, height=0.4)
plt.xlabel('Number of people')
plt.ylabel('Gender')
plt.title('The number of member on StarWar fansign by Gender')
plt.subplot(2,2,3)
plt.barh(y=list(starwar['Age'].unique()),width=stw_age_2, height=0.4)
plt.xlabel('Number of people')
plt.ylabel('Age')
plt.title('The number of member on StarTrek fansign by Age')
plt.subplot(2,2,4)
plt.barh(y=list(starwar['Gender'].unique()),width=stw_gender_2, height=0.4)
plt.xlabel('Number of people')
plt.ylabel('Gender')
plt.title('The number of member on StarTrek fansign by Gender')
plt.show()
We can see the larger scale of StarWar
and StarTrek
fan-sign than Expanded Universe
, describe 1 pattern for the question: The expanded_universe_mem seems to low, no one like it? Or it have not get a properly advertise way to everyone? => It's because almost people have join in StarWar
and StarTrek
franchise's fansign, so they have no tend to join on other team:
Female
, for the both of franchise's fansign.above 60
CONCLUSION:
Female
is the mostly member in every franchise's fansign, and depend on which franchise, have mostly people with certain age range: Expanded Universe
is people in age range 18-29
, both remain is above 60
Expand Universe
have small scale is almost people have joined in StarWar
and StarTrek
franchise fansign, especially the strongest fansign is StarWar
fansign => they have no interest to join on other team (except people in 18-29
age range, they want to find the new thing)Pacific
, and they don't taste it very well (average ranking points over 2.5 points => so many people ranking it 3 and 4 points, even 5 points)South Atlantic
and Mountain
, they're ranking in fairly good pointsMale
group with age range of 30-44
in West South Central
ranking it with actually good points => They can be considered to be the fan of StarWar Episode V18-29
though not ranking with good points for this Episode, but they're so actively in activity like Fansign, fanchart with the Expanded Universe
franchise's fansign.Han-Solo
, but Female
attendance seem to love this character than Male
Yoda
is ranked as the second one, but in actually result, this character is Somewhat Favourable
with everyoneNormal
or Unfamiliar
with the attendanceStarWar
is the largest franchise fansign, the smallest is Expanded Universe
franchise fansignStarWar
fansign is largest, the Expanded Universe
franchise fansign have not received notice much more as StarWar
.Female
is the member of these franchise fansignStarWar
and StarTrek
, people with age range in 30-44
is the second one after Senior attendance likely be the fan of StarWar
(contains the fan of Episode V that mentioned above).