While waiting for Star Wars: The Force Awakens to come out in 2015 FiveThirtyEight surveyed 835 responses of Star Wars fans using the online tool SurveyMonkey ,
in order to answer some questions In particular, they wondered does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
you can find data here
The data has several columns, including:
we will cleaning & exploring data
# Basic Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
%matplotlib inline
# create read_file() func to read data into pandas dataframe
def read_file(file_path,encoding="utf-8"):
df=pd.read_csv(file_path,encoding=encoding)
return df
# use read_file() func to read "star_wars.csv" file,
# encoding="ISO-8859-1", assign it to star_wars
star_wars=read_file("star_wars.csv","ISO-8859-1")
# create function to explore data by print numbers of rows
def explore_data(df,rows_num):
print(df.head(rows_num))
# use explore_data to print top 10 rows
explore_data(star_wars,10)
RespondentID Have you seen any of the 6 films in the Star Wars franchise? \ 0 NaN Response 1 3.292880e+09 Yes 2 3.292880e+09 No 3 3.292765e+09 Yes 4 3.292763e+09 Yes 5 3.292731e+09 Yes 6 3.292719e+09 Yes 7 3.292685e+09 Yes 8 3.292664e+09 Yes 9 3.292654e+09 Yes Do you consider yourself to be a fan of the Star Wars film franchise? \ 0 Response 1 Yes 2 NaN 3 No 4 Yes 5 Yes 6 Yes 7 Yes 8 Yes 9 Yes Which of the following Star Wars films have you seen? Please select all that apply. \ 0 Star Wars: Episode I The Phantom Menace 1 Star Wars: Episode I The Phantom Menace 2 NaN 3 Star Wars: Episode I The Phantom Menace 4 Star Wars: Episode I The Phantom Menace 5 Star Wars: Episode I The Phantom Menace 6 Star Wars: Episode I The Phantom Menace 7 Star Wars: Episode I The Phantom Menace 8 Star Wars: Episode I The Phantom Menace 9 Star Wars: Episode I The Phantom Menace Unnamed: 4 \ 0 Star Wars: Episode II Attack of the Clones 1 Star Wars: Episode II Attack of the Clones 2 NaN 3 Star Wars: Episode II Attack of the Clones 4 Star Wars: Episode II Attack of the Clones 5 Star Wars: Episode II Attack of the Clones 6 Star Wars: Episode II Attack of the Clones 7 Star Wars: Episode II Attack of the Clones 8 Star Wars: Episode II Attack of the Clones 9 Star Wars: Episode II Attack of the Clones Unnamed: 5 \ 0 Star Wars: Episode III Revenge of the Sith 1 Star Wars: Episode III Revenge of the Sith 2 NaN 3 Star Wars: Episode III Revenge of the Sith 4 Star Wars: Episode III Revenge of the Sith 5 Star Wars: Episode III Revenge of the Sith 6 Star Wars: Episode III Revenge of the Sith 7 Star Wars: Episode III Revenge of the Sith 8 Star Wars: Episode III Revenge of the Sith 9 Star Wars: Episode III Revenge of the Sith Unnamed: 6 \ 0 Star Wars: Episode IV A New Hope 1 Star Wars: Episode IV A New Hope 2 NaN 3 NaN 4 Star Wars: Episode IV A New Hope 5 Star Wars: Episode IV A New Hope 6 Star Wars: Episode IV A New Hope 7 Star Wars: Episode IV A New Hope 8 Star Wars: Episode IV A New Hope 9 Star Wars: Episode IV A New Hope Unnamed: 7 \ 0 Star Wars: Episode V The Empire Strikes Back 1 Star Wars: Episode V The Empire Strikes Back 2 NaN 3 NaN 4 Star Wars: Episode V The Empire Strikes Back 5 Star Wars: Episode V The Empire Strikes Back 6 Star Wars: Episode V The Empire Strikes Back 7 Star Wars: Episode V The Empire Strikes Back 8 Star Wars: Episode V The Empire Strikes Back 9 Star Wars: Episode V The Empire Strikes Back Unnamed: 8 \ 0 Star Wars: Episode VI Return of the Jedi 1 Star Wars: Episode VI Return of the Jedi 2 NaN 3 NaN 4 Star Wars: Episode VI Return of the Jedi 5 Star Wars: Episode VI Return of the Jedi 6 Star Wars: Episode VI Return of the Jedi 7 Star Wars: Episode VI Return of the Jedi 8 Star Wars: Episode VI Return of the Jedi 9 Star Wars: Episode VI Return of the Jedi Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. \ 0 Star Wars: Episode I The Phantom Menace 1 3 2 NaN 3 1 4 5 5 5 6 1 7 6 8 4 9 5 ... Unnamed: 28 \ 0 ... Yoda 1 ... Very favorably 2 ... NaN 3 ... Unfamiliar (N/A) 4 ... Very favorably 5 ... Somewhat favorably 6 ... Very favorably 7 ... Very favorably 8 ... Very favorably 9 ... Somewhat favorably Which character shot first? \ 0 Response 1 I don't understand this question 2 NaN 3 I don't understand this question 4 I don't understand this question 5 Greedo 6 Han 7 Han 8 Han 9 Han Are you familiar with the Expanded Universe? \ 0 Response 1 Yes 2 NaN 3 No 4 No 5 Yes 6 Yes 7 Yes 8 No 9 No Do you consider yourself to be a fan of the Expanded Universe?Âæ \ 0 Response 1 No 2 NaN 3 NaN 4 NaN 5 No 6 No 7 No 8 NaN 9 NaN Do you consider yourself to be a fan of the Star Trek franchise? Gender \ 0 Response Response 1 No Male 2 Yes Male 3 No Male 4 Yes Male 5 No Male 6 Yes Male 7 No Male 8 Yes Male 9 No Male Age Household Income Education \ 0 Response Response Response 1 18-29 NaN High school degree 2 18-29 $0 - $24,999 Bachelor degree 3 18-29 $0 - $24,999 High school degree 4 18-29 $100,000 - $149,999 Some college or Associate degree 5 18-29 $100,000 - $149,999 Some college or Associate degree 6 18-29 $25,000 - $49,999 Bachelor degree 7 18-29 NaN High school degree 8 18-29 NaN High school degree 9 18-29 $0 - $24,999 Some college or Associate degree Location (Census Region) 0 Response 1 South Atlantic 2 West South Central 3 West North Central 4 West North Central 5 West North Central 6 Middle Atlantic 7 East North Central 8 South Atlantic 9 South Atlantic [10 rows x 38 columns]
# review columns
print(star_wars.columns)
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?Âæ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
# create function to check df null values
def check_null(df):
print (df.isnull().sum())
# use check_null to check star_wars null values
check_null(star_wars)
RespondentID 1 Have you seen any of the 6 films in the Star Wars franchise? 0 Do you consider yourself to be a fan of the Star Wars film franchise? 350 Which of the following Star Wars films have you seen? Please select all that apply. 513 Unnamed: 4 615 Unnamed: 5 636 Unnamed: 6 579 Unnamed: 7 428 Unnamed: 8 448 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. 351 Unnamed: 10 350 Unnamed: 11 351 Unnamed: 12 350 Unnamed: 13 350 Unnamed: 14 350 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. 357 Unnamed: 16 355 Unnamed: 17 355 Unnamed: 18 363 Unnamed: 19 361 Unnamed: 20 372 Unnamed: 21 360 Unnamed: 22 366 Unnamed: 23 374 Unnamed: 24 359 Unnamed: 25 356 Unnamed: 26 365 Unnamed: 27 372 Unnamed: 28 360 Which character shot first? 358 Are you familiar with the Expanded Universe? 358 Do you consider yourself to be a fan of the Expanded Universe?Âæ 973 Do you consider yourself to be a fan of the Star Trek franchise? 118 Gender 140 Age 140 Household Income 328 Education 150 Location (Census Region) 143 dtype: int64
# explore_df_info
def explore_df(df):
print("Data Frame info\n",df.info())
print("\n Describing Data",df.describe())
explore_df(star_wars)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1187 entries, 0 to 1186 Data columns (total 38 columns): RespondentID 1186 non-null float64 Have you seen any of the 6 films in the Star Wars franchise? 1187 non-null object Do you consider yourself to be a fan of the Star Wars film franchise? 837 non-null object Which of the following Star Wars films have you seen? Please select all that apply. 674 non-null object Unnamed: 4 572 non-null object Unnamed: 5 551 non-null object Unnamed: 6 608 non-null object Unnamed: 7 759 non-null object Unnamed: 8 739 non-null object Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. 836 non-null object Unnamed: 10 837 non-null object Unnamed: 11 836 non-null object Unnamed: 12 837 non-null object Unnamed: 13 837 non-null object Unnamed: 14 837 non-null object Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. 830 non-null object Unnamed: 16 832 non-null object Unnamed: 17 832 non-null object Unnamed: 18 824 non-null object Unnamed: 19 826 non-null object Unnamed: 20 815 non-null object Unnamed: 21 827 non-null object Unnamed: 22 821 non-null object Unnamed: 23 813 non-null object Unnamed: 24 828 non-null object Unnamed: 25 831 non-null object Unnamed: 26 822 non-null object Unnamed: 27 815 non-null object Unnamed: 28 827 non-null object Which character shot first? 829 non-null object Are you familiar with the Expanded Universe? 829 non-null object Do you consider yourself to be a fan of the Expanded Universe?Âæ 214 non-null object Do you consider yourself to be a fan of the Star Trek franchise? 1069 non-null object Gender 1047 non-null object Age 1047 non-null object Household Income 859 non-null object Education 1037 non-null object Location (Census Region) 1044 non-null object dtypes: float64(1), object(37) memory usage: 352.5+ KB Data Frame info None Describing Data RespondentID count 1.186000e+03 mean 3.290128e+09 std 1.055639e+06 min 3.288373e+09 25% 3.289451e+09 50% 3.290147e+09 75% 3.290814e+09 max 3.292880e+09
# remove rows where RespondentID is NaN
star_wars=star_wars[pd.notnull(star_wars["RespondentID"])]
# use check_null() func to check star_wars null values
check_null(star_wars)
RespondentID 0 Have you seen any of the 6 films in the Star Wars franchise? 0 Do you consider yourself to be a fan of the Star Wars film franchise? 350 Which of the following Star Wars films have you seen? Please select all that apply. 513 Unnamed: 4 615 Unnamed: 5 636 Unnamed: 6 579 Unnamed: 7 428 Unnamed: 8 448 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. 351 Unnamed: 10 350 Unnamed: 11 351 Unnamed: 12 350 Unnamed: 13 350 Unnamed: 14 350 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. 357 Unnamed: 16 355 Unnamed: 17 355 Unnamed: 18 363 Unnamed: 19 361 Unnamed: 20 372 Unnamed: 21 360 Unnamed: 22 366 Unnamed: 23 374 Unnamed: 24 359 Unnamed: 25 356 Unnamed: 26 365 Unnamed: 27 372 Unnamed: 28 360 Which character shot first? 358 Are you familiar with the Expanded Universe? 358 Do you consider yourself to be a fan of the Expanded Universe?Âæ 973 Do you consider yourself to be a fan of the Star Trek franchise? 118 Gender 140 Age 140 Household Income 328 Education 150 Location (Census Region) 143 dtype: int64
# create function to check column data
def explore_uni_data(df,column_name):
print(df[column_name].value_counts(dropna=False))
# check column[Have you seen any of the 6 films in the Star Wars franchise?] data
explore_uni_data(star_wars,"Have you seen any of the 6 films in the Star Wars franchise?")
Yes 936 No 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
# check column[Do you consider yourself to be a fan of the Star Wars film franchise?]
explore_uni_data(star_wars,"Do you consider yourself to be a fan of the Star Wars film franchise?")
Yes 552 NaN 350 No 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
we will convert them to bolean values (True,False) to be easy to work with
# create yes_no mapping dic
yes_no={"Yes":True,"No":False}
# create convert_data() function
def convert_data(df,column,mapping_dic):
df[column]=df[column].map(mapping_dic)
return df[column]
# use convert_data() func to convert Have you seen any of the 6 films in the Star Wars franchise? column
star_wars["Have you seen any of the 6 films in the Star Wars franchise?"]=convert_data(star_wars,"Have you seen any of the 6 films in the Star Wars franchise?",yes_no)
print(star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].head())
1 True 2 False 3 True 4 True 5 True Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: bool
# use convert_data() func to convert Do you consider yourself to be a fan of the Star Wars film franchise? column
star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"]=convert_data(star_wars,"Do you consider yourself to be a fan of the Star Wars film franchise?",yes_no)
print(star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"].head())
1 True 2 NaN 3 False 4 True 5 True Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: object
column index start from [3:9] represent a single checkbox question. The respondent checked off a series of boxes in response to the question,
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
# create dic have converted values
movie_mapping = {
"Star Wars: Episode I The Phantom Menace": True,
np.nan: False,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True
}
# itrate over columns & mapping it's values use convert_data ()func
for col in star_wars.columns[3:9]:
star_wars[col] = convert_data(star_wars,col,movie_mapping)
# check columns data after coverting
print(star_wars.iloc[:,3:9])
Which of the following Star Wars films have you seen? Please select all that apply. \ 1 True 2 False 3 True 4 True 5 True 6 True 7 True 8 True 9 True 10 False 11 False 12 False 13 True 14 True 15 True 16 True 17 False 18 True 19 True 20 True 21 True 22 True 23 True 24 True 25 True 26 False 27 True 28 True 29 True 30 True ... ... 1157 True 1158 False 1159 False 1160 False 1161 True 1162 True 1163 True 1164 False 1165 True 1166 True 1167 True 1168 True 1169 False 1170 True 1171 False 1172 True 1173 False 1174 True 1175 True 1176 True 1177 True 1178 True 1179 False 1180 False 1181 True 1182 True 1183 True 1184 False 1185 True 1186 True Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 1 True True True True True 2 False False False False False 3 True True False False False 4 True True True True True 5 True True True True True 6 True True True True True 7 True True True True True 8 True True True True True 9 True True True True True 10 True False False False False 11 False False False False False 12 False False False False False 13 True True True True True 14 True True True True True 15 True True True True True 16 True True True True True 17 False False True False False 18 True True False False True 19 True True True True True 20 True True True True True 21 True True True True True 22 True True True True False 23 True True True True True 24 True True True True True 25 True True True True True 26 False False False False False 27 True True True True True 28 True True True True True 29 True True True True True 30 True True True True True ... ... ... ... ... ... 1157 True True True True True 1158 False False False False False 1159 False False False True True 1160 False False False False False 1161 False True False False True 1162 True True False True True 1163 False False True True True 1164 False False True True True 1165 True True False True True 1166 True True True True True 1167 True True True True True 1168 False False False True True 1169 False False False False False 1170 False False True True True 1171 False False False False False 1172 True True True True True 1173 False False True True True 1174 True True True True True 1175 True True False False False 1176 True True True True True 1177 True False False False False 1178 True True False True True 1179 False False False False False 1180 False False False True True 1181 True True True True True 1182 True True True True True 1183 True True True True True 1184 False False False False False 1185 True True True True True 1186 True False False True True [1186 rows x 6 columns]
# rename_column index [3:9] to be realtive to the question
rename_columns={} # dictionary have old column & new one
x=0
# intrate over the column to update rename_columns{} dic
for col in star_wars.columns[3:9]:
rename_columns[col]="seen_{}".format(x+1)
x+=1
# rename columns using rename function
star_wars=star_wars.rename(columns=rename_columns)
# use explore_data() to check data & column name after update
explore_data(star_wars,5)
RespondentID Have you seen any of the 6 films in the Star Wars franchise? \ 1 3.292880e+09 True 2 3.292880e+09 False 3 3.292765e+09 True 4 3.292763e+09 True 5 3.292731e+09 True Do you consider yourself to be a fan of the Star Wars film franchise? \ 1 True 2 NaN 3 False 4 True 5 True seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 \ 1 True True True True True True 2 False False False False False False 3 True True True False False False 4 True True True True True True 5 True True True True True True Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. \ 1 3 2 NaN 3 1 4 5 5 5 ... Unnamed: 28 \ 1 ... Very favorably 2 ... NaN 3 ... Unfamiliar (N/A) 4 ... Very favorably 5 ... Somewhat favorably Which character shot first? \ 1 I don't understand this question 2 NaN 3 I don't understand this question 4 I don't understand this question 5 Greedo Are you familiar with the Expanded Universe? \ 1 Yes 2 NaN 3 No 4 No 5 Yes Do you consider yourself to be a fan of the Expanded Universe?Âæ \ 1 No 2 NaN 3 NaN 4 NaN 5 No Do you consider yourself to be a fan of the Star Trek franchise? Gender \ 1 No Male 2 Yes Male 3 No Male 4 Yes Male 5 No Male Age Household Income Education \ 1 18-29 NaN High school degree 2 18-29 $0 - $24,999 Bachelor degree 3 18-29 $0 - $24,999 High school degree 4 18-29 $100,000 - $149,999 Some college or Associate degree 5 18-29 $100,000 - $149,999 Some college or Associate degree Location (Census Region) 1 South Atlantic 2 West South Central 3 West North Central 4 West North Central 5 West North Central [5 rows x 38 columns]
columns index start from [9:15] ask the respondent to
columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:
# create unique_values() func
def uni_value(df,column):
print(df[column].value_counts(dropna=False))
print(df[column].dtype)
# itrate over columns to check columns data
for c in star_wars.columns[9:15]:
uni_value(star_wars,c)
NaN 351 4 237 6 168 3 130 1 129 5 100 2 71 Name: Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film., dtype: int64 object NaN 350 5 300 4 183 2 116 3 103 6 102 1 32 Name: Unnamed: 10, dtype: int64 object NaN 351 6 217 5 203 4 182 3 150 2 47 1 36 Name: Unnamed: 11, dtype: int64 object NaN 350 1 204 6 161 2 135 4 130 3 127 5 79 Name: Unnamed: 12, dtype: int64 object NaN 350 1 289 2 235 5 118 3 106 4 47 6 41 Name: Unnamed: 13, dtype: int64 object NaN 350 2 232 3 220 1 146 6 145 4 57 5 36 Name: Unnamed: 14, dtype: int64 object
# convert columns[9:15] type to float
star_wars[star_wars.columns[9:15]]=star_wars[star_wars.columns[9:15]].astype(float)
# check columns data after converting
for c in star_wars.columns[9:15]:
uni_value(star_wars,c)
NaN 351 4.0 237 6.0 168 3.0 130 1.0 129 5.0 100 2.0 71 Name: Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film., dtype: int64 float64 NaN 350 5.0 300 4.0 183 2.0 116 3.0 103 6.0 102 1.0 32 Name: Unnamed: 10, dtype: int64 float64 NaN 351 6.0 217 5.0 203 4.0 182 3.0 150 2.0 47 1.0 36 Name: Unnamed: 11, dtype: int64 float64 NaN 350 1.0 204 6.0 161 2.0 135 4.0 130 3.0 127 5.0 79 Name: Unnamed: 12, dtype: int64 float64 NaN 350 1.0 289 2.0 235 5.0 118 3.0 106 4.0 47 6.0 41 Name: Unnamed: 13, dtype: int64 float64 NaN 350 2.0 232 3.0 220 1.0 146 6.0 145 4.0 57 5.0 36 Name: Unnamed: 14, dtype: int64 float64
# rename column with descriptive name
rename_columns={} # dictionary have old & new column name
x=0
# itrate over columns[9:15] & update rename_columns{}
for c in star_wars.columns[9:15]:
rename_columns[c]="ranking_{}".format(x+1)
x+=1
# rename columns
star_wars=star_wars.rename(columns=rename_columns)
# check columns name
print(star_wars.columns)
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?Âæ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
# first will replace null values with mean for numeric values
star_wars.fillna(star_wars.mean(),inplace=True)
# calculate ranking mean
ranking_mean=star_wars[star_wars.columns[9:15]].mean()
ranking_mean=ranking_mean.apply(lambda x:int(x))
ranking_mean
ranking_1 3 ranking_2 4 ranking_3 4 ranking_4 3 ranking_5 2 ranking_6 3 dtype: int64
# create function to plot data
def plot_data(figsize,data,plot_kind,y_lim,x_ticklabel,title):
fig,ax=plt.subplots(figsize=figsize)
ax=data.plot(kind=plot_kind)
for p in ax.patches:
if p.get_height()>=0.2:
ax.annotate(str(int(p.get_height())),(p.get_x()+0.1,p.get_height()*1.05))
ax.set_ylim(y_lim)
for spine in ax.spines:
ax.spines[spine].set_visible(False)
ax.tick_params(left=False,top=False,right=False,bottom=False,labelleft=False)
ax.set_xticklabels(x_ticklabel,rotation=45)
ax.set_title(title,fontsize=15)
plt.show()
# plot ranking mean in bar chart
plot_data((6,4),ranking_mean,"bar",(0,6),ranking_mean.index,"Ranking Star Wars Movies")
let us remember that ranking (1 means the film was the most favorite, and 6 means it was the least favorite),so
from the graph above we noticed that (Star Wars: Episode V The Empire Strikes Back) which is ranking_5 is the most favorite have ranking average 2 & Star Wars: Episode I,Episode IV,& Episode VI all have almost same rank with average 3
* let us konw How many people have seen each movie? just by taking the sum of the seen columns
# sum seen columns to compute the sum of each of the seen columns.
seen_total=star_wars[star_wars.columns[3:9]].sum()
seen_total=seen_total.apply(lambda x :int(x))
seen_total
seen_1 673 seen_2 571 seen_3 550 seen_4 607 seen_5 758 seen_6 738 dtype: int64
# use plot_data() func to plot the seen_total
plot_data((6,4),seen_total,"bar",(0,850),seen_total.index,"Star Wars Movies Seen")
Star Wars: Episode V The Empire Strikes Back is much more popular then Star Wars:Episode VI after that Star Wars: Episode I
correlation=star_wars[star_wars.columns[3:15]].corr()
# combined.corr()["sat_score"][survey_fields].plot.bar()
print(correlation)
seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 \ seen_1 1.000000 0.783358 0.729996 0.665818 0.648044 0.653696 seen_2 0.783358 1.000000 0.883886 0.687882 0.611608 0.642843 seen_3 0.729996 0.883886 1.000000 0.698517 0.617805 0.651306 seen_4 0.665818 0.687882 0.698517 1.000000 0.734259 0.759477 seen_5 0.648044 0.611608 0.617805 0.734259 1.000000 0.910124 seen_6 0.653696 0.642843 0.651306 0.759477 0.910124 1.000000 ranking_1 0.045018 0.192587 0.245790 0.329289 0.196162 0.239303 ranking_2 0.009260 0.032612 0.107698 0.273855 0.197400 0.217926 ranking_3 -0.045454 -0.079822 -0.144524 0.130938 0.126508 0.132463 ranking_4 -0.098361 -0.125265 -0.118085 -0.415678 -0.066153 -0.088852 ranking_5 0.044514 -0.011483 -0.039873 -0.102497 -0.214689 -0.151924 ranking_6 0.053295 -0.001593 -0.042692 -0.107388 -0.187370 -0.283890 ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6 seen_1 0.045018 0.009260 -0.045454 -0.098361 0.044514 0.053295 seen_2 0.192587 0.032612 -0.079822 -0.125265 -0.011483 -0.001593 seen_3 0.245790 0.107698 -0.144524 -0.118085 -0.039873 -0.042692 seen_4 0.329289 0.273855 0.130938 -0.415678 -0.102497 -0.107388 seen_5 0.196162 0.197400 0.126508 -0.066153 -0.214689 -0.187370 seen_6 0.239303 0.217926 0.132463 -0.088852 -0.151924 -0.283890 ranking_1 1.000000 0.415353 0.066543 -0.451620 -0.453847 -0.462552 ranking_2 0.415353 1.000000 0.335531 -0.435664 -0.528662 -0.532254 ranking_3 0.066543 0.335531 1.000000 -0.299675 -0.452271 -0.421262 ranking_4 -0.451620 -0.435664 -0.299675 1.000000 0.003324 -0.043641 ranking_5 -0.453847 -0.528662 -0.452271 0.003324 1.000000 0.312429 ranking_6 -0.462552 -0.532254 -0.421262 -0.043641 0.312429 1.000000
# using seaborn to plot correlation in a matrix
sns.heatmap(correlation)
<matplotlib.axes._subplots.AxesSubplot at 0x7f206f34de48>
gender_rank=star_wars.pivot_table(values=star_wars.columns[9:15],index="Gender",margins=False).applymap(lambda x:int(x))
ax3=gender_rank.transpose().plot(kind="bar",figsize=(8,5),cmap="summer")
for p in ax3.patches:
if p.get_height()>=0.2:
ax3.annotate(str(int(p.get_height())),(p.get_x()+0.1,p.get_height()*1.05))
ax3.set_ylim(0,6)
for spine in ax3.spines:
ax3.spines[spine].set_visible(False)
ax3.tick_params(top=False,right=False,bottom=False,left=False,labelleft=False)
ax3.set_xticklabels(star_wars.columns[9:15],rotation=45)
ax3.set_title("Ranking star wars Movies by Gender",fontsize=15)
<matplotlib.text.Text at 0x7f206efc0a20>
# create seen table by Gender
gender_seen=star_wars.pivot_table(values=star_wars.columns[3:9],index="Gender",aggfunc=np.sum,margins=False).applymap(lambda x:int(x))
# plot data in bar
ax4=gender_seen.transpose().plot(kind="bar",figsize=(10,6),cmap="summer")
ax4.set_ylim(0,500)
# add x value above column
for p in ax4.patches:
if p.get_height()>=0.2:
ax4.annotate(str(int(p.get_height())),(p.get_x(),p.get_height()*1.05))
for spine in ax4.spines:
ax4.spines[spine].set_visible(False)
ax4.tick_params(top=False,right=False,bottom=False,left=False,labelleft=False)
ax4.set_xticklabels(gender_seen.columns,rotation=45)
ax4.set_title("Seen Star wars Movies by Gender",fontsize=15)
ax4.legend(loc="upper right")
<matplotlib.legend.Legend at 0x7f206ef6cb00>
by exploring ranking & seen movies for both male & female it seems like:
# check education column data using uni_value() fun
uni_value(star_wars,"Education")
Some college or Associate degree 328 Bachelor degree 321 Graduate degree 275 NaN 150 High school degree 105 Less than high school degree 7 Name: Education, dtype: int64 object
# create pivot table for education effect on movies viewership
education_seen=star_wars.pivot_table(values=star_wars.columns[3:9],index="Education",aggfunc=np.sum,margins=False).applymap(lambda x:int(x))
# plot pivot table
ax5=education_seen.transpose().plot(kind="bar",figsize=(10,6),cmap="Paired")
ax5.set_ylim(0,300)
ax5.set_xticklabels(labels=education_seen.columns,rotation=45)
ax5.set_title("Star Wars Movies Seen by Education",fontsize=15)
ax5.legend(loc="upper center")
<matplotlib.legend.Legend at 0x7f206eff18d0>
# create pivot table to explore the educational effect on movies evaluation
education_rank=star_wars.pivot_table(values=star_wars.columns[9:15],index="Education",aggfunc=np.mean,margins=False).applymap(lambda x:int(x))
# plot education_rank
ax6=education_rank.transpose().plot(kind="bar",figsize=(10,8),cmap="Paired")
ax6.set_ylim(0,6)
ax6.set_xticklabels(education_rank.columns,rotation=45)
[<matplotlib.text.Text at 0x7f206ee3b588>, <matplotlib.text.Text at 0x7f206edc99e8>, <matplotlib.text.Text at 0x7f206ed5b4a8>, <matplotlib.text.Text at 0x7f206ed5bf98>, <matplotlib.text.Text at 0x7f206ed5eac8>, <matplotlib.text.Text at 0x7f206ed625f8>]
will create map according to regions based on fans true answer for column index 2 "Do you consider yourself to be a fan of the Star Wars film franchise?"
star_wars_fans=star_wars[star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"]==True]
star_wars_fans["Location (Census Region)"].value_counts(dropna=False)
South Atlantic 88 Pacific 86 East North Central 84 Middle Atlantic 64 West South Central 53 West North Central 53 Mountain 49 New England 44 East South Central 20 NaN 11 Name: Location (Census Region), dtype: int64
# drop null values
star_wars_fans_clean=star_wars_fans[pd.notnull(star_wars_fans["Location (Census Region)"])]
star_wars_fans_clean["Location (Census Region)"].value_counts(dropna=False)
South Atlantic 88 Pacific 86 East North Central 84 Middle Atlantic 64 West South Central 53 West North Central 53 Mountain 49 New England 44 East South Central 20 Name: Location (Census Region), dtype: int64
# # how to map Census Region , i need it's llcrnrlat,urcrnrlat,llcrnrlong,urcrnrlong to add them on basemap function
# fig=plt.figure(figsize=(14,8))
# m=Basemap(projection="lcc",width=8E6,height=8E6,lat_0=180,lon_0=0,resolution="i")
# # m.drawmapboundary(fill_color='#85A6D9')
# m.drawcoastlines(color='gray')
# m.drawcountries(color='gray')
# m.drawrivers(color='#6D5F47',linewidth=.4)
# m.drawstates(color='gray')
region_lat_lon={"East North Central":[39.964180,-75.157761],"Pacific":[37.030678,-95.638306],"South Atlantic":[-58.024670,-61.755451],
"Middle Atlantic":[32.777458,-79.926071],"West South Central":[37.771069,-122.247017],"West North Central":[39.973129,-75.155006],
"Mountain":[36.064877,-95.992950],"New England":[46.537289,-102.861549],"East South Central":[37.757030,-122.236603]}
def lat(country):
for key,index in region_lat_lon.items():
if country==key:
latitude=index[0]
return latitude
star_wars_fans_clean["latitudes"]=star_wars_fans_clean["Location (Census Region)"].apply(lat)
print(star_wars_fans_clean.head())
# print(latitudes)
RespondentID Have you seen any of the 6 films in the Star Wars franchise? \ 1 3.292880e+09 True 4 3.292763e+09 True 5 3.292731e+09 True 6 3.292719e+09 True 7 3.292685e+09 True Do you consider yourself to be a fan of the Star Wars film franchise? \ 1 True 4 True 5 True 6 True 7 True seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 ranking_1 ... \ 1 True True True True True True 3.0 ... 4 True True True True True True 5.0 ... 5 True True True True True True 5.0 ... 6 True True True True True True 1.0 ... 7 True True True True True True 6.0 ... Which character shot first? \ 1 I don't understand this question 4 I don't understand this question 5 Greedo 6 Han 7 Han Are you familiar with the Expanded Universe? \ 1 Yes 4 No 5 Yes 6 Yes 7 Yes Do you consider yourself to be a fan of the Expanded Universe?Âæ \ 1 No 4 NaN 5 No 6 No 7 No Do you consider yourself to be a fan of the Star Trek franchise? Gender \ 1 No Male 4 Yes Male 5 No Male 6 Yes Male 7 No Male Age Household Income Education \ 1 18-29 NaN High school degree 4 18-29 $100,000 - $149,999 Some college or Associate degree 5 18-29 $100,000 - $149,999 Some college or Associate degree 6 18-29 $25,000 - $49,999 Bachelor degree 7 18-29 NaN High school degree Location (Census Region) latitudes 1 South Atlantic -58.024670 4 West North Central 39.973129 5 West North Central 39.973129 6 Middle Atlantic 32.777458 7 East North Central 39.964180 [5 rows x 39 columns]
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
def lon(country):
for key,index in region_lat_lon.items():
if country==key:
longitude=index[1]
return longitude
star_wars_fans_clean["longitudes"]=star_wars_fans_clean["Location (Census Region)"].apply(lon)
print(star_wars_fans_clean.head())
RespondentID Have you seen any of the 6 films in the Star Wars franchise? \ 1 3.292880e+09 True 4 3.292763e+09 True 5 3.292731e+09 True 6 3.292719e+09 True 7 3.292685e+09 True Do you consider yourself to be a fan of the Star Wars film franchise? \ 1 True 4 True 5 True 6 True 7 True seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 ranking_1 ... \ 1 True True True True True True 3.0 ... 4 True True True True True True 5.0 ... 5 True True True True True True 5.0 ... 6 True True True True True True 1.0 ... 7 True True True True True True 6.0 ... Are you familiar with the Expanded Universe? \ 1 Yes 4 No 5 Yes 6 Yes 7 Yes Do you consider yourself to be a fan of the Expanded Universe?Âæ \ 1 No 4 NaN 5 No 6 No 7 No Do you consider yourself to be a fan of the Star Trek franchise? Gender \ 1 No Male 4 Yes Male 5 No Male 6 Yes Male 7 No Male Age Household Income Education \ 1 18-29 NaN High school degree 4 18-29 $100,000 - $149,999 Some college or Associate degree 5 18-29 $100,000 - $149,999 Some college or Associate degree 6 18-29 $25,000 - $49,999 Bachelor degree 7 18-29 NaN High school degree Location (Census Region) latitudes longitudes 1 South Atlantic -58.024670 -61.755451 4 West North Central 39.973129 -75.155006 5 West North Central 39.973129 -75.155006 6 Middle Atlantic 32.777458 -79.926071 7 East North Central 39.964180 -75.157761 [5 rows x 40 columns]
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
fig,ax=plt.subplots(figsize=(14,8))
earth=Basemap(ax=ax)
earth.drawcoastlines(color='#556655', linewidth=0.5)
# ax.scatter(star_wars_fans_clean['longitudes'],star_wars_fans_clean['latitudes'],
# c=star_wars_fans_clean["Location (Census Region)"].value_counts(),cmap="rainbow",zorder=50)
ax.scatter(star_wars_fans_clean['longitudes'],star_wars_fans_clean['latitudes'],
star_wars_fans_clean["Location (Census Region)"].value_counts(),cmap="red",zorder=10)
ax.set_title("Mapping star wars fans by region")
<matplotlib.text.Text at 0x7f206e43ae48>