The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.
For this project, you'll be cleaning and exploring the data set in Jupyter notebook. To see a sample notebook containing all of the answers, visit the project's GitHub repository.
We need to specify an encoding because the data set has some characters that aren't in Python's default utf-8 encoding. You can read more about character encodings on developer Joel Spolsky's blog.
The data has several columns, including:
There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format. As a result, this data set needs a lot of cleaning.
#importing the modules
import pandas as pd
import numpy as np
#loading the file into Dataframe
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
#displaying first 10 rows
print(star_wars.head(10))
#inspecting column names
columns = star_wars.columns
RespondentID Have you seen any of the 6 films in the Star Wars franchise? \ 0 NaN Response 1 3.292880e+09 Yes 2 3.292880e+09 No 3 3.292765e+09 Yes 4 3.292763e+09 Yes 5 3.292731e+09 Yes 6 3.292719e+09 Yes 7 3.292685e+09 Yes 8 3.292664e+09 Yes 9 3.292654e+09 Yes Do you consider yourself to be a fan of the Star Wars film franchise? \ 0 Response 1 Yes 2 NaN 3 No 4 Yes 5 Yes 6 Yes 7 Yes 8 Yes 9 Yes Which of the following Star Wars films have you seen? Please select all that apply. \ 0 Star Wars: Episode I The Phantom Menace 1 Star Wars: Episode I The Phantom Menace 2 NaN 3 Star Wars: Episode I The Phantom Menace 4 Star Wars: Episode I The Phantom Menace 5 Star Wars: Episode I The Phantom Menace 6 Star Wars: Episode I The Phantom Menace 7 Star Wars: Episode I The Phantom Menace 8 Star Wars: Episode I The Phantom Menace 9 Star Wars: Episode I The Phantom Menace Unnamed: 4 \ 0 Star Wars: Episode II Attack of the Clones 1 Star Wars: Episode II Attack of the Clones 2 NaN 3 Star Wars: Episode II Attack of the Clones 4 Star Wars: Episode II Attack of the Clones 5 Star Wars: Episode II Attack of the Clones 6 Star Wars: Episode II Attack of the Clones 7 Star Wars: Episode II Attack of the Clones 8 Star Wars: Episode II Attack of the Clones 9 Star Wars: Episode II Attack of the Clones Unnamed: 5 \ 0 Star Wars: Episode III Revenge of the Sith 1 Star Wars: Episode III Revenge of the Sith 2 NaN 3 Star Wars: Episode III Revenge of the Sith 4 Star Wars: Episode III Revenge of the Sith 5 Star Wars: Episode III Revenge of the Sith 6 Star Wars: Episode III Revenge of the Sith 7 Star Wars: Episode III Revenge of the Sith 8 Star Wars: Episode III Revenge of the Sith 9 Star Wars: Episode III Revenge of the Sith Unnamed: 6 \ 0 Star Wars: Episode IV A New Hope 1 Star Wars: Episode IV A New Hope 2 NaN 3 NaN 4 Star Wars: Episode IV A New Hope 5 Star Wars: Episode IV A New Hope 6 Star Wars: Episode IV A New Hope 7 Star Wars: Episode IV A New Hope 8 Star Wars: Episode IV A New Hope 9 Star Wars: Episode IV A New Hope Unnamed: 7 \ 0 Star Wars: Episode V The Empire Strikes Back 1 Star Wars: Episode V The Empire Strikes Back 2 NaN 3 NaN 4 Star Wars: Episode V The Empire Strikes Back 5 Star Wars: Episode V The Empire Strikes Back 6 Star Wars: Episode V The Empire Strikes Back 7 Star Wars: Episode V The Empire Strikes Back 8 Star Wars: Episode V The Empire Strikes Back 9 Star Wars: Episode V The Empire Strikes Back Unnamed: 8 \ 0 Star Wars: Episode VI Return of the Jedi 1 Star Wars: Episode VI Return of the Jedi 2 NaN 3 NaN 4 Star Wars: Episode VI Return of the Jedi 5 Star Wars: Episode VI Return of the Jedi 6 Star Wars: Episode VI Return of the Jedi 7 Star Wars: Episode VI Return of the Jedi 8 Star Wars: Episode VI Return of the Jedi 9 Star Wars: Episode VI Return of the Jedi Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. \ 0 Star Wars: Episode I The Phantom Menace 1 3 2 NaN 3 1 4 5 5 5 6 1 7 6 8 4 9 5 ... Unnamed: 28 \ 0 ... Yoda 1 ... Very favorably 2 ... NaN 3 ... Unfamiliar (N/A) 4 ... Very favorably 5 ... Somewhat favorably 6 ... Very favorably 7 ... Very favorably 8 ... Very favorably 9 ... Somewhat favorably Which character shot first? \ 0 Response 1 I don't understand this question 2 NaN 3 I don't understand this question 4 I don't understand this question 5 Greedo 6 Han 7 Han 8 Han 9 Han Are you familiar with the Expanded Universe? \ 0 Response 1 Yes 2 NaN 3 No 4 No 5 Yes 6 Yes 7 Yes 8 No 9 No Do you consider yourself to be a fan of the Expanded Universe?Âæ \ 0 Response 1 No 2 NaN 3 NaN 4 NaN 5 No 6 No 7 No 8 NaN 9 NaN Do you consider yourself to be a fan of the Star Trek franchise? Gender \ 0 Response Response 1 No Male 2 Yes Male 3 No Male 4 Yes Male 5 No Male 6 Yes Male 7 No Male 8 Yes Male 9 No Male Age Household Income Education \ 0 Response Response Response 1 18-29 NaN High school degree 2 18-29 $0 - $24,999 Bachelor degree 3 18-29 $0 - $24,999 High school degree 4 18-29 $100,000 - $149,999 Some college or Associate degree 5 18-29 $100,000 - $149,999 Some college or Associate degree 6 18-29 $25,000 - $49,999 Bachelor degree 7 18-29 NaN High school degree 8 18-29 NaN High school degree 9 18-29 $0 - $24,999 Some college or Associate degree Location (Census Region) 0 Response 1 South Atlantic 2 West South Central 3 West North Central 4 West North Central 5 West North Central 6 Middle Atlantic 7 East North Central 8 South Atlantic 9 South Atlantic [10 rows x 38 columns]
#Inspecting for null values
star_wars.isnull()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
2 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | False | False | False | False | False | False |
3 | False | False | False | False | False | False | True | True | True | False | ... | False | False | False | True | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
5 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
6 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
7 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
8 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | True | False | False |
9 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
10 | False | False | False | True | False | True | True | True | True | False | ... | False | False | False | True | False | False | False | False | False | False |
11 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
12 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
13 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
14 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
15 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | True | False | False |
16 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
17 | False | False | False | True | True | True | False | True | True | False | ... | False | False | False | False | False | False | False | False | False | False |
18 | False | False | False | False | False | False | True | True | False | False | ... | False | False | False | True | False | False | False | True | False | False |
19 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | True | False | False |
20 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
21 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | True | False | False |
22 | False | False | False | False | False | False | False | False | True | False | ... | False | False | False | True | False | False | False | True | False | False |
23 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
24 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
25 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | True | True | False |
26 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | False | False | False | False | False | False |
27 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
28 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | True | False | False |
29 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1157 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1158 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | False | False | False | False | False | False |
1159 | False | False | False | True | True | True | True | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1160 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | False | False | False | False | False | False |
1161 | False | False | False | False | True | False | True | True | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1162 | False | False | False | False | False | False | True | False | False | False | ... | False | False | False | True | False | True | True | True | True | True |
1163 | False | False | False | False | True | True | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1164 | False | False | False | True | True | True | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
1165 | False | False | False | False | False | False | True | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1166 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1167 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1168 | False | False | False | False | True | True | True | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
1169 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | False | False | False | False | False | False |
1170 | False | False | False | False | True | True | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1171 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | False | False | False | True | False | False |
1172 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1173 | False | False | False | True | True | True | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
1174 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | True | False | False |
1175 | False | False | False | False | False | False | True | True | True | False | ... | False | False | False | True | False | False | False | True | False | False |
1176 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1177 | False | False | False | False | False | True | True | True | True | False | ... | False | False | False | True | False | False | False | False | False | False |
1178 | False | False | False | False | False | False | True | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1179 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | False | False | False | True | False | False |
1180 | False | False | False | True | True | True | True | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1181 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1182 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1183 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1184 | False | False | True | True | True | True | True | True | True | True | ... | True | True | True | True | False | False | False | False | False | False |
1185 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1186 | False | False | False | False | False | True | True | False | False | False | ... | False | False | False | True | False | False | False | False | False | False |
1187 rows × 38 columns
#removing rows without respondentId
star_wars = star_wars[star_wars['RespondentID'].notnull()]
#checking for changes in the DF
star_wars.isnull().sum()
RespondentID 0 Have you seen any of the 6 films in the Star Wars franchise? 0 Do you consider yourself to be a fan of the Star Wars film franchise? 350 Which of the following Star Wars films have you seen? Please select all that apply. 513 Unnamed: 4 615 Unnamed: 5 636 Unnamed: 6 579 Unnamed: 7 428 Unnamed: 8 448 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. 351 Unnamed: 10 350 Unnamed: 11 351 Unnamed: 12 350 Unnamed: 13 350 Unnamed: 14 350 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. 357 Unnamed: 16 355 Unnamed: 17 355 Unnamed: 18 363 Unnamed: 19 361 Unnamed: 20 372 Unnamed: 21 360 Unnamed: 22 366 Unnamed: 23 374 Unnamed: 24 359 Unnamed: 25 356 Unnamed: 26 365 Unnamed: 27 372 Unnamed: 28 360 Which character shot first? 358 Are you familiar with the Expanded Universe? 358 Do you consider yourself to be a fan of the Expanded Universe?Âæ 973 Do you consider yourself to be a fan of the Star Trek franchise? 118 Gender 140 Age 140 Household Income 328 Education 150 Location (Census Region) 143 dtype: int64
Take a look at the next two columns, which are:
Both represent Yes/No questions. They can also be NaN where a respondent chooses not to answer a question. We can use the pandas.Series.value_counts() method on a series to see all of the unique values in a column, along with the total number of times each value appears.
Both columns are currently string types, because the main values they contain are Yes and No. We can make the data a bit easier to analyze down the road by converting each column to a Boolean having only the values True, False, and NaN. Booleans are easier to work with because we can select the rows that are True or False without having to do a string comparison.
# inspecting Unique Column Values
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()
Yes 552 No 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
#defining Dictonary to work with map function
yes_no = {"Yes": True, "No": False}
#applying map funtion with dictnory to columns
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
#Looking for changes
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()
True 552 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.
The columns for this question are:
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
We'll need to convert each of these columns to a Boolean, then rename the column something more intuitive. We can convert the values the same way we did earlier, except that we'll need to include the movie title and NaN in the mapping dictionary.
#defining dictonary for conversion
true_false = {"Star Wars: Episode I The Phantom Menace": True,
"Star Wars: Episode II Attack of the Clones":True,
"Star Wars: Episode III Revenge of the Sith":True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"tar Wars: Episode VI Return of the Jedi": True,
np.NaN: False}
#applying mapping function
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(true_false)
#renaming Column names
for col in star_wars[3:9]:
star_wars = star_wars.rename(columns = {"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6"})
star_wars.head()
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | ranking_1 | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | NaN | False | False | False | NaN | False | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | NaN | False | False | NaN | NaN | NaN | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | NaN | False | False | False | NaN | False | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | NaN | False | False | False | NaN | False | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:
# concerting columns to float type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
star_wars[star_wars.columns[9:15]].head()
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | Unnamed: 10 | Unnamed: 11 | Unnamed: 12 | Unnamed: 13 | Unnamed: 14 | |
---|---|---|---|---|---|---|
1 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
2 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
4 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
5 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
#Renaming columns
star_wars = star_wars.rename(columns={"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.":
"ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6"
})
star_wars[star_wars.columns[9:15]].head()
ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|
1 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
2 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
4 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
5 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
we've cleaned up the ranking columns, we can find the highest-ranked movie more quickly. To do this, take the mean of each of the ranking columns using the pandas.DataFrame.mean() method on dataframes.
#importing visulization library
%matplotlib inline
import matplotlib.pyplot as plt
ranking_mean = star_wars[star_wars.columns[9:15]].mean()
plt.bar(range(6), ranking_mean)
<Container object of 6 artists>
it looks like the "original" movies are rated much more highly than the newer ones.
plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum())
<Container object of 6 artists>
It appears that the original movies were seen by more respondents than the newer movies. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular.
We know which movies the survey population as a whole has ranked the highest. Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:
# Converting it into binary
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
# plotting the segments
plt.bar(range(6), males[males.columns[9:15]].mean())
plt.show()
plt.bar(range(6), females[females.columns[9:15]].mean())
plt.show()
#Finding polular by gender
plt.bar(range(6), males[males.columns[3:9]].sum())
plt.show()
plt.bar(range(6), females[females.columns[3:9]].sum())
plt.show()