import pandas as pd
import numpy as np
# Reading the dataset into a dataframe
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
# Exploring the first 10 rows of the dataset
print(star_wars.head(10))
# Exploring column names in order to review them
star_wars.columns
RespondentID Have you seen any of the 6 films in the Star Wars franchise? \ 0 NaN Response 1 3.292880e+09 Yes 2 3.292880e+09 No 3 3.292765e+09 Yes 4 3.292763e+09 Yes 5 3.292731e+09 Yes 6 3.292719e+09 Yes 7 3.292685e+09 Yes 8 3.292664e+09 Yes 9 3.292654e+09 Yes Do you consider yourself to be a fan of the Star Wars film franchise? \ 0 Response 1 Yes 2 NaN 3 No 4 Yes 5 Yes 6 Yes 7 Yes 8 Yes 9 Yes Which of the following Star Wars films have you seen? Please select all that apply. \ 0 Star Wars: Episode I The Phantom Menace 1 Star Wars: Episode I The Phantom Menace 2 NaN 3 Star Wars: Episode I The Phantom Menace 4 Star Wars: Episode I The Phantom Menace 5 Star Wars: Episode I The Phantom Menace 6 Star Wars: Episode I The Phantom Menace 7 Star Wars: Episode I The Phantom Menace 8 Star Wars: Episode I The Phantom Menace 9 Star Wars: Episode I The Phantom Menace Unnamed: 4 \ 0 Star Wars: Episode II Attack of the Clones 1 Star Wars: Episode II Attack of the Clones 2 NaN 3 Star Wars: Episode II Attack of the Clones 4 Star Wars: Episode II Attack of the Clones 5 Star Wars: Episode II Attack of the Clones 6 Star Wars: Episode II Attack of the Clones 7 Star Wars: Episode II Attack of the Clones 8 Star Wars: Episode II Attack of the Clones 9 Star Wars: Episode II Attack of the Clones Unnamed: 5 \ 0 Star Wars: Episode III Revenge of the Sith 1 Star Wars: Episode III Revenge of the Sith 2 NaN 3 Star Wars: Episode III Revenge of the Sith 4 Star Wars: Episode III Revenge of the Sith 5 Star Wars: Episode III Revenge of the Sith 6 Star Wars: Episode III Revenge of the Sith 7 Star Wars: Episode III Revenge of the Sith 8 Star Wars: Episode III Revenge of the Sith 9 Star Wars: Episode III Revenge of the Sith Unnamed: 6 \ 0 Star Wars: Episode IV A New Hope 1 Star Wars: Episode IV A New Hope 2 NaN 3 NaN 4 Star Wars: Episode IV A New Hope 5 Star Wars: Episode IV A New Hope 6 Star Wars: Episode IV A New Hope 7 Star Wars: Episode IV A New Hope 8 Star Wars: Episode IV A New Hope 9 Star Wars: Episode IV A New Hope Unnamed: 7 \ 0 Star Wars: Episode V The Empire Strikes Back 1 Star Wars: Episode V The Empire Strikes Back 2 NaN 3 NaN 4 Star Wars: Episode V The Empire Strikes Back 5 Star Wars: Episode V The Empire Strikes Back 6 Star Wars: Episode V The Empire Strikes Back 7 Star Wars: Episode V The Empire Strikes Back 8 Star Wars: Episode V The Empire Strikes Back 9 Star Wars: Episode V The Empire Strikes Back Unnamed: 8 \ 0 Star Wars: Episode VI Return of the Jedi 1 Star Wars: Episode VI Return of the Jedi 2 NaN 3 NaN 4 Star Wars: Episode VI Return of the Jedi 5 Star Wars: Episode VI Return of the Jedi 6 Star Wars: Episode VI Return of the Jedi 7 Star Wars: Episode VI Return of the Jedi 8 Star Wars: Episode VI Return of the Jedi 9 Star Wars: Episode VI Return of the Jedi Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. \ 0 Star Wars: Episode I The Phantom Menace 1 3 2 NaN 3 1 4 5 5 5 6 1 7 6 8 4 9 5 ... Unnamed: 28 Which character shot first? \ 0 ... Yoda Response 1 ... Very favorably I don't understand this question 2 ... NaN NaN 3 ... Unfamiliar (N/A) I don't understand this question 4 ... Very favorably I don't understand this question 5 ... Somewhat favorably Greedo 6 ... Very favorably Han 7 ... Very favorably Han 8 ... Very favorably Han 9 ... Somewhat favorably Han Are you familiar with the Expanded Universe? \ 0 Response 1 Yes 2 NaN 3 No 4 No 5 Yes 6 Yes 7 Yes 8 No 9 No Do you consider yourself to be a fan of the Expanded Universe?Âæ \ 0 Response 1 No 2 NaN 3 NaN 4 NaN 5 No 6 No 7 No 8 NaN 9 NaN Do you consider yourself to be a fan of the Star Trek franchise? Gender \ 0 Response Response 1 No Male 2 Yes Male 3 No Male 4 Yes Male 5 No Male 6 Yes Male 7 No Male 8 Yes Male 9 No Male Age Household Income Education \ 0 Response Response Response 1 18-29 NaN High school degree 2 18-29 $0 - $24,999 Bachelor degree 3 18-29 $0 - $24,999 High school degree 4 18-29 $100,000 - $149,999 Some college or Associate degree 5 18-29 $100,000 - $149,999 Some college or Associate degree 6 18-29 $25,000 - $49,999 Bachelor degree 7 18-29 NaN High school degree 8 18-29 NaN High school degree 9 18-29 $0 - $24,999 Some college or Associate degree Location (Census Region) 0 Response 1 South Atlantic 2 West South Central 3 West North Central 4 West North Central 5 West North Central 6 Middle Atlantic 7 East North Central 8 South Atlantic 9 South Atlantic [10 rows x 38 columns]
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?Âæ', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
# Removing rows where RespondentID is 'NaN'
star_wars = star_wars[pd.notnull(star_wars["RespondentID"])]
# Checking the new star_wars dataframe with the cleaned RespondentID column
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.292685e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
8 | 3.292664e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
9 | 3.292654e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
10 | 3.292640e+09 | Yes | No | NaN | Star Wars: Episode II Attack of the Clones | NaN | NaN | NaN | NaN | 1 | ... | Very favorably | I don't understand this question | No | NaN | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
10 rows × 38 columns
Now the dataset 'star_wars' contains only rows where the column 'RespondentID' != (is not equal to) 'NaN'.
Let us change the values in the two columns to boolean values (True & False): Have you seen any of the 6 films in the Star Wars franchise? and Do you consider yourself to be a fan of the Star Wars film franchise?
# Dictionary which defines the mapping we need for the two columns
boolean_map = {
"Yes": True,
"No": False
}
# Converting the two columns to contain Boolean values.
star_wars["Have you seen any of the 6 films in the Star Wars franchise?"] = star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].map(boolean_map)
star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"] = star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"].map(boolean_map)
# Could also have used this little loop:
# boolean_map = {"Yes": True, "No": False}
# for col in [
# "Have you seen any of the 6 films in the Star Wars franchise?",
# "Do you consider yourself to be a fan of the Star Wars film franchise?"
# ]:
# star_wars[col] = star_wars[col].map(boolean_map)
# Checking whether the changes to effect
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.292685e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
8 | 3.292664e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
9 | 3.292654e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
10 | 3.292640e+09 | True | False | NaN | Star Wars: Episode II Attack of the Clones | NaN | NaN | NaN | NaN | 1 | ... | Very favorably | I don't understand this question | No | NaN | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
10 rows × 38 columns
It seems like my magic worked. Now we have some Boolean types to work with.
The next six columns represent a single checkbox question, in which the respondent was asked: Which of the following Star Wars films have you seen? Please select all that apply.
The columns, and possible checkbox answers are:
The values inside these columns are the names of the movies respondents checked off. Thus, if the respondent saw the movie, there will be a string with the title of the particular movie. However, if the respondent didn't see the movie, or answer for the particular checkbox, the value will be 'NaN'. I will convert each of these columns into containing solely Boolean values, following the same principles I did just previously.
After I have converted the values of the columns, I will rename the columns, in order to provide more intuitive names.
# Dictionary which defines the mapping we need for the two columns
titles = {
"Star Wars: Episode I The Phantom Menace": True,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True,
np.NaN: False
}
# Converting the six columns to contain Boolean values.
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(titles)
# Checking whether the changes to effect
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | True | True | True | True | True | True | True | True | 1 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.292685e+09 | True | True | True | True | True | True | True | True | 6 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
8 | 3.292664e+09 | True | True | True | True | True | True | True | True | 4 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
9 | 3.292654e+09 | True | True | True | True | True | True | True | True | 5 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
10 | 3.292640e+09 | True | False | False | True | False | False | False | False | 1 | ... | Very favorably | I don't understand this question | No | NaN | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
10 rows × 38 columns
We now have even more beautiful and easy to analyze columns to work with.
# Renaming the columns
star_wars = star_wars.rename(columns = {
"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6"
})
# Checking whether the changes took effect
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
6 | 3.292719e+09 | True | True | True | True | True | True | True | True | 1 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
7 | 3.292685e+09 | True | True | True | True | True | True | True | True | 6 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
8 | 3.292664e+09 | True | True | True | True | True | True | True | True | 4 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
9 | 3.292654e+09 | True | True | True | True | True | True | True | True | 5 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
10 | 3.292640e+09 | True | False | False | True | False | False | False | False | 1 | ... | Very favorably | I don't understand this question | No | NaN | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
10 rows × 38 columns
Now that I've converted to values into Boolean values, and changed the column names, the dataset will be much easier to work with, and more intuitive to analyze and understand.
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:
I will now convert each column to a numeric type, and then rename the columns so that the columns will be more intuitive to work with.
# Converting the columns to numeric type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
# Renaming the columns with more descriptive names
star_wars = star_wars.rename(columns = {
"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6"
})
# Checking whether the changes took effect
star_wars.iloc[:, 9:15].head(10)
ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|
1 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
2 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
4 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
5 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
6 | 1.0 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 |
7 | 6.0 | 5.0 | 4.0 | 3.0 | 1.0 | 2.0 |
8 | 4.0 | 5.0 | 6.0 | 3.0 | 2.0 | 1.0 |
9 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
10 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
Look how good those columns look now. This is going to be great to work with, and much easier to analyze.
Let us take a look at the means of each of the ranking columns, in order to determine which movie is ranked the highest in 538's survey.
I will make a bar chart for each ranking in order to provide a better overview of the columns values.
# Computing the mean of each of the rankings
star_wars[star_wars.columns[9:15]].mean()
ranking_1 3.732934 ranking_2 4.087321 ranking_3 4.341317 ranking_4 3.272727 ranking_5 2.513158 ranking_6 3.047847 dtype: float64
# Importing matplotlib which we will need to plot the graph
import matplotlib.pyplot as plt
# Allowing plots to be displayed directly in the notebook
%matplotlib inline
# Making a fancy bar chart for each ranking
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
rating = [1, 2, 3, 4, 5, 6]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("Rating")
plt.title("Mean rankings of each Star Wars movie (1-6)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), star_wars[star_wars.columns[9:15]].mean(), color = 'blueviolet')
<BarContainer object of 6 artists>
I have made a graph which shows the ranking means of each movie. The important factor to note is that the lower the ranking is, the better the movie has been rated by the fans. Therefore, we can now see that the 5th movie, Empire Striks Back, is the highest rated movie. Moreover, we can see that the original trilogy rank better, across the board, then the more recent prequels.
Let us now turn our attention to the popularity of each movie, in terms of viewership. I will now examine the sum of the seen columns, and make another chart to display the data.
# Summing the columns
star_wars[star_wars.columns[3:9]].sum()
seen_1 673 seen_2 571 seen_3 550 seen_4 607 seen_5 758 seen_6 738 dtype: int64
# Making a fancy bar chart for each seen column
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
seen = [525, 575, 625, 675, 725, 775]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("seen")
plt.title("How many times each Star Wars movie has been seen")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum(), color = 'blueviolet')
<BarContainer object of 6 artists>
We can tell from the graph that the most viewed movie is also the most highly ranked movie; The Empire Strikes Back. The movie viewed the least amount of times is the third movie; Revenge of the Sith. However, the first movie, The Phantom Menace, has been viewed more times than the fourth movie, A New hope.
In terms of the fifth and the sixth movie, the correlation between viewership and ranking is positive. However, that is not the case for the fourth movie. Moreover, the prequels all show a correlation between amount of times viewed, and the rating.
Let us now look into the segments of the data, particularly the differeces in response between the genders, and between Star Wars fans and Star Trek fans. I will now split the dataframe into two groups, in order to compare the mentioned segmentations.
# Splitting the data by gender
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
# Computing the mean of each of the rankings by males
males[males.columns[9:15]].mean()
ranking_1 4.037825 ranking_2 4.224586 ranking_3 4.274882 ranking_4 2.997636 ranking_5 2.458629 ranking_6 3.002364 dtype: float64
# Making a fancy bar chart for each ranking by males
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
rating = [1, 2, 3, 4, 5, 6]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("Rating")
plt.title("Mean rankings of each Star Wars movie (1-6)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), males[males.columns[9:15]].mean(), color = 'steelblue')
<BarContainer object of 6 artists>
Let us redo the analysis for females as well before we draw any conclusions on ranking.
# Computing the mean of each of the rankings by females
females[females.columns[9:15]].mean()
ranking_1 3.429293 ranking_2 3.954660 ranking_3 4.418136 ranking_4 3.544081 ranking_5 2.569270 ranking_6 3.078086 dtype: float64
# Making a fancy bar chart for each ranking by females
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
rating = [1, 2, 3, 4, 5, 6]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("Rating")
plt.title("Mean rankings of each Star Wars movie (1-6)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), females[females.columns[9:15]].mean(), color = 'salmon')
<BarContainer object of 6 artists>
We can now see that the ranking of the movies differs when controlling for gender. Females have rated the first and second movie higher than males have, and they have also rated the original trilogy slightly lower than males have.
Let us look at the amount of views, controlling for gender.
# Summing the columns for males
males[males.columns[3:9]].sum()
seen_1 361 seen_2 323 seen_3 317 seen_4 342 seen_5 392 seen_6 387 dtype: int64
# Making a fancy bar chart for each seen column
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
seen = [525, 575, 625, 675, 725, 775]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("seen")
plt.title("How many times each Star Wars movie has been seen (males)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), males[males.columns[3:9]].sum(), color = 'steelblue')
<BarContainer object of 6 artists>
Let us do the analysis for females as well before we dive into the details.
# Summing the columns for females
females[females.columns[3:9]].sum()
seen_1 298 seen_2 237 seen_3 222 seen_4 255 seen_5 353 seen_6 338 dtype: int64
# Making a fancy bar chart for each seen column
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
seen = [525, 575, 625, 675, 725, 775]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("seen")
plt.title("How many times each Star Wars movie has been seen (females)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), females[females.columns[3:9]].sum(), color = 'salmon')
<BarContainer object of 6 artists>
There doesn't seen to be any discernable difference between the popularity of the movies, across the genders. However, females have in general viewed the movies less than males have.
An interesting point here is that while the viewership of each movie is relatively similar across genders, there does not seem to be a positive correlation between views and rating from females. Whereas there is a positive correlation between views and rating from males.
Let us take a look at the difference between the Star Wars fanbase, and the Star Trek fanbase.
# Splitting the data by fanbase
wars_fan = star_wars[star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"] == True]
trek_fan = star_wars[star_wars["Do you consider yourself to be a fan of the Star Trek franchise?"] == "Yes"]
# Computing the mean of each of the rankings by Jedis
wars_fan[wars_fan.columns[9:15]].mean()
ranking_1 4.141304 ranking_2 4.342391 ranking_3 4.417423 ranking_4 2.932971 ranking_5 2.333333 ranking_6 2.829710 dtype: float64
# Making a fancy bar chart for each ranking by Jedis
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
rating = [1, 2, 3, 4, 5, 6]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("Rating")
plt.title("Mean rankings of each Star Wars movie (Jedis)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), wars_fan[wars_fan.columns[9:15]].mean(), color = 'seagreen')
<BarContainer object of 6 artists>
Seems like the Star Wars fans know what they like! Original trilogy all the way baby.
# Computing the mean of each of the rankings by trekkies
trek_fan[trek_fan.columns[9:15]].mean()
ranking_1 3.968675 ranking_2 4.255422 ranking_3 4.403382 ranking_4 3.110843 ranking_5 2.407229 ranking_6 2.850602 dtype: float64
# Making a fancy bar chart for each ranking by trekkies
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
rating = [1, 2, 3, 4, 5, 6]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("Rating")
plt.title("Mean rankings of each Star Wars movie (trekkies)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), trek_fan[trek_fan.columns[9:15]].mean(), color = 'deeppink')
<BarContainer object of 6 artists>
While it is a tiny difference, the Star Trek fans rank the prequels higher than their Star Wars fan counterparts. However, Star Trek fans almost rate The Empire Strikes Back just as good as Star Wars fans do. Overall, a very similar rating of the movies.
Let us take a look at the viewership between the fanbases.
# Computing the mean of each of the rankings by Jedis
wars_fan[wars_fan.columns[3:9]].sum()
seen_1 500 seen_2 463 seen_3 450 seen_4 483 seen_5 538 seen_6 537 dtype: int64
# Making a fancy bar chart for each ranking by Jedis
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
rating = [1, 2, 3, 4, 5, 6]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("Rating")
plt.title("How many times each Star Wars movie has been seen (Jedis)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), wars_fan[wars_fan.columns[3:9]].sum(), color = 'seagreen')
<BarContainer object of 6 artists>
And for trekkies:
# Computing the mean of each of the rankings by trekkies
trek_fan[trek_fan.columns[3:9]].sum()
seen_1 364 seen_2 336 seen_3 322 seen_4 342 seen_5 397 seen_6 396 dtype: int64
# Making a fancy bar chart for each ranking by trekkies
plt.style.use('ggplot')
movies = ['#1', '#2', '#3', '#4', '#5', '#6']
rating = [1, 2, 3, 4, 5, 6]
# Formatting the graph
x_pos = [i for i, _ in enumerate(movies)]
plt.xlabel("Movie")
plt.ylabel("Rating")
plt.title("How many times each Star Wars movie has been seen (trekkies)")
plt.xticks(x_pos, movies)
# Plotting the bar graph
plt.bar(range(6), trek_fan[trek_fan.columns[3:9]].sum(), color = 'deeppink')
<BarContainer object of 6 artists>
The most viewed movies by Star Wars and Star Trek fans are in order: #5, #6, #1. However, Star Wars fans have seen the movies more than Star Trek fans have.
It definitely speaks to no real difference in rating, nor in correlation between rating and viewership, between the fanbases.
In this project I cleaned and explored 538's Star Wars survey, in order to gain some insight into which Star Wars movie: (1) is the most highly rated in the franchise; (2) has been viewed the most; (3) is most popular across different fanbases; (4) is preferred by males and females.
The cleaning of the dataset provided us with a lot of insights into the workings of pandas and numpy. Those tools allowed me to change not only names of columns, but also the types of values in the specific columns, in order to use some Boolean magic to make my analysis easier. I also put a lot of work into the graphs I made use of, in order to properly display the exploratory and descriptive points I was making. I hope you, dear reader, learned something.
We've seen some interesting results, and also some rather logical results, such as Star Wars fans viewing the movies the most vs Star Trek fans, similarly for males vs females. Regarding the former results, particularly interesting in my view, was the rating of the movies in the franchise, where the original trilogy come out on top, by quite a large margin. Moreover, the difference in rating between males and females was also quite interesting. Clearly, females enjoy the entire franchise much more so than males do.
Thank you for reading my analysis of FiveThirtyEight's Star Wars Survey.
/mhj