#!/usr/bin/env python # coding: utf-8 # # Exploring Star Wars # ## Cleaning and exploring FiveThirtyEight's data set on Star Wars fans # # In this project, I will be cleaning and exploring a data set from FiveThirtyEight, in order to answer some questions about *Star Wars* fans. # In[1]: import pandas as pd import numpy as np # Reading the dataset into a dataframe star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1") # Exploring the first 10 rows of the dataset print(star_wars.head(10)) # Exploring column names in order to review them star_wars.columns # In[2]: # Removing rows where RespondentID is 'NaN' star_wars = star_wars[pd.notnull(star_wars["RespondentID"])] # Checking the new star_wars dataframe with the cleaned RespondentID column star_wars.head(10) # Now the dataset 'star_wars' contains only rows where the column 'RespondentID' != (is not equal to) 'NaN'. # ## A bit of Boolean magic # Let us change the values in the two columns to boolean values (True & False): *Have you seen any of the 6 films in the Star Wars franchise?* and *Do you consider yourself to be a fan of the Star Wars film franchise?* # In[3]: # Dictionary which defines the mapping we need for the two columns boolean_map = { "Yes": True, "No": False } # Converting the two columns to contain Boolean values. star_wars["Have you seen any of the 6 films in the Star Wars franchise?"] = star_wars["Have you seen any of the 6 films in the Star Wars franchise?"].map(boolean_map) star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"] = star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"].map(boolean_map) # Could also have used this little loop: # boolean_map = {"Yes": True, "No": False} # for col in [ # "Have you seen any of the 6 films in the Star Wars franchise?", # "Do you consider yourself to be a fan of the Star Wars film franchise?" # ]: # star_wars[col] = star_wars[col].map(boolean_map) # Checking whether the changes to effect star_wars.head(10) # It seems like my magic worked. Now we have some Boolean types to work with. # ## Checkbox columns # The next six columns represent a single checkbox question, in which the respondent was asked: *Which of the following Star Wars films have you seen? Please select all that apply.* # # The columns, and possible checkbox answers are: # # * Which of the following Star Wars films have you seen? Please select all that apply. - Whether or not the respondent saw Star Wars: Episode I The Phantom Menace. # * Unnamed: 4 - Whether or not the respondent saw Star Wars: Episode II Attack of the Clones. # * Unnamed: 5 - Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith. # * Unnamed: 6 - Whether or not the respondent saw Star Wars: Episode IV A New Hope. # * Unnamed: 7 - Whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back. # * Unnamed: 8 - Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi. # # The values inside these columns are the names of the movies respondents checked off. Thus, if the respondent saw the movie, there will be a string with the title of the particular movie. However, if the respondent didn't see the movie, or answer for the particular checkbox, the value will be 'NaN'. # I will convert each of these columns into containing solely Boolean values, following the same principles I did just previously. # # After I have converted the values of the columns, I will rename the columns, in order to provide more intuitive names. # In[4]: # Dictionary which defines the mapping we need for the two columns titles = { "Star Wars: Episode I The Phantom Menace": True, "Star Wars: Episode II Attack of the Clones": True, "Star Wars: Episode III Revenge of the Sith": True, "Star Wars: Episode IV A New Hope": True, "Star Wars: Episode V The Empire Strikes Back": True, "Star Wars: Episode VI Return of the Jedi": True, np.NaN: False } # Converting the six columns to contain Boolean values. for col in star_wars.columns[3:9]: star_wars[col] = star_wars[col].map(titles) # Checking whether the changes to effect star_wars.head(10) # We now have even more beautiful and easy to analyze columns to work with. # In[5]: # Renaming the columns star_wars = star_wars.rename(columns = { "Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1", "Unnamed: 4": "seen_2", "Unnamed: 5": "seen_3", "Unnamed: 6": "seen_4", "Unnamed: 7": "seen_5", "Unnamed: 8": "seen_6" }) # Checking whether the changes took effect star_wars.head(10) # Now that I've converted to values into Boolean values, and changed the column names, the dataset will be much easier to work with, and more intuitive to analyze and understand. # ## Converting the ranking columns to numeric type # The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN: # # * Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace # * Unnamed: 10 - How much the respondent liked Star Wars: Episode II Attack of the Clones # * Unnamed: 11 - How much the respondent liked Star Wars: Episode III Revenge of the Sith # * Unnamed: 12 - How much the respondent liked Star Wars: Episode IV A New Hope # * Unnamed: 13 - How much the respondent liked Star Wars: Episode V The Empire Strikes Back # * Unnamed: 14 - How much the respondent liked Star Wars: Episode VI Return of the Jedi # # I will now convert each column to a numeric type, and then rename the columns so that the columns will be more intuitive to work with. # In[6]: # Converting the columns to numeric type star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float) # Renaming the columns with more descriptive names star_wars = star_wars.rename(columns = { "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1", "Unnamed: 10": "ranking_2", "Unnamed: 11": "ranking_3", "Unnamed: 12": "ranking_4", "Unnamed: 13": "ranking_5", "Unnamed: 14": "ranking_6" }) # Checking whether the changes took effect star_wars.iloc[:, 9:15].head(10) # Look how good those columns look now. This is going to be great to work with, and much easier to analyze. # ## Highest ranked movie # Let us take a look at the means of each of the ranking columns, in order to determine which movie is ranked the highest in 538's survey. # # I will make a bar chart for each ranking in order to provide a better overview of the columns values. # In[7]: # Computing the mean of each of the rankings star_wars[star_wars.columns[9:15]].mean() # In[8]: # Importing matplotlib which we will need to plot the graph import matplotlib.pyplot as plt # Allowing plots to be displayed directly in the notebook get_ipython().run_line_magic('matplotlib', 'inline') # Making a fancy bar chart for each ranking plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] rating = [1, 2, 3, 4, 5, 6] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("Rating") plt.title("Mean rankings of each Star Wars movie (1-6)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), star_wars[star_wars.columns[9:15]].mean(), color = 'blueviolet') # I have made a graph which shows the ranking means of each movie. The important factor to note is that the lower the ranking is, the better the movie has been rated by the fans. # Therefore, we can now see that the 5th movie, *Empire Striks Back*, is the highest rated movie. Moreover, we can see that the original trilogy rank better, across the board, then the more recent prequels. # ## Analyzing how many people have seen each movie # Let us now turn our attention to the popularity of each movie, in terms of viewership. # I will now examine the sum of the seen columns, and make another chart to display the data. # In[9]: # Summing the columns star_wars[star_wars.columns[3:9]].sum() # In[10]: # Making a fancy bar chart for each seen column plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] seen = [525, 575, 625, 675, 725, 775] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("seen") plt.title("How many times each Star Wars movie has been seen") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum(), color = 'blueviolet') # We can tell from the graph that the most viewed movie is also the most highly ranked movie; *The Empire Strikes Back*. # The movie viewed the least amount of times is the third movie; *Revenge of the Sith*. # However, the first movie, *The Phantom Menace*, has been viewed more times than the fourth movie, *A New hope*. # # In terms of the fifth and the sixth movie, the correlation between viewership and ranking is positive. # However, that is not the case for the fourth movie. # Moreover, the prequels all show a correlation between amount of times viewed, and the rating. # ## Segmentation # Let us now look into the segments of the data, particularly the differeces in response between the genders, and between Star Wars fans and Star Trek fans. # I will now split the dataframe into two groups, in order to compare the mentioned segmentations. # In[11]: # Splitting the data by gender males = star_wars[star_wars["Gender"] == "Male"] females = star_wars[star_wars["Gender"] == "Female"] # In[12]: # Computing the mean of each of the rankings by males males[males.columns[9:15]].mean() # In[13]: # Making a fancy bar chart for each ranking by males plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] rating = [1, 2, 3, 4, 5, 6] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("Rating") plt.title("Mean rankings of each Star Wars movie (1-6)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), males[males.columns[9:15]].mean(), color = 'steelblue') # Let us redo the analysis for females as well before we draw any conclusions on ranking. # In[14]: # Computing the mean of each of the rankings by females females[females.columns[9:15]].mean() # In[15]: # Making a fancy bar chart for each ranking by females plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] rating = [1, 2, 3, 4, 5, 6] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("Rating") plt.title("Mean rankings of each Star Wars movie (1-6)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), females[females.columns[9:15]].mean(), color = 'salmon') # We can now see that the ranking of the movies differs when controlling for gender. # Females have rated the first and second movie higher than males have, and they have also rated the original trilogy slightly lower than males have. # # Let us look at the amount of views, controlling for gender. # In[16]: # Summing the columns for males males[males.columns[3:9]].sum() # In[17]: # Making a fancy bar chart for each seen column plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] seen = [525, 575, 625, 675, 725, 775] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("seen") plt.title("How many times each Star Wars movie has been seen (males)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), males[males.columns[3:9]].sum(), color = 'steelblue') # Let us do the analysis for females as well before we dive into the details. # In[18]: # Summing the columns for females females[females.columns[3:9]].sum() # In[19]: # Making a fancy bar chart for each seen column plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] seen = [525, 575, 625, 675, 725, 775] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("seen") plt.title("How many times each Star Wars movie has been seen (females)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), females[females.columns[3:9]].sum(), color = 'salmon') # There doesn't seen to be any discernable difference between the popularity of the movies, across the genders. However, females have in general viewed the movies less than males have. # # An interesting point here is that while the viewership of each movie is relatively similar across genders, there does not seem to be a positive correlation between views and rating from females. Whereas there is a positive correlation between views and rating from males. # Let us take a look at the difference between the *Star Wars* fanbase, and the *Star Trek* fanbase. # In[20]: # Splitting the data by fanbase wars_fan = star_wars[star_wars["Do you consider yourself to be a fan of the Star Wars film franchise?"] == True] trek_fan = star_wars[star_wars["Do you consider yourself to be a fan of the Star Trek franchise?"] == "Yes"] # In[21]: # Computing the mean of each of the rankings by Jedis wars_fan[wars_fan.columns[9:15]].mean() # In[22]: # Making a fancy bar chart for each ranking by Jedis plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] rating = [1, 2, 3, 4, 5, 6] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("Rating") plt.title("Mean rankings of each Star Wars movie (Jedis)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), wars_fan[wars_fan.columns[9:15]].mean(), color = 'seagreen') # Seems like the Star Wars fans know what they like! # Original trilogy all the way baby. # In[23]: # Computing the mean of each of the rankings by trekkies trek_fan[trek_fan.columns[9:15]].mean() # In[24]: # Making a fancy bar chart for each ranking by trekkies plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] rating = [1, 2, 3, 4, 5, 6] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("Rating") plt.title("Mean rankings of each Star Wars movie (trekkies)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), trek_fan[trek_fan.columns[9:15]].mean(), color = 'deeppink') # While it is a tiny difference, the Star Trek fans rank the prequels higher than their Star Wars fan counterparts. # However, Star Trek fans almost rate *The Empire Strikes Back* just as good as Star Wars fans do. # Overall, a very similar rating of the movies. # Let us take a look at the viewership between the fanbases. # In[25]: # Computing the mean of each of the rankings by Jedis wars_fan[wars_fan.columns[3:9]].sum() # In[26]: # Making a fancy bar chart for each ranking by Jedis plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] rating = [1, 2, 3, 4, 5, 6] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("Rating") plt.title("How many times each Star Wars movie has been seen (Jedis)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), wars_fan[wars_fan.columns[3:9]].sum(), color = 'seagreen') # And for trekkies: # In[27]: # Computing the mean of each of the rankings by trekkies trek_fan[trek_fan.columns[3:9]].sum() # In[28]: # Making a fancy bar chart for each ranking by trekkies plt.style.use('ggplot') movies = ['#1', '#2', '#3', '#4', '#5', '#6'] rating = [1, 2, 3, 4, 5, 6] # Formatting the graph x_pos = [i for i, _ in enumerate(movies)] plt.xlabel("Movie") plt.ylabel("Rating") plt.title("How many times each Star Wars movie has been seen (trekkies)") plt.xticks(x_pos, movies) # Plotting the bar graph plt.bar(range(6), trek_fan[trek_fan.columns[3:9]].sum(), color = 'deeppink') # The most viewed movies by Star Wars and Star Trek fans are in order: #5, #6, #1. # However, Star Wars fans have seen the movies more than Star Trek fans have. # # It definitely speaks to no real difference in rating, nor in correlation between rating and viewership, between the fanbases. # ## Conclusion # In this project I cleaned and explored 538's Star Wars survey, in order to gain some insight into which Star Wars movie: (1) is the most highly rated in the franchise; (2) has been viewed the most; (3) is most popular across different fanbases; (4) is preferred by males and females. # # The cleaning of the dataset provided us with a lot of insights into the workings of pandas and numpy. Those tools allowed me to change not only names of columns, but also the types of values in the specific columns, in order to use some Boolean magic to make my analysis easier. # I also put a lot of work into the graphs I made use of, in order to properly display the exploratory and descriptive points I was making. # I hope you, dear reader, learned something. # # We've seen some interesting results, and also some rather logical results, such as Star Wars fans viewing the movies the most vs Star Trek fans, similarly for males vs females. Regarding the former results, particularly interesting in my view, was the rating of the movies in the franchise, where the original trilogy come out on top, by quite a large margin. Moreover, the difference in rating between males and females was also quite interesting. Clearly, females enjoy the entire franchise much more so than males do. # # # Thank you for reading my analysis of FiveThirtyEight's Star Wars Survey. # # /mhj # In[ ]: