#!/usr/bin/env python # coding: utf-8 # ## Guided Project: Star Wars Survey # # In this project, we are addressing the question regarding the Star Wars series - **does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?** # # To achieve this, we are going to analyse the data collected by [FiveThirtyEight](http://fivethirtyeight.com/)after surveying Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository [here](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey). # # ### Read data set and overview of the data # In[39]: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns get_ipython().run_line_magic('matplotlib', 'inline') sns.set_style("whitegrid", {'axes.grid' : False}) star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1") star_wars.columns # The data has several columns, including: # # * `RespondentID` - An anonymized ID for the respondent (person taking the survey) # * `Gender` - The respondent's gender # * `Age` - The respondent's age # * `Household Income` - The respondent's income # * `Education` - The respondent's education level # * `Location (Census Region)` - The respondent's location # * `Have you seen any of the 6 films in the Star Wars franchise?` - Has a `Yes` or `No` response # * `Do you consider yourself to be a fan of the Star Wars film franchise?` - Has a `Yes` or `No` response # * `Which of the following Star Wars films have you seen? Please select all that apply.` # # # We will now check for any strange values in the dataset. # In[40]: star_wars.head(5) # It is obvious from the above result cell that `RespondentID` contains `NaN` and we should clean this column before proceeding with our analysis. # In[41]: star_wars = star_wars[pd.notnull(star_wars["RespondentID"])] star_wars.head(3) # ### Cleaning and Mapping Yes/No columns # # Some of the columns represent `Yes/No` questions and it is also important to bear in mind that it can also have `NaN` where a respondent chooses not to answer a question. The columns in question are: # # * ` Have you seen any of the 6 films in the Star Wars franchise?` # * ` Do you consider yourself to be a fan of the Star Wars film franchise?` # # Let's jump straightaway into cleaning these columns. # In[42]: # dictionary to define a mapping for each value in the series # map value Yes to boolean value True and No to False yes_no = { "Yes": True, "No": False } # function to map and convert column values to Boolean def convert_to_bool(col): return col.map(yes_no) star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = convert_to_bool(star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] ) print(star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)) star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = convert_to_bool(star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']) print(star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)) # Now we have `True`, `False` and `NaN` values for both the columns # ### Cleaning and mapping checkbox columns # If we check our column names we can notice that there are nearly 4 columns that represent a single checkbox question. The columns are: # # * ` Which of the following Star Wars films have you seen? Please select all that apply.` - Whether or not the respondent saw `Star Wars: Episode I The Phantom Menace`. # * `Unnamed: 4` - Whether or not the respondent saw `Star Wars: Episode II Attack of the Clones`. # * `Unnamed: 5` - Whether or not the respondent saw `Star Wars: Episode III Revenge of the Sith`. # * `Unnamed: 6` - Whether or not the respondent saw `Star Wars: Episode IV A New Hope`. # * `Unnamed: 7` - Whether or not the respondent saw `Star Wars: Episode V The Empire Strikes Back`. # * `Unnamed: 8` - Whether or not the respondent saw `Star Wars: Episode VI Return of the Jedi`. # # For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll conver each of these columns to a Boolean, then rename the column for sanity purposes. 🤓 # In[43]: # mapping dictionary for movies movie_mapping = { "Star Wars: Episode I The Phantom Menace": True, np.nan: False, "Star Wars: Episode II Attack of the Clones": True, "Star Wars: Episode III Revenge of the Sith": True, "Star Wars: Episode IV A New Hope": True, "Star Wars: Episode V The Empire Strikes Back": True, "Star Wars: Episode VI Return of the Jedi": True } # map values and convert to Boolean values. # columns numbers 3 to 9 represent the columns in question for col in star_wars.columns[3:9]: star_wars[col] = star_wars[col].map(movie_mapping) # rename columns star_wars = star_wars.rename(columns={ "Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1", "Unnamed: 4": "seen_2", "Unnamed: 5": "seen_3", "Unnamed: 6": "seen_4", "Unnamed: 7": "seen_5", "Unnamed: 8": "seen_6" }) star_wars.head(2) # ### Cleaning the ranking columns # # Next, we have columns that rank the Star Wars in order of least favorite to most favorite, 1 being most favorite and 6 being the least favorite. The following are the columns that rank the movies: # # * `Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.` - How much the respondent liked `Star Wars: Episode I The Phantom Menace` # * `Unnamed: 10` - How much the respondent liked `Star Wars: Episode II Attack of the Clones` # * ` Unnamed: 11` - How much the respondent liked `Star Wars: Episode III Revenge of the Sith` # * `Unnamed: 12` - How much the respondent liked `Star Wars: Episode IV A New Hope` # * `Unnamed: 13` - How much the respondent liked `Star Wars: Episode V The Empire Strikes Back` # * ` Unnamed: 14` - How much the respondent liked `Star Wars: Episode VI Return of the Jedi` # # We'll convert each column to a numeric type and then rename the columns. The columns numbers range from 9 to 15 in this case. # In[44]: # Convert each of the columns above to a float type star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float) # rename the columns star_wars = star_wars.rename(columns={ "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1", "Unnamed: 10": "ranking_2", "Unnamed: 11": "ranking_3", "Unnamed: 12": "ranking_4", "Unnamed: 13": "ranking_5", "Unnamed: 14": "ranking_6" }) star_wars.head(2) # ### Finding the highest-ranked movie # In[45]: # find the highest-ranked movie by finding the mean of each rating star_wars[star_wars.columns[9:15]].mean() # In[46]: # plot the mean values star_wars[star_wars.columns[9:15]].mean().plot(kind='bar') sns.despine() plt.show() # From the plot, we can say that `ranking_5` has the lowest ranking ie, `Star Wars: Episode V The Empire Strikes Back` is the most favorite movie. We have to remember that the rankings values are 1 through 6, 1 means the film was the most favorite, and 6 means it was the least favorite. # ### Finding the most viewed movie # # We have already cleaned up the seen columns and converted their values to the Boolean type. Now let's find the most viewed movie from the series. # In[47]: # columns numbers 3 to 9 represent the columns seen star_wars[star_wars.columns[3:9]].sum() # In[48]: # plot the values star_wars[star_wars.columns[3:9]].sum().plot(kind='bar') sns.despine() plt.show() # `seen_5` or `Star Wars: Episode V The Empire Strikes Back` is the most viewed movie which explains why the highest ranked movie is also the same ie, more number of people watched `Star Wars: Episode V The Empire Strikes Back` than other movies in the Star War series. # ### Exploring the data by binary segments # Let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples: # # * ` Do you consider yourself to be a fan of the Star Wars film franchise?` - True or False # * ` Do you consider yourself to be a fan of the Star Trek franchise?` - Yes or No # * `Gender` - Male or Female # # We can compute the most viewed movie, the highest-ranked movie, and other statistics separately for each group. # In[49]: # split the data into two groups based on gender males = star_wars[star_wars["Gender"] == "Male"] females = star_wars[star_wars["Gender"] == "Female"] # Highest-ranked movie - Male respondents and plot the values print("Highest-ranked movie - Male respondents \n\n",males[males.columns[9:15]].mean()) males[males.columns[9:15]].mean().plot(kind='bar', title="Movie ranking by Male respondents") plt.show() # Highest-ranked movie - Female respondents and plot the values print("Highest-ranked movie - Female respondents \n\n",females[females.columns[9:15]].mean()) females[females.columns[9:15]].mean().plot(kind='bar', title="Movie ranking by Female respondents") sns.despine() plt.show() # In[50]: # Most most viewed - Male and plot the values print("Most most viewed - Male respondents\n\n",males[males.columns[3:9]].sum()) males[males.columns[3:9]].sum().plot(kind='bar',title="Most viewed movie by Male respondents") plt.show() # Most most viewed - Female and plot the values print("Most most viewed - Female respondents\n\n",females[females.columns[3:9]].sum()) females[females.columns[3:9]].sum().plot(kind='bar',title="Most viewed movie by Female respondents") sns.despine() plt.show() # From the plots,episode 5 received highest rating and views from both men and women. More men watched episodes 1-3 but didnt like the episodes compared to women. Episodes 5 and 6 shows more views from both men and women. # In[51]: # analysis based on column- # Do you consider yourself to be a fan of the Star Wars film franchise? # rename the column in both male and female dataset we grouped in the previous step male_fans = males.rename(columns={ 'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"}) # drop NaN values print(male_fans['fan_or_not'].value_counts(dropna=False)) male_fans['fan_or_not']= male_fans['fan_or_not'].fillna(False) print('\nafter removing NaN values\n',male_fans['fan_or_not'].value_counts(dropna=False,normalize=True)) # plot values male_fans['fan_or_not'].value_counts(dropna=False,normalize=True).plot(kind='bar', title='Star Wars Fan or not - Male') sns.despine() plt.show() # In[52]: female_fans = females.rename(columns={ 'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"}) # drop NaN values print(female_fans['fan_or_not'].value_counts(dropna=False)) female_fans['fan_or_not']= female_fans['fan_or_not'].fillna(False) print('\nafter removing NaN values\n',female_fans['fan_or_not'].value_counts(dropna=False,normalize=True)) # plot values female_fans['fan_or_not'].value_counts(dropna=False,normalize=True).plot(kind='bar', title='Star Wars Fan or not - Female') sns.despine() plt.show() # In[53]: # combined plot - Gender, fan_ot_not all_fans =star_wars.rename(columns={ 'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"}) # considering NaN values as False all_fans['fan_or_not']= all_fans['fan_or_not'].fillna(False) # group by columns in question and plot fans_by_gender=all_fans.groupby(['Gender','fan_or_not']).size() df=fans_by_gender.unstack() df.plot(kind='bar', title="Are you a fan of Star Wars?") sns.despine() # It is obvious from the plots above that women are not fans of Star Wars series, whereas men are! # ### Further analysis # # #### 1. Based on Education # In[54]: # check the values in the column star_wars['Education'].value_counts() # #### Ranking based on Education # In[55]: # rename the column star_wars = star_wars.rename(columns={ 'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"}) # create a pivot table ranking_by_education = star_wars.pivot_table(index="Education", values=star_wars.columns[9:15]) print(ranking_by_education) # plot the data ranking_by_education.plot(kind='bar', title='Ranking by education', figsize=(20,10),fontsize=10) sns.despine() plt.show() # sns heatmap f, ax = plt.subplots(figsize=(9, 6)) sns.heatmap(ranking_by_education, annot=True, linewidths=.5, ax=ax) ax.set_title('Ranking by education') # #### Views based on Education # In[56]: views_by_education = star_wars.pivot_table(index="Education", values=star_wars.columns[3:9]) print(views_by_education) f, ax = plt.subplots(figsize=(9, 6)) sns.heatmap(views_by_education*100, annot=True,fmt='.1f' ,linewidths=.5, ax=ax) ax.set_title('Views by education') # The data above shows that respondents with less than high school education were the ones who most liked episode 5 in the Star Wars franchise but only 43% of them watched it. On contrast, almost 78% of respondents with a bachelor's degree watched episode 5 and also rated it an avereage of 2.3 . # #### 2. Based on Location/Region # # In[57]: # check values in location column star_wars['Location (Census Region)'].value_counts() # #### Ranking and views based on region # In[58]: ranking_by_location = star_wars.pivot_table(index="Location (Census Region)", values=star_wars.columns[9:15]) f, ax = plt.subplots(figsize=(9, 6)) sns.heatmap(ranking_by_location, annot=True, linewidths=.5, ax=ax) ax.set_title('Ranking by region') plt.show() #views by location views_location = star_wars.pivot_table(index="Location (Census Region)", values=star_wars.columns[3:9]) # f, ax = plt.subplots(figsize=(9, 6)) sns.heatmap(views_location, annot=True, linewidths=.5, ax=ax) ax.set_title('Views by region') plt.show() # From our analysis of location data, we see that respondents across all the regions rated episode 5 with a higher ranking. Approx. 82% of espondents from East South Central region views episode 5. Taking a closer look, the data shows that more number of respondents ( more than 50%) watched episode 1,4,5 and 6 across the regions. # # #### 3. Response to `Which character shot first?` # In[59]: # check the values in the column star_wars['Which character shot first?'].value_counts(dropna=False) # In[60]: # replacing NaN values star_wars['Which character shot first?'].fillna("I don't understand this question", inplace = True) print(star_wars['Which character shot first?'].value_counts(normalize=True)) star_wars['Which character shot first?'].value_counts(normalize=True).plot(kind='bar', title='Who was shot first - all respondents') sns.despine() plt.show() # #### `Which character shot first?` : Response based on `Gender` # In[61]: star_wars['Which character shot first?'].fillna("I don't understand this question", inplace = True) print("Who was shot first? - all fans \n",star_wars['Which character shot first?'].value_counts()) grouped=star_wars.groupby(['Gender','Which character shot first?']).size() df=grouped.unstack() df.plot(kind='bar') sns.despine() # sns.set_style( {'axes.grid' : False}) # Male and female respondents said `Han` was shot first, however more female respondents said they did not understand the question. # #### 4. Views and Ranks based on age group # In[62]: star_wars['Age'].value_counts(normalize=True) # In[63]: # views by age views_by_age = star_wars.pivot_table(index="Age", values=star_wars.columns[3:9]) print(views_by_age) f, ax = plt.subplots(figsize=(9, 6)) sns.heatmap(views_by_age*100, annot=True,fmt='.1f' ,linewidths=.5, ax=ax) ax.set_title('Views by Age') # We can see that approx. more than 66% of viewers under the age group 18-29 watched all the episodes and 73.4% of them watched episode 5 and the figures shows that only the series was least watched by viewers above 60 years of age however, 62.5% watched episode 5 which is the highest views in this age group. # # More than 73% of viewers under the age groups 18-29, 30-44 and 45-60 watched episode 5 and clearly, episode 5 was most viewed by all the viewers when compared to other episodes in the series. # In[64]: # rankings by age ranks_by_age = star_wars.pivot_table(index="Age", values=star_wars.columns[9:15]) print(ranks_by_age) f, ax = plt.subplots(figsize=(9, 6)) sns.heatmap(ranks_by_age, annot=True,fmt='.1f' ,linewidths=.5, ax=ax) ax.set_title('Rankings by Age') # Clearly, the highest ranked movie by all the people from all the given age ranges is episode 5 with an average of 2.5 rating. # ### Conclusion # # # We started our analysis of the survey data collected by [FiveThirtyEight](http://fivethirtyeight.com/) to answer the question `does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?` # # From our analysis of the survey results of 835 responses, it is obvious that `Star Wars: The Empire Strikes Back` is the best of all the episodes in the Star Wars franchise. It was not only the most watched movie but also the episode with the top ratings. We also found out that compared to women, more men were fans of the Star Wars movies. # # # #