#!/usr/bin/env python
# coding: utf-8

# # The Best and the Worst of the Star Wars Film Franchise (so far)
# ___
# 
# A decade after the release of Star Wars: Revenge of the Sith, Disney sought to build upon main storyline of this classic sci-fi series with it's seventh installment, Star Wars: The Force Awakens. In anticipation of the movie, [FiveThirtyEight](https://fivethirtyeight.com/) conducted an online survey through SurveyMonkey to assess the America's sentiments towards the series so far. The survey was limited to the six main films and did not include other media such as comic books, television series, etc.  
# 
# The data collected from the survey can be downloaded from their [GitHub repository](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey). 
# 
# Based on the survey results, we will try to determine how the respondents ranked the series' first six films and its pivotal characters. 
# 
# ### Results
# 
# Among the first six films of the film franchise, *The Empire Strikes Back*, ranked at the top for both most viewed and best movie. From the characters, Luke Skywalker was the most favored while Jar Jar Binks was the least favored.  

# # Cleaning Our Data Set
# ___
# 
# Let us familiarize ourselves with some of the columns we will be working with. 
# 
# 
# |Column Name|Description
# |:--------|:----------
# |`RespondentID`|An anonymized ID for the respondent (person taking the survey)
# |`Gender`|The respondent's gender
# |`Age`|The respondent's age
# |`Household Income`|The respondent's income
# |`Education`|The respondent's education level
# |`Location (Census Region)`|The respondent's location
# |`Have you seen any of the 6 films in the Star Wars franchise?`|Has a Yes or No response
# |`Do you consider yourself to be a fan of the Star Wars film franchise?`|Has a Yes or No response
# 
# 
# ## Retaining Valid Respondents
# 
# After reading the dataset into a `pandas` DataFrame, we will notice that the first row appears to contain an invalid response as indicated by the `NaN`value under the `RespondentID` columns. A valid response will need to have a unique identifier. Checking the contents of this row for the other columns, we will quickly realize that the first row actually contains the options per question or a prompt for the response. The actual answers begin in the second row. 

# In[1]:


import pandas as pd

star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')

pd.options.display.max_columns = 50 # To show all columns of the data set
display(star_wars.head())
star_wars.shape


# We will store the first row under `options` as reference.

# In[2]:


options = star_wars.iloc[0:1,:]
options


# Although we haven't checked the entire dataset, it is possible to have other `NaN` values for `RespondentID`. We will use the `pandas.Series.notnull()` method to filter the dataset to contain only rows with valid ID values. 

# In[3]:


star_wars = star_wars[star_wars['RespondentID'].notnull()]
star_wars.shape


# Comparing the results of the `DataFrame.shape` attribute (enclosed in parentheses), from 1187 to 1186, it is clear that only the first row contained an invalid ID value. 

# ## Yes/No Columns
# 
# To make our analysis easier, we will convert the columns with Yes or No responses into the `bool` type by replacing Yes with `True` and No with `False`. `NaN` values will remain as such. `bool` values make it easier for us to compute aggregate values through methods such as `Series.mean()` and `Series.sum()`.
# 
# Below are the names of the columns that we will perform our value substitution on. We will access them using their indexes to forego typing them. 

# In[4]:


star_wars.columns[[1,2,30,31,32]]


# Let's see first how many Yes / No values each column contains. 

# In[5]:


for n in [1,2,30,31,32]:
    display(star_wars.iloc[:,n].value_counts(dropna=False))


# We create a dictionary, `yes_no`, to contain the substitute values. This dictionary is passed on to the `Series.map()` method which will take care of the substitution. Note that since `NaN` is not included, these values will be retained as such. The respondents may have been given the option to not answer the question, hence the missing value. 

# In[6]:


# Substitute Yes and No values with True and False, respectively
yes_no = {'Yes': True, 'No': False}

for n in [1,2,30,31,32]:
    star_wars.iloc[:,n] = star_wars.iloc[:,n].map(yes_no)


# We confirm our changes below. 

# In[7]:


for n in [1,2,30,31,32]:
    display(star_wars.iloc[:,n].value_counts(dropna=False))


# In[8]:


display(star_wars.head())


# ## Checkbox columns

# The columns with indexes 3 to 8 of our data set indicate whether a respondent has seen a particular Star Wars film. From the last sentence, "Please select all that apply" (column with index 3), we can deduce that each film title was listed beside a checkbox. If a box was ticked, the corresponding film's title will appear in our dataset. If otherwise, there will be a missing value (`NaN`), indicating that a respondent has not seen the film. 
# 
# Similar to the Yes or No columns, we will replace the values with `True` or `False`. 
# 
# Below, the names of the movies are taken from `options` and stored as a list in `movie_names`.

# In[9]:


movie_names = list(options.iloc[0,3:9])

movie_names


# We create a function, `replace()`, which will change a cell's content to either `True` if it contains a film title, or `False` if otherwise. This function is passed on to the `DataFrame.applymap()` method to work on the relevant columns. 

# In[10]:


# Create a function that replaces a movie title with True and other values with False
def replace(movie):
    if movie in movie_names:
        return True
    else:
        return False

star_wars.iloc[:,3:9] = star_wars.iloc[:,3:9].applymap(replace) # Apply the function to the appropriate columns


# We will also change the columns names to something more understandable and intuitive. The current column names are stored as a list in `seen_cols`.

# In[11]:


# Store the current column names as a list in seen_cols
seen_cols = star_wars.columns[3:9]

seen_cols


# To replace our column names, we need to create a dictionary, which will contain the current names as the keys and the replacement names as the corresponding values. 

# In[12]:


# Create a dictionary to contain replacement names for the checkbox (seen movies) columns
seen_dict = {}

n = 0
for seen in seen_cols:
    n += 1
    seen_dict[seen] = 'seen_{0}'.format(n)
    
seen_dict


# We then use the `DataFrame.rename()` method to replace our column names. 

# In[13]:


star_wars.rename(columns=seen_dict, inplace=True)


# Our changes are confirmed below. 

# In[14]:


star_wars.head()


# ## Ranking Columns

# The next six columns after the checkbox columns let the respondent rank the movies according to preference. Since there are six movies, each movie is ranked on a scale of 1-6, with 1 being the highest.
# 
# We see that each column is of the type, `object`.

# In[15]:


star_wars[star_wars.columns[9:15]].dtypes


# We will convert them to a numeric type and then rename the columns to something more intuitive like we did with the checkbox columns. 

# In[16]:


# Convert the ranking columns to a numeric type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float) 

# Store the column names as a list in rank_cols
rank_cols = star_wars.columns[9:15]

rank_cols


# As with the checkbox columns, we conduct a similar process of creating a dictionary and using the `DataFrame.rename()` method to replace the ranking column names. 
# 
# The results of our changes are displayed after. 

# In[17]:


# Create a dictionary to contain replacement names for the ranking columns
rank_dict = {}
n = 0
for rank in rank_cols:
    n += 1
    rank_dict[rank] = 'ranking_{}'.format(n)
    

# Replace the column names of the ranking columns
star_wars.rename(columns=rank_dict, inplace=True)

star_wars.head()


# ## Character Columns
# 
# Reviewing `options`, we will notice that there are several columns with character names, from Han Solo to Yoda.

# In[18]:


options


# Each of these characters was rated in terms of favorability. We can see the list of possible answers for these columns below. 

# In[19]:


star_wars.iloc[:,15].value_counts(dropna=False)


# For easier visualization, we can divide these ratings into four parts:
# 
# 1. Favorable
# 2. Unfavorable
# 3. Neutral
# 4. Unfamiliar
# 
# We will group them according to and convert them to their corresponding numbers above to aid in our analysis. 
# 
# Before doing so, let us inspect the reason behind the presence of `NaN` values. These values may be coming from those respondents who have not seen any of the Star Wars films. Let's confirm this below. 

# In[20]:


# Select only rows for respondents who have NOT seen any film
not_seen = star_wars.loc[star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] == False, star_wars.columns[15:29]]

not_seen.isnull().describe() # To show the amount of NaN values per column 


# As seen above, all respondents who answered *No* (amounting to 250) for the first question did not give any favorability rating to any of the characters as indicated by a frequency of 250 for all columns. This frequency represents the amount of times the value `NaN` appears per column.  
# 
# However, upon further inspection, we will see that `NaN` values appear even for respondents who have seen ANY of the films.  

# In[21]:


# Select only rows for respondents who have SEEN any film
seen = star_wars.loc[star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] == True, star_wars.columns[15:29]]


# In[22]:


seen.isnull().describe() # To show the amount of NaN values per column 


# For those who have seen ANY of the films, we can see that even though most of the values are not `NaN`, this value still appears in varying frequencies per column. Since we will not include respondents who have NOT seen any film in our analysis, we can assume that `NaN` values indicate that a respondent was unfamiliar with the character. 
# 
# Below, we create a dictionary for the replacement values and then use the `Series.map()` method within a loop to do the appropriate subsitution.

# In[23]:


import numpy as np

# Dictionary containing replacement values for favorability ratings
favor_ratings = {
                'Very favorably' : 1,
                'Somewhat favorably' : 1,
                'Neither favorably nor unfavorably (neutral)' : 3,
                'Somewhat unfavorably' : 2,
                'Very unfavorably' : 2,
                'Unfamiliar (N/A)' : 4, 
                np.nan : 4
                }

# Replace favorability ratings with corresponding numeric or NaN value
for n in range(15,29):
    star_wars.iloc[:,n] = star_wars.iloc[:,n].map(favor_ratings)


# We will also convert the column names to the actual names of the characters. 

# In[24]:


# Dictionary containing replacement names for character column names
char_dict = {
            'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.': 'Han Solo',
            'Unnamed: 16': 'Luke Skywalker',
            'Unnamed: 17': 'Princess Leia Organa',
            'Unnamed: 18': 'Anakin Skywalker',
            'Unnamed: 19': 'Obi Wan Kenobi',
            'Unnamed: 20': 'Emperor Palpatine',
            'Unnamed: 21': 'Darth Vader',
            'Unnamed: 22': 'Lando Calrissian',
            'Unnamed: 23': 'Boba Fett',
            'Unnamed: 24': 'C-3P0',
            'Unnamed: 25': 'R2 D2',
            'Unnamed: 26': 'Jar Jar Binks',
            'Unnamed: 27': 'Padme Amidala',
            'Unnamed: 28': 'Yoda'
            }

# Replace column names of character columns
star_wars.rename(columns=char_dict, inplace=True)


# The first rows of our final dataset is displayed below. 

# In[25]:


star_wars.head()


# # Most Viewed Movie
# ___

# Now that we have cleaned our dataset, let us first figure out which among the six films was viewed the most. Recall that the titles were assigned to `movie_names`. 

# In[26]:


movie_names


# In our visualization, we will only include the subtitles to avoid making it too cluttered with text. The subtitles for each movie appear after the roman numerals (I, II, III, etc.). 
# 
# We will first convert `movie_names` to a Series object, assign it to `titles`, then utilize a Regular Expression to extract the subtitles. These subtitles will then be assigned to the aptly named variable, `subtitles`. 

# In[27]:


import re

pattern = r'Star Wars: Episode .{1,3} +(.+)' # Regular expression to extract the subtitles

titles = pd.Series(movie_names)
subtitles = titles.str.extract(pattern)
subtitles


# The first question of the survey asks whether the respondent has seen any of six the Star Wars films. It would make sense to include in our analysis only those who responded `Yes` to this question. Below, we see that 936 respondents meet this criteria. 

# In[28]:


star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna = False)


# For each film, we will figure out how many of the 936 respondents have seen it. Note that a respondent may have seen all, a few, or just one of the films. This means that each film can have a maximum of 936 `True` values. The film with the highest number of respondents (`True` values) will be considered as the most viewed film among the six. 
# 
# Below, we filter the dataset to include only the 936 respondents and the checkbox columns we previously cleaned. We then use the `DataFrame.T` attribute to make the rows into columns and vice versa. This will allow us to make a new column containing the sum of `True` values for each film. The transposed DataFrame is stored in `seen_any`. 

# In[29]:


seen_any = star_wars.loc[star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] == True, star_wars.columns[3:9]].T
seen_any


# We will create a new column, `seen_count`, to contain the sum of the `True` values for each film. The percentages of these sum values are then computed and placed under the `perc` column. 

# In[30]:


seen_any['seen_count'] = seen_any.sum(axis = 1) 
seen_any['perc'] = (seen_any['seen_count'] / 936) * 100 # The denominator 936 is the number of respondents who have seen ANY film
seen_any['perc'] = round(seen_any['perc'])
seen_any_df = seen_any[['seen_count', 'perc']]


# In[31]:


seen_any_df # New DataFrame to be used for the visualization. 


# Now that we have computed, for each film, how many of the 936 respondents have seen it, we can visualize these values through a horizontal bar graph. For a more intuitive approach, we will plot our values in terms of percentage.  
# 
# We can check the available preset graph styles of `matplotlib` by importing its `style` module and accessing `style.available`. 

# In[32]:


import matplotlib.pyplot as plt
import matplotlib.style as style

get_ipython().run_line_magic('matplotlib', 'inline')

style.available


# We will use the `fivethirtyeight` style to imitate [FiveThirtyEight](https://fivethirtyeight.com/)'s graphs. The code block below is used to construct our bar graph. 

# In[33]:


get_ipython().run_line_magic('config', "InlineBackend.figure_format ='retina' # For higher resolution graphs")

# Set essential graph attributes
style.use('fivethirtyeight') # This style will apply to all our succeeding graphs

seen_graph = seen_any_df['perc'].plot.barh(figsize = (7,5), legend = False, color = '#1F77B4', width = 0.7)
seen_graph.tick_params(axis = 'both', which = 'both', labelsize = 16, labelbottom = False)
seen_graph.grid(False)
seen_graph.set_yticklabels(subtitles[0], alpha = 0.8)
seen_graph.set_xlim(left = -2) # Gives space between bars and the y-tick labels
seen_graph.get_children()[4].set_color('#D62728') # Sets a different color for 'The Empire Strikes Back'
seen_graph.invert_yaxis() # Orders the films from top (part 1) to bottom (part 6)

# Label each bar's value in percent 
seen_graph.text(x = 74.5, y = 0.1, s = '72%', alpha = 0.8)
seen_graph.text(x = 64, y = 1.1, s = '61', alpha = 0.8)
seen_graph.text(x = 62, y = 2.1, s = '59', alpha = 0.8)
seen_graph.text(x = 68, y = 3.1, s = '65', alpha = 0.8)
seen_graph.text(x = 84.2, y = 4.1, s = '81  ', alpha = 0.8)
seen_graph.text(x = 81.5, y = 5.1, s = '79', alpha = 0.8)

# Place the heading and subheading
seen_graph.text(x = -43.5, y = -1.6, s = ' The Most Viewed Star Wars Movie', fontsize = 27, weight = 'bold', alpha = 0.75)
seen_graph.text(x = -43, y = -1, s = ' Out of 936 respondents who have seen ANY of the films', fontsize = 18)


# From the graph above, we can see that *The Empire Strikes Back* was the most viewed movie at 81%, followed by *Return of the Jedi* at 79%. 

# # Highest Ranked Movie
# ___

# The next thing we want to find out is which among the films the respondents consider the best. For this, we will have to be more selective of our respondents. A good criteria would be to select only those who have seen ALL of the films as they are more qualified to rank the films from best to worst. 
# 
# To determine if a respondent has seen ALL six films, we will create a new column, `seen_all`. With an `axis` value of 1, the `DataFrame.all()` method will be used on the checkbox (seen) columns to check whether all cells in a particular row contain `True`. If this is the case, the value `True` will be returned to `seen_all`. If otherwise, `False` will be returned. 

# In[34]:


star_wars['seen_all'] = star_wars.iloc[:,3:9].all(axis = 1)
star_wars['seen_all'].sum()


# In[35]:


star_wars.iloc[:,[3,4,5,6,7,8,38]].head()


# Above, we see that there are 471 respondents who have seen ALL films.
# 
# As we have observed, the films were ranked on a scale of 1-6, with 1 being the highest. To determine the highest ranked film, we will count how many times each film was ranked number 1 and compare the results. Just as with our previous query, we will transpose the DataFrame and create a new column to contain the results. This time, we will use our more selective criteria for the respondents. 

# In[36]:


ranks_seen_all = star_wars.loc[star_wars['seen_all'] == True, star_wars.columns[9:15]].T
ranks_seen_all


# In `ranks_seen_all`, we will create a new column `one_count` to contain the number of times each film was ranked first. Since each film is ranked uniquely by each respondent, we can simply divide each `one_count` value by the column's sum (equal to 471) to determine its percentage. The results are assigned to the `perc` column. 

# In[37]:


ranks_seen_all['one_count'] = (ranks_seen_all == 1).sum(axis = 1)
ranks_seen_all['perc'] = (ranks_seen_all['one_count'] / ranks_seen_all['one_count'].sum()) * 100 
ranks_seen_all['perc'] = round(ranks_seen_all['perc'])
ranks_one_df = ranks_seen_all[['one_count', 'perc']]


# In[38]:


ranks_one_df # New DataFrame to be used for the visualization.


# With the `ranks_one_df` DataFrame ready, we use the code block below to visualize our findings. 

# In[39]:


# Set essential graph attributes
ranks_graph = ranks_one_df['perc'].plot.barh(figsize = (7,5), legend = False, color = '#1F77B4', width = 0.7)
ranks_graph.tick_params(axis = 'both', which = 'both', labelsize = 16, labelbottom = False)
ranks_graph.grid(False)
ranks_graph.set_yticklabels(subtitles[0], alpha = 0.8)
ranks_graph.set_xlim(left = -1) # Gives space between bars and the y-tick labels
ranks_graph.get_children()[4].set_color('#D62728') # Sets a different color for 'The Empire Strikes Back'
ranks_graph.invert_yaxis() # Orders the films from top (part 1) to bottom (part 6)

# Label each bars value in percent 
ranks_graph.text(x = 11, y = 0.1, s = '10%', alpha = 0.8)
ranks_graph.text(x = 5, y = 1.1, s = '4', alpha = 0.8)
ranks_graph.text(x = 7, y = 2.1, s = '6', alpha = 0.8)
ranks_graph.text(x = 28.2, y = 3.1, s = '27', alpha = 0.8)
ranks_graph.text(x = 37, y = 4.1, s = '36  ', alpha = 0.8)
ranks_graph.text(x = 18.4, y = 5.1, s = '17', alpha = 0.8)

# Place the heading and subheading
ranks_graph.text(x = -19.6, y = -1.5, s = ' The Best Star Wars Movie', fontsize = 27, weight = 'bold', alpha = 0.75)
ranks_graph.text(x = -19.2, y = -0.9, s = ' According to 471 respondents who have seen ALL films', fontsize = 18)


# We can see that *The Empire Strikes Back* was the highest ranked at 36%, followed by *A New Hope* at 27%. 
# 
# Even though *The Empire Strikes Back* was the most viewed and was considered the best among the six, these two factors are not necessarily correlated as the other five films fall in different orders when both graphs are compared. For example, *Revenge of the Sith* was the least viewed but is ranked 5th in the graph above. 

# # Character Ratings 
# ___

# Finally, we move on to how the respondents felt about a number of the franchise's characters in terms of favorability. We will select the answers of those respondents who have seen ANY of the film (936 respondents). 
# 
# Below, we filter the dataset according to this criteria, selecting the character columns, and then transposing the filtered DataFrame. We assign the result to `all_chars`. 

# In[40]:


all_chars = star_wars.loc[star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] == True , star_wars.columns[15:29]].T
all_chars


# Recall that we made the following divisions for the ratings:
# 
# 1. Favorable
# 2. Unfavorable
# 3. Neutral
# 4. Unfamiliar
# 
# For each character, we will compute how many percent they fall under each division. To do this, we will have to create eight new columns. The first four will contain the total count of each favorability rating while the last four will contain their corresponding percentages. 

# In[41]:


favorability = ['Favorable', 'Unfavorable', 'Neutral', 'Unfamiliar']

# Create the first four columns to contain each rating's total count 
n = 0
for rating in favorability:
    n += 1
    all_chars[rating] = (all_chars == n).sum(axis = 1) # Sums the amount of `True` values for each rating
    
# Create the last four columns for the corresponding percentages 
n = 0
for rating in favorability:
    n += 1
    all_chars[rating + '_perc'] = round((all_chars[rating] / 936) * 100) # Out of the 936 respondents
    all_chars[rating + '_perc'] = all_chars[rating + '_perc'].astype(int)

char_df = all_chars.iloc[:,936:]
char_df = char_df.sort_values('Favorable', ascending = False) # Order the DataFrame by the Favorable rating


# In[42]:


char_df # New DataFrame to be used for the visualization.


# To visualize our table, we will use the percentage columns to place four horizontal bar graphs beside each other. This will make it easy for us to compare the four favorability ratings for each character. 
# 
# Below, we take the names of each character and store them in `char_names`. 

# In[43]:


char_names = char_df.index
char_names


# The code block belows allows us to construct our desired visualization. Please refer to the comments to understand better how the combination of graphs was constructed.  

# In[44]:


get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina' # For higher resolution graphs")

favorability = ['Favorable', 'Unfavorable', 'Neutral', 'Unfamiliar']
clrs = ['#2CA02C', '#D62728', '#1F77B4', 'grey'] # Colors to be used for each bar graph
favorability_perc = ['Favorable_perc', 'Unfavorable_perc', 'Neutral_perc', 'Unfamiliar_perc'] # Column names to extract percent values

char = plt.figure(figsize = (9, 8)) 

#Set essential attributes for each bar graph
f = 0 # For favorability list order
s = 0 # For the subplot number
for rating in favorability_perc:
    s += 1
    ax = char.add_subplot(1, 4, s)
    ax.barh(char_names, char_df[rating], color = clrs[f])
    ax.text(x = 0.2, y = -1, s = favorability[f], weight = 'bold', size = 18, alpha = 0.7)
    ax.grid(False)
    ax.set_xticklabels([])
    ax.set_xlim((-12,90)) # Gives space between bars and the y-tick labels
    ax.set_yticklabels(char_names, alpha = 0.8)
    ax.invert_yaxis()
    f += 1
    
    # Remove y-tick labels for the three other graphs
    if s != 1:
        ax.set_yticklabels([])
    
    # Bar labels in percent
    bar_labels = char_df[rating].tolist()
    y_loc = 0 # For y-location of label
    for label in bar_labels:
        
        if y_loc == 0: # Places the % symbol for only the top bar label
            ax.text(x = label + 5, y = 0.15 + y_loc,  s = str(label) + '%', alpha = 0.8, size = 12)
                
        ax.text(x = label + 5, y = 0.15 + y_loc,  s = str(label), alpha = 0.8, size = 12)
        y_loc += 1

# Places the heading and subheading
char.axes[0].text(x = -140, y = -2.7, s = ' Favorite Characters from the films', fontsize = 27, weight = 'bold', alpha = 0.75)
char.axes[0].text(x = -137, y = -1.9, s = ' Out of 936 respondents who have seen ANY of the films', fontsize = 18)


# Luke Skywalker was considered as the most favored among the characters at 82%, followed by Han Solo and Princess Leia, both at 81%. At the opposite end, Jar Jar Binks was the most unfavored at 33%. His favorability was also diminished by 24% of the  respondents who were unfamiliar with him. 

# # Conclusion
# ___

# From the results shown by the graphs we generated, we saw that the fifth installment of the film franchise, *The Empire Strikes Back*, ranked at the top for both most viewed and best movie. We also saw how the characters were rated in terms of favorability, with Luke Skywalker, the main protagonist or the original Star Wars trilogy, being the most favored.