#!/usr/bin/env python
# coding: utf-8

# # Guided Project: Star Wars
# While waiting for *Star Wars: The Force Awakens* to come out, the team at [FiveThirtyEight](fivethirtyeight.com) became interested in answering some questions about Star Wars fans. In particular, they wondered: **does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?**
# 
# The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which we'll be cleaning and exploring.

# ## Exploring and cleaning the data set

# In[1]:


import numpy as np
import pandas as pd

star_wars = pd.read_csv('star_wars.csv', encoding="ISO-8859-1")
star_wars.head(10)


# We can notice some strange values. The `RespondentID` column is supposed to be a unique ID for each respondent, but it's blank in some rows. There are also questions in the survey where the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format.

# In[2]:


# reviewing column names
star_wars.columns


# Some columns have strange names, which we will handle later on. For now, we'll remove any rows where `RespondentID` is `NaN`.

# In[3]:


# removing NaN values
star_wars = star_wars[star_wars['RespondentID'].notnull()]
star_wars.head()


# Three columns represent `Yes/No` questions: 
# - `Have you seen any of the 6 films in the Star Wars franchise?`
# - `Do you consider yourself to be a fan of the Star Wars film franchise?`
# - `Do you consider yourself to be a fan of the Star Trek franchise?`
# 
# They can also be `NaN` where a respondent chooses not to answer a question. We can convert those values to Booleans, which makes it easier to analyze down the road because we can select the rows that are `True` or `False` without having to do a string comparison.
# 

# In[4]:


# exploring columns
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()


# In[5]:


star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()


# In[6]:


star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].value_counts()


# In[7]:


# converting 'Yes' and 'No' to Boolean values
new_values = {'Yes': True, 'No': False}

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(new_values)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(new_values)
star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].map(new_values)


# In[8]:


# exploring the new values
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()


# In[9]:


star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()


# In[10]:


star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].value_counts()


# The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question *Which of the following Star Wars films have you seen? Please select all that apply*.
# 
# The columns for this question are:
# 
# - `Which of the following Star Wars films have you seen? Please select all that apply.` - Whether or not the respondent saw `Star Wars: Episode I The Phantom Menace`.
# - `Unnamed: 4` - Whether or not the respondent saw `Star Wars: Episode II Attack of the Clones`.
# - `Unnamed: 5` - Whether or not the respondent saw `Star Wars: Episode III Revenge of the Sith`.
# - `Unnamed: 6` - Whether or not the respondent saw `Star Wars: Episode IV A New Hope`.
# - `Unnamed: 7` - Whether or not the respondent saw `Star Wars: Episode V The Empire Strikes Back`.
# - `Unnamed: 8` - Whether or not the respondent saw `Star Wars: Episode VI Return of the Jedi`.
# 
# For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
# 
# We'll convert each of these columns to a Boolean and then rename the column something more intuitive.

# In[11]:


new_values2 = {
    'Star Wars: Episode I  The Phantom Menace': True, 
    'Star Wars: Episode II  Attack of the Clones': True, 
    'Star Wars: Episode III  Revenge of the Sith': True, 
    'Star Wars: Episode IV  A New Hope': True, 
    'Star Wars: Episode V The Empire Strikes Back': True, 
    'Star Wars: Episode VI Return of the Jedi': True, 
    np.nan: False
}


for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(new_values2)
    
star_wars[3:9].head()


# In[12]:


# renaming the columns
star_wars = star_wars.rename(columns={'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1', 
                                     'Unnamed: 4': 'seen_2', 
                                      'Unnamed: 5': 'seen_3', 
                                      'Unnamed: 6': 'seen_4',
                                      'Unnamed: 7': 'seen_5', 
                                      'Unnamed: 8': 'seen_6'
                                     })

star_wars.columns[3:9]


# The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. `1` means the film was the most favorite, and `6` means it was the least favorite. Each of the following columns can contain the value `1`, `2`, `3`, `4`, `5`, `6` or `NaN`:
# 
# - `Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.` - How much the respondent liked `Star Wars: Episode I The Phantom Menace`.
# - `Unnamed: 10` - How much the respondent liked `Star Wars: Episode II Attack of the Clones`.
# - `Unnamed: 11` - How much the respondent liked `Star Wars: Episode III Revenge of the Sith`.
# - `Unnamed: 12` - How much the respondent liked `Star Wars: Episode IV A New Hope`.
# - `Unnamed: 13` - How much the respondent liked `Star Wars: Episode V The Empire Strikes Back`.
# - `Unnamed: 14` - How much the respondent liked `Star Wars: Episode VI Return of the Jedi`.
# 
# We'll convert each column to a numeric type and rename the columns.

# In[13]:


# converting columns to numeric type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

# renaming the columns
star_wars = star_wars.rename(columns={'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1', 
                                      'Unnamed: 10': 'ranking_2', 
                                     'Unnamed: 11': 'ranking_3',
                                     'Unnamed: 12': 'ranking_4',
                                     'Unnamed: 13': 'ranking_5',
                                     'Unnamed: 14': 'ranking_6'})

star_wars.columns[9:15]


# ## Global Movie Ranking

# In[14]:


# calculating the mean of each ranking column
ranking_mean = star_wars[star_wars.columns[9:15]].mean()
ranking_mean


# In[15]:


# creating a bar chart to plot the different means
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')

ranking_mean.plot(kind='bar', title='Ranking Star Wars Movies', colormap='ocean')


# The lower the mean, the higher the respondents ranked the movie. 
# 
# From this bar chart, we can deduce the following:
# - `ranking_5` is the most favorite movie of the respondents
# - `ranking_3` is the least favorite movie of the respondents
# - the last 3 movies have a better score than the first 3
# 
# For the `seen` columns, we'll figure out how many people have seen each movie by taking the sum of each column.

# In[16]:


sum_seen = star_wars[star_wars.columns[3:9]].sum()
sum_seen.plot(kind='bar', title='Ranking Star Wars Movies', colormap='ocean')


# We can deduce from this bar chart that the first and the last 2 movies have been seen the most by the respondents. This can explain why the most recent movies have a higher ranking than the older ones.
# 
# We'll now examine how certain segments of the survey population responded. There are several columns that segment our data into two groups:
# 
# - `Do you consider yourself to be a fan of the Star Wars film franchise?` - `True` or `False`
# - `Do you consider yourself to be a fan of the Star Trek franchise?` - `Yes` or `No`
# - `Gender` - `Male` or `Female`

# ## Ranking by Groups

# In[17]:


# splitting our data in two groups

fan_star_wars = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] == True]
no_fan_star_wars = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] == False]

fan_star_trek = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] == 'Yes']
no_fan_star_trek = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] == 'No']

males = star_wars[star_wars['Gender'] == 'Male']
females = star_wars[star_wars['Gender'] == 'Female']


# ### By Star Wars fans

# In[18]:


# calcuting the mean ranking values for Star Wars fans and non Star Wars fans
mean_star_wars_fan = fan_star_wars[fan_star_wars.columns[9:15]].mean()
mean_star_wars_no_fan = no_fan_star_wars[no_fan_star_wars.columns[9:15]].mean()

# creating a bar chart for both groups
cols = ['ranking_1', 'ranking_2', 'ranking_3', 
        'ranking_4', 'ranking_5', 'ranking_6']
fan = [fan_star_wars, no_fan_star_wars]
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, mean_star_wars_fan, bar_width, color='green', label='Fan')
plt.bar(pos + bar_width, mean_star_wars_no_fan, bar_width, color='darkblue', label='No Fan')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Ranking')
plt.title('Ranking Star Wars Movies')
plt.legend(loc='upper right')

plt.show()


# The bar charts show us following trends:
# - the fans of Star Wars give a lower score to the first three Star Wars movies than the respondents who do not consider themselves to be fans. 
# - the last three movies are much more appreciated by the respondents who are fan of the Star Wars movies
# - in general we can say that the last two movies are the most popular

# In[25]:


# calculating the total number of Star Wars fans and non Star Wars fans
star_wars_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] == True]
star_wars_no_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] == False]

sum_star_wars_fan = star_wars_fan[star_wars_fan.columns[3:9]].sum()
sum_star_wars_no_fan = star_wars_no_fan[star_wars_no_fan.columns[3:9]].sum()

# creating a bar chart for both groups
cols = ['seen_1', 'seen_2', 'seen_3', 
        'seen_4', 'seen_5', 'seen_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, sum_star_wars_fan, bar_width, color='green', label='Fan')
plt.bar(pos + bar_width, sum_star_wars_no_fan, bar_width, color='darkblue', label='No Fan')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Number of Respondents')
plt.title('No of Respondants who have seen each Star Wars Movie')
plt.legend(loc='upper left')

plt.show()


# From the bar chart above, we can deduce the following:
# - a lot more fans than non fans have seen the different Star Wars movies.
# - the last two movies have been seen by the most respondents (fan and non fan).

# ### By Star Trek fans

# In[30]:


# calculating the mean ranking values for Star Trek fans and non Star Trek Fans
mean_star_trek_fan = star_trek_fan[star_trek_fan.columns[9:15]].mean()
mean_star_trek_no_fan = star_trek_no_fan[star_trek_no_fan.columns[9:15]].mean()

# creating a bar chart for both groups
cols = ['ranking_1', 'ranking_2', 'ranking_3', 
        'ranking_4', 'ranking_5', 'ranking_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, mean_star_trek_fan, bar_width, color='green', label='Fan')
plt.bar(pos + bar_width, mean_star_trek_no_fan, bar_width, color='darkblue', label='No Fan')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Ranking')
plt.title('Ranking Star Wars Movies')
plt.legend(loc='upper right')

plt.show()


# We can observe following trends:
# - the first 3 movies are more appreciated by the respondents who do not consider themselves as fans of the Star Trek franchise.
# - the last 3 movies are more appreciated by the Star Trek fans.
# - overall, we can see that the last 2 movies are the most popular amongst both groups.

# In[21]:


# calculating the number of Star Trek fans and non Star Trek fans
star_trek_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] == True]
star_trek_no_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] == False]

sum_star_trek_fan = star_trek_fan[star_trek_fan.columns[3:9]].sum()
sum_star_trek_no_fan = star_trek_no_fan[star_trek_no_fan.columns[3:9]].sum()

# creating a bar chart for both groups
cols = ['seen_1', 'seen_2', 'seen_3', 
        'seen_4', 'seen_5', 'seen_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, sum_star_trek_fan, bar_width, color='green', label='Fan')
plt.bar(pos + bar_width, sum_star_trek_no_fan, bar_width, color='darkblue', label='No Fan')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Number of Respondents')
plt.title('No of Respondants who have seen each Star Wars Movie')
plt.legend(loc='upper left')

plt.show()


# From the bar chart above, we can deduce the following:
# - more fans of Star Trek have seen the different movies.
# - the latest two movies have been seen more than the older ones by both groups (fan and non fan).

# ### By Gender

# In[22]:


# calculating the mean ranking values for males and females
mean_males = males[males.columns[9:15]].mean()
mean_females = females[females.columns[9:15]].mean()

# creating a bar chart for both groups
cols = ['ranking_1', 'ranking_2', 'ranking_3', 
        'ranking_4', 'ranking_5', 'ranking_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, mean_males, bar_width, color='green', label='Male')
plt.bar(pos + bar_width, mean_females, bar_width, color='darkblue', label='Female')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Ranking')
plt.title('Ranking Star Wars Movies by Gender')
plt.legend(loc='upper right')

plt.show()


# We can see the same trend line in ranking by male and by female respondents. Following differences appear from the bar charts:
# - the first 2 movies have a higher score amongst female respondents.
# - the last 4 movies have a higher score amongst male respondents.
# - in general, we can say that the last 2 movies are the most popular.

# In[37]:


# calculating the sum of each `seen` column
seen_male = males[males.columns[3:9]].sum()
seen_female = females[females.columns[3:9]].sum()

# creating a bar chart for both groups
cols = ['seen_1', 'seen_2', 'seen_3', 
        'seen_4', 'seen_5', 'seen_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, seen_male, bar_width, color='green', label='Male')
plt.bar(pos + bar_width, seen_female, bar_width, color='darkblue', label='Female')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Number of Respondents')
plt.title('No of Respondents who have seen the Star Wars movies')
plt.legend(loc='upper center')

plt.show()


# From the bar chart above, we can deduce the following:
# - in general, each movie has been seen more by male than by female respondents.
# - the last two movies have been seen by the most respondents (male and female).
# 
# As a conclusion, we can say that the more respondents have seen a movie, the higher its ranking is, regardless of gender or loving Star Wars or Star Trek. The last two movies are by far the most seen and most popular ones among the six movies, followed in general by the first movie. Fans of Star Wars and Star Trek appreciate and see more movies than non fans.

# Here are some potential next steps:
# 
# - Try to segment the data based on columns like Education, Location (Census Region), and Which character shot first?, which aren't binary. Are they any interesting patterns?
# - Clean up columns 15 to 29, which contain data on the characters respondents view favorably and unfavorably.
# - Which character do respondents like the most?
# - Which character do respondents dislike the most?
# - Which character is the most controversial (split between likes and dislikes)?
#