#!/usr/bin/env python
# coding: utf-8

# # Guided Project : Finding the Best Markets to Advertise In
# We're working for an an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement. Our goal in this project is to find out the two best markets to advertise our product in.
# 
# To reach our goal, we could organize surveys for a couple of different markets to find out which would the best choices for advertising. This is very costly, however, and it's a good call to explore cheaper options first.
# 
# We can try to search existing data that might be relevant for our purpose. One good candidate is the data from [freeCodeCamp's 2017 New Coder Survey](https://www.freecodecamp.org/news/we-asked-20-000-people-who-they-are-and-how-theyre-learning-to-code-fff5d668969/). [freeCodeCamp](https://www.freecodecamp.org/) is a free e-learning platform that offers courses on web development. Because they run [a popular Medium publication](https://www.freecodecamp.org/news/) (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
# 
# The survey data is publicly available in [this GitHub repository](https://github.com/freeCodeCamp/2017-new-coder-survey).

# ## Reading in and Exploring our Data

# In[1]:


# read in our data en import the libraries we'll use
import pandas as pd
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')

survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv', low_memory=False)
survey.shape


# In[2]:


survey.head()


# Our dataset contains 18175 rows and 136 columns. Some interesting columns for our analysis are `JobRoleInterest`, `HoursLearning` (the time clients spent studying), `CityPopulation` (the place clients are located) and `MoneyForLearning` (hwo much money they want to spend on learning).
# 
# ## Representativity of our Sample
# As we mentioned earlier, most of the courses we offer are on web and mobile development, but we also cover many other domains, like data science, game development, etc. For the purpose of our analysis, we want to answer questions about a population of new coders that are interested in the subjects we teach. What we'd like to know:
# 
# - Where are these new coders located?
# - What are the locations with the greatest number of new coders?
# - How much money new coders are willing to spend on learning?
# 
# Before starting to analyze the sample data we have, we need to clarify whether it's representative for our population of interest and whether it has the right categories of people for our purpose. The `JobRoleInterest` column describes for every participant the role(s) they'd be interested in working.

# In[3]:


# generating a frequency table for the JobRoleInterest column
freq_interest = survey['JobRoleInterest'].value_counts(normalize=True) * 100
freq_interest


# Analyzing our frequency table, we can tell that almost 12% of the people who took the survey are interested in becoming a Full-Stack Web Developer. Almost 6% is interested in a role as Front-End Web Developer. When we look to the second half of our table, we can notice that many people are interested in more than one subject. We'll plot this data in a pie chart.

# In[4]:


# analyze the number of students interested in one or multiple courses
multiple_courses = survey['JobRoleInterest'].str.contains(',')
print("Number of students interested in multiple courses: " + str(multiple_courses.sum()))

all_students = survey['JobRoleInterest'].shape[0]

single_course = all_students - multiple_courses.sum()
print("Number of students interested in one single course: " + str(single_course))

print('Total number of students: ' + str(all_students))


# In[5]:


# plot our data in a pie chart
df = pd.DataFrame({'Courses': [multiple_courses.sum(), single_course]}, index=['Multiple Courses', 'One Course'])
df.plot.pie(y='Courses', figsize=(5, 5))
plt.ylabel('')
plt.legend(loc='lower right')


# More than a quarter of the students is interested in multiple courses. We can say that it makes sense to our company to propose a variety of course subjects. Let's analyze how many people are interested in web and mobile development - the focus of our courses.

# In[6]:


# create a frequency table with absolute values
survey_web_mobile = survey['JobRoleInterest'].str.contains('Web Developer|Mobile Developer')
survey_web_mobile.value_counts()


# In[7]:


# create a frequency table with percentages
survey_web_mobile_percent = survey_web_mobile.value_counts(normalize=True) * 100
survey_web_mobile_percent.plot(kind='bar')

# plot the results in a bar chart
plt.style.use('fivethirtyeight')
plt.title('Distribution of Students \nby Job Role Interest', size=18)
plt.ylabel('Percentage', size=12)
plt.ylim([0, 100])
plt.xticks([0, 1], ['Web or Mobile \nDevelopment', 'Other Subjects'], rotation=0)
plt.show()


# More than 86% of the people who took the survey indicated that they are interested in web or mobile related subjects. We can conclude that our sample is representative for the purpose of our analysis. Now that we found out that the sample has the right categories of people, we can begin analyzing it.

# ## Location of Potential Customers
# We can start with finding out where these new coders are located and what are the densities for each location. We can find this information in the `CountryCitizen` and `CountryLive` columns which describe the country of origin and the country each participant lives in. As we're interested in where people are located, we'll use the last mentioned column only. One good indicator of a good market is the number of potential customers - the more potential customers in a market, the better.

# In[8]:


# drop the rows where respondents did not answer what role they are interested in
survey_clean = survey[survey['JobRoleInterest'].notnull()].copy()


# In[9]:


survey_clean.shape


# In[10]:


# generate a frequency table for the `CountryLive` column with absolute values
absolute_values = survey_clean['CountryLive'].value_counts()
absolute_values.head()


# In[11]:


# generate a frequency table for the `CountryLive` column with percentages
percentage = survey_clean['CountryLive'].value_counts(normalize=True) * 100
percentage.head()


# The United States is the top country where more than 45% of the respondents live. India comes far behind with almost 8% of the respondents. All the other countries stay under 5%. For our advertisement, we certainly will choose the United States for advertisement. As a second market, we could choose India, but we should check whether people living in India are willing to spend a lot of money on learning courses compared to the numbers 3 and 4 in the list - the United Kingdom and Canada. The `MoneyForLearning` column will be helpfull to answer this question.
# 
# We'll narrow down our analysis to only four countries: the US, India, the United Kingdom and Canada. These are the countries having the highest absolute frequencies in our sample, which means we have a decent amount of data for each. In addition to that, our courses are written in English, and English is an official language in all these four countries. The more people that know English, the better our chances to target the right people with our ads.

# ## Amount of Money spent for Learning
# 
# The `MoneyForLearning` column describes in American dollars the amount of money spent by participants from the moment they started coding until the moment they completed the survey. Our company sells subscriptions at a price of \$59 per month, and for this reason we're interested in finding out how much money each student spends per month. Therefore, we will divide the `MoneyForLearning` column to the `MonthsProgramming` column and replace all values of 0 (students that had just started when they completed the survey) in the last mentioned column with 1 to avoid dividing by 0.

# In[12]:


# replace values of 0 with 1
survey_clean['MonthsProgramming'] = survey_clean['MonthsProgramming'].replace(0, 1)

# create a new column that describes the money students spent per month
survey_clean['MonthlySpent'] = round(survey_clean['MoneyForLearning'] / survey_clean['MonthsProgramming'], 2)
survey_clean['MonthlySpent'].head()


# In[13]:


# count the null values in the `MonthlySpent` column
survey_clean['MonthlySpent'].isnull().sum()


# In[14]:


# count the null values in the `CountryLive` column
survey_clean['CountryLive'].isnull().sum()


# In[15]:


# remove the null values in the new column and the `CountryLive` column
survey_clean['MonthlySpent'] = survey_clean['MonthlySpent'].notnull().copy()
survey_clean['CountryLive'] = survey_clean['CountryLive'].notnull().copy()


# In[16]:


# check the null values again
print(survey_clean['MonthlySpent'].isnull().sum())
print(survey_clean['CountryLive'].isnull().sum())


# In[17]:


# group the remaining data by country and mean of the `MonthlySpent` column
countries_mean = survey_clean.groupby('CountryLive')['MonthlySpent'].mean()


# In[18]:


countries_mean


# We got some surprising results. Intuitively, we'd expect people in the UK and Canada to spend more on learning than people in India, taking into account a few socio-economical metrics like GDP per capita. It might be that we don't have enough representative data for the UK, Canada and India, or we might have some outliers coming from wrong survey answers for example influencing the mean, or it might be that the results are correct.
# 
# We'll generate 4 box plots to visualize the distribution of the `MonthlySpent` column for each country and spot possible outliers.

# In[23]:


# generate 4 box plots for the `MonthlySpent` column
plt.boxplot(countries_mean[['United States of America']])


# In[ ]: