Finding the Best Markets to Advertise In¶

As an e-learning company that offers courses on programming -- mostly web and mobile development, but also many other domains -- we are hoping to find two best markets to promote in with this project.

In [428]:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Read and explore the dataset¶

In [429]:

ncs = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv', dtype = {'JobInterestOther':str, 'CodeEventOther':str})

In [430]:

ncs.head()

Out[430]:

	Age	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventConferences	CodeEventDjangoGirls	...	YouTubeFCC	YouTubeFunFunFunction	YouTubeGoogleDev	YouTubeLearnCode	YouTubeLevelUpTuts	YouTubeMIT	YouTubeMozillaHacks	YouTubeOther	YouTubeSimplilearn	YouTubeTheNewBoston
0	27.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	34.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	21.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN
3	26.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN
4	20.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 136 columns

In [431]:

ncs.describe()

Out[431]:

	Age	AttendedBootcamp	BootcampFinish	BootcampLoanYesNo	BootcampRecommend	ChildrenNumber	CodeEventConferences	CodeEventDjangoGirls	CodeEventFCC	CodeEventGameJam	...	YouTubeEngineeredTruth	YouTubeFCC	YouTubeFunFunFunction	YouTubeGoogleDev	YouTubeLearnCode	YouTubeLevelUpTuts	YouTubeMIT	YouTubeMozillaHacks	YouTubeSimplilearn	YouTubeTheNewBoston
count	15367.000000	17709.000000	1069.000000	1079.000000	1073.000000	2314.000000	1609.0	165.0	1708.0	290.0	...	993.0	6036.0	1261.0	3539.0	2662.0	1396.0	3327.0	622.0	201.0	2960.0
mean	27.691872	0.062002	0.699719	0.305839	0.818267	1.832325	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
std	8.559239	0.241167	0.458594	0.460975	0.385805	0.972813	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
min	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
25%	22.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
50%	26.000000	0.000000	1.000000	0.000000	1.000000	2.000000	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
75%	32.000000	0.000000	1.000000	1.000000	1.000000	2.000000	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
max	90.000000	1.000000	1.000000	1.000000	1.000000	9.000000	1.0	1.0	1.0	1.0	...	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0

8 rows × 105 columns

In [432]:

ncs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18175 entries, 0 to 18174
Columns: 136 entries, Age to YouTubeTheNewBoston
dtypes: float64(105), object(31)
memory usage: 18.9+ MB

A few notes on the dataset:

We are using a ready-made data set instead of organizing a survey because surveys are very costly, it's good to explore cheaper options first.
This dataset is from freeCodeCamp, a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
The survey data is publicly available in this GitHub repository.

Clarify if the dataset is representative for our population of interest¶

In [433]:

# Quick look at the `JobRoleInterest` column
new_coders_survey = ncs.copy()
print(new_coders_survey['JobRoleInterest'].head(), '\n\n\n', new_coders_survey['JobRoleInterest'].describe())

0                                                  NaN
1                             Full-Stack Web Developer
2      Front-End Web Developer, Back-End Web Develo...
3      Front-End Web Developer, Full-Stack Web Deve...
4    Full-Stack Web Developer, Information Security...
Name: JobRoleInterest, dtype: object 


 count                         6992
unique                        3213
top       Full-Stack Web Developer
freq                           823
Name: JobRoleInterest, dtype: object

From a quick look at the JobRoleInterest column, we can tell in this survey, people can be interested in more than one subjects, and the subjects can be related, or with overlapping means.

In [434]:

# Generate a frenquency table for the `JobRoleInterest` column by percentage in desending order
job_role_interest_pct = pd.DataFrame(data = new_coders_survey['JobRoleInterest'].value_counts(normalize = True, ascending = False)*100)
job_role_interest_pct.head()

Out[434]:

	JobRoleInterest
Full-Stack Web Developer	11.770595
Front-End Web Developer	6.435927
Data Scientist	2.173913
Back-End Web Developer	2.030892
Mobile Developer	1.673341

** From the frenquncy table above, we have found out that Full-Stack Web Developer and Front-End Web Developer -- which are the same thing -- are the top two popular job roles. Followed with Data Scientist, Back-End Web Developer, and Mobile Developer, in popularity order.

But since the column can contain more than one job role interest, the frenquency table is not very representative.**

In [435]:

# Drop null values 
new_coders_survey['JobRoleInterest'].dropna(inplace = True)

In [436]:

# Find out the percentage of participants that are interested in either web or mobile development
web_mobile = new_coders_survey['JobRoleInterest'].str.contains(r'Web|Mobile').sum()/new_coders_survey['JobRoleInterest'].shape[0]
web_mobile*100

Out[436]:

86.29862700228833

In [437]:

# Turn each value in `JobRoleInterest` into a list
new_coders_survey['JobRoleInterest'] = new_coders_survey['JobRoleInterest'].str.strip().str.replace(',  *', ',').str.split(',')
new_coders_survey['JobRoleInterest'].head()

Out[437]:

0                                                  NaN
1                           [Full-Stack Web Developer]
2    [Front-End Web Developer, Back-End Web Develop...
3    [Front-End Web Developer, Full-Stack Web Devel...
4    [Full-Stack Web Developer, Information Securit...
Name: JobRoleInterest, dtype: object

In [438]:

# Drop null values
new_coders_survey['JobRoleInterest'].dropna(inplace = True)

In [439]:

# Find out percentage of people that are intersted in more than one subjects
two_plus_pct = new_coders_survey['JobRoleInterest'].str.contains(',').sum()/new_coders_survey['JobRoleInterest'].count()
two_plus_pct*100

Out[439]:

0.0

Since close to 70% of the participants of the survey are interested in more than one subject, let's explore the correlations of job roles that are selected together

In [440]:

# Unwrap all job role interest in the column
job_roles = []
for roles in new_coders_survey['JobRoleInterest']:
    for role in roles:
        job_roles.append(role)

job_roles = pd.Series(job_roles)
job_roles.head()

Out[440]:

0    Full-Stack Web Developer
1     Front-End Web Developer
2      Back-End Web Developer
3           DevOps / SysAdmin
4            Mobile Developer
dtype: object

In [441]:

unique_roles = pd.Series(job_roles).unique()

In [442]:

# Create a dataframe for combination of job roles
job_role_combo = pd.DataFrame(columns = unique_roles, index = unique_roles)
job_role_combo.fillna(0, inplace = True)

In [443]:

# Count combo appearance
for roles in new_coders_survey['JobRoleInterest']:
    job_role_combo.loc[roles, roles]+=1

In [444]:

# Create a frequency table for top job roles 
top_job_roles = (pd.Series(job_roles).value_counts(normalize = True)*100).head(15)
print(top_job_roles,
      '\n\n',
    '* The top 15 job role interests take up ' + str(round(top_job_roles.sum(),2)) + '% of all roles')

Full-Stack Web Developer      18.575221
Front-End Web Developer       15.632743
Back-End Web Developer        12.265487
Mobile Developer              10.194690
Data Scientist                 7.269912
Game Developer                 7.203540
User Experience Designer       6.500000
Information Security           5.867257
Data Engineer                  5.522124
DevOps / SysAdmin              4.101770
Product Manager                3.601770
Quality Assurance Engineer     2.203540
Software Engineer              0.048673
Software Developer             0.026549
Artificial Intelligence        0.017699
dtype: float64 

 * The top 15 job role interests take up 99.03% of all roles

From frenquency table above, we can tell that the top 15 job roles interest form roughly 99% of all job role interests, which is representative enough for the dataset. Let's visualize the freqency table above to get a better idea of the distribution of the top job role interests.

In [445]:

# Get combo appearance for top 15 job roles
top_role_combo = job_role_combo.loc[top_job_roles.index, top_job_roles.index]
top_role_combo.iloc[:5,:5]

Out[445]:

	Full-Stack Web Developer	Front-End Web Developer	Back-End Web Developer	Mobile Developer	Data Scientist
Full-Stack Web Developer	4198	2348	2124	1632	988
Front-End Web Developer	2348	3533	1889	1535	731
Back-End Web Developer	2124	1889	2772	1304	825
Mobile Developer	1632	1535	1304	2304	581
Data Scientist	988	731	825	581	1643

In [446]:

# Calculate correlations between top job roles
top_role_corr = top_role_combo.corr()

Visualizations¶

Fig.1 Horizontal Bar Plot On Top 15 Job Roles Interest Frenquency¶

In [447]:

# Plot a horizontal bar for top job roles frenquency 
fig = plt.figure(figsize = (10,5))
ax = fig.add_subplot(1,1,1)
top_job_roles.sort_values(ascending = True).plot.barh(color = '#348feb',alpha = 0.8, ax = ax, 
                                                      title = 'Horizontal Bar Plot On Top 15 Job Roles Interest Frenquency')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.tick_params(width = 0)

Fig.2 Heatmap On Top 15 Job Role Interest Correlation¶

In [448]:

# Plot a heatmap for top job role combo correlations
import seaborn as sns 
sns.heatmap(top_role_corr)
plt.title('Top 15 Job Role Interest Correlation Heatmap')

Out[448]:

<matplotlib.text.Text at 0x7ff40a7774a8>

Observations:

86% of people are interested in either web development or mobile development
Close to 70% of the participants of the survey are interested in more than one subject
The dataset is still representative because from Fig.2, we can tell that most people who are interested in more than one subject are interested in subjects that are closely related. Like Full-Stack Web Developer and Front-End Web Developer & Back-End Web Developer, or Data Engineer and Data Scientist.
All above suggest that this dataset is representative enough for our goal here for now

Analyze the dataset¶

Number of potential coders by country¶

In [449]:

# Drop null values in column `CountryLive`
coder_loc = ncs['CountryLive'].dropna()

In [450]:

# Generate an absolute frenquency table
coder_loc.value_counts()

Out[450]:

United States of America         5791
India                            1400
United Kingdom                    757
Canada                            616
Brazil                            364
Germany                           324
Poland                            265
Russia                            263
Australia                         259
France                            228
Spain                             217
Nigeria                           214
Ukraine                           202
Romania                           171
Italy                             164
Mexico                            155
Netherlands (Holland, Europe)     142
Philippines                       135
South Africa                      126
Turkey                            120
Greece                            116
Serbia                            115
Argentina                         113
Pakistan                          109
Kenya                              92
Indonesia                          91
China                              90
Egypt                              87
Sweden                             80
Hungary                            77
                                 ... 
Benin                               2
Malawi                              2
Guadeloupe                          1
Botswana                            1
Hawaii                              1
Vanuatu                             1
Gibraltar                           1
Burundi                             1
Curacao                             1
Sierra Leone                        1
Martinique                          1
Channel Islands                     1
Greenland                           1
New Caledonia                       1
Barbados                            1
Isle of Man                         1
Rwanda                              1
Canary Islands                      1
Anguilla                            1
Gabon                               1
Aruba                               1
Samoa                               1
Korea North                         1
Madagascar                          1
Bermuda                             1
Virgin Islands (British)            1
Liberia                             1
British Indian Ocean Ter            1
Tajikistan                          1
Kuwait                              1
Name: CountryLive, Length: 172, dtype: int64

In [451]:

# Generate a relative frenquency table
coder_loc.value_counts(normalize = True)*100

Out[451]:

United States of America         37.760824
India                             9.128847
United Kingdom                    4.936098
Canada                            4.016693
Brazil                            2.373500
Germany                           2.112676
Poland                            1.727960
Russia                            1.714919
Australia                         1.688837
France                            1.486698
Spain                             1.414971
Nigeria                           1.395409
Ukraine                           1.317162
Romania                           1.115023
Italy                             1.069379
Mexico                            1.010694
Netherlands (Holland, Europe)     0.925926
Philippines                       0.880282
South Africa                      0.821596
Turkey                            0.782473
Greece                            0.756390
Serbia                            0.749870
Argentina                         0.736828
Pakistan                          0.710746
Kenya                             0.599896
Indonesia                         0.593375
China                             0.586854
Egypt                             0.567293
Sweden                            0.521648
Hungary                           0.502087
                                   ...    
Benin                             0.013041
Malawi                            0.013041
Guadeloupe                        0.006521
Botswana                          0.006521
Hawaii                            0.006521
Vanuatu                           0.006521
Gibraltar                         0.006521
Burundi                           0.006521
Curacao                           0.006521
Sierra Leone                      0.006521
Martinique                        0.006521
Channel Islands                   0.006521
Greenland                         0.006521
New Caledonia                     0.006521
Barbados                          0.006521
Isle of Man                       0.006521
Rwanda                            0.006521
Canary Islands                    0.006521
Anguilla                          0.006521
Gabon                             0.006521
Aruba                             0.006521
Samoa                             0.006521
Korea North                       0.006521
Madagascar                        0.006521
Bermuda                           0.006521
Virgin Islands (British)          0.006521
Liberia                           0.006521
British Indian Ocean Ter          0.006521
Tajikistan                        0.006521
Kuwait                            0.006521
Name: CountryLive, Length: 172, dtype: float64

Based on the frequency tables, the biggest two markets are the United Stats and Indian.
But we need to go more in depth, like making sure these top countries do have our target audience, which are people interested in web or mobile develpment, or they do plan on spending money on learning.

Amount of money potential coders are willing to spend by country¶

We are going to narrow down our analysis to only four countries: the US, India, the United Kingdom, and Canada. Two reasons for this decision are:

These are the countries having the highest absolute frequencies in our sample, which means we have a decent amount of data for each.
Our courses are written in English, and English is an official language in all these four countries. The more people that know English, the better our chances to target the right people with our ads.

In [452]:

# Calculate the money coders spent on learning every month. To avoid deviding by zero, replace all zeros in `MonthsProgramming` column with ones.
ncs_copy = ncs[ncs['JobRoleInterest'].notnull()].copy()
ncs_copy['MonthsProgramming'] = ncs_copy['MonthsProgramming'].replace(0,1)
ncs_copy['money_spent_monthly'] = ncs_copy['MoneyForLearning']/ncs_copy['MonthsProgramming']
ncs_copy['money_spent_monthly'].head()

Out[452]:

1     13.333333
2    200.000000
3      0.000000
4      0.000000
6      0.000000
Name: money_spent_monthly, dtype: float64

In [453]:

# Find out how many null values there are in the new column
ncs_copy['money_spent_monthly'].isnull().sum()

Out[453]:

In [454]:

# Drop null values in `money_spent_monthly` and `CountryLive`
ncs_copy['money_spent_monthly'].dropna(inplace = True)
ncs_copy['CountryLive'].dropna(inplace = True)

In [455]:

# Calculate how much money a student spends on average each month in the four countries
top_4_countries = ncs_copy['CountryLive'].value_counts().head(4)
top_countries = top_4_countries.index 
money_by_country = ncs_copy.groupby('CountryLive').mean()['money_spent_monthly']
money_by_country_4 = pd.DataFrame(money_by_country.loc[top_countries], columns = ['money_spent_monthly'])
money_by_country_4

Out[455]:

	money_spent_monthly
United States of America	227.997996
India	135.100982
United Kingdom	45.534443
Canada	113.510961

The results for the United Kingdom and Canada are surprisingly low relative to the values we see for India. If we considered a few socio-economical metrics (like GDP per capita), we'd intuitively expect people in the UK and Canada to spend more on learning than people in India.

This sugguests that we should look deeper into the data.

Identify possible outliers in the four countries with box plots¶

Fig.3 Box Plots For Each Country On Money Spent Monthly¶

In [456]:

# Generate boxplots for each country
fig = plt.figure(figsize=(14, 10)) 
for i in range(4):
    ax = fig.add_subplot(2,2,i+1)
    sns.boxplot(x = 'CountryLive', y = 'money_spent_monthly', 
                data = ncs_copy[ncs_copy['CountryLive'] == top_countries[i]], ax = ax)   
plt.show()

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

From Fig.3, we can see that there are extreme outlieres in every country. Next we will eliminate some outlieres and plot again to 'zoom in.

In [457]:

# Drop extreme outliers in each counrty
us = ncs_copy[(ncs_copy['CountryLive'] == 'United States of America') & (ncs_copy['money_spent_monthly'] < 2e4)]
india = ncs_copy[(ncs_copy['CountryLive'] == 'India') & (ncs_copy['money_spent_monthly'] < 2e3)]
uk = ncs_copy[(ncs_copy['CountryLive'] == 'United Kingdom') & (ncs_copy['money_spent_monthly'] < 400)]
canada = ncs_copy[(ncs_copy['CountryLive'] == 'Canada') & (ncs_copy['money_spent_monthly'] < 3000)]
top_4 = [us, india, uk, canada]

Fig.4 Box Plots For Each Country On Money Spent Monthly Without Extreme Outliers¶

In [458]:

# Generate boxplots for each country after eliminating extreme outliers
fig = plt.figure(figsize=(14, 10)) 
for i in range(4):
    ax = fig.add_subplot(2,2,i+1)
    sns.boxplot(x = 'CountryLive', y = 'money_spent_monthly', 
                data = top_4[i], ax = ax)   
    
plt.show()

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

In [459]:

# Check mean of money spent monthly by country again 
msm_4 = pd.concat([us, india, uk, canada])
msm_4.groupby('CountryLive').mean()['money_spent_monthly']

Out[459]:

CountryLive
Canada                       93.065400
India                        57.256604
United Kingdom               25.245838
United States of America    183.800110
Name: money_spent_monthly, dtype: float64

Just by common sense, it's still a little hard to believe that there are people in the US spend 20000 dollars a month on learning, let's take a closer look.

In [460]:

msm_4[msm_4['CountryLive'] == 'United States of America']

Out[460]:

	Age	AttendedBootcamp	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventConferences	CodeEventDjangoGirls	...	YouTubeFunFunFunction	YouTubeGoogleDev	YouTubeLearnCode	YouTubeLevelUpTuts	YouTubeMIT	YouTubeMozillaHacks	YouTubeOther	YouTubeSimplilearn	YouTubeTheNewBoston	money_spent_monthly
1	34.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	13.333333
2	21.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN	200.000000
15	32.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
16	29.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	16.666667
18	46.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35.714286
19	31.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	17.857143
21	23.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	100.000000
23	27.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	100.000000
28	19.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	2.416667
30	29.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	66.666667
31	27.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
32	54.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	100.000000
33	24.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	83.333333
35	23.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
40	30.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	1.0	1.0	1.0	1.0	1.0	NaN	Stephen Mayeux	NaN	1.0	25.000000
42	30.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	50.000000
63	31.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16.666667
66	22.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.777778
67	22.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.785714
68	31.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	357.142857
70	48.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	50.000000
84	28.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16.666667
97	30.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	166.666667
163	20.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	25.000000
213	23.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	27.777778
226	45.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15.000000
229	23.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
234	25.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1200.000000
236	29.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	1.0	16.666667
237	16.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	1.0	1.0	1.0	NaN	NaN	NaN	NaN	1.0	0.833333
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
17965	42.0	0.0	NaN	NaN	NaN	NaN	2.0	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
17967	46.0	0.0	NaN	NaN	NaN	NaN	4.0	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
17970	38.0	0.0	NaN	NaN	NaN	NaN	5.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
17972	24.0	0.0	NaN	NaN	NaN	NaN	3.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	not applicable	NaN	NaN	0.000000
17974	61.0	0.0	NaN	NaN	NaN	NaN	2.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	66.666667
17979	34.0	0.0	NaN	NaN	NaN	NaN	1.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
17980	29.0	0.0	NaN	NaN	NaN	NaN	2.0	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
17981	25.0	0.0	NaN	NaN	NaN	NaN	1.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
17987	30.0	0.0	NaN	NaN	NaN	NaN	1.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
17991	30.0	0.0	NaN	NaN	NaN	NaN	1.0	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
17995	24.0	0.0	NaN	NaN	NaN	NaN	2.0	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Not watched any	NaN	NaN	0.000000
17996	32.0	0.0	NaN	NaN	NaN	NaN	2.0	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
18000	44.0	0.0	NaN	NaN	NaN	NaN	2.0	between 100,000 and 1 million	NaN	NaN	...	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	10.416667
18003	30.0	0.0	NaN	NaN	NaN	NaN	3.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	25.000000
18007	29.0	0.0	NaN	NaN	NaN	NaN	1.0	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	none	NaN	NaN	0.000000
18014	27.0	0.0	NaN	NaN	NaN	NaN	1.0	less than 100,000	NaN	NaN	...	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
18015	27.0	0.0	NaN	NaN	NaN	NaN	2.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	5.555556
18017	45.0	0.0	NaN	NaN	NaN	NaN	2.0	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
18020	31.0	0.0	NaN	NaN	NaN	NaN	2.0	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	0.000000
18037	40.0	0.0	NaN	NaN	NaN	NaN	3.0	more than 1 million	1.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	83.333333
18039	31.0	0.0	NaN	NaN	NaN	NaN	1.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	10.000000
18043	29.0	0.0	NaN	NaN	NaN	NaN	1.0	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
18044	33.0	0.0	NaN	NaN	NaN	NaN	2.0	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
18049	35.0	0.0	NaN	NaN	NaN	NaN	2.0	between 100,000 and 1 million	NaN	NaN	...	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
18050	33.0	0.0	NaN	NaN	NaN	NaN	3.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	16.666667
18069	32.0	0.0	NaN	NaN	NaN	NaN	1.0	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
18071	47.0	0.0	NaN	NaN	NaN	NaN	2.0	less than 100,000	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Random videos	NaN	NaN	7.500000
18093	29.0	1.0	0.0	0.0	Dev Bootcamp	1.0	5.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	1.0	1.0	1.0	NaN	NaN	NaN	27.777778
18113	24.0	0.0	NaN	NaN	NaN	NaN	2.0	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	0.000000
18130	23.0	0.0	NaN	NaN	NaN	NaN	1.0	between 100,000 and 1 million	NaN	NaN	...	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	0.000000

2931 rows × 137 columns

In [461]:

# Let's check out people who doesn't attend any bootcamp and check out how much they spend every month
non_bootcamp_attendants = msm_4[(msm_4['AttendedBootcamp'] == 0) & (msm_4['money_spent_monthly']!= 0)]
non_bootcamp_attendants['money_spent_monthly'].value_counts().sort_index(ascending = False)

Out[461]:

16666.666667     1
15000.000000     1
14000.000000     1
12500.000000     1
10833.333333     1
5000.000000      2
4550.000000      1
4500.000000      1
4250.000000      1
4000.000000      2
3333.333333      1
3250.000000      1
3000.000000      2
2666.666667      1
2500.000000      2
2375.000000      1
2000.000000      4
1833.333333      2
1750.000000      2
1666.666667      5
1650.000000      1
1500.000000      7
1428.571429      1
1400.000000      1
1333.333333      1
1285.714286      1
1250.000000      1
1222.222222      1
1200.000000      3
1150.000000      1
                ..
1.208333         1
1.111111         1
1.052632         1
1.041667         3
1.000000         8
0.909091         1
0.900000         1
0.833333        13
0.769231         1
0.714286         1
0.708333         1
0.694444         1
0.672043         1
0.666667         3
0.625000         1
0.555556         3
0.500000         2
0.431034         1
0.428571         1
0.416667        12
0.357143         1
0.333333         2
0.250000         1
0.208333         1
0.192308         1
0.166667         1
0.138889         1
0.066667         1
0.050000         1
0.033333         1
Name: money_spent_monthly, Length: 325, dtype: int64

In [462]:

print('mean of money spent monthly for non-bootcamp-attendants: ' + str(non_bootcamp_attendants['money_spent_monthly'].mean()), 
      '\n\n',
     'median of money spent monthly for non-bootcamp-attendants: ' + str(non_bootcamp_attendants['money_spent_monthly'].median()),
     )

mean of money spent monthly for non-bootcamp-attendants: 175.89802495856682 

 median of money spent monthly for non-bootcamp-attendants: 25.0

Fig.5 Histgram On Monthly Spend With Non-bootcamp-attendants¶

In [463]:

# Generate a histplot to get a better idea of the distribution 
non_bootcamp_attendants['money_spent_monthly'].hist(bins = 15)

Out[463]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff4076c5e48>

In [464]:

# Check percentage of participants are non_bootcamp_attendants
non_bootcamp_attendants.shape[0]/msm_4.shape[0]

Out[464]:

0.4639651192613491

Since non-bootcamp-attendants take up almost half of our data, it's too much to just drop. Also, there are other learning platforms other than bootcamps that could cost money. From Fig5, we can see most of the non-bootcamp-attendants spend less than 1000 a month, so let's eliminate the data where non-bootcamp-attendants spend more than 1000 a month.

In [465]:

to_drop = msm_4[(msm_4['AttendedBootcamp'] == 0) & (msm_4['money_spent_monthly'] > 1000)].index
msm_4.drop(to_drop, inplace = True)

In [466]:

# Check mean of money spent monthly by country again 
msm_4.groupby('CountryLive').mean()['money_spent_monthly']

Out[466]:

CountryLive
Canada                       62.219219
India                        51.438752
United Kingdom               25.245838
United States of America    130.234528
Name: money_spent_monthly, dtype: float64

Fig 6. Barplot For Number Of Coders And Their Monthly Spend By Country¶

In [467]:

# Generate barplot for number of coders and their monthly spend by country
fig = plt.figure(figsize = (15, 5))
ax1 = fig.add_subplot(1,2,1)
sns.barplot(data = msm_4, x = 'CountryLive', y = 'money_spent_monthly', palette="Blues_d", ax = ax1)

ax2 = fig.add_subplot(1,2,2)
sns.countplot(data = msm_4, x = 'CountryLive', palette = "summer", ax = ax2)

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:1428: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Out[467]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff4075ef550>

Conclusions:¶

It's obvious that the United States of America should be our number one pick of market to advertise in. It has the most number of coders and most monthly spend.
The second market can be a little hard to pick. India has a lower average monthly spend but a higher number of coders. On the other hand, Cananda has a higher average monthly spend but a lower number of coders. Considering the number of coders in cannada is almost half the number of India, and India's average monthly spend is about $10 less than Canada, and close to $60, which our subscription is $59. All considered, India seems to have more potential than Canada.
It makes sense to split the advertising budget unequally either we choose India or Cananda as the second market. We can start with a 70% vs 30% as a test run and adjust the budget along the way.
We should at least show Fig.6 to the marketing team and let them use their domain knowledge to take the best decision. From my experience as an account/project manager in an advertising agency in Beijing, there are always other things to consider for the marketing team. For example if we choose US and Canada as the markets to advertise in, we may only need one set of materials for advertisement because how similar the two countries are in culture in reference to India.