Notebook

1. Introduction and preliminary data exploration¶

As a programming e-learning company, we want to advertise our services to aspiring developers. In this project, we are interested in finding the best two countries to do so. Note that these two countries must be English-speaking, as all of our content is in English.

One way to figure out where to avertise would be to organize a survey. However, this could be costly. Luckily, the e-learning platform freeCodeCamp made available the results of their survey of new coders, the New Coder Survey. We will use this as a starting point.

In [145]:

#importing useful libraries

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [146]:

#reading the data into panda file

new_coders = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv')

/dataquest/system/env/python3/lib/python3.4/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning:

Columns (17,62) have mixed types. Specify dtype option on import or set low_memory=False.

Apparently, some columns have mixed data types, which is strange. We can keep that in mind. For now, let's do some basic exploration on this dataset.

In [147]:

new_coders.shape

Out[147]:

(18175, 136)

In [148]:

new_coders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18175 entries, 0 to 18174
Columns: 136 entries, Age to YouTubeTheNewBoston
dtypes: float64(105), object(31)
memory usage: 18.9+ MB

That's a lot of columns! We will focus our analysis on only a few of them at a time.

Before we start, let's look at the two columns with mixed types.

In [149]:

new_coders.columns[17]

Out[149]:

'CodeEventOther'

In [150]:

new_coders['CodeEventOther'].value_counts(dropna = False).head(10)

Out[150]:

NaN                     17605
No                         21
Ladies Learning Code        9
Bootcamp                    8
General Assembly            8
School                      8
Na                          6
No One                      5
Codebar                     5
GDG                         4
Name: CodeEventOther, dtype: int64

This columns contains any coding event, not present in the survey's list, that the respondent attended. There is a large number of NaN values, as well as strings denoting a coding event. There are also values such as 'No', meaning the respondent attended no event.

Let's look at the other column.

In [151]:

new_coders.columns[62]

Out[151]:

'JobInterestOther'

In [152]:

new_coders['JobInterestOther'].value_counts(dropna = False).head(10)

Out[152]:

NaN                          17909
Undecided                       23
Software Engineer               22
Software Developer               9
Data Analyst                     6
Machine Learning Engineer        5
Artificial Intelligence          5
Project Manager                  4
Programmer                       4
Researcher                       3
Name: JobInterestOther, dtype: int64

This seems to be a similar issue. With both of these columns, there are two types because NaN is considered a float by pandas, and is mixed with strings.

As long as we do not use any string-specific method on these two columns, this shouldn't be a problem. We will leave it be for now.

Let's print the first five rows :

In [153]:

new_coders.head(5)

Out[153]:

	Age	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventConferences	CodeEventDjangoGirls	...	YouTubeFCC	YouTubeFunFunFunction	YouTubeGoogleDev	YouTubeLearnCode	YouTubeLevelUpTuts	YouTubeMIT	YouTubeMozillaHacks	YouTubeOther	YouTubeSimplilearn	YouTubeTheNewBoston
0	27.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	34.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	21.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN
3	26.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN
4	20.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 136 columns

Here we can see that there are a lot of missing values. As these should be dealt with on a column-by-column basis, we will not dive deeper into that issue for now.

Let's try to identify which columns we'll be interested in.

In [154]:

for col in new_coders.columns:
    print(col)

Age
AttendedBootcamp
BootcampFinish
BootcampLoanYesNo
BootcampName
BootcampRecommend
ChildrenNumber
CityPopulation
CodeEventConferences
CodeEventDjangoGirls
CodeEventFCC
CodeEventGameJam
CodeEventGirlDev
CodeEventHackathons
CodeEventMeetup
CodeEventNodeSchool
CodeEventNone
CodeEventOther
CodeEventRailsBridge
CodeEventRailsGirls
CodeEventStartUpWknd
CodeEventWkdBootcamps
CodeEventWomenCode
CodeEventWorkshops
CommuteTime
CountryCitizen
CountryLive
EmploymentField
EmploymentFieldOther
EmploymentStatus
EmploymentStatusOther
ExpectedEarning
FinanciallySupporting
FirstDevJob
Gender
GenderOther
HasChildren
HasDebt
HasFinancialDependents
HasHighSpdInternet
HasHomeMortgage
HasServedInMilitary
HasStudentDebt
HomeMortgageOwe
HoursLearning
ID.x
ID.y
Income
IsEthnicMinority
IsReceiveDisabilitiesBenefits
IsSoftwareDev
IsUnderEmployed
JobApplyWhen
JobInterestBackEnd
JobInterestDataEngr
JobInterestDataSci
JobInterestDevOps
JobInterestFrontEnd
JobInterestFullStack
JobInterestGameDev
JobInterestInfoSec
JobInterestMobile
JobInterestOther
JobInterestProjMngr
JobInterestQAEngr
JobInterestUX
JobPref
JobRelocateYesNo
JobRoleInterest
JobWherePref
LanguageAtHome
MaritalStatus
MoneyForLearning
MonthsProgramming
NetworkID
Part1EndTime
Part1StartTime
Part2EndTime
Part2StartTime
PodcastChangeLog
PodcastCodeNewbie
PodcastCodePen
PodcastDevTea
PodcastDotNET
PodcastGiantRobots
PodcastJSAir
PodcastJSJabber
PodcastNone
PodcastOther
PodcastProgThrowdown
PodcastRubyRogues
PodcastSEDaily
PodcastSERadio
PodcastShopTalk
PodcastTalkPython
PodcastTheWebAhead
ResourceCodecademy
ResourceCodeWars
ResourceCoursera
ResourceCSS
ResourceEdX
ResourceEgghead
ResourceFCC
ResourceHackerRank
ResourceKA
ResourceLynda
ResourceMDN
ResourceOdinProj
ResourceOther
ResourcePluralSight
ResourceSkillcrush
ResourceSO
ResourceTreehouse
ResourceUdacity
ResourceUdemy
ResourceW3S
SchoolDegree
SchoolMajor
StudentDebtOwe
YouTubeCodeCourse
YouTubeCodingTrain
YouTubeCodingTut360
YouTubeComputerphile
YouTubeDerekBanas
YouTubeDevTips
YouTubeEngineeredTruth
YouTubeFCC
YouTubeFunFunFunction
YouTubeGoogleDev
YouTubeLearnCode
YouTubeLevelUpTuts
YouTubeMIT
YouTubeMozillaHacks
YouTubeOther
YouTubeSimplilearn
YouTubeTheNewBoston

We will probably use :

CountryCitizen and CountryLive, to find the location of a respondent
HoursLearning, to determine how long a respondent could spend on our platform
The JobInterest columns, to determine whether a respondent would be interested in the material we offer.
The JobRoleInterest column for the same reason.
MoneyForLearning, to see how much a respondent could spend on our platform.
MonthsProgramming, how long the respondent has been programming

2 . Respondent Job Interest¶

We first need to figure out if the respondents could be interest in our services. Let us look at the JobRoleInterest column.

In [155]:

new_coders['JobRoleInterest'].value_counts(normalize = True,dropna=False)

Out[155]:

NaN                                                                                                                                                                                                       0.615296
Full-Stack Web Developer                                                                                                                                                                                  0.045282
  Front-End Web Developer                                                                                                                                                                                 0.024759
  Data Scientist                                                                                                                                                                                          0.008363
Back-End Web Developer                                                                                                                                                                                    0.007813
  Mobile Developer                                                                                                                                                                                        0.006437
Game Developer                                                                                                                                                                                            0.006272
Information Security                                                                                                                                                                                      0.005062
Full-Stack Web Developer,   Front-End Web Developer                                                                                                                                                       0.003521
  Front-End Web Developer, Full-Stack Web Developer                                                                                                                                                       0.003081
  Product Manager                                                                                                                                                                                         0.003026
Data Engineer                                                                                                                                                                                             0.002916
  User Experience Designer                                                                                                                                                                                0.002861
  User Experience Designer,   Front-End Web Developer                                                                                                                                                     0.002366
  Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer                                                                                                                               0.002146
Back-End Web Developer, Full-Stack Web Developer,   Front-End Web Developer                                                                                                                               0.001981
  DevOps / SysAdmin                                                                                                                                                                                       0.001981
Back-End Web Developer,   Front-End Web Developer, Full-Stack Web Developer                                                                                                                               0.001981
Full-Stack Web Developer,   Front-End Web Developer, Back-End Web Developer                                                                                                                               0.001706
  Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer                                                                                                                               0.001651
Full-Stack Web Developer,   Mobile Developer                                                                                                                                                              0.001596
  Front-End Web Developer,   User Experience Designer                                                                                                                                                     0.001596
Back-End Web Developer, Full-Stack Web Developer                                                                                                                                                          0.001486
Full-Stack Web Developer, Back-End Web Developer                                                                                                                                                          0.001431
Back-End Web Developer,   Front-End Web Developer                                                                                                                                                         0.001100
Data Engineer,   Data Scientist                                                                                                                                                                           0.001045
Full-Stack Web Developer, Back-End Web Developer,   Front-End Web Developer                                                                                                                               0.001045
  Front-End Web Developer,   Mobile Developer                                                                                                                                                             0.000990
Full-Stack Web Developer,   Data Scientist                                                                                                                                                                0.000935
  Mobile Developer, Game Developer                                                                                                                                                                        0.000880
                                                                                                                                                                                                            ...   
  Mobile Developer, Data Engineer, Game Developer, Information Security,   User Experience Designer                                                                                                       0.000055
  Front-End Web Developer, Full-Stack Web Developer,   User Experience Designer,   Quality Assurance Engineer, Back-End Web Developer                                                                     0.000055
Back-End Web Developer,   Front-End Web Developer, Data Engineer, Full-Stack Web Developer,   Mobile Developer                                                                                            0.000055
Back-End Web Developer,   Data Scientist,   Front-End Web Developer, Full-Stack Web Developer, Data Engineer                                                                                              0.000055
  DevOps / SysAdmin, Full-Stack Web Developer, Back-End Web Developer                                                                                                                                     0.000055
Full-Stack Web Developer,   Quality Assurance Engineer,   Front-End Web Developer, Back-End Web Developer, Game Developer                                                                                 0.000055
Back-End Web Developer,   Quality Assurance Engineer, Full-Stack Web Developer                                                                                                                            0.000055
Full-Stack Web Developer, Back-End Web Developer,   Mobile Developer, Game Developer, Data Engineer                                                                                                       0.000055
Back-End Web Developer, Information Security, Game Developer, Full-Stack Web Developer                                                                                                                    0.000055
  DevOps / SysAdmin, Game Developer, Full-Stack Web Developer, Information Security, Back-End Web Developer                                                                                               0.000055
  Front-End Web Developer, Data Engineer,   DevOps / SysAdmin, Back-End Web Developer, Full-Stack Web Developer,   Mobile Developer,   User Experience Designer, Information Security                     0.000055
  DevOps / SysAdmin, Information Security, Data Engineer, Game Developer,   Mobile Developer, Back-End Web Developer,   Data Scientist, Full-Stack Web Developer                                          0.000055
  Quality Assurance Engineer,   Front-End Web Developer,   User Experience Designer, Game Developer                                                                                                       0.000055
  Data Scientist,   DevOps / SysAdmin,   Mobile Developer,   Front-End Web Developer,   Quality Assurance Engineer, Information Security,   Product Manager,   User Experience Designer, Data Engineer    0.000055
  Data Scientist, Information Security,   Front-End Web Developer,   User Experience Designer                                                                                                             0.000055
  User Experience Designer, Information Security,   Front-End Web Developer,   Mobile Developer, Game Developer                                                                                           0.000055
Full-Stack Web Developer,   Front-End Web Developer, Back-End Web Developer,   Product Manager, Data Engineer                                                                                             0.000055
  Front-End Web Developer, Game Developer,   User Experience Designer, Back-End Web Developer, Full-Stack Web Developer                                                                                   0.000055
Back-End Web Developer, Full-Stack Web Developer,   Data Scientist,   Front-End Web Developer                                                                                                             0.000055
Full-Stack Web Developer, Game Developer, Back-End Web Developer,   User Experience Designer                                                                                                              0.000055
Full-Stack Web Developer, Back-End Web Developer,   Mobile Developer,   Front-End Web Developer,   Data Scientist,   User Experience Designer                                                             0.000055
Data Engineer, Game Developer,   Data Scientist,   Quality Assurance Engineer                                                                                                                             0.000055
  Front-End Web Developer, Game Developer,   Mobile Developer, Back-End Web Developer, Information Security,   DevOps / SysAdmin,   Data Scientist,   Product Manager                                     0.000055
Back-End Web Developer,   User Experience Designer,   Data Scientist,   Mobile Developer, Full-Stack Web Developer,   Front-End Web Developer                                                             0.000055
Full-Stack Web Developer, Back-End Web Developer, Game Developer,   Front-End Web Developer,   Mobile Developer,   User Experience Designer                                                               0.000055
  User Experience Designer,   Product Manager,   Data Scientist,   Front-End Web Developer                                                                                                                0.000055
  User Experience Designer, Full-Stack Web Developer,   Front-End Web Developer,   Mobile Developer                                                                                                       0.000055
Full-Stack Web Developer,   Front-End Web Developer,   Data Scientist,   User Experience Designer, Back-End Web Developer                                                                                 0.000055
  Front-End Web Developer,   Product Manager,   Data Scientist, Information Security,   Mobile Developer, Data Engineer,   Quality Assurance Engineer                                                     0.000055
  DevOps / SysAdmin,   Front-End Web Developer,   Mobile Developer, Full-Stack Web Developer                                                                                                              0.000055
Name: JobRoleInterest, Length: 3214, dtype: float64

We remark immediately:

Roughly 60% of coders did not answer this question
There are a lot of different values, as the table is truncated
Many coders are interested in more than one subject

Let's find out the number of possible values.

In [156]:

new_coders['JobRoleInterest'].unique().shape

Out[156]:

(3214,)

That's a lot of unique values. What would be interesting for us to know at this stage is if enough respondents are interested in web and mobile development, as this is the main content our company offers.

In [157]:

new_coders['JobRoleInterest'].str.contains(r'[Ww]eb|[Mm]obile').value_counts()

Out[157]:

True     6035
False     957
Name: JobRoleInterest, dtype: int64

In roughly 18000 repondents, a third of them are interested in Web and Mobile development. Let's make a pie chart to summarize our findings.

In [158]:

#Setting up labels, sizes and colors
labels = ['Did not Respond', 'Web or Mobile dev', 'Other']

sizes = [new_coders[new_coders['JobRoleInterest'].isnull()].shape[0]
         ,6035
         ,957]

colors = ['indianred','seagreen','slateblue']

#Drawing the pie
fig1, ax = plt.subplots()
ax.pie(sizes
       , labels=labels
       , autopct='%1.f%%'
       , explode = [0,0.1,0]
       , colors = colors)

ax.set_title('Job Role Interests\n', fontsize= 20)

#This makes sure the axes are of equal lenght
#And thus that our pie is circular
ax.axis('equal')

plt.show()

As this pie chart makes clear, a third of the respondents are interested in web or mobile development. The other third is mainly composed of coders that did not answer the question.

From this we conclude we have a sizable (roughly 6000) sample of new coders interested in web or mobile development.

3. Country of Residence of Respondents¶

As we are interested in which country to advertise in, we will now determine which country has the most aspiring developers.

As we want to make sure to work with coders that could be interested with our material, we'll drop any row that has a missing value in the 'JobRoleInterest' column.

In [159]:

coders_int = new_coders.dropna(subset = ['JobRoleInterest'])

We can look at relative and absolute frequencies for the CoutryLive variable. This indicates which country the respondent lives in right now, and thus is better suited for us than CountryOrigin.

In [160]:

#we use value_counts to generate frequenties
#with normalize = True or False
freq_table = pd.concat([coders_int['CountryLive'].value_counts()
          ,round(coders_int['CountryLive'].value_counts(normalize = True)*100,1)]
          ,axis = 1)

#renaming columns

freq_table.columns = ['Total Respondents','Percentage']

freq_table.head(10)

Out[160]:

	Total Respondents	Percentage
United States of America	3125	45.7
India	528	7.7
United Kingdom	315	4.6
Canada	260	3.8
Poland	131	1.9
Brazil	129	1.9
Germany	125	1.8
Australia	112	1.6
Russia	102	1.5
Ukraine	89	1.3

From this table, we clearly see that the United States has the largest number of respondents by far. The second largest is India, with roughly six time less respondents.

After that, the differences are smaller : the third best country is the U.K., with 315 respondents (compared with 528 for India).

At this stage, it is clear that the U.S. will be one of the country we advertise in. However, all of India, U.K. and Canada could be legitimate second markets. This could depend on ease of running an efficient ad campaign, competition within countries, and/or potential spending of customers.

We currently have no data on the first and second points, but our database does contain some info on how much coders spend on learning.

5. New Coders Learning Budget¶

Our database gives us access to two variables :

MoneyForLearning is the amount of money spent on learning by a respondent since they started learning to code.
MonthsProgramming is the number of months the respondent has been learning to code.

We are interested in how much a respondent would pay for our subscription (currently at $59 a month). Hence, we will divide MoneyForLearning by MonthsProgramming. We'll first replace any 0 in MonthsProgramming by a 1, as we do not want to divide by zero.

We'll also only consider U.S, U.K. India and Canada. Let's remove any other row.

In [161]:

coders_4 = coders_int[coders_int['CountryLive'].isin(['India'
                                                      ,'United States of America'
                                                      ,'United Kingdom'
                                                      ,'Canada'])]

coders_4.reset_index(inplace = True, drop = True)

coders_4.shape

Out[161]:

(4228, 136)

This narrows our focus to 4227 respondents, still a sizable sample. We can now create a column for monthly money spent, which we'll name MonthlySpending. First we need to replace any zero in the MonthsLearning column by a 1.

In [162]:

coders_4['MonthsProgramming'].replace(to_replace = 0, value = 1, inplace = True)

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/generic.py:4619: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [163]:

#Creating the MonthlySpending column
coders_4['MonthlySpending'] = coders_4['MoneyForLearning']/coders_4['MonthsProgramming']

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [164]:

#Removing any missing value
coders_4.dropna(subset = ['MonthlySpending'], inplace = True)

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

We now have all the data needed, and we can generate a bar plot to illustrate our results.

In [165]:

#this will store the average monthly spending by country
spending_country = coders_4.groupby(by = 'CountryLive').mean()['MonthlySpending']

#generating the plot
fig1, ax = plt.subplots()

colors = ['indianred','seagreen','slateblue','khaki']

#making the bar plot
ax.bar([0,1,2,3]
       , spending_country
       , align = 'center'
       , width = 0.5
       , tick_label = spending_country.index
       , color = colors)

#rotating labels for readability
plt.xticks(rotation = 20)

#labels and title
ax.set_ylabel('Mean Monthly Spending USD')

ax.set_title('Mean Monthly Spending by Country\n')

Out[165]:

<matplotlib.text.Text at 0x7ff0f74c9518>

The mean spending is highest in the U.S. Somewhat unexpectedly, the second highest is India. This can be surpising as India has a lower GDP per capita than the other three countries.

As the mean is very sensitive to outliers, this might explain our results. To make our findings more robust, we would need to remove these outliers from our data. For this, we can use box plots.

In [166]:

sns.boxplot(x = 'CountryLive'
            , y = 'MonthlySpending'
            ,data = coders_4)

plt.xticks(rotation = 20)

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Out[166]:

(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)

We can see on this graph that we have some extreme outliers for the U.S. that outstrech the boxplot. We can remove them from our database.

In [167]:

coders_4 = coders_4[coders_4['MonthlySpending']<20000]

We create the same box plots :

In [168]:

sns.boxplot(x = 'CountryLive'
            , y = 'MonthlySpending'
            ,data = coders_4)

plt.xticks(rotation = 20)

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Out[168]:

(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)

This is still very stretched. Let us remove any respondent that spends more than $6000 a month.

In [169]:

coders_4 = coders_4[coders_4['MonthlySpending']<6000]

In [170]:

sns.boxplot(x = 'CountryLive'
            , y = 'MonthlySpending'
            ,data = coders_4)

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Out[170]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff0fc93d7f0>

Here, we clearly see outliers in India and Canada, spending more than $3000 a month. Let us remove them from the database.

In [171]:

coders_4 = coders_4[~((coders_4['MonthlySpending'] > 3000 )&\
                          ((coders_4['CountryLive'] == 'India') |\
                          (coders_4['CountryLive'] == 'Canada')))]

The other high spenders, after taking a look at their individual responses, can at least partially be explained : some of them took weeks-long online courses, which can be priced in the $10000. Thus, we will leave them in the database. We can recompute the mean now.

In [172]:

#this will store the average monthly spending by country
spending_country = coders_4.groupby(by = 'CountryLive').mean()['MonthlySpending']

#generating the plot
fig1, ax = plt.subplots()

colors = ['indianred','seagreen','slateblue','khaki']

#making the bar plot
ax.bar([0,1,2,3]
       , spending_country
       , align = 'center'
       , width = 0.5
       , tick_label = spending_country.index
       , color = colors)

#rotating labels for readability
plt.xticks(rotation = 20)

#labels and title
ax.set_ylabel('Mean Monthly Spending USD')

ax.set_title('Mean Monthly Spending by Country\n')

Out[172]:

<matplotlib.text.Text at 0x7ff0f724ac18>

In [173]:

spending_country

Out[173]:

CountryLive
Canada                       93.065400
India                        65.758763
United Kingdom               45.534443
United States of America    142.654608
Name: MonthlySpending, dtype: float64

Based on these means, it would seem that the second country should be Canada. However, there are other concerns before we make our decision :

total country population, as advertising in a larger market is better
our market penetration in each country. For example, if we already have half of the market in Canada, an ad campaign would be less efficient.

Moreover, we are only interested in selling monthly subscriptions, for 59 dollars a month. Thus another interesting statistic to look at would be the percentage of coders spending more than 59 dollars a month.

In [174]:

#This function takes any series, and returns the proportion of values over 59
def more_than_59_per(s):
    return (s >= 59).sum()/s.shape[0]

#We use it on a groupby object over CountryLive
group = coders_4.groupby(by = 'CountryLive')

more_than_59_data = group.aggregate(more_than_59_per)['MonthlySpending']*100

more_than_59_data

Out[174]:

CountryLive
Canada                      16.317992
India                       15.098468
United Kingdom              15.412186
United States of America    22.294521
Name: MonthlySpending, dtype: float64

In [175]:

#generating the plot
fig1, ax = plt.subplots()

colors = ['indianred','seagreen','slateblue','khaki']

#making the bar plot
ax.bar([0,1,2,3]
       , more_than_59_data
       , align = 'center'
       , width = 0.5
       , tick_label = more_than_59_data.index
       , color = colors)

#rotating labels for readability
plt.xticks(rotation = 20)

#labels and title
ax.set_ylabel('Percentage')

ax.set_title('Percentage Spending more than 59USD/Month\n')

Out[175]:

<matplotlib.text.Text at 0x7ff0f71ddc18>

The U.S. is higher than the other three, which have very close percentages (roughly 15%) of coders spending over 59 dollars a month.

This is where country population may become relevant. Indeed, Canada has a population of 38 millions, whereas India has a population of 1.3 billion, roughly 34 times more. If we can reach a similar proportion of the market in Canada and India, it is likely for India to be more profitable. We could even potentially lower our price in India, to account for a lower mean monthly spending there.

At this point in our study, India seems more promising that Canada: the total population is much larger, and thus we could expect to reach a great number of coders there.

We could also advertise only in the U.S., or even in more than two countries. This would require more data, for example on the competition in each of these countries, and the growth of the market for online coding courses.

At this point, it would be advisable to tranfert our findings to the marketing department, and ask for their opinions. With more data from them, we could make a more informed decision.

6. Advertising Medium: Youtube and Podcasts¶

Our dataset could also help use determine our advertising medium. For example, it contains data about which youtube channels and podcasts the respondent used in learning. An advertisment on one popular free resource could be very efficient in bringing in new customers. We can determine what are the most popular ones.

We will do so by computing the propotion of respondents that used them, starting with YouTube channels. Let us look at the names of the columns, as well at the values they store.

In [176]:

#This list will store the name of columns containing the string YouTube 
youtube_cols = coders_4.columns[coders_4.columns.str.contains('YouTube')]

for col in youtube_cols:
    print(coders_4[col].value_counts(dropna= False))

NaN     3727
 1.0     168
Name: YouTubeCodeCourse, dtype: int64
NaN     3693
 1.0     202
Name: YouTubeCodingTrain, dtype: int64
NaN     3499
 1.0     396
Name: YouTubeCodingTut360, dtype: int64
NaN     3552
 1.0     343
Name: YouTubeComputerphile, dtype: int64
NaN     3455
 1.0     440
Name: YouTubeDerekBanas, dtype: int64
NaN     3346
 1.0     549
Name: YouTubeDevTips, dtype: int64
NaN     3578
 1.0     317
Name: YouTubeEngineeredTruth, dtype: int64
NaN     2438
 1.0    1457
Name: YouTubeFCC, dtype: int64
NaN     3651
 1.0     244
Name: YouTubeFunFunFunction, dtype: int64
NaN     3266
 1.0     629
Name: YouTubeGoogleDev, dtype: int64
NaN     3337
 1.0     558
Name: YouTubeLearnCode, dtype: int64
NaN     3631
 1.0     264
Name: YouTubeLevelUpTuts, dtype: int64
NaN     3092
 1.0     803
Name: YouTubeMIT, dtype: int64
NaN     3810
 1.0      85
Name: YouTubeMozillaHacks, dtype: int64
NaN                                                                                         3655
None                                                                                          19
none                                                                                          15
The Net Ninja                                                                                  7
Traversy Media                                                                                 7
Simple Programmer                                                                              6
LearnWebCode                                                                                   4
Sentdex                                                                                        3
Khan Academy                                                                                   2
Eli the Computer Guy                                                                           2
Wes Bos                                                                                        2
net ninja                                                                                      2
LambdaSchool                                                                                   2
Stephen Mayeux                                                                                 2
mindspace                                                                                      2
Brackeys                                                                                       2
TheNetNinja                                                                                    2
Stanford                                                                                       2
I haven't watched any youtube videos yet                                                       1
siraj                                                                                          1
simple programmer                                                                              1
none yet                                                                                       1
CHRIS Sean                                                                                     1
TechmakerTV                                                                                    1
I have not yet watched coding-related YouTube videos                                           1
Random YouTube videos                                                                          1
HandMadeHero, Paul Programming                                                                 1
Chris Hawkes                                                                                   1
sentdex                                                                                        1
Slidenerd                                                                                      1
                                                                                            ... 
sentdev                                                                                        1
Code Ninja                                                                                     1
None so far... planning to soon                                                                1
random ones                                                                                    1
Harvard CS50                                                                                   1
Stanford Engineering course                                                                    1
I haven't really                                                                               1
idk                                                                                            1
coding tutorials 360                                                                           1
Nptel                                                                                          1
  thenewboston                                                                                 1
CodeWithNick                                                                                   1
sentdex , sirajraval                                                                           1
Simple Programmer, Chris Hawkes                                                                1
RailsCasts                                                                                     1
TheRealCasadaro                                                                                1
Code Babes                                                                                     1
Hack Reactor                                                                                   1
CBT Nuggets                                                                                    1
random vids I find while googling                                                              1
There was a Sequelize one that was really good, but I can't remember what it was called.       1
Rem Zolotykh https://www.youtube.com/channel/UCsvMopMspsGw89AWim0FMfw                          1
don't watch youtube to look coding up                                                          1
---                                                                                            1
Jim Campagno                                                                                   1
Have not seen any of these youtube videos                                                      1
none of these                                                                                  1
A Cloud Guru                                                                                   1
simpleprogrammer                                                                               1
FunFunFunction                                                                                 1
Name: YouTubeOther, Length: 177, dtype: int64
NaN     3852
 1.0      43
Name: YouTubeSimplilearn, dtype: int64
NaN     3250
 1.0     645
Name: YouTubeTheNewBoston, dtype: int64

All the variables but YoutubeOther have two possible values: NaN, which denotes that the respondent did not watch this YouTube channel, and 1.0, denoting that they did. The only exception is YouTubeOther, where respondents where asked to enter any other YouTube coding channel they watched. There are a lot of different string values here. However, none of the shows given there are very popular, so we'll simply discard this variable.

In [177]:

del coders_4['YouTubeOther']

In [178]:

#Updating the columns containing the string YouTube
youtube_cols = coders_4.columns[coders_4.columns.str.contains('YouTube')]

In [179]:

#We define a function to convert NaN and 1.0 to booleans
def to_bool(x):
    if pd.isna(x):
        return False
    elif x == 1.0:
        return True
    else:
        return x
    
for col in youtube_cols:
    coders_4[col] = coders_4[col].apply(to_bool)

We can now use value count to find out the proportion of respondents who watched a given channel.

In [180]:

#We use value count to count the number of True values in each column
#Use list comprehension to collect it
data = [coders_4[col].value_counts(normalize = True)[True]*100 for col in youtube_cols]


yt_pop = pd.Series(index = youtube_cols
                   ,data = data).sort_values(ascending = False)

In [181]:

yt_pop.plot.bar(figsize = (10,10))

Out[181]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff0f729d898>

A good next step would be to find out if any of these channels would accept to advertise our company directly. We should contact them in the order showed on this graph.

We can repeat this process for podcasts. The code is exactly the same, so we will go faster this time.

In [182]:

podcast_cols = coders_4.columns[coders_4.columns.str.contains('Podcast')]

for col in podcast_cols:
    print(coders_4[col].value_counts(dropna= False))

NaN     3811
 1.0      84
Name: PodcastChangeLog, dtype: int64
NaN     3339
 1.0     556
Name: PodcastCodeNewbie, dtype: int64
NaN     3759
 1.0     136
Name: PodcastCodePen, dtype: int64
NaN     3719
 1.0     176
Name: PodcastDevTea, dtype: int64
NaN     3843
 1.0      52
Name: PodcastDotNET, dtype: int64
NaN     3860
 1.0      35
Name: PodcastGiantRobots, dtype: int64
NaN     3784
 1.0     111
Name: PodcastJSAir, dtype: int64
NaN     3627
 1.0     268
Name: PodcastJSJabber, dtype: int64
NaN     3624
 1.0     271
Name: PodcastNone, dtype: int64
NaN                                                            3689
Learn to Code with Me                                             9
Front End Happy Hour                                              7
Learn to Code With Me                                             6
Coding Blocks                                                     5
Learn to code with me                                             4
na                                                                4
Full Stack Radio                                                  4
Coder Radio                                                       3
no                                                                3
coding blocks                                                     3
The Versioning Show                                               2
Not So Standard Deviations                                        2
Toolsday                                                          2
c++                                                               2
I havn't                                                          2
Start Here                                                        2
Coding Tutorials 360                                              2
Frontend Happy Hour                                               2
Arrested DevOps                                                   2
Starthere.fm                                                      2
Youtube                                                           2
not listened to any                                               2
Learn to code with me                                             2
not yet                                                           2
Late Nights with Trav and Los                                     2
Front end happy hour                                              2
Front-end Happy Hour                                              2
Adventures in Angular                                             2
Eat Sleep Code                                                    1
                                                               ... 
Start Here, Especially Big Data, Hadooponomics, Data Crunch       1
MS Dev Show                                                       1
Not yet                                                           1
Codependant                                                       1
Start here web development                                        1
Ruby on rails                                                     1
Coding360                                                         1
Breakingintostartups.com                                          1
various youtubers                                                 1
TravLos podcast                                                   1
Women Who Code                                                    1
Learn To Code With Me                                             1
Mentoring Developers                                              1
Learntocodewith.me                                                1
We Are Lighthouse London                                          1
freeCodeCamp YouTube Channel                                      1
Start Here: Web Development                                       1
DataSkeptic                                                       1
O'reilly show                                                     1
React native                                                      1
Reboot, Start Here FM                                             1
IDeveloper, merge conflict                                        1
Frontend Masters                                                  1
Python podcast init                                               1
Google Cloud Platform Podcast                                     1
partiallyderivative.com                                           1
---                                                               1
JavaScript Jabber                                                 1
LearnWebCode on Youtube                                           1
Developer on Fire                                                 1
Name: PodcastOther, Length: 151, dtype: int64
NaN     3799
 1.0      96
Name: PodcastProgThrowdown, dtype: int64
NaN     3812
 1.0      83
Name: PodcastRubyRogues, dtype: int64
NaN     3727
 1.0     168
Name: PodcastSEDaily, dtype: int64
NaN     3815
 1.0      80
Name: PodcastSERadio, dtype: int64
NaN     3823
 1.0      72
Name: PodcastShopTalk, dtype: int64
NaN     3735
 1.0     160
Name: PodcastTalkPython, dtype: int64
NaN     3831
 1.0      64
Name: PodcastTheWebAhead, dtype: int64

In [183]:

del coders_4['PodcastOther']

podcast_cols = coders_4.columns[coders_4.columns.str.contains('Podcast')]

for col in podcast_cols:
    coders_4[col] = coders_4[col].apply(to_bool)

data_pod = [coders_4[col].value_counts(normalize = True)[True]*100 for col in podcast_cols]

podcast_pop = pd.Series(index = podcast_cols
                   ,data = data_pod).sort_values(ascending = False)

podcast_pop.plot.bar(figsize = (10,10))

Out[183]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff0f715c0f0>

Again, this gives us a roadmap to try to place an advertisement in some podcast. Also note that podcasts are a lot less popular than YouTube channels. Thus, we should prioritize YouTube advertisement if possible.

One next step would be to segregate this data by country. We could then target our advertisement more precisely.

In [ ]: