We are an e-learning company that offers programming courses. We focus on web and mobile development; however, we also many other domains, like data science, game development, etc.
We want to spend some money on advertising to promote our product. However, our goal is to find the two best markets to advertise our products as our budget is limited.
To do so, we will try answering the below questions: • Where are new coders located? • What are the locations with the most significant number of new coders? • How much money new coders are willing to spend on learning.
In this project, we analyzed survey data from new coders to find the best two markets to advertise in, based on the potential customer and monthly spend.
The USA is an excellent market to promote as it is the biggest market and has the highest spending per month on learning.
India and Canada are good candidates too. However, we will need input from the Marketing team to make the best decision.
For more details, please refer to the full analysis below.
To avoid spending a lot of money on gathering data, we will explore existing data relevant to our purpose.
One promising candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for our analysis.
The data set is available on [GitHub][1] [1]:https://github.com/freeCodeCamp/2017-new-coder-survey
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv',low_memory=False)
survey.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18175 entries, 0 to 18174 Columns: 136 entries, Age to YouTubeTheNewBoston dtypes: float64(105), object(31) memory usage: 18.9+ MB
# prinint the first 5 rows to inspect the data
survey.head()
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | ChildrenNumber | CityPopulation | CodeEventConferences | CodeEventDjangoGirls | ... | YouTubeFCC | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeOther | YouTubeSimplilearn | YouTubeTheNewBoston | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 34.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | less than 100,000 | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 21.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
3 | 26.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | ... | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
4 | 20.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 136 columns
The data has 18 175 rows and 136 columns. Each column represents a question from the survey.
InIn the next step, we will check if the data it's representative of our population of interest and if it has the right categories of people for our purpose.
# explore JobRoleInterest column
survey.JobRoleInterest
0 NaN 1 Full-Stack Web Developer 2 Front-End Web Developer, Back-End Web Develo... 3 Front-End Web Developer, Full-Stack Web Deve... 4 Full-Stack Web Developer, Information Security... 5 NaN 6 Full-Stack Web Developer 7 NaN 8 NaN 9 Full-Stack Web Developer, Quality Assurance ... 10 NaN 11 DevOps / SysAdmin, Data Scientist, Informa... 12 NaN 13 Back-End Web Developer, Full-Stack Web Develop... 14 Full-Stack Web Developer 15 Full-Stack Web Developer 16 Full-Stack Web Developer 17 NaN 18 Full-Stack Web Developer, Front-End Web Deve... 19 Front-End Web Developer, Mobile Developer,... 20 NaN 21 Information Security 22 Full-Stack Web Developer 23 Back-End Web Developer 24 NaN 25 NaN 26 NaN 27 NaN 28 Full-Stack Web Developer 29 Front-End Web Developer, Data Scientist, F... ... 18145 NaN 18146 NaN 18147 NaN 18148 NaN 18149 NaN 18150 NaN 18151 Front-End Web Developer 18152 NaN 18153 Information Security, Full-Stack Web Developer 18154 Full-Stack Web Developer 18155 Full-Stack Web Developer, Front-End Web Deve... 18156 Full-Stack Web Developer 18157 Back-End Web Developer, Data Engineer, Mobil... 18158 NaN 18159 NaN 18160 User Experience Designer 18161 Full-Stack Web Developer 18162 Data Scientist, Game Developer, Quality As... 18163 Back-End Web Developer, Data Engineer, Data ... 18164 NaN 18165 NaN 18166 NaN 18167 NaN 18168 NaN 18169 NaN 18170 NaN 18171 DevOps / SysAdmin, Mobile Developer, Pro... 18172 NaN 18173 NaN 18174 Back-End Web Developer, Data Engineer, Data ... Name: JobRoleInterest, Length: 18175, dtype: object
As we can see, people are interested in more than one subject. Moreover, we can see that we have some null values.
To better understand the data, we are going to calculate the frequency of every job.
job_interest_freq = {'Unknown':0}
# loop over each response in JobRoleInterest
for row in survey["JobRoleInterest"]:
if type(row) is float:
job_interest_freq["Unknown"] += 1 # null Values will be tracked under Unknown
elif type(row) is str:
str_list = row.split(',')# create a list for answer with multiple choices
for n in str_list:
n= n.strip().lower()# clean up the string by removing trailing spaces and Upper case letter
if n in job_interest_freq:
job_interest_freq[n]+=1
else:
job_interest_freq[n]=1
# turn Dictionnary into Dataframe
job_interest_freq = pd.DataFrame.from_dict(job_interest_freq,orient="index")
job_interest_freq.rename(columns={0:"Frequency"}, inplace=True)
# Convert values to percentages
job_interest_freq["Frequency"] = round(job_interest_freq["Frequency"]/job_interest_freq["Frequency"].sum()*100,2)
job_interest_freq.sort_values(by="Frequency",ascending=False,inplace= True)
#Summarize top 10 and others
top_10 = job_interest_freq[:10]
# add the rest as others
others = job_interest_freq[10:].sum()
others = pd.Series(others[0],index=["Other"],name="Frequency")
others= others.to_frame()
top_10 = pd.concat([top_10,others])
top_10.plot.barh()
plt.title("Frequency of Top 10 Job Interest")
plt.xlabel('Frequency (%)')
<matplotlib.text.Text at 0x7f08b62741d0>
As the question is open-ended, there is a lot of different jobs. However, 33.1% of the respondent did not specify any interest. Web developer, which splits into full-stack, front-end, and back-end, is the highest interest role with 31.1% of the respondent. Mobile developer is next with 6.82% of the respondent, followed by Data scientists, game developers.
As we can see, the sample has the right categories of people for our purpose. Therefore we can continue with our analysis.
In the survey, we have information regarding the location of the new coders. Each Country represents an individual market. In the next step, we will be finding the best Country to advertise on, based on new coders density.
We will start by removing all rows that have not entered any JobRoleInterest
to ensure that we focus on the coders we want to target.
#drop rows with unknow job interest
survey = survey.dropna(subset=["JobRoleInterest"])
Next, we will explore the column CountryLive
, which describes the Country the participants currently living.
country = survey["CountryLive"].value_counts(normalize= True)*100
country
United States of America 45.700497 India 7.721556 United Kingdom 4.606610 Canada 3.802281 Poland 1.915765 Brazil 1.886517 Germany 1.828020 Australia 1.637906 Russia 1.491664 Ukraine 1.301550 Nigeria 1.228429 Spain 1.126060 France 1.096812 Romania 1.038315 Netherlands (Holland, Europe) 0.950570 Italy 0.906698 Serbia 0.760456 Philippines 0.760456 Greece 0.672711 Ireland 0.628839 South Africa 0.570342 Mexico 0.541094 Turkey 0.526470 Hungary 0.497221 Singapore 0.497221 New Zealand 0.482597 Argentina 0.467973 Croatia 0.467973 Pakistan 0.453349 Indonesia 0.453349 ... Rwanda 0.014624 Nicaragua 0.014624 Panama 0.014624 Guadeloupe 0.014624 Cayman Islands 0.014624 Channel Islands 0.014624 Qatar 0.014624 Guatemala 0.014624 Liberia 0.014624 Samoa 0.014624 Nambia 0.014624 Somalia 0.014624 Myanmar 0.014624 Gambia 0.014624 Kyrgyzstan 0.014624 Angola 0.014624 Jordan 0.014624 Anguilla 0.014624 Gibraltar 0.014624 Yemen 0.014624 Mozambique 0.014624 Cameroon 0.014624 Botswana 0.014624 Turkmenistan 0.014624 Trinidad & Tobago 0.014624 Papua New Guinea 0.014624 Cuba 0.014624 Sudan 0.014624 Vanuatu 0.014624 Aruba 0.014624 Name: CountryLive, Length: 137, dtype: float64
The survey has respondents from all over the world. However, 72% of the respondent are within the top 10 countries.
Let's visualize the Top 10 countries to get a better understanding of the top markets.
country[:10].plot.barh(color='blue')
plt.title("Frequency of respondent by Country")
plt.xlabel("Frequency (%)")
<matplotlib.text.Text at 0x7f08b62cd4a8>
As we can see, most of the respondents are from the United States of America, followed by India, the United Kingdom, and Canada. Any of these countries could be a potential market for our advertisement. However, to decide which market to advertise on, we need to go more in-depth with the analysis.
In the next step, we will explore how much each market is willing to spend on learning.
For the rest of the analysis, we will focus on the top 4 countries, making up 62% of the respondents. Moreover, our courses are in English, and all 4 are English speaking countries.
Let's explore the column MoneyForLearning
. This column describes in USD, the amount of money spent by participants from the moment they started coding.
survey["MoneyForLearning"].describe()
count 6476.00000 mean 885.31223 std 4893.37377 min 0.00000 25% 0.00000 50% 10.00000 75% 200.00000 max 200000.00000 Name: MoneyForLearning, dtype: float64
Average money spends USD 885. However, the data is very spread with a range of USD 200000. If we look at the third Quartile, 75% of the respondent has spent USD 200 or less. Comparing this with the average, we can say that some extreme outliers likely impact the average.
For our analysis, we are interested in monthly spending, as our subscription is priced per month. We will need to calculate this by using the column MonthsProgramming
.
country_list=["United States of America","India","United Kingdom","Canada"]
# filtering the data for the countries we are intersted in
survey = survey[survey["CountryLive"].isin(country_list)].copy()
# replacing monthsProgramming 0 with 1 to avoid dividing by 0
survey["MonthsProgramming"] = survey["MonthsProgramming"].replace(0,1)
# creating spend per month column
survey["Spend_Month"] = survey["MoneyForLearning"]/survey["MonthsProgramming"]
# Removing rows with Null Spend_month
survey.dropna(subset=["Spend_Month"],inplace= True)
In the next step, we will explore the average monthly spend per Country to determine which market could be the most profitable.
survey.pivot_table(values="Spend_Month",index="CountryLive",aggfunc=[np.mean,np.median])
mean | median | |
---|---|---|
Spend_Month | Spend_Month | |
CountryLive | ||
Canada | 113.510961 | 0.000000 |
India | 135.100982 | 0.000000 |
United Kingdom | 45.534443 | 0.000000 |
United States of America | 227.997996 | 3.333333 |
We can say that Coders in the USA average monthly spending is the highest at USD 228 per month from the table above. United Kingdom has the lowest spend at USD 45 per month. There seems to be a big difference between the average spend for some countries. Therefore we are going to explore the spread of the monthly spend for each Country.
On the other hand, looking at the median, we can see that all countries except the USA have a Median spend of USD 0 per month, suggesting that the data is right-skewed and that 50% of the respondent lear using free learnings.
In the next step, we will explore the outliers to get a better understanding of the average spend per month
sns.boxplot(y ="Spend_Month",x="CountryLive", data = survey)
plt.title("Comparison of monthly spend by Country")
plt.xticks(rotation =45)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)
The Box plots above are not very clear; however, we can see that the USA has some very extreme values impacting the average spend. The same can be said about India and Canada.
To continue with our analysis, we will exclude rows with Average spend per month higher than 10000.
survey = survey[survey["Spend_Month"]< 10000].copy()
sns.boxplot(y ="Spend_Month",x="CountryLive", data = survey)
plt.title("Comparison of monthly spend by Country")
plt.xticks(rotation =45)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)
We can see that there are still quite a lot of outliers.
Let's explore some of the outliers respondents to see if there is a parameter driving this high spend per month.
survey[survey["Spend_Month"]> 2000].head()
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | ChildrenNumber | CityPopulation | CodeEventConferences | CodeEventDjangoGirls | ... | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeOther | YouTubeSimplilearn | YouTubeTheNewBoston | Spend_Month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
415 | 26.0 | 1.0 | 1.0 | 1.0 | The Iron Yard | 1.0 | NaN | between 100,000 and 1 million | NaN | NaN | ... | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 2333.333333 |
441 | 30.0 | 1.0 | 0.0 | 0.0 | Rutgers Coding Bootcamp | 1.0 | NaN | between 100,000 and 1 million | NaN | NaN | ... | NaN | 1.0 | NaN | NaN | 1.0 | NaN | The Net Ninja | NaN | NaN | 2250.000000 |
484 | 25.0 | 1.0 | 1.0 | 0.0 | Codesmith | 1.0 | NaN | more than 1 million | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2857.142857 |
718 | 26.0 | 1.0 | 0.0 | 0.0 | The Coding Boot Camp at UCLA Extension | 1.0 | NaN | more than 1 million | 1.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8000.000000 |
723 | 22.0 | 1.0 | 1.0 | 0.0 | Dev Bootcamp | 0.0 | NaN | more than 1 million | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4000.000000 |
5 rows × 137 columns
From the data extract above, we can see that respondents with spending higher than USD 2000 seem to have done a Bootcamp.
In the next step, we will split the monthly spend into ten bins to explore how many people per bin attended a Bootcamp.
# Use pd.cut to sort data into Bins and Unstack to create a pivot table
(survey.groupby(["AttendedBootcamp",pd.cut(survey["Spend_Month"],10)]).size().unstack())
Spend_Month | (-9.0, 900.0] | (900.0, 1800.0] | (1800.0, 2700.0] | (2700.0, 3600.0] | (3600.0, 4500.0] | (4500.0, 5400.0] | (5400.0, 6300.0] | (6300.0, 7200.0] | (7200.0, 8100.0] | (8100.0, 9000.0] |
---|---|---|---|---|---|---|---|---|---|---|
AttendedBootcamp | ||||||||||
0.0 | 3557.0 | 44.0 | 12.0 | 5.0 | 5.0 | 6.0 | NaN | NaN | NaN | NaN |
1.0 | 171.0 | 42.0 | 23.0 | 10.0 | 2.0 | 6.0 | 1.0 | 2.0 | 2.0 | 1.0 |
The table above shows the number of people who did or did not attend a Bootcamp ( show by value 1.0 or 0.0) by Monthly spending.
We can see that most people who did not attend a booth camp spend less than USD 900 per month. The bins with USD 900 and more have more people attending Boothcamps than not attending.
Therefore, we are going to exclude rows with spending more than USD 1000 and re-evaluate. We will also exclude all rows with 0 Spend, as we are interested in understanding how much each market is willing to pay for learning.
no_outlier = survey[(survey["Spend_Month"]< 1000) & (survey["Spend_Month"]>1)]
(no_outlier.groupby(["AttendedBootcamp",pd.cut(no_outlier["Spend_Month"],10)]).size().unstack())
Spend_Month | (0.0705, 98.16] | (98.16, 195.278] | (195.278, 292.396] | (292.396, 389.514] | (389.514, 486.632] | (486.632, 583.75] | (583.75, 680.868] | (680.868, 777.986] | (777.986, 875.104] | (875.104, 972.222] |
---|---|---|---|---|---|---|---|---|---|---|
AttendedBootcamp | ||||||||||
0.0 | 1291 | 190 | 81 | 40 | 23 | 26 | 20 | 3 | 5 | 4 |
1.0 | 51 | 17 | 8 | 15 | 13 | 11 | 11 | 7 | 15 | 7 |
As we can see, most people who did not attend a Bootcamp spend less than USD 98 per month on learning. However,there are some significant numbers of people without a Bootcamp on some of the higher spend bins. We will use this data without outliers to re-evaluate the average spend per Country.
sns.boxplot(y ="Spend_Month",x="CountryLive", data = no_outlier)
plt.title("Comparison of monthly spend by Country")
plt.xticks(rotation =45)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)
We can see that the USA has the highest median and the highest 1st Quartile in the box plots above. This suggests that the USA could be the most profitable market to advertise.
Let's calculate the average spend per Country to confirm that the USA is the most profitable market and understand which other markets could be profitable.
no_outlier[["CountryLive","Spend_Month"]].groupby(["CountryLive"]).mean()
Spend_Month | |
---|---|
CountryLive | |
Canada | 104.687955 |
India | 78.786918 |
United Kingdom | 62.797366 |
United States of America | 107.408242 |
USA and Canada have the highest average spend per month and seem to be the most profitable market to advertise.
However, before we conclude, let's review the market size after the removal of the outliers.
no_outlier["CountryLive"].value_counts(normalize=True).plot.barh()
plt.title("Frequency of respondent by Country")
plt.xlabel("Frequency (%)")
<matplotlib.text.Text at 0x7f08b5e83630>
The United States remains the most significant market and has the highest average spend per month; therefore, we should advertise in this market.
Canada is the second market we should focus on, as it has the second-highest spend per month.
India is also a potential market to advertise on; despite a lower average spend of USD 78, it is the second-highest market. The average spend is higher than the price of our subscription.
It is now time to share the results with the Marketing team, to use their domain knowledge in making the final decision
In this project, we analyzed survey data from new coders to find the best two markets to advertise in, based on the potential customer and monthly spend.
The USA is an excellent market to promote as it is the biggest market and has the highest spending per month on learning.
India and Canada are good candidates too. However, we will need input from the Marketing team to make the best decision.