In the present project, we'll assume that we're working for an an e-learning company that offers courses on programming, covering topics such as: web and mobile development, data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement. Our goal in this project is to find out the two best markets to advertise our product in.
We would like to explore surveys that have been already conducted, instead of organizing one ourselves (which would increase our expenses), and a good candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
The survey data is publicly available in this GitHub repository.
# Importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Reading the data
raw_data = pd.read_csv("2017-fCC-New-Coders-Survey-Data.csv")
# Setting pandas so we can inspect every column in the output
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# Displaying info of interest
display("Shape of the dataframe:", raw_data.shape)
display(raw_data.head())
#17 and 62
# display(raw_data.iloc[:,17].value_counts())
C:\Users\Alvaro\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3444: DtypeWarning: Columns (17,62) have mixed types.Specify dtype option on import or set low_memory=False. exec(code_obj, self.user_global_ns, self.user_ns)
'Shape of the dataframe:'
(18175, 136)
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | ChildrenNumber | CityPopulation | CodeEventConferences | CodeEventDjangoGirls | CodeEventFCC | CodeEventGameJam | CodeEventGirlDev | CodeEventHackathons | CodeEventMeetup | CodeEventNodeSchool | CodeEventNone | CodeEventOther | CodeEventRailsBridge | CodeEventRailsGirls | CodeEventStartUpWknd | CodeEventWkdBootcamps | CodeEventWomenCode | CodeEventWorkshops | CommuteTime | CountryCitizen | CountryLive | EmploymentField | EmploymentFieldOther | EmploymentStatus | EmploymentStatusOther | ExpectedEarning | FinanciallySupporting | FirstDevJob | Gender | GenderOther | HasChildren | HasDebt | HasFinancialDependents | HasHighSpdInternet | HasHomeMortgage | HasServedInMilitary | HasStudentDebt | HomeMortgageOwe | HoursLearning | ID.x | ID.y | Income | IsEthnicMinority | IsReceiveDisabilitiesBenefits | IsSoftwareDev | IsUnderEmployed | JobApplyWhen | JobInterestBackEnd | JobInterestDataEngr | JobInterestDataSci | JobInterestDevOps | JobInterestFrontEnd | JobInterestFullStack | JobInterestGameDev | JobInterestInfoSec | JobInterestMobile | JobInterestOther | JobInterestProjMngr | JobInterestQAEngr | JobInterestUX | JobPref | JobRelocateYesNo | JobRoleInterest | JobWherePref | LanguageAtHome | MaritalStatus | MoneyForLearning | MonthsProgramming | NetworkID | Part1EndTime | Part1StartTime | Part2EndTime | Part2StartTime | PodcastChangeLog | PodcastCodeNewbie | PodcastCodePen | PodcastDevTea | PodcastDotNET | PodcastGiantRobots | PodcastJSAir | PodcastJSJabber | PodcastNone | PodcastOther | PodcastProgThrowdown | PodcastRubyRogues | PodcastSEDaily | PodcastSERadio | PodcastShopTalk | PodcastTalkPython | PodcastTheWebAhead | ResourceCodecademy | ResourceCodeWars | ResourceCoursera | ResourceCSS | ResourceEdX | ResourceEgghead | ResourceFCC | ResourceHackerRank | ResourceKA | ResourceLynda | ResourceMDN | ResourceOdinProj | ResourceOther | ResourcePluralSight | ResourceSkillcrush | ResourceSO | ResourceTreehouse | ResourceUdacity | ResourceUdemy | ResourceW3S | SchoolDegree | SchoolMajor | StudentDebtOwe | YouTubeCodeCourse | YouTubeCodingTrain | YouTubeCodingTut360 | YouTubeComputerphile | YouTubeDerekBanas | YouTubeDevTips | YouTubeEngineeredTruth | YouTubeFCC | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeOther | YouTubeSimplilearn | YouTubeTheNewBoston | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 to 29 minutes | Canada | Canada | software development and IT | NaN | Employed for wages | NaN | NaN | NaN | NaN | female | NaN | NaN | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN | 15.0 | 02d9465b21e8bd09374b0066fb2d5614 | eb78c1c3ac6cd9052aec557065070fbf | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | start your own business | NaN | NaN | NaN | English | married or domestic partnership | 150.0 | 6.0 | 6f1fbc6b2b | 2017-03-09 00:36:22 | 2017-03-09 00:32:59 | 2017-03-09 00:59:46 | 2017-03-09 00:36:26 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 34.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | less than 100,000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | United States of America | United States of America | NaN | NaN | Not working but looking for work | NaN | 35000.0 | NaN | NaN | male | NaN | NaN | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 10.0 | 5bfef9ecb211ec4f518cfc1d2a6f3e0c | 21db37adb60cdcafadfa7dca1b13b6b1 | NaN | 0.0 | 0.0 | 0.0 | NaN | Within 7 to 12 months | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | work for a nonprofit | 1.0 | Full-Stack Web Developer | in an office with other developers | English | single, never married | 80.0 | 6.0 | f8f8be6910 | 2017-03-09 00:37:07 | 2017-03-09 00:33:26 | 2017-03-09 00:38:59 | 2017-03-09 00:37:10 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | 1.0 | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 21.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 to 29 minutes | United States of America | United States of America | software development and IT | NaN | Employed for wages | NaN | 70000.0 | NaN | NaN | male | NaN | NaN | 0.0 | 0.0 | 1.0 | NaN | 0.0 | NaN | NaN | 25.0 | 14f1863afa9c7de488050b82eb3edd96 | 21ba173828fbe9e27ccebaf4d5166a55 | 13000.0 | 1.0 | 0.0 | 0.0 | 0.0 | Within 7 to 12 months | 1.0 | NaN | NaN | 1.0 | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | work for a medium-sized company | 1.0 | Front-End Web Developer, Back-End Web Develo... | no preference | Spanish | single, never married | 1000.0 | 5.0 | 2ed189768e | 2017-03-09 00:37:58 | 2017-03-09 00:33:53 | 2017-03-09 00:40:14 | 2017-03-09 00:38:02 | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | Codenewbie | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | high school diploma or equivalent (GED) | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
3 | 26.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | I work from home | Brazil | Brazil | software development and IT | NaN | Employed for wages | NaN | 40000.0 | 0.0 | NaN | male | NaN | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 40000.0 | 14.0 | 91756eb4dc280062a541c25a3d44cfb0 | 3be37b558f02daae93a6da10f83f0c77 | 24000.0 | 0.0 | 0.0 | 0.0 | 1.0 | Within the next 6 months | 1.0 | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | work for a medium-sized company | NaN | Front-End Web Developer, Full-Stack Web Deve... | from home | Portuguese | married or domestic partnership | 0.0 | 5.0 | dbdc0664d1 | 2017-03-09 00:40:13 | 2017-03-09 00:37:45 | 2017-03-09 00:42:26 | 2017-03-09 00:40:18 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
4 | 20.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Portugal | Portugal | NaN | NaN | Not working but looking for work | NaN | 140000.0 | NaN | NaN | female | NaN | NaN | 0.0 | 0.0 | 1.0 | NaN | 0.0 | NaN | NaN | 10.0 | aa3f061a1949a90b27bef7411ecd193f | d7c56bbf2c7b62096be9db010e86d96d | NaN | 0.0 | 0.0 | 0.0 | NaN | Within 7 to 12 months | 1.0 | NaN | NaN | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | work for a multinational corporation | 1.0 | Full-Stack Web Developer, Information Security... | in an office with other developers | Portuguese | single, never married | 0.0 | 24.0 | 11b0f2d8a9 | 2017-03-09 00:42:45 | 2017-03-09 00:39:44 | 2017-03-09 00:45:42 | 2017-03-09 00:42:50 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | bachelor's degree | Information Technology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
After a first inspection of the data, we will examine whether this data is actually useful to us, taking into account the population of our interest. The "JobRoleInterest" column describes for every participant the role(s) they'd be interested in working, so we'll filter the data to keep only rows where this question was answered, and we'll see its unique values next:
# Keeping only rows with jobs role of interest
filtered_data = raw_data.dropna(subset=["JobRoleInterest"]).copy()
display("New dataframe's shape:", filtered_data.shape)
# Plotting the job roles that are more interesting to respondants
filtered_data["JobRoleInterest"].value_counts(normalize=True, dropna=False)[:10].plot.barh()
plt.xticks(ticks=[0, .025, .05, .075, .1, .125], labels=[0, 2.5, 5, 7.5, 10, 12.5])
plt.gca().invert_yaxis()
plt.title("% of choices for each job role of interest")
plt.show()
"New dataframe's shape:"
(6992, 136)
Looking at the information above, we can conclude that this data clearly shows professional profiles of our interest.
People can be interested in more than one profile at the same time, so we chose the 10 most popular choices.
Also, we have kept a small piece of the original data, because we're only interested in people that we know they can be interested in jobs that our company offers trainings about.
Two columns are of special interest to us to proceed with our analysis:
We'll begin by analyzing what countries have more people interested in the topics we offer:
# Generating absolute and relative value counts for countries
freq_table_abs = pd.DataFrame.from_dict(filtered_data["CountryLive"].value_counts().reset_index())
freq_table_rel = pd.DataFrame.from_dict(filtered_data["CountryLive"].value_counts(normalize=True).reset_index())
# Merging both types of value counts
freq_table = freq_table_abs.merge(freq_table_rel, on="index")
# Renaming columns and properly calculating percentages
freq_table.columns = ["Country", "Number", "Percentage"]
freq_table["Percentage"] = round(freq_table["Percentage"] * 100, 1)
display(freq_table[:10])
Country | Number | Percentage | |
---|---|---|---|
0 | United States of America | 3125 | 45.7 |
1 | India | 528 | 7.7 |
2 | United Kingdom | 315 | 4.6 |
3 | Canada | 260 | 3.8 |
4 | Poland | 131 | 1.9 |
5 | Brazil | 129 | 1.9 |
6 | Germany | 125 | 1.8 |
7 | Australia | 112 | 1.6 |
8 | Russia | 102 | 1.5 |
9 | Ukraine | 89 | 1.3 |
The table above shows that the USA is, by far, the country that could interest us more - almost *46%* of the respondants live there.
We'll now proceed with the money spent on learning.
Our company sells subscriptions at a price of $59 per month, and for this reason we're interested in finding out how much money each student spends per month. Therefore, we'll need to divide the "MoneyForLearning" column to the "MonthsProgramming" column.
# Replacing "0 months programming" for 1 month learning, as these are people that just started - to avoid dividing by 0
filtered_data["MonthsProgramming"] = filtered_data["MonthsProgramming"].replace(to_replace=0, value=1)
# Creating a "MoneyPerMonth" column
filtered_data["MoneyPerMonth"] = filtered_data["MoneyForLearning"] / filtered_data["MonthsProgramming"]
# Dropping rows with null values in our new column "MoneyPerMonth" or "CountryLive", as these are fundamental for us
filtered_data.dropna(subset=["MoneyPerMonth", "CountryLive"], inplace=True)
# Generating grouped mean and median for money spent per month, per country
money_spent = filtered_data.groupby("CountryLive")["MoneyPerMonth"].agg(["mean", "median"])
# Displaing the results for the countries that are more interesting to us
display(money_spent.loc[["United States of America", "India" ,"United Kingdom", "Canada"]])
mean | median | |
---|---|---|
CountryLive | ||
United States of America | 227.997996 | 3.333333 |
India | 135.100982 | 0.000000 |
United Kingdom | 45.534443 | 0.000000 |
Canada | 113.510961 | 0.000000 |
The table above shows some odd results:
We'll search for outliers and distributions to understand this discrepancies:
# Keeping the countries we're interested in
graph_data = filtered_data[filtered_data["CountryLive"].isin(["United States of America", "India" ,"United Kingdom", "Canada"])]
# Plotting box plots of money spent per month for these countries
sns.boxplot(data=graph_data, x="CountryLive", y="MoneyPerMonth")
<AxesSubplot:xlabel='CountryLive', ylabel='MoneyPerMonth'>
The plot above shows that the data has several outliers, therefore the distributions are completely odd. We'll clean this data to see the actual values and distributions. The value we'll use to filter the data will be 300$ per month, which is actualy a very high price, and see what happens:
# Filtering the data
graph_data = graph_data[graph_data["MoneyPerMonth"] <= 300]
# Displaying grouped statistics
display(graph_data.groupby("CountryLive")["MoneyPerMonth"].agg(["mean", "median"]))
# Plotting "MoneyPerMonth" distributions
sns.boxplot(data=graph_data, x="CountryLive", y="MoneyPerMonth")
mean | median | |
---|---|---|
CountryLive | ||
Canada | 19.655420 | 0.000000 |
India | 17.955836 | 0.000000 |
United Kingdom | 21.842034 | 0.000000 |
United States of America | 25.828829 | 0.416667 |
<AxesSubplot:xlabel='CountryLive', ylabel='MoneyPerMonth'>
Even if we just cleaned the data, we still see several outliers in each distribution. This means that the values are centered in the very low part of the scale.
Still, we can clearly see that *USA* and *United Kingdom* are the countries with higher money per month values.
The analysis we just conducted shows some interesting results:
Therefore, our recommendation for advertisment would be to clearly invest in *USA, and probably India* - even if the average money spent per month is 4 and 2 USD lower than UK and Canada, respectively, there are nearly twice more people compared with those countries.