We're working for an an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement. Our goal in this project is to find out the two best markets to advertise our product in.
We will use a survey from freeCodeCamp (link https://github.com/freeCodeCamp/2017-new-coder-survey) directed at new coders (people who started coding less than 5 years ago). They run a popular publication (over 400.000 followers), and thus their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
#importing and understanding the data
import pandas as pd
pd.options.display.max_columns = 150
data=pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv')
print(data.head())
print(len(data))
Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName \ 0 27.0 0.0 NaN NaN NaN 1 34.0 0.0 NaN NaN NaN 2 21.0 0.0 NaN NaN NaN 3 26.0 0.0 NaN NaN NaN 4 20.0 0.0 NaN NaN NaN BootcampRecommend ChildrenNumber CityPopulation \ 0 NaN NaN more than 1 million 1 NaN NaN less than 100,000 2 NaN NaN more than 1 million 3 NaN NaN between 100,000 and 1 million 4 NaN NaN between 100,000 and 1 million CodeEventConferences CodeEventDjangoGirls CodeEventFCC CodeEventGameJam \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN CodeEventGirlDev CodeEventHackathons CodeEventMeetup \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN 1.0 NaN 3 NaN NaN NaN 4 NaN NaN NaN CodeEventNodeSchool CodeEventNone CodeEventOther CodeEventRailsBridge \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 1.0 NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN CodeEventRailsGirls CodeEventStartUpWknd CodeEventWkdBootcamps \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN CodeEventWomenCode CodeEventWorkshops CommuteTime \ 0 NaN NaN 15 to 29 minutes 1 NaN NaN NaN 2 NaN NaN 15 to 29 minutes 3 NaN NaN I work from home 4 NaN NaN NaN CountryCitizen CountryLive \ 0 Canada Canada 1 United States of America United States of America 2 United States of America United States of America 3 Brazil Brazil 4 Portugal Portugal EmploymentField EmploymentFieldOther \ 0 software development and IT NaN 1 NaN NaN 2 software development and IT NaN 3 software development and IT NaN 4 NaN NaN EmploymentStatus EmploymentStatusOther ExpectedEarning \ 0 Employed for wages NaN NaN 1 Not working but looking for work NaN 35000.0 2 Employed for wages NaN 70000.0 3 Employed for wages NaN 40000.0 4 Not working but looking for work NaN 140000.0 FinanciallySupporting FirstDevJob Gender GenderOther HasChildren \ 0 NaN NaN female NaN NaN 1 NaN NaN male NaN NaN 2 NaN NaN male NaN NaN 3 0.0 NaN male NaN 0.0 4 NaN NaN female NaN NaN HasDebt HasFinancialDependents HasHighSpdInternet HasHomeMortgage \ 0 1.0 0.0 1.0 0.0 1 1.0 0.0 1.0 0.0 2 0.0 0.0 1.0 NaN 3 1.0 1.0 1.0 1.0 4 0.0 0.0 1.0 NaN HasServedInMilitary HasStudentDebt HomeMortgageOwe HoursLearning \ 0 0.0 0.0 NaN 15.0 1 0.0 1.0 NaN 10.0 2 0.0 NaN NaN 25.0 3 0.0 0.0 40000.0 14.0 4 0.0 NaN NaN 10.0 ID.x ID.y \ 0 02d9465b21e8bd09374b0066fb2d5614 eb78c1c3ac6cd9052aec557065070fbf 1 5bfef9ecb211ec4f518cfc1d2a6f3e0c 21db37adb60cdcafadfa7dca1b13b6b1 2 14f1863afa9c7de488050b82eb3edd96 21ba173828fbe9e27ccebaf4d5166a55 3 91756eb4dc280062a541c25a3d44cfb0 3be37b558f02daae93a6da10f83f0c77 4 aa3f061a1949a90b27bef7411ecd193f d7c56bbf2c7b62096be9db010e86d96d Income IsEthnicMinority IsReceiveDisabilitiesBenefits IsSoftwareDev \ 0 NaN NaN 0.0 0.0 1 NaN 0.0 0.0 0.0 2 13000.0 1.0 0.0 0.0 3 24000.0 0.0 0.0 0.0 4 NaN 0.0 0.0 0.0 IsUnderEmployed JobApplyWhen JobInterestBackEnd \ 0 0.0 NaN NaN 1 NaN Within 7 to 12 months NaN 2 0.0 Within 7 to 12 months 1.0 3 1.0 Within the next 6 months 1.0 4 NaN Within 7 to 12 months 1.0 JobInterestDataEngr JobInterestDataSci JobInterestDevOps \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN 1.0 3 NaN NaN NaN 4 NaN NaN NaN JobInterestFrontEnd JobInterestFullStack JobInterestGameDev \ 0 NaN NaN NaN 1 NaN 1.0 NaN 2 1.0 1.0 NaN 3 1.0 1.0 NaN 4 1.0 1.0 NaN JobInterestInfoSec JobInterestMobile JobInterestOther \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN 1.0 NaN 3 NaN NaN NaN 4 1.0 1.0 NaN JobInterestProjMngr JobInterestQAEngr JobInterestUX \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN JobPref JobRelocateYesNo \ 0 start your own business NaN 1 work for a nonprofit 1.0 2 work for a medium-sized company 1.0 3 work for a medium-sized company NaN 4 work for a multinational corporation 1.0 JobRoleInterest \ 0 NaN 1 Full-Stack Web Developer 2 Front-End Web Developer, Back-End Web Develo... 3 Front-End Web Developer, Full-Stack Web Deve... 4 Full-Stack Web Developer, Information Security... JobWherePref LanguageAtHome \ 0 NaN English 1 in an office with other developers English 2 no preference Spanish 3 from home Portuguese 4 in an office with other developers Portuguese MaritalStatus MoneyForLearning MonthsProgramming \ 0 married or domestic partnership 150.0 6.0 1 single, never married 80.0 6.0 2 single, never married 1000.0 5.0 3 married or domestic partnership 0.0 5.0 4 single, never married 0.0 24.0 NetworkID Part1EndTime Part1StartTime Part2EndTime \ 0 6f1fbc6b2b 2017-03-09 00:36:22 2017-03-09 00:32:59 2017-03-09 00:59:46 1 f8f8be6910 2017-03-09 00:37:07 2017-03-09 00:33:26 2017-03-09 00:38:59 2 2ed189768e 2017-03-09 00:37:58 2017-03-09 00:33:53 2017-03-09 00:40:14 3 dbdc0664d1 2017-03-09 00:40:13 2017-03-09 00:37:45 2017-03-09 00:42:26 4 11b0f2d8a9 2017-03-09 00:42:45 2017-03-09 00:39:44 2017-03-09 00:45:42 Part2StartTime PodcastChangeLog PodcastCodeNewbie PodcastCodePen \ 0 2017-03-09 00:36:26 NaN NaN NaN 1 2017-03-09 00:37:10 NaN 1.0 NaN 2 2017-03-09 00:38:02 1.0 NaN 1.0 3 2017-03-09 00:40:18 NaN NaN NaN 4 2017-03-09 00:42:50 NaN NaN NaN PodcastDevTea PodcastDotNET PodcastGiantRobots PodcastJSAir \ 0 1.0 NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN PodcastJSJabber PodcastNone PodcastOther PodcastProgThrowdown \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN Codenewbie NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN PodcastRubyRogues PodcastSEDaily PodcastSERadio PodcastShopTalk \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN 1.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN PodcastTalkPython PodcastTheWebAhead ResourceCodecademy \ 0 NaN NaN 1.0 1 NaN NaN 1.0 2 NaN NaN 1.0 3 NaN NaN NaN 4 NaN NaN NaN ResourceCodeWars ResourceCoursera ResourceCSS ResourceEdX \ 0 NaN NaN NaN NaN 1 NaN NaN 1.0 NaN 2 NaN NaN 1.0 NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN ResourceEgghead ResourceFCC ResourceHackerRank ResourceKA \ 0 NaN 1.0 NaN NaN 1 NaN 1.0 NaN NaN 2 NaN 1.0 NaN NaN 3 1.0 1.0 NaN NaN 4 NaN NaN NaN NaN ResourceLynda ResourceMDN ResourceOdinProj ResourceOther \ 0 NaN 1.0 NaN NaN 1 NaN NaN NaN NaN 2 NaN 1.0 NaN NaN 3 NaN 1.0 NaN NaN 4 NaN NaN NaN NaN ResourcePluralSight ResourceSkillcrush ResourceSO ResourceTreehouse \ 0 NaN NaN NaN NaN 1 NaN NaN 1.0 NaN 2 NaN NaN NaN NaN 3 NaN NaN 1.0 NaN 4 NaN NaN 1.0 NaN ResourceUdacity ResourceUdemy ResourceW3S \ 0 NaN 1.0 1.0 1 NaN 1.0 1.0 2 1.0 1.0 NaN 3 NaN NaN NaN 4 NaN NaN NaN SchoolDegree SchoolMajor \ 0 some college credit, no degree NaN 1 some college credit, no degree NaN 2 high school diploma or equivalent (GED) NaN 3 some college credit, no degree NaN 4 bachelor's degree Information Technology StudentDebtOwe YouTubeCodeCourse YouTubeCodingTrain YouTubeCodingTut360 \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN 1.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN YouTubeComputerphile YouTubeDerekBanas YouTubeDevTips \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN 1.0 1.0 3 NaN NaN 1.0 4 NaN NaN NaN YouTubeEngineeredTruth YouTubeFCC YouTubeFunFunFunction \ 0 NaN NaN NaN 1 NaN 1.0 NaN 2 NaN NaN NaN 3 NaN 1.0 1.0 4 NaN NaN NaN YouTubeGoogleDev YouTubeLearnCode YouTubeLevelUpTuts YouTubeMIT \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN 1.0 1.0 NaN 3 NaN NaN 1.0 NaN 4 NaN NaN NaN NaN YouTubeMozillaHacks YouTubeOther YouTubeSimplilearn YouTubeTheNewBoston 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN 18175
/dataquest/system/env/python3/lib/python3.4/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (17,62) have mixed types. Specify dtype option on import or set low_memory=False.
There is no documentation on the different columns of the dataset, but they are pretty self-explanatory.
We want to know: -Where are these new coders located. -What are the locations with the greatest number of new coders. -How much money new coders are willing to spend on learning.
#We will start by assessing if the sample is representative of our population of interest (i.e. are interested in jobs in our areas)
print(data['JobRoleInterest'].value_counts(normalize=True)*100)
Full-Stack Web Developer 11.770595 Front-End Web Developer 6.435927 Data Scientist 2.173913 Back-End Web Developer 2.030892 Mobile Developer 1.673341 Game Developer 1.630435 Information Security 1.315789 Full-Stack Web Developer, Front-End Web Developer 0.915332 Front-End Web Developer, Full-Stack Web Developer 0.800915 Product Manager 0.786613 Data Engineer 0.758009 User Experience Designer 0.743707 User Experience Designer, Front-End Web Developer 0.614989 Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer 0.557780 Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer 0.514874 DevOps / SysAdmin 0.514874 Back-End Web Developer, Full-Stack Web Developer, Front-End Web Developer 0.514874 Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer 0.443364 Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer 0.429062 Full-Stack Web Developer, Mobile Developer 0.414760 Front-End Web Developer, User Experience Designer 0.414760 Back-End Web Developer, Full-Stack Web Developer 0.386156 Full-Stack Web Developer, Back-End Web Developer 0.371854 Back-End Web Developer, Front-End Web Developer 0.286041 Data Engineer, Data Scientist 0.271739 Full-Stack Web Developer, Back-End Web Developer, Front-End Web Developer 0.271739 Front-End Web Developer, Mobile Developer 0.257437 Full-Stack Web Developer, Data Scientist 0.243135 Mobile Developer, Game Developer 0.228833 Data Scientist, Data Engineer 0.228833 ... Back-End Web Developer, Full-Stack Web Developer, Mobile Developer, Product Manager, Front-End Web Developer 0.014302 Mobile Developer, Back-End Web Developer, Front-End Web Developer, Information Security 0.014302 Game Developer, Data Scientist, Data Engineer 0.014302 Front-End Web Developer, Data Scientist, Full-Stack Web Developer, Quality Assurance Engineer, Mobile Developer, User Experience Designer 0.014302 Quality Assurance Engineer, Front-End Web Developer, Back-End Web Developer 0.014302 Data Engineer, Front-End Web Developer, Mobile Developer, Full-Stack Web Developer, Product Manager, Information Security, Data Scientist, Back-End Web Developer 0.014302 DevOps / SysAdmin, Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer 0.014302 Game Developer, User Experience Designer, Full-Stack Web Developer, Front-End Web Developer, Information Security 0.014302 Data Scientist, Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer, Mobile Developer 0.014302 Game Developer, Front-End Web Developer, Full-Stack Web Developer, Mobile Developer 0.014302 User Experience Designer, Product Manager, Full-Stack Web Developer, Front-End Web Developer 0.014302 Front-End Web Developer, Mobile Developer, Back-End Web Developer, Full-Stack Web Developer, Game Developer 0.014302 Front-End Web Developer, Full-Stack Web Developer, Mobile Developer, Information Security, Back-End Web Developer, DevOps / SysAdmin 0.014302 Back-End Web Developer, Data Scientist, DevOps / SysAdmin, Full-Stack Web Developer, Front-End Web Developer, Mobile Developer, Information Security, Game Developer 0.014302 Product Manager, Full-Stack Web Developer, Game Developer, Front-End Web Developer 0.014302 Information Security, Full-Stack Web Developer, Game Developer, Back-End Web Developer, Data Scientist, Mobile Developer, Front-End Web Developer, Data Engineer 0.014302 Systems Programming 0.014302 User Experience Designer, Front-End Web Developer, Full-Stack Web Developer, Mobile Developer, Data Engineer, Product Manager, Game Developer, Back-End Web Developer, Data Scientist, Quality Assurance Engineer, DevOps / SysAdmin 0.014302 Mobile Developer, Product Manager, Information Security, User Experience Designer 0.014302 Information Security, Back-End Web Developer, Front-End Web Developer, DevOps / SysAdmin, Data Engineer 0.014302 Back-End Web Developer, Game Developer, DevOps / SysAdmin, Mobile Developer, Front-End Web Developer 0.014302 Game Developer, Data Scientist, Information Security, Full-Stack Web Developer 0.014302 Technology Management 0.014302 Full-Stack Web Developer, Product Manager, User Experience Designer 0.014302 DevOps / SysAdmin, Full-Stack Web Developer, Information Security, Front-End Web Developer 0.014302 Back-End Web Developer, Quality Assurance Engineer 0.014302 DevOps / SysAdmin, Back-End Web Developer, Full-Stack Web Developer, Data Scientist, Front-End Web Developer 0.014302 Information Security, Front-End Web Developer, DevOps / SysAdmin, Back-End Web Developer, User Experience Designer, Mobile Developer, Full-Stack Web Developer 0.014302 Machine learning engineer 0.014302 Game Developer, DevOps / SysAdmin 0.014302 Name: JobRoleInterest, Length: 3213, dtype: float64
the table is hard to read because several participants are interested in more than one field. Let's make the question easier and check how many people are interested in at least one of our subjects: web and mobile development
#remove subjects who did not answer this question
job_int=data['JobRoleInterest'].dropna()
#select subjects interested in web and mobile development
web_mob_mask=job_int.str.contains('Web Developer|Mobile Developer')
web_mob=job_int[web_mob_mask]
#calculate the percentage
percent_web_mob=len(web_mob)/len(job_int)*100
print(percent_web_mob)
86.24141876430205
#plot the percentage of subjects interested in our main areas
import matplotlib.pyplot as plt
%matplotlib inline
#create meaningful descriptions in the mask
web_mob_mask=web_mob_mask.replace(True,'Web and Mobile Development')
web_mob_mask=web_mob_mask.replace(False,'Other Job Interests')
#create pie plot
web_mob_mask.value_counts().plot.pie(figsize = (6,6),title='Percentage of new developers with potential interest in our main products')
#display percentages
web_mob_mask.value_counts().plot.pie(figsize = (6,6), autopct = '%.1f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4512927748>
86% of all new coders are interested in jobs in one of our two main areas, which means that this database is useful for us.
We now want to find out which countries are better to advertise at. As the vast majority of the survey population is relevant for us, we will assess the top 4 countries these subjects live in.
#To make sure we are working with a representative sample, drop all the rows where participants didn't answer what role they are interested in. Where a participant didn't respond, we can't know for sure what their interests are, so it's better if we leave out this category of participants
data_with_role=data[data['JobRoleInterest'].notnull()]
#confirming that the lines were dropped
print(data.shape)
print(data_with_role.shape)
(18175, 136) (6992, 136)
#check absolute and relative frequencies of where the participants live
print(data_with_role['CountryLive'].value_counts().head(4))
print((data_with_role['CountryLive'].value_counts(normalize=True)*100).head(4))
United States of America 3125 India 528 United Kingdom 315 Canada 260 Name: CountryLive, dtype: int64 United States of America 45.700497 India 7.721556 United Kingdom 4.606610 Canada 3.802281 Name: CountryLive, dtype: float64
About 46% of the respondents live in the USA. The next more frequent countries are India (8%), UK(7%), and Canada(4%).
It is a bit worrying that we are only analyzing about one third of all respondents (the ones that responded to the type of job they are looking for). I will check if, when looking at all the respondents, we find the same main countries.
print(data['CountryLive'].value_counts().head(4))
print((data['CountryLive'].value_counts(normalize=True)*100).head(4))
United States of America 5791 India 1400 United Kingdom 757 Canada 616 Name: CountryLive, dtype: int64 United States of America 37.760824 India 9.128847 United Kingdom 4.936098 Canada 4.016693 Name: CountryLive, dtype: float64
The main countries remain the same, so, at least regarding the country, the one third we were analyzing, seems to be representative.
Before we start advertising in these countries, we should make sure that the coders who live there are willing to pay the price of our subscription: $59 per month.
We will focus on the 4 top countries because they have the highest frequencies, and because our courses are in english.
#Some students answered that they had been learning to code for 0 months (it might be that they had just started when they completed the survey). To avoid dividing by 0, replace all the values of 0 with 1.
data_with_role['MonthsProgramming']=data_with_role['MonthsProgramming'].replace(0.0,1.0)
#confirming that there are no 0s (uncomment to check)
#print(data_with_role['MonthsProgramming'].value_counts())
#calculating the amount each student pays per month
data_with_role['money_per_month']=data_with_role['MoneyForLearning']/data_with_role['MonthsProgramming']
#print(data_with_role['money_per_month'].value_counts())
#Find out how many null values there are in the new column and keep only the non-nulls
data_with_val=data_with_role[data_with_role['money_per_month'].notnull()]
print(data_with_role.shape)
print(data_with_val.shape)
(6992, 137) (6317, 137)
#remove rows that don't have country to live
data_with_country=data_with_val[data_with_val['CountryLive'].notnull()]
print(data_with_val.shape)
print(data_with_country.shape)
(6317, 137) (6212, 137)
We lost about 700 subjects, who did not have the necessary responses to proceed with the analysis
#check how much, on average, each of our top countries spends per month
val_per_country=data_with_country[['CountryLive','money_per_month']].groupby('CountryLive').mean()
print(val_per_country.loc[['United States of America','India','United Kingdom','Canada']])
money_per_month CountryLive United States of America 227.997996 India 135.100982 United Kingdom 45.534443 Canada 113.510961
#check also the median
val_per_country=data_with_country[['CountryLive','money_per_month']].groupby('CountryLive').median()
print(val_per_country.loc[['United States of America','India','United Kingdom','Canada']])
money_per_month CountryLive United States of America 3.333333 India 0.000000 United Kingdom 0.000000 Canada 0.000000
While the means are quite high, the medians are all close to 0. ALso, the values for India are high in comparison with UK and Canada.
That suggests that there are outliers in the analysis. Let's plot box and whisker graphs to confirm.
import seaborn as sns
#create new database which only contains the 4 top countries (we will use this one from now on)
val_top_4=data_with_country[data_with_country['CountryLive'].str.contains('United States of America|India|United Kingdom|Canada')]
#plot the box and whisker graph
sns.boxplot(x = 'CountryLive', y = 'money_per_month', data = val_top_4)
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
As expected, most values in all 4 countries are very low, but there are a few outliers. We will remove the extreme outliers (i.e. more than 3 sd from the mean)
#recalculate mean
mean_top_4=val_top_4[['CountryLive','money_per_month']].groupby('CountryLive').mean()
print(mean_top_4)
money_per_month CountryLive Canada 113.510961 India 135.100982 United Kingdom 45.534443 United States of America 227.997996
#calculate std
std_top_4=val_top_4[['CountryLive','money_per_month']].groupby('CountryLive').std(ddof=1)
print(std_top_4)
money_per_month CountryLive Canada 441.014158 India 692.960378 United Kingdom 162.311836 United States of America 1940.245614
#calculate cutoffs
Canada_cuttoff=mean_top_4.loc['Canada']+(std_top_4.loc['Canada']*3)
India_cuttoff=mean_top_4.loc['India']+(std_top_4.loc['India']*3)
UK_cuttoff=mean_top_4.loc['United Kingdom']+(std_top_4.loc['United Kingdom']*3)
USA_cuttoff=mean_top_4.loc['United States of America']+(std_top_4.loc['United States of America']*3)
#mark outliers (in oultier list: 1 if outlier, 0 if not outlier) and outliers who attended bootcamps (in outlier_boot list, 0 if not outlier, 1 if outlier who did not attend bootcamp, 2 if outlier who attended bootcamp)
def outlier_classification (col_country, col_bootcamp, col_money_per_month):
outliers=[]
outliers_boot=[]
for i in range(len(col_country)):
if col_country.iloc[i]=='Canada':
if col_money_per_month.iloc[i]>Canada_cuttoff['money_per_month']:
outliers.append(1)
if col_bootcamp.iloc[i]==1.0:
outliers_boot.append(2)
else:
outliers_boot.append(1)
else:
outliers.append(0)
outliers_boot.append(0)
elif col_country.iloc[i]=='India':
if col_money_per_month.iloc[i]>India_cuttoff['money_per_month']:
outliers.append(1)
if col_bootcamp.iloc[i]==1.0:
outliers_boot.append(2)
else:
outliers_boot.append(1)
else:
outliers.append(0)
outliers_boot.append(0)
elif col_country.iloc[i]=='United Kingdom':
if col_money_per_month.iloc[i]>UK_cuttoff['money_per_month']:
outliers.append(1)
if col_bootcamp.iloc[i]==1.0:
outliers_boot.append(2)
else:
outliers_boot.append(1)
else:
outliers.append(0)
outliers_boot.append(0)
elif col_country.iloc[i]=='United States of America':
if col_money_per_month.iloc[i]>USA_cuttoff['money_per_month']:
outliers.append(1)
if col_bootcamp.iloc[i]==1.0:
outliers_boot.append(2)
else:
outliers_boot.append(1)
else:
outliers.append(0)
outliers_boot.append(0)
return outliers, outliers_boot
outliers,outliers_boot=outlier_classification(val_top_4['CountryLive'],val_top_4['AttendedBootcamp'],val_top_4['money_per_month'])
#add to the main database
val_top_4['outliers']=outliers
val_top_4['outliers_boot']=outliers_boot
Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName \ 1 34.0 0.0 NaN NaN NaN 2 21.0 0.0 NaN NaN NaN 6 29.0 0.0 NaN NaN NaN 15 32.0 0.0 NaN NaN NaN 16 29.0 0.0 NaN NaN NaN BootcampRecommend ChildrenNumber CityPopulation \ 1 NaN NaN less than 100,000 2 NaN NaN more than 1 million 6 NaN NaN between 100,000 and 1 million 15 NaN NaN less than 100,000 16 NaN NaN between 100,000 and 1 million CodeEventConferences CodeEventDjangoGirls CodeEventFCC \ 1 NaN NaN NaN 2 NaN NaN NaN 6 1.0 NaN NaN 15 NaN NaN NaN 16 NaN NaN NaN CodeEventGameJam CodeEventGirlDev CodeEventHackathons CodeEventMeetup \ 1 NaN NaN NaN NaN 2 NaN NaN 1.0 NaN 6 NaN NaN NaN 1.0 15 NaN NaN NaN NaN 16 NaN NaN NaN NaN CodeEventNodeSchool CodeEventNone CodeEventOther CodeEventRailsBridge \ 1 NaN NaN NaN NaN 2 1.0 NaN NaN NaN 6 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN NaN NaN NaN CodeEventRailsGirls CodeEventStartUpWknd CodeEventWkdBootcamps \ 1 NaN NaN NaN 2 NaN NaN NaN 6 1.0 NaN NaN 15 NaN NaN NaN 16 NaN NaN 1.0 CodeEventWomenCode CodeEventWorkshops CommuteTime \ 1 NaN NaN NaN 2 NaN NaN 15 to 29 minutes 6 NaN 1.0 30 to 44 minutes 15 NaN NaN 30 to 44 minutes 16 NaN NaN 30 to 44 minutes CountryCitizen CountryLive \ 1 United States of America United States of America 2 United States of America United States of America 6 United Kingdom United Kingdom 15 United States of America United States of America 16 Lithuania United States of America EmploymentField EmploymentFieldOther \ 1 NaN NaN 2 software development and IT NaN 6 NaN Market research 15 sales NaN 16 finance NaN EmploymentStatus EmploymentStatusOther ExpectedEarning \ 1 Not working but looking for work NaN 35000.0 2 Employed for wages NaN 70000.0 6 Employed for wages NaN 30000.0 15 Employed for wages NaN 40000.0 16 Employed for wages NaN 60000.0 FinanciallySupporting FirstDevJob Gender GenderOther HasChildren \ 1 NaN NaN male NaN NaN 2 NaN NaN male NaN NaN 6 NaN NaN female NaN NaN 15 NaN NaN male NaN NaN 16 NaN NaN male NaN NaN HasDebt HasFinancialDependents HasHighSpdInternet HasHomeMortgage \ 1 1.0 0.0 1.0 0.0 2 0.0 0.0 1.0 NaN 6 1.0 0.0 1.0 1.0 15 1.0 0.0 1.0 0.0 16 0.0 0.0 1.0 NaN HasServedInMilitary HasStudentDebt HomeMortgageOwe HoursLearning \ 1 0.0 1.0 NaN 10.0 2 0.0 NaN NaN 25.0 6 0.0 1.0 120000.0 16.0 15 0.0 1.0 NaN 1.0 16 0.0 NaN NaN 6.0 ID.x ID.y \ 1 5bfef9ecb211ec4f518cfc1d2a6f3e0c 21db37adb60cdcafadfa7dca1b13b6b1 2 14f1863afa9c7de488050b82eb3edd96 21ba173828fbe9e27ccebaf4d5166a55 6 5e130f133306abd6c2f9af31467ff37c fe5e9f175fdfbf18bcf6c85d6e042b68 15 cfff58e11d5ab123bd574302ff1b8e8f 044f4310564b902b19f1e2d776b988d6 16 91ec37259e33b1549af3506011142f4a 3d23237a292a1d0441c882ae42830e8c Income IsEthnicMinority IsReceiveDisabilitiesBenefits IsSoftwareDev \ 1 NaN 0.0 0.0 0.0 2 13000.0 1.0 0.0 0.0 6 40000.0 NaN 0.0 0.0 15 20000.0 0.0 0.0 0.0 16 60000.0 0.0 0.0 0.0 IsUnderEmployed JobApplyWhen JobInterestBackEnd \ 1 NaN Within 7 to 12 months NaN 2 0.0 Within 7 to 12 months 1.0 6 0.0 I'm already applying NaN 15 1.0 more than 12 months from now NaN 16 0.0 Within the next 6 months NaN JobInterestDataEngr JobInterestDataSci JobInterestDevOps \ 1 NaN NaN NaN 2 NaN NaN 1.0 6 NaN NaN NaN 15 NaN NaN NaN 16 NaN NaN NaN JobInterestFrontEnd JobInterestFullStack JobInterestGameDev \ 1 NaN 1.0 NaN 2 1.0 1.0 NaN 6 NaN 1.0 NaN 15 NaN 1.0 NaN 16 NaN 1.0 NaN JobInterestInfoSec JobInterestMobile JobInterestOther \ 1 NaN NaN NaN 2 NaN 1.0 NaN 6 NaN NaN NaN 15 NaN NaN NaN 16 NaN NaN NaN JobInterestProjMngr JobInterestQAEngr JobInterestUX \ 1 NaN NaN NaN 2 NaN NaN NaN 6 NaN NaN NaN 15 NaN NaN NaN 16 NaN NaN NaN JobPref JobRelocateYesNo \ 1 work for a nonprofit 1.0 2 work for a medium-sized company 1.0 6 work for a medium-sized company NaN 15 work for a nonprofit 1.0 16 work for a medium-sized company 0.0 JobRoleInterest \ 1 Full-Stack Web Developer 2 Front-End Web Developer, Back-End Web Develo... 6 Full-Stack Web Developer 15 Full-Stack Web Developer 16 Full-Stack Web Developer JobWherePref LanguageAtHome \ 1 in an office with other developers English 2 no preference Spanish 6 no preference English 15 in an office with other developers English 16 in an office with other developers English MaritalStatus MoneyForLearning MonthsProgramming \ 1 single, never married 80.0 6.0 2 single, never married 1000.0 5.0 6 married or domestic partnership 0.0 12.0 15 single, never married 0.0 1.0 16 married or domestic partnership 200.0 12.0 NetworkID Part1EndTime Part1StartTime Part2EndTime \ 1 f8f8be6910 2017-03-09 00:37:07 2017-03-09 00:33:26 2017-03-09 00:38:59 2 2ed189768e 2017-03-09 00:37:58 2017-03-09 00:33:53 2017-03-09 00:40:14 6 f4abfae20d 2017-03-09 00:45:33 2017-03-09 00:41:27 2017-03-09 00:48:49 15 4b5d882f3a 2017-03-09 00:55:36 2017-03-09 00:47:48 2017-03-09 00:57:31 16 049bc4d083 2017-03-09 01:02:14 2017-03-09 01:00:10 2017-03-09 01:04:00 Part2StartTime PodcastChangeLog PodcastCodeNewbie PodcastCodePen \ 1 2017-03-09 00:37:10 NaN 1.0 NaN 2 2017-03-09 00:38:02 1.0 NaN 1.0 6 2017-03-09 00:45:52 NaN NaN NaN 15 2017-03-09 00:55:39 NaN 1.0 NaN 16 2017-03-09 01:02:17 1.0 1.0 NaN PodcastDevTea PodcastDotNET PodcastGiantRobots PodcastJSAir \ 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 6 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN 1.0 NaN 1.0 PodcastJSJabber PodcastNone PodcastOther PodcastProgThrowdown \ 1 NaN NaN NaN NaN 2 NaN NaN Codenewbie NaN 6 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 1.0 NaN NaN NaN PodcastRubyRogues PodcastSEDaily PodcastSERadio PodcastShopTalk \ 1 NaN NaN NaN NaN 2 NaN NaN NaN 1.0 6 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 1.0 1.0 1.0 1.0 PodcastTalkPython PodcastTheWebAhead ResourceCodecademy \ 1 NaN NaN 1.0 2 NaN NaN 1.0 6 NaN NaN 1.0 15 NaN NaN 1.0 16 NaN NaN NaN ResourceCodeWars ResourceCoursera ResourceCSS ResourceEdX \ 1 NaN NaN 1.0 NaN 2 NaN NaN 1.0 NaN 6 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN NaN 1.0 NaN ResourceEgghead ResourceFCC ResourceHackerRank ResourceKA \ 1 NaN 1.0 NaN NaN 2 NaN 1.0 NaN NaN 6 NaN 1.0 NaN NaN 15 NaN 1.0 NaN NaN 16 NaN 1.0 NaN 1.0 ResourceLynda ResourceMDN ResourceOdinProj ResourceOther \ 1 NaN NaN NaN NaN 2 NaN 1.0 NaN NaN 6 NaN NaN NaN Sololearn 15 NaN NaN NaN NaN 16 NaN 1.0 NaN NaN ResourcePluralSight ResourceSkillcrush ResourceSO ResourceTreehouse \ 1 NaN NaN 1.0 NaN 2 NaN NaN NaN NaN 6 1.0 NaN 1.0 NaN 15 NaN NaN NaN NaN 16 1.0 NaN 1.0 NaN ResourceUdacity ResourceUdemy ResourceW3S \ 1 NaN 1.0 1.0 2 1.0 1.0 NaN 6 NaN NaN 1.0 15 NaN NaN NaN 16 NaN NaN NaN SchoolDegree SchoolMajor \ 1 some college credit, no degree NaN 2 high school diploma or equivalent (GED) NaN 6 some college credit, no degree NaN 15 master's degree (non-professional) English 16 master's degree (non-professional) Political Science StudentDebtOwe YouTubeCodeCourse YouTubeCodingTrain \ 1 NaN NaN NaN 2 NaN NaN NaN 6 8000.0 NaN NaN 15 25000.0 NaN NaN 16 NaN NaN NaN YouTubeCodingTut360 YouTubeComputerphile YouTubeDerekBanas \ 1 NaN NaN NaN 2 1.0 NaN 1.0 6 NaN NaN NaN 15 NaN NaN NaN 16 NaN NaN NaN YouTubeDevTips YouTubeEngineeredTruth YouTubeFCC YouTubeFunFunFunction \ 1 NaN NaN 1.0 NaN 2 1.0 NaN NaN NaN 6 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN NaN 1.0 1.0 YouTubeGoogleDev YouTubeLearnCode YouTubeLevelUpTuts YouTubeMIT \ 1 NaN NaN NaN NaN 2 NaN 1.0 1.0 NaN 6 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN NaN NaN NaN YouTubeMozillaHacks YouTubeOther YouTubeSimplilearn YouTubeTheNewBoston \ 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 6 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN NaN NaN 1.0 money_per_month outliers outliers_boot 1 13.333333 0 0 2 200.000000 0 0 6 0.000000 0 0 15 0.000000 0 0 16 16.666667 0 0
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#remove outliers
val_top_4_noout=val_top_4[val_top_4['outliers']==0]
print(len(val_top_4))
print(len(val_top_4_noout))
3915 3886
Only 29 outliers were removed from the 4 countries. This is because the standard deviation is quite high. Let's check if these outliers were due to bootcamp participation
val_top_4_boot=val_top_4[val_top_4['outliers']==1]
val_top_4_boot=val_top_4_boot[['outliers_boot']]
#create meaningful descriptions
val_top_4_boot['outliers_boot']=val_top_4_boot['outliers_boot'].replace(1,'Did not attend bootcamps')
val_top_4_boot['outliers_boot']=val_top_4_boot['outliers_boot'].replace(2,'Attended bootcamps')
#create pie plot
val_top_4_boot['outliers_boot'].value_counts().plot.pie(figsize = (6,6),title='Percentage of big spenders who attended bootcamps')
#display percentages
val_top_4_boot['outliers_boot'].value_counts().plot.pie(figsize = (6,6), autopct = '%.1f%%')
1 18 2 11 Name: outliers_boot, dtype: int64
<matplotlib.axes._subplots.AxesSubplot at 0x7f45082199b0>
Attending bootcamps explains about 38% of extreme outliers.
We will remove both from the database as:
Let's see if removing these outliers improved the distribution.
#create box and whisker plots
sns.boxplot(x = 'CountryLive', y = 'money_per_month', data = val_top_4_noout)
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
the distribution is a bit mor homgeneous now, but it still looks very skewed, let's look at the distributions in kde plots
#USA
USA=val_top_4[val_top_4['CountryLive'].str.contains('United States of America')]
USA['money_per_month'].plot.kde(xlim=(min(USA['money_per_month']),max(USA['money_per_month'])),title='USA')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4507c52f98>
#India
India=val_top_4[val_top_4['CountryLive'].str.contains('India')]
India['money_per_month'].plot.kde(xlim=(min(India['money_per_month']),max(India['money_per_month'])),title='India')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4508195d30>
#UK
UK=val_top_4[val_top_4['CountryLive'].str.contains('United Kingdom')]
UK['money_per_month'].plot.kde(xlim=(min(UK['money_per_month']),max(UK['money_per_month'])),title='UK')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4516c76f60>
#Canada
Canada=val_top_4[val_top_4['CountryLive'].str.contains('Canada')]
Canada['money_per_month'].plot.kde(xlim=(min(Canada['money_per_month']),max(Canada['money_per_month'])),title='Canada')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4516fa1588>
As exoected, all these 4 countries are extremely right skewed. Considering that, I do not think that looking at the mean is the best way to analyze this data. Instead, I will look at the number of subjects in each of these countries, who spend more that $59 per month on coding resources.
#determine the number of subjects who pay more than $59 per month and plot
total_pay=pd.pivot_table(val_top_4,index='CountryLive',values='money_per_month',aggfunc=lambda x:(x>=59).sum())
total_pay['money_per_month'].plot.bar(title='total number of respondents who pay more than $59/month in the top 4 countries')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4507c38c88>
#determine the percentage of subjects who pay more than $59 per month and plot
percent_pay=pd.pivot_table(val_top_4,index='CountryLive',values='money_per_month',aggfunc=lambda x:(x>=59).sum()/(x>=0).sum()*100)
percent_pay['money_per_month'].plot.bar(title='percentage of respondents who pay more than $59/month in the top 4 countries')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4516f5f588>
The USA have the highest number and percentage of respondents who invest more than $59 in their training. The remaining 3 countries are very close in these metrics.
The survey we analysed is relevant for our question regarding markets in which we should invest in advertisement, considering that:
The top 4 countries responding to this survey were USA, India, UK and Canada. Because of that, and also because our courses are tought in english, we focused our analysis in these 4 countries.
The selection of where to advertise was based on assessment of potential wilingness to pay for our montly subscription ($59).
The USA is the country with more respondents in the survey. Also, it is the country where a higher percentage of coders are already investing the value of our subscription ($59) or above, montlhy (about 23%). Our main advertising investment should thus be directed at this country
If the goal is to invest in a other markets, I would advise to also advertise in Canada. After the USA, it is the country with highest percentage of coders investing $59 or above, monthly. Although this percentage is not significantly higher from India and the UK, the proximity (geographical/cultural) with the USA may reduce the necessary investment.