Background and Aim
As a marketing analyst working for an an e-learning company that offers courses on programming we are interested in finding markets for advertising. Most of our courses are on "web" and "mobile development", but we also cover many other domains, like "data science", "game development", etc. We want to promote our product and we'd like to invest some money in advertisement. Our goal in this project is to find out the two best markets to advertise our product in.
Dataset
Conducting a physical survey being costly, we will initially explore available datasets. Presently, we wil work with data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
The survey data is publicly available in their GitHub repository.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Setting pandas display options for large data
pd.options.display.max_rows = 200
pd.options.display.max_columns = 150
survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv', keep_default_na = True, low_memory = False)
print(survey.info())
print(survey.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18175 entries, 0 to 18174 Columns: 136 entries, Age to YouTubeTheNewBoston dtypes: float64(105), object(31) memory usage: 18.9+ MB None Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName \ 0 27.0 0.0 NaN NaN NaN 1 34.0 0.0 NaN NaN NaN 2 21.0 0.0 NaN NaN NaN 3 26.0 0.0 NaN NaN NaN 4 20.0 0.0 NaN NaN NaN BootcampRecommend ChildrenNumber CityPopulation \ 0 NaN NaN more than 1 million 1 NaN NaN less than 100,000 2 NaN NaN more than 1 million 3 NaN NaN between 100,000 and 1 million 4 NaN NaN between 100,000 and 1 million CodeEventConferences CodeEventDjangoGirls CodeEventFCC CodeEventGameJam \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN CodeEventGirlDev CodeEventHackathons CodeEventMeetup \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN 1.0 NaN 3 NaN NaN NaN 4 NaN NaN NaN CodeEventNodeSchool CodeEventNone CodeEventOther CodeEventRailsBridge \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 1.0 NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN CodeEventRailsGirls CodeEventStartUpWknd CodeEventWkdBootcamps \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN CodeEventWomenCode CodeEventWorkshops CommuteTime \ 0 NaN NaN 15 to 29 minutes 1 NaN NaN NaN 2 NaN NaN 15 to 29 minutes 3 NaN NaN I work from home 4 NaN NaN NaN CountryCitizen CountryLive \ 0 Canada Canada 1 United States of America United States of America 2 United States of America United States of America 3 Brazil Brazil 4 Portugal Portugal EmploymentField EmploymentFieldOther \ 0 software development and IT NaN 1 NaN NaN 2 software development and IT NaN 3 software development and IT NaN 4 NaN NaN EmploymentStatus EmploymentStatusOther ExpectedEarning \ 0 Employed for wages NaN NaN 1 Not working but looking for work NaN 35000.0 2 Employed for wages NaN 70000.0 3 Employed for wages NaN 40000.0 4 Not working but looking for work NaN 140000.0 FinanciallySupporting FirstDevJob Gender GenderOther HasChildren \ 0 NaN NaN female NaN NaN 1 NaN NaN male NaN NaN 2 NaN NaN male NaN NaN 3 0.0 NaN male NaN 0.0 4 NaN NaN female NaN NaN HasDebt HasFinancialDependents HasHighSpdInternet HasHomeMortgage \ 0 1.0 0.0 1.0 0.0 1 1.0 0.0 1.0 0.0 2 0.0 0.0 1.0 NaN 3 1.0 1.0 1.0 1.0 4 0.0 0.0 1.0 NaN HasServedInMilitary HasStudentDebt HomeMortgageOwe HoursLearning \ 0 0.0 0.0 NaN 15.0 1 0.0 1.0 NaN 10.0 2 0.0 NaN NaN 25.0 3 0.0 0.0 40000.0 14.0 4 0.0 NaN NaN 10.0 ID.x ID.y \ 0 02d9465b21e8bd09374b0066fb2d5614 eb78c1c3ac6cd9052aec557065070fbf 1 5bfef9ecb211ec4f518cfc1d2a6f3e0c 21db37adb60cdcafadfa7dca1b13b6b1 2 14f1863afa9c7de488050b82eb3edd96 21ba173828fbe9e27ccebaf4d5166a55 3 91756eb4dc280062a541c25a3d44cfb0 3be37b558f02daae93a6da10f83f0c77 4 aa3f061a1949a90b27bef7411ecd193f d7c56bbf2c7b62096be9db010e86d96d Income IsEthnicMinority IsReceiveDisabilitiesBenefits IsSoftwareDev \ 0 NaN NaN 0.0 0.0 1 NaN 0.0 0.0 0.0 2 13000.0 1.0 0.0 0.0 3 24000.0 0.0 0.0 0.0 4 NaN 0.0 0.0 0.0 IsUnderEmployed JobApplyWhen JobInterestBackEnd \ 0 0.0 NaN NaN 1 NaN Within 7 to 12 months NaN 2 0.0 Within 7 to 12 months 1.0 3 1.0 Within the next 6 months 1.0 4 NaN Within 7 to 12 months 1.0 JobInterestDataEngr JobInterestDataSci JobInterestDevOps \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN 1.0 3 NaN NaN NaN 4 NaN NaN NaN JobInterestFrontEnd JobInterestFullStack JobInterestGameDev \ 0 NaN NaN NaN 1 NaN 1.0 NaN 2 1.0 1.0 NaN 3 1.0 1.0 NaN 4 1.0 1.0 NaN JobInterestInfoSec JobInterestMobile JobInterestOther \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN 1.0 NaN 3 NaN NaN NaN 4 1.0 1.0 NaN JobInterestProjMngr JobInterestQAEngr JobInterestUX \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN JobPref JobRelocateYesNo \ 0 start your own business NaN 1 work for a nonprofit 1.0 2 work for a medium-sized company 1.0 3 work for a medium-sized company NaN 4 work for a multinational corporation 1.0 JobRoleInterest \ 0 NaN 1 Full-Stack Web Developer 2 Front-End Web Developer, Back-End Web Develo... 3 Front-End Web Developer, Full-Stack Web Deve... 4 Full-Stack Web Developer, Information Security... JobWherePref LanguageAtHome \ 0 NaN English 1 in an office with other developers English 2 no preference Spanish 3 from home Portuguese 4 in an office with other developers Portuguese MaritalStatus MoneyForLearning MonthsProgramming \ 0 married or domestic partnership 150.0 6.0 1 single, never married 80.0 6.0 2 single, never married 1000.0 5.0 3 married or domestic partnership 0.0 5.0 4 single, never married 0.0 24.0 NetworkID Part1EndTime Part1StartTime Part2EndTime \ 0 6f1fbc6b2b 2017-03-09 00:36:22 2017-03-09 00:32:59 2017-03-09 00:59:46 1 f8f8be6910 2017-03-09 00:37:07 2017-03-09 00:33:26 2017-03-09 00:38:59 2 2ed189768e 2017-03-09 00:37:58 2017-03-09 00:33:53 2017-03-09 00:40:14 3 dbdc0664d1 2017-03-09 00:40:13 2017-03-09 00:37:45 2017-03-09 00:42:26 4 11b0f2d8a9 2017-03-09 00:42:45 2017-03-09 00:39:44 2017-03-09 00:45:42 Part2StartTime PodcastChangeLog PodcastCodeNewbie PodcastCodePen \ 0 2017-03-09 00:36:26 NaN NaN NaN 1 2017-03-09 00:37:10 NaN 1.0 NaN 2 2017-03-09 00:38:02 1.0 NaN 1.0 3 2017-03-09 00:40:18 NaN NaN NaN 4 2017-03-09 00:42:50 NaN NaN NaN PodcastDevTea PodcastDotNET PodcastGiantRobots PodcastJSAir \ 0 1.0 NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN PodcastJSJabber PodcastNone PodcastOther PodcastProgThrowdown \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN Codenewbie NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN PodcastRubyRogues PodcastSEDaily PodcastSERadio PodcastShopTalk \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN 1.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN PodcastTalkPython PodcastTheWebAhead ResourceCodecademy \ 0 NaN NaN 1.0 1 NaN NaN 1.0 2 NaN NaN 1.0 3 NaN NaN NaN 4 NaN NaN NaN ResourceCodeWars ResourceCoursera ResourceCSS ResourceEdX \ 0 NaN NaN NaN NaN 1 NaN NaN 1.0 NaN 2 NaN NaN 1.0 NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN ResourceEgghead ResourceFCC ResourceHackerRank ResourceKA \ 0 NaN 1.0 NaN NaN 1 NaN 1.0 NaN NaN 2 NaN 1.0 NaN NaN 3 1.0 1.0 NaN NaN 4 NaN NaN NaN NaN ResourceLynda ResourceMDN ResourceOdinProj ResourceOther \ 0 NaN 1.0 NaN NaN 1 NaN NaN NaN NaN 2 NaN 1.0 NaN NaN 3 NaN 1.0 NaN NaN 4 NaN NaN NaN NaN ResourcePluralSight ResourceSkillcrush ResourceSO ResourceTreehouse \ 0 NaN NaN NaN NaN 1 NaN NaN 1.0 NaN 2 NaN NaN NaN NaN 3 NaN NaN 1.0 NaN 4 NaN NaN 1.0 NaN ResourceUdacity ResourceUdemy ResourceW3S \ 0 NaN 1.0 1.0 1 NaN 1.0 1.0 2 1.0 1.0 NaN 3 NaN NaN NaN 4 NaN NaN NaN SchoolDegree SchoolMajor \ 0 some college credit, no degree NaN 1 some college credit, no degree NaN 2 high school diploma or equivalent (GED) NaN 3 some college credit, no degree NaN 4 bachelor's degree Information Technology StudentDebtOwe YouTubeCodeCourse YouTubeCodingTrain YouTubeCodingTut360 \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN 1.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN YouTubeComputerphile YouTubeDerekBanas YouTubeDevTips \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN 1.0 1.0 3 NaN NaN 1.0 4 NaN NaN NaN YouTubeEngineeredTruth YouTubeFCC YouTubeFunFunFunction \ 0 NaN NaN NaN 1 NaN 1.0 NaN 2 NaN NaN NaN 3 NaN 1.0 1.0 4 NaN NaN NaN YouTubeGoogleDev YouTubeLearnCode YouTubeLevelUpTuts YouTubeMIT \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN 1.0 1.0 NaN 3 NaN NaN 1.0 NaN 4 NaN NaN NaN NaN YouTubeMozillaHacks YouTubeOther YouTubeSimplilearn YouTubeTheNewBoston 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN
Below we generate a frequency table in the form of a dictionary, in order to assess the major categories for which the respondents to the survey expressed interest.
counts = survey.JobRoleInterest.value_counts(normalize = True)*100
# Coders have given multiple options in the form of strings; we have to split the string
counts_split = counts.index.str.split(',')
# Generate frequency table in the form of a dictionary
freq_table = {}
for row in counts_split:
for i in range (len(row)):
role = row[i]
if role in freq_table:
freq_table[role] += 1
else:
freq_table[role] = 1
print(freq_table)
len(freq_table)
{'Full-Stack Web Developer': 460, ' Front-End Web Developer': 416, ' Data Scientist': 266, 'Back-End Web Developer': 382, ' Mobile Developer': 306, 'Game Developer': 256, 'Information Security': 209, ' Front-End Web Developer': 1582, ' Full-Stack Web Developer': 1761, ' Product Manager': 141, 'Data Engineer': 209, ' User Experience Designer': 218, ' Back-End Web Developer': 1509, ' DevOps / SysAdmin': 162, ' Mobile Developer': 1347, ' User Experience Designer': 907, ' Data Scientist': 974, ' Data Engineer': 859, ' Game Developer': 948, ' Quality Assurance Engineer': 86, ' Information Security': 893, ' DevOps / SysAdmin': 643, ' Product Manager': 543, 'Software Engineer': 1, 'Software Developer': 1, ' Quality Assurance Engineer': 375, 'Unsure': 1, 'Not sure': 1, 'Artificial Intelligence': 1, 'undecided': 1, 'Not sure yet': 1, ' data analyst': 2, ' plc': 1, 'Network': 1, ' Teacher': 1, ' software engineer': 1, ' Bioinformatics/science ': 1, ' security expert': 1, ' Business Analyst': 2, 'All - whatever is required to develop tools to revolutionize the mechanical engineering process': 1, ' Software Developer or Front-End Web Developer': 1, 'Developer Evangelist': 1, ' Project Management': 1, ' Technical Writer': 1, 'Research ': 1, 'improving in my current career as a Learning technologist': 1, 'Data Visualization Specialist': 1, 'Data visualisation': 1, ' Desktop Application Developer': 3, ' Software Engineer': 6, ' Product Designer': 2, 'Not Sure': 1, ' I am interested in Game Development': 1, ' Mobile Development': 1, ' Web Design': 1, ' Front End Web Development': 1, ' Program Manager': 1, 'Software Engineers': 1, 'Support Engineer or API Support': 1, 'Front-End Web Designer': 1, ' computer engineer': 1, ' System Software': 1, ' Software engineer': 2, ' milatary engineer': 1, ' Technology-Business Liaison': 1, ' Researcher': 2, 'non-programmer': 1, 'Full Stack Developer ': 1, 'Non technical ': 1, 'Computer Architect': 1, 'Quant (Algorithmic Trader)': 1, ' UI Design': 1, ' network admin': 1, " This futurist's dream of using some tech in a way that inspires critical amounts of people to influence the changes we need to protect ": 1, ' Entrepreneur': 1, ' Artificial Intelligence engineer': 1, 'Research': 1, ' Application Support Analyst': 1, ' Networking': 1, ' Machine Learning Engineer ': 1, 'AI': 1, ' Artificial Intelligence ': 1, 'Informatician': 1, 'Web Designer': 1, "I'm just learning code to increase my skill-set. I see it as a literacy issue.": 1, 'Urban Planner': 1, ' Embedded Developer': 1, ' Data Analyst': 2, 'Not Sure Yet': 1, ' virtual reality developer': 1, 'lab scientist': 1, 'Software Engineer (Computer Science Based)': 1, ' Campaign Manager': 1, 'IoT Developer': 1, ' creative coder / generative artist/designer': 1, ' Information Technology': 1, ' UX developer/designer': 1, 'Software Specialist ': 1, 'Security Business Analyst ': 1, ' Desktop applications developer': 1, 'Remote Support': 1, ' Marketing': 1, 'BA or developer': 1, ' Software Development': 1, 'Real-time systems': 1, 'Desktop Applications': 1, ' Bitcoin/Crypto': 1, ' Bioinformatics': 1, ' SWE': 1, 'Digital Humanitites': 1, 'Research and education': 1, 'code developer...in whatever format': 1, ' front-end': 1, ' back-end': 1, ' app dev etc.': 1, ' Analyst': 1, 'full-stack developer': 1, ' support scientific resaerch ': 1, "i don't know what the difference is between most of these soz lol": 1, ' Robotics': 1, ' Web Designer': 1, ' Artificial intelligence': 1, 'Ceo': 1, 'Desings': 1, 'data journalist / data visualist': 1, 'Ethical Hacker': 1, 'idk': 1, ' System Administrator/Network': 1, ' Founder': 1, 'VR Technology developer': 1, 'Technology Management': 1, 'software developer': 1, ' Entreprenuer / Web Dev Hustler ': 1, 'Web development ': 1, 'Any of them.': 1, ' Bioinformatitian': 1, ' Project Manager': 1, ' Operating Systems': 1, ' Compilers': 1, ' etc...': 1, 'Physicist ': 1, ' Software developer': 1, 'Information Developer': 1, ' designer': 1, "Don't know yet": 1, ' SEO': 1, ' Criminal Defense Attorney-- focusing on cyber crimes ': 1, 'philosopher': 1, 'Web developer': 1, 'Programming': 1, 'Not sure!': 1, 'Pharmaceutical industry': 1, "I don't know yet!": 1, ' Scientific Programming': 1, ' Growth Hacker': 1, 'Network Engineer': 1, ' AI Engineer': 1, ' IoT': 1, ' Cybersecurity': 1, 'College professor': 1, ' Bioinformatics ': 1, 'User Interface Designer': 1, ' Journalist': 1, 'Document Controller': 1, ' Machine Learning': 1, ' developer': 1, ' Artificial Intelligence Engineer': 1, ' Journalist/Graphic Designer/Marketing': 1, 'full stack developer': 1, 'Java developer': 1, 'Astrophysicist': 1, 'Cloud computing ': 1, 'Systems Engineer': 1, 'Data Reporter': 1, ' AI and neuroscience': 1, 'Full Stack Software Engineer': 1, ' Programmer': 2, ' Anything that engages me': 1, 'Marketing Automation ': 1, ' Tech lobbiest': 1, ' Information Architect': 1, 'System Engineer': 1, ' Library Developer': 1, 'Software engineer': 1, 'Teacher. Teaching students to code. ': 1, ' Databases': 1, ' Ethical Hacker': 1, ' Software enginner': 1, ' Artificial Intelligence': 1, 'Robotics Process Automation Specialist': 1, 'Natural Language Processing': 1, ' Machine Learning Engineer': 2, 'Systems Programmer': 1, 'Systems Programming': 1, ' Tech art': 1, 'Software Developper': 1, ' programmer': 1, ' AI and Machine Learning': 1, ' a job in which I can use coding skills to create valuable portals to advance human rights': 1, ' Machine Learning ': 1, 'Machine learning engineer': 1, ' GIS Database Admin': 1, 'Robotics and AI Engineer': 1, ' IT specialist ': 1, ' User Interface Design': 1, 'Machine learning and AI ': 1, 'Embedded hardware': 1, ' i dunno!!!!': 1, ' Data analyst': 1, ' User Interface Designer': 1, ' I dont yet know': 1, 'AI Developer': 1, ' Python Developer': 1, ' UI Designer': 1, 'Desktop Applications Programmer': 1, 'email coder': 1, 'Front end': 1, ' back end': 1, ' game': 1, ' web': 1, ' mobile developer': 1, 'undeceided': 1, 'Software engineer ': 1, 'programmer': 1, 'GIS Developer': 1, ' Software Developer': 2, ' Java developer': 1, ' Infrastructure Architect ': 1, ' Software Projects Manager': 1, 'Project Manager': 1, 'Data/Interactive Journalist': 1, 'Education': 1, 'Pharmacy tech': 1, 'Project manager': 1, 'Financial Services': 1, 'Software Engineering': 1}
236
The above generated dictionary is a long list of categories (236), and has many repetitive values due to inconsistent formatting and inconsistent use of spacing by respondents.
Below, we concrentrate on respondents who included 'web developer'
and 'mobile developer'
as a future role that they were interested in, using regex.
freq_series = pd.Series(freq_table, index = freq_table.keys()) # uses keys() to get dictionary keys as index of series
web_mobile = freq_series[freq_series.index.str.contains('[wW]eb [dD]evel*|[mM]obile [dD]evel*')] # caters for development and developer
print('sum_web_mobile: ', sum(web_mobile), '\n')
print(web_mobile)
sum_web_mobile: 7769 Full-Stack Web Developer 460 Front-End Web Developer 416 Back-End Web Developer 382 Mobile Developer 306 Front-End Web Developer 1582 Full-Stack Web Developer 1761 Back-End Web Developer 1509 Mobile Developer 1347 Software Developer or Front-End Web Developer 1 Mobile Development 1 Front End Web Development 1 Web development 1 Web developer 1 mobile developer 1 dtype: int64
# Checking null entries
print(survey.JobRoleInterest.notnull().sum())
print(survey.JobRoleInterest.isnull().sum())
6992 11183
We can see above the use of inconsistent spacing and non-uniform formatting due to which the data is segmented. However, it is clear that there are a total of 7769 expressions of some interest in web and/ or mobile development under 14 categories. 6992 respondents have answered this question, but due to multiple options exercised by them the interest for web and mobile development is almost uniform. For instance, check following entries which contain the string 'web developer' multiple times.
print(survey.JobRoleInterest.head(5))
0 NaN 1 Full-Stack Web Developer 2 Front-End Web Developer, Back-End Web Develo... 3 Front-End Web Developer, Full-Stack Web Deve... 4 Full-Stack Web Developer, Information Security... Name: JobRoleInterest, dtype: object
Similarly, we can identify the categories which are other than web or mobile development.
not_web_mobile = freq_series[freq_series.index.str.contains('[wW]eb [dD]evel*|[mM]obile [dD]evel*') == False]
print('sum_not_web_mobile: ', sum(not_web_mobile), '\n')
print(not_web_mobile) # Remove all single digit entries
sum_not_web_mobile: 7911 Data Scientist 266 Game Developer 256 Information Security 209 Product Manager 141 Data Engineer 209 ... Education 1 Pharmacy tech 1 Project manager 1 Financial Services 1 Software Engineering 1 Length: 222, dtype: int64
7911 expressions of intersts are there in 222 not_web_mobile
categories. Most of them have frequencies in single digit though implying that these are not very popular choices. So below, we have removed the single digit categories from both criteria i.e. web_mobile
as well as not_web_mobile
to focus on major categories only.
# Removing single digit categories
web_mobile_dd = web_mobile[web_mobile > 9].sort_values(ascending = False) ## dd denotes double digit
print('sum_web_mobile_dd: ', sum(web_mobile_dd), '\n')
print(web_mobile_dd)
sum_web_mobile_dd: 7763 Full-Stack Web Developer 1761 Front-End Web Developer 1582 Back-End Web Developer 1509 Mobile Developer 1347 Full-Stack Web Developer 460 Front-End Web Developer 416 Back-End Web Developer 382 Mobile Developer 306 dtype: int64
# Removing single digit categories
not_web_mobile_dd = not_web_mobile[not_web_mobile > 9].sort_values(ascending = False) ## dd denotes double digit
print('sum_not_web_mobile_dd: ', sum(not_web_mobile_dd), '\n')
print(not_web_mobile_dd)
sum_not_web_mobile_dd: 7689 Data Scientist 974 Game Developer 948 User Experience Designer 907 Information Security 893 Data Engineer 859 DevOps / SysAdmin 643 Product Manager 543 Quality Assurance Engineer 375 Data Scientist 266 Game Developer 256 User Experience Designer 218 Data Engineer 209 Information Security 209 DevOps / SysAdmin 162 Product Manager 141 Quality Assurance Engineer 86 dtype: int64
We have removed all single digit categories. After this we are reduced to only 16 categories instead of the original 222. categories. There are some duplications in these categories as well, for instance, Data Scientist
, Data Engineer
and Game Developer
are repeated due to formatting inconsistencies. However, we get a broad idea of popular categories from not_web_mobile
criteria.
In many cases, the respondents have stacked them together with web_mobile
categories while answering the question as shown below.
print(survey.JobRoleInterest.tail(5))
18170 NaN 18171 DevOps / SysAdmin, Mobile Developer, Pro... 18172 NaN 18173 NaN 18174 Back-End Web Developer, Data Engineer, Data ... Name: JobRoleInterest, dtype: object
We will now, plot the data for web_mobile
and not_web_mobile
as a bar chart for each criteria.
fig = plt.figure(figsize=(12, 12))
plt.style.use('fivethirtyeight')
plt.subplot(2,1,1)
web_mobile_dd.plot.barh(color = 'red', title = 'Respondents Interest: Development Streams', fontsize = 8, width = 0.25)
plt.xlim(0, 1800)
plt.subplot(2,1,2)
not_web_mobile_dd.plot.barh(color = 'blue', title = 'Respondents Interest: Non-Development Streams', fontsize = 8)
plt.xlabel('frequency', fontsize = 12)
plt.xlim(0, 1800)
plt.tight_layout()
plt.show()
Although the data above can be further refined by combining similar columns, but nonethelss, it is quite clear even from this picture that the respondents had expressed interest not only in web and mobile development but also in some major non-development categories such as data science, game development and data engineering.
We will now focus on respondents who had answered the question for the JobRoleInterest
column and make absolute and normalized frequency tables based on their current locations i.e. the column CountryLive
to generate the density of new coders by their countries of residence.
survey_notnull = survey[survey.JobRoleInterest.notnull()]
# generate absolute frquency table by location
abs_freq = survey_notnull.CountryLive.value_counts(ascending = False)
# generate relative frquency table by location
rel_freq = survey_notnull.CountryLive.value_counts(normalize = True, ascending = False)*100
# Combine data in a new dataframe
survey_bylocation = pd.DataFrame({'Frequency': abs_freq, 'Percentage': rel_freq}, index = abs_freq.index)
survey_bylocation
Frequency | Percentage | |
---|---|---|
United States of America | 3125 | 45.700497 |
India | 528 | 7.721556 |
United Kingdom | 315 | 4.606610 |
Canada | 260 | 3.802281 |
Poland | 131 | 1.915765 |
Brazil | 129 | 1.886517 |
Germany | 125 | 1.828020 |
Australia | 112 | 1.637906 |
Russia | 102 | 1.491664 |
Ukraine | 89 | 1.301550 |
Nigeria | 84 | 1.228429 |
Spain | 77 | 1.126060 |
France | 75 | 1.096812 |
Romania | 71 | 1.038315 |
Netherlands (Holland, Europe) | 65 | 0.950570 |
Italy | 62 | 0.906698 |
Philippines | 52 | 0.760456 |
Serbia | 52 | 0.760456 |
Greece | 46 | 0.672711 |
Ireland | 43 | 0.628839 |
South Africa | 39 | 0.570342 |
Mexico | 37 | 0.541094 |
Turkey | 36 | 0.526470 |
Hungary | 34 | 0.497221 |
Singapore | 34 | 0.497221 |
New Zealand | 33 | 0.482597 |
Croatia | 32 | 0.467973 |
Argentina | 32 | 0.467973 |
Indonesia | 31 | 0.453349 |
Pakistan | 31 | 0.453349 |
Norway | 31 | 0.453349 |
Sweden | 31 | 0.453349 |
Denmark | 30 | 0.438725 |
Finland | 29 | 0.424101 |
Egypt | 29 | 0.424101 |
Israel | 29 | 0.424101 |
Malaysia | 28 | 0.409476 |
Portugal | 28 | 0.409476 |
Vietnam | 28 | 0.409476 |
China | 28 | 0.409476 |
Czech Republic | 26 | 0.380228 |
Kenya | 26 | 0.380228 |
Japan | 24 | 0.350980 |
Bangladesh | 23 | 0.336356 |
Lithuania | 23 | 0.336356 |
Great Britain | 21 | 0.307107 |
Bosnia & Herzegovina | 20 | 0.292483 |
Belarus | 20 | 0.292483 |
United Arab Emirates | 19 | 0.277859 |
Belgium | 19 | 0.277859 |
Austria | 17 | 0.248611 |
Nepal | 17 | 0.248611 |
Korea South | 17 | 0.248611 |
Colombia | 16 | 0.233987 |
Venezuela | 16 | 0.233987 |
Bulgaria | 15 | 0.219362 |
Taiwan | 14 | 0.204738 |
Republic of Serbia | 14 | 0.204738 |
Switzerland | 14 | 0.204738 |
Thailand | 13 | 0.190114 |
Latvia | 13 | 0.190114 |
Ghana | 12 | 0.175490 |
Hong Kong | 11 | 0.160866 |
Kazakhstan | 11 | 0.160866 |
Macedonia | 10 | 0.146242 |
Sri Lanka | 10 | 0.146242 |
Morocco | 10 | 0.146242 |
Slovenia | 9 | 0.131617 |
Jamaica | 9 | 0.131617 |
Slovakia | 9 | 0.131617 |
Saudi Arabia | 9 | 0.131617 |
Peru | 8 | 0.116993 |
Algeria | 8 | 0.116993 |
Estonia | 8 | 0.116993 |
Dominican Republic | 8 | 0.116993 |
Costa Rica | 7 | 0.102369 |
Puerto Rico | 7 | 0.102369 |
Albania | 6 | 0.087745 |
Chile | 6 | 0.087745 |
Virgin Islands (USA) | 6 | 0.087745 |
Luxembourg | 6 | 0.087745 |
Azerbaijan | 5 | 0.073121 |
Iran | 5 | 0.073121 |
Tunisia | 5 | 0.073121 |
Uruguay | 5 | 0.073121 |
Zimbabwe | 4 | 0.058497 |
Cambodia | 4 | 0.058497 |
Georgia | 4 | 0.058497 |
Afghanistan | 4 | 0.058497 |
Iceland | 3 | 0.043872 |
Niger | 3 | 0.043872 |
Paraguay | 3 | 0.043872 |
Netherland Antilles | 3 | 0.043872 |
Uzbekistan | 3 | 0.043872 |
Senegal | 3 | 0.043872 |
Uganda | 3 | 0.043872 |
Haiti | 2 | 0.029248 |
Mauritius | 2 | 0.029248 |
Cyprus | 2 | 0.029248 |
Ecuador | 2 | 0.029248 |
Moldova | 2 | 0.029248 |
Guam | 2 | 0.029248 |
Lebanon | 2 | 0.029248 |
Bahrain | 2 | 0.029248 |
Iraq | 2 | 0.029248 |
Honduras | 2 | 0.029248 |
Kyrgyzstan | 1 | 0.014624 |
Turkmenistan | 1 | 0.014624 |
Guatemala | 1 | 0.014624 |
Gibraltar | 1 | 0.014624 |
Guadeloupe | 1 | 0.014624 |
Bolivia | 1 | 0.014624 |
Somalia | 1 | 0.014624 |
Panama | 1 | 0.014624 |
Vanuatu | 1 | 0.014624 |
Cameroon | 1 | 0.014624 |
Jordan | 1 | 0.014624 |
Myanmar | 1 | 0.014624 |
Mozambique | 1 | 0.014624 |
Angola | 1 | 0.014624 |
Anguilla | 1 | 0.014624 |
Sudan | 1 | 0.014624 |
Gambia | 1 | 0.014624 |
Aruba | 1 | 0.014624 |
Nicaragua | 1 | 0.014624 |
Papua New Guinea | 1 | 0.014624 |
Nambia | 1 | 0.014624 |
Qatar | 1 | 0.014624 |
Botswana | 1 | 0.014624 |
Channel Islands | 1 | 0.014624 |
Samoa | 1 | 0.014624 |
Liberia | 1 | 0.014624 |
Trinidad & Tobago | 1 | 0.014624 |
Cayman Islands | 1 | 0.014624 |
Cuba | 1 | 0.014624 |
Rwanda | 1 | 0.014624 |
Yemen | 1 | 0.014624 |
United States is the biggest market followed by a wide margin by India, UK and Canada. To illustrate the distribution, below, we plot a bar graph depicting relative frequencies country wise only for countries with at least 1% respondents.
fig = plt.figure(figsize=(12, 6))
from numpy import arange
plt.style.use('fivethirtyeight')
survey_bylocation[survey_bylocation['Percentage'] >= 1]['Percentage'].plot.barh(label = 'location distribution', legend = True, colormap = 'tab10')
# lines indicating mean, and standard deviation
plt.legend()
plt.xlabel('Relative Frequency')
plt.title("Distribution of Coders by Location", fontsize = 16)
plt.show()
An important factor to consider while identifying markets for our products is the financial capacity and preferences of the new coders i.e. how much money they are willing to spend on learning.
The MoneyForLearning
column describes in American dollars the amount of money spent by participants from the moment they started coding until the moment they completed the survey. Our company sells subscriptions at a price of usd 59 per month, and for this reason we're interested in finding out how much money each student spends per month on learning.
We will also narrow down our analysis to only four top countries: the US, India, the United Kingdom, and Canada being the most representative of typical digital learners. We will find the mean of money spent per month by new coders in these four countries and plot a bar chart for comparison.
# Create a new column "money_per_month" based on past spending and the duration spent coding till the time of survey
# Analysis will be restricted to the US, India, the United Kingdon and Canada
# Replace the '0' values in the column 'MonthsProgramming' with 1
survey_notnull.MonthsProgramming.replace(0, 1, inplace = True)
# Create the column and keep only not-null values
survey_notnull['money_per_month'] = survey_notnull.MoneyForLearning / survey_notnull.MonthsProgramming
# Remove null values from 'CountryLive' column
survey_notnull = survey_notnull[survey_notnull.CountryLive.notnull()].copy()
# Groupby 'CountryLive', find mean and isolate data to four countries of concern
survey_mean_spending = survey_notnull.groupby('CountryLive')['money_per_month'].mean()[['United States of America',
'India', 'United Kingdom',
'Canada']]
print(survey_mean_spending)
fig = plt.figure(figsize=(12, 6))
from numpy import arange
plt.style.use('fivethirtyeight')
survey_mean_spending.plot.bar(label = 'mean spending', legend = True, colormap = 'winter', rot = 0)
# lines indicating mean, and standard deviation
plt.legend()
plt.ylabel('mean spending')
plt.xlabel('')
plt.title("Mean Spending by Coders by Location", fontsize = 16)
plt.show()
CountryLive United States of America 227.997996 India 135.100982 United Kingdom 45.534443 Canada 113.510961 Name: money_per_month, dtype: float64
C:\Anaconda\lib\site-packages\pandas\core\generic.py:6746: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._update_inplace(new_data) <ipython-input-14-9a0b4ce6bda7>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy survey_notnull['money_per_month'] = survey_notnull.MoneyForLearning / survey_notnull.MonthsProgramming
Above spending profiles are somewhat inconsistent with socio-economic indicators for respective countries. We have to, therefore, check for extreme outliers which might be responsible for inflating mean values for India and deflating for USA, UK and Canada. For this we will plot box plots for the four countries in question.
# isolate four countries of interest
countries = ['United States of America', 'India', 'United Kingdom', 'Canada']
top_four = pd.DataFrame()
for country in countries:
country_wise = survey_notnull[survey_notnull['CountryLive'] == country]
top_four = top_four.append(country_wise)
# Plot data for all four countries as boxplots
fig = plt.figure(figsize=(8, 8))
plt.style.use('fivethirtyeight')
sns.boxplot(x = 'CountryLive', y = 'money_per_month', data = top_four)
plt.xlabel('Country')
plt.ylabel('money spent')
plt.title(' Distribution of Money Spent Per Month Per Country')
plt.show()
There ae some extreme outliers far above the mean in terms of money spent. We start rationalizing by removing outliers who spent above usd 10,000.
# Remove outliers above $10000
top_four = top_four[top_four['money_per_month'] < 10000]
print(top_four.groupby('CountryLive')['money_per_month'].mean())
# Plot data for all four countries as boxplots again
fig = plt.figure(figsize=(8, 8))
plt.style.use('fivethirtyeight')
sns.boxplot(x = 'CountryLive', y = 'money_per_month', data = top_four)
plt.xlabel('Country')
plt.ylabel('money spent')
plt.title(' Distribution of Money Spent Per Month Per Country')
plt.show()
CountryLive Canada 113.510961 India 113.748387 United Kingdom 45.534443 United States of America 155.459187 Name: money_per_month, dtype: float64
This has driven the mean down for the US and India while not affecting UK and Canada. We need to identify the high level of spending for these outliers for the US and India and a couple of outliers left for Canada. One possible reason could be attendance of an expensive bootcamp by those reporting spending over USD 2000 per month. We can identify that.
# Check india's coders who attended bootcamp
india_2000 = top_four[(top_four['CountryLive'] == 'India') & (top_four['money_per_month'] > 2000)]
india_2000['AttendedBootcamp']
1728 0.0 1755 0.0 7989 0.0 8126 0.0 15587 0.0 Name: AttendedBootcamp, dtype: float64
None of these outliers report having attended an expensive bootcamp. The most likely possibility is that they misinterpreted the survey question which required them to report monthly expenses on tuition other than university. Perhaps, they included the university tuition as well. So, we are going to remove these outliers. But first, we will check the US and Canadian outliers who spent over usd 2000.
# US Outliers being checked whether they attended bootcamp
us_2000 = top_four[(top_four['CountryLive'] == 'United States of America') & (top_four['money_per_month'] > 2000)]
us_2000['AttendedBootcamp']
415 1.0 441 1.0 484 1.0 718 1.0 723 1.0 1222 1.0 1334 1.0 2432 0.0 2480 0.0 3013 1.0 3144 1.0 3145 1.0 3184 1.0 3260 0.0 3304 1.0 4014 1.0 4884 0.0 5059 1.0 5769 0.0 5894 1.0 6018 0.0 6444 1.0 6528 0.0 6949 1.0 7167 1.0 7194 0.0 7505 1.0 7925 1.0 8030 0.0 8120 0.0 8202 1.0 8901 1.0 9145 1.0 9248 1.0 9559 1.0 9778 1.0 12283 1.0 12877 1.0 13051 0.0 13145 1.0 13357 1.0 13587 1.0 13815 1.0 16211 1.0 16290 0.0 16410 1.0 16616 1.0 16672 1.0 16700 0.0 16719 0.0 16971 0.0 17265 1.0 17361 1.0 Name: AttendedBootcamp, dtype: float64
Many US respondents who reported spending over usd 2000 also attended a bootcamp which explains for their high spending. Those who did not, may have misinterpreted the survey question, so we are going to remove them.
# Canadian Outliers being checked whether they attended bootcamp
can_2000 = top_four[(top_four['CountryLive'] == 'Canada') & (top_four['money_per_month'] > 2000)]
can_2000['AttendedBootcamp']
6590 1.0 13659 1.0 Name: AttendedBootcamp, dtype: float64
Canadian outliers have spent large sumns as they attended the bootcamp. Coding bootcamps are very much a part of the coding community's culture, especially in the West, so the money spent by those attending the bootcamps is justified. We are going to keep these outliers and remove Indian and US outliers who did not attend bootcamp.
# Drop Indian Outliers
top_four = top_four.drop(labels = india_2000.index)
# Drop US Outliers who did not attend bootcamp
us_no_bootcamp = us_2000[us_2000['AttendedBootcamp'] == 0]
top_four = top_four.drop(labels = us_no_bootcamp.index)
# Check again for mean and outliers
print(top_four.groupby('CountryLive')['money_per_month'].mean())
# Plot data for all four countries as boxplots again
fig = plt.figure(figsize=(8, 8))
plt.style.use('fivethirtyeight')
sns.boxplot(x = 'CountryLive', y = 'money_per_month', data = top_four)
plt.xlabel('Country')
plt.ylabel('money spent')
plt.title(' Distribution of Money Spent Per Month Per Country')
plt.show()
CountryLive Canada 113.510961 India 65.758763 United Kingdom 45.534443 United States of America 137.729595 Name: money_per_month, dtype: float64
We can check the absolute and relative frequencies of individuals left in our dataset after removal of outliers to verify the size of market.
# Generate frequency tables
abs_market_size = top_four['CountryLive'].value_counts(ascending = False)
relative_market_size = top_four['CountryLive'].value_counts(normalize = True, ascending = False)*100
average_monthly_spending = top_four.groupby('CountryLive')['money_per_month'].mean()
# Combine metrics above into a dataframe for comparison
combined = {'market_size': abs_market_size, 'share_market': relative_market_size, 'spending_power': average_monthly_spending}
market_comparison = pd.DataFrame(data = combined)
market_comparison
market_size | share_market | spending_power | |
---|---|---|---|
Canada | 240 | 6.176016 | 113.510961 |
India | 457 | 11.760165 | 65.758763 |
United Kingdom | 279 | 7.179619 | 45.534443 |
United States of America | 2910 | 74.884200 | 137.729595 |
# Plot a pie chart to compare comparative market size
fig = plt.figure(figsize=(8, 8))
plt.style.use('fivethirtyeight')
market_comparison.market_size.plot.pie(title = "Comparative Market Size", figsize = (10,10), colormap = "summer", autopct = '%.1f%%')
plt.ylabel('')
plt.xlabel("Country")
Text(0.5, 0, 'Country')
# Plot a bar chart to compare spending power
fig = plt.figure(figsize=(8, 8))
plt.style.use('fivethirtyeight')
market_comparison.spending_power.plot.bar(title = "Comparative Spending Power", figsize = (10,10), colormap = "winter", rot = 45)
plt.ylabel('')
plt.xlabel("Country")
Text(0.5, 0, 'Country')
From above comparisons, it can be noted that the US has the largest share of market at about 75% as well as the highest spending power at about usd 140, so United States is the automatic selection as the top market.
UK is at the bottom in terms of spending at only usd 45 though nearly tied with Canada in terms of market share at between 6-7%. It is not a promising market for our products.
We have to choose between Canada and India. Canada has a decent market share at between 6-7% and high spending power next only to the USA at usd 113%. On the other hand, India, though has a higher market share at about 11.8% does not have the required spending power which stands at only about usd 66. And though we offer products at usd 59 per month subscription, but still it is too close.
With a more aggressive marketing strategy, the higher spending power of Canada can be capitalized to increase the market share, whereas India is a higher risk market. So, the second recommended market is Canada