Guided Project : Finding the Best Markets to Advertise In

Search for markets to promote content related to Web Development and Mobile Development for an E-Learning Company

The objective of this project is to strengthen statistical skills such as - summarizing distributions, measuring variability of a distribution, and understanding of z-scores. The project will help in combining these skills together to perform practical statistical data analysis.

Project-Case-Scenario: An E-Learning company offers courses on programming spanning multiple domains - web/mobile development, game development, data science etc. The company wants to identify the best markets to invest in advertising, in order to promote it's products and create a consumer base with learners/ coders.

Approach Taken: To identify the best markets to promote the courses, following becomes pre-requisite information about the potential learners, for the E-Learning company:

What people actually want to study - most preferred courses/ roles?
What is the economic investment range - how much money, learners are willing to invest in online education?
Investments in terms of time - the duration of time learners would like to spend in courses.
Demographic information such as their current residence, current employment status and its type, educational qualifications etc.

Survey Data Set: The above and related information was collected by the E-Learning platform freeCodeCamp in a survey, they conducted in 2017. Since the freeCodeCamp runs a popular Medium Publication with over 400,000 followers, they received responses from coders/ learners with diversified interests and not just web development. They made the results as well as the data of this survey available under free distribution license.

The e-learning company intends to utilize this survey data, instead of conducting a new one - for both economic reasons and also the variety of responses available in dataset.

The dataset can be accessed from the git repository here.

Sample or Population? Whether this data has been considered as population or sample data - will be answered during the workflow. A decent reason to keep scrolling!

Import the libraries

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

#magic line
%matplotlib inline

# to avoid display of truncated columns 
pd.options.display.max_columns = 140
pd.options.display.max_colwidth = 50

Read the data

Pandas is unable to figure out the datatypes of each and every column. As shown by - DtypeWarning: Columns (17,62) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result).

In order to suppress this warning, low_memory option of "read_csv" method, has been set to False.

These columns are ID or survey timestamp related columns and hence dropped at the onset - ID.x, ID.y, NetworkID, Part1EndTime, Part1StartTime, Part2EndTime, Part2StartTime

In [2]:

# read in dataset and store in a dataframe
fcc_survey = pd.read_csv("2017-fCC-New-Coders-Survey-Data.csv", encoding = "latin-1", low_memory=False)

# drop un-necessary columns from the dataset
fcc_survey.drop(columns = ["ID.x", "ID.y", "NetworkID", "Part1EndTime", 
                           "Part1StartTime", "Part2EndTime", "Part2StartTime"], axis = 1, inplace = True)

# display number of rows and columns in dataset
print("Total Rows in survey data :",fcc_survey.shape[0])
print("Total Columns in survey data :",fcc_survey.shape[1])

Total Rows in survey data : 18175
Total Columns in survey data : 129

Size the data - non empty volume of data

The below plot summarizes the actual amount of non-null values present in each of the column of the dataset.

In [3]:

# Plot the non-null proportion of each column

fig, ax = plt.subplots(figsize = (12, 5))
sns.barplot(x = fcc_survey.columns, y = fcc_survey.notnull().sum(), ax = ax, palette = "tab20c_r")
ax.set_xticklabels([])
ax.set_xlabel("Column indices from 0 to 129", fontsize = "large")
ax.set_title("FreeCodeCamp's New Coder Survey Data Availability", fontsize = "x-large")
ax.set_yticklabels([])
ax.set_yticks([False])
ax.set_xticks([False])
sns.despine(top=True, right=True, left=True)
plt.show()

There seems to be lot of columns with empty/ null values. Perhaps familiarity with each of the column and what exactly it captures will help in identifying which fields are useful for the analysis and which fields can be ignored.

Display first 5 records from dataset

In [4]:

fcc_survey.head()

Out[4]:

	Age	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventConferences	CodeEventDjangoGirls	CodeEventFCC	CodeEventGameJam	CodeEventGirlDev	CodeEventHackathons	CodeEventMeetup	CodeEventNodeSchool	CodeEventNone	CodeEventOther	CodeEventRailsBridge	CodeEventRailsGirls	CodeEventStartUpWknd	CodeEventWkdBootcamps	CodeEventWomenCode	CodeEventWorkshops	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentFieldOther	EmploymentStatus	EmploymentStatusOther	ExpectedEarning	FinanciallySupporting	FirstDevJob	Gender	GenderOther	HasChildren	HasDebt	HasFinancialDependents	HasHighSpdInternet	HasHomeMortgage	HasStudentDebt	HomeMortgageOwe	HoursLearning	Income	IsEthnicMinority	IsUnderEmployed	JobApplyWhen	JobInterestBackEnd	JobInterestDataEngr	JobInterestDataSci	JobInterestDevOps	JobInterestFrontEnd	JobInterestFullStack	JobInterestGameDev	JobInterestInfoSec	JobInterestMobile	JobInterestOther	JobInterestProjMngr	JobInterestQAEngr	JobInterestUX	JobPref	JobRelocateYesNo	JobRoleInterest	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	PodcastChangeLog	PodcastCodeNewbie	PodcastCodePen	PodcastDevTea	PodcastDotNET	PodcastGiantRobots	PodcastJSAir	PodcastJSJabber	PodcastNone	PodcastOther	PodcastProgThrowdown	PodcastRubyRogues	PodcastSEDaily	PodcastSERadio	PodcastShopTalk	PodcastTalkPython	PodcastTheWebAhead	ResourceCodecademy	ResourceCodeWars	ResourceCoursera	ResourceCSS	ResourceEdX	ResourceEgghead	ResourceFCC	ResourceHackerRank	ResourceKA	ResourceLynda	ResourceMDN	ResourceOdinProj	ResourceOther	ResourcePluralSight	ResourceSkillcrush	ResourceSO	ResourceTreehouse	ResourceUdacity	ResourceUdemy	ResourceW3S	SchoolDegree	SchoolMajor	StudentDebtOwe	YouTubeCodeCourse	YouTubeCodingTrain	YouTubeCodingTut360	YouTubeComputerphile	YouTubeDerekBanas	YouTubeDevTips	YouTubeEngineeredTruth	YouTubeFCC	YouTubeFunFunFunction	YouTubeGoogleDev	YouTubeLearnCode	YouTubeLevelUpTuts	YouTubeMIT	YouTubeMozillaHacks	YouTubeOther	YouTubeSimplilearn	YouTubeTheNewBoston
0	27.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15 to 29 minutes	Canada	Canada	software development and IT	NaN	Employed for wages	NaN	NaN	NaN	NaN	female	NaN	NaN	1.0	0.0	1.0	0.0	0.0	NaN	15.0	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	start your own business	NaN	NaN	NaN	English	married or domestic partnership	150.0	6.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	34.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	United States of America	United States of America	NaN	NaN	Not working but looking for work	NaN	35000.0	NaN	NaN	male	NaN	NaN	1.0	0.0	1.0	0.0	1.0	NaN	10.0	NaN	0.0	NaN	Within 7 to 12 months	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	work for a nonprofit	1.0	Full-Stack Web Developer	in an office with other developers	English	single, never married	80.0	6.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	1.0	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	21.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	NaN	NaN	NaN	1.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15 to 29 minutes	United States of America	United States of America	software development and IT	NaN	Employed for wages	NaN	70000.0	NaN	NaN	male	NaN	NaN	0.0	0.0	1.0	NaN	NaN	NaN	25.0	13000.0	1.0	0.0	Within 7 to 12 months	1.0	NaN	NaN	1.0	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	work for a medium-sized company	1.0	Front-End Web Developer, Back-End Web Develo...	no preference	Spanish	single, never married	1000.0	5.0	1.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Codenewbie	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	NaN	high school diploma or equivalent (GED)	NaN	NaN	NaN	NaN	1.0	NaN	1.0	1.0	NaN	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN
3	26.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	I work from home	Brazil	Brazil	software development and IT	NaN	Employed for wages	NaN	40000.0	0.0	NaN	male	NaN	0.0	1.0	1.0	1.0	1.0	0.0	40000.0	14.0	24000.0	0.0	1.0	Within the next 6 months	1.0	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	work for a medium-sized company	NaN	Front-End Web Developer, Full-Stack Web Deve...	from home	Portuguese	married or domestic partnership	0.0	5.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN
4	20.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Portugal	Portugal	NaN	NaN	Not working but looking for work	NaN	140000.0	NaN	NaN	female	NaN	NaN	0.0	0.0	1.0	NaN	NaN	NaN	10.0	NaN	0.0	NaN	Within 7 to 12 months	1.0	NaN	NaN	NaN	1.0	1.0	NaN	1.0	1.0	NaN	NaN	NaN	NaN	work for a multinational corporation	1.0	Full-Stack Web Developer, Information Security...	in an office with other developers	Portuguese	single, never married	0.0	24.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	bachelor's degree	Information Technology	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Identify relevant fields for analysis

For the analysis, out of the 129 columns present in dataset, the relevant columns are the ones that can provide information about the coder's:

subject interests
willing to invest in education and the extent of budget
current geographic location
preferred language for communication and/or some related information

Although at this point, certain fields may still not be required for objective of the project, but they might be useful for additional analysis of the data and hence included.

Some of the relevant Columns are (respective actual field names are less formatted):

Age
Country Citizen
Country Live
Employment Field
Employment Status
Hours Learning
Job Role Interest
Language At Home
Money For Learning
Months Programming

Check for Sample Representativity

For the analysis of best markets to promote the online courses, it is essential to understand if the potential coders/ learners are actually interested in learning the subjects/ courses that the E-Company offers.

This is required as without a representative population, this analysis will hold no true value. The dataset should be able to answer:

Are coders interested in Web and/or Mobile development courses
Are coders interested in more than these two courses which are also provided by E-Company

Transform "Job Role Interest" column for better readability

The transformation for role values is required, to avoid same subjects marked as different. For example: Full-Stack Developer and Full stack developer will be updated to full stack developer.

This transformation is conditional and extremely basic. For example the following scenarios, need thorough data clean up. But due to existent lack of data for this column, these scenarios have not been considered for transformation:

answers like
- "all whatever is required to develop tools to revolutionize the mechanical engineering process" or
- "i don't know what the difference is between most of these soz lol" or
- "i'm just learning code to increase my skill set. i see it as a literacy issue."
system engineer or systems engineer
full stack developer or full stack development etc.

In [5]:

print("Total coders who responded for Job Role Interests:",fcc_survey["JobRoleInterest"].notnull().sum(), "\n")

print("Total coders who responded for Job Role Interests (%):",
      int(fcc_survey["JobRoleInterest"].notnull().sum()/fcc_survey.shape[0] * 100))

Total coders who responded for Job Role Interests: 6992 

Total coders who responded for Job Role Interests (%): 38

In [6]:

"""
transform_roles function will take each value/ list of the roles answered by a coder, 
convert into lower case and replace any additional characters that differentiate same subjects
"""
def transform_roles(in_row):
    role_list = []
    if in_row is not np.nan:
        for each_val in in_row.split(","):
            role_list.append(each_val.replace(r"  "," ").replace("-"," ").strip().lower())
#             role_list.append(each_val.replace("development", "developer"))
        in_row = role_list
    return in_row

fcc_survey["JobRoleInterest"] = fcc_survey.JobRoleInterest.apply(transform_roles)

pd.options.display.max_colwidth = 120
# display first 5 records after transformation as sample
print("Sample rows after transformation", "\n")
print(fcc_survey["JobRoleInterest"].head(), "\n")

pd.options.display.max_colwidth = 50

Sample rows after transformation 

0                                                                                                                    NaN
1                                                                                             [full stack web developer]
2       [front end web developer, back end web developer, devops / sysadmin, mobile developer, full stack web developer]
3                                            [front end web developer, full stack web developer, back end web developer]
4    [full stack web developer, information security, mobile developer, front end web developer, back end web developer]
Name: JobRoleInterest, dtype: object

The below code is to prepare for plots to visualize number of courses coders have shown interests in and to get a general idea of the most preferred roles.

In [7]:

"""
Total number of subjects/ roles the coders has shown interest to learn
"""
RolesCount = (
fcc_survey["JobRoleInterest"].apply(lambda x: len(x) 
                                    if x is not np.nan else np.nan).value_counts(normalize = True).mul(100).round(2)
)

"""
Number of times a particular subject/ role has been cited by the coders
"""
roles_dict = {}
for row in fcc_survey["JobRoleInterest"]:
    if row is not np.nan:
        for each_role in row:
            if each_role in roles_dict:
                roles_dict[each_role] += 1
            else:
                roles_dict[each_role] = 1

TopRoles = pd.DataFrame(roles_dict.items())
TopRoles.rename(columns = {0 : "RoleName", 1: "Count"}, inplace = True)

# Compute contribution of each subject percentage wise
TopRoles["Count%"] = (TopRoles["Count"]/ TopRoles["Count"].sum()).mul(100).round(2)

# Select only the top most sought out courses
TopRoles = TopRoles.sort_values("Count%", ascending = False)[:10]

Visualize No. of courses Coders are Interested to Learn & Most preferred Courses

In [8]:

fig = plt.figure(figsize = (18, 6))
ax1, ax2 = fig.add_subplot(1, 2, 1), fig.add_subplot(1,2,2)

# plot number of courses
sns.barplot(x = RolesCount.index.astype(int), y = RolesCount, ax = ax1, palette = "tab20c")

# plot most preferred courses
sns.barplot(y = TopRoles.RoleName, x = TopRoles["Count%"], ax = ax2, palette = "tab20c")

for ax in [ax1, ax2]:
    ax.set_ylabel("")
    if ax == ax1:
        ax.set_ylabel("%(Interested) Coders", fontsize="x-large")
        ax.set_yticks([False])
        ax.set_xlabel("Number of Courses", fontsize="x-large")
        ax.set_title("No. Of Courses vs. % of Interested Coders", fontsize = "x-large")
    else:
        ax.set_xticks([False])
        ax.set_xlabel("%(Interested) Coders", fontsize="x-large")
        ax.set_title("10 Most Preferred Courses", fontsize = "x-large")

sns.despine(top=True, right=True, left=True, bottom=True)

for index, row in RolesCount.items():
    ax1.text(index - 1, row + 0.5, row, color='black', ha="center", fontsize="medium")
    
for p in ax2.patches:
    w, h = p.get_width(), p.get_height()
    ax2.text(w * .5, p.get_y() + 0.5 * h, w, ha='center', va='center', fontsize="medium")    

# set a common title for both the plots
plt.suptitle("Preferred Courses", fontsize="xx-large")
# plt.tight_layout()
plt.show()

Observations based on the above distributions, can be summarized as:

Close to 31% of total coders who answered this survey question, would prefer atleast 1 course.
Collectively about 52% of coders would prefer to learn 2 to 5 courses.
The top 3 of the most preferred courses include some or the other web development related courses.
The 10 most preferred courses also include - mobile development, data science and game development courses.

Identify coders interested in atleast one role - Web Developer or Mobile Developer

The No. of Courses distribution does provide us the information that learners are interested in more than one subject. The Top most preferred distribution provides information on various courses that coders are interested.

It can be easily understood that front end web developer and full stack web developer can be considered specialization under broader term "Web Developer". For now, since the data hasn't undergone thorough cleaning process, the responses have been categorized into two:

Web or Mobile - where coders have mentioned atleast one of the two roles
and Others - where neither of the role has been mentioned

In [9]:

# dummy_pattern = re.compile(r"web\sdev|mobile\sdev")

"""
search_web_mob function : searches for "web dev" or "mobile dev" in the list of roles and classifies as "web or mobile"
                          if either of the two roles are found else returns "others"
"""
def search_web_mob(in_list):
    web_p = re.compile(r"(web\sdeve|mobile\sdeve)")
    if in_list is not np.nan:
        new_list = list(filter(web_p.search, in_list))
        if len(new_list) > 0:
            return "web or mobile"
        else:
            return "others"

fcc_survey["WebOrMobDev"] = fcc_survey["JobRoleInterest"].apply(search_web_mob)

Visualize interest in Two Subjects - Web Development and Mobile Development

In [10]:

fig, ax = plt.subplots(figsize = (12, 1.5))

sns.barplot(ax = ax, y = fcc_survey["WebOrMobDev"].value_counts().index, 
            x = fcc_survey["WebOrMobDev"].value_counts(normalize = True), palette = "tab20c")
for p in ax.patches:
    w, h = p.get_width(), p.get_height()
    ax.text(w * .5, p.get_y() + 0.5 * h, str(round(w*100,2))+"%",
            ha='center', va='center', fontsize="large", color = "white", fontweight="bold")

ax.set_title("Web or Mobile Developer Roles vs Others", fontsize = "x-large")
ax.set_xticks([False])
ax.set_xlabel("")
sns.despine()
plt.show()

Most of the respondent are interested in web and/or mobile development skills.

To answer the representativity of the dataset - based on the above results, it can be concluded that this dataset is respresentative of the population of respondents the E-Company is looking for and hence is valid for further analysis.

New Coders - Locations and Densities

Two distinct columns in dataset provide information related to a coders geographic location - "Country Citizen" and "Country Live" - coder's native and current residence.

The E-Company is only interested in the current location of the coders. Hence the country a coder is currently living at, can be considered as an individual market.

One indicator of a good market is the number of potential customers — the more potential customers in a market, the better. If the ads manage to convince 10% of the 5000 potential customers in market A to buy E-Company's product, then it is better than convincing 100% of the 30 potential customers in market B.

For the analysis, only those coders have been considered who have answered to the Job Role preference.

In [11]:

# drop all rows where coders haven't responded to preferred Job Role Interest Question
fcc_survey = fcc_survey.dropna(subset = ["JobRoleInterest"])

fig, ax = plt.subplots(figsize = (10, 2))

# group countries each having more and equal or less than 1.5% of coder population
sns.barplot(y = ["Above", "Equal or Below"], 
            x = [(fcc_survey["CountryLive"].value_counts(normalize = True) > 0.015).sum(),
                 (fcc_survey["CountryLive"].value_counts(normalize = True) <= 0.015).sum()],
            palette = "tab20c")

ax.set_title("Countries with more than 1.5% respondents Each!", fontsize = "x-large")
ax.set_ylabel("% of Respondents")
ax.set_xticks([False])
for p in ax.patches:
    w, h = p.get_width(), p.get_height()
    ax.text(w * 0.5, p.get_y() + 0.4, int(w), fontsize = "large", color = "white", fontweight = "demibold",
            ha = "center", va = "center")
    
sns.despine()
plt.tight_layout()
plt.show()

In order to have a summarized view, the countries have been grouped together:

There are 8 countries, with each country having more than 1.5% of total respondents.
A total of 129 countries each, have shown responses from 1.5% or below, of the total coders who attempted the survey.

Distribution of coders in the 8 Countries

The following distribution is only to have a granular view for the top 8 countries. The plot and data describe the percentage contribution of coder population for the 8 countries.

The reason behind the waffle chart was my curiosity to replace pie chart. In search for the same came across this stack overflow post and wanted to utilize this opportunity! Many thanks to Guangyang Li for this class definition. This needs installation of pywaffle (pip install pywaffle)

In [12]:

from pywaffle import Waffle
data = dict(fcc_survey["CountryLive"].value_counts(normalize = True).mul(100).astype(int)[:8])

fig = plt.figure(FigureClass=Waffle, rows = 5, values = data, icons = "child", icon_size = 18,
                 legend = {"loc":"right", "bbox_to_anchor": (0,0.45), "fontsize": 12},
                 colors=("#232066", "#983D3D", "#DCB732", "#228587", "#3bdc32", "#4e872d", "#09b4ed", "#e84d86"),
                 figsize = (10,4))
plt.title("Distribution of coders among Top 8 Countries", fontsize = "x-large")
plt.show()
print(data)

{'United States of America': 45, 'India': 7, 'United Kingdom': 4, 'Canada': 3, 'Poland': 1, 'Brazil': 1, 'Germany': 1, 'Australia': 1}

Even with top 8 countries, the dataset is right skewed with maximum survey respondents residing in the USA. The next three countries with substantial number of coders, are India, UK and Canada.

The search for potential markets can be narrowed down to these four countries - USA, India, UK & Canada. However, in order to factor in profitability from these markets, a deeper understanding of investment potential of coders is required.

Interested to Learn - Yes, Willing to Invest - ?

Advertising within markets where most people are only willing to learn for free is extremely unlikely to be profitable for the E-Company. To understand the profit potential of the Four markets, the investment potential of coders needs to be analysed.

For the analysis two columns are most useful:

"Money for Learning" - the amount of money spent by participants from the moment they started coding until the moment they completed the survey (the currency in dataset is USD).
"Months Programming" - the number of months spent by participants to code till the time they completed the survey.

Calculate amount spent per month by coders

In [13]:

fcc_survey = fcc_survey[fcc_survey["CountryLive"].isin(["United States of America", "India", "United Kingdom", "Canada"])]
"""
"MonthsProgramming" has been assigned 1 in case the participant has responded with 0 as value
"""
fcc_survey.loc[fcc_survey["MonthsProgramming"] == 0, "MonthsProgramming"] = 1

# New column for amount of money spent by coder per month
fcc_survey["InvestPerMonth"] = fcc_survey["MoneyForLearning"]/ fcc_survey["MonthsProgramming"]

"""
Restrict survey data to rows with non-null values for InvestPerMonth
"""
fcc_survey = fcc_survey.dropna(subset = ["InvestPerMonth"])

Compare average amount spent per month for the 4 markets

In [14]:

pd.DataFrame(fcc_survey.groupby("CountryLive").InvestPerMonth.mean().sort_values())

Out[14]:

	InvestPerMonth
CountryLive
United Kingdom	45.534443
Canada	113.510961
India	135.100982
United States of America	227.997996

The above results do highlight, that coders are willing to invest monetarily as well. What stands out, is the average amount spent by coders from India is relatively higher than countries like Canada and the UK. To certain extent, this indicates presence of outliers within the data.

Identify extreme outliers

Taking account of few socio-economical metrics like GDP per capita, it would be intuitive to expect people in the UK and Canada to spend more on learning than people in India.

Since the results are otherwise, either there is lack of enough representative data for the three countries or the data has some outliers, or the data could really be correct.

Identify the outliers - Violin and Boxplot Visualizations

One of the primary methods to detect outliers is using the InterQuartile Deviation method (IQD). This method is used by violin and boxplots as well, highlighting the outliers from the median dispersion of the data.

More on IQD

In [15]:

fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (12, 5))
sns.boxplot(ax = ax, y = "CountryLive", data = fcc_survey, x = "InvestPerMonth")
ax.set_ylabel("") #"Markets", fontsize = "large")
ax.set_xlabel("Amount Invested Per Month", fontsize = "large")
ax.set_title("Dispersion of Investment per month for the 4 Markets", fontsize = "x-large")
plt.show()

USA shows two extremely high value for Investments made per month. Although possible, these values seem unlikely and hence are eliminated outright.

In [16]:

# filter dataframe with Invest Per Month values below 20000 USD
fcc_survey = fcc_survey[fcc_survey.InvestPerMonth < 20000]

Isolated boxplot visualization for Each Market - These visualizations are just for learning with fun!

In [17]:

fig, axes = plt.subplots(nrows = 1, ncols = 4, figsize = (18, 5.5))

markets = ["United States of America", "India", "Canada", "United Kingdom"]

for each in markets:
    ax_pos = markets.index(each)
    sns.violinplot(fcc_survey.loc[fcc_survey["CountryLive"] == each, "InvestPerMonth"], orient = "v", 
                ax = axes[ax_pos], palette = "tab20c")
    axes[ax_pos].set_title(str(each), color = "Red", fontsize = "large")
    axes[ax_pos].set_ylabel("")
# plt.tight_layout()
plt.suptitle("Dispersion of Investment per Month for the 4 Markets At Individual Scale", fontsize="xx-large")
plt.show()

fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (12, 5))
sns.violinplot(ax = ax, x = "CountryLive", data = fcc_survey, y = "InvestPerMonth")
ax.set_xlabel("") #"Markets", fontsize = "large")
ax.set_ylabel("Amount Invested Per Month", fontsize = "large")
ax.set_title("Comparison of Investment per month for the 4 Markets", fontsize = "x-large")
# ax.set_ylim(-500, 4000)
sns.despine()
plt.show()

Boxplots highlight the outliers for individual market. The most intriguing observation here is, that even though UK has few outliers, they still do not exceed the lowest outliers present in India or Canada.

Examine outliers for India

This small analysis is to confirm the outliers are really outliers for the country India. The analysis is based on the assumption, that the coder may have participated in boot camps or tech events. Since these camps and events might require course or registration fees the same can be considered as investments. The threshold value for this analysis has been set to 2500 USD.

In [18]:

# dropna to avoid any columns which have nodata
fcc_survey.loc[(fcc_survey.InvestPerMonth >= 2500) & (fcc_survey.CountryLive == "India")].dropna(axis = 1)

Out[18]:

	Age	CityPopulation	CountryCitizen	CountryLive	EmploymentStatus	ExpectedEarning	Gender	HasDebt	HasFinancialDependents	HasHighSpdInternet	HoursLearning	JobApplyWhen	JobPref	JobRoleInterest	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	SchoolDegree	SchoolMajor	WebOrMobDev	InvestPerMonth
1728	24.0	between 100,000 and 1 million	India	India	A stay-at-home parent or homemaker	70000.0	male	0.0	0.0	1.0	30.0	Within the next 6 months	work for a startup	[user experience designer, mobile developer, p...	in an office with other developers	Bengali	single, never married	20000.0	4.0	bachelor's degree	Computer Programming	web or mobile	5000.000000
1755	20.0	more than 1 million	India	India	Not working and not looking for work	100000.0	male	0.0	0.0	1.0	10.0	I haven't decided	work for a multinational corporation	[information security, full stack web develope...	no preference	Hindi	single, never married	50000.0	15.0	bachelor's degree	Computer Science	web or mobile	3333.333333
7989	28.0	between 100,000 and 1 million	India	India	Employed for wages	500000.0	male	1.0	1.0	1.0	20.0	more than 12 months from now	work for a multinational corporation	[user experience designer, back end web develo...	in an office with other developers	Marathi	married or domestic partnership	5000.0	1.0	bachelor's degree	Aerospace and Aeronautical Engineering	web or mobile	5000.000000
8126	22.0	more than 1 million	India	India	Not working but looking for work	80000.0	male	1.0	0.0	1.0	80.0	I'm already applying	work for a startup	[back end web developer, full stack web develo...	in an office with other developers	Malayalam	single, never married	5000.0	1.0	bachelor's degree	Electrical and Electronics Engineering	web or mobile	5000.000000
13398	19.0	more than 1 million	India	India	Unable to work	100000.0	male	0.0	0.0	0.0	30.0	I haven't decided	work for a multinational corporation	[mobile developer]	no preference	Hindi	single, never married	20000.0	2.0	bachelor's degree	Computer Science	web or mobile	10000.000000
15587	27.0	more than 1 million	India	India	Employed for wages	65000.0	male	1.0	1.0	1.0	36.0	I haven't decided	work for a startup	[full stack web developer, data scientist]	from home	Hindi	single, never married	100000.0	24.0	bachelor's degree	Communications	web or mobile	4166.666667

There is a lack of information for the attended camps or events (if any) for these coders. Also, to consider they might have included Student Loan as investments, that information is also missing. Since the validity of the data can't be established for now, these values will be treated as outliers.

Examine outliers for Canada

In [19]:

fcc_survey[(fcc_survey.InvestPerMonth >= 2500) & (fcc_survey.CountryLive == "Canada")].dropna(axis = 1)

Out[19]:

	Age	AttendedBootcamp	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	CityPopulation	CodeEventConferences	CodeEventFCC	CodeEventMeetup	CodeEventWorkshops	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentStatus	ExpectedEarning	Gender	HasDebt	HasFinancialDependents	HasHighSpdInternet	HasHomeMortgage	HasServedInMilitary	HasStudentDebt	HomeMortgageOwe	HoursLearning	Income	IsEthnicMinority	IsReceiveDisabilitiesBenefits	IsSoftwareDev	IsUnderEmployed	JobApplyWhen	JobInterestBackEnd	JobInterestDataSci	JobInterestFrontEnd	JobInterestFullStack	JobInterestGameDev	JobInterestMobile	JobInterestProjMngr	JobInterestUX	JobPref	JobRoleInterest	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	PodcastNone	ResourceCodecademy	ResourceCodeWars	ResourceCoursera	ResourceFCC	ResourceSO	ResourceW3S	SchoolDegree	SchoolMajor	YouTubeFCC	YouTubeMIT	WebOrMobDev	InvestPerMonth
13659	24.0	1.0	0.0	0.0	Bloc.io	1.0	more than 1 million	1.0	1.0	1.0	1.0	30 to 44 minutes	Canada	Canada	finance	Employed for wages	60000.0	male	1.0	0.0	1.0	1.0	0.0	0.0	250000.0	10.0	140000.0	1.0	1.0	0.0	0.0	I haven't decided	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	work for a multinational corporation	[mobile developer, full stack web developer, d...	from home	Yue (Cantonese) Chinese	single, never married	10000.0	2.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	bachelor's degree	Finance	1.0	1.0	web or mobile	5000.0

Taking 2500 USD as threshold, there is only one outlier for Canada data. The coder has responded in positive for boot camp training, but no information is available for any kind of student debt or loan. This could be case of lump-sum fees paid for a course/ boot camp instead of subscription based model.

Examine outliers for USA

In [20]:

fcc_survey[(fcc_survey.InvestPerMonth >= 10000) & (fcc_survey.CountryLive == "United States of America")].dropna(axis = 1)

Out[20]:

	Age	AttendedBootcamp	CityPopulation	CountryCitizen	CountryLive	EmploymentStatus	Gender	HasDebt	HasFinancialDependents	HasHighSpdInternet	HoursLearning	IsEthnicMinority	IsReceiveDisabilitiesBenefits	JobApplyWhen	JobPref	JobRelocateYesNo	JobRoleInterest	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	SchoolDegree	WebOrMobDev	InvestPerMonth
3930	31.0	0.0	between 100,000 and 1 million	United States of America	United States of America	Not working and not looking for work	male	1.0	0.0	1.0	50.0	1.0	0.0	Within the next 6 months	work for a startup	1.0	[devops / sysadmin, front end web developer, f...	no preference	English	married or domestic partnership	65000.0	6.0	bachelor's degree	web or mobile	10833.333333
6805	46.0	1.0	between 100,000 and 1 million	United States of America	United States of America	Not working but looking for work	male	1.0	0.0	1.0	45.0	1.0	0.0	Within the next 6 months	work for a multinational corporation	1.0	[full stack web developer, game developer, pro...	no preference	English	married or domestic partnership	15000.0	1.0	bachelor's degree	web or mobile	15000.000000
7198	32.0	0.0	more than 1 million	United States of America	United States of America	Employed for wages	male	1.0	0.0	1.0	4.0	1.0	0.0	I'm already applying	work for a multinational corporation	0.0	[full stack web developer, back end web develo...	no preference	Spanish	single, never married	70000.0	5.0	professional degree (MBA, MD, JD, etc.)	web or mobile	14000.000000
16650	29.0	0.0	more than 1 million	United States of America	United States of America	Not working but looking for work	male	1.0	1.0	1.0	40.0	1.0	1.0	I'm already applying	work for a multinational corporation	1.0	[product manager, data engineer, full stack we...	in an office with other developers	English	married or domestic partnership	200000.0	12.0	associate's degree	web or mobile	16666.666667
16997	27.0	0.0	more than 1 million	United States of America	United States of America	Employed for wages	female	1.0	1.0	1.0	12.0	1.0	0.0	Within 7 to 12 months	work for a medium-sized company	1.0	[mobile developer, game developer, user experi...	in an office with other developers	English	single, never married	12500.0	1.0	some college credit, no degree	web or mobile	12500.000000
17231	50.0	0.0	less than 100,000	Kenya	United States of America	Not working but looking for work	female	0.0	1.0	1.0	1.0	1.0	0.0	Within the next 6 months	work for a nonprofit	0.0	[front end web developer]	in an office with other developers	English	married or domestic partnership	30000.0	2.0	bachelor's degree	web or mobile	15000.000000

Here again, there is not much information available for BootCamps or CodeEvents for these coders. Time to change the strategy!

Summary of statistics for each market With Extreme Outliers - Following table describes the data distribution in terms of percentiles.

In [21]:

# pd.set_option('display.float_format', lambda x: '%.3f' % x)
fcc_survey.groupby("CountryLive").InvestPerMonth.describe(percentiles = [0.25, 0.50, 0.75, 0.95, 0.99])

Out[21]:

	count	mean	std	min	25%	50%	75%	95%	99%	max
CountryLive
Canada	240.0	113.510961	441.014158	0.0	0.0	0.000000	25.000000	592.803030	1902.500000	5000.000000
India	463.0	135.100982	692.960378	0.0	0.0	0.000000	13.250000	491.650000	3650.000000	10000.000000
United Kingdom	279.0	45.534443	162.311836	0.0	0.0	0.000000	20.416667	200.000000	1044.000000	1400.000000
United States of America	2931.0	183.800110	864.767219	0.0	0.0	3.333333	50.000000	833.333333	3308.333333	16666.666667

The 99th percentile for each of the market, clearly indicates the concentration of data points is well below, the maximum amount spent per month, for the 3 markets - USA, UK & Canada.

Eliminate extreme Outliers

Examine Z-Score for each investment per month made by coders grouped by markets

Another method of identifying outlier is by Mean and Standard Deviation Method (SD) which measures the distance of each value from the mean of all of the sample values.

Although it is more susceptible to changes than IQD method, for the purpose of learning let's rule out outliers based on this method.

Time to answer the question - Sample or Population? For this project, each of the market has been taken as sample data of the respective markets. Why? Because the survey hasn't been answered by every possible learner/ coder from each of the country! These learners are the ones who actively participated in the survey and thus represent a sample from each country/market.

The surveyees may have reached out to their peers before or after the survey who in turn could be potential customers for E-Company there by making the other coders and themselves collectively a population.

Where does this understanding comes into play in this project - Delta Degrees of Freedom should be 0 or 1. For this project ddof is 1.

In [22]:

# import zscore module
from scipy.stats import zscore

# group each market/ country
country_group = fcc_survey.groupby("CountryLive")

# calculate zscore for each value for InvestPerMonth column with ddof = 1
fcc_survey["InvestZScore"] = country_group.InvestPerMonth.transform(lambda x: zscore(x, ddof=1))

In the standard deviation method, the default value for SD is 3. That is, if a data point in a sample variable, has a standard deviation equal to or more than 3, it can be considered an outlier. Taking this into account, following table filters out, all the data points for each of the markets which have SD equal to or more than 3.

It has already been established, that UK's outliers are still well within the lowest outlier data points of Canada and India, hence not included in the table.

In [23]:

fcc_survey.loc[(country_group.InvestZScore.apply(lambda x:x >= 3)) & (fcc_survey.CountryLive != "United Kingdom"),
               ["InvestPerMonth","CountryLive", "InvestZScore"]].sort_values(["CountryLive", "InvestZScore"],
                                                                             ascending = False).reset_index(drop = True)

Out[23]:

	InvestPerMonth	CountryLive	InvestZScore
0	16666.666667	United States of America	19.060466
1	15000.000000	United States of America	17.133166
2	15000.000000	United States of America	17.133166
3	14000.000000	United States of America	15.976785
4	12500.000000	United States of America	14.242214
5	10833.333333	United States of America	12.314913
6	9000.000000	United States of America	10.194882
7	8000.000000	United States of America	9.038502
8	8000.000000	United States of America	9.038502
9	6666.666667	United States of America	7.496661
10	6500.000000	United States of America	7.303931
11	5666.666667	United States of America	6.340280
12	5333.333333	United States of America	5.954820
13	5000.000000	United States of America	5.569360
14	5000.000000	United States of America	5.569360
15	5000.000000	United States of America	5.569360
16	5000.000000	United States of America	5.569360
17	5000.000000	United States of America	5.569360
18	4666.666667	United States of America	5.183900
19	4550.000000	United States of America	5.048989
20	4500.000000	United States of America	4.991170
21	4250.000000	United States of America	4.702075
22	4000.000000	United States of America	4.412979
23	4000.000000	United States of America	4.412979
24	4000.000000	United States of America	4.412979
25	4000.000000	United States of America	4.412979
26	3500.000000	United States of America	3.834789
27	3333.333333	United States of America	3.642059
28	3333.333333	United States of America	3.642059
29	3333.333333	United States of America	3.642059
30	3250.000000	United States of America	3.545694
31	3250.000000	United States of America	3.545694
32	3250.000000	United States of America	3.545694
33	3166.666667	United States of America	3.449329
34	3000.000000	United States of America	3.256599
35	3000.000000	United States of America	3.256599
36	3000.000000	United States of America	3.256599
37	3000.000000	United States of America	3.256599
38	2857.142857	United States of America	3.091402
39	10000.000000	India	14.235877
40	5000.000000	India	7.020458
41	5000.000000	India	7.020458
42	5000.000000	India	7.020458
43	4166.666667	India	5.817888
44	3333.333333	India	4.615318
45	5000.000000	Canada	11.080118
46	2337.166667	Canada	5.042141
47	2000.000000	Canada	4.277616
48	1750.000000	Canada	3.710740
49	1500.000000	Canada	3.143865

Focusing on India and Canada - based on the table above we can observe that the SD jumps after 5 (SD = 5). As a student/learner I can think of these options:

match the results with that of guided project.
choose the standard rule of eliminating data points having more than 3 SD.
or customize the selection of SD for each market.

Taking the liberty of learning phase, SD has been set to 4.

Eliminate data points, away from the mean by 4 standard deviations

In [24]:

usa_df = fcc_survey.query("InvestZScore <= 4 & CountryLive == 'United States of America'")
canada_df = fcc_survey.query("InvestZScore <= 4 & CountryLive == 'Canada'")
india_df = fcc_survey.query("InvestZScore <= 4 & CountryLive == 'India'")
uk_df = fcc_survey.query("CountryLive == 'United Kingdom'")

fcc_survey_filtered = pd.concat([uk_df, usa_df, canada_df, india_df])

Summary of statistics for each market without extreme outliers

In [25]:

fcc_survey_filtered.groupby("CountryLive").InvestPerMonth.describe(percentiles = [0.25, 0.50, 0.75, 0.95, 0.99])

Out[25]:

	count	mean	std	min	25%	50%	75%	95%	99%	max
CountryLive
Canada	237.0	75.550481	241.474691	0.0	0.0	0.0	22.222222	433.333333	1333.333333	1750.0
India	457.0	65.758763	233.452738	0.0	0.0	0.0	12.500000	340.666667	1110.000000	2000.0
United Kingdom	279.0	45.534443	162.311836	0.0	0.0	0.0	20.416667	200.000000	1044.000000	1400.0
United States of America	2905.0	119.306296	382.716521	0.0	0.0	3.0	45.000000	666.666667	2221.333333	3500.0

Visualize amount spent per month for each market without extreme outliers

In [26]:

fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (12, 5))
sns.boxplot(ax = ax, y = "CountryLive", data = fcc_survey_filtered, x = "InvestPerMonth")
ax.set_ylabel("") #"Markets", fontsize = "large")
ax.set_xlabel("Amount Invested Per Month", fontsize = "large")
ax.set_title("Dispersion of Investment per month for the 4 Markets", fontsize = "x-large")
ax.set_xlim(-100, 4000)
sns.despine()
plt.show()

In [27]:

pd.DataFrame({"% of Surveyees": fcc_survey_filtered.CountryLive.value_counts(normalize = True).mul(100).astype(int)})

Out[27]:

	% of Surveyees
United States of America	74
India	11
United Kingdom	7
Canada	6

Conclusion Time! - Choosing the Two Best Markets

First is USA, Which is the second one?

The E-Company follows a monthly subscription model, priced at 59 USD as per month. USA stands out as the clear choice for E-Company to invest in advertisement campaign.

Coming to the question of second market, following observations need to be considered:

there were more participants from UK (7%) than Canada(6%), however coders from UK are willing to spend less than the base price of subscription. - So UK as a market can be dropped from this contest.
between the remaining two markets, Canada and India, India has shown higher percentage of coders(11%) participating in the survey, as compared to Canada(6%).
the average amount to be spent by India is lesser than that of Canada, but it is still competent. also, it is higher than the base price of subscription model. - Both these factors make investing in India as economic as in Canada.
also the rapidly increasing market potential of various industries in India, support for higher innovation and digitalization by the government policies, make compelling reasons for upskill of current, or skill development for future workforce.

What can be done for Second one?

It would be prudent to decide between India and Canada based on so close competitive results. Few options that the E-Company can do:

Select USA and Canada with higher potential for spending but loose on a larger potential consumer base.
Select USA and India with a mix of high and low spending but loose on a willing to afford consumer base.
Select USA as clear choice, and create small test-campaigns for Canada and India. Take the results of these small campaigns into further considerations and then decide the second market.
Select all the 3 - USA, Canada and India, divide the budget proportionately and start teaching! (Makes you wonder what strategies DataQuest employs/ employed!?)

Team member or Team Player?

The last question from the guided project is - If we had a marketing team in our company, would it be better to just send them our results and let them use their domain knowledge to take the best decision?

The entire project is to help the organization, E-Company to increase its business portfolio, which requires all the teams to work together and in conjunction. Assuming that this entire process should aid the marketing team in deciding the best markets, in addition to their domain knowledge, it won't be wrong to keep the marketing team informed of the findings of the project and options available on the basis.

However the final decision can very well be deliberated by the marketing team.

An interesting read for data-science and marketing is here

I genuinely enjoyed working on this project, hope you enjoyed reading it.