The objective of this project is to strengthen statistical skills such as - summarizing distributions, measuring variability of a distribution, and understanding of z-scores. The project will help in combining these skills together to perform practical statistical data analysis.
Project-Case-Scenario: An E-Learning company offers courses on programming spanning multiple domains - web/mobile development, game development, data science etc. The company wants to identify the best markets to invest in advertising, in order to promote it's products and create a consumer base with learners/ coders.
Approach Taken: To identify the best markets to promote the courses, following becomes pre-requisite information about the potential learners, for the E-Learning company:
Survey Data Set: The above and related information was collected by the E-Learning platform freeCodeCamp in a survey, they conducted in 2017. Since the freeCodeCamp runs a popular Medium Publication with over 400,000 followers, they received responses from coders/ learners with diversified interests and not just web development. They made the results as well as the data of this survey available under free distribution license.
The e-learning company intends to utilize this survey data, instead of conducting a new one - for both economic reasons and also the variety of responses available in dataset.
The dataset can be accessed from the git repository here.
Sample or Population? Whether this data has been considered as population or sample data - will be answered during the workflow. A decent reason to keep scrolling!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
#magic line
%matplotlib inline
# to avoid display of truncated columns
pd.options.display.max_columns = 140
pd.options.display.max_colwidth = 50
Pandas is unable to figure out the datatypes of each and every column. As shown by - DtypeWarning: Columns (17,62) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result).
In order to suppress this warning, low_memory option of "read_csv" method, has been set to False.
These columns are ID or survey timestamp related columns and hence dropped at the onset - ID.x, ID.y, NetworkID, Part1EndTime, Part1StartTime, Part2EndTime, Part2StartTime
# read in dataset and store in a dataframe
fcc_survey = pd.read_csv("2017-fCC-New-Coders-Survey-Data.csv", encoding = "latin-1", low_memory=False)
# drop un-necessary columns from the dataset
fcc_survey.drop(columns = ["ID.x", "ID.y", "NetworkID", "Part1EndTime",
"Part1StartTime", "Part2EndTime", "Part2StartTime"], axis = 1, inplace = True)
# display number of rows and columns in dataset
print("Total Rows in survey data :",fcc_survey.shape[0])
print("Total Columns in survey data :",fcc_survey.shape[1])
Total Rows in survey data : 18175 Total Columns in survey data : 129
The below plot summarizes the actual amount of non-null values present in each of the column of the dataset.
# Plot the non-null proportion of each column
fig, ax = plt.subplots(figsize = (12, 5))
sns.barplot(x = fcc_survey.columns, y = fcc_survey.notnull().sum(), ax = ax, palette = "tab20c_r")
ax.set_xticklabels([])
ax.set_xlabel("Column indices from 0 to 129", fontsize = "large")
ax.set_title("FreeCodeCamp's New Coder Survey Data Availability", fontsize = "x-large")
ax.set_yticklabels([])
ax.set_yticks([False])
ax.set_xticks([False])
sns.despine(top=True, right=True, left=True)
plt.show()
There seems to be lot of columns with empty/ null values. Perhaps familiarity with each of the column and what exactly it captures will help in identifying which fields are useful for the analysis and which fields can be ignored.
fcc_survey.head()
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | ChildrenNumber | CityPopulation | CodeEventConferences | CodeEventDjangoGirls | CodeEventFCC | CodeEventGameJam | CodeEventGirlDev | CodeEventHackathons | CodeEventMeetup | CodeEventNodeSchool | CodeEventNone | CodeEventOther | CodeEventRailsBridge | CodeEventRailsGirls | CodeEventStartUpWknd | CodeEventWkdBootcamps | CodeEventWomenCode | CodeEventWorkshops | CommuteTime | CountryCitizen | CountryLive | EmploymentField | EmploymentFieldOther | EmploymentStatus | EmploymentStatusOther | ExpectedEarning | FinanciallySupporting | FirstDevJob | Gender | GenderOther | HasChildren | HasDebt | HasFinancialDependents | HasHighSpdInternet | HasHomeMortgage | HasServedInMilitary | HasStudentDebt | HomeMortgageOwe | HoursLearning | Income | IsEthnicMinority | IsReceiveDisabilitiesBenefits | IsSoftwareDev | IsUnderEmployed | JobApplyWhen | JobInterestBackEnd | JobInterestDataEngr | JobInterestDataSci | JobInterestDevOps | JobInterestFrontEnd | JobInterestFullStack | JobInterestGameDev | JobInterestInfoSec | JobInterestMobile | JobInterestOther | JobInterestProjMngr | JobInterestQAEngr | JobInterestUX | JobPref | JobRelocateYesNo | JobRoleInterest | JobWherePref | LanguageAtHome | MaritalStatus | MoneyForLearning | MonthsProgramming | PodcastChangeLog | PodcastCodeNewbie | PodcastCodePen | PodcastDevTea | PodcastDotNET | PodcastGiantRobots | PodcastJSAir | PodcastJSJabber | PodcastNone | PodcastOther | PodcastProgThrowdown | PodcastRubyRogues | PodcastSEDaily | PodcastSERadio | PodcastShopTalk | PodcastTalkPython | PodcastTheWebAhead | ResourceCodecademy | ResourceCodeWars | ResourceCoursera | ResourceCSS | ResourceEdX | ResourceEgghead | ResourceFCC | ResourceHackerRank | ResourceKA | ResourceLynda | ResourceMDN | ResourceOdinProj | ResourceOther | ResourcePluralSight | ResourceSkillcrush | ResourceSO | ResourceTreehouse | ResourceUdacity | ResourceUdemy | ResourceW3S | SchoolDegree | SchoolMajor | StudentDebtOwe | YouTubeCodeCourse | YouTubeCodingTrain | YouTubeCodingTut360 | YouTubeComputerphile | YouTubeDerekBanas | YouTubeDevTips | YouTubeEngineeredTruth | YouTubeFCC | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeOther | YouTubeSimplilearn | YouTubeTheNewBoston | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 to 29 minutes | Canada | Canada | software development and IT | NaN | Employed for wages | NaN | NaN | NaN | NaN | female | NaN | NaN | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN | 15.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | start your own business | NaN | NaN | NaN | English | married or domestic partnership | 150.0 | 6.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 34.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | less than 100,000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | United States of America | United States of America | NaN | NaN | Not working but looking for work | NaN | 35000.0 | NaN | NaN | male | NaN | NaN | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 10.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | Within 7 to 12 months | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | work for a nonprofit | 1.0 | Full-Stack Web Developer | in an office with other developers | English | single, never married | 80.0 | 6.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | 1.0 | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 21.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 to 29 minutes | United States of America | United States of America | software development and IT | NaN | Employed for wages | NaN | 70000.0 | NaN | NaN | male | NaN | NaN | 0.0 | 0.0 | 1.0 | NaN | 0.0 | NaN | NaN | 25.0 | 13000.0 | 1.0 | 0.0 | 0.0 | 0.0 | Within 7 to 12 months | 1.0 | NaN | NaN | 1.0 | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | work for a medium-sized company | 1.0 | Front-End Web Developer, Back-End Web Develo... | no preference | Spanish | single, never married | 1000.0 | 5.0 | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | Codenewbie | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | high school diploma or equivalent (GED) | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
3 | 26.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | I work from home | Brazil | Brazil | software development and IT | NaN | Employed for wages | NaN | 40000.0 | 0.0 | NaN | male | NaN | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 40000.0 | 14.0 | 24000.0 | 0.0 | 0.0 | 0.0 | 1.0 | Within the next 6 months | 1.0 | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | work for a medium-sized company | NaN | Front-End Web Developer, Full-Stack Web Deve... | from home | Portuguese | married or domestic partnership | 0.0 | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
4 | 20.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Portugal | Portugal | NaN | NaN | Not working but looking for work | NaN | 140000.0 | NaN | NaN | female | NaN | NaN | 0.0 | 0.0 | 1.0 | NaN | 0.0 | NaN | NaN | 10.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | Within 7 to 12 months | 1.0 | NaN | NaN | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | work for a multinational corporation | 1.0 | Full-Stack Web Developer, Information Security... | in an office with other developers | Portuguese | single, never married | 0.0 | 24.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | bachelor's degree | Information Technology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
For the analysis, out of the 129 columns present in dataset, the relevant columns are the ones that can provide information about the coder's:
Although at this point, certain fields may still not be required for objective of the project, but they might be useful for additional analysis of the data and hence included.
Some of the relevant Columns are (respective actual field names are less formatted):
For the analysis of best markets to promote the online courses, it is essential to understand if the potential coders/ learners are actually interested in learning the subjects/ courses that the E-Company offers.
This is required as without a representative population, this analysis will hold no true value. The dataset should be able to answer:
Transform "Job Role Interest" column for better readability
The transformation for role values is required, to avoid same subjects marked as different. For example: Full-Stack Developer and Full stack developer will be updated to full stack developer.
This transformation is conditional and extremely basic. For example the following scenarios, need thorough data clean up. But due to existent lack of data for this column, these scenarios have not been considered for transformation:
print("Total coders who responded for Job Role Interests:",fcc_survey["JobRoleInterest"].notnull().sum(), "\n")
print("Total coders who responded for Job Role Interests (%):",
int(fcc_survey["JobRoleInterest"].notnull().sum()/fcc_survey.shape[0] * 100))
Total coders who responded for Job Role Interests: 6992 Total coders who responded for Job Role Interests (%): 38
"""
transform_roles function will take each value/ list of the roles answered by a coder,
convert into lower case and replace any additional characters that differentiate same subjects
"""
def transform_roles(in_row):
role_list = []
if in_row is not np.nan:
for each_val in in_row.split(","):
role_list.append(each_val.replace(r" "," ").replace("-"," ").strip().lower())
# role_list.append(each_val.replace("development", "developer"))
in_row = role_list
return in_row
fcc_survey["JobRoleInterest"] = fcc_survey.JobRoleInterest.apply(transform_roles)
pd.options.display.max_colwidth = 120
# display first 5 records after transformation as sample
print("Sample rows after transformation", "\n")
print(fcc_survey["JobRoleInterest"].head(), "\n")
pd.options.display.max_colwidth = 50
Sample rows after transformation 0 NaN 1 [full stack web developer] 2 [front end web developer, back end web developer, devops / sysadmin, mobile developer, full stack web developer] 3 [front end web developer, full stack web developer, back end web developer] 4 [full stack web developer, information security, mobile developer, front end web developer, back end web developer] Name: JobRoleInterest, dtype: object
The below code is to prepare for plots to visualize number of courses coders have shown interests in and to get a general idea of the most preferred roles.
"""
Total number of subjects/ roles the coders has shown interest to learn
"""
RolesCount = (
fcc_survey["JobRoleInterest"].apply(lambda x: len(x)
if x is not np.nan else np.nan).value_counts(normalize = True).mul(100).round(2)
)
"""
Number of times a particular subject/ role has been cited by the coders
"""
roles_dict = {}
for row in fcc_survey["JobRoleInterest"]:
if row is not np.nan:
for each_role in row:
if each_role in roles_dict:
roles_dict[each_role] += 1
else:
roles_dict[each_role] = 1
TopRoles = pd.DataFrame(roles_dict.items())
TopRoles.rename(columns = {0 : "RoleName", 1: "Count"}, inplace = True)
# Compute contribution of each subject percentage wise
TopRoles["Count%"] = (TopRoles["Count"]/ TopRoles["Count"].sum()).mul(100).round(2)
# Select only the top most sought out courses
TopRoles = TopRoles.sort_values("Count%", ascending = False)[:10]
Visualize No. of courses Coders are Interested to Learn & Most preferred Courses
fig = plt.figure(figsize = (18, 6))
ax1, ax2 = fig.add_subplot(1, 2, 1), fig.add_subplot(1,2,2)
# plot number of courses
sns.barplot(x = RolesCount.index.astype(int), y = RolesCount, ax = ax1, palette = "tab20c")
# plot most preferred courses
sns.barplot(y = TopRoles.RoleName, x = TopRoles["Count%"], ax = ax2, palette = "tab20c")
for ax in [ax1, ax2]:
ax.set_ylabel("")
if ax == ax1:
ax.set_ylabel("%(Interested) Coders", fontsize="x-large")
ax.set_yticks([False])
ax.set_xlabel("Number of Courses", fontsize="x-large")
ax.set_title("No. Of Courses vs. % of Interested Coders", fontsize = "x-large")
else:
ax.set_xticks([False])
ax.set_xlabel("%(Interested) Coders", fontsize="x-large")
ax.set_title("10 Most Preferred Courses", fontsize = "x-large")
sns.despine(top=True, right=True, left=True, bottom=True)
for index, row in RolesCount.items():
ax1.text(index - 1, row + 0.5, row, color='black', ha="center", fontsize="medium")
for p in ax2.patches:
w, h = p.get_width(), p.get_height()
ax2.text(w * .5, p.get_y() + 0.5 * h, w, ha='center', va='center', fontsize="medium")
# set a common title for both the plots
plt.suptitle("Preferred Courses", fontsize="xx-large")
# plt.tight_layout()
plt.show()
Observations based on the above distributions, can be summarized as:
Identify coders interested in atleast one role - Web Developer or Mobile Developer
The No. of Courses distribution does provide us the information that learners are interested in more than one subject. The Top most preferred distribution provides information on various courses that coders are interested.
It can be easily understood that front end web developer and full stack web developer can be considered specialization under broader term "Web Developer". For now, since the data hasn't undergone thorough cleaning process, the responses have been categorized into two:
# dummy_pattern = re.compile(r"web\sdev|mobile\sdev")
"""
search_web_mob function : searches for "web dev" or "mobile dev" in the list of roles and classifies as "web or mobile"
if either of the two roles are found else returns "others"
"""
def search_web_mob(in_list):
web_p = re.compile(r"(web\sdeve|mobile\sdeve)")
if in_list is not np.nan:
new_list = list(filter(web_p.search, in_list))
if len(new_list) > 0:
return "web or mobile"
else:
return "others"
fcc_survey["WebOrMobDev"] = fcc_survey["JobRoleInterest"].apply(search_web_mob)
Visualize interest in Two Subjects - Web Development and Mobile Development
fig, ax = plt.subplots(figsize = (12, 1.5))
sns.barplot(ax = ax, y = fcc_survey["WebOrMobDev"].value_counts().index,
x = fcc_survey["WebOrMobDev"].value_counts(normalize = True), palette = "tab20c")
for p in ax.patches:
w, h = p.get_width(), p.get_height()
ax.text(w * .5, p.get_y() + 0.5 * h, str(round(w*100,2))+"%",
ha='center', va='center', fontsize="large", color = "white", fontweight="bold")
ax.set_title("Web or Mobile Developer Roles vs Others", fontsize = "x-large")
ax.set_xticks([False])
ax.set_xlabel("")
sns.despine()
plt.show()
Most of the respondent are interested in web and/or mobile development skills.
To answer the representativity of the dataset - based on the above results, it can be concluded that this dataset is respresentative of the population of respondents the E-Company is looking for and hence is valid for further analysis.
Two distinct columns in dataset provide information related to a coders geographic location - "Country Citizen" and "Country Live" - coder's native and current residence.
The E-Company is only interested in the current location of the coders. Hence the country a coder is currently living at, can be considered as an individual market.
One indicator of a good market is the number of potential customers — the more potential customers in a market, the better. If the ads manage to convince 10% of the 5000 potential customers in market A to buy E-Company's product, then it is better than convincing 100% of the 30 potential customers in market B.
For the analysis, only those coders have been considered who have answered to the Job Role preference.
# drop all rows where coders haven't responded to preferred Job Role Interest Question
fcc_survey = fcc_survey.dropna(subset = ["JobRoleInterest"])
fig, ax = plt.subplots(figsize = (10, 2))
# group countries each having more and equal or less than 1.5% of coder population
sns.barplot(y = ["Above", "Equal or Below"],
x = [(fcc_survey["CountryLive"].value_counts(normalize = True) > 0.015).sum(),
(fcc_survey["CountryLive"].value_counts(normalize = True) <= 0.015).sum()],
palette = "tab20c")
ax.set_title("Countries with more than 1.5% respondents Each!", fontsize = "x-large")
ax.set_ylabel("% of Respondents")
ax.set_xticks([False])
for p in ax.patches:
w, h = p.get_width(), p.get_height()
ax.text(w * 0.5, p.get_y() + 0.4, int(w), fontsize = "large", color = "white", fontweight = "demibold",
ha = "center", va = "center")
sns.despine()
plt.tight_layout()
plt.show()
In order to have a summarized view, the countries have been grouped together:
Distribution of coders in the 8 Countries
The following distribution is only to have a granular view for the top 8 countries. The plot and data describe the percentage contribution of coder population for the 8 countries.
The reason behind the waffle chart was my curiosity to replace pie chart. In search for the same came across this stack overflow post and wanted to utilize this opportunity! Many thanks to Guangyang Li for this class definition. This needs installation of pywaffle (pip install pywaffle)
from pywaffle import Waffle
data = dict(fcc_survey["CountryLive"].value_counts(normalize = True).mul(100).astype(int)[:8])
fig = plt.figure(FigureClass=Waffle, rows = 5, values = data, icons = "child", icon_size = 18,
legend = {"loc":"right", "bbox_to_anchor": (0,0.45), "fontsize": 12},
colors=("#232066", "#983D3D", "#DCB732", "#228587", "#3bdc32", "#4e872d", "#09b4ed", "#e84d86"),
figsize = (10,4))
plt.title("Distribution of coders among Top 8 Countries", fontsize = "x-large")
plt.show()
print(data)
{'United States of America': 45, 'India': 7, 'United Kingdom': 4, 'Canada': 3, 'Poland': 1, 'Brazil': 1, 'Germany': 1, 'Australia': 1}
Even with top 8 countries, the dataset is right skewed with maximum survey respondents residing in the USA. The next three countries with substantial number of coders, are India, UK and Canada.
The search for potential markets can be narrowed down to these four countries - USA, India, UK & Canada. However, in order to factor in profitability from these markets, a deeper understanding of investment potential of coders is required.
Advertising within markets where most people are only willing to learn for free is extremely unlikely to be profitable for the E-Company. To understand the profit potential of the Four markets, the investment potential of coders needs to be analysed.
For the analysis two columns are most useful:
Calculate amount spent per month by coders
fcc_survey = fcc_survey[fcc_survey["CountryLive"].isin(["United States of America", "India", "United Kingdom", "Canada"])]
"""
"MonthsProgramming" has been assigned 1 in case the participant has responded with 0 as value
"""
fcc_survey.loc[fcc_survey["MonthsProgramming"] == 0, "MonthsProgramming"] = 1
# New column for amount of money spent by coder per month
fcc_survey["InvestPerMonth"] = fcc_survey["MoneyForLearning"]/ fcc_survey["MonthsProgramming"]
"""
Restrict survey data to rows with non-null values for InvestPerMonth
"""
fcc_survey = fcc_survey.dropna(subset = ["InvestPerMonth"])
Compare average amount spent per month for the 4 markets
pd.DataFrame(fcc_survey.groupby("CountryLive").InvestPerMonth.mean().sort_values())
InvestPerMonth | |
---|---|
CountryLive | |
United Kingdom | 45.534443 |
Canada | 113.510961 |
India | 135.100982 |
United States of America | 227.997996 |
The above results do highlight, that coders are willing to invest monetarily as well. What stands out, is the average amount spent by coders from India is relatively higher than countries like Canada and the UK. To certain extent, this indicates presence of outliers within the data.
Taking account of few socio-economical metrics like GDP per capita, it would be intuitive to expect people in the UK and Canada to spend more on learning than people in India.
Since the results are otherwise, either there is lack of enough representative data for the three countries or the data has some outliers, or the data could really be correct.
Identify the outliers - Violin and Boxplot Visualizations
One of the primary methods to detect outliers is using the InterQuartile Deviation method (IQD). This method is used by violin and boxplots as well, highlighting the outliers from the median dispersion of the data.
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (12, 5))
sns.boxplot(ax = ax, y = "CountryLive", data = fcc_survey, x = "InvestPerMonth")
ax.set_ylabel("") #"Markets", fontsize = "large")
ax.set_xlabel("Amount Invested Per Month", fontsize = "large")
ax.set_title("Dispersion of Investment per month for the 4 Markets", fontsize = "x-large")
plt.show()
USA shows two extremely high value for Investments made per month. Although possible, these values seem unlikely and hence are eliminated outright.
# filter dataframe with Invest Per Month values below 20000 USD
fcc_survey = fcc_survey[fcc_survey.InvestPerMonth < 20000]
Isolated boxplot visualization for Each Market - These visualizations are just for learning with fun!
fig, axes = plt.subplots(nrows = 1, ncols = 4, figsize = (18, 5.5))
markets = ["United States of America", "India", "Canada", "United Kingdom"]
for each in markets:
ax_pos = markets.index(each)
sns.violinplot(fcc_survey.loc[fcc_survey["CountryLive"] == each, "InvestPerMonth"], orient = "v",
ax = axes[ax_pos], palette = "tab20c")
axes[ax_pos].set_title(str(each), color = "Red", fontsize = "large")
axes[ax_pos].set_ylabel("")
# plt.tight_layout()
plt.suptitle("Dispersion of Investment per Month for the 4 Markets At Individual Scale", fontsize="xx-large")
plt.show()
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (12, 5))
sns.violinplot(ax = ax, x = "CountryLive", data = fcc_survey, y = "InvestPerMonth")
ax.set_xlabel("") #"Markets", fontsize = "large")
ax.set_ylabel("Amount Invested Per Month", fontsize = "large")
ax.set_title("Comparison of Investment per month for the 4 Markets", fontsize = "x-large")
# ax.set_ylim(-500, 4000)
sns.despine()
plt.show()
Boxplots highlight the outliers for individual market. The most intriguing observation here is, that even though UK has few outliers, they still do not exceed the lowest outliers present in India or Canada.
Examine outliers for India
This small analysis is to confirm the outliers are really outliers for the country India. The analysis is based on the assumption, that the coder may have participated in boot camps or tech events. Since these camps and events might require course or registration fees the same can be considered as investments. The threshold value for this analysis has been set to 2500 USD.
# dropna to avoid any columns which have nodata
fcc_survey.loc[(fcc_survey.InvestPerMonth >= 2500) & (fcc_survey.CountryLive == "India")].dropna(axis = 1)
Age | AttendedBootcamp | CityPopulation | CountryCitizen | CountryLive | EmploymentStatus | ExpectedEarning | Gender | HasDebt | HasFinancialDependents | HasHighSpdInternet | HasServedInMilitary | HoursLearning | IsEthnicMinority | IsReceiveDisabilitiesBenefits | IsSoftwareDev | JobApplyWhen | JobPref | JobRoleInterest | JobWherePref | LanguageAtHome | MaritalStatus | MoneyForLearning | MonthsProgramming | SchoolDegree | SchoolMajor | WebOrMobDev | InvestPerMonth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1728 | 24.0 | 0.0 | between 100,000 and 1 million | India | India | A stay-at-home parent or homemaker | 70000.0 | male | 0.0 | 0.0 | 1.0 | 0.0 | 30.0 | 0.0 | 0.0 | 0.0 | Within the next 6 months | work for a startup | [user experience designer, mobile developer, p... | in an office with other developers | Bengali | single, never married | 20000.0 | 4.0 | bachelor's degree | Computer Programming | web or mobile | 5000.000000 |
1755 | 20.0 | 0.0 | more than 1 million | India | India | Not working and not looking for work | 100000.0 | male | 0.0 | 0.0 | 1.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | I haven't decided | work for a multinational corporation | [information security, full stack web develope... | no preference | Hindi | single, never married | 50000.0 | 15.0 | bachelor's degree | Computer Science | web or mobile | 3333.333333 |
7989 | 28.0 | 0.0 | between 100,000 and 1 million | India | India | Employed for wages | 500000.0 | male | 1.0 | 1.0 | 1.0 | 0.0 | 20.0 | 0.0 | 0.0 | 0.0 | more than 12 months from now | work for a multinational corporation | [user experience designer, back end web develo... | in an office with other developers | Marathi | married or domestic partnership | 5000.0 | 1.0 | bachelor's degree | Aerospace and Aeronautical Engineering | web or mobile | 5000.000000 |
8126 | 22.0 | 0.0 | more than 1 million | India | India | Not working but looking for work | 80000.0 | male | 1.0 | 0.0 | 1.0 | 0.0 | 80.0 | 0.0 | 0.0 | 0.0 | I'm already applying | work for a startup | [back end web developer, full stack web develo... | in an office with other developers | Malayalam | single, never married | 5000.0 | 1.0 | bachelor's degree | Electrical and Electronics Engineering | web or mobile | 5000.000000 |
13398 | 19.0 | 0.0 | more than 1 million | India | India | Unable to work | 100000.0 | male | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 0.0 | 0.0 | 0.0 | I haven't decided | work for a multinational corporation | [mobile developer] | no preference | Hindi | single, never married | 20000.0 | 2.0 | bachelor's degree | Computer Science | web or mobile | 10000.000000 |
15587 | 27.0 | 0.0 | more than 1 million | India | India | Employed for wages | 65000.0 | male | 1.0 | 1.0 | 1.0 | 0.0 | 36.0 | 0.0 | 0.0 | 0.0 | I haven't decided | work for a startup | [full stack web developer, data scientist] | from home | Hindi | single, never married | 100000.0 | 24.0 | bachelor's degree | Communications | web or mobile | 4166.666667 |
There is a lack of information for the attended camps or events (if any) for these coders. Also, to consider they might have included Student Loan as investments, that information is also missing. Since the validity of the data can't be established for now, these values will be treated as outliers.
Examine outliers for Canada
fcc_survey[(fcc_survey.InvestPerMonth >= 2500) & (fcc_survey.CountryLive == "Canada")].dropna(axis = 1)
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | CityPopulation | CodeEventConferences | CodeEventFCC | CodeEventMeetup | CodeEventWorkshops | CommuteTime | CountryCitizen | CountryLive | EmploymentField | EmploymentStatus | ExpectedEarning | Gender | HasDebt | HasFinancialDependents | HasHighSpdInternet | HasHomeMortgage | HasServedInMilitary | HasStudentDebt | HomeMortgageOwe | HoursLearning | Income | IsEthnicMinority | IsReceiveDisabilitiesBenefits | IsSoftwareDev | IsUnderEmployed | JobApplyWhen | JobInterestBackEnd | JobInterestDataSci | JobInterestFrontEnd | JobInterestFullStack | JobInterestGameDev | JobInterestMobile | JobInterestProjMngr | JobInterestUX | JobPref | JobRoleInterest | JobWherePref | LanguageAtHome | MaritalStatus | MoneyForLearning | MonthsProgramming | PodcastNone | ResourceCodecademy | ResourceCodeWars | ResourceCoursera | ResourceFCC | ResourceSO | ResourceW3S | SchoolDegree | SchoolMajor | YouTubeFCC | YouTubeMIT | WebOrMobDev | InvestPerMonth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13659 | 24.0 | 1.0 | 0.0 | 0.0 | Bloc.io | 1.0 | more than 1 million | 1.0 | 1.0 | 1.0 | 1.0 | 30 to 44 minutes | Canada | Canada | finance | Employed for wages | 60000.0 | male | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 250000.0 | 10.0 | 140000.0 | 1.0 | 1.0 | 0.0 | 0.0 | I haven't decided | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | work for a multinational corporation | [mobile developer, full stack web developer, d... | from home | Yue (Cantonese) Chinese | single, never married | 10000.0 | 2.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | bachelor's degree | Finance | 1.0 | 1.0 | web or mobile | 5000.0 |
Taking 2500 USD as threshold, there is only one outlier for Canada data. The coder has responded in positive for boot camp training, but no information is available for any kind of student debt or loan. This could be case of lump-sum fees paid for a course/ boot camp instead of subscription based model.
Examine outliers for USA
fcc_survey[(fcc_survey.InvestPerMonth >= 10000) & (fcc_survey.CountryLive == "United States of America")].dropna(axis = 1)
Age | AttendedBootcamp | CityPopulation | CountryCitizen | CountryLive | EmploymentStatus | Gender | HasDebt | HasFinancialDependents | HasHighSpdInternet | HasServedInMilitary | HoursLearning | IsEthnicMinority | IsReceiveDisabilitiesBenefits | IsSoftwareDev | JobApplyWhen | JobPref | JobRelocateYesNo | JobRoleInterest | JobWherePref | LanguageAtHome | MaritalStatus | MoneyForLearning | MonthsProgramming | SchoolDegree | WebOrMobDev | InvestPerMonth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3930 | 31.0 | 0.0 | between 100,000 and 1 million | United States of America | United States of America | Not working and not looking for work | male | 1.0 | 0.0 | 1.0 | 0.0 | 50.0 | 1.0 | 0.0 | 0.0 | Within the next 6 months | work for a startup | 1.0 | [devops / sysadmin, front end web developer, f... | no preference | English | married or domestic partnership | 65000.0 | 6.0 | bachelor's degree | web or mobile | 10833.333333 |
6805 | 46.0 | 1.0 | between 100,000 and 1 million | United States of America | United States of America | Not working but looking for work | male | 1.0 | 0.0 | 1.0 | 0.0 | 45.0 | 1.0 | 0.0 | 0.0 | Within the next 6 months | work for a multinational corporation | 1.0 | [full stack web developer, game developer, pro... | no preference | English | married or domestic partnership | 15000.0 | 1.0 | bachelor's degree | web or mobile | 15000.000000 |
7198 | 32.0 | 0.0 | more than 1 million | United States of America | United States of America | Employed for wages | male | 1.0 | 0.0 | 1.0 | 0.0 | 4.0 | 1.0 | 0.0 | 0.0 | I'm already applying | work for a multinational corporation | 0.0 | [full stack web developer, back end web develo... | no preference | Spanish | single, never married | 70000.0 | 5.0 | professional degree (MBA, MD, JD, etc.) | web or mobile | 14000.000000 |
16650 | 29.0 | 0.0 | more than 1 million | United States of America | United States of America | Not working but looking for work | male | 1.0 | 1.0 | 1.0 | 0.0 | 40.0 | 1.0 | 1.0 | 0.0 | I'm already applying | work for a multinational corporation | 1.0 | [product manager, data engineer, full stack we... | in an office with other developers | English | married or domestic partnership | 200000.0 | 12.0 | associate's degree | web or mobile | 16666.666667 |
16997 | 27.0 | 0.0 | more than 1 million | United States of America | United States of America | Employed for wages | female | 1.0 | 1.0 | 1.0 | 0.0 | 12.0 | 1.0 | 0.0 | 0.0 | Within 7 to 12 months | work for a medium-sized company | 1.0 | [mobile developer, game developer, user experi... | in an office with other developers | English | single, never married | 12500.0 | 1.0 | some college credit, no degree | web or mobile | 12500.000000 |
17231 | 50.0 | 0.0 | less than 100,000 | Kenya | United States of America | Not working but looking for work | female | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | Within the next 6 months | work for a nonprofit | 0.0 | [front end web developer] | in an office with other developers | English | married or domestic partnership | 30000.0 | 2.0 | bachelor's degree | web or mobile | 15000.000000 |
Here again, there is not much information available for BootCamps or CodeEvents for these coders. Time to change the strategy!
Summary of statistics for each market With Extreme Outliers - Following table describes the data distribution in terms of percentiles.
# pd.set_option('display.float_format', lambda x: '%.3f' % x)
fcc_survey.groupby("CountryLive").InvestPerMonth.describe(percentiles = [0.25, 0.50, 0.75, 0.95, 0.99])
count | mean | std | min | 25% | 50% | 75% | 95% | 99% | max | |
---|---|---|---|---|---|---|---|---|---|---|
CountryLive | ||||||||||
Canada | 240.0 | 113.510961 | 441.014158 | 0.0 | 0.0 | 0.000000 | 25.000000 | 592.803030 | 1902.500000 | 5000.000000 |
India | 463.0 | 135.100982 | 692.960378 | 0.0 | 0.0 | 0.000000 | 13.250000 | 491.650000 | 3650.000000 | 10000.000000 |
United Kingdom | 279.0 | 45.534443 | 162.311836 | 0.0 | 0.0 | 0.000000 | 20.416667 | 200.000000 | 1044.000000 | 1400.000000 |
United States of America | 2931.0 | 183.800110 | 864.767219 | 0.0 | 0.0 | 3.333333 | 50.000000 | 833.333333 | 3308.333333 | 16666.666667 |
The 99th percentile for each of the market, clearly indicates the concentration of data points is well below, the maximum amount spent per month, for the 3 markets - USA, UK & Canada.
Examine Z-Score for each investment per month made by coders grouped by markets
Another method of identifying outlier is by Mean and Standard Deviation Method (SD) which measures the distance of each value from the mean of all of the sample values.
Although it is more susceptible to changes than IQD method, for the purpose of learning let's rule out outliers based on this method.
Time to answer the question - Sample or Population? For this project, each of the market has been taken as sample data of the respective markets. Why? Because the survey hasn't been answered by every possible learner/ coder from each of the country! These learners are the ones who actively participated in the survey and thus represent a sample from each country/market.
The surveyees may have reached out to their peers before or after the survey who in turn could be potential customers for E-Company there by making the other coders and themselves collectively a population.
Where does this understanding comes into play in this project - Delta Degrees of Freedom should be 0 or 1. For this project ddof is 1.
# import zscore module
from scipy.stats import zscore
# group each market/ country
country_group = fcc_survey.groupby("CountryLive")
# calculate zscore for each value for InvestPerMonth column with ddof = 1
fcc_survey["InvestZScore"] = country_group.InvestPerMonth.transform(lambda x: zscore(x, ddof=1))
In the standard deviation method, the default value for SD is 3. That is, if a data point in a sample variable, has a standard deviation equal to or more than 3, it can be considered an outlier. Taking this into account, following table filters out, all the data points for each of the markets which have SD equal to or more than 3.
It has already been established, that UK's outliers are still well within the lowest outlier data points of Canada and India, hence not included in the table.
fcc_survey.loc[(country_group.InvestZScore.apply(lambda x:x >= 3)) & (fcc_survey.CountryLive != "United Kingdom"),
["InvestPerMonth","CountryLive", "InvestZScore"]].sort_values(["CountryLive", "InvestZScore"],
ascending = False).reset_index(drop = True)
InvestPerMonth | CountryLive | InvestZScore | |
---|---|---|---|
0 | 16666.666667 | United States of America | 19.060466 |
1 | 15000.000000 | United States of America | 17.133166 |
2 | 15000.000000 | United States of America | 17.133166 |
3 | 14000.000000 | United States of America | 15.976785 |
4 | 12500.000000 | United States of America | 14.242214 |
5 | 10833.333333 | United States of America | 12.314913 |
6 | 9000.000000 | United States of America | 10.194882 |
7 | 8000.000000 | United States of America | 9.038502 |
8 | 8000.000000 | United States of America | 9.038502 |
9 | 6666.666667 | United States of America | 7.496661 |
10 | 6500.000000 | United States of America | 7.303931 |
11 | 5666.666667 | United States of America | 6.340280 |
12 | 5333.333333 | United States of America | 5.954820 |
13 | 5000.000000 | United States of America | 5.569360 |
14 | 5000.000000 | United States of America | 5.569360 |
15 | 5000.000000 | United States of America | 5.569360 |
16 | 5000.000000 | United States of America | 5.569360 |
17 | 5000.000000 | United States of America | 5.569360 |
18 | 4666.666667 | United States of America | 5.183900 |
19 | 4550.000000 | United States of America | 5.048989 |
20 | 4500.000000 | United States of America | 4.991170 |
21 | 4250.000000 | United States of America | 4.702075 |
22 | 4000.000000 | United States of America | 4.412979 |
23 | 4000.000000 | United States of America | 4.412979 |
24 | 4000.000000 | United States of America | 4.412979 |
25 | 4000.000000 | United States of America | 4.412979 |
26 | 3500.000000 | United States of America | 3.834789 |
27 | 3333.333333 | United States of America | 3.642059 |
28 | 3333.333333 | United States of America | 3.642059 |
29 | 3333.333333 | United States of America | 3.642059 |
30 | 3250.000000 | United States of America | 3.545694 |
31 | 3250.000000 | United States of America | 3.545694 |
32 | 3250.000000 | United States of America | 3.545694 |
33 | 3166.666667 | United States of America | 3.449329 |
34 | 3000.000000 | United States of America | 3.256599 |
35 | 3000.000000 | United States of America | 3.256599 |
36 | 3000.000000 | United States of America | 3.256599 |
37 | 3000.000000 | United States of America | 3.256599 |
38 | 2857.142857 | United States of America | 3.091402 |
39 | 10000.000000 | India | 14.235877 |
40 | 5000.000000 | India | 7.020458 |
41 | 5000.000000 | India | 7.020458 |
42 | 5000.000000 | India | 7.020458 |
43 | 4166.666667 | India | 5.817888 |
44 | 3333.333333 | India | 4.615318 |
45 | 5000.000000 | Canada | 11.080118 |
46 | 2337.166667 | Canada | 5.042141 |
47 | 2000.000000 | Canada | 4.277616 |
48 | 1750.000000 | Canada | 3.710740 |
49 | 1500.000000 | Canada | 3.143865 |
Focusing on India and Canada - based on the table above we can observe that the SD jumps after 5 (SD = 5). As a student/learner I can think of these options:
Taking the liberty of learning phase, SD has been set to 4.
Eliminate data points, away from the mean by 4 standard deviations
usa_df = fcc_survey.query("InvestZScore <= 4 & CountryLive == 'United States of America'")
canada_df = fcc_survey.query("InvestZScore <= 4 & CountryLive == 'Canada'")
india_df = fcc_survey.query("InvestZScore <= 4 & CountryLive == 'India'")
uk_df = fcc_survey.query("CountryLive == 'United Kingdom'")
fcc_survey_filtered = pd.concat([uk_df, usa_df, canada_df, india_df])
Summary of statistics for each market without extreme outliers
fcc_survey_filtered.groupby("CountryLive").InvestPerMonth.describe(percentiles = [0.25, 0.50, 0.75, 0.95, 0.99])
count | mean | std | min | 25% | 50% | 75% | 95% | 99% | max | |
---|---|---|---|---|---|---|---|---|---|---|
CountryLive | ||||||||||
Canada | 237.0 | 75.550481 | 241.474691 | 0.0 | 0.0 | 0.0 | 22.222222 | 433.333333 | 1333.333333 | 1750.0 |
India | 457.0 | 65.758763 | 233.452738 | 0.0 | 0.0 | 0.0 | 12.500000 | 340.666667 | 1110.000000 | 2000.0 |
United Kingdom | 279.0 | 45.534443 | 162.311836 | 0.0 | 0.0 | 0.0 | 20.416667 | 200.000000 | 1044.000000 | 1400.0 |
United States of America | 2905.0 | 119.306296 | 382.716521 | 0.0 | 0.0 | 3.0 | 45.000000 | 666.666667 | 2221.333333 | 3500.0 |
Visualize amount spent per month for each market without extreme outliers
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (12, 5))
sns.boxplot(ax = ax, y = "CountryLive", data = fcc_survey_filtered, x = "InvestPerMonth")
ax.set_ylabel("") #"Markets", fontsize = "large")
ax.set_xlabel("Amount Invested Per Month", fontsize = "large")
ax.set_title("Dispersion of Investment per month for the 4 Markets", fontsize = "x-large")
ax.set_xlim(-100, 4000)
sns.despine()
plt.show()
pd.DataFrame({"% of Surveyees": fcc_survey_filtered.CountryLive.value_counts(normalize = True).mul(100).astype(int)})
% of Surveyees | |
---|---|
United States of America | 74 |
India | 11 |
United Kingdom | 7 |
Canada | 6 |
First is USA, Which is the second one?
The E-Company follows a monthly subscription model, priced at 59 USD as per month. USA stands out as the clear choice for E-Company to invest in advertisement campaign.
Coming to the question of second market, following observations need to be considered:
there were more participants from UK (7%) than Canada(6%), however coders from UK are willing to spend less than the base price of subscription. - So UK as a market can be dropped from this contest.
between the remaining two markets, Canada and India, India has shown higher percentage of coders(11%) participating in the survey, as compared to Canada(6%).
the average amount to be spent by India is lesser than that of Canada, but it is still competent. also, it is higher than the base price of subscription model. - Both these factors make investing in India as economic as in Canada.
also the rapidly increasing market potential of various industries in India, support for higher innovation and digitalization by the government policies, make compelling reasons for upskill of current, or skill development for future workforce.
What can be done for Second one?
It would be prudent to decide between India and Canada based on so close competitive results. Few options that the E-Company can do:
Team member or Team Player?
The last question from the guided project is - If we had a marketing team in our company, would it be better to just send them our results and let them use their domain knowledge to take the best decision?
The entire project is to help the organization, E-Company to increase its business portfolio, which requires all the teams to work together and in conjunction. Assuming that this entire process should aid the marketing team in deciding the best markets, in addition to their domain knowledge, it won't be wrong to keep the marketing team informed of the findings of the project and options available on the basis.
However the final decision can very well be deliberated by the marketing team.
An interesting read for data-science and marketing is here