To reach this project goal (defined further below), we could organize surveys for a couple of different markets to find out which would be the best choices for advertising. This is very costly. It may be a good call to explore cheaper options first.
We can try to search existing data that might be relevant for our purpose. One good candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp which is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
The survey data is publicly available in this GitHub repository.
Fiction: I'm going to write this up as a fictional assignment.
Description: I work for an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement. Our goal in this project is to find out the two best markets to advertise our product in.
I was handed a wheelbarrow load of survey based data by our marketing senior management team. The commission was: 'Here, analyze this. We have a limited amount of advertising dollars in our budget. Find out from the data the two best markets for us to advertise in and capture a high influx of new customers for e-learning training. Have it done before we get back from our 'Communication Training'!
Well, after I cleaned and massaged the data for hours ending up with a small percent of the original data; I thought, couldn't I have been given the task after the senior management's 'Communication Training'. In light of that, this is how I sort of now feel.
# import raw data and relevant libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import emoji
np.warnings.filterwarnings('ignore')
code_survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv')
print('Original File')
print('_____________', '\n')
print(code_survey.head(3), '\n', '\n')
# trim the data file down to only columns that appear to be relevant to the project objective.
cols = ['Age', 'AttendedBootcamp', 'BootcampFinish', 'BootcampLoanYesNo',
'BootcampName', 'BootcampRecommend', 'CountryLive', 'ExpectedEarning',
'FirstDevJob', 'Gender', 'HoursLearning', 'Income', 'JobInterestBackEnd',
'JobInterestDataEngr', 'JobInterestDataSci', 'JobInterestDevOps',
'JobInterestFrontEnd', 'JobInterestFullStack', 'JobInterestGameDev',
'JobInterestInfoSec', 'JobInterestMobile', 'JobInterestProjMngr',
'JobInterestQAEngr', 'JobInterestUX', 'JobPref', 'JobRoleInterest',
'LanguageAtHome', 'MoneyForLearning', 'MonthsProgramming',
'SchoolDegree', 'SchoolMajor']
code_survey2 = code_survey[cols]
print('Trimmed File')
print('_____________', '\n')
print(code_survey2.head(3))
Original File _____________ Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName \ 0 27.0 0.0 NaN NaN NaN 1 34.0 0.0 NaN NaN NaN 2 21.0 0.0 NaN NaN NaN BootcampRecommend ChildrenNumber CityPopulation \ 0 NaN NaN more than 1 million 1 NaN NaN less than 100,000 2 NaN NaN more than 1 million CodeEventConferences CodeEventDjangoGirls ... YouTubeFCC \ 0 NaN NaN ... NaN 1 NaN NaN ... 1.0 2 NaN NaN ... NaN YouTubeFunFunFunction YouTubeGoogleDev YouTubeLearnCode \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN 1.0 YouTubeLevelUpTuts YouTubeMIT YouTubeMozillaHacks YouTubeOther \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 1.0 NaN NaN NaN YouTubeSimplilearn YouTubeTheNewBoston 0 NaN NaN 1 NaN NaN 2 NaN NaN [3 rows x 136 columns] Trimmed File _____________ Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName \ 0 27.0 0.0 NaN NaN NaN 1 34.0 0.0 NaN NaN NaN 2 21.0 0.0 NaN NaN NaN BootcampRecommend CountryLive ExpectedEarning FirstDevJob \ 0 NaN Canada NaN NaN 1 NaN United States of America 35000.0 NaN 2 NaN United States of America 70000.0 NaN Gender ... JobInterestProjMngr JobInterestQAEngr JobInterestUX \ 0 female ... NaN NaN NaN 1 male ... NaN NaN NaN 2 male ... NaN NaN NaN JobPref \ 0 start your own business 1 work for a nonprofit 2 work for a medium-sized company JobRoleInterest LanguageAtHome \ 0 NaN English 1 Full-Stack Web Developer English 2 Front-End Web Developer, Back-End Web Develo... Spanish MoneyForLearning MonthsProgramming \ 0 150.0 6.0 1 80.0 6.0 2 1000.0 5.0 SchoolDegree SchoolMajor 0 some college credit, no degree NaN 1 some college credit, no degree NaN 2 high school diploma or equivalent (GED) NaN [3 rows x 31 columns]
The survey file read in has 136 columns and 18174 rows (excluding heading row).
I initially executed the print command 'print(code_survey.info(verbose=True)' so I could see all columns and discern by titles whether they would be relevant to accomplish the project objective.
I trimmed down the original file from 136 columns to 31. As I proceed through this project I will determine if I'm missing any key columns. After trimming the file, I removed the command 'print(code_survey.info(verbose=True)' to reduce the size of subsequent output.
There are 11 different types of job descriptions associated with computer programming training. Let's see what job types most of the survey respondents were leaning towards by means of their training.
pd.reset_option('max_colwidth')
pd.set_option('display.max_colwidth', 50)
# observe number of missing values.
pd.set_option('display.max_rows', 10)
print(code_survey2['JobRoleInterest'].value_counts(sort=True, dropna=False))
pd.reset_option('max_rows')
print(len(code_survey2), '\n')
# remove missing values from column.
# JRI = Job Role Interest
JRI = code_survey2['JobRoleInterest'].dropna()
print(len(JRI), '\n')
print('There are 11183 missing values out of 18174 rows in \'JobRoleInterest\' column')
print(round(100 - (len(JRI)*100) / (len(code_survey2)), 2),
'% of the responders did not identify any specific job interests')
print('That leaves only 6,991 rows available to analyze.', '\n')
print(JRI.value_counts(sort=True, dropna=False), '\n')
pd.set_option('display.max_colwidth', 200)
print(code_survey2.iloc[2:4, 25], '\n')
print(code_survey2.iloc[2:4, 12:23])
pd.reset_option('max_colwidth')
NaN 11183 Full-Stack Web Developer 823 Front-End Web Developer 450 Data Scientist 152 Back-End Web Developer 142 ... User Experience Designer, Game Developer, Front-End Web Developer, Product Manager, Full-Stack Web Developer, Back-End Web Developer, Mobile Developer 1 Information Security, Back-End Web Developer, Product Manager, Data Scientist 1 Product Manager, Information Security, Data Engineer, Full-Stack Web Developer, Back-End Web Developer, Front-End Web Developer, Mobile Developer, User Experience Designer, DevOps / SysAdmin 1 Data Scientist, Front-End Web Developer, User Experience Designer, Product Manager 1 Full-Stack Web Developer, Back-End Web Developer, Data Engineer, Information Security 1 Name: JobRoleInterest, Length: 3214, dtype: int64 18175 6992 There are 11183 missing values out of 18174 rows in 'JobRoleInterest' column 61.53 % of the responders did not identify any specific job interests That leaves only 6,991 rows available to analyze. Full-Stack Web Developer 823 Front-End Web Developer 450 Data Scientist 152 Back-End Web Developer 142 Mobile Developer 117 ... User Experience Designer, Data Scientist, Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer, Mobile Developer, Game Developer 1 Front-End Web Developer, Quality Assurance Engineer, Back-End Web Developer, Mobile Developer, DevOps / SysAdmin, Game Developer, Full-Stack Web Developer 1 Information Security, Front-End Web Developer, User Experience Designer, Back-End Web Developer, Mobile Developer 1 Game Developer, DevOps / SysAdmin, Full-Stack Web Developer 1 User Experience Designer, Data Engineer, Back-End Web Developer, Data Scientist, Mobile Developer 1 Name: JobRoleInterest, Length: 3213, dtype: int64 2 Front-End Web Developer, Back-End Web Developer, DevOps / SysAdmin, Mobile Developer, Full-Stack Web Developer 3 Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer Name: JobRoleInterest, dtype: object JobInterestBackEnd JobInterestDataEngr JobInterestDataSci \ 2 1.0 NaN NaN 3 1.0 NaN NaN JobInterestDevOps JobInterestFrontEnd JobInterestFullStack \ 2 1.0 1.0 1.0 3 NaN 1.0 1.0 JobInterestGameDev JobInterestInfoSec JobInterestMobile \ 2 NaN NaN 1.0 3 NaN NaN NaN JobInterestProjMngr JobInterestQAEngr 2 NaN NaN 3 NaN NaN
The survey file includes 11 columns as shown below with each one relating to a specific job type that computer programming training could lead to. The survey responders were given opportunity to identify not just one of the job types they are interested in, but any number of the listed choices including all of them.
There were in fact some cases where respondents selected all 11. Why that is, I can't say for sure. Maybe they were still early in their training and weren't sure what career direction to pursue or they may have worked in all 11 categories at some time in the working career or ... whatever.
Separate Job Interest Type Columns
Discovery: I noticed that the 'JobRoleInterest' column is simply a conglomeration of the 11 columns above according to each respondent's selection. Some respondents indicated they were interested in only one job type and other respondents indicated they were interested in several job types. We'll look at the distribution of respondent choices later in this project.
Conglomerated Job Interest Column
Proof of Discovery:
2 Front-End Web Developer, Back-End Web Developer, DevOps/SysAdmin, Mobile Developer, Full-Stack Web Developer
3 Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer
Name: JobRoleInterest, dtype: object
In light of the discovery above, I believe it would be more efficient and easier to sum the individual column job choices into one column and analyze that as a distribution in a Pareto Chart, rather than trying to separate the messy list choices in the conglomerated column and analyze that.
The other reason I'm choosing the individual job type columns over the conglomerated column is that they provide about 1200 more rows of data (i.e. less data removed due to missing values).
# define the 11 job type columns.
cols = ['JobInterestBackEnd', 'JobInterestDataEngr', 'JobInterestDataSci',
'JobInterestDevOps', 'JobInterestFrontEnd', 'JobInterestFullStack',
'JobInterestGameDev', 'JobInterestInfoSec', 'JobInterestMobile',
'JobInterestProjMngr', 'JobInterestQAEngr']
# create a new column with the sum of the number of job choices made by each respondent.
code_survey2['Multi_Interests'] = code_survey2[cols].sum(axis=1)
code_survey2 = code_survey2[code_survey2.Multi_Interests != 0]
# create new column MI as string.
# I need to retain Multi-Interest as a float64 type.
code_survey2['MI'] = code_survey2['Multi_Interests'].astype(str)
print(code_survey2.info(), '\n')
print(len(code_survey2), '\n')
df12 = code_survey2['MI'].value_counts().reset_index()
df12.columns = ['MI', 'Frequency']
df12['Percent'] = (df12['Frequency'] /
df12['Frequency'].sum()) * 100
df12['Cum_Percent'] = 100*(df12.Frequency.cumsum() / df12.Frequency.sum())
df12.at[0,'Multi_Interests']='One'
df12.at[1,'Multi_Interests']='Three'
df12.at[2,'Multi_Interests']='Four'
df12.at[3,'Multi_Interests']='Two'
df12.at[4,'Multi_Interests']='Five'
df12.at[5,'Multi_Interests']='Six'
df12.at[6,'Multi_Interests']='Seven'
df12.at[7,'Multi_Interests']='Eight'
df12.at[8,'Multi_Interests']='Nine'
df12.at[9,'Multi_Interests']='Eleven'
df12.at[10,'Multi_Interests']='Ten'
print(df12)
<class 'pandas.core.frame.DataFrame'> Int64Index: 7920 entries, 1 to 18174 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 6654 non-null float64 1 AttendedBootcamp 7764 non-null float64 2 BootcampFinish 409 non-null float64 3 BootcampLoanYesNo 414 non-null float64 4 BootcampName 360 non-null object 5 BootcampRecommend 413 non-null float64 6 CountryLive 6651 non-null object 7 ExpectedEarning 7010 non-null float64 8 FirstDevJob 0 non-null float64 9 Gender 6705 non-null object 10 HoursLearning 7308 non-null float64 11 Income 2810 non-null float64 12 JobInterestBackEnd 3201 non-null float64 13 JobInterestDataEngr 1470 non-null float64 14 JobInterestDataSci 1940 non-null float64 15 JobInterestDevOps 1092 non-null float64 16 JobInterestFrontEnd 4047 non-null float64 17 JobInterestFullStack 4831 non-null float64 18 JobInterestGameDev 1875 non-null float64 19 JobInterestInfoSec 1560 non-null float64 20 JobInterestMobile 2677 non-null float64 21 JobInterestProjMngr 939 non-null float64 22 JobInterestQAEngr 584 non-null float64 23 JobInterestUX 1625 non-null float64 24 JobPref 7842 non-null object 25 JobRoleInterest 6800 non-null object 26 LanguageAtHome 6650 non-null object 27 MoneyForLearning 7272 non-null float64 28 MonthsProgramming 7468 non-null float64 29 SchoolDegree 6658 non-null object 30 SchoolMajor 3611 non-null object 31 Multi_Interests 7920 non-null float64 32 MI 7920 non-null object dtypes: float64(24), object(9) memory usage: 2.1+ MB None 7920 MI Frequency Percent Cum_Percent Multi_Interests 0 1.0 2551 32.209596 32.209596 One 1 3.0 1335 16.856061 49.065657 Three 2 4.0 1259 15.896465 64.962121 Four 3 2.0 1015 12.815657 77.777778 Two 4 5.0 845 10.669192 88.446970 Five 5 6.0 456 5.757576 94.204545 Six 6 7.0 237 2.992424 97.196970 Seven 7 8.0 112 1.414141 98.611111 Eight 8 9.0 54 0.681818 99.292929 Nine 9 11.0 32 0.404040 99.696970 Eleven 10 10.0 24 0.303030 100.000000 Ten
The Pareto principle states that for many outcomes, roughly 80% of consequences come from 20% of the causes (the “vital few”). Other names for this principle are the 80/20 rule, the law of the vital few, or the principle of factor sparsity.
Management consultant Joseph M. Juran developed the concept in the context of quality control, and improvement, naming it after Italian economist Vilfredo Pareto, who noted the 80/20 connection while at the University of Lausanne in 1896. In his first work, Cours d'économie politique, Pareto showed that approximately 80% of the land in Italy was owned by 20% of the population. The Pareto principle is only tangentially related to Pareto efficiency.
Mathematically, the 80/20 rule is roughly described by a power law distribution (also known as a Pareto distribution) for a particular set of parameters, and many natural phenomena have been shown to exhibit such a distribution. It is an adage of business management that '80% of sales come from 20% of clients'.
That's why the Pareto Chart is often used as a tool to help prioritize opportunities or identify what variables to start focusing on in any problem solving project. I will use it to help achieve this project goal.
# use appropriate plotting commands to generate a Pareto chart.
fig, ax = plt.subplots(figsize=(14,9))
ax.bar(df12['Multi_Interests'], df12['Frequency'], color='C0')
ax2 = ax.twinx()
ax2.plot(df12['Multi_Interests'], df12['Cum_Percent'], color='C1', marker='D', ms=7)
ax2.yaxis.set_major_formatter(PercentFormatter())
plt.title('Pareto Chart', fontsize=24)
ax.set_xlabel('Qty of Job Type Interests Chosen', fontsize=20, labelpad = 10)
plt.xticks(fontsize=30)
ax.set_ylabel('Frequency', fontsize=20, labelpad=15)
plt.yticks(fontsize=15)
sns.despine(bottom=False)
ax.grid(False)
ax.yaxis.set_label_position('left')
ax.tick_params(axis='x', colors='C0', labelrotation=90, labelsize=18)
ax.tick_params(axis='y', colors='C0', labelsize=15)
plt.axhline(32, color='red')
ax2.tick_params(axis='y', colors='C1', labelsize = 15)
plt.ylabel('Percent', fontsize=20)
plt.show()
The chart above tells me that just over 30% of the respondents identified one specific job type they were interested in. Everyone else chose two or more. Does that mean that about 70% weren't sure what they wanted and would depend upon further computer programming training to eventually make the choice? Not necessarily. For example, choosing the three types of Web Developer jobs may indicate the responder didn't care what type of web development job they could land: 'Back-End', 'Front-End' and 'Full-Stack'.
What I will do is look at the distribution of job type choices after splitting the multi choices into individual and then look at the distribution for cases where only one choice was made. After that, draw some conclusions if possible.
# define dataframe with only job type columns included.
df2 = code_survey2[cols]
# create new row to sum number of respondents interested
# in each job type from list.
df2 = df2.append(pd.Series(df2.sum(),name='Total'))
print(df2.columns, '\n')
new1 = df2.loc[['Total']]
print(new1, '\n')
# convert column names to something a little easier to decipher.
new1.columns = ['Back-End Web Developer', 'Data Engineer', 'Data Scientist',
'DevOps', 'Front-End Web Developer', 'Full-Stack Web Developer',
'Game Developer', 'Information Security', 'Mobile Developer',
'Project Manager', 'Quality Assurance Engineer']
# create dataframe that includes job type interest and choice frequency of each.
df = pd.melt(new1, var_name='Job_Interest', value_name='Total')
# add column with percent value for each job type preference.
df['Percent'] = (df['Total'] /
df['Total'].sum()) * 100
df = df.sort_values(by=['Total'], ascending=False)
# calculate cumulative percent for each job type preference.
df['Cum_Percent'] = 100*(df.Total.cumsum() / df.Total.sum())
print(df)
Index(['JobInterestBackEnd', 'JobInterestDataEngr', 'JobInterestDataSci', 'JobInterestDevOps', 'JobInterestFrontEnd', 'JobInterestFullStack', 'JobInterestGameDev', 'JobInterestInfoSec', 'JobInterestMobile', 'JobInterestProjMngr', 'JobInterestQAEngr'], dtype='object') JobInterestBackEnd JobInterestDataEngr JobInterestDataSci \ Total 3201.0 1470.0 1940.0 JobInterestDevOps JobInterestFrontEnd JobInterestFullStack \ Total 1092.0 4047.0 4831.0 JobInterestGameDev JobInterestInfoSec JobInterestMobile \ Total 1875.0 1560.0 2677.0 JobInterestProjMngr JobInterestQAEngr Total 939.0 584.0 Job_Interest Total Percent Cum_Percent 5 Full-Stack Web Developer 4831.0 19.949620 19.949620 4 Front-End Web Developer 4047.0 16.712091 36.661711 0 Back-End Web Developer 3201.0 13.218533 49.880244 8 Mobile Developer 2677.0 11.054675 60.934919 2 Data Scientist 1940.0 8.011232 68.946151 6 Game Developer 1875.0 7.742815 76.688966 7 Information Security 1560.0 6.442022 83.130988 1 Data Engineer 1470.0 6.070367 89.201354 3 DevOps 1092.0 4.509415 93.710770 9 Project Manager 939.0 3.877602 97.588371 10 Quality Assurance Engineer 584.0 2.411629 100.000000
# use appropriate plotting commands to generate pareto chart.
fig, ax = plt.subplots(figsize=(14,9))
ax.bar(df['Job_Interest'], df['Total'], color='C0')
ax2 = ax.twinx()
ax2.plot(df['Job_Interest'], df['Cum_Percent'], color='C1', marker='D', ms=7)
ax2.yaxis.set_major_formatter(PercentFormatter())
plt.title('Pareto Chart', fontsize=24)
plt.xlabel(None)
plt.xticks(fontsize=30)
ax.set_ylabel('Frequency', fontsize=20, labelpad=15)
plt.yticks(fontsize=22)
sns.despine(bottom=False)
ax.grid(False)
ax.yaxis.set_label_position('left')
ax.tick_params(axis='x', colors='C0', labelrotation=90, labelsize=15)
ax.tick_params(axis='y', colors='C0', labelsize=15)
plt.axhline(61, color='red')
ax2.tick_params(axis='y', colors='C1', labelsize = 15)
plt.ylabel('Percent', fontsize=20)
plt.show()
This particular example doesn't quite follow the Pareto principle. There are only 11 variables in this case. Twenty percent of the job types is only qty. two. The first two only account for about 35% of all the job types.
It's interesting to see that the top four job preferences fall in line with our primary current courses we provide: web and mobile development. That accounts for about 60% of all the choices made by the respondents.
That works in our favor in this regard that we may not need to invest capital for new course development to acquire new customers. Further analysis to be done before making conclusions.
Let's look at the distribution of job type interests for cases where only one job type was selected by survey respondents.
# select only the single job type choices.
one_job_choice = code_survey2[code_survey2['Multi_Interests'] == 1.0]
one_job_choice = one_job_choice[cols]
# create new row to sum number of respondents interested in each job type from list.
df7 = one_job_choice.append(pd.Series(one_job_choice.sum(),name='Total2'))
print(df7.columns)
new2 = df7.loc[["Total2"]]
print(new2, '\n')
# convert column names to something a little easier to decipher.
new2.columns = ['Back-End Web Developer', 'Data Engineer', 'Data Scientist',
'DevOps', 'Front-End Web Developer', 'Full-Stack Web Developer',
'Game Developer', 'Information Security', 'Mobile Developer',
'Project Manager', 'Quality Assurance Engineer']
# create dataframe and move row 'Total2' into column position.
df8 = pd.melt(new2, var_name="Job_Interest", value_name="Total2")
print(df8, '\n')
df8['Percent'] = (df8['Total2'] /
df8['Total2'].sum()) * 100
df8 = df8.sort_values(by=['Total2'], ascending=False)
df8['Cum_Percent'] = 100*(df8.Total2.cumsum() / df8.Total2.sum())
print(df8)
Index(['JobInterestBackEnd', 'JobInterestDataEngr', 'JobInterestDataSci', 'JobInterestDevOps', 'JobInterestFrontEnd', 'JobInterestFullStack', 'JobInterestGameDev', 'JobInterestInfoSec', 'JobInterestMobile', 'JobInterestProjMngr', 'JobInterestQAEngr'], dtype='object') JobInterestBackEnd JobInterestDataEngr JobInterestDataSci \ Total2 168.0 66.0 192.0 JobInterestDevOps JobInterestFrontEnd JobInterestFullStack \ Total2 46.0 610.0 978.0 JobInterestGameDev JobInterestInfoSec JobInterestMobile \ Total2 144.0 123.0 143.0 JobInterestProjMngr JobInterestQAEngr Total2 63.0 18.0 Job_Interest Total2 0 Back-End Web Developer 168.0 1 Data Engineer 66.0 2 Data Scientist 192.0 3 DevOps 46.0 4 Front-End Web Developer 610.0 5 Full-Stack Web Developer 978.0 6 Game Developer 144.0 7 Information Security 123.0 8 Mobile Developer 143.0 9 Project Manager 63.0 10 Quality Assurance Engineer 18.0 Job_Interest Total2 Percent Cum_Percent 5 Full-Stack Web Developer 978.0 38.337907 38.337907 4 Front-End Web Developer 610.0 23.912191 62.250098 2 Data Scientist 192.0 7.526460 69.776558 0 Back-End Web Developer 168.0 6.585653 76.362211 6 Game Developer 144.0 5.644845 82.007056 8 Mobile Developer 143.0 5.605645 87.612701 7 Information Security 123.0 4.821639 92.434339 1 Data Engineer 66.0 2.587221 95.021560 9 Project Manager 63.0 2.469620 97.491180 3 DevOps 46.0 1.803214 99.294394 10 Quality Assurance Engineer 18.0 0.705606 100.000000
# use appropriate plotting commands to generate pareto chart.
fig, ax = plt.subplots(figsize=(14,9))
ax.bar(df8['Job_Interest'], df8['Total2'], color='C0')
ax2 = ax.twinx()
ax2.plot(df8['Job_Interest'], df8['Cum_Percent'], color='C1', marker='D', ms=7)
ax2.yaxis.set_major_formatter(PercentFormatter())
plt.title('Pareto Chart', fontsize=24)
plt.xlabel(None)
plt.xticks(fontsize=30)
ax.set_ylabel('Frequency', fontsize=20, labelpad=15)
plt.yticks(fontsize=22)
sns.despine(bottom=False)
ax.grid(False)
ax.yaxis.set_label_position('left')
ax.tick_params(axis='x', colors='C0', labelrotation=90, labelsize=15)
ax.tick_params(axis='y', colors='C0', labelsize=15)
plt.axhline(62.2, color='red')
ax2.tick_params(axis='y', colors='C1', labelsize = 15)
plt.ylabel('Percent', fontsize=20)
plt.show()
Concerning job type choices where respondents indicated only one, the results are quite close to the distribution for multiple choices. The web development jobs are in the top 3 out of 4 while Mobile development falls to sixth choice.
This chart is a little closer to following the Pareto principle. The first 2 choices account for just over 60% of the total. So, we're still looking good with our current available courses.
# create frequency table for countries and qty of respondents living in each.
present_home = code_survey2['CountryLive'].value_counts(ascending=False)
focus_home = code_survey2['CountryLive'].dropna()
print(focus_home.value_counts(ascending=False), '\n')
# convert frequency table into a dataframe.
new_df1 = pd.value_counts(focus_home).to_frame().reset_index()
# PCR stands for Present Country Residence.
new_df1.columns = ['PCR', 'Frequency']
print(new_df1, '\n')
new_df1['Cum_Percent'] = 100*(new_df1.Frequency.cumsum() / new_df1.Frequency.sum())
other = new_df1.iloc[19:137, 1].sum()
new_row3 = {'PCR':'Other', 'Frequency':1304, 'Cum_Percent':100}
new_df1 = new_df1.append(new_row3, ignore_index=True)
print(new_df1, '\n')
new_df1.at[0,'PCR']='U.S.A.'
new_df1.at[2,'PCR']='U.K.'
new_df1.at[14,'PCR']='Holland Euro'
print(new_df1)
pareto1 = new_df1[new_df1.Frequency > 42]
print(pareto1)
United States of America 3023 India 516 United Kingdom 310 Canada 249 Poland 130 ... Samoa 1 Myanmar 1 Jordan 1 Lebanon 1 Sudan 1 Name: CountryLive, Length: 137, dtype: int64 PCR Frequency 0 United States of America 3023 1 India 516 2 United Kingdom 310 3 Canada 249 4 Poland 130 .. ... ... 132 Samoa 1 133 Myanmar 1 134 Jordan 1 135 Lebanon 1 136 Sudan 1 [137 rows x 2 columns] PCR Frequency Cum_Percent 0 United States of America 3023 45.451812 1 India 516 53.210044 2 United Kingdom 310 57.870997 3 Canada 249 61.614795 4 Poland 130 63.569388 .. ... ... ... 133 Myanmar 1 99.954894 134 Jordan 1 99.969929 135 Lebanon 1 99.984965 136 Sudan 1 100.000000 137 Other 1304 100.000000 [138 rows x 3 columns] PCR Frequency Cum_Percent 0 U.S.A. 3023 45.451812 1 India 516 53.210044 2 U.K. 310 57.870997 3 Canada 249 61.614795 4 Poland 130 63.569388 .. ... ... ... 133 Myanmar 1 99.954894 134 Jordan 1 99.969929 135 Lebanon 1 99.984965 136 Sudan 1 100.000000 137 Other 1304 100.000000 [138 rows x 3 columns] PCR Frequency Cum_Percent 0 U.S.A. 3023 45.451812 1 India 516 53.210044 2 U.K. 310 57.870997 3 Canada 249 61.614795 4 Poland 130 63.569388 5 Brazil 127 65.478875 6 Germany 123 67.328221 7 Australia 108 68.952037 8 Russia 100 70.455571 9 Ukraine 89 71.793715 10 Nigeria 83 73.041648 11 Spain 76 74.184333 12 France 73 75.281912 13 Romania 71 76.349421 14 Holland Euro 62 77.281612 15 Netherlands (Holland, Europe) 62 78.213802 16 Serbia 52 78.995640 17 Philippines 50 79.747406 18 Greece 43 80.393926 137 Other 1304 100.000000
I initially executed the command 'pd.set_option('display.max_rows', None)' so I could see at what country the cumulative percent reaches 80. My plan was then to consolidate all remaining countries into the group called 'Other' and then plot a Pareto Graph.
There were a total of 137 countries where survey responders currently live. The top 19 countries have about 80 percent of all the responders currently living there. I consolidated the remaining 118 countries into one category labeled 'Other'.
fig, ax = plt.subplots(figsize=(14,9))
ax.bar(pareto1['PCR'], pareto1['Frequency'], color='C0')
ax2 = ax.twinx()
ax2.plot(pareto1['PCR'], pareto1['Cum_Percent'], color='C1', marker='D', ms=7)
ax2.yaxis.set_major_formatter(PercentFormatter())
plt.title('Pareto Chart', fontsize=24)
plt.xlabel(None)
plt.xticks()
ax.set_ylabel('Percent', color='C0', fontsize=20)
ax2.set_ylabel('Frequency', color='C1', fontsize=20)
plt.yticks(fontsize=22)
plt.axhline(61.5, color='red', label = 'Mean')
sns.despine(bottom=False)
ax.grid(False)
ax.set_xticklabels(pareto1['PCR'], rotation=90, ha='center', color='C0', minor=False, fontsize=15)
ax.yaxis.set_label_position('left')
ax.tick_params(axis='y', colors='C0', labelsize=15)
ax2.tick_params(axis='y', colors='C1', labelsize = 15)
plt.show()
In terms of respondent country residents, U.S.A. certainly dominates at 45% of the total among 137 countries. The top four countries, U.S.A., India, U.K. and Canada account for over 60% of the total.
This case follows the Pareto principle more closely in that 19/137 (14%) countries account for 80% of the total.
Let's find out the top countries of respondent residence that tie in with our current courses: web and mobile development only.
one_job_choice = code_survey2[code_survey2['Multi_Interests'] == 1.0]
print(one_job_choice.info())
Focus_cols = ['CountryLive', 'JobInterestDataSci',
'JobInterestFrontEnd', 'JobInterestFullStack',
'JobInterestMobile']
new_focus = one_job_choice[Focus_cols]
print(new_focus, '\n')
print(new_focus['CountryLive'].value_counts(ascending=False, dropna=False), '\n')
print(len(new_focus))
focus_home = new_focus['CountryLive'].dropna()
print(focus_home.value_counts(ascending=False))
print(len(focus_home))
# convert frequency table into a dataframe.
new_df = pd.value_counts(focus_home).to_frame().reset_index()
# PCR stands for Present Country Residence.
new_df.columns = ['PCR', 'Frequency']
print(new_df, '\n')
new_df['Cum_Percent'] = 100*(new_df.Frequency.cumsum() / new_df.Frequency.sum())
other = new_df.iloc[20:111, 1].sum()
new_row2 = {'PCR':'Other', 'Frequency':431, 'Cum_Percent':100}
new_df = new_df.append(new_row2, ignore_index=True)
print(new_df, '\n')
new_df.at[0,'PCR']='U.S.A.'
new_df.at[2,'PCR']='U.K.'
new_df.at[14,'PCR']='Holland Euro'
print(new_df, '\n')
pareto2 = new_df[new_df.Frequency > 14]
print(pareto2)
<class 'pandas.core.frame.DataFrame'> Int64Index: 2551 entries, 1 to 18161 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 2090 non-null float64 1 AttendedBootcamp 2467 non-null float64 2 BootcampFinish 145 non-null float64 3 BootcampLoanYesNo 147 non-null float64 4 BootcampName 127 non-null object 5 BootcampRecommend 148 non-null float64 6 CountryLive 2095 non-null object 7 ExpectedEarning 2241 non-null float64 8 FirstDevJob 0 non-null float64 9 Gender 2114 non-null object 10 HoursLearning 2334 non-null float64 11 Income 871 non-null float64 12 JobInterestBackEnd 168 non-null float64 13 JobInterestDataEngr 66 non-null float64 14 JobInterestDataSci 192 non-null float64 15 JobInterestDevOps 46 non-null float64 16 JobInterestFrontEnd 610 non-null float64 17 JobInterestFullStack 978 non-null float64 18 JobInterestGameDev 144 non-null float64 19 JobInterestInfoSec 123 non-null float64 20 JobInterestMobile 143 non-null float64 21 JobInterestProjMngr 63 non-null float64 22 JobInterestQAEngr 18 non-null float64 23 JobInterestUX 110 non-null float64 24 JobPref 2524 non-null object 25 JobRoleInterest 2148 non-null object 26 LanguageAtHome 2092 non-null object 27 MoneyForLearning 2300 non-null float64 28 MonthsProgramming 2381 non-null float64 29 SchoolDegree 2091 non-null object 30 SchoolMajor 1165 non-null object 31 Multi_Interests 2551 non-null float64 32 MI 2551 non-null object dtypes: float64(24), object(9) memory usage: 677.6+ KB None CountryLive JobInterestDataSci JobInterestFrontEnd \ 1 United States of America NaN NaN 6 United Kingdom NaN NaN 14 United States of America NaN NaN 15 United States of America NaN NaN 16 United States of America NaN NaN ... ... ... ... 18129 United States of America NaN NaN 18151 Bangladesh NaN 1.0 18154 Philippines NaN NaN 18156 India NaN NaN 18161 Republic of Serbia NaN NaN JobInterestFullStack JobInterestMobile 1 1.0 NaN 6 1.0 NaN 14 1.0 NaN 15 1.0 NaN 16 1.0 NaN ... ... ... 18129 NaN 1.0 18151 NaN NaN 18154 1.0 NaN 18156 1.0 NaN 18161 1.0 NaN [2551 rows x 5 columns] United States of America 872 NaN 456 India 191 United Kingdom 94 Canada 78 ... Qatar 1 Iraq 1 Luxembourg 1 Iceland 1 Mauritius 1 Name: CountryLive, Length: 113, dtype: int64 2551 United States of America 872 India 191 United Kingdom 94 Canada 78 Poland 49 ... Iran 1 Chile 1 Qatar 1 Iraq 1 Sudan 1 Name: CountryLive, Length: 112, dtype: int64 2095 PCR Frequency 0 United States of America 872 1 India 191 2 United Kingdom 94 3 Canada 78 4 Poland 49 .. ... ... 107 Iran 1 108 Chile 1 109 Qatar 1 110 Iraq 1 111 Sudan 1 [112 rows x 2 columns] PCR Frequency Cum_Percent 0 United States of America 872 41.622912 1 India 191 50.739857 2 United Kingdom 94 55.226730 3 Canada 78 58.949881 4 Poland 49 61.288783 .. ... ... ... 108 Chile 1 99.856802 109 Qatar 1 99.904535 110 Iraq 1 99.952267 111 Sudan 1 100.000000 112 Other 431 100.000000 [113 rows x 3 columns] PCR Frequency Cum_Percent 0 U.S.A. 872 41.622912 1 India 191 50.739857 2 U.K. 94 55.226730 3 Canada 78 58.949881 4 Poland 49 61.288783 .. ... ... ... 108 Chile 1 99.856802 109 Qatar 1 99.904535 110 Iraq 1 99.952267 111 Sudan 1 100.000000 112 Other 431 100.000000 [113 rows x 3 columns] PCR Frequency Cum_Percent 0 U.S.A. 872 41.622912 1 India 191 50.739857 2 U.K. 94 55.226730 3 Canada 78 58.949881 4 Poland 49 61.288783 5 Australia 38 63.102625 6 Brazil 34 64.725537 7 Russia 34 66.348449 8 France 33 67.923628 9 Nigeria 33 69.498807 10 Ukraine 32 71.026253 11 Germany 31 72.505967 12 Romania 31 73.985680 13 Spain 25 75.178998 14 Holland Euro 19 76.085919 15 South Africa 18 76.945107 16 Italy 18 77.804296 17 Indonesia 17 78.615752 18 Serbia 17 79.427208 19 Ireland 17 80.238663 112 Other 431 100.000000
I used the 'pd.set_option("display.max_rows", 130)' command to print every country and frequency so I could determine which countries after the cumulative 80% mark to consolidate into the group title 'Other'. I followed that command with 'pd.reset_option("display.max_rows")' to restore maximum rows to default.
fig, ax = plt.subplots(figsize=(14,9))
ax.bar(pareto2['PCR'], pareto2['Frequency'], color='C0')
ax2 = ax.twinx()
ax2.plot(pareto2['PCR'], pareto2['Cum_Percent'], color='C1', marker='D', ms=7)
ax2.yaxis.set_major_formatter(PercentFormatter())
plt.title('Pareto Chart', fontsize=24)
plt.xlabel(None)
plt.xticks(fontsize=30)
ax.set_ylabel('Frequency', fontsize=20, color='C0')
plt.yticks(fontsize=22)
plt.axhline(58.9, color='red', label = 'Mean')
sns.despine(bottom=False)
ax.grid(False)
ax.yaxis.set_label_position('left')
ax.tick_params(axis='x', colors='C0', labelrotation=90, labelsize=15)
ax.tick_params(axis='y', colors='C0', labelsize=15)
ax2.tick_params(axis='y', colors='C1', labelsize = 15)
plt.ylabel('Percent', fontsize=20, color='C1')
plt.show()
Regarding the web and mobile development job type preferences, the same top four countries are where the respondents currently live. At this point of the analysis, these countries are the leading candidates for top four marketing areas. I still need to narrow it down to two.
Code academies (also called coding bootcamps) are expensive. Tuition ranges from \$5000 to more than \$20,000 for 8-24 weeks of study. According to the academies and their fans, this is a great deal compared to university programs, which cost more and take longer to complete.
The 'MoneyForLearning' column describes in American dollars the amount of money spent by participants from the moment they started coding until the moment they completed the survey. Our company sells subscriptions at a price of \$59 per month, and for this reason we're interested in finding out how much money each student spends per month.
print(code_survey2['MonthsProgramming'].value_counts().sort_index())
print('\n')
print(code_survey2['AttendedBootcamp'].value_counts().sort_index(), '\n')
# determine monthly spending rates for 1 month training indoviduals.
zero_months = code_survey2[code_survey2['MonthsProgramming'] == 0.0]
zero_months.loc[zero_months['MonthsProgramming'] == 0.0, ['MonthsProgramming']] = 1.0
print(zero_months.info(), '\n')
zero_months['MonthlySpending'] = zero_months['MoneyForLearning'] / zero_months['MonthsProgramming']
print(zero_months['MonthlySpending'].value_counts().sort_index(), '\n')
# determine monthly spending rates for boot camp attendees.
boot_camp = code_survey2[code_survey2['AttendedBootcamp'] == 1]
print(boot_camp['MoneyForLearning'].value_counts().sort_index(), '\n')
boot_camp.loc[boot_camp['MonthsProgramming'] == 0.0, ['MonthsProgramming']] = 1.0
boot_camp['MonthlySpending'] = boot_camp['MoneyForLearning'] / boot_camp['MonthsProgramming']
code_survey2.loc[code_survey2['MonthsProgramming'] == 0.0, ['MonthsProgramming']] = 1.0
print(code_survey2['MonthsProgramming'].value_counts().sort_index(), '\n')
code_survey2['MonthlySpending'] = code_survey2['MoneyForLearning'] / code_survey2['MonthsProgramming']
print(code_survey2['MonthlySpending'].value_counts(dropna=False), '\n')
code_survey3 = code_survey2[code_survey2['MonthlySpending'] != 0]
code_survey4 = code_survey3.dropna(axis=0, subset=['MonthlySpending'])
print(code_survey4['CountryLive'].value_counts(ascending=False, dropna=False), '\n')
code_survey5 = code_survey4.dropna(axis=0, subset=['CountryLive'])
print(code_survey5['CountryLive'].value_counts(ascending=False, dropna=False), '\n')
0.0 276 1.0 848 2.0 736 3.0 706 4.0 395 ... 360.0 4 370.0 1 408.0 1 480.0 2 744.0 3 Name: MonthsProgramming, Length: 95, dtype: int64 0.0 7341 1.0 423 Name: AttendedBootcamp, dtype: int64 <class 'pandas.core.frame.DataFrame'> Int64Index: 276 entries, 234 to 18157 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 217 non-null float64 1 AttendedBootcamp 273 non-null float64 2 BootcampFinish 2 non-null float64 3 BootcampLoanYesNo 2 non-null float64 4 BootcampName 2 non-null object 5 BootcampRecommend 2 non-null float64 6 CountryLive 213 non-null object 7 ExpectedEarning 243 non-null float64 8 FirstDevJob 0 non-null float64 9 Gender 216 non-null object 10 HoursLearning 244 non-null float64 11 Income 106 non-null float64 12 JobInterestBackEnd 90 non-null float64 13 JobInterestDataEngr 56 non-null float64 14 JobInterestDataSci 62 non-null float64 15 JobInterestDevOps 48 non-null float64 16 JobInterestFrontEnd 137 non-null float64 17 JobInterestFullStack 128 non-null float64 18 JobInterestGameDev 81 non-null float64 19 JobInterestInfoSec 67 non-null float64 20 JobInterestMobile 97 non-null float64 21 JobInterestProjMngr 41 non-null float64 22 JobInterestQAEngr 33 non-null float64 23 JobInterestUX 66 non-null float64 24 JobPref 273 non-null object 25 JobRoleInterest 220 non-null object 26 LanguageAtHome 214 non-null object 27 MoneyForLearning 266 non-null float64 28 MonthsProgramming 276 non-null float64 29 SchoolDegree 214 non-null object 30 SchoolMajor 98 non-null object 31 Multi_Interests 276 non-null float64 32 MI 276 non-null object dtypes: float64(24), object(9) memory usage: 73.3+ KB None 0.0 215 15.0 1 20.0 3 22.0 1 25.0 1 30.0 1 40.0 1 45.0 1 50.0 6 60.0 1 85.0 1 90.0 1 100.0 10 110.0 1 200.0 4 330.0 1 400.0 1 500.0 2 600.0 1 650.0 1 1000.0 3 1200.0 2 1400.0 1 2000.0 2 2500.0 1 9000.0 2 30000.0 1 Name: MonthlySpending, dtype: int64 0.0 41 4.0 1 10.0 4 20.0 2 23.0 1 .. 20000.0 15 24500.0 1 25000.0 4 26000.0 1 100000.0 1 Name: MoneyForLearning, Length: 76, dtype: int64 1.0 1124 2.0 736 3.0 706 4.0 395 5.0 320 ... 360.0 4 370.0 1 408.0 1 480.0 2 744.0 3 Name: MonthsProgramming, Length: 94, dtype: int64 0.000000 3513 NaN 830 50.000000 145 100.000000 137 25.000000 134 ... 48.611111 1 26.923077 1 386.111111 1 75.555556 1 3.472222 1 Name: MonthlySpending, Length: 525, dtype: int64 United States of America 1581 NaN 471 India 196 United Kingdom 135 Canada 112 ... Trinidad & Tobago 1 Cameroon 1 Rwanda 1 Guatemala 1 Somalia 1 Name: CountryLive, Length: 110, dtype: int64 United States of America 1581 India 196 United Kingdom 135 Canada 112 Poland 61 ... Nicaragua 1 Iceland 1 Trinidad & Tobago 1 Cameroon 1 Myanmar 1 Name: CountryLive, Length: 109, dtype: int64
Insufficient Communication: Senior marketing management did not inform me whether they are prepared to introduce new type of e-learning training valued at \$100, \$200, \$1000, or more per month; especially "Boot Camp" type training.
I noticed that qty. 423 of the survey responders attended 'boot camp' type training. I will look at the distribution of monthly spending from these responders to determine if that data should be removed.
There are also responders who reported their months of training as zero. I need to view that data and after converting the value to 1 from 0 (to at least indicate one month of training) decide whether to include or exclude that data.
It's nice to have lots of data! But not if it includes questionable values that have great potential to distort the outcome!
# create a dataframe containing for money spent for all individuals.
df = pd.DataFrame(columns = ['money_spent'])
df['money_spent'] = code_survey2['MoneyForLearning']
# use Seaborn coding to generate box and whisker plot.
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(data=df, width = 0.35, linewidth=2.5, palette='Blues',
showmeans=True, medianprops={'color':'red'}, meanprops={'marker':'o',
'markerfacecolor':'white', 'markeredgecolor':'black',
'markersize':'10'})
plt.title('Boxplot of Monthly Spending Across All Survey Responders', fontsize=22, pad=20)
plt.xlabel('All Survey Responders', fontsize=20, labelpad = 20)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, labelbottom=False)
plt.ylabel('Monthly Spending', fontsize=20, labelpad = 20)
plt.show()
A couple of individuals have spent over $200000 per month for their training. Really?!! It may take them a long time to earn back what they've invested in training. Hopefully they'll live another 100 years to break even.
Executive Decision Time: Since the marketing senior managers are not available to obtain more details, I will have to make an executive decision on what a reasonable maximum monthly spending rate is to exclude: \$5000, \$500, \$100, ...?
I haven't yet generated a box and whisker plot that shows monthly spending being unreasonably too LOW to fit our spectrum of available courses. More to come.
Regarding the column title 'MonthsProgramming', I can't necessarily conclude that it truly means months of learning by training or months of learning by experience. I will assume it means months of training.
df = pd.DataFrame(columns = ['months_training'])
df['months_training'] = code_survey2['MonthsProgramming']
# use Seaborn coding to generate box and whisker plot.
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(data=df, width = 0.35, linewidth=2.5, palette='Blues',
showmeans=True, medianprops={'color':'red'}, meanprops={'marker':'o',
'markerfacecolor':'white', 'markeredgecolor':'black',
'markersize':'10'})
plt.title('Boxplot of Months of Training Across All Survey Responders', fontsize=22, pad=20)
plt.xlabel('All Survey Responders', fontsize=20, labelpad = 20)
plt.tick_params(axis='x', which='both', bottom=False, labelbottom=False)
plt.yticks(fontsize=15)
plt.ylabel('Months of Training', fontsize=20, labelpad = 20)
plt.show()
There definitely are some crazy numbers for months of training by various survey responders. Someone who showed they've had over 700 months (> 58 years) of training so far. Really?!! I don't think so.
Executive Decision Time: Again I will need to make an executive decision on maximum number of months to include in the final trimmed data file.
df = pd.DataFrame(columns = ['boot_camp'])
df['boot_camp'] = boot_camp['MonthlySpending']
# use Seaborn coding to generate box and whisker plot.
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(data=df, width = 0.35, linewidth=2.5, palette='Blues',
showmeans=True, medianprops={'color':'red'}, meanprops={'marker':'o',
'markerfacecolor':'white', 'markeredgecolor':'black',
'markersize':'10'})
plt.title('Boxplot of Monthly Spending by Boot Camp Attendees', fontsize=22, pad=20)
plt.xlabel('Boot Camp Attendees', fontsize=20, labelpad = 20)
plt.tick_params(axis='x', which='both', bottom=False, labelbottom=False)
plt.yticks(fontsize=15)
plt.ylabel('Monthly Spending', fontsize=20, labelpad = 20)
plt.show()
As I expected, the boot camp monthly spending rates are quite high relative to $59 per month.
Executive Decision Time: I'm going to assume that senior management is not interested in developing boot camp type training in the immediate future. I will exclude this group of data.
df = pd.DataFrame(columns = ['zero_months'])
df['zero_months'] = zero_months['MonthlySpending']
# use Seaborn coding to generate box and whisker plot.
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(data=df, width = 0.35, linewidth=2.5, palette='Blues',
showmeans=True, medianprops={'color':'red'}, meanprops={'marker':'o',
'markerfacecolor':'white', 'markeredgecolor':'black',
'markersize':'10'})
plt.title('Boxplot of Monthly Spending by One Month Students', fontsize=22, pad=20)
plt.xlabel('One Month Students', fontsize=20, labelpad = 20)
plt.tick_params(axis='x', which='both', bottom=False, labelbottom=False)
plt.yticks(fontsize=15)
plt.ylabel('Monthly Spending', fontsize=20, labelpad = 20)
plt.show()
In the project guidelines, it was suggested that I convert the months of training cases that show 0 to 1 to represent at least 1 month of training. The primary reason for this suggestion is when calculating monthly spending, there would be a problem for the cases where the denominator was 0.
So, having done that, we see suspect monthly spending rates here for students that have had one month or less of training.
Not only that, here's the concern I have for these cases and possibly even cases where students have had less than 12 months of training. Among the survey responders, I'm sure there are individuals who made the full payment up front, whether it's a one year subscription or full course subscription or whatever. So, when dividing that total payment by 1 month, 2 or ... less than 12, it will generate an inflated monthly payment. I don't think the data file tells us who made full year or more course payments among all the survey responders.
I'm beginning to feel like monthly spending on training should not be used in making a final decision on the top two marketing areas.
I will generate a box and whisker plot with an upper limit equal to our current monthly subscription rate: \$59 per month.
low_data = code_survey5[(code_survey5['MonthlySpending'] < 60.00)]
df = pd.DataFrame(columns = ['likely'])
df['likely'] = low_data['MonthlySpending']
# use Seaborn coding to generate box and whisker plot.
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(data=df, width = 0.35, linewidth=2.5, palette='Blues',
showmeans=True, medianprops={'color':'red'}, meanprops={'marker':'o',
'markerfacecolor':'white', 'markeredgecolor':'black',
'markersize':'10'})
plt.title('Boxplot of Monthly Spending by Survey Responders', fontsize=22, pad=20)
plt.xlabel('Survey Responders', fontsize=20, labelpad = 20)
plt.tick_params(axis='x', which='both', bottom=False, labelbottom=False)
plt.yticks(fontsize=15)
plt.ylabel('Monthly Spending', fontsize=20, labelpad = 20)
plt.show()
Of all the cases where monthly spending is less than \$59, at least 75% are lower than about \$25. I will now decide what the upper and lower monthly spending rates are acceptable to marketing senior management.
Final Executive Decision:!! On behalf of marketing senior management, I'm going to set the maximum monthly subscription rate at \$200 and the minimum monthly subscription rate at our current one which is \$59 per month. This will get rid of months of programming values set at 1 which yield unreasonable monthly spending.
Let's see what these limits look like for each of the top four countries.
# define the upper and lower monthly spending limits.
likely_data2 = code_survey5[(code_survey5['MonthlySpending'] < 200.00) & (code_survey5['MonthlySpending'] > 59.00)]
likely_data2['MonthlySpending'] = likely_data2['MoneyForLearning'] / likely_data2['MonthsProgramming']
top_four = likely_data2.loc[((likely_data2['CountryLive'] == 'United States of America')
| (likely_data2['CountryLive'] == 'India')
| (likely_data2['CountryLive'] == 'United Kingdom')
| (likely_data2['CountryLive'] == 'Canada'))]
print(top_four['CountryLive'].value_counts(), '\n')
avg_by_country = top_four.groupby('CountryLive')['MonthlySpending'].mean()
print(avg_by_country)
United States of America 285 India 36 United Kingdom 28 Canada 15 Name: CountryLive, dtype: int64 CountryLive Canada 100.619048 India 119.141534 United Kingdom 100.364100 United States of America 105.768501 Name: MonthlySpending, dtype: float64
# use Seaborn coding to generate box and whisker plot.
fig, ax = plt.subplots(figsize=(10,8))
sns.boxplot(x = top_four['CountryLive'], y = top_four['MonthlySpending'], width = 0.35, linewidth=2.5, palette='Blues',
showmeans=True, medianprops={'color':'red'}, meanprops={'marker':'o',
'markerfacecolor':'white', 'markeredgecolor':'black',
'markersize':'10'})
plt.title('Boxplot of Monthly Spending by Survey Respondents', fontsize=22, pad=20)
ax.yaxis.set_label_position("left")
ax.tick_params(axis='x', labelrotation=90, labelsize=15)
ax.tick_params(axis='y', labelsize=15)
plt.xlabel('Country', fontsize=20, labelpad = 10)
plt.ylabel('Monthly Spending', fontsize=20, labelpad = 20)
plt.show()
After all the data cleaning for monthly spending, there's very little data remaining. It is what it is.
So with what we've have, there's very little difference in average monthly spending rate among the top four countries; not enough to impact the final decision regarding the top two regions to advertise in from a monthly spending standpoint.
Since our courses are set up at the moment only in English, I think that will eliminate India as a recommendation for now. If large differences in time zones is potentially a negative for our business interactions with clients, then I think that eliminates the U.K. as well.
That leaves U.S.A. and Canada as my two top markets I will recommend to advertise in.
When the marketing management team gets back from training, I will make a ten minute presentation and say results from the survey analysis are inconclusive in terms of determining the top two markets for advertising.
At the same time I will say that choosing U.S.A. and Canada to launch their advertising blitz in would be low risk for poor return on investment.
I will also express appreciation for their decision to take a week long training session on communication so that for next project assignment, they will provide me with a lot more detail, thus relieving me of making any executive decisions on their behalf.