This project is being carried out on behalf of an an e-learning company that offers courses on programming. Most of their courses are on web and mobile development, but also cover many other domains, like data science, game development, etc. They want to promote their product and would like to invest some money in advertising.
Our goal in this project is to find out the two best markets to advertise their product in.
To identify which would be the best markets to advertise in I could :
Option 2 has been chosen as it is a lot less expensive.
We can try to search existing data that might be relevant for our purpose. One good candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
The survey data is publicly available in this GitHub repository. https://github.com/freeCodeCamp/2017-new-coder-survey
import pandas as pd
pd.options.display.max_columns=200
new_coder_survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv')
new_coder_survey.shape
C:\Users\gavin\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (17,62) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
(18175, 136)
new_coder_survey.head()
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | ChildrenNumber | CityPopulation | CodeEventConferences | CodeEventDjangoGirls | CodeEventFCC | CodeEventGameJam | CodeEventGirlDev | CodeEventHackathons | CodeEventMeetup | CodeEventNodeSchool | CodeEventNone | CodeEventOther | CodeEventRailsBridge | CodeEventRailsGirls | CodeEventStartUpWknd | CodeEventWkdBootcamps | CodeEventWomenCode | CodeEventWorkshops | CommuteTime | CountryCitizen | CountryLive | EmploymentField | EmploymentFieldOther | EmploymentStatus | EmploymentStatusOther | ExpectedEarning | FinanciallySupporting | FirstDevJob | Gender | GenderOther | HasChildren | HasDebt | HasFinancialDependents | HasHighSpdInternet | HasHomeMortgage | HasServedInMilitary | HasStudentDebt | HomeMortgageOwe | HoursLearning | ID.x | ID.y | Income | IsEthnicMinority | IsReceiveDisabilitiesBenefits | IsSoftwareDev | IsUnderEmployed | JobApplyWhen | JobInterestBackEnd | JobInterestDataEngr | JobInterestDataSci | JobInterestDevOps | JobInterestFrontEnd | JobInterestFullStack | JobInterestGameDev | JobInterestInfoSec | JobInterestMobile | JobInterestOther | JobInterestProjMngr | JobInterestQAEngr | JobInterestUX | JobPref | JobRelocateYesNo | JobRoleInterest | JobWherePref | LanguageAtHome | MaritalStatus | MoneyForLearning | MonthsProgramming | NetworkID | Part1EndTime | Part1StartTime | Part2EndTime | Part2StartTime | PodcastChangeLog | PodcastCodeNewbie | PodcastCodePen | PodcastDevTea | PodcastDotNET | PodcastGiantRobots | PodcastJSAir | PodcastJSJabber | PodcastNone | PodcastOther | PodcastProgThrowdown | PodcastRubyRogues | PodcastSEDaily | PodcastSERadio | PodcastShopTalk | PodcastTalkPython | PodcastTheWebAhead | ResourceCodecademy | ResourceCodeWars | ResourceCoursera | ResourceCSS | ResourceEdX | ResourceEgghead | ResourceFCC | ResourceHackerRank | ResourceKA | ResourceLynda | ResourceMDN | ResourceOdinProj | ResourceOther | ResourcePluralSight | ResourceSkillcrush | ResourceSO | ResourceTreehouse | ResourceUdacity | ResourceUdemy | ResourceW3S | SchoolDegree | SchoolMajor | StudentDebtOwe | YouTubeCodeCourse | YouTubeCodingTrain | YouTubeCodingTut360 | YouTubeComputerphile | YouTubeDerekBanas | YouTubeDevTips | YouTubeEngineeredTruth | YouTubeFCC | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeOther | YouTubeSimplilearn | YouTubeTheNewBoston | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 to 29 minutes | Canada | Canada | software development and IT | NaN | Employed for wages | NaN | NaN | NaN | NaN | female | NaN | NaN | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN | 15.0 | 02d9465b21e8bd09374b0066fb2d5614 | eb78c1c3ac6cd9052aec557065070fbf | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | start your own business | NaN | NaN | NaN | English | married or domestic partnership | 150.0 | 6.0 | 6f1fbc6b2b | 2017-03-09 00:36:22 | 2017-03-09 00:32:59 | 2017-03-09 00:59:46 | 2017-03-09 00:36:26 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 34.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | less than 100,000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | United States of America | United States of America | NaN | NaN | Not working but looking for work | NaN | 35000.0 | NaN | NaN | male | NaN | NaN | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | NaN | 10.0 | 5bfef9ecb211ec4f518cfc1d2a6f3e0c | 21db37adb60cdcafadfa7dca1b13b6b1 | NaN | 0.0 | 0.0 | 0.0 | NaN | Within 7 to 12 months | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | work for a nonprofit | 1.0 | Full-Stack Web Developer | in an office with other developers | English | single, never married | 80.0 | 6.0 | f8f8be6910 | 2017-03-09 00:37:07 | 2017-03-09 00:33:26 | 2017-03-09 00:38:59 | 2017-03-09 00:37:10 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | 1.0 | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 21.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15 to 29 minutes | United States of America | United States of America | software development and IT | NaN | Employed for wages | NaN | 70000.0 | NaN | NaN | male | NaN | NaN | 0.0 | 0.0 | 1.0 | NaN | 0.0 | NaN | NaN | 25.0 | 14f1863afa9c7de488050b82eb3edd96 | 21ba173828fbe9e27ccebaf4d5166a55 | 13000.0 | 1.0 | 0.0 | 0.0 | 0.0 | Within 7 to 12 months | 1.0 | NaN | NaN | 1.0 | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | work for a medium-sized company | 1.0 | Front-End Web Developer, Back-End Web Develo... | no preference | Spanish | single, never married | 1000.0 | 5.0 | 2ed189768e | 2017-03-09 00:37:58 | 2017-03-09 00:33:53 | 2017-03-09 00:40:14 | 2017-03-09 00:38:02 | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | Codenewbie | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | high school diploma or equivalent (GED) | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
3 | 26.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | I work from home | Brazil | Brazil | software development and IT | NaN | Employed for wages | NaN | 40000.0 | 0.0 | NaN | male | NaN | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 40000.0 | 14.0 | 91756eb4dc280062a541c25a3d44cfb0 | 3be37b558f02daae93a6da10f83f0c77 | 24000.0 | 0.0 | 0.0 | 0.0 | 1.0 | Within the next 6 months | 1.0 | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | work for a medium-sized company | NaN | Front-End Web Developer, Full-Stack Web Deve... | from home | Portuguese | married or domestic partnership | 0.0 | 5.0 | dbdc0664d1 | 2017-03-09 00:40:13 | 2017-03-09 00:37:45 | 2017-03-09 00:42:26 | 2017-03-09 00:40:18 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | some college credit, no degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
4 | 20.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Portugal | Portugal | NaN | NaN | Not working but looking for work | NaN | 140000.0 | NaN | NaN | female | NaN | NaN | 0.0 | 0.0 | 1.0 | NaN | 0.0 | NaN | NaN | 10.0 | aa3f061a1949a90b27bef7411ecd193f | d7c56bbf2c7b62096be9db010e86d96d | NaN | 0.0 | 0.0 | 0.0 | NaN | Within 7 to 12 months | 1.0 | NaN | NaN | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | work for a multinational corporation | 1.0 | Full-Stack Web Developer, Information Security... | in an office with other developers | Portuguese | single, never married | 0.0 | 24.0 | 11b0f2d8a9 | 2017-03-09 00:42:45 | 2017-03-09 00:39:44 | 2017-03-09 00:45:42 | 2017-03-09 00:42:50 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | bachelor's degree | Information Technology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
The 'JobRoleInterest' field is of particular interest. As the courses that are sold relate to Mobile / Web. We will also want to investigate the 'CountryLive' field as this indicates the market to advertise in. Finally the 'MoneyForLearning' field gives me an indication of the respondents budget.
Lets look at the 'JobRoleInterest' field.
new_coder_survey['JobRoleInterest'].head(10)
0 NaN 1 Full-Stack Web Developer 2 Front-End Web Developer, Back-End Web Develo... 3 Front-End Web Developer, Full-Stack Web Deve... 4 Full-Stack Web Developer, Information Security... 5 NaN 6 Full-Stack Web Developer 7 NaN 8 NaN 9 Full-Stack Web Developer, Quality Assurance ... Name: JobRoleInterest, dtype: object
We can see from the above table that there are null values. I will replace these to make analysis of this column easier.
new_coder_survey['JobRoleInterest'] = new_coder_survey['JobRoleInterest'].fillna('Empty')
new_coder_survey['JobRoleInterest'].value_counts(normalize=True)*100
Empty 61.529574 Full-Stack Web Developer 4.528198 Front-End Web Developer 2.475928 Data Scientist 0.836314 Back-End Web Developer 0.781293 ... Data Scientist, Mobile Developer, Information Security 0.005502 Front-End Web Developer, Back-End Web Developer, Game Developer, Full-Stack Web Developer, Data Scientist 0.005502 DevOps / SysAdmin, Front-End Web Developer, Full-Stack Web Developer, Game Developer, Mobile Developer, Back-End Web Developer, Data Engineer, User Experience Designer 0.005502 Front-End Web Developer, Mobile Developer, Game Developer, Full-Stack Web Developer, User Experience Designer, Back-End Web Developer 0.005502 Front-End Web Developer, Back-End Web Developer, Data Scientist, Full-Stack Web Developer, Data Engineer, Game Developer, Mobile Developer 0.005502 Name: JobRoleInterest, Length: 3214, dtype: float64
%matplotlib inline
import matplotlib.pyplot as plt
new_coder_survey['JobRoleInterest'].value_counts().head(5).plot.bar()
plt.title('Top 5 Job Role Interests')
plt.ylabel('Number of Responses')
Text(0, 0.5, 'Number of Responses')
From the table and chart above we can see that over 61% of respondents left the 'JobRoleInterest' field empty. Ideally I would want this field to have been completed by everyone. This does impact the conclusion that can be drawn from this survey. I will continue the analysis with the above caveat in mind.
I will proceed to investigate the relevance of the completed 'JobRoleInterest' by counting refernces to 'Web' and 'Mobile'.
webmobile = {'web':0,'mobile':0}
import re
for course in new_coder_survey['JobRoleInterest']:
if re.search('[Ww]eb', course):
webmobile['web'] += 1
elif re.search('[Mm]obile', course):
webmobile['mobile'] += 1
wm = pd.DataFrame.from_dict(data=webmobile, orient='index', columns=['total'])
wm
total | |
---|---|
web | 5781 |
mobile | 254 |
wm['total'].plot.bar()
plt.title('Interest in Web or Mobile')
plt.ylabel('Number of Responses')
Text(0, 0.5, 'Number of Responses')
From the chart above we can see that approximately 5781 people expressed an in interest in web with 254 interested in mobile.
Lets look at the 'CountryLive' field. I will exclude the 'JobRoleInterest' 'Empty' rows as we are only interested in rows that have a value.
new_coder_survey_new=new_coder_survey[new_coder_survey['JobRoleInterest'] != 'Empty']
new_coder_survey_new['CountryLive'].value_counts()
United States of America 3125 India 528 United Kingdom 315 Canada 260 Poland 131 ... Botswana 1 Cayman Islands 1 Guatemala 1 Turkmenistan 1 Anguilla 1 Name: CountryLive, Length: 137, dtype: int64
new_coder_survey_new['CountryLive'].value_counts(normalize=True)*100
United States of America 45.700497 India 7.721556 United Kingdom 4.606610 Canada 3.802281 Poland 1.915765 ... Botswana 0.014624 Cayman Islands 0.014624 Guatemala 0.014624 Turkmenistan 0.014624 Anguilla 0.014624 Name: CountryLive, Length: 137, dtype: float64
From looking at both absolute and relative frequency tables, it would suggest focusing on USA(45%, 3125) and India(7%, 528).
Next lets calculate the monthly spend of respondents in each country. To do this I will create a new column 'MoneySpentPerMonth' which derived is from dividing 'MoneyForLearning' by 'MonthsProgramming'.
new_coder_survey_new['MonthsProgramming'].head(10)
1 6.0 2 5.0 3 5.0 4 24.0 6 12.0 9 18.0 11 1.0 13 NaN 14 9.0 15 1.0 Name: MonthsProgramming, dtype: float64
new_coder_survey_new['MonthsProgramming'].min()
0.0
We can see from above that 'MonthsProgramming' has null values and a minimum value of 0. To avoid infinity in 'MoneySpentPerMonth' I will replace the null values and 0 with 1.
#Avoids infinity
new_coder_survey_new['MonthsProgramming']=new_coder_survey_new['MonthsProgramming'].apply(lambda x: x+1 if x==0 else x)
new_coder_survey_new['MonthsProgramming']=new_coder_survey_new['MonthsProgramming'].fillna(value=1)
C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until
new_coder_survey_new['MonthsProgramming'].head(10)
1 6.0 2 5.0 3 5.0 4 24.0 6 12.0 9 18.0 11 1.0 13 1.0 14 9.0 15 1.0 Name: MonthsProgramming, dtype: float64
new_coder_survey_new['MonthsProgramming'].min()
1.0
Repeat process for 'MoneyForLearning'
new_coder_survey_new['MoneyForLearning'].isna().sum()
516
new_coder_survey_new['MoneyForLearning'].min()
0.0
#Avoids infinity
new_coder_survey_new['MoneyForLearning']=new_coder_survey_new['MoneyForLearning'].apply(lambda x: x+1 if x==0 else x)
new_coder_survey_new['MoneyForLearning']=new_coder_survey_new['MoneyForLearning'].fillna(value=1)
C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy This is separate from the ipykernel package so we can avoid doing imports until
new_coder_survey_new['MoneyForLearning'].isna().sum()
0
new_coder_survey_new['MoneyForLearning'].min()
1.0
I drop na rows on 'CountryLive' as I will be grouping by it.
new_coder_survey_new['CountryLive'].isna().sum()
154
new_coder_survey_new = new_coder_survey_new.dropna(subset=['CountryLive'])
new_coder_survey_new['CountryLive'].isna().sum()
0
Calculate Mean, Mode, Median for 'MoneySpentPerMonth'
#Create new column
new_coder_survey_new['MoneySpentPerMonth'] = new_coder_survey_new['MoneyForLearning'] / new_coder_survey_new['MonthsProgramming']
new_coder_survey_new['MoneySpentPerMonth'].isna().sum()
C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.
0
#Groupby mean average
mean_avg = new_coder_survey_new.groupby(by=['CountryLive']).mean()
mean_avg['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]
CountryLive United States of America 228.478565 India 119.271354 United Kingdom 40.572292 Canada 124.294036 Name: MoneySpentPerMonth, dtype: float64
#Groupby mode average
avg_mode =new_coder_survey_new.groupby(['CountryLive'])['MoneySpentPerMonth'].agg(lambda x: pd.Series.mode(x)[0]).to_frame()
avg_mode['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]
CountryLive United States of America 1.0 India 1.0 United Kingdom 1.0 Canada 1.0 Name: MoneySpentPerMonth, dtype: float64
#Groupby median average
median_avg = new_coder_survey_new.groupby(by=['CountryLive']).median()
median_avg['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]
CountryLive United States of America 2.083333 India 0.500000 United Kingdom 1.000000 Canada 1.000000 Name: MoneySpentPerMonth, dtype: float64
The mean, mode and median figures are off as the majority of the data is incomplete. Also I had to be replace 0 values with '1'.
I will now boxplot these findings to see the impact of outliers.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
uiuc = new_coder_survey_new[new_coder_survey_new['CountryLive'].str.contains('United States of America|India|United Kingdom|Canada')]
sns.boxplot(y='MoneySpentPerMonth', x='CountryLive', data = uiuc)
plt.title('Boxplot to Identify Outliers by Country for "Money Spent Per Month"')
plt.xticks(range(4),['US','UK','India','Canada'])
plt.show()
From the looking at the chart above I will take any spend above 5000 to be an outlier.
remove_outliers2 = new_coder_survey_new[new_coder_survey_new['MoneySpentPerMonth'] < 5000]
ro2_mavg = remove_outliers2.groupby(['CountryLive']).mean()
ro2_mavg['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]
CountryLive United States of America 130.743830 India 72.471899 United Kingdom 40.572292 Canada 86.497866 Name: MoneySpentPerMonth, dtype: float64
Once the outliers equal to or above 5000 have been removed we can see 2nd place in Monthly Spend is Canada. But not by too much.
I would like compare the mean average monthly spend to the sample size of that country.
summary_df = pd.DataFrame()
summary_df['MeanMoneySpentPerMonth']= ro2_mavg['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]
summary_df['TotalPersonsPerCountry']= remove_outliers2['CountryLive'].value_counts().iloc[0:4]
summary_df
MeanMoneySpentPerMonth | TotalPersonsPerCountry | |
---|---|---|
CountryLive | ||
United States of America | 130.743830 | 3104 |
India | 72.471899 | 524 |
United Kingdom | 40.572292 | 315 |
Canada | 86.497866 | 258 |
From the table above we can see that the top two markets are USA and India. Canada has a higher mean monthly spend but from this sample its customer size is half that of india. The UK does not pass our monthly spend threshold of 59.
I would like to drill down further and include the mode monthly spend and its frequency for each country to see if this affects the result.
def counting_mode(array, country):
count_mode = 0
for a in array:
if a == avg_mode['MoneySpentPerMonth'][country]:
count_mode += 1
return count_mode
country_list=['United States of America','India','United Kingdom','Canada']
country_mode_frequency= []
for b in country_list:
data = remove_outliers2['MoneySpentPerMonth'][remove_outliers2['CountryLive'] == b]
country_mode_frequency.append(counting_mode(data,b))
country_mode_frequency
[445, 63, 54, 34]
summary_df['CountryMode'] = avg_mode['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]
summary_df['CountryModeFrequency'] = country_mode_frequency
summary_df
MeanMoneySpentPerMonth | TotalPersonsPerCountry | CountryMode | CountryModeFrequency | |
---|---|---|---|---|
CountryLive | ||||
United States of America | 130.743830 | 3104 | 1.0 | 445 |
India | 72.471899 | 524 | 1.0 | 63 |
United Kingdom | 40.572292 | 315 | 1.0 | 54 |
Canada | 86.497866 | 258 | 1.0 | 34 |
From the table above the Mode monthly spend is 1 for all countries. The frequency of the mode is approximately 10% of the sample population for the top 3 countries. So the findings would stand.
For the sample the top 2 markets are USA and India. There is a strong case for focusing solely in the USA as their sample size is seven times that of India and has almost double the monthly average spend. If equal opportunity is part of the company mission statement or values then India and Canada should be included. UK does not hit the monthly spend requirment.
Before any actual spend on advertizing I would stronglty recommend further analysis to marketing for the following reasons: