Finding the Best Markets to Advertise In¶

This project is being carried out on behalf of an an e-learning company that offers courses on programming. Most of their courses are on web and mobile development, but also cover many other domains, like data science, game development, etc. They want to promote their product and would like to invest some money in advertising.

Our goal in this project is to find out the two best markets to advertise their product in.

The Approach¶

To identify which would be the best markets to advertise in I could :

Organize surveys in different markets.
Analyse an existing dataset.

Option 2 has been chosen as it is a lot less expensive.

We can try to search existing data that might be relevant for our purpose. One good candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.

The survey data is publicly available in this GitHub repository. https://github.com/freeCodeCamp/2017-new-coder-survey

Reading in The File and Preliminary Analysis¶

In [1]:

import pandas as pd

pd.options.display.max_columns=200

new_coder_survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv')

new_coder_survey.shape

C:\Users\gavin\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (17,62) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Out[1]:

(18175, 136)

In [2]:

new_coder_survey.head()

Out[2]:

	Age	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventConferences	CodeEventDjangoGirls	CodeEventFCC	CodeEventGameJam	CodeEventGirlDev	CodeEventHackathons	CodeEventMeetup	CodeEventNodeSchool	CodeEventNone	CodeEventOther	CodeEventRailsBridge	CodeEventRailsGirls	CodeEventStartUpWknd	CodeEventWkdBootcamps	CodeEventWomenCode	CodeEventWorkshops	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentFieldOther	EmploymentStatus	EmploymentStatusOther	ExpectedEarning	FinanciallySupporting	FirstDevJob	Gender	GenderOther	HasChildren	HasDebt	HasFinancialDependents	HasHighSpdInternet	HasHomeMortgage	HasStudentDebt	HomeMortgageOwe	HoursLearning	ID.x	ID.y	Income	IsEthnicMinority	IsUnderEmployed	JobApplyWhen	JobInterestBackEnd	JobInterestDataEngr	JobInterestDataSci	JobInterestDevOps	JobInterestFrontEnd	JobInterestFullStack	JobInterestGameDev	JobInterestInfoSec	JobInterestMobile	JobInterestOther	JobInterestProjMngr	JobInterestQAEngr	JobInterestUX	JobPref	JobRelocateYesNo	JobRoleInterest	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	NetworkID	Part1EndTime	Part1StartTime	Part2EndTime	Part2StartTime	PodcastChangeLog	PodcastCodeNewbie	PodcastCodePen	PodcastDevTea	PodcastDotNET	PodcastGiantRobots	PodcastJSAir	PodcastJSJabber	PodcastNone	PodcastOther	PodcastProgThrowdown	PodcastRubyRogues	PodcastSEDaily	PodcastSERadio	PodcastShopTalk	PodcastTalkPython	PodcastTheWebAhead	ResourceCodecademy	ResourceCodeWars	ResourceCoursera	ResourceCSS	ResourceEdX	ResourceEgghead	ResourceFCC	ResourceHackerRank	ResourceKA	ResourceLynda	ResourceMDN	ResourceOdinProj	ResourceOther	ResourcePluralSight	ResourceSkillcrush	ResourceSO	ResourceTreehouse	ResourceUdacity	ResourceUdemy	ResourceW3S	SchoolDegree	SchoolMajor	StudentDebtOwe	YouTubeCodeCourse	YouTubeCodingTrain	YouTubeCodingTut360	YouTubeComputerphile	YouTubeDerekBanas	YouTubeDevTips	YouTubeEngineeredTruth	YouTubeFCC	YouTubeFunFunFunction	YouTubeGoogleDev	YouTubeLearnCode	YouTubeLevelUpTuts	YouTubeMIT	YouTubeMozillaHacks	YouTubeOther	YouTubeSimplilearn	YouTubeTheNewBoston
0	27.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15 to 29 minutes	Canada	Canada	software development and IT	NaN	Employed for wages	NaN	NaN	NaN	NaN	female	NaN	NaN	1.0	0.0	1.0	0.0	0.0	NaN	15.0	02d9465b21e8bd09374b0066fb2d5614	eb78c1c3ac6cd9052aec557065070fbf	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	start your own business	NaN	NaN	NaN	English	married or domestic partnership	150.0	6.0	6f1fbc6b2b	2017-03-09 00:36:22	2017-03-09 00:32:59	2017-03-09 00:59:46	2017-03-09 00:36:26	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	34.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	United States of America	United States of America	NaN	NaN	Not working but looking for work	NaN	35000.0	NaN	NaN	male	NaN	NaN	1.0	0.0	1.0	0.0	1.0	NaN	10.0	5bfef9ecb211ec4f518cfc1d2a6f3e0c	21db37adb60cdcafadfa7dca1b13b6b1	NaN	0.0	NaN	Within 7 to 12 months	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	work for a nonprofit	1.0	Full-Stack Web Developer	in an office with other developers	English	single, never married	80.0	6.0	f8f8be6910	2017-03-09 00:37:07	2017-03-09 00:33:26	2017-03-09 00:38:59	2017-03-09 00:37:10	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	1.0	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	21.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	NaN	NaN	NaN	1.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15 to 29 minutes	United States of America	United States of America	software development and IT	NaN	Employed for wages	NaN	70000.0	NaN	NaN	male	NaN	NaN	0.0	0.0	1.0	NaN	NaN	NaN	25.0	14f1863afa9c7de488050b82eb3edd96	21ba173828fbe9e27ccebaf4d5166a55	13000.0	1.0	0.0	Within 7 to 12 months	1.0	NaN	NaN	1.0	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	work for a medium-sized company	1.0	Front-End Web Developer, Back-End Web Develo...	no preference	Spanish	single, never married	1000.0	5.0	2ed189768e	2017-03-09 00:37:58	2017-03-09 00:33:53	2017-03-09 00:40:14	2017-03-09 00:38:02	1.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Codenewbie	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	NaN	high school diploma or equivalent (GED)	NaN	NaN	NaN	NaN	1.0	NaN	1.0	1.0	NaN	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN
3	26.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	I work from home	Brazil	Brazil	software development and IT	NaN	Employed for wages	NaN	40000.0	0.0	NaN	male	NaN	0.0	1.0	1.0	1.0	1.0	0.0	40000.0	14.0	91756eb4dc280062a541c25a3d44cfb0	3be37b558f02daae93a6da10f83f0c77	24000.0	0.0	1.0	Within the next 6 months	1.0	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	work for a medium-sized company	NaN	Front-End Web Developer, Full-Stack Web Deve...	from home	Portuguese	married or domestic partnership	0.0	5.0	dbdc0664d1	2017-03-09 00:40:13	2017-03-09 00:37:45	2017-03-09 00:42:26	2017-03-09 00:40:18	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN
4	20.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Portugal	Portugal	NaN	NaN	Not working but looking for work	NaN	140000.0	NaN	NaN	female	NaN	NaN	0.0	0.0	1.0	NaN	NaN	NaN	10.0	aa3f061a1949a90b27bef7411ecd193f	d7c56bbf2c7b62096be9db010e86d96d	NaN	0.0	NaN	Within 7 to 12 months	1.0	NaN	NaN	NaN	1.0	1.0	NaN	1.0	1.0	NaN	NaN	NaN	NaN	work for a multinational corporation	1.0	Full-Stack Web Developer, Information Security...	in an office with other developers	Portuguese	single, never married	0.0	24.0	11b0f2d8a9	2017-03-09 00:42:45	2017-03-09 00:39:44	2017-03-09 00:45:42	2017-03-09 00:42:50	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	bachelor's degree	Information Technology	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

The 'JobRoleInterest' field is of particular interest. As the courses that are sold relate to Mobile / Web. We will also want to investigate the 'CountryLive' field as this indicates the market to advertise in. Finally the 'MoneyForLearning' field gives me an indication of the respondents budget.

Determine whether Web or Mobile is in demand ?¶

Lets look at the 'JobRoleInterest' field.

In [3]:

new_coder_survey['JobRoleInterest'].head(10)

Out[3]:

0                                                  NaN
1                             Full-Stack Web Developer
2      Front-End Web Developer, Back-End Web Develo...
3      Front-End Web Developer, Full-Stack Web Deve...
4    Full-Stack Web Developer, Information Security...
5                                                  NaN
6                             Full-Stack Web Developer
7                                                  NaN
8                                                  NaN
9    Full-Stack Web Developer,   Quality Assurance ...
Name: JobRoleInterest, dtype: object

We can see from the above table that there are null values. I will replace these to make analysis of this column easier.

In [4]:

new_coder_survey['JobRoleInterest'] = new_coder_survey['JobRoleInterest'].fillna('Empty')

In [5]:

new_coder_survey['JobRoleInterest'].value_counts(normalize=True)*100

Out[5]:

Empty                                                                                                                                                                              61.529574
Full-Stack Web Developer                                                                                                                                                            4.528198
  Front-End Web Developer                                                                                                                                                           2.475928
  Data Scientist                                                                                                                                                                    0.836314
Back-End Web Developer                                                                                                                                                              0.781293
                                                                                                                                                                                     ...    
  Data Scientist,   Mobile Developer, Information Security                                                                                                                          0.005502
  Front-End Web Developer, Back-End Web Developer, Game Developer, Full-Stack Web Developer,   Data Scientist                                                                       0.005502
  DevOps / SysAdmin,   Front-End Web Developer, Full-Stack Web Developer, Game Developer,   Mobile Developer, Back-End Web Developer, Data Engineer,   User Experience Designer     0.005502
  Front-End Web Developer,   Mobile Developer, Game Developer, Full-Stack Web Developer,   User Experience Designer, Back-End Web Developer                                         0.005502
  Front-End Web Developer, Back-End Web Developer,   Data Scientist, Full-Stack Web Developer, Data Engineer, Game Developer,   Mobile Developer                                    0.005502
Name: JobRoleInterest, Length: 3214, dtype: float64

In [6]:

%matplotlib inline
import matplotlib.pyplot as plt

new_coder_survey['JobRoleInterest'].value_counts().head(5).plot.bar()
plt.title('Top 5 Job Role Interests')
plt.ylabel('Number of Responses')

Out[6]:

Text(0, 0.5, 'Number of Responses')

From the table and chart above we can see that over 61% of respondents left the 'JobRoleInterest' field empty. Ideally I would want this field to have been completed by everyone. This does impact the conclusion that can be drawn from this survey. I will continue the analysis with the above caveat in mind.

I will proceed to investigate the relevance of the completed 'JobRoleInterest' by counting refernces to 'Web' and 'Mobile'.

In [7]:

webmobile = {'web':0,'mobile':0}

import re

for course in new_coder_survey['JobRoleInterest']:
    if re.search('[Ww]eb', course):
        webmobile['web'] += 1
      
    elif re.search('[Mm]obile', course):
        webmobile['mobile'] += 1
   

wm = pd.DataFrame.from_dict(data=webmobile, orient='index', columns=['total'])
wm

Out[7]:

	total
web	5781
mobile	254

In [8]:

wm['total'].plot.bar()
plt.title('Interest in Web or Mobile')
plt.ylabel('Number of Responses')

Out[8]:

Text(0, 0.5, 'Number of Responses')

From the chart above we can see that approximately 5781 people expressed an in interest in web with 254 interested in mobile.

Determine the Location of the Respondents?¶

Lets look at the 'CountryLive' field. I will exclude the 'JobRoleInterest' 'Empty' rows as we are only interested in rows that have a value.

In [9]:

new_coder_survey_new=new_coder_survey[new_coder_survey['JobRoleInterest'] != 'Empty']

new_coder_survey_new['CountryLive'].value_counts()

Out[9]:

United States of America    3125
India                        528
United Kingdom               315
Canada                       260
Poland                       131
                            ... 
Botswana                       1
Cayman Islands                 1
Guatemala                      1
Turkmenistan                   1
Anguilla                       1
Name: CountryLive, Length: 137, dtype: int64

In [10]:

new_coder_survey_new['CountryLive'].value_counts(normalize=True)*100

Out[10]:

United States of America    45.700497
India                        7.721556
United Kingdom               4.606610
Canada                       3.802281
Poland                       1.915765
                              ...    
Botswana                     0.014624
Cayman Islands               0.014624
Guatemala                    0.014624
Turkmenistan                 0.014624
Anguilla                     0.014624
Name: CountryLive, Length: 137, dtype: float64

From looking at both absolute and relative frequency tables, it would suggest focusing on USA(45%, 3125) and India(7%, 528).

Determine the Value of Respondents¶

Next lets calculate the monthly spend of respondents in each country. To do this I will create a new column 'MoneySpentPerMonth' which derived is from dividing 'MoneyForLearning' by 'MonthsProgramming'.

In [11]:

new_coder_survey_new['MonthsProgramming'].head(10)

Out[11]:

1      6.0
2      5.0
3      5.0
4     24.0
6     12.0
9     18.0
11     1.0
13     NaN
14     9.0
15     1.0
Name: MonthsProgramming, dtype: float64

In [12]:

new_coder_survey_new['MonthsProgramming'].min()

Out[12]:

0.0

We can see from above that 'MonthsProgramming' has null values and a minimum value of 0. To avoid infinity in 'MoneySpentPerMonth' I will replace the null values and 0 with 1.

In [24]:

#Avoids infinity
new_coder_survey_new['MonthsProgramming']=new_coder_survey_new['MonthsProgramming'].apply(lambda x: x+1 if x==0 else x)
new_coder_survey_new['MonthsProgramming']=new_coder_survey_new['MonthsProgramming'].fillna(value=1)

C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

In [25]:

new_coder_survey_new['MonthsProgramming'].head(10)

Out[25]:

1      6.0
2      5.0
3      5.0
4     24.0
6     12.0
9     18.0
11     1.0
13     1.0
14     9.0
15     1.0
Name: MonthsProgramming, dtype: float64

In [26]:

new_coder_survey_new['MonthsProgramming'].min()

Out[26]:

1.0

Repeat process for 'MoneyForLearning'

In [27]:

new_coder_survey_new['MoneyForLearning'].isna().sum()

Out[27]:

In [28]:

new_coder_survey_new['MoneyForLearning'].min()

Out[28]:

0.0

In [29]:

#Avoids infinity
new_coder_survey_new['MoneyForLearning']=new_coder_survey_new['MoneyForLearning'].apply(lambda x: x+1 if x==0 else x)
new_coder_survey_new['MoneyForLearning']=new_coder_survey_new['MoneyForLearning'].fillna(value=1)

C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

In [30]:

new_coder_survey_new['MoneyForLearning'].isna().sum()

Out[30]:

In [31]:

new_coder_survey_new['MoneyForLearning'].min()

Out[31]:

1.0

I drop na rows on 'CountryLive' as I will be grouping by it.

In [32]:

new_coder_survey_new['CountryLive'].isna().sum()

Out[32]:

In [33]:

new_coder_survey_new = new_coder_survey_new.dropna(subset=['CountryLive'])
new_coder_survey_new['CountryLive'].isna().sum()

Out[33]:

Calculate Mean, Mode, Median for 'MoneySpentPerMonth'

In [34]:

#Create new column
new_coder_survey_new['MoneySpentPerMonth'] = new_coder_survey_new['MoneyForLearning'] / new_coder_survey_new['MonthsProgramming']
new_coder_survey_new['MoneySpentPerMonth'].isna().sum()

C:\Users\gavin\anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Out[34]:

In [35]:

#Groupby mean average
mean_avg = new_coder_survey_new.groupby(by=['CountryLive']).mean()
mean_avg['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]

Out[35]:

CountryLive
United States of America    228.478565
India                       119.271354
United Kingdom               40.572292
Canada                      124.294036
Name: MoneySpentPerMonth, dtype: float64

In [36]:

#Groupby mode average
avg_mode =new_coder_survey_new.groupby(['CountryLive'])['MoneySpentPerMonth'].agg(lambda x: pd.Series.mode(x)[0]).to_frame()
avg_mode['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]

Out[36]:

CountryLive
United States of America    1.0
India                       1.0
United Kingdom              1.0
Canada                      1.0
Name: MoneySpentPerMonth, dtype: float64

In [37]:

#Groupby median average
median_avg = new_coder_survey_new.groupby(by=['CountryLive']).median()
median_avg['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]

Out[37]:

CountryLive
United States of America    2.083333
India                       0.500000
United Kingdom              1.000000
Canada                      1.000000
Name: MoneySpentPerMonth, dtype: float64

The mean, mode and median figures are off as the majority of the data is incomplete. Also I had to be replace 0 values with '1'.

Identifying and Removing Outliers¶

I will now boxplot these findings to see the impact of outliers.

In [41]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

uiuc = new_coder_survey_new[new_coder_survey_new['CountryLive'].str.contains('United States of America|India|United Kingdom|Canada')]

sns.boxplot(y='MoneySpentPerMonth', x='CountryLive', data = uiuc)
plt.title('Boxplot to Identify Outliers by Country for "Money Spent Per Month"')
plt.xticks(range(4),['US','UK','India','Canada'])


plt.show()

From the looking at the chart above I will take any spend above 5000 to be an outlier.

In [42]:

remove_outliers2 = new_coder_survey_new[new_coder_survey_new['MoneySpentPerMonth'] < 5000]
ro2_mavg = remove_outliers2.groupby(['CountryLive']).mean()
ro2_mavg['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]

Out[42]:

CountryLive
United States of America    130.743830
India                        72.471899
United Kingdom               40.572292
Canada                       86.497866
Name: MoneySpentPerMonth, dtype: float64

Once the outliers equal to or above 5000 have been removed we can see 2nd place in Monthly Spend is Canada. But not by too much.

I would like compare the mean average monthly spend to the sample size of that country.

In [44]:

summary_df = pd.DataFrame()

summary_df['MeanMoneySpentPerMonth']= ro2_mavg['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]
summary_df['TotalPersonsPerCountry']= remove_outliers2['CountryLive'].value_counts().iloc[0:4]
summary_df

Out[44]:

	MeanMoneySpentPerMonth	TotalPersonsPerCountry
CountryLive
United States of America	130.743830	3104
India	72.471899	524
United Kingdom	40.572292	315
Canada	86.497866	258

From the table above we can see that the top two markets are USA and India. Canada has a higher mean monthly spend but from this sample its customer size is half that of india. The UK does not pass our monthly spend threshold of 59.

I would like to drill down further and include the mode monthly spend and its frequency for each country to see if this affects the result.

In [45]:

def counting_mode(array, country):
    count_mode = 0
    for a in array:
        if a == avg_mode['MoneySpentPerMonth'][country]:
            count_mode += 1
    return count_mode

country_list=['United States of America','India','United Kingdom','Canada']

country_mode_frequency= []

for b in country_list:
    data = remove_outliers2['MoneySpentPerMonth'][remove_outliers2['CountryLive'] == b]
    country_mode_frequency.append(counting_mode(data,b))
    
country_mode_frequency
    

Out[45]:

[445, 63, 54, 34]

In [46]:

summary_df['CountryMode'] = avg_mode['MoneySpentPerMonth'][['United States of America','India','United Kingdom','Canada']]
summary_df['CountryModeFrequency'] = country_mode_frequency

summary_df

Out[46]:

	MeanMoneySpentPerMonth	TotalPersonsPerCountry	CountryMode	CountryModeFrequency
CountryLive
United States of America	130.743830	3104	1.0	445
India	72.471899	524	1.0	63
United Kingdom	40.572292	315	1.0	54
Canada	86.497866	258	1.0	34

From the table above the Mode monthly spend is 1 for all countries. The frequency of the mode is approximately 10% of the sample population for the top 3 countries. So the findings would stand.

Conclusion and Caveats:¶

For the sample the top 2 markets are USA and India. There is a strong case for focusing solely in the USA as their sample size is seven times that of India and has almost double the monthly average spend. If equal opportunity is part of the company mission statement or values then India and Canada should be included. UK does not hit the monthly spend requirment.

Before any actual spend on advertizing I would stronglty recommend further analysis to marketing for the following reasons:

The Data was not representative of our requirements. Out of 18000+ rows only approx 6000 related to web and mobile.
The Data was incomplete. Estimates had to be used.