As a programming e-learning company, we want to advertise our services to aspiring developers. In this project, we are interested in finding the best two countries to do so. Note that these two countries must be English-speaking, as all of our content is in English.
One way to figure out where to avertise would be to organize a survey. However, this could be costly. Luckily, the e-learning platform freeCodeCamp made available the results of their survey of new coders, the New Coder Survey. We will use this as a starting point.
#importing useful libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#reading the data into panda file
new_coders = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv')
/dataquest/system/env/python3/lib/python3.4/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (17,62) have mixed types. Specify dtype option on import or set low_memory=False.
Apparently, some columns have mixed data types, which is strange. We can keep that in mind. For now, let's do some basic exploration on this dataset.
new_coders.shape
(18175, 136)
new_coders.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18175 entries, 0 to 18174 Columns: 136 entries, Age to YouTubeTheNewBoston dtypes: float64(105), object(31) memory usage: 18.9+ MB
That's a lot of columns! We will focus our analysis on only a few of them at a time.
Before we start, let's look at the two columns with mixed types.
new_coders.columns[17]
'CodeEventOther'
new_coders['CodeEventOther'].value_counts(dropna = False).head(10)
NaN 17605 No 21 Ladies Learning Code 9 Bootcamp 8 General Assembly 8 School 8 Na 6 No One 5 Codebar 5 GDG 4 Name: CodeEventOther, dtype: int64
This columns contains any coding event, not present in the survey's list, that the respondent attended. There is a large number of NaN values, as well as strings denoting a coding event. There are also values such as 'No', meaning the respondent attended no event.
Let's look at the other column.
new_coders.columns[62]
'JobInterestOther'
new_coders['JobInterestOther'].value_counts(dropna = False).head(10)
NaN 17909 Undecided 23 Software Engineer 22 Software Developer 9 Data Analyst 6 Machine Learning Engineer 5 Artificial Intelligence 5 Project Manager 4 Programmer 4 Researcher 3 Name: JobInterestOther, dtype: int64
This seems to be a similar issue. With both of these columns, there are two types because NaN is considered a float by pandas, and is mixed with strings.
As long as we do not use any string-specific method on these two columns, this shouldn't be a problem. We will leave it be for now.
Let's print the first five rows :
new_coders.head(5)
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | ChildrenNumber | CityPopulation | CodeEventConferences | CodeEventDjangoGirls | ... | YouTubeFCC | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeOther | YouTubeSimplilearn | YouTubeTheNewBoston | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 34.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | less than 100,000 | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 21.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
3 | 26.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | ... | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
4 | 20.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 136 columns
Here we can see that there are a lot of missing values. As these should be dealt with on a column-by-column basis, we will not dive deeper into that issue for now.
Let's try to identify which columns we'll be interested in.
for col in new_coders.columns:
print(col)
Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName BootcampRecommend ChildrenNumber CityPopulation CodeEventConferences CodeEventDjangoGirls CodeEventFCC CodeEventGameJam CodeEventGirlDev CodeEventHackathons CodeEventMeetup CodeEventNodeSchool CodeEventNone CodeEventOther CodeEventRailsBridge CodeEventRailsGirls CodeEventStartUpWknd CodeEventWkdBootcamps CodeEventWomenCode CodeEventWorkshops CommuteTime CountryCitizen CountryLive EmploymentField EmploymentFieldOther EmploymentStatus EmploymentStatusOther ExpectedEarning FinanciallySupporting FirstDevJob Gender GenderOther HasChildren HasDebt HasFinancialDependents HasHighSpdInternet HasHomeMortgage HasServedInMilitary HasStudentDebt HomeMortgageOwe HoursLearning ID.x ID.y Income IsEthnicMinority IsReceiveDisabilitiesBenefits IsSoftwareDev IsUnderEmployed JobApplyWhen JobInterestBackEnd JobInterestDataEngr JobInterestDataSci JobInterestDevOps JobInterestFrontEnd JobInterestFullStack JobInterestGameDev JobInterestInfoSec JobInterestMobile JobInterestOther JobInterestProjMngr JobInterestQAEngr JobInterestUX JobPref JobRelocateYesNo JobRoleInterest JobWherePref LanguageAtHome MaritalStatus MoneyForLearning MonthsProgramming NetworkID Part1EndTime Part1StartTime Part2EndTime Part2StartTime PodcastChangeLog PodcastCodeNewbie PodcastCodePen PodcastDevTea PodcastDotNET PodcastGiantRobots PodcastJSAir PodcastJSJabber PodcastNone PodcastOther PodcastProgThrowdown PodcastRubyRogues PodcastSEDaily PodcastSERadio PodcastShopTalk PodcastTalkPython PodcastTheWebAhead ResourceCodecademy ResourceCodeWars ResourceCoursera ResourceCSS ResourceEdX ResourceEgghead ResourceFCC ResourceHackerRank ResourceKA ResourceLynda ResourceMDN ResourceOdinProj ResourceOther ResourcePluralSight ResourceSkillcrush ResourceSO ResourceTreehouse ResourceUdacity ResourceUdemy ResourceW3S SchoolDegree SchoolMajor StudentDebtOwe YouTubeCodeCourse YouTubeCodingTrain YouTubeCodingTut360 YouTubeComputerphile YouTubeDerekBanas YouTubeDevTips YouTubeEngineeredTruth YouTubeFCC YouTubeFunFunFunction YouTubeGoogleDev YouTubeLearnCode YouTubeLevelUpTuts YouTubeMIT YouTubeMozillaHacks YouTubeOther YouTubeSimplilearn YouTubeTheNewBoston
We will probably use :
We first need to figure out if the respondents could be interest in our services. Let us look at the JobRoleInterest column.
new_coders['JobRoleInterest'].value_counts(normalize = True,dropna=False)
NaN 0.615296 Full-Stack Web Developer 0.045282 Front-End Web Developer 0.024759 Data Scientist 0.008363 Back-End Web Developer 0.007813 Mobile Developer 0.006437 Game Developer 0.006272 Information Security 0.005062 Full-Stack Web Developer, Front-End Web Developer 0.003521 Front-End Web Developer, Full-Stack Web Developer 0.003081 Product Manager 0.003026 Data Engineer 0.002916 User Experience Designer 0.002861 User Experience Designer, Front-End Web Developer 0.002366 Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer 0.002146 Back-End Web Developer, Full-Stack Web Developer, Front-End Web Developer 0.001981 DevOps / SysAdmin 0.001981 Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer 0.001981 Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer 0.001706 Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer 0.001651 Full-Stack Web Developer, Mobile Developer 0.001596 Front-End Web Developer, User Experience Designer 0.001596 Back-End Web Developer, Full-Stack Web Developer 0.001486 Full-Stack Web Developer, Back-End Web Developer 0.001431 Back-End Web Developer, Front-End Web Developer 0.001100 Data Engineer, Data Scientist 0.001045 Full-Stack Web Developer, Back-End Web Developer, Front-End Web Developer 0.001045 Front-End Web Developer, Mobile Developer 0.000990 Full-Stack Web Developer, Data Scientist 0.000935 Mobile Developer, Game Developer 0.000880 ... Mobile Developer, Data Engineer, Game Developer, Information Security, User Experience Designer 0.000055 Front-End Web Developer, Full-Stack Web Developer, User Experience Designer, Quality Assurance Engineer, Back-End Web Developer 0.000055 Back-End Web Developer, Front-End Web Developer, Data Engineer, Full-Stack Web Developer, Mobile Developer 0.000055 Back-End Web Developer, Data Scientist, Front-End Web Developer, Full-Stack Web Developer, Data Engineer 0.000055 DevOps / SysAdmin, Full-Stack Web Developer, Back-End Web Developer 0.000055 Full-Stack Web Developer, Quality Assurance Engineer, Front-End Web Developer, Back-End Web Developer, Game Developer 0.000055 Back-End Web Developer, Quality Assurance Engineer, Full-Stack Web Developer 0.000055 Full-Stack Web Developer, Back-End Web Developer, Mobile Developer, Game Developer, Data Engineer 0.000055 Back-End Web Developer, Information Security, Game Developer, Full-Stack Web Developer 0.000055 DevOps / SysAdmin, Game Developer, Full-Stack Web Developer, Information Security, Back-End Web Developer 0.000055 Front-End Web Developer, Data Engineer, DevOps / SysAdmin, Back-End Web Developer, Full-Stack Web Developer, Mobile Developer, User Experience Designer, Information Security 0.000055 DevOps / SysAdmin, Information Security, Data Engineer, Game Developer, Mobile Developer, Back-End Web Developer, Data Scientist, Full-Stack Web Developer 0.000055 Quality Assurance Engineer, Front-End Web Developer, User Experience Designer, Game Developer 0.000055 Data Scientist, DevOps / SysAdmin, Mobile Developer, Front-End Web Developer, Quality Assurance Engineer, Information Security, Product Manager, User Experience Designer, Data Engineer 0.000055 Data Scientist, Information Security, Front-End Web Developer, User Experience Designer 0.000055 User Experience Designer, Information Security, Front-End Web Developer, Mobile Developer, Game Developer 0.000055 Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer, Product Manager, Data Engineer 0.000055 Front-End Web Developer, Game Developer, User Experience Designer, Back-End Web Developer, Full-Stack Web Developer 0.000055 Back-End Web Developer, Full-Stack Web Developer, Data Scientist, Front-End Web Developer 0.000055 Full-Stack Web Developer, Game Developer, Back-End Web Developer, User Experience Designer 0.000055 Full-Stack Web Developer, Back-End Web Developer, Mobile Developer, Front-End Web Developer, Data Scientist, User Experience Designer 0.000055 Data Engineer, Game Developer, Data Scientist, Quality Assurance Engineer 0.000055 Front-End Web Developer, Game Developer, Mobile Developer, Back-End Web Developer, Information Security, DevOps / SysAdmin, Data Scientist, Product Manager 0.000055 Back-End Web Developer, User Experience Designer, Data Scientist, Mobile Developer, Full-Stack Web Developer, Front-End Web Developer 0.000055 Full-Stack Web Developer, Back-End Web Developer, Game Developer, Front-End Web Developer, Mobile Developer, User Experience Designer 0.000055 User Experience Designer, Product Manager, Data Scientist, Front-End Web Developer 0.000055 User Experience Designer, Full-Stack Web Developer, Front-End Web Developer, Mobile Developer 0.000055 Full-Stack Web Developer, Front-End Web Developer, Data Scientist, User Experience Designer, Back-End Web Developer 0.000055 Front-End Web Developer, Product Manager, Data Scientist, Information Security, Mobile Developer, Data Engineer, Quality Assurance Engineer 0.000055 DevOps / SysAdmin, Front-End Web Developer, Mobile Developer, Full-Stack Web Developer 0.000055 Name: JobRoleInterest, Length: 3214, dtype: float64
We remark immediately:
Let's find out the number of possible values.
new_coders['JobRoleInterest'].unique().shape
(3214,)
That's a lot of unique values. What would be interesting for us to know at this stage is if enough respondents are interested in web and mobile development, as this is the main content our company offers.
new_coders['JobRoleInterest'].str.contains(r'[Ww]eb|[Mm]obile').value_counts()
True 6035 False 957 Name: JobRoleInterest, dtype: int64
In roughly 18000 repondents, a third of them are interested in Web and Mobile development. Let's make a pie chart to summarize our findings.
#Setting up labels, sizes and colors
labels = ['Did not Respond', 'Web or Mobile dev', 'Other']
sizes = [new_coders[new_coders['JobRoleInterest'].isnull()].shape[0]
,6035
,957]
colors = ['indianred','seagreen','slateblue']
#Drawing the pie
fig1, ax = plt.subplots()
ax.pie(sizes
, labels=labels
, autopct='%1.f%%'
, explode = [0,0.1,0]
, colors = colors)
ax.set_title('Job Role Interests\n', fontsize= 20)
#This makes sure the axes are of equal lenght
#And thus that our pie is circular
ax.axis('equal')
plt.show()
As this pie chart makes clear, a third of the respondents are interested in web or mobile development. The other third is mainly composed of coders that did not answer the question.
From this we conclude we have a sizable (roughly 6000) sample of new coders interested in web or mobile development.
As we are interested in which country to advertise in, we will now determine which country has the most aspiring developers.
As we want to make sure to work with coders that could be interested with our material, we'll drop any row that has a missing value in the 'JobRoleInterest' column.
coders_int = new_coders.dropna(subset = ['JobRoleInterest'])
We can look at relative and absolute frequencies for the CoutryLive variable. This indicates which country the respondent lives in right now, and thus is better suited for us than CountryOrigin.
#we use value_counts to generate frequenties
#with normalize = True or False
freq_table = pd.concat([coders_int['CountryLive'].value_counts()
,round(coders_int['CountryLive'].value_counts(normalize = True)*100,1)]
,axis = 1)
#renaming columns
freq_table.columns = ['Total Respondents','Percentage']
freq_table.head(10)
Total Respondents | Percentage | |
---|---|---|
United States of America | 3125 | 45.7 |
India | 528 | 7.7 |
United Kingdom | 315 | 4.6 |
Canada | 260 | 3.8 |
Poland | 131 | 1.9 |
Brazil | 129 | 1.9 |
Germany | 125 | 1.8 |
Australia | 112 | 1.6 |
Russia | 102 | 1.5 |
Ukraine | 89 | 1.3 |
From this table, we clearly see that the United States has the largest number of respondents by far. The second largest is India, with roughly six time less respondents.
After that, the differences are smaller : the third best country is the U.K., with 315 respondents (compared with 528 for India).
At this stage, it is clear that the U.S. will be one of the country we advertise in. However, all of India, U.K. and Canada could be legitimate second markets. This could depend on ease of running an efficient ad campaign, competition within countries, and/or potential spending of customers.
We currently have no data on the first and second points, but our database does contain some info on how much coders spend on learning.
Our database gives us access to two variables :
We are interested in how much a respondent would pay for our subscription (currently at $59 a month). Hence, we will divide MoneyForLearning by MonthsProgramming. We'll first replace any 0 in MonthsProgramming by a 1, as we do not want to divide by zero.
We'll also only consider U.S, U.K. India and Canada. Let's remove any other row.
coders_4 = coders_int[coders_int['CountryLive'].isin(['India'
,'United States of America'
,'United Kingdom'
,'Canada'])]
coders_4.reset_index(inplace = True, drop = True)
coders_4.shape
(4228, 136)
This narrows our focus to 4227 respondents, still a sizable sample. We can now create a column for monthly money spent, which we'll name MonthlySpending. First we need to replace any zero in the MonthsLearning column by a 1.
coders_4['MonthsProgramming'].replace(to_replace = 0, value = 1, inplace = True)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/generic.py:4619: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#Creating the MonthlySpending column
coders_4['MonthlySpending'] = coders_4['MoneyForLearning']/coders_4['MonthsProgramming']
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#Removing any missing value
coders_4.dropna(subset = ['MonthlySpending'], inplace = True)
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
We now have all the data needed, and we can generate a bar plot to illustrate our results.
#this will store the average monthly spending by country
spending_country = coders_4.groupby(by = 'CountryLive').mean()['MonthlySpending']
#generating the plot
fig1, ax = plt.subplots()
colors = ['indianred','seagreen','slateblue','khaki']
#making the bar plot
ax.bar([0,1,2,3]
, spending_country
, align = 'center'
, width = 0.5
, tick_label = spending_country.index
, color = colors)
#rotating labels for readability
plt.xticks(rotation = 20)
#labels and title
ax.set_ylabel('Mean Monthly Spending USD')
ax.set_title('Mean Monthly Spending by Country\n')
<matplotlib.text.Text at 0x7ff0f74c9518>
The mean spending is highest in the U.S. Somewhat unexpectedly, the second highest is India. This can be surpising as India has a lower GDP per capita than the other three countries.
As the mean is very sensitive to outliers, this might explain our results. To make our findings more robust, we would need to remove these outliers from our data. For this, we can use box plots.
sns.boxplot(x = 'CountryLive'
, y = 'MonthlySpending'
,data = coders_4)
plt.xticks(rotation = 20)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)
We can see on this graph that we have some extreme outliers for the U.S. that outstrech the boxplot. We can remove them from our database.
coders_4 = coders_4[coders_4['MonthlySpending']<20000]
We create the same box plots :
sns.boxplot(x = 'CountryLive'
, y = 'MonthlySpending'
,data = coders_4)
plt.xticks(rotation = 20)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)
This is still very stretched. Let us remove any respondent that spends more than $6000 a month.
coders_4 = coders_4[coders_4['MonthlySpending']<6000]
sns.boxplot(x = 'CountryLive'
, y = 'MonthlySpending'
,data = coders_4)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0fc93d7f0>
Here, we clearly see outliers in India and Canada, spending more than $3000 a month. Let us remove them from the database.
coders_4 = coders_4[~((coders_4['MonthlySpending'] > 3000 )&\
((coders_4['CountryLive'] == 'India') |\
(coders_4['CountryLive'] == 'Canada')))]
The other high spenders, after taking a look at their individual responses, can at least partially be explained : some of them took weeks-long online courses, which can be priced in the $10000. Thus, we will leave them in the database. We can recompute the mean now.
#this will store the average monthly spending by country
spending_country = coders_4.groupby(by = 'CountryLive').mean()['MonthlySpending']
#generating the plot
fig1, ax = plt.subplots()
colors = ['indianred','seagreen','slateblue','khaki']
#making the bar plot
ax.bar([0,1,2,3]
, spending_country
, align = 'center'
, width = 0.5
, tick_label = spending_country.index
, color = colors)
#rotating labels for readability
plt.xticks(rotation = 20)
#labels and title
ax.set_ylabel('Mean Monthly Spending USD')
ax.set_title('Mean Monthly Spending by Country\n')
<matplotlib.text.Text at 0x7ff0f724ac18>
spending_country
CountryLive Canada 93.065400 India 65.758763 United Kingdom 45.534443 United States of America 142.654608 Name: MonthlySpending, dtype: float64
Based on these means, it would seem that the second country should be Canada. However, there are other concerns before we make our decision :
Moreover, we are only interested in selling monthly subscriptions, for 59 dollars a month. Thus another interesting statistic to look at would be the percentage of coders spending more than 59 dollars a month.
#This function takes any series, and returns the proportion of values over 59
def more_than_59_per(s):
return (s >= 59).sum()/s.shape[0]
#We use it on a groupby object over CountryLive
group = coders_4.groupby(by = 'CountryLive')
more_than_59_data = group.aggregate(more_than_59_per)['MonthlySpending']*100
more_than_59_data
CountryLive Canada 16.317992 India 15.098468 United Kingdom 15.412186 United States of America 22.294521 Name: MonthlySpending, dtype: float64
#generating the plot
fig1, ax = plt.subplots()
colors = ['indianred','seagreen','slateblue','khaki']
#making the bar plot
ax.bar([0,1,2,3]
, more_than_59_data
, align = 'center'
, width = 0.5
, tick_label = more_than_59_data.index
, color = colors)
#rotating labels for readability
plt.xticks(rotation = 20)
#labels and title
ax.set_ylabel('Percentage')
ax.set_title('Percentage Spending more than 59USD/Month\n')
<matplotlib.text.Text at 0x7ff0f71ddc18>
The U.S. is higher than the other three, which have very close percentages (roughly 15%) of coders spending over 59 dollars a month.
This is where country population may become relevant. Indeed, Canada has a population of 38 millions, whereas India has a population of 1.3 billion, roughly 34 times more. If we can reach a similar proportion of the market in Canada and India, it is likely for India to be more profitable. We could even potentially lower our price in India, to account for a lower mean monthly spending there.
At this point in our study, India seems more promising that Canada: the total population is much larger, and thus we could expect to reach a great number of coders there.
We could also advertise only in the U.S., or even in more than two countries. This would require more data, for example on the competition in each of these countries, and the growth of the market for online coding courses.
At this point, it would be advisable to tranfert our findings to the marketing department, and ask for their opinions. With more data from them, we could make a more informed decision.
Our dataset could also help use determine our advertising medium. For example, it contains data about which youtube channels and podcasts the respondent used in learning. An advertisment on one popular free resource could be very efficient in bringing in new customers. We can determine what are the most popular ones.
We will do so by computing the propotion of respondents that used them, starting with YouTube channels. Let us look at the names of the columns, as well at the values they store.
#This list will store the name of columns containing the string YouTube
youtube_cols = coders_4.columns[coders_4.columns.str.contains('YouTube')]
for col in youtube_cols:
print(coders_4[col].value_counts(dropna= False))
NaN 3727 1.0 168 Name: YouTubeCodeCourse, dtype: int64 NaN 3693 1.0 202 Name: YouTubeCodingTrain, dtype: int64 NaN 3499 1.0 396 Name: YouTubeCodingTut360, dtype: int64 NaN 3552 1.0 343 Name: YouTubeComputerphile, dtype: int64 NaN 3455 1.0 440 Name: YouTubeDerekBanas, dtype: int64 NaN 3346 1.0 549 Name: YouTubeDevTips, dtype: int64 NaN 3578 1.0 317 Name: YouTubeEngineeredTruth, dtype: int64 NaN 2438 1.0 1457 Name: YouTubeFCC, dtype: int64 NaN 3651 1.0 244 Name: YouTubeFunFunFunction, dtype: int64 NaN 3266 1.0 629 Name: YouTubeGoogleDev, dtype: int64 NaN 3337 1.0 558 Name: YouTubeLearnCode, dtype: int64 NaN 3631 1.0 264 Name: YouTubeLevelUpTuts, dtype: int64 NaN 3092 1.0 803 Name: YouTubeMIT, dtype: int64 NaN 3810 1.0 85 Name: YouTubeMozillaHacks, dtype: int64 NaN 3655 None 19 none 15 The Net Ninja 7 Traversy Media 7 Simple Programmer 6 LearnWebCode 4 Sentdex 3 Khan Academy 2 Eli the Computer Guy 2 Wes Bos 2 net ninja 2 LambdaSchool 2 Stephen Mayeux 2 mindspace 2 Brackeys 2 TheNetNinja 2 Stanford 2 I haven't watched any youtube videos yet 1 siraj 1 simple programmer 1 none yet 1 CHRIS Sean 1 TechmakerTV 1 I have not yet watched coding-related YouTube videos 1 Random YouTube videos 1 HandMadeHero, Paul Programming 1 Chris Hawkes 1 sentdex 1 Slidenerd 1 ... sentdev 1 Code Ninja 1 None so far... planning to soon 1 random ones 1 Harvard CS50 1 Stanford Engineering course 1 I haven't really 1 idk 1 coding tutorials 360 1 Nptel 1 thenewboston 1 CodeWithNick 1 sentdex , sirajraval 1 Simple Programmer, Chris Hawkes 1 RailsCasts 1 TheRealCasadaro 1 Code Babes 1 Hack Reactor 1 CBT Nuggets 1 random vids I find while googling 1 There was a Sequelize one that was really good, but I can't remember what it was called. 1 Rem Zolotykh https://www.youtube.com/channel/UCsvMopMspsGw89AWim0FMfw 1 don't watch youtube to look coding up 1 --- 1 Jim Campagno 1 Have not seen any of these youtube videos 1 none of these 1 A Cloud Guru 1 simpleprogrammer 1 FunFunFunction 1 Name: YouTubeOther, Length: 177, dtype: int64 NaN 3852 1.0 43 Name: YouTubeSimplilearn, dtype: int64 NaN 3250 1.0 645 Name: YouTubeTheNewBoston, dtype: int64
All the variables but YoutubeOther have two possible values: NaN, which denotes that the respondent did not watch this YouTube channel, and 1.0, denoting that they did. The only exception is YouTubeOther, where respondents where asked to enter any other YouTube coding channel they watched. There are a lot of different string values here. However, none of the shows given there are very popular, so we'll simply discard this variable.
del coders_4['YouTubeOther']
#Updating the columns containing the string YouTube
youtube_cols = coders_4.columns[coders_4.columns.str.contains('YouTube')]
#We define a function to convert NaN and 1.0 to booleans
def to_bool(x):
if pd.isna(x):
return False
elif x == 1.0:
return True
else:
return x
for col in youtube_cols:
coders_4[col] = coders_4[col].apply(to_bool)
We can now use value count to find out the proportion of respondents who watched a given channel.
#We use value count to count the number of True values in each column
#Use list comprehension to collect it
data = [coders_4[col].value_counts(normalize = True)[True]*100 for col in youtube_cols]
yt_pop = pd.Series(index = youtube_cols
,data = data).sort_values(ascending = False)
yt_pop.plot.bar(figsize = (10,10))
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0f729d898>
A good next step would be to find out if any of these channels would accept to advertise our company directly. We should contact them in the order showed on this graph.
We can repeat this process for podcasts. The code is exactly the same, so we will go faster this time.
podcast_cols = coders_4.columns[coders_4.columns.str.contains('Podcast')]
for col in podcast_cols:
print(coders_4[col].value_counts(dropna= False))
NaN 3811 1.0 84 Name: PodcastChangeLog, dtype: int64 NaN 3339 1.0 556 Name: PodcastCodeNewbie, dtype: int64 NaN 3759 1.0 136 Name: PodcastCodePen, dtype: int64 NaN 3719 1.0 176 Name: PodcastDevTea, dtype: int64 NaN 3843 1.0 52 Name: PodcastDotNET, dtype: int64 NaN 3860 1.0 35 Name: PodcastGiantRobots, dtype: int64 NaN 3784 1.0 111 Name: PodcastJSAir, dtype: int64 NaN 3627 1.0 268 Name: PodcastJSJabber, dtype: int64 NaN 3624 1.0 271 Name: PodcastNone, dtype: int64 NaN 3689 Learn to Code with Me 9 Front End Happy Hour 7 Learn to Code With Me 6 Coding Blocks 5 Learn to code with me 4 na 4 Full Stack Radio 4 Coder Radio 3 no 3 coding blocks 3 The Versioning Show 2 Not So Standard Deviations 2 Toolsday 2 c++ 2 I havn't 2 Start Here 2 Coding Tutorials 360 2 Frontend Happy Hour 2 Arrested DevOps 2 Starthere.fm 2 Youtube 2 not listened to any 2 Learn to code with me 2 not yet 2 Late Nights with Trav and Los 2 Front end happy hour 2 Front-end Happy Hour 2 Adventures in Angular 2 Eat Sleep Code 1 ... Start Here, Especially Big Data, Hadooponomics, Data Crunch 1 MS Dev Show 1 Not yet 1 Codependant 1 Start here web development 1 Ruby on rails 1 Coding360 1 Breakingintostartups.com 1 various youtubers 1 TravLos podcast 1 Women Who Code 1 Learn To Code With Me 1 Mentoring Developers 1 Learntocodewith.me 1 We Are Lighthouse London 1 freeCodeCamp YouTube Channel 1 Start Here: Web Development 1 DataSkeptic 1 O'reilly show 1 React native 1 Reboot, Start Here FM 1 IDeveloper, merge conflict 1 Frontend Masters 1 Python podcast init 1 Google Cloud Platform Podcast 1 partiallyderivative.com 1 --- 1 JavaScript Jabber 1 LearnWebCode on Youtube 1 Developer on Fire 1 Name: PodcastOther, Length: 151, dtype: int64 NaN 3799 1.0 96 Name: PodcastProgThrowdown, dtype: int64 NaN 3812 1.0 83 Name: PodcastRubyRogues, dtype: int64 NaN 3727 1.0 168 Name: PodcastSEDaily, dtype: int64 NaN 3815 1.0 80 Name: PodcastSERadio, dtype: int64 NaN 3823 1.0 72 Name: PodcastShopTalk, dtype: int64 NaN 3735 1.0 160 Name: PodcastTalkPython, dtype: int64 NaN 3831 1.0 64 Name: PodcastTheWebAhead, dtype: int64
del coders_4['PodcastOther']
podcast_cols = coders_4.columns[coders_4.columns.str.contains('Podcast')]
for col in podcast_cols:
coders_4[col] = coders_4[col].apply(to_bool)
data_pod = [coders_4[col].value_counts(normalize = True)[True]*100 for col in podcast_cols]
podcast_pop = pd.Series(index = podcast_cols
,data = data_pod).sort_values(ascending = False)
podcast_pop.plot.bar(figsize = (10,10))
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0f715c0f0>
Again, this gives us a roadmap to try to place an advertisement in some podcast. Also note that podcasts are a lot less popular than YouTube channels. Thus, we should prioritize YouTube advertisement if possible.
One next step would be to segregate this data by country. We could then target our advertisement more precisely.