Machine learning is an application of artificial intelligence that provides systems with the ability to learn and improve with experience without being explicitly programmed to do so. This field of study has grown significantly in recent years.
Deep learning (a form of machine learning) is based on creating artifical neural networks that mimic the biological neural connections that exist in our own brains. With the myriad applications of deep learning across many industries, its popularity has seen a sharp rise.
For this project we will take the role of a content writer for a data science magazine to see if there is any indication of this being a fad or if it is here to stay.
To examine the trend we are going to pull data relating to all of questions posted to Stack Exchange data science website (DSSE). Stack Exchange employs a reputation award system for its questions and answers. Each post is subject to upvotes and downvotes. Questions with more upvotes have more visibility.
The content posted on the DSSE is wholly dependant on what people in the data science community are posting. In this sense, our data and our conclusions are based on patterns in the community itself.
The data was pulled using the Stack Exchange Data Explorer (SEDE). The SEDE uses a built in SQL query system for pulling data from the site.
We have data for questions posted in 2019 and aso a dataset for all questions posted which go back to 2015. The datasets include the following columns:
Id
: An identification number for the post.PostTypeId
: An identification number for the type of post.CreationDate
: The date and time of creation of the post.Score
: The post's score.ViewCount
: How many times the post was viewed.Tags
: What tags were used.AnswerCount
: How many answers the question got (only applicable to question posts).FavoriteCount
: How many times the question was favored (only applicable to question posts).# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import datetime as dt
import numpy as np
# Read in the file
ques = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])
ques.head()
Id | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount | |
---|---|---|---|---|---|---|---|
0 | 44419 | 2019-01-23 09:21:13 | 1 | 21 | <machine-learning><data-mining> | 0 | NaN |
1 | 44420 | 2019-01-23 09:34:01 | 0 | 25 | <machine-learning><regression><linear-regressi... | 0 | NaN |
2 | 44423 | 2019-01-23 09:58:41 | 2 | 1651 | <python><time-series><forecast><forecasting> | 0 | NaN |
3 | 44427 | 2019-01-23 10:57:09 | 0 | 55 | <machine-learning><scikit-learn><pca> | 1 | NaN |
4 | 44428 | 2019-01-23 11:02:15 | 0 | 19 | <dataset><bigdata><data><speech-to-text> | 0 | NaN |
ques.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8839 entries, 0 to 8838 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 8839 non-null int64 1 CreationDate 8839 non-null datetime64[ns] 2 Score 8839 non-null int64 3 ViewCount 8839 non-null int64 4 Tags 8839 non-null object 5 AnswerCount 8839 non-null int64 6 FavoriteCount 1407 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(4), object(1) memory usage: 483.5+ KB
ques.describe()
Id | Score | ViewCount | AnswerCount | FavoriteCount | |
---|---|---|---|---|---|
count | 8839.000000 | 8839.000000 | 8839.000000 | 8839.000000 | 1407.000000 |
mean | 54724.172870 | 0.870687 | 171.548026 | 0.787985 | 1.184790 |
std | 6507.618509 | 1.410255 | 772.813626 | 0.851146 | 0.982766 |
min | 43363.000000 | -2.000000 | 2.000000 | 0.000000 | 0.000000 |
25% | 48917.500000 | 0.000000 | 22.000000 | 0.000000 | 1.000000 |
50% | 54833.000000 | 1.000000 | 40.000000 | 1.000000 | 1.000000 |
75% | 60674.500000 | 1.000000 | 98.000000 | 1.000000 | 1.000000 |
max | 65675.000000 | 45.000000 | 33203.000000 | 9.000000 | 16.000000 |
ques['FavoriteCount'].value_counts(dropna=False)
NaN 7432 1.0 953 2.0 205 0.0 175 3.0 43 4.0 12 5.0 8 6.0 4 7.0 4 11.0 1 8.0 1 16.0 1 Name: FavoriteCount, dtype: int64
FavoriteCount
column. The other colummns do not have missing values.FavoriteCount
column has 7,432 missing values. These are probably posts that received zero favorite votes. We can just assign these NaN values to zeroFavoriteCount
column, we can cast it to int64
. We alreaded parsed the CreationDate
so it is in datetime format already.# Fill Nan values with zero & cast to int
ques['FavoriteCount'] = ques['FavoriteCount'].fillna(value=0).astype(int)
# Clean the Tag column
ques['Tags'] = ques['Tags'].str.replace('><', ',').str.replace('<', '').str.replace('>', '')
# Split 'Tags' column on ',' and cast to list
ques['Tags'] = ques['Tags'].str.split(pat=',')
ques
Id | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount | |
---|---|---|---|---|---|---|---|
0 | 44419 | 2019-01-23 09:21:13 | 1 | 21 | [machine-learning, data-mining] | 0 | 0 |
1 | 44420 | 2019-01-23 09:34:01 | 0 | 25 | [machine-learning, regression, linear-regressi... | 0 | 0 |
2 | 44423 | 2019-01-23 09:58:41 | 2 | 1651 | [python, time-series, forecast, forecasting] | 0 | 0 |
3 | 44427 | 2019-01-23 10:57:09 | 0 | 55 | [machine-learning, scikit-learn, pca] | 1 | 0 |
4 | 44428 | 2019-01-23 11:02:15 | 0 | 19 | [dataset, bigdata, data, speech-to-text] | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... |
8834 | 55413 | 2019-07-10 09:08:31 | 1 | 39 | [pca, dimensionality-reduction, linear-algebra] | 1 | 1 |
8835 | 55414 | 2019-07-10 09:34:55 | 0 | 113 | [keras, weight-initialization] | 0 | 0 |
8836 | 55415 | 2019-07-10 09:45:37 | 1 | 212 | [python, visualization, seaborn] | 1 | 0 |
8837 | 55416 | 2019-07-10 09:59:56 | 0 | 22 | [time-series] | 0 | 0 |
8838 | 55419 | 2019-07-10 10:31:23 | 1 | 168 | [k-nn] | 1 | 0 |
8839 rows × 7 columns
Since tags identify what subject the question is about, we will examine these to see if they point to a trend in Deep Learning
We will use the following metrics for measuring the popularity of each tag:
In order to do this we will create two dictionaries and iterate as appropriate over the dataframe:
# Create dictionary and loop to calculate how often each tag was used.
tags_count_dict = {}
for list in ques['Tags']:
for item in list:
if item in tags_count_dict:
tags_count_dict[item] += 1
elif item not in tags_count_dict:
tags_count_dict[item] = 1
# Create dictionary and loop to calculate how many views each tag received.
tags_views_dict = {}
index = 0
for list in ques['Tags']:
for item in list:
if item in tags_views_dict:
tags_views_dict[item] += ques['ViewCount'][index]
else:
tags_views_dict[item] = ques['ViewCount'][index]
index += 1
# Transform the data and rename columns
frequencies = pd.DataFrame.from_dict(data=[tags_count_dict, tags_views_dict]).T
frequencies = frequencies.rename(columns={0:'uses', 1:'views'})
# Create column that holds the number of views per each use of that tag
frequencies['views_per_use'] = frequencies['views'] / frequencies['uses']
# Create subset dataframes to use in below graphs
frequencies10 = frequencies.sort_values('uses', ascending=False).head(10)
frequencies20 = frequencies.sort_values('uses', ascending=False).head(20)
frequencies20.head()
uses | views | views_per_use | |
---|---|---|---|
machine-learning | 2693 | 388499 | 144.262532 |
python | 1814 | 537585 | 296.353363 |
deep-learning | 1220 | 233628 | 191.498361 |
neural-network | 1055 | 185367 | 175.703318 |
keras | 935 | 268608 | 287.281283 |
#Plot
fig, ax = plt.subplots(3,1,
figsize=(12,10),
sharex=True)
fig.subplots_adjust(hspace=0.1)
plt.rcParams['figure.dpi'] = 460
fig.suptitle('Tag Usage Analysis',
fontsize=18,
y=0.93)
ax1 = plt.subplot(3,1,1)
ax1 = sns.barplot(data=frequencies20,
x=frequencies20.index,
y='uses',
color='steelblue')
ax1.set_ylabel('Total Tag Uses', fontsize=14)
sns.despine(left=True)
sns.set(style='whitegrid')
ax1.get_xaxis().set_visible(False)
ax2 = plt.subplot(3,1,2)
ax2 = sns.barplot(data=frequencies20,
x=frequencies20.index,
y='views',
color='steelblue')
ax2.set_ylabel('Total Tag Views', fontsize=14)
ax2.tick_params(labelsize=14)
sns.despine(left=True)
sns.set(style='whitegrid')
ax2.get_xaxis().set_visible(False)
ax3 = plt.subplot(3,1,3)
ax3 = sns.barplot(data=frequencies20,
x=frequencies20.index,
y='views_per_use',
color='steelblue')
ax3.set_ylabel('Views Per Use', fontsize=14)
ax3.tick_params(labelsize=14)
plt.xticks(rotation = 45, ha='right')
sns.despine(left=True)
sns.set(style='whitegrid')
# Plot
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax = sns.scatterplot(data=frequencies10,
x="uses", y="views",
hue=frequencies10.index,
palette="deep",
s=300)
# Plot Aesthetics
fig.subplots_adjust(hspace=0.5)
ax.set_ylabel('Total Tag Views',
fontsize=22,
labelpad=30)
ax.set_xlabel('Total Tag Count',
fontsize=22,
labelpad=30)
ax.tick_params(labelsize=14)
ax.set_title('Tag Uses v. Tag Views',
fontsize=22,
y=1.05)
sns.despine(left=True)
plt.legend(loc='center left',
bbox_to_anchor=(1.1, 0.5),
ncol=1, handlelength=2,
handleheight=1,
prop=dict(size=18),
markerscale=2)
ax.annotate('\'nlp\' & \'cnn\' are overlapped',
xy=(550,72000),
xytext=(800,80000),
fontsize=16,
arrowprops=dict(facecolor='black', shrink=0.05))
Text(800, 80000, "'nlp' & 'cnn' are overlapped")
Word clouds (aka: tag clouds) are useful in getting a quick idea of how popular certain words are compared to others. The more times a word is mentioned, the larger it appears on the word cloud. We will use the WorldCloud library for this.
# # Create a world could based on the frequency dictionary
# wordcloud = WordCloud(background_color='white',
# max_words=50,
# max_font_size=40,
# min_font_size=5,
# scale=3,
# random_state=3).generate_from_frequencies(tags_count_dict)
# # Display the generated image
# fig = plt.figure(1, figsize=(20,20))
# plt.imshow(wordcloud, interpolation='bilinear')
# plt.axis("off")
Here we will examine how often one tag is used with another tag. To do this we will create a co-occurence matrix that displays the pair-wise frequency of the top 20 tags. Once we have the data, we will plot it in a heatmap so we can visualize the patterns.
# Import libraries
import itertools
from scipy.sparse import csr_matrix
# Function to create co-occurence matrix
def create_co_occurences_matrix(subset, entire_set):
print(f"allowed_words:\n{subset}")
print(f"documents:\n{entire_set}")
word_to_id = dict(zip(subset, range(len(subset))))
documents_as_ids = [np.sort([word_to_id[w] for w in doc if w in word_to_id]).astype('uint32') for doc in entire_set]
row_ind, col_ind = zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)]))
data = np.ones(len(row_ind), dtype='uint32') # use unsigned int for better memory utilization
max_word_id = max(itertools.chain(*documents_as_ids)) + 1
docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_word_id)) # efficient arithmetic operations with CSR * CSR
words_cooc_matrix = docs_words_matrix.T * docs_words_matrix # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix
words_cooc_matrix.setdiag(0)
print(f"words_cooc_matrix:\n{words_cooc_matrix.todense()}")
return words_cooc_matrix, word_to_id
# Assign tag variables to be loaded into function
subset = frequencies20.index
entire_set = ques['Tags']
# Create matrix
words_cooc_matrix, word_to_id = create_co_occurences_matrix(subset, entire_set)
allowed_words: Index(['machine-learning', 'python', 'deep-learning', 'neural-network', 'keras', 'classification', 'tensorflow', 'scikit-learn', 'nlp', 'cnn', 'time-series', 'lstm', 'pandas', 'regression', 'dataset', 'r', 'predictive-modeling', 'clustering', 'statistics', 'machine-learning-model'], dtype='object') documents: 0 [machine-learning, data-mining] 1 [machine-learning, regression, linear-regressi... 2 [python, time-series, forecast, forecasting] 3 [machine-learning, scikit-learn, pca] 4 [dataset, bigdata, data, speech-to-text] ... 8834 [pca, dimensionality-reduction, linear-algebra] 8835 [keras, weight-initialization] 8836 [python, visualization, seaborn] 8837 [time-series] 8838 [k-nn] Name: Tags, Length: 8839, dtype: object words_cooc_matrix: [[ 0 499 429 366 195 259 106 188 113 124 131 71 62 119 99 63 123 61 89 139] [499 0 160 137 280 98 167 235 71 62 105 61 244 59 53 24 35 45 35 37] [429 160 0 305 247 59 136 16 72 160 44 103 1 21 32 5 32 2 12 19] [366 137 305 0 235 65 108 24 24 118 33 69 1 42 20 9 13 8 11 10] [195 280 247 235 0 58 256 34 23 116 51 133 3 31 13 10 11 0 3 17] [259 98 59 65 58 0 20 47 35 20 25 20 3 34 28 10 27 12 19 21] [106 167 136 108 256 20 0 15 11 57 9 43 3 9 9 1 6 0 0 9] [188 235 16 24 34 47 15 0 12 0 12 2 37 37 9 1 12 24 6 18] [113 71 72 24 23 35 11 12 0 7 0 19 3 2 11 4 1 9 3 4] [124 62 160 118 116 20 57 0 7 0 8 24 1 6 11 2 6 0 1 4] [131 105 44 33 51 25 9 12 0 8 0 87 19 24 6 22 31 20 22 7] [ 71 61 103 69 133 20 43 2 19 24 87 0 7 11 7 3 13 3 1 5] [ 62 244 1 1 3 3 3 37 3 1 19 7 0 6 14 2 4 5 3 4] [119 59 21 42 31 34 9 37 2 6 24 11 6 0 6 10 28 2 16 8] [ 99 53 32 20 13 28 9 9 11 11 6 7 14 6 0 6 7 5 17 12] [ 63 24 5 9 10 10 1 1 4 2 22 3 2 10 6 0 13 16 16 7] [123 35 32 13 11 27 6 12 1 6 31 13 4 28 7 13 0 0 16 21] [ 61 45 2 8 0 12 0 24 9 0 20 3 5 2 5 16 0 0 3 3] [ 89 35 12 11 3 19 0 6 3 1 22 1 3 16 17 16 16 3 0 3] [139 37 19 10 17 21 9 18 4 4 7 5 4 8 12 7 21 3 3 0]]
# View data type of the matrix
words_cooc_matrix
<20x20 sparse matrix of type '<class 'numpy.uint32'>' with 386 stored elements in Compressed Sparse Column format>
# Convert the above 'sparse matrix' dtype to a dataframe
matrix = pd.DataFrame(words_cooc_matrix.toarray(), index=subset, columns=subset)
# Tweek color scale so higher numbers stand out more
norm = plt.Normalize(0,250)
# Create mask so that the top triange of the matrix is removed
mask = np.zeros_like(matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Plot
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(matrix,
cmap='RdBu',
annot=True,
lw=2,
cbar=False,
annot_kws={"fontsize":12},
center=0,
norm=norm,
fmt='g',
mask=mask
)
# Plot Aesthetics
ax.tick_params(left=False, bottom=False)
ax.set_xticklabels(frequencies20.index,
fontsize=12,
rotation=45,
ha='right'
)
plt.title(
'Top 20 Tag Co-Occurence',
fontsize=24,
fontweight=525
)
text = 'Interpretation: The \'pandas\' and \'python\' tags are \n used in the same question 244 times.'
ax.text(0.48, 0.90,
text,
transform=ax.transAxes,
fontsize=14,
verticalalignment='top')
Text(0.48, 0.9, "Interpretation: The 'pandas' and 'python' tags are \n used in the same question 244 times.")
Here we will use our multi-year dataset to see how deep learning interest has changed over time. We will look at monthly as well as yearly changes.
# Read in the multi-year data
all = pd.read_csv('all_questions.csv', parse_dates=['CreationDate'])
# Create new column that holds just the year and month of the question
all['yearmonth'] = all['CreationDate'].dt.strftime('%Y, %m').astype(str).str.replace(",", "").str.replace(" ", "")
all.head()
Id | CreationDate | Tags | yearmonth | |
---|---|---|---|---|
0 | 45416 | 2019-02-12 00:36:29 | <python><keras><tensorflow><cnn><probability> | 201902 |
1 | 45418 | 2019-02-12 00:50:39 | <neural-network> | 201902 |
2 | 45422 | 2019-02-12 04:40:51 | <python><ibm-watson><chatbot> | 201902 |
3 | 45426 | 2019-02-12 04:51:49 | <keras> | 201902 |
4 | 45427 | 2019-02-12 05:08:24 | <r><predictive-modeling><machine-learning-mode... | 201902 |
# Aggregate by month and count the number of questions in that month
gp_all = all.groupby(by='yearmonth').count()
# Reset the index
# gp = gp.reset_index()
# Drop unnecessary columns
gp_all = gp_all.drop(columns=['CreationDate', 'Tags'])
# Cast 'yearmonth' to datetime format that includes the year and month
# gp['yearmonth'] = pd.to_datetime(gp['yearmonth'], format='%Y%m')
gp_all.index = pd.to_datetime(gp_all.index, format='%Y%m')
# Drop the partial months which are the first and last rows
gp_all.drop(gp_all.tail(1).index,inplace=True)
gp_all.drop(gp_all.head(1).index,inplace=True)
gp_all.head()
Id | |
---|---|
yearmonth | |
2014-06-01 | 99 |
2014-07-01 | 76 |
2014-08-01 | 65 |
2014-09-01 | 48 |
2014-10-01 | 71 |
# Create new dataframe that will be aggregated by year
gp_all_year = gp_all
# Reset index
gp_all_year=gp_all_year.reset_index()
# Extract just the year
gp_all_year['year'] = gp_all_year['yearmonth'].dt.year
# Aggregate by year and find total questions for that year
gp_all_year = gp_all_year.groupby('year').sum()
# Drop 2014 data since its amost half of the year's data is missing
gp_all_year = gp_all_year.iloc[1:, :]
gp_all_year.head()
Id | |
---|---|
year | |
2015 | 1167 |
2016 | 2146 |
2017 | 2957 |
2018 | 5475 |
2019 | 8810 |
# Clean 'Tags' column and cast to list
all['Tags'] = all['Tags'].str.replace('><', ',').str.replace('<', '').str.replace('>', '')
all['Tags'] = all['Tags'].str.split(pat=',')
# Define function and make new column that indicates whether the question is related to deep learning
def has_dl(value):
dl_tags = ['deep-learning', 'deep-network', 'neural', 'neural-network',
'convolutional-neural-network', 'convolutional-neural-network',
'graph-neural-network', 'neural-style-transfer', 'cnn', 'faster-rcnn',
'rnn', 'keras', 'keras-rl', 'machine-learning', 'tensorflow']
for tag in dl_tags:
if tag in value:
return True
# Create new column and apply the function
all['dl_related'] = all['Tags'].apply(has_dl)
# Filter dataframe to include only questions related to deep learning
all_dl = all[all['dl_related'] == True].copy()
# Create new column that holds just the year and month of the question
all_dl['yearmonth'] = all_dl['CreationDate'].dt.strftime('%Y, %m').astype(str).str.replace(",", "").str.replace(" ", "")
# Aggregate by month and count the number of questions in that month
gp = all_dl.groupby(by='yearmonth').count()
# Drop unnecessary columns
gp = gp.drop(columns=['Id', 'CreationDate', 'Tags'])
# Cast 'yearmonth' to datetime format that includes the year and month
gp.index = pd.to_datetime(gp.index, format='%Y%m')
# Drop the partial months which are the first and last rows
gp.drop(gp.tail(1).index,inplace=True)
gp.drop(gp.head(1).index,inplace=True)
gp.head()
dl_related | |
---|---|
yearmonth | |
2014-06-01 | 33 |
2014-07-01 | 27 |
2014-08-01 | 20 |
2014-09-01 | 19 |
2014-10-01 | 28 |
# Create new dataframe that will be aggregated by year
gp_year = gp
# Reset index
gp_year=gp_year.reset_index()
# Extract just the year
gp_year['year'] = gp_year['yearmonth'].dt.year
# Aggregate by year and find total questions for that year
gp_year = gp_year.groupby('year').sum()
# Drop 2014 data since its amost half of the year's data is missing
gp_year = gp_year.iloc[1:, :]
gp_year.head()
dl_related | |
---|---|
year | |
2015 | 455 |
2016 | 942 |
2017 | 1554 |
2018 | 3138 |
2019 | 4780 |
# Create a new column on the gp_year datagrame that holds the total amount of questions for that year
gp_year['all'] = gp_all_year['Id'].tolist()
# Cast it to int
gp_year['all'].astype(int)
# Create column that holds the percentage of DL related questions
gp_year['dl_pct'] = round((gp_year['dl_related'] / gp_year['all']) * 100, 2)
# Create column that holds the rate of increase per year for DL related questions
gp_year['dl_rate'] = round(gp_year['dl_related'].pct_change()* 100, 2)
gp_year
dl_related | all | dl_pct | dl_rate | |
---|---|---|---|---|
year | ||||
2015 | 455 | 1167 | 38.99 | NaN |
2016 | 942 | 2146 | 43.90 | 107.03 |
2017 | 1554 | 2957 | 52.55 | 64.97 |
2018 | 3138 | 5475 | 57.32 | 101.93 |
2019 | 4780 | 8810 | 54.26 | 52.33 |
# Plot ax1
fig, ax = plt.subplots(nrows=4, ncols=1, figsize=(10,20))
fig.subplots_adjust(hspace=0.4)
ax1 = plt.subplot(4,1,1)
ax1 = sns.barplot(data=gp_all_year,
x=gp_all_year.index,
y='Id',
color = 'red',
label='All Questions')
ax1 = sns.barplot(data=gp_year,
x=gp_all_year.index,
y='dl_related',
color = 'steelblue',
label='Deep Learning Questions')
plt.legend(loc='best')
# Plot Aesthetics
ax1.set_title('Number of Questions Per Year',
fontsize=16,
pad=35)
ax1.set_xlabel('')
ax1.set_ylabel('')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.despine(left=True)
sns.set(style='whitegrid')
# Plot ax2
ax2 = plt.subplot(4,1,2)
plt.plot_date(x = gp.index,
y= gp_all['Id'],
linestyle='solid',
marker='',
color='red',
label='All Questions')
plt.plot_date(x = gp.index,
y = gp['dl_related'],
linestyle='solid',
marker='',
label='Deep Learning Questions')
# Plot Aesthetics
plt.xticks(rotation = 45,
ha='right',
fontsize=12)
plt.yticks(fontsize=12)
ax2.set_title('Number of Questions Per Month',
fontsize=16,
pad=30)
sns.despine(left=True)
sns.set(style='whitegrid')
plt.legend(loc='best')
# Plot ax3
ax3 = plt.subplot(4,1,3)
ax3 = sns.barplot(data=gp_year,
x=gp_year.index,
y='dl_pct',
color = 'steelblue')
# Plot Aesthetics
ax3.set_title('Percentage of Deep Learning Questions',
fontsize=16,
pad=20)
ax3.set_ylabel('')
ax3.set_ylim([30,60])
ax3.set_xlabel('')
ax3.set_ylabel('Percentage')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.despine(left=True)
sns.set(style='whitegrid')
# Plot ax4
ax4 = plt.subplot(4,1,4)
ax4 = sns.barplot(data=gp_year,
x=gp_year.index,
y='dl_rate',
color = 'steelblue')
# Plot Aesthetics
ax4.set_title('Deep Learning Questions Growth Rate',
fontsize=16,
pad=20)
ax4.set_ylabel('')
ax4.set_xlabel('')
ax4.set_ylabel('Rate of Growth (Percent)')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.despine(left=True)
sns.set(style='whitegrid')
Another indicator of the popularity of a topic could be the amount of times that question receives a favorite vote. Let's take a look at our dataset from 2019 and compare the amount of favorite votes received for deep learning related and non deep learning related posts. We will also compare this to the total ratio of deep learning questions vs. the total amount of questions.
# Create new column and apply our function that we made previously
ques['dl_related'] = ques['Tags'].apply(has_dl)
# Groupby and aggregate the columns differently: Favorite gets sum, Id gets count
quesgb = ques.groupby('dl_related', dropna=False).agg({'FavoriteCount':'sum', 'Id':'count'})
quesgb
FavoriteCount | Id | |
---|---|---|
dl_related | ||
True | 961 | 4793 |
NaN | 706 | 4046 |
# Plot
fig, (ax1, ax2) = plt.subplots(1,2)
colors = ['steelblue', 'red']
patches, texts, autotexts = ax1.pie(quesgb['FavoriteCount'],
explode=(0,0.03),
startangle=90,
autopct='%1.1f%%',
colors=colors,
textprops={'fontsize': 10})
patches, texts, autotexts = ax2.pie(quesgb['Id'],
explode=(0,0.03),
startangle=90,
autopct='%1.1f%%',
colors=colors,
textprops={'fontsize': 10})
# Plot Aesthetics
ax1.set_title('Total Favorite Votes', y=0.97)
ax2.set_title('Total Number of Questions', y=0.97)
fig.suptitle('2019 Data Science Questions', fontsize=16)
fig.legend(['Deep Learning Related', 'Not Deep Learning Related'],
prop={'size': 8},
loc='lower center')
plt.tight_layout()