Deep Learning: A Fad or Here To Stay?¶

Machine learning is an application of artificial intelligence that provides systems with the ability to learn and improve with experience without being explicitly programmed to do so. This field of study has grown significantly in recent years.

Deep learning (a form of machine learning) is based on creating artifical neural networks that mimic the biological neural connections that exist in our own brains. With the myriad applications of deep learning across many industries, its popularity has seen a sharp rise.

For this project we will take the role of a content writer for a data science magazine to see if there is any indication of this being a fad or if it is here to stay.

To examine the trend we are going to pull data relating to all of questions posted to Stack Exchange data science website (DSSE). Stack Exchange employs a reputation award system for its questions and answers. Each post is subject to upvotes and downvotes. Questions with more upvotes have more visibility.

The content posted on the DSSE is wholly dependant on what people in the data science community are posting. In this sense, our data and our conclusions are based on patterns in the community itself.

The data was pulled using the Stack Exchange Data Explorer (SEDE). The SEDE uses a built in SQL query system for pulling data from the site.

We have data for questions posted in 2019 and aso a dataset for all questions posted which go back to 2015. The datasets include the following columns:

Id: An identification number for the post.
PostTypeId: An identification number for the type of post.
CreationDate: The date and time of creation of the post.
Score: The post's score.
ViewCount: How many times the post was viewed.
Tags: What tags were used.
AnswerCount: How many answers the question got (only applicable to question posts).
FavoriteCount: How many times the question was favored (only applicable to question posts).

In [1]:

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import datetime as dt
import numpy as np

# Read in the file
ques = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])

Exploring the Data Set¶

In [2]:

ques.head()

Out[2]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
0	44419	2019-01-23 09:21:13	1	21	<machine-learning><data-mining>	0	NaN
1	44420	2019-01-23 09:34:01	0	25	<machine-learning><regression><linear-regressi...	0	NaN
2	44423	2019-01-23 09:58:41	2	1651	<python><time-series><forecast><forecasting>	0	NaN
3	44427	2019-01-23 10:57:09	0	55	<machine-learning><scikit-learn><pca>	1	NaN
4	44428	2019-01-23 11:02:15	0	19	<dataset><bigdata><data><speech-to-text>	0	NaN

In [3]:

ques.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             8839 non-null   int64         
 1   CreationDate   8839 non-null   datetime64[ns]
 2   Score          8839 non-null   int64         
 3   ViewCount      8839 non-null   int64         
 4   Tags           8839 non-null   object        
 5   AnswerCount    8839 non-null   int64         
 6   FavoriteCount  1407 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB

In [4]:

ques.describe()

Out[4]:

	Id	Score	ViewCount	AnswerCount	FavoriteCount
count	8839.000000	8839.000000	8839.000000	8839.000000	1407.000000
mean	54724.172870	0.870687	171.548026	0.787985	1.184790
std	6507.618509	1.410255	772.813626	0.851146	0.982766
min	43363.000000	-2.000000	2.000000	0.000000	0.000000
25%	48917.500000	0.000000	22.000000	0.000000	1.000000
50%	54833.000000	1.000000	40.000000	1.000000	1.000000
75%	60674.500000	1.000000	98.000000	1.000000	1.000000
max	65675.000000	45.000000	33203.000000	9.000000	16.000000

In [5]:

ques['FavoriteCount'].value_counts(dropna=False)

Out[5]:

NaN     7432
1.0      953
2.0      205
0.0      175
3.0       43
4.0       12
5.0        8
6.0        4
7.0        4
11.0       1
8.0        1
16.0       1
Name: FavoriteCount, dtype: int64

Observations¶

There are missing values in the FavoriteCount column. The other colummns do not have missing values.
The FavoriteCount column has 7,432 missing values. These are probably posts that received zero favorite votes. We can just assign these NaN values to zero
The datatypes seem to be fine. After we transpose the FavoriteCount column, we can cast it to int64. We alreaded parsed the CreationDate so it is in datetime format already.
For the tag data, we can parse and seperate the strings and cast it to a list.

Cleaning The Data¶

In [6]:

# Fill Nan values with zero & cast to int
ques['FavoriteCount'] = ques['FavoriteCount'].fillna(value=0).astype(int)

# Clean the Tag column
ques['Tags'] = ques['Tags'].str.replace('><', ',').str.replace('<', '').str.replace('>', '')

# Split 'Tags' column on ',' and cast to list
ques['Tags'] = ques['Tags'].str.split(pat=',')

ques

Out[6]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
0	44419	2019-01-23 09:21:13	1	21	[machine-learning, data-mining]	0	0
1	44420	2019-01-23 09:34:01	0	25	[machine-learning, regression, linear-regressi...	0	0
2	44423	2019-01-23 09:58:41	2	1651	[python, time-series, forecast, forecasting]	0	0
3	44427	2019-01-23 10:57:09	0	55	[machine-learning, scikit-learn, pca]	1	0
4	44428	2019-01-23 11:02:15	0	19	[dataset, bigdata, data, speech-to-text]	0	0
...	...	...	...	...	...	...	...
8834	55413	2019-07-10 09:08:31	1	39	[pca, dimensionality-reduction, linear-algebra]	1	1
8835	55414	2019-07-10 09:34:55	0	113	[keras, weight-initialization]	0	0
8836	55415	2019-07-10 09:45:37	1	212	[python, visualization, seaborn]	1	0
8837	55416	2019-07-10 09:59:56	0	22	[time-series]	0	0
8838	55419	2019-07-10 10:31:23	1	168	[k-nn]	1	0

8839 rows × 7 columns

Tags As An Indicator of Deep Learning Popularity¶

Since tags identify what subject the question is about, we will examine these to see if they point to a trend in Deep Learning

Tag Usage and Views¶

We will use the following metrics for measuring the popularity of each tag:

How often each tag as used ("tagged" to a question post)
How many times a question with that tag was viewed

In order to do this we will create two dictionaries and iterate as appropriate over the dataframe:

In [7]:

# Create dictionary and loop to calculate how often each tag was used.
tags_count_dict = {} 
            
for list in ques['Tags']:
    for item in list:
        if item in tags_count_dict:
            tags_count_dict[item] += 1
        elif item not in tags_count_dict:
            tags_count_dict[item] = 1
            
# Create dictionary and loop to calculate how many views each tag received.
tags_views_dict = {}
index = 0

for list in ques['Tags']:
    for item in list:
        if item in tags_views_dict:
            tags_views_dict[item] += ques['ViewCount'][index]
        else:
            tags_views_dict[item] = ques['ViewCount'][index]
    index += 1
    
# Transform the data and rename columns
frequencies = pd.DataFrame.from_dict(data=[tags_count_dict, tags_views_dict]).T
frequencies = frequencies.rename(columns={0:'uses', 1:'views'})

# Create column that holds the number of views per each use of that tag
frequencies['views_per_use'] = frequencies['views'] / frequencies['uses']

# Create subset dataframes to use in below graphs
frequencies10 = frequencies.sort_values('uses', ascending=False).head(10)
frequencies20 = frequencies.sort_values('uses', ascending=False).head(20)

frequencies20.head()

Out[7]:

	uses	views	views_per_use
machine-learning	2693	388499	144.262532
python	1814	537585	296.353363
deep-learning	1220	233628	191.498361
neural-network	1055	185367	175.703318
keras	935	268608	287.281283

In [8]:

#Plot
fig, ax = plt.subplots(3,1,
                       figsize=(12,10),
                       sharex=True)
fig.subplots_adjust(hspace=0.1)
plt.rcParams['figure.dpi'] = 460
fig.suptitle('Tag Usage Analysis',
             fontsize=18,
             y=0.93)


ax1 = plt.subplot(3,1,1)
ax1 = sns.barplot(data=frequencies20,
                 x=frequencies20.index,
                 y='uses',
                 color='steelblue')
ax1.set_ylabel('Total Tag Uses', fontsize=14)
sns.despine(left=True)
sns.set(style='whitegrid')
ax1.get_xaxis().set_visible(False)

ax2 = plt.subplot(3,1,2)
ax2 = sns.barplot(data=frequencies20,
                 x=frequencies20.index,
                 y='views',
                 color='steelblue')
ax2.set_ylabel('Total Tag Views', fontsize=14)
ax2.tick_params(labelsize=14)
sns.despine(left=True)
sns.set(style='whitegrid')
ax2.get_xaxis().set_visible(False)

ax3 = plt.subplot(3,1,3)
ax3 = sns.barplot(data=frequencies20,
                 x=frequencies20.index,
                 y='views_per_use',
                 color='steelblue')
ax3.set_ylabel('Views Per Use', fontsize=14)
ax3.tick_params(labelsize=14)
plt.xticks(rotation = 45, ha='right')
sns.despine(left=True)
sns.set(style='whitegrid')

In [9]:

# Plot
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax = sns.scatterplot(data=frequencies10,
                     x="uses", y="views",
                     hue=frequencies10.index,
                     palette="deep",
                     s=300)

# Plot Aesthetics
fig.subplots_adjust(hspace=0.5)
ax.set_ylabel('Total Tag Views',
              fontsize=22,
              labelpad=30)
ax.set_xlabel('Total Tag Count',
              fontsize=22,
              labelpad=30)
ax.tick_params(labelsize=14)
ax.set_title('Tag Uses v. Tag Views',
             fontsize=22,
             y=1.05)
sns.despine(left=True)
plt.legend(loc='center left',
           bbox_to_anchor=(1.1, 0.5),
           ncol=1, handlelength=2,
           handleheight=1,
           prop=dict(size=18),
           markerscale=2)
ax.annotate('\'nlp\' & \'cnn\' are overlapped', 
            xy=(550,72000), 
            xytext=(800,80000),
            fontsize=16,
            arrowprops=dict(facecolor='black', shrink=0.05))

Out[9]:

Text(800, 80000, "'nlp' & 'cnn' are overlapped")

Observations¶

Topics in machine learning, deep learning & neural networks are clearly the most popular.
There some overlap and redundancy in the way people use tags. For example, TensorFlow and Keras are both packages that are commonly used with Python. If someone posts a question regarding Keras, most likely 'python' could be deemed an appropriate (and possibly unnecessary) tag. The Python tag could, in theory, apply to all of the top 20 tags above with the possible exception of the 'R' tag since R is a different programming language. However, it is possible that someone asked a question that discusses both Python and R so in that case it would be appropriate.
Other instances of redundancy are uses of the 'keras' and 'deep-learning' tags in the same question. If someone is asking a question about Keras, they are probably asking about deep learning too.
Pandas has significanly more views per use than the rest. This might be due to the widespread use of Pandas. As more people use Pandas in their workflow, they are probably searching the internet as they get stuck and need to learn how to use certain aspects of the package. This could lead them to search Stack Exchange for answers.

Creating a Word Cloud * Removed Due to File Size Limit *¶

Word clouds (aka: tag clouds) are useful in getting a quick idea of how popular certain words are compared to others. The more times a word is mentioned, the larger it appears on the word cloud. We will use the WorldCloud library for this.

In [10]:

# # Create a world could based on the frequency dictionary
# wordcloud = WordCloud(background_color='white',
#                       max_words=50,
#                       max_font_size=40,
#                       min_font_size=5,
#                       scale=3,
#                       random_state=3).generate_from_frequencies(tags_count_dict)

# # Display the generated image
# fig = plt.figure(1, figsize=(20,20))
# plt.imshow(wordcloud, interpolation='bilinear')
# plt.axis("off")

How Often Do The Top 20 Tags Appear Together?¶

Here we will examine how often one tag is used with another tag. To do this we will create a co-occurence matrix that displays the pair-wise frequency of the top 20 tags. Once we have the data, we will plot it in a heatmap so we can visualize the patterns.

In [11]:

# Import libraries
import itertools
from scipy.sparse import csr_matrix

# Function to create co-occurence matrix
def create_co_occurences_matrix(subset, entire_set):
    print(f"allowed_words:\n{subset}")
    print(f"documents:\n{entire_set}")
    word_to_id = dict(zip(subset, range(len(subset))))
    documents_as_ids = [np.sort([word_to_id[w] for w in doc if w in word_to_id]).astype('uint32') for doc in entire_set]
    row_ind, col_ind = zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)]))
    data = np.ones(len(row_ind), dtype='uint32')  # use unsigned int for better memory utilization
    max_word_id = max(itertools.chain(*documents_as_ids)) + 1
    docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_word_id))  # efficient arithmetic operations with CSR * CSR
    words_cooc_matrix = docs_words_matrix.T * docs_words_matrix  # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix
    words_cooc_matrix.setdiag(0)
    print(f"words_cooc_matrix:\n{words_cooc_matrix.todense()}")
    return words_cooc_matrix, word_to_id 

# Assign tag variables to be loaded into function
subset = frequencies20.index
entire_set = ques['Tags']

# Create matrix
words_cooc_matrix, word_to_id = create_co_occurences_matrix(subset, entire_set)

allowed_words:
Index(['machine-learning', 'python', 'deep-learning', 'neural-network',
       'keras', 'classification', 'tensorflow', 'scikit-learn', 'nlp', 'cnn',
       'time-series', 'lstm', 'pandas', 'regression', 'dataset', 'r',
       'predictive-modeling', 'clustering', 'statistics',
       'machine-learning-model'],
      dtype='object')
documents:
0                         [machine-learning, data-mining]
1       [machine-learning, regression, linear-regressi...
2            [python, time-series, forecast, forecasting]
3                   [machine-learning, scikit-learn, pca]
4                [dataset, bigdata, data, speech-to-text]
                              ...                        
8834      [pca, dimensionality-reduction, linear-algebra]
8835                       [keras, weight-initialization]
8836                     [python, visualization, seaborn]
8837                                        [time-series]
8838                                               [k-nn]
Name: Tags, Length: 8839, dtype: object
words_cooc_matrix:
[[  0 499 429 366 195 259 106 188 113 124 131  71  62 119  99  63 123  61
   89 139]
 [499   0 160 137 280  98 167 235  71  62 105  61 244  59  53  24  35  45
   35  37]
 [429 160   0 305 247  59 136  16  72 160  44 103   1  21  32   5  32   2
   12  19]
 [366 137 305   0 235  65 108  24  24 118  33  69   1  42  20   9  13   8
   11  10]
 [195 280 247 235   0  58 256  34  23 116  51 133   3  31  13  10  11   0
    3  17]
 [259  98  59  65  58   0  20  47  35  20  25  20   3  34  28  10  27  12
   19  21]
 [106 167 136 108 256  20   0  15  11  57   9  43   3   9   9   1   6   0
    0   9]
 [188 235  16  24  34  47  15   0  12   0  12   2  37  37   9   1  12  24
    6  18]
 [113  71  72  24  23  35  11  12   0   7   0  19   3   2  11   4   1   9
    3   4]
 [124  62 160 118 116  20  57   0   7   0   8  24   1   6  11   2   6   0
    1   4]
 [131 105  44  33  51  25   9  12   0   8   0  87  19  24   6  22  31  20
   22   7]
 [ 71  61 103  69 133  20  43   2  19  24  87   0   7  11   7   3  13   3
    1   5]
 [ 62 244   1   1   3   3   3  37   3   1  19   7   0   6  14   2   4   5
    3   4]
 [119  59  21  42  31  34   9  37   2   6  24  11   6   0   6  10  28   2
   16   8]
 [ 99  53  32  20  13  28   9   9  11  11   6   7  14   6   0   6   7   5
   17  12]
 [ 63  24   5   9  10  10   1   1   4   2  22   3   2  10   6   0  13  16
   16   7]
 [123  35  32  13  11  27   6  12   1   6  31  13   4  28   7  13   0   0
   16  21]
 [ 61  45   2   8   0  12   0  24   9   0  20   3   5   2   5  16   0   0
    3   3]
 [ 89  35  12  11   3  19   0   6   3   1  22   1   3  16  17  16  16   3
    0   3]
 [139  37  19  10  17  21   9  18   4   4   7   5   4   8  12   7  21   3
    3   0]]

In [12]:

# View data type of the matrix
words_cooc_matrix

Out[12]:

<20x20 sparse matrix of type '<class 'numpy.uint32'>'
	with 386 stored elements in Compressed Sparse Column format>

In [13]:

# Convert the above 'sparse matrix' dtype to a dataframe
matrix = pd.DataFrame(words_cooc_matrix.toarray(), index=subset, columns=subset)

# Tweek color scale so higher numbers stand out more
norm = plt.Normalize(0,250)

# Create mask so that the top triange of the matrix is removed
mask = np.zeros_like(matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Plot
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(matrix,
            cmap='RdBu',
            annot=True,
            lw=2,
            cbar=False,
            annot_kws={"fontsize":12},
            center=0,
            norm=norm,
            fmt='g',
            mask=mask
           )

# Plot Aesthetics
ax.tick_params(left=False, bottom=False)
ax.set_xticklabels(frequencies20.index,
                    fontsize=12,
                    rotation=45,
                    ha='right'
                  )
plt.title(
          'Top 20 Tag Co-Occurence',
          fontsize=24,
          fontweight=525
         )

text = 'Interpretation: The \'pandas\' and \'python\' tags are \n used in the same question 244 times.'
ax.text(0.48, 0.90,
        text,
        transform=ax.transAxes,
        fontsize=14,
        verticalalignment='top')

Out[13]:

Text(0.48, 0.9, "Interpretation: The 'pandas' and 'python' tags are \n used in the same question 244 times.")

Observations¶

Topics related to deep learning are very popular within the broader field of data science.
The 'machine-learning' tag has the most appearances with other tags. This is probably because the subject of machine learning touches upon many other topics within data science such as deep learning, neural networks, and programming languages.
We also see pairs that have many appearances together such as 'keras'/'neural network', 'neural-network'/'deep-learning', and 'tensorflow'/'keras'. These are all topics that are closely tied together so its no suprise that a question about one of these topics would also relate to another.
'pandas' and 'python' are often mentionened together. This is a redundant tag pairing since Pandas is build on top of the Python programming language. Any question about Pandas is, by definitition, also a question about Python.

Tracking Deep Learning Interest Over Time¶

Here we will use our multi-year dataset to see how deep learning interest has changed over time. We will look at monthly as well as yearly changes.

In [14]:

# Read in the multi-year data
all = pd.read_csv('all_questions.csv', parse_dates=['CreationDate'])

# Create new column that holds just the year and month of the question
all['yearmonth'] = all['CreationDate'].dt.strftime('%Y, %m').astype(str).str.replace(",", "").str.replace(" ", "")

all.head()

Out[14]:

	Id	CreationDate	Tags	yearmonth
0	45416	2019-02-12 00:36:29	<python><keras><tensorflow><cnn><probability>	201902
1	45418	2019-02-12 00:50:39	<neural-network>	201902
2	45422	2019-02-12 04:40:51	<python><ibm-watson><chatbot>	201902
3	45426	2019-02-12 04:51:49	<keras>	201902
4	45427	2019-02-12 05:08:24	<r><predictive-modeling><machine-learning-mode...	201902

All Questions By Month¶

In [15]:

# Aggregate by month and count the number of questions in that month
gp_all = all.groupby(by='yearmonth').count()

# Reset the index
# gp = gp.reset_index()

# Drop unnecessary columns
gp_all = gp_all.drop(columns=['CreationDate', 'Tags'])

# Cast 'yearmonth' to datetime format that includes the year and month
# gp['yearmonth'] = pd.to_datetime(gp['yearmonth'], format='%Y%m') 
gp_all.index = pd.to_datetime(gp_all.index, format='%Y%m')

# Drop the partial months which are the first and last rows
gp_all.drop(gp_all.tail(1).index,inplace=True)
gp_all.drop(gp_all.head(1).index,inplace=True)

gp_all.head()

Out[15]:

	Id
yearmonth
2014-06-01	99
2014-07-01	76
2014-08-01	65
2014-09-01	48
2014-10-01	71

All Questions By Year¶

In [16]:

# Create new dataframe that will be aggregated by year
gp_all_year = gp_all

# Reset index
gp_all_year=gp_all_year.reset_index()

# Extract just the year
gp_all_year['year'] = gp_all_year['yearmonth'].dt.year

# Aggregate by year and find total questions for that year
gp_all_year =  gp_all_year.groupby('year').sum()

# Drop 2014 data since its amost half of the year's data is missing
gp_all_year = gp_all_year.iloc[1:, :]

gp_all_year.head()

Out[16]:

	Id
year
2015	1167
2016	2146
2017	2957
2018	5475
2019	8810

Deep Learning Questions By Month¶

In [17]:

# Clean 'Tags' column and cast to list
all['Tags'] = all['Tags'].str.replace('><', ',').str.replace('<', '').str.replace('>', '')
all['Tags'] = all['Tags'].str.split(pat=',')

# Define function and make new column that indicates whether the question is related to deep learning
def has_dl(value):
    dl_tags = ['deep-learning', 'deep-network', 'neural', 'neural-network',
           'convolutional-neural-network', 'convolutional-neural-network',
          'graph-neural-network', 'neural-style-transfer', 'cnn', 'faster-rcnn',
          'rnn', 'keras', 'keras-rl', 'machine-learning', 'tensorflow']
    for tag in dl_tags:
        if tag in value:
            return True

# Create new column and apply the function
all['dl_related'] = all['Tags'].apply(has_dl)

# Filter dataframe to include only questions related to deep learning
all_dl = all[all['dl_related'] == True].copy()

# Create new column that holds just the year and month of the question
all_dl['yearmonth'] = all_dl['CreationDate'].dt.strftime('%Y, %m').astype(str).str.replace(",", "").str.replace(" ", "")

# Aggregate by month and count the number of questions in that month
gp = all_dl.groupby(by='yearmonth').count()

# Drop unnecessary columns
gp = gp.drop(columns=['Id', 'CreationDate', 'Tags'])

# Cast 'yearmonth' to datetime format that includes the year and month
gp.index = pd.to_datetime(gp.index, format='%Y%m')

# Drop the partial months which are the first and last rows
gp.drop(gp.tail(1).index,inplace=True)
gp.drop(gp.head(1).index,inplace=True)

gp.head()

Out[17]:

	dl_related
yearmonth
2014-06-01	33
2014-07-01	27
2014-08-01	20
2014-09-01	19
2014-10-01	28

Deep Learning Questions By Year¶

In [18]:

# Create new dataframe that will be aggregated by year
gp_year = gp

# Reset index
gp_year=gp_year.reset_index()

# Extract just the year
gp_year['year'] = gp_year['yearmonth'].dt.year

# Aggregate by year and find total questions for that year
gp_year =  gp_year.groupby('year').sum()

# Drop 2014 data since its amost half of the year's data is missing
gp_year = gp_year.iloc[1:, :]

gp_year.head()

Out[18]:

	dl_related
year
2015	455
2016	942
2017	1554
2018	3138
2019	4780

Calculating Percentage¶

In [19]:

# Create a new column on the gp_year datagrame that holds the total amount of questions for that year
gp_year['all'] = gp_all_year['Id'].tolist()

# Cast it to int
gp_year['all'].astype(int)

# Create column that holds the percentage of DL related questions
gp_year['dl_pct'] = round((gp_year['dl_related'] / gp_year['all']) * 100, 2)

# Create column that holds the rate of increase per year for DL related questions
gp_year['dl_rate'] =  round(gp_year['dl_related'].pct_change()* 100, 2)

gp_year

Out[19]:

	dl_related	all	dl_pct	dl_rate
year
2015	455	1167	38.99	NaN
2016	942	2146	43.90	107.03
2017	1554	2957	52.55	64.97
2018	3138	5475	57.32	101.93
2019	4780	8810	54.26	52.33

In [20]:

# Plot ax1
fig, ax = plt.subplots(nrows=4, ncols=1, figsize=(10,20))
fig.subplots_adjust(hspace=0.4)

ax1 = plt.subplot(4,1,1)
ax1 = sns.barplot(data=gp_all_year,
                  x=gp_all_year.index,
                  y='Id',
                  color = 'red',
                  label='All Questions')
ax1 = sns.barplot(data=gp_year,
                  x=gp_all_year.index,
                  y='dl_related',
                  color = 'steelblue',
                  label='Deep Learning Questions')
plt.legend(loc='best')

# Plot Aesthetics
ax1.set_title('Number of Questions Per Year',
              fontsize=16,
              pad=35)
ax1.set_xlabel('')
ax1.set_ylabel('')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.despine(left=True)
sns.set(style='whitegrid')

# Plot ax2
ax2 = plt.subplot(4,1,2)
plt.plot_date(x = gp.index,
              y= gp_all['Id'],
              linestyle='solid',
              marker='',
              color='red',
              label='All Questions')
plt.plot_date(x = gp.index,
              y = gp['dl_related'],
              linestyle='solid',
              marker='',
              label='Deep Learning Questions')


# Plot Aesthetics
plt.xticks(rotation = 45,
           ha='right',
           fontsize=12)
plt.yticks(fontsize=12)
ax2.set_title('Number of Questions Per Month',
              fontsize=16,
              pad=30)
sns.despine(left=True)
sns.set(style='whitegrid')
plt.legend(loc='best')

# Plot ax3
ax3 = plt.subplot(4,1,3)
ax3 = sns.barplot(data=gp_year,
                  x=gp_year.index,
                  y='dl_pct',
                  color = 'steelblue')

# Plot Aesthetics
ax3.set_title('Percentage of Deep Learning Questions',
              fontsize=16,
              pad=20)
ax3.set_ylabel('')
ax3.set_ylim([30,60])
ax3.set_xlabel('')
ax3.set_ylabel('Percentage')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.despine(left=True)
sns.set(style='whitegrid')

# Plot ax4
ax4 = plt.subplot(4,1,4)
ax4 = sns.barplot(data=gp_year,
                  x=gp_year.index,
                  y='dl_rate',
                  color = 'steelblue')

# Plot Aesthetics
ax4.set_title('Deep Learning Questions Growth Rate',
              fontsize=16,
              pad=20)
ax4.set_ylabel('')
ax4.set_xlabel('')
ax4.set_ylabel('Rate of Growth (Percent)')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.despine(left=True)
sns.set(style='whitegrid')

Observations¶

Questions related to deep learning have been increasing relative to total questions asked. As of 2019, over half of the questions asked on the Data Science section of Stack Exchange are related to deep learning. In 2015 less than 40 percent of questions asked concerned deep learning.
Questions as a whole have increased dramatically - a ten fold increase since 2015.
There seems to be a seasonal dip in questions posted just after the mid-year mark. This could be that it is summer and people take time off.
The growth rate for deep learning related questions varies significantly but is still high.
Overall we see consistant relative growth for deep learning qustions posted to the Stack Exchange Data Science site.

Favorite Votes As An Indicator of Deep Learning Popularity¶

Another indicator of the popularity of a topic could be the amount of times that question receives a favorite vote. Let's take a look at our dataset from 2019 and compare the amount of favorite votes received for deep learning related and non deep learning related posts. We will also compare this to the total ratio of deep learning questions vs. the total amount of questions.

In [21]:

# Create new column and apply our function that we made previously
ques['dl_related'] = ques['Tags'].apply(has_dl)

# Groupby and aggregate the columns differently:  Favorite gets sum, Id gets count
quesgb = ques.groupby('dl_related', dropna=False).agg({'FavoriteCount':'sum', 'Id':'count'})

quesgb

Out[21]:

	FavoriteCount	Id
dl_related
True	961	4793
NaN	706	4046

In [22]:

# Plot
fig, (ax1, ax2) = plt.subplots(1,2)
colors = ['steelblue', 'red']

patches, texts, autotexts = ax1.pie(quesgb['FavoriteCount'],
        explode=(0,0.03), 
        startangle=90,
        autopct='%1.1f%%',
        colors=colors,
        textprops={'fontsize': 10})

patches, texts, autotexts = ax2.pie(quesgb['Id'],
        explode=(0,0.03), 
        startangle=90,
        autopct='%1.1f%%',
        colors=colors,
        textprops={'fontsize': 10})

# Plot Aesthetics
ax1.set_title('Total Favorite Votes', y=0.97)
ax2.set_title('Total Number of Questions', y=0.97)
fig.suptitle('2019 Data Science Questions', fontsize=16)
fig.legend(['Deep Learning Related', 'Not Deep Learning Related'],
           prop={'size': 8},
           loc='lower center')
plt.tight_layout()

Observations¶

We can see that deep learning related questions recevie 57.6% of the favororite votes even thought they only make up 54% of the total questions asked. While this bias is small, it might indicate that people tend to be more engaged with deep learning related questions as opposed to other questions.

Conclusions¶

It seems that topics in data science as a whole and, more increasingly, deep learning are consistantly engaging more people each year. From the consistancy and growth, we can feel comfortable saying that it is not a fad.
Deep learning accounts for more than half of the data science questions on Stack Exchange and the growth rate remains high.
Based on the tag data we could also write about topics within Deep Learning such as neural-networks and programming packages such as Keras and Tensorflow.

Methodology Summary¶

We used data about questions posted on the Stack Exchange Data Science website to try and determine whether deep learning was just a pasing trend or here to stay. The data included information such as the tags associated with the question as well as post date and favorite scores.
We transformed and filtered the data in pandas, parsed dates using DateTime, and used various visualization methods such as:
- Seaborn and Matplotlib plots
- Time series plots
- WordCloud
- A co-occurence matrix placed inside a Seaborn heatmap with custom color settings

Deep Learning: A Fad or Here To Stay?¶

Exploring the Data Set¶

Observations¶

Cleaning The Data¶

Tags As An Indicator of Deep Learning Popularity¶

Tag Usage and Views¶

Observations¶

Creating a Word Cloud *** Removed Due to File Size Limit ***¶

How Often Do The Top 20 Tags Appear Together?¶

Observations¶

Tracking Deep Learning Interest Over Time¶

All Questions By Month¶

All Questions By Year¶

Deep Learning Questions By Month¶

Deep Learning Questions By Year¶

Calculating Percentage¶

Observations¶

Favorite Votes As An Indicator of Deep Learning Popularity¶

Observations¶

Conclusions¶

Methodology Summary¶

Creating a Word Cloud * Removed Due to File Size Limit *¶