We've been asked to help inform the content strategy for a website that creates data science content. We've been tasked with figuring out what people want to learn about in data science. In order to answer these questions we've looked at patterns of tag use on the Data Science Stack Exchange (DSSE).
The Data Science StackExchange was founded in 2014. Since then, people have posted approximately 26,000 questions and 18,000 answers to those questions. In the process, they've applied six hundred different tags in classifying the questions. Activity on the site has grown steadily since its inception to a current level of almost 2,500 new questions posted per quarter.
The site offeres us a perspective on what data science related topics people are interested in, and their data portal makes it easy to extract site data for further analysis.
We studied tags used on questions used on the Data Science StackExchange using extracts obtained from the data portal. We identified the most commonly used tags on questions and how they've changed over time. We also identified the tags most commonly associated with the most frequently viewed questions.
We found that there is significant overlap between the ten tags people use most often when posting questions and the ten tags most frequently found on the most frequently viewed questions. These tags group logically into two overlapping sets.
deep-learning
, keras
, tensorflow
, cnn
'python
,keras
, tensorflow
, scikit-kearn
, pandas
, dataframe
When we looked at the use of Deep Learning related tags on questions over time we found that they rose from making up less than 5% of the questions in the early days of the site to a high almost 30% by the end of 2018. Since then they've made up a steady ~25% of questions on the site as overall activity on the site has continued to increase.
In order to find topics of rising importance we looked past the top ten most used and viewed tags and identified the next 30 ranks tags in each category. We identifed the tags in common between both categories, which gave us a list of twenty tags. We then looked at the trends in use of these tags over time. This revealed two tags that first appeared around 2017 and have grown to appear on sigificant numbers of new questions.
lstm
: Long short-term memory is a generall class of deep learning model.pytorch
: is a deep learning and machine learning library for Python.Both fit into the areas of interest defined above, so they don't represent completely new topic areas, but they do suggest areas of focus that could yield high ROI if interest in them continues to grow.
Our trend analysis called attention to the declining popularity of certain topics:
R
, a language for statistical analysis, was once used on 20% of new questions, but is now found on less than 5%.data-mining
a term first popularized in the 1980s appeared on ~18% of questions when DSSE first started, but is now found on less than 5%.These declines were unusual. Most other popular tags have been fairly steady in their popularity.
Our analysis suggests that developing data science content related to Python and deep learning is likely to find an audience. Moreover lstm
and pytorch
seem likely to find a growing audience. The downside risks are low, since interest in datascience topics overall continues to grow.
For further insights we could perform a similar exploration of other StackExchange sites mentioned in the sources section, below.
A cursory examination shows that some of the popular tags we identified on DSSE, like neural networks
, machine learning
, neural networks
, tensor flow
are also common on Cross Validated (the statistics StackExchange) and Artificial Intelligence SE. Python
is a popular tag on StackOverflow, and is frequently combined with pandas
, tensorflow
and dataframe
, all popular tags on DSSE. Looking at tags on other StackExchange sites related to popular tags on DSSE could provide a route to identify other relevant content areas.
StackExchange is a network of question and answer sites. The flagship community, StackOverflow is a fixture of google search results for programming related topics as well as being a destination in its own-right.
In addition to StackOverflow there is the Data Sceince Stack Exchange (DSSE) and other potentially relevant sites.
These sites represent an attractive target for our research because StackExchange makes a great deal of data about the sites content and activity available through a SQL-based data portal.
In order to understand the available data its helpful to take a closer look at a typical StackExchange web site. Let's focus on DSSE, for obvious reasons.
Looking at the Data Science StackExchange, we see that the body of the homepage is filled with questions asked on the site that are considered "active" at the time the page loaded. This view can also be switched to show the "hot" questions of the past few days, week or month. Finally, one can look at questions that people have placed "bounties" on.
There is another view of quesitons that allows filtering by newest, active, bountied, unanswered, "frequent" and most voted for.
One can also drill into the site by way of tags. One can search for tags, or click through a grid of popular or newly created tags. There are 18 pages of tags. The number of questions and recent activity on each tag is called out on the listing.
One can also browse by lists of users, filterable based on their activity/popularity over various time periods.
There is also a special view for just unanswered questions, filterable by personalized tags, newest, number of votes and the lack of any attempted answers.
As mentioned abover, an extract of the database for the DSSE is queryable using MS-SQL (the same can be done for other StackExchange sites). There is a tutorial with some basic schema info and some help.
Exploring the database, I found that there are:
It looks like both "questions" and "answers" are Posts. Posts have a PostType of 1 and no ParentID. Answers have a PostType of 2 and a populated ParentID. DeletionDate appears to be unused. ClosedDate is populated for a subset.
The posts table includes information about votes and views, which should be helpful in finding popular questions and answers.
The PostTags table has information needed to compute popular tags
SELECT TOP 50 t.TagName, count(*)
FROM PostTags pt
LEFT JOIN Tags t ON t.Id = pt.TagId
GROUP BY t.TagName, t.Id
ORDER BY count(*) DESC
To start, I'll extraced metadata, including tags, for question and answer posts made in 2019.
SELECT
Id, PostTypeId, CreationDate, Score,
ViewCount, Tags, AnswerCount, FavoriteCount
FROM posts
WHERE PostTypeID IN (1,2)
AND YEAR(CreationDate) = 2019;
Results exported and stored alongside this notebook in 2019_DSSE_posts_metadata_extract.csv
# Import needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import combinations
# Configure notebook environment
%matplotlib inline
%config InlineBackend.figure_format='retina'
#pd.options.display.max_rows = 200
pd.options.display.max_columns = 30
posts = pd.read_csv('2019_DSSE_posts_metadata_extract.csv')
posts.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15334 entries, 0 to 15333 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 15334 non-null int64 1 PostTypeId 15334 non-null int64 2 CreationDate 15334 non-null object 3 Score 15334 non-null int64 4 ViewCount 7580 non-null float64 5 Tags 7580 non-null object 6 AnswerCount 7580 non-null float64 7 FavoriteCount 1526 non-null float64 dtypes: float64(3), int64(3), object(2) memory usage: 958.5+ KB
There are 15,334 total rows in the extract. All have scores. Only 7,580 have ViewCount and Tags, and AnswerCount populated. Only 1,526 have favorite count populated.
masked_posts = pd.concat([posts['PostTypeId'], posts.iloc[:,2:].notnull()],axis=1)
masked_posts.pivot_table(index='PostTypeId', values=masked_posts.columns[1:], aggfunc='sum', margins=True)
AnswerCount | CreationDate | FavoriteCount | Score | Tags | ViewCount | |
---|---|---|---|---|---|---|
PostTypeId | ||||||
1 | 7580 | 7580 | 1526 | 7580 | 7580 | 7580 |
2 | 0 | 7754 | 0 | 7754 | 0 | 0 |
All | 7580 | 15334 | 1526 | 15334 | 7580 | 7580 |
Depending on the analysis we do, it could be useful to fill in some of the unpopulated fields. Zeros seem approrpiate for the favorite counts, and potentially, answer counts.
# Replace NA with 0
posts[['ViewCount', 'AnswerCount', 'FavoriteCount']] = posts[['ViewCount', 'AnswerCount', 'FavoriteCount']].fillna(0)
posts.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15334 entries, 0 to 15333 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 15334 non-null int64 1 PostTypeId 15334 non-null int64 2 CreationDate 15334 non-null object 3 Score 15334 non-null int64 4 ViewCount 15334 non-null float64 5 Tags 7580 non-null object 6 AnswerCount 15334 non-null float64 7 FavoriteCount 15334 non-null float64 dtypes: float64(3), int64(3), object(2) memory usage: 958.5+ KB
Fix column types
posts['CreationDate'] = pd.to_datetime(posts['CreationDate'])
posts[['ViewCount', 'AnswerCount', 'FavoriteCount']] = posts[['ViewCount', 'AnswerCount', 'FavoriteCount']].astype(int)
posts.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15334 entries, 0 to 15333 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 15334 non-null int64 1 PostTypeId 15334 non-null int64 2 CreationDate 15334 non-null datetime64[ns] 3 Score 15334 non-null int64 4 ViewCount 15334 non-null int64 5 Tags 7580 non-null object 6 AnswerCount 15334 non-null int64 7 FavoriteCount 15334 non-null int64 dtypes: datetime64[ns](1), int64(6), object(1) memory usage: 958.5+ KB
posts['Tags'].value_counts()
<machine-learning> 100 <python><pandas> 56 <python> 46 <r> 36 <reinforcement-learning> 30 ... <classification><lstm><multiclass-classification><multilabel-classification> 1 <python><neural-network><keras><evaluation><hyperparameter-tuning> 1 <classification><regression><evaluation><overfitting> 1 <classification><nlp><text> 1 <word2vec><word-embeddings><bert> 1 Name: Tags, Length: 5650, dtype: int64
The tags field contain strings for all the tags on the post, delimited by angle brackets around each tag.
In order to make the tag data easier to work with I'll convert the strings into a list of individual tags.
posts['Tags'] = posts['Tags'].str.replace('><', ',').str.strip('<>').str.split(',')
posts['Tags'].head()
0 NaN 1 [machine-learning, gaussian, discriminant-anal... 2 [classification, unbalanced-classes] 3 NaN 4 NaN Name: Tags, dtype: object
#replace NaN with empty lists to make further processing easier
no_tags = posts['Tags'].isna()
posts.loc[no_tags,'Tags'] = posts.loc[no_tags,'Tags'].apply(lambda x: [])
#Extract individual tags and relevant metadata from each post
tag_pile = []
for i, p in posts.iterrows():
post_views = p['ViewCount']
for t in p['Tags']:
tag_pile.append([t, post_views, 1])
tag_pile = pd.DataFrame(tag_pile, columns=['tag', 'views', 'count'])
# Aggregate tags
tag_summary = tag_pile.pivot_table(index='tag', values=['count', 'views'], aggfunc=sum)
most_used = pd.DataFrame(tag_summary['count'].sort_values(ascending=False), columns=['count'])
most_used
count | |
---|---|
tag | |
machine-learning | 2344 |
python | 1589 |
deep-learning | 1013 |
neural-network | 915 |
keras | 803 |
... | ... |
isolation-forest | 1 |
kendalls-tau-coefficient | 1 |
knime | 1 |
lda-classifier | 1 |
.net | 1 |
501 rows × 1 columns
most_viewed = pd.DataFrame(tag_summary['views'].sort_values(ascending=False), columns=['views'])
most_viewed
views | |
---|---|
tag | |
python | 1355080 |
machine-learning | 872335 |
pandas | 608055 |
keras | 607205 |
deep-learning | 507577 |
... | ... |
state-of-the-art | 17 |
galago | 15 |
frequentist | 15 |
statsmodels | 10 |
multivariate-distribution | 8 |
501 rows × 1 columns
fig, [ax1, ax2] = plt.subplots(1,2, figsize=(12,5))
most_used.head(10).plot.bar(y='count', ax=ax1, title="Most Used")
most_viewed.head(10).plot.bar(y='views', ax=ax2, title="Most Viewed")
<AxesSubplot:title={'center':'Most Viewed'}, xlabel='tag'>
There is a great deal of overlap between the tags most used on questions posted to DSSE and the most common tags on viewed questions; eight of the tags are shared between the two lists.
If we look closely, we also see that many of these tags are related.
Python is the most viewed and second most used tag. Further, many of the other tags are python-related: "Pandas" is a Python data manipulation and analysis package, "dataframes" are the primary data structure in Pandas; "Keras," "Tensorflow" and "scikit-learn" are all machine learning tools that are used primarily or exclusively with Python.
The other category of commonality revolves around Deep Learning. "CNN" (covolutional neural networks) and "TensorFlow" are deep-learning technologies. "Keras" considers itself a deep-learning API. Moreover, deep-learning relies on neural-networks.
We might define some groups of related tags thusly:
python_related = ['python','keras','tensorflow', 'scikit-kearn', 'pandas', 'dataframe']
deep_learning_related = ['deep-learning', 'keras', 'tensorflow', 'cnn']
The groupings we identified above suggest that it makes sense to focus on Deep Learning and Python-related content. Let's see if a closer analysis of tag combinations provides any further inspiration.
tag_pairs =[]
for i, p in posts.iterrows():
tags = p['Tags']
# putting tags in a consitent order ensures that combinations
# are in a consistent order, thus ensuring that pairs are
# always represented the same way.
# ie always ('cat', 'fat'), never ('fat', 'cat')
tags.sort()
if "" in tags:
print(tags)
if len(tags) > 1:
comb = list(combinations(tags,2))
tag_pairs.extend(comb)
tag_pairs = pd.DataFrame(tag_pairs, columns=['tag1', 'tag2'])
tag_pairs.shape
(29141, 2)
tag_pairs.value_counts().head(20)
tag1 tag2 machine-learning python 443 deep-learning machine-learning 365 machine-learning neural-network 324 deep-learning neural-network 269 keras python 237 classification machine-learning 234 pandas python 231 keras tensorflow 226 deep-learning keras 215 keras neural-network 213 python scikit-learn 210 keras machine-learning 172 machine-learning scikit-learn 169 python tensorflow 135 cnn deep-learning 134 deep-learning python 125 neural-network python 124 machine-learning machine-learning-model 122 deep-learning tensorflow 113 machine-learning time-series 113 dtype: int64
tag_triads =[]
for i, p in posts.iterrows():
tags = p['Tags']
# putting tags in a consitent order ensures that combinations
# are in a consistent order, thus ensuring that pairs are
# always represented the same way.
# ie always ('cat', 'fat'), never ('fat', 'cat')
tags.sort()
if "" in tags:
print(tags)
if len(tags) > 2:
comb = list(combinations(tags,3))
tag_triads.extend(comb)
tag_triads = pd.DataFrame(tag_triads, columns=['tag1', 'tag2', 'tag3'])
tag_triads.value_counts().head(20)
tag1 tag2 tag3 deep-learning machine-learning neural-network 112 machine-learning python scikit-learn 84 deep-learning keras neural-network 66 keras python tensorflow 65 machine-learning neural-network 61 python 54 deep-learning keras machine-learning 54 keras neural-network tensorflow 52 deep-learning keras tensorflow 51 keras neural-network python 50 deep-learning keras python 48 cnn machine-learning neural-network 47 machine-learning pandas python 42 dataframe pandas python 40 keras machine-learning tensorflow 39 deep-learning machine-learning python 38 cnn deep-learning neural-network 37 machine-learning 36 keras 36 classification machine-learning python 34 dtype: int64
tag_triads.shape
(20052, 3)
Examining tag combinations was an interesting technical exercise, and it underscores the relevance of the Deep Learning and Python-related tag groupings we came up with earlier.
Data was obtained from https://data.stackexchange.com/datascience/query/new using the following query:
SELECT Id, CreationDate, Tags
FROM Posts
WHERE PostTypeID=1;
The results are stored alongside this noteboo in 20200904_all_DSSE_posts_metadata_extract.csv
all_questions = pd.read_csv('20200904_all_DSSE_posts_metadata_extract.csv')
all_questions['CreationDate'] = pd.to_datetime(all_questions['CreationDate'])
# convert tags text to a column lists
all_questions['Tags'] = all_questions['Tags'].str.replace('><', ',').str.strip('<>').str.split(',')
#replace NaN with empty lists to make further processing easier
no_tags = all_questions['Tags'].isna()
posts.loc[no_tags,'Tags'] = posts.loc[no_tags,'Tags'].apply(lambda x: [])
# create classification column to flag questions with deep-learning tag.
all_questions['deep-learning'] = all_questions['Tags'].apply(lambda x: True if "deep-learning" in x else False)
# identify questions bearing any of the deep learning-related tags identified earlier
def is_dl_related(tags):
dl_tags = set(deep_learning_related)
tags=set(tags)
result = True if dl_tags & tags else False
return result
all_questions['deep-learning-related'] = all_questions['Tags'].apply(is_dl_related)
all_questions.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25675 entries, 0 to 25674 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 25675 non-null int64 1 CreationDate 25675 non-null datetime64[ns] 2 Tags 25675 non-null object 3 deep-learning 25675 non-null bool 4 deep-learning-related 25675 non-null bool dtypes: bool(2), datetime64[ns](1), int64(1), object(1) memory usage: 652.0+ KB
dl_trend = all_questions.loc[all_questions['deep-learning'],['CreationDate','deep-learning']].resample('Q', on='CreationDate').count()
dl_trend[:-1].plot.line(y='deep-learning', figsize=(10,6))
<AxesSubplot:xlabel='CreationDate'>
# Create a column to use to count occurances during aggregation
all_questions['posts'] = True
# aggregate on a quarterly basis
trends = all_questions.resample('Q', on='CreationDate').sum()
# create calculated columns to show relationships
trends['deep-learning-ratio'] = trends['deep-learning-related'] / trends['posts']
trends['dlr_to_dl'] = trends['deep-learning-related'] / trends['deep-learning']
# chart, leave off latest, incomplete, quarter
trends[:-1].plot.line(y=['posts',
'deep-learning-related',
'deep-learning-ratio',
'dlr_to_dl'], subplots=True, figsize =(10,16))
array([<AxesSubplot:xlabel='CreationDate'>, <AxesSubplot:xlabel='CreationDate'>, <AxesSubplot:xlabel='CreationDate'>, <AxesSubplot:xlabel='CreationDate'>], dtype=object)
The quarterly volume of new questions on the Data Science Stack Exchange has grown in most quarters since its founding in 2014. The same has held true for the absolute volume of deep learning-related questions.
The relative proportions of Deep Learning-related learning questions grew between 2014 and the second quarter of 2018, after which it dropped somewhat. It has been holding roughly constant ever since.
Over time, people have become increacingly likely to omit the deep-learning
tag while asking questions bearing other deep learning-related tags.
If we use the DSSE as a proxy for the wider market for data science-related information, we see that deep learning is a popular area of interest. It's strongest growth as a topic-area may be behind it but there may be other areas of interest that are seeing more substantial growth.
Demand for deep-learning content is both large, relative to other data science content, and growing overall as the audience for data science related grows. However, there may be other data science topics which will find larger audiences in a few years time.
Let's look at the trends for other top DSSE tags to see what the future might hold.
Below we'll try to identify emerging data science topics so we can target content development that could provide a high ROI in the future.
We'll do that by looking at trends in use of the most viewed and used tags.
second_tier_tags = most_used.iloc[10:41].join(most_viewed.iloc[10:41], how='inner')
second_tier_tags
count | views | |
---|---|---|
tag | ||
time-series | 369 | 146200 |
lstm | 316 | 160160 |
regression | 303 | 105615 |
dataset | 288 | 92881 |
clustering | 243 | 75838 |
r | 227 | 77045 |
machine-learning-model | 196 | 50876 |
data-mining | 187 | 83384 |
feature-selection | 185 | 63572 |
image-classification | 178 | 76270 |
data-science-model | 166 | 52839 |
xgboost | 156 | 82513 |
linear-regression | 155 | 45177 |
loss-function | 147 | 49847 |
pytorch | 137 | 98987 |
data-cleaning | 134 | 68648 |
logistic-regression | 133 | 45893 |
cross-validation | 131 | 45136 |
decision-trees | 130 | 51921 |
multiclass-classification | 116 | 44666 |
Earlier we created a column indicating questions with deep learning-related tags so we could look at their prevalance over time. Now we want to look at trends for multiple tags. This will require a different approach. We'll start by breaking up the lists of tags on each question into separate rows.
tag_use = all_questions.iloc[:,:3].explode('Tags')
tag_use = tag_use.reset_index(drop=True)
tag_use
Id | CreationDate | Tags | |
---|---|---|---|
0 | 32397 | 2018-05-30 13:41:39 | predictive-modeling |
1 | 32397 | 2018-05-30 13:41:39 | time-series |
2 | 32398 | 2018-05-30 13:44:38 | python |
3 | 32398 | 2018-05-30 13:44:38 | deep-learning |
4 | 32398 | 2018-05-30 13:44:38 | time-series |
... | ... | ... | ... |
77055 | 80980 | 2020-08-29 20:52:02 | machine-learning |
77056 | 80980 | 2020-08-29 20:52:02 | neural-network |
77057 | 80980 | 2020-08-29 20:52:02 | deep-learning |
77058 | 80980 | 2020-08-29 20:52:02 | computer-vision |
77059 | 80980 | 2020-08-29 20:52:02 | faster-rcnn |
77060 rows × 3 columns
Next we perform an inner join to our tags of interest to eliminate irrelevant rows and discard unneeded columns as well.
second_tier_tag_use = second_tier_tags.merge(tag_use, how='inner', left_index= True, right_on='Tags')
second_tier_tag_use = second_tier_tag_use[['CreationDate','Tags']].reset_index(drop=True)
second_tier_tag_use
CreationDate | Tags | |
---|---|---|
0 | 2018-05-30 13:41:39 | time-series |
1 | 2018-05-30 13:44:38 | time-series |
2 | 2018-05-31 15:08:48 | time-series |
3 | 2018-06-01 19:52:45 | time-series |
4 | 2020-03-29 01:56:33 | time-series |
... | ... | ... |
13467 | 2018-05-29 19:39:32 | multiclass-classification |
13468 | 2017-09-12 22:44:47 | multiclass-classification |
13469 | 2020-08-27 10:58:06 | multiclass-classification |
13470 | 2020-08-27 14:17:29 | multiclass-classification |
13471 | 2020-08-27 20:32:53 | multiclass-classification |
13472 rows × 2 columns
# Add column to aggregate
second_tier_tag_use['Posts'] = 1
# Take daily data and aggregate/resample by quarter
second_tier_trends = second_tier_tag_use.groupby('Tags').resample('Q', on='CreationDate').sum()
# Create separate column for each tag
second_tier_trends = second_tier_trends.unstack('Tags').droplevel(0, axis=1)
# Add column of total posts per quarter
second_tier_trends['total_posts'] = trends.loc[second_tier_trends.index, 'posts']
second_tier_normalized = second_tier_trends.iloc[:,:20].div(second_tier_trends['total_posts']/100, axis=0)
goo = second_tier_normalized[:-1].plot.line(subplots=True, sharey=True, figsize=(10,50))
We see that some of these second tier tags are actually declining in relative importance. In particular, r
and data-mining
have declined significantly over the life of DSSE from highs of 15% to less than 5%. Many other have held pretty steady. A few though have seen recent growth in use.
lstm
first appeared ~2017 and has grown to almost 5%, putting it on par with as the most used second tier tags.pytorch
appeared about the same time and has risen to ~2.5%.data-science-model
and machine-learning-model
arose in 2018 and has already risen to ~2.5% of questions. Looking at questions with these tags, it doesn't appear that their meaning is specific or well defined. This makes them low value for our purposes.To keep things in perspective, below we look at trends for all top tags, rather than just the second tier.
top_tags = most_used.iloc[:41].join(most_viewed.iloc[:41], how='inner')
top_tag_use = top_tags.merge(tag_use, how='inner', left_index= True, right_on='Tags')
top_tag_use = top_tag_use[['CreationDate','Tags']].reset_index(drop=True)
top_tag_use['Posts'] = 1
top_tag_trends = top_tag_use.groupby('Tags').resample('Q', on='CreationDate').sum()
top_tag_trends = top_tag_trends.unstack('Tags').droplevel(0, axis=1)
top_tag_trends['total_posts'] = trends.loc[top_tag_trends.index, 'posts']
top_tags_normalized =top_tag_trends.iloc[:,:-1].div(top_tag_trends['total_posts']/100, axis=0)
goo = top_tags_normalized.plot.line(subplots=True, sharey=True, figsize=(8,70))