Introduction¶

We've been asked to help inform the content strategy for a website that creates data science content. We've been tasked with figuring out what people want to learn about in data science. In order to answer these questions we've looked at patterns of tag use on the Data Science Stack Exchange (DSSE).

Executive Summary¶

The Data Science StackExchange was founded in 2014. Since then, people have posted approximately 26,000 questions and 18,000 answers to those questions. In the process, they've applied six hundred different tags in classifying the questions. Activity on the site has grown steadily since its inception to a current level of almost 2,500 new questions posted per quarter.

The site offeres us a perspective on what data science related topics people are interested in, and their data portal makes it easy to extract site data for further analysis.

We studied tags used on questions used on the Data Science StackExchange using extracts obtained from the data portal. We identified the most commonly used tags on questions and how they've changed over time. We also identified the tags most commonly associated with the most frequently viewed questions.

Findings¶

We found that there is significant overlap between the ten tags people use most often when posting questions and the ten tags most frequently found on the most frequently viewed questions. These tags group logically into two overlapping sets.

Deep Learning related: deep-learning, keras, tensorflow, cnn
Python related: 'python,keras, tensorflow, scikit-kearn, pandas, dataframe

When we looked at the use of Deep Learning related tags on questions over time we found that they rose from making up less than 5% of the questions in the early days of the site to a high almost 30% by the end of 2018. Since then they've made up a steady ~25% of questions on the site as overall activity on the site has continued to increase.

In order to find topics of rising importance we looked past the top ten most used and viewed tags and identified the next 30 ranks tags in each category. We identifed the tags in common between both categories, which gave us a list of twenty tags. We then looked at the trends in use of these tags over time. This revealed two tags that first appeared around 2017 and have grown to appear on sigificant numbers of new questions.

lstm: Long short-term memory is a generall class of deep learning model.
pytorch: is a deep learning and machine learning library for Python.

Both fit into the areas of interest defined above, so they don't represent completely new topic areas, but they do suggest areas of focus that could yield high ROI if interest in them continues to grow.

Our trend analysis called attention to the declining popularity of certain topics:

R, a language for statistical analysis, was once used on 20% of new questions, but is now found on less than 5%.
data-mining a term first popularized in the 1980s appeared on ~18% of questions when DSSE first started, but is now found on less than 5%.

These declines were unusual. Most other popular tags have been fairly steady in their popularity.

Reccomendation¶

Our analysis suggests that developing data science content related to Python and deep learning is likely to find an audience. Moreover lstm and pytorch seem likely to find a growing audience. The downside risks are low, since interest in datascience topics overall continues to grow.

Future Directions¶

For further insights we could perform a similar exploration of other StackExchange sites mentioned in the sources section, below.

A cursory examination shows that some of the popular tags we identified on DSSE, like neural networks, machine learning, neural networks, tensor flow are also common on Cross Validated (the statistics StackExchange) and Artificial Intelligence SE. Python is a popular tag on StackOverflow, and is frequently combined with pandas, tensorflow and dataframe, all popular tags on DSSE. Looking at tags on other StackExchange sites related to popular tags on DSSE could provide a route to identify other relevant content areas.

Sources¶

StackExchange is a network of question and answer sites. The flagship community, StackOverflow is a fixture of google search results for programming related topics as well as being a destination in its own-right.

In addition to StackOverflow there is the Data Sceince Stack Exchange (DSSE) and other potentially relevant sites.

These sites represent an attractive target for our research because StackExchange makes a great deal of data about the sites content and activity available through a SQL-based data portal.

In order to understand the available data its helpful to take a closer look at a typical StackExchange web site. Let's focus on DSSE, for obvious reasons.

Looking at the Data Science StackExchange, we see that the body of the homepage is filled with questions asked on the site that are considered "active" at the time the page loaded. This view can also be switched to show the "hot" questions of the past few days, week or month. Finally, one can look at questions that people have placed "bounties" on.

There is another view of quesitons that allows filtering by newest, active, bountied, unanswered, "frequent" and most voted for.

One can also drill into the site by way of tags. One can search for tags, or click through a grid of popular or newly created tags. There are 18 pages of tags. The number of questions and recent activity on each tag is called out on the listing.

One can also browse by lists of users, filterable based on their activity/popularity over various time periods.

There is also a special view for just unanswered questions, filterable by personalized tags, newest, number of votes and the lack of any attempted answers.

As mentioned abover, an extract of the database for the DSSE is queryable using MS-SQL (the same can be done for other StackExchange sites). There is a tutorial with some basic schema info and some help.

Exploring the database, I found that there are:

603 tags
19 tag synonyms
54,656 posts
- 26173 posts with no ParentID. DSSE claims 25,782 posts, so my understanding of a "question" is probably incomplete.
77,075 "PostTags" mapping posts to tags.

It looks like both "questions" and "answers" are Posts. Posts have a PostType of 1 and no ParentID. Answers have a PostType of 2 and a populated ParentID. DeletionDate appears to be unused. ClosedDate is populated for a subset.

The posts table includes information about votes and views, which should be helpful in finding popular questions and answers.

The PostTags table has information needed to compute popular tags

SELECT TOP 50 t.TagName, count(*) 
FROM PostTags pt
LEFT JOIN Tags t ON t.Id = pt.TagId
GROUP BY t.TagName, t.Id
ORDER BY count(*) DESC

To start, I'll extraced metadata, including tags, for question and answer posts made in 2019.

SELECT 
  Id, PostTypeId, CreationDate, Score,
  ViewCount, Tags, AnswerCount, FavoriteCount
FROM posts
WHERE PostTypeID IN (1,2)
AND YEAR(CreationDate) = 2019;

Results exported and stored alongside this notebook in 2019_DSSE_posts_metadata_extract.csv

Preparation¶

In [1]:

# Import needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import combinations

# Configure notebook environment
%matplotlib inline
%config InlineBackend.figure_format='retina'
#pd.options.display.max_rows = 200
pd.options.display.max_columns = 30

In [2]:

posts = pd.read_csv('2019_DSSE_posts_metadata_extract.csv')

In [3]:

posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15334 entries, 0 to 15333
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             15334 non-null  int64  
 1   PostTypeId     15334 non-null  int64  
 2   CreationDate   15334 non-null  object 
 3   Score          15334 non-null  int64  
 4   ViewCount      7580 non-null   float64
 5   Tags           7580 non-null   object 
 6   AnswerCount    7580 non-null   float64
 7   FavoriteCount  1526 non-null   float64
dtypes: float64(3), int64(3), object(2)
memory usage: 958.5+ KB

There are 15,334 total rows in the extract. All have scores. Only 7,580 have ViewCount and Tags, and AnswerCount populated. Only 1,526 have favorite count populated.

In [4]:

masked_posts = pd.concat([posts['PostTypeId'], posts.iloc[:,2:].notnull()],axis=1)
masked_posts.pivot_table(index='PostTypeId', values=masked_posts.columns[1:], aggfunc='sum', margins=True)

Out[4]:

	AnswerCount	CreationDate	FavoriteCount	Score	Tags	ViewCount
PostTypeId
1	7580	7580	1526	7580	7580	7580
2	0	7754	0	7754	0	0
All	7580	15334	1526	15334	7580	7580

All posts have creation dates and scores.
All questions (PostTypeID=1) have answer counts, and tags
Only some questions have favorite counts.

Depending on the analysis we do, it could be useful to fill in some of the unpopulated fields. Zeros seem approrpiate for the favorite counts, and potentially, answer counts.

Clean Data¶

In [5]:

# Replace NA with 0
posts[['ViewCount', 'AnswerCount', 'FavoriteCount']] = posts[['ViewCount', 'AnswerCount', 'FavoriteCount']].fillna(0)

In [6]:

posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15334 entries, 0 to 15333
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             15334 non-null  int64  
 1   PostTypeId     15334 non-null  int64  
 2   CreationDate   15334 non-null  object 
 3   Score          15334 non-null  int64  
 4   ViewCount      15334 non-null  float64
 5   Tags           7580 non-null   object 
 6   AnswerCount    15334 non-null  float64
 7   FavoriteCount  15334 non-null  float64
dtypes: float64(3), int64(3), object(2)
memory usage: 958.5+ KB

Fix column types

In [7]:

posts['CreationDate'] = pd.to_datetime(posts['CreationDate'])
posts[['ViewCount', 'AnswerCount', 'FavoriteCount']] = posts[['ViewCount', 'AnswerCount', 'FavoriteCount']].astype(int)
posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15334 entries, 0 to 15333
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             15334 non-null  int64         
 1   PostTypeId     15334 non-null  int64         
 2   CreationDate   15334 non-null  datetime64[ns]
 3   Score          15334 non-null  int64         
 4   ViewCount      15334 non-null  int64         
 5   Tags           7580 non-null   object        
 6   AnswerCount    15334 non-null  int64         
 7   FavoriteCount  15334 non-null  int64         
dtypes: datetime64[ns](1), int64(6), object(1)
memory usage: 958.5+ KB

Extract Tags¶

In [8]:

posts['Tags'].value_counts()

Out[8]:

<machine-learning>                                                              100
<python><pandas>                                                                 56
<python>                                                                         46
<r>                                                                              36
<reinforcement-learning>                                                         30
                                                                               ... 
<classification><lstm><multiclass-classification><multilabel-classification>      1
<python><neural-network><keras><evaluation><hyperparameter-tuning>                1
<classification><regression><evaluation><overfitting>                             1
<classification><nlp><text>                                                       1
<word2vec><word-embeddings><bert>                                                 1
Name: Tags, Length: 5650, dtype: int64

The tags field contain strings for all the tags on the post, delimited by angle brackets around each tag.

In order to make the tag data easier to work with I'll convert the strings into a list of individual tags.

In [9]:

posts['Tags'] = posts['Tags'].str.replace('><', ',').str.strip('<>').str.split(',')
posts['Tags'].head()

Out[9]:

0                                                  NaN
1    [machine-learning, gaussian, discriminant-anal...
2                 [classification, unbalanced-classes]
3                                                  NaN
4                                                  NaN
Name: Tags, dtype: object

In [10]:

#replace NaN with empty lists to make further processing easier
no_tags = posts['Tags'].isna()
posts.loc[no_tags,'Tags'] = posts.loc[no_tags,'Tags'].apply(lambda x: [])

Analysis¶

Tag Analysis¶

In [11]:

#Extract individual tags and relevant metadata from each post
tag_pile = []
for i, p in posts.iterrows():
    post_views = p['ViewCount']
    for t in p['Tags']:
        tag_pile.append([t, post_views, 1])
tag_pile = pd.DataFrame(tag_pile, columns=['tag', 'views', 'count']) 

In [12]:

# Aggregate tags
tag_summary = tag_pile.pivot_table(index='tag', values=['count', 'views'], aggfunc=sum)
most_used = pd.DataFrame(tag_summary['count'].sort_values(ascending=False), columns=['count'])
most_used

Out[12]:

	count
tag
machine-learning	2344
python	1589
deep-learning	1013
neural-network	915
keras	803
...	...
isolation-forest	1
kendalls-tau-coefficient	1
knime	1
lda-classifier	1
.net	1

501 rows × 1 columns

In [13]:

most_viewed = pd.DataFrame(tag_summary['views'].sort_values(ascending=False), columns=['views'])
most_viewed

Out[13]:

	views
tag
python	1355080
machine-learning	872335
pandas	608055
keras	607205
deep-learning	507577
...	...
state-of-the-art	17
galago	15
frequentist	15
statsmodels	10
multivariate-distribution	8

501 rows × 1 columns

In [14]:

fig, [ax1, ax2] = plt.subplots(1,2, figsize=(12,5))
most_used.head(10).plot.bar(y='count', ax=ax1, title="Most Used")
most_viewed.head(10).plot.bar(y='views', ax=ax2, title="Most Viewed")

Out[14]:

<AxesSubplot:title={'center':'Most Viewed'}, xlabel='tag'>

There is a great deal of overlap between the tags most used on questions posted to DSSE and the most common tags on viewed questions; eight of the tags are shared between the two lists.

If we look closely, we also see that many of these tags are related.

Python is the most viewed and second most used tag. Further, many of the other tags are python-related: "Pandas" is a Python data manipulation and analysis package, "dataframes" are the primary data structure in Pandas; "Keras," "Tensorflow" and "scikit-learn" are all machine learning tools that are used primarily or exclusively with Python.

The other category of commonality revolves around Deep Learning. "CNN" (covolutional neural networks) and "TensorFlow" are deep-learning technologies. "Keras" considers itself a deep-learning API. Moreover, deep-learning relies on neural-networks.

We might define some groups of related tags thusly:

In [15]:

python_related = ['python','keras','tensorflow', 'scikit-kearn', 'pandas', 'dataframe']
deep_learning_related = ['deep-learning', 'keras', 'tensorflow', 'cnn']

Tag Combinations¶

The groupings we identified above suggest that it makes sense to focus on Deep Learning and Python-related content. Let's see if a closer analysis of tag combinations provides any further inspiration.

In [16]:

tag_pairs =[]
for i, p in posts.iterrows():
    tags = p['Tags']
    # putting tags in a consitent order ensures that combinations 
    # are in a consistent order, thus ensuring that pairs are 
    # always represented the same way.
    # ie always ('cat', 'fat'), never ('fat', 'cat')
    tags.sort()
    if "" in tags:
        print(tags)
    if len(tags) > 1:
        comb = list(combinations(tags,2)) 
        tag_pairs.extend(comb)
tag_pairs = pd.DataFrame(tag_pairs, columns=['tag1', 'tag2'])

In [17]:

tag_pairs.shape

Out[17]:

(29141, 2)

In [18]:

tag_pairs.value_counts().head(20)

Out[18]:

tag1              tag2                  
machine-learning  python                    443
deep-learning     machine-learning          365
machine-learning  neural-network            324
deep-learning     neural-network            269
keras             python                    237
classification    machine-learning          234
pandas            python                    231
keras             tensorflow                226
deep-learning     keras                     215
keras             neural-network            213
python            scikit-learn              210
keras             machine-learning          172
machine-learning  scikit-learn              169
python            tensorflow                135
cnn               deep-learning             134
deep-learning     python                    125
neural-network    python                    124
machine-learning  machine-learning-model    122
deep-learning     tensorflow                113
machine-learning  time-series               113
dtype: int64

In [19]:

tag_triads =[]
for i, p in posts.iterrows():
    tags = p['Tags']
    # putting tags in a consitent order ensures that combinations 
    # are in a consistent order, thus ensuring that pairs are 
    # always represented the same way.
    # ie always ('cat', 'fat'), never ('fat', 'cat')
    tags.sort()
    if "" in tags:
        print(tags)
    if len(tags) > 2:
        comb = list(combinations(tags,3)) 
        tag_triads.extend(comb)
tag_triads = pd.DataFrame(tag_triads, columns=['tag1', 'tag2', 'tag3'])

In [20]:

tag_triads.value_counts().head(20)

Out[20]:

tag1              tag2              tag3            
deep-learning     machine-learning  neural-network      112
machine-learning  python            scikit-learn         84
deep-learning     keras             neural-network       66
keras             python            tensorflow           65
                  machine-learning  neural-network       61
                                    python               54
deep-learning     keras             machine-learning     54
keras             neural-network    tensorflow           52
deep-learning     keras             tensorflow           51
keras             neural-network    python               50
deep-learning     keras             python               48
cnn               machine-learning  neural-network       47
machine-learning  pandas            python               42
dataframe         pandas            python               40
keras             machine-learning  tensorflow           39
deep-learning     machine-learning  python               38
cnn               deep-learning     neural-network       37
                                    machine-learning     36
                                    keras                36
classification    machine-learning  python               34
dtype: int64

In [21]:

tag_triads.shape

Out[21]:

(20052, 3)

Examining tag combinations was an interesting technical exercise, and it underscores the relevance of the Deep Learning and Python-related tag groupings we came up with earlier.

Question Trends¶

Data was obtained from https://data.stackexchange.com/datascience/query/new using the following query:

SELECT Id, CreationDate, Tags
FROM Posts
WHERE PostTypeID=1;

The results are stored alongside this noteboo in 20200904_all_DSSE_posts_metadata_extract.csv

In [22]:

all_questions = pd.read_csv('20200904_all_DSSE_posts_metadata_extract.csv')

In [23]:

all_questions['CreationDate'] = pd.to_datetime(all_questions['CreationDate'])

In [24]:

# convert tags text to a column lists
all_questions['Tags'] = all_questions['Tags'].str.replace('><', ',').str.strip('<>').str.split(',')

#replace NaN with empty lists to make further processing easier
no_tags = all_questions['Tags'].isna()
posts.loc[no_tags,'Tags'] = posts.loc[no_tags,'Tags'].apply(lambda x: [])

In [25]:

# create classification column to flag questions with deep-learning tag.
all_questions['deep-learning'] = all_questions['Tags'].apply(lambda x: True if "deep-learning" in x else False)

In [26]:

# identify questions bearing any of the deep learning-related tags identified earlier
def is_dl_related(tags):
    dl_tags = set(deep_learning_related)
    tags=set(tags)
    result = True if dl_tags & tags else False
    return result
all_questions['deep-learning-related'] = all_questions['Tags'].apply(is_dl_related)

In [27]:

all_questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25675 entries, 0 to 25674
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Id                     25675 non-null  int64         
 1   CreationDate           25675 non-null  datetime64[ns]
 2   Tags                   25675 non-null  object        
 3   deep-learning          25675 non-null  bool          
 4   deep-learning-related  25675 non-null  bool          
dtypes: bool(2), datetime64[ns](1), int64(1), object(1)
memory usage: 652.0+ KB

In [28]:

dl_trend = all_questions.loc[all_questions['deep-learning'],['CreationDate','deep-learning']].resample('Q', on='CreationDate').count()
dl_trend[:-1].plot.line(y='deep-learning', figsize=(10,6))

Out[28]:

<AxesSubplot:xlabel='CreationDate'>

In [29]:

# Create a column to use to count occurances during aggregation
all_questions['posts'] = True

In [30]:

# aggregate on a quarterly basis
trends = all_questions.resample('Q', on='CreationDate').sum()

# create calculated columns to show relationships
trends['deep-learning-ratio'] = trends['deep-learning-related'] / trends['posts']
trends['dlr_to_dl'] = trends['deep-learning-related'] / trends['deep-learning']

In [31]:

# chart, leave off latest, incomplete, quarter
trends[:-1].plot.line(y=['posts', 
                         'deep-learning-related', 
                         'deep-learning-ratio', 
                         'dlr_to_dl'], subplots=True, figsize =(10,16))

Out[31]:

array([<AxesSubplot:xlabel='CreationDate'>,
       <AxesSubplot:xlabel='CreationDate'>,
       <AxesSubplot:xlabel='CreationDate'>,
       <AxesSubplot:xlabel='CreationDate'>], dtype=object)

The quarterly volume of new questions on the Data Science Stack Exchange has grown in most quarters since its founding in 2014. The same has held true for the absolute volume of deep learning-related questions.

The relative proportions of Deep Learning-related learning questions grew between 2014 and the second quarter of 2018, after which it dropped somewhat. It has been holding roughly constant ever since.

Over time, people have become increacingly likely to omit the deep-learning tag while asking questions bearing other deep learning-related tags.

If we use the DSSE as a proxy for the wider market for data science-related information, we see that deep learning is a popular area of interest. It's strongest growth as a topic-area may be behind it but there may be other areas of interest that are seeing more substantial growth.

Demand for deep-learning content is both large, relative to other data science content, and growing overall as the audience for data science related grows. However, there may be other data science topics which will find larger audiences in a few years time.

Let's look at the trends for other top DSSE tags to see what the future might hold.

Rising Tags¶

Below we'll try to identify emerging data science topics so we can target content development that could provide a high ROI in the future.

We'll do that by looking at trends in use of the most viewed and used tags.

In [32]:

second_tier_tags = most_used.iloc[10:41].join(most_viewed.iloc[10:41], how='inner')
second_tier_tags

Out[32]:

	count	views
tag
time-series	369	146200
lstm	316	160160
regression	303	105615
dataset	288	92881
clustering	243	75838
r	227	77045
machine-learning-model	196	50876
data-mining	187	83384
feature-selection	185	63572
image-classification	178	76270
data-science-model	166	52839
xgboost	156	82513
linear-regression	155	45177
loss-function	147	49847
pytorch	137	98987
data-cleaning	134	68648
logistic-regression	133	45893
cross-validation	131	45136
decision-trees	130	51921
multiclass-classification	116	44666

Earlier we created a column indicating questions with deep learning-related tags so we could look at their prevalance over time. Now we want to look at trends for multiple tags. This will require a different approach. We'll start by breaking up the lists of tags on each question into separate rows.

In [33]:

tag_use = all_questions.iloc[:,:3].explode('Tags')
tag_use = tag_use.reset_index(drop=True)
tag_use

Out[33]:

	Id	CreationDate	Tags
0	32397	2018-05-30 13:41:39	predictive-modeling
1	32397	2018-05-30 13:41:39	time-series
2	32398	2018-05-30 13:44:38	python
3	32398	2018-05-30 13:44:38	deep-learning
4	32398	2018-05-30 13:44:38	time-series
...	...	...	...
77055	80980	2020-08-29 20:52:02	machine-learning
77056	80980	2020-08-29 20:52:02	neural-network
77057	80980	2020-08-29 20:52:02	deep-learning
77058	80980	2020-08-29 20:52:02	computer-vision
77059	80980	2020-08-29 20:52:02	faster-rcnn

77060 rows × 3 columns

Next we perform an inner join to our tags of interest to eliminate irrelevant rows and discard unneeded columns as well.

In [34]:

second_tier_tag_use = second_tier_tags.merge(tag_use, how='inner', left_index= True, right_on='Tags')

second_tier_tag_use = second_tier_tag_use[['CreationDate','Tags']].reset_index(drop=True)
second_tier_tag_use

Out[34]:

	CreationDate	Tags
0	2018-05-30 13:41:39	time-series
1	2018-05-30 13:44:38	time-series
2	2018-05-31 15:08:48	time-series
3	2018-06-01 19:52:45	time-series
4	2020-03-29 01:56:33	time-series
...	...	...
13467	2018-05-29 19:39:32	multiclass-classification
13468	2017-09-12 22:44:47	multiclass-classification
13469	2020-08-27 10:58:06	multiclass-classification
13470	2020-08-27 14:17:29	multiclass-classification
13471	2020-08-27 20:32:53	multiclass-classification

13472 rows × 2 columns

In [35]:

# Add column to aggregate
second_tier_tag_use['Posts'] = 1

# Take daily data and aggregate/resample by quarter
second_tier_trends = second_tier_tag_use.groupby('Tags').resample('Q', on='CreationDate').sum()

# Create separate column for each tag
second_tier_trends = second_tier_trends.unstack('Tags').droplevel(0, axis=1)

# Add column of total posts per quarter
second_tier_trends['total_posts'] = trends.loc[second_tier_trends.index, 'posts']

second_tier_normalized = second_tier_trends.iloc[:,:20].div(second_tier_trends['total_posts']/100, axis=0)
goo = second_tier_normalized[:-1].plot.line(subplots=True, sharey=True, figsize=(10,50))

We see that some of these second tier tags are actually declining in relative importance. In particular, r and data-mining have declined significantly over the life of DSSE from highs of 15% to less than 5%. Many other have held pretty steady. A few though have seen recent growth in use.

lstm first appeared ~2017 and has grown to almost 5%, putting it on par with as the most used second tier tags.
pytorch appeared about the same time and has risen to ~2.5%.
data-science-model and machine-learning-model arose in 2018 and has already risen to ~2.5% of questions. Looking at questions with these tags, it doesn't appear that their meaning is specific or well defined. This makes them low value for our purposes.

To keep things in perspective, below we look at trends for all top tags, rather than just the second tier.

In [36]:

top_tags = most_used.iloc[:41].join(most_viewed.iloc[:41], how='inner')
top_tag_use = top_tags.merge(tag_use, how='inner', left_index= True, right_on='Tags')
top_tag_use = top_tag_use[['CreationDate','Tags']].reset_index(drop=True)
top_tag_use['Posts'] = 1
top_tag_trends = top_tag_use.groupby('Tags').resample('Q', on='CreationDate').sum()
top_tag_trends = top_tag_trends.unstack('Tags').droplevel(0, axis=1)
top_tag_trends['total_posts'] = trends.loc[top_tag_trends.index, 'posts']
top_tags_normalized =top_tag_trends.iloc[:,:-1].div(top_tag_trends['total_posts']/100, axis=0)
goo = top_tags_normalized.plot.line(subplots=True, sharey=True, figsize=(8,70))

In [ ]: