In this project we'll be in a hypothetical situation where we work for a company that produces data science content. We're tasked with figuring out what is the best conent to write about. We decide to investigate Stack Overflow and learn that it is part of Stack Exchange - an network of question-answer websites for a broad number of topics. We find that there's a Stack Overflow for Data Science, called Data Science Stack Exchange, which happens to be in the bottom 10 of sites with respect to the number of questions that received answers.
We decide that that is a good place to start our research. Since we're relatively new to the Data Science Stack Exchange (DSSE), we need to do some initial exploration. For example, we'll look into the following questions:
When we click on the site's home link, we can immediately see on the left-hand side how the website is split up:
We see that the site is relatively simple. There is a section for questions, tags, users, and the unanswered questions. There is also a section for career/job opportunities. Since we want to figure out the best type of content to create, it would be best if we stuck to the questions and tags pages. If we look into the tags section, we can see the following:
This seems like an excellent resource for all things related to data science. We can search for and filter through the numerous tags to help us find a specific question/answer. Instead of web scraping the website, we will go with an easier alternative.
Stack Exchange provides a public data base for each of its websites. Here is a link where we can query and explore the DSSE.
For example, the following query of the data base gives us the following results:
SELECT *
FROM tags
ORDER BY Count DESC;
We can quickly see that the overwhelming majority of questions on the DSSE are related to machine-learning
, with the next most popular tags being python
and deep-learning
. Since we started our journey on DataQuest by learning Python, perhaps that would be a good source of information for our research.
Another query gives us the following:
SELECT *
FROM PostTypes;
Here we can see that there are a total of eight different PostTypes that posts can be on the DSSE. The first two are self-explanatory, however the last six are a little confusing. Let's keep looking.
If we look at the database schema, we can see that the "Posts" table acutally has a lot of relevant information we will interested in:
We can clearly see from the above that each post made on the DSSE contains the following relevant information:
Id
- a unique identifier for each post.PostTypeId
- an identification number for the type of post.CreationDate
- the time of creation of the post.Score
- the score (or upvotes) of the post.ViewCount
- how many times the post was viewed.Tags
- the relevant tags that were used for the question.AnswerCount
- how many answers the question received (only for question posts)FavoriteCount
- how many times the question was favored (only for question posts)Since we have seen that there are eight different types of posts, lets go ahead and see how many of them there are by running the following query:
SELECT PostTypeId,
COUNT(*) as Num_posts
FROM Posts
GROUP BY PostTypeId;
We know from an earlier query that ID 1 = Question, and ID 2 = Answer. Since these are really the only two kinds of posts that we're interested in, the others are inconsequential.
Since we want to stay recent (at the time of this writing, it is December 2021), we're going to run a query against the DSSE to find all the relevant information listed above from this year (2021):
SELECT Id,
PostTypeId,
CreationDate,
Score,
ViewCount,
Tags,
AnswerCount,
FavoriteCount
FROM Posts
WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2021
ORDER BY CreationDate;
And here is what the first 10 rows look like:
Luckily for us, the DSSE allows us to run a query and then download the query results as a .csv file. Since we now how the information we are interested in, we can go ahead and download the .csv file and continue to analyze the data with Python.
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
data = pd.read_csv('QueryResults.csv', parse_dates=["CreationDate"])
data.head()
Id | PostTypeId | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount | |
---|---|---|---|---|---|---|---|---|
0 | 87391 | 1 | 2021-01-01 03:10:42 | 1 | 28 | <decision-trees> | 1 | NaN |
1 | 87392 | 1 | 2021-01-01 07:28:07 | 0 | 14 | <machine-learning><python><deep-learning><imag... | 1 | NaN |
2 | 87393 | 1 | 2021-01-01 08:07:33 | 1 | 18 | <neural-network><deep-learning><inception> | 0 | NaN |
3 | 87395 | 1 | 2021-01-01 10:31:51 | 1 | 37 | <machine-learning><cloud><federated-learning> | 1 | 1.0 |
4 | 87404 | 1 | 2021-01-01 18:00:21 | 2 | 48 | <reinforcement-learning><openai-gym> | 1 | NaN |
We can see that our FavoriteCount
column has some NaN
values. Let's quickly check to see how many missing values we have in our dataset, along with any other issues we might find.
#preview our data types and null-value counts
data.info()
# print(data.shape)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6315 entries, 0 to 6314 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 6315 non-null int64 1 PostTypeId 6315 non-null int64 2 CreationDate 6315 non-null datetime64[ns] 3 Score 6315 non-null int64 4 ViewCount 6315 non-null int64 5 Tags 6315 non-null object 6 AnswerCount 6315 non-null int64 7 FavoriteCount 505 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(5), object(1) memory usage: 394.8+ KB
Thankfully, the only column that has null values is our FavoriteCount
column. Unfortunately however, only 505 rows out of a total 6315 rows don't have null values. That's only ~8% of the data for that column! Since the missing values represent the number of times a user "favorited" the post, a missing value signifies that no users had the post in their favorites list. We can simply replace these values with zeroes.
#fill in missing values with a zero
data.fillna(0, inplace=True)
#change data type to int64
data["FavoriteCount"] = data["FavoriteCount"].astype('int64')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6315 entries, 0 to 6314 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 6315 non-null int64 1 PostTypeId 6315 non-null int64 2 CreationDate 6315 non-null datetime64[ns] 3 Score 6315 non-null int64 4 ViewCount 6315 non-null int64 5 Tags 6315 non-null object 6 AnswerCount 6315 non-null int64 7 FavoriteCount 6315 non-null int64 dtypes: datetime64[ns](1), int64(6), object(1) memory usage: 394.8+ KB
Now we have no more missing values. We can still make this information a little cleaner though. Since we are only interested in the questions asked this year, we know that all of the PostTypeId values for each row are identical. There's no more use for that information, so we can safely remove that column from our dataset.
#remove the PostTypeId column
data.drop("PostTypeId", axis=1, inplace=True)
Next, we will have to analyze the Tags
column. We can definitely clean this up a bit. Let's check the data type of the Tags
column first, just to be safe.
#shows us the data type of each tag
data["Tags"].apply(lambda t: type(t)).value_counts()
<class 'str'> 6315 Name: Tags, dtype: int64
We can see that each value in the column is a string. On Stack Exchange, each post can have a maximum of five tags. We could have five separate columns for each tag, but let's leave the structure alone for now. Instead, let's clean up the strings to make them easier to read.
data["Tags"].head()
0 <decision-trees> 1 <machine-learning><python><deep-learning><imag... 2 <neural-network><deep-learning><inception> 3 <machine-learning><cloud><federated-learning> 4 <reinforcement-learning><openai-gym> Name: Tags, dtype: object
We can see from the above that the list of tags are grouped by angle brackets <> and placed one right after the other with no spaces. We can use the str.replace and str.split methods to simplify it.
#removes the <> from the tags
data["Tags"] = data["Tags"].str.replace('^<|>$','').str.split('><')
data["Tags"].head()
<ipython-input-7-f5c56739d64b>:2: FutureWarning: The default value of regex will change from True to False in a future version. data["Tags"] = data["Tags"].str.replace('^<|>$','').str.split('><')
0 [decision-trees] 1 [machine-learning, python, deep-learning, imag... 2 [neural-network, deep-learning, inception] 3 [machine-learning, cloud, federated-learning] 4 [reinforcement-learning, openai-gym] Name: Tags, dtype: object
Now that they're cleaned up a bit, let's count how many times each tag was used. We can create a dictionary where the key is the tag, and the value is the number of times it was used.
tag_count = {}
#iterate through each tag in the list, for each row
for row in data["Tags"]:
for tags in row:
if tags in tag_count:
tag_count[tags] += 1
else:
tag_count[tags] = 1
#converts the dictionary into a dataframe
tag_count = pd.DataFrame.from_dict(tag_count, orient="index")
tag_count.rename(columns={0:"Count"}, inplace=True)
#sorts by "Count" and displays top 20
most_used = tag_count.sort_values("Count",ascending=False).head(20)
most_used
Count | |
---|---|
machine-learning | 1668 |
python | 1165 |
deep-learning | 917 |
neural-network | 580 |
keras | 520 |
nlp | 514 |
tensorflow | 508 |
classification | 495 |
time-series | 393 |
scikit-learn | 356 |
cnn | 262 |
dataset | 247 |
regression | 227 |
clustering | 213 |
pytorch | 207 |
pandas | 203 |
lstm | 202 |
convolutional-neural-network | 192 |
statistics | 189 |
machine-learning-model | 186 |
Let's try and visualize this information now, to get a better sense of scale.
#plot the most used tags
ax = most_used.plot.barh()
ax.invert_yaxis()
plt.show()
We can see that the top three tags used are related to machine-learning, python, and deep-learning. While these are definitely helpful, a tag like "python" can be a bit vague as it doesn't really tell us exactly what the question is about. Before we dive further into that issue however, let's repeat this process based on ViewCount.
tag_views = {}
for index, row in data.iterrows():
for tag in row["Tags"]:
if tag in tag_views:
tag_views[tag] += row["ViewCount"]
else:
tag_views[tag] = row["ViewCount"]
#converts tag_views dictionary into dataframe
tag_views = pd.DataFrame.from_dict(tag_views, orient="index")
tag_views.rename(columns={0:"Views"}, inplace=True)
#sorts by "Count" and displays top 20
most_viewed = tag_views.sort_values("Views",ascending=False).head(20)
most_viewed
Views | |
---|---|
python | 123413 |
machine-learning | 93912 |
deep-learning | 59274 |
keras | 52803 |
tensorflow | 51930 |
pandas | 41321 |
scikit-learn | 39538 |
nlp | 33227 |
neural-network | 29442 |
numpy | 29251 |
classification | 26761 |
cnn | 24098 |
time-series | 21040 |
matplotlib | 20189 |
pytorch | 18358 |
convolutional-neural-network | 18097 |
dataset | 17345 |
bert | 16732 |
transformer | 16488 |
image-classification | 13765 |
#plot the most viewed tags
ax = most_viewed.plot.barh()
ax.invert_yaxis()
plt.show()
Many of the most used tags are also some of the most viewed. Our top three are python, machine-learning, and deep-learning. Python and machine-learning are clearly the two most used and most viewed, by a fairly large margin. Let's now check to see if there are any tags that are in our most_used dataset, but not in our most viewed.
#merge most_used with most_viewed, on most_used
in_used = pd.merge(most_used, most_viewed, how="left", left_index=True, right_index=True)
in_used
Count | Views | |
---|---|---|
machine-learning | 1668 | 93912.0 |
python | 1165 | 123413.0 |
deep-learning | 917 | 59274.0 |
neural-network | 580 | 29442.0 |
keras | 520 | 52803.0 |
nlp | 514 | 33227.0 |
tensorflow | 508 | 51930.0 |
classification | 495 | 26761.0 |
time-series | 393 | 21040.0 |
scikit-learn | 356 | 39538.0 |
cnn | 262 | 24098.0 |
dataset | 247 | 17345.0 |
regression | 227 | NaN |
clustering | 213 | NaN |
pytorch | 207 | 18358.0 |
pandas | 203 | 41321.0 |
lstm | 202 | NaN |
convolutional-neural-network | 192 | 18097.0 |
statistics | 189 | NaN |
machine-learning-model | 186 | NaN |
Similarly, let's check to see if any tags are in the most_viewed, but not most_used.
#merge most_used with most_viewed, on most_viewed
in_viewed = pd.merge(most_used, most_viewed, how="right", left_index=True, right_index=True)
in_viewed
Count | Views | |
---|---|---|
python | 1165.0 | 123413 |
machine-learning | 1668.0 | 93912 |
deep-learning | 917.0 | 59274 |
keras | 520.0 | 52803 |
tensorflow | 508.0 | 51930 |
pandas | 203.0 | 41321 |
scikit-learn | 356.0 | 39538 |
nlp | 514.0 | 33227 |
neural-network | 580.0 | 29442 |
numpy | NaN | 29251 |
classification | 495.0 | 26761 |
cnn | 262.0 | 24098 |
time-series | 393.0 | 21040 |
matplotlib | NaN | 20189 |
pytorch | 207.0 | 18358 |
convolutional-neural-network | 192.0 | 18097 |
dataset | 247.0 | 17345 |
bert | NaN | 16732 |
transformer | NaN | 16488 |
image-classification | NaN | 13765 |
The tags that are present in most_used but not present in most_viewed are:
regression
clustering
lstm
statistics
machine-learning-model
The tags that are presnt in most_viewed but not present in most_used are:
numpy
matplotlib
bert
transformer
image-classification
Also, it's important to note that some tags have a clear relationship between them. For example, python
and pandas
, because pandas
is a Python library. So for the context of our research, we could tackle two tags (python
and pandas
) by creating content for pandas
.
There are some tags however, that do not have any relationship between them, such as python
and r
, since they are both different languages.
data[data["Tags"].apply(
lambda tags: True if "r" in tags and "pandas" in tags else False)]
Id | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount |
---|
Luckily for us, none of the data in our query results contains both of those tags together.
If we look at the results in the previous section, we can see that the common tags in both most used and most viewed tags are: python
, machine-learning
, deep-learning
, keras
, tensorflow
, pandas
, scikit-learn
, neural-network
, and nlp
. But what are these? What do they mean and are they related at all?
The DSSE has a tags page. Here is a brief summarization of each of the tags:
python
- used for data science questions related to the programming language Python.machine-learning
- a subfield of computer science that draws on elements from algorithmic analysis, computational statistics, mathematics, etc. Mainly concerned with the use of data to build models that have high predictive ability.deep-learning
- a new area of Machine Learning research concerned with the technology used for learning hierarchical representations of data, mainly done with deep neural networks.keras
- a popular open-source deep learning API for Python built on top of TensorFlow and is useful for fast implementation.tensorflow
- an open source library for machine learning and machine intelligence.pandas
- a Python library for Panel Data manipulation and analysis.scikit-learn
- a popular machine learning package for Python that has simple and efficient tools for predictive data analysis.neural-network
- Artificial neural networks (ANN), are composed of 'neurons' - programming constructs that mimic the properties of biological neurons.nlp
- Natural language processing is a field of computer science, AI, and linguistics concerned with the interactions between computers and human (natural) languages.We can clearly see that the most used/viewed tags are all very closely related to machine-learning and data analysis. Python seems to be the foundation upon which a lot of the above is built on, so it's natural that python
and pandas
would be used hand-in-hand.
Before we make an official recommendation, it would be nice to solidify our findings with some more concrete proof. One thing that comes to mind is "is deep learning just a fad?" Ideally, the content we decide to create will be the most useful for as long as possible. Could interest in deep learning be slowing down? Let's find out!
We begin by going back to the Stack Exchange Data Explorer, and perform a new query that fetches all questions ever asked on DSSE, with both their dates and tags.
SELECT Id, CreationDate, Tags
FROM posts
WHERE PostTypeId = 1;
We then download the results as a new .csv file named "all_questions.csv".
#reads in our new data
all_questions = pd.read_csv('all_questions.csv', parse_dates=["CreationDate"])
all_questions.head()
Id | CreationDate | Tags | |
---|---|---|---|
0 | 14680 | 2016-10-21 10:19:08 | <machine-learning><backpropagation> |
1 | 14685 | 2016-10-21 15:33:24 | <data-mining><dataset><data><sql> |
2 | 14686 | 2016-10-21 15:48:03 | <predictive-modeling><statistics><social-netwo... |
3 | 14688 | 2016-10-21 16:58:18 | <machine-learning><scikit-learn><tensorflow><k... |
4 | 14690 | 2016-10-21 17:29:49 | <machine-learning><python><data-cleaning><visu... |
#removes the <> from the tags
all_questions["Tags"] = all_questions["Tags"].str.replace('^<|>$','').str.split('><')
all_questions["Tags"].head()
<ipython-input-16-63d867bea81f>:2: FutureWarning: The default value of regex will change from True to False in a future version. all_questions["Tags"] = all_questions["Tags"].str.replace('^<|>$','').str.split('><')
0 [machine-learning, backpropagation] 1 [data-mining, dataset, data, sql] 2 [predictive-modeling, statistics, social-netwo... 3 [machine-learning, scikit-learn, tensorflow, k... 4 [machine-learning, python, data-cleaning, visu... Name: Tags, dtype: object
all_questions.head()
Id | CreationDate | Tags | |
---|---|---|---|
0 | 14680 | 2016-10-21 10:19:08 | [machine-learning, backpropagation] |
1 | 14685 | 2016-10-21 15:33:24 | [data-mining, dataset, data, sql] |
2 | 14686 | 2016-10-21 15:48:03 | [predictive-modeling, statistics, social-netwo... |
3 | 14688 | 2016-10-21 16:58:18 | [machine-learning, scikit-learn, tensorflow, k... |
4 | 14690 | 2016-10-21 17:29:49 | [machine-learning, python, data-cleaning, visu... |
Now that our new set of data is cleaned up a bit, we can move on. We're trying to figure out which questions are deep learning questions, so that we can track their interest and popularity over time. By navigation to the deep-learning tag info page on DSSE, we can see the 'Related Tags' section at the bottom. It lists the following tags:
machine-learning
neural-network
keras
tensorflow
python
cnn
nlp
lstm
classification
convolutional-neural-network
The tags cnn
and convolutional-neural-network
refer to the same thing. The other tags all refer to neural networks, machine learning, python, and everything else related to deep learning. However, something like python
is so vague that we can't tell for sure if the question is directly related to deep-learning
, so we'll remove it from our list. So for the sake of simplicity, I will consider any question that uses one of these tags (except python
) to be considered a deep-learning
question. There are, of course, more analytical ways of producing a list of tags related to deep-learning
, but I feel like that is beyond the scope of this project.
#list of tags related to deep-learning
dl_tag_list = ["deep-learning", "machine-learning", "neural-network", "keras", "tensorflow",
"cnn", "nlp", "lstm", "classification", "convolutional-neural-network"]
#check if all_questions["Tags"] contains one of the above
def is_deep_learning(tags):
for tag in tags:
if (tag in dl_tag_list):
return 1
return 0
#run above function on our all_questions["Tags"] column
all_questions["DeepLearning"] = all_questions["Tags"].apply(is_deep_learning)
#verify
all_questions.head()
Id | CreationDate | Tags | DeepLearning | |
---|---|---|---|---|
0 | 14680 | 2016-10-21 10:19:08 | [machine-learning, backpropagation] | 1 |
1 | 14685 | 2016-10-21 15:33:24 | [data-mining, dataset, data, sql] | 0 |
2 | 14686 | 2016-10-21 15:48:03 | [predictive-modeling, statistics, social-netwo... | 0 |
3 | 14688 | 2016-10-21 16:58:18 | [machine-learning, scikit-learn, tensorflow, k... | 1 |
4 | 14690 | 2016-10-21 17:29:49 | [machine-learning, python, data-cleaning, visu... | 1 |
Let's get some more information about our data set. In particular, let's see how many rows of data we have, as well as the earliest and most recent questions. This will help us to pick a timeframe to track the interest in deep learning.
all_questions.shape
(31475, 4)
earliest = all_questions["CreationDate"].min()
recent = all_questions["CreationDate"].max()
print("The earliest question was asked on:", earliest)
print("The most recent question was asked on:", recent)
The earliest question was asked on: 2014-05-13 23:58:30 The most recent question was asked on: 2021-11-28 02:57:13
With something as dynamic as data science, I think it would be best to track this on a quarterly basis. Since it is (at the time of writing) December 1, 2021, we don't have the full data for this year. We do have data for the first three quarters of 2021 however. We similarly don't have the data for the first half of 2014, so we'll have to remove the questions whose CreationDate
falls in those ranges. Before we remove anything though, let's create a way to identify what quarter a question falls into.
#extract values from CreationDate
def get_quarter(date):
year = date.year
month = date.month
quarter = ((month-1) // 3) + 1
return "{y}_Q{q}".format(y=year, q=quarter)
#apply our function from above
all_questions["Quarter"] = all_questions["CreationDate"].apply(get_quarter)
all_questions.head()
Id | CreationDate | Tags | DeepLearning | Quarter | |
---|---|---|---|---|---|
0 | 14680 | 2016-10-21 10:19:08 | [machine-learning, backpropagation] | 1 | 2016_Q4 |
1 | 14685 | 2016-10-21 15:33:24 | [data-mining, dataset, data, sql] | 0 | 2016_Q4 |
2 | 14686 | 2016-10-21 15:48:03 | [predictive-modeling, statistics, social-netwo... | 0 | 2016_Q4 |
3 | 14688 | 2016-10-21 16:58:18 | [machine-learning, scikit-learn, tensorflow, k... | 1 | 2016_Q4 |
4 | 14690 | 2016-10-21 17:29:49 | [machine-learning, python, data-cleaning, visu... | 1 | 2016_Q4 |
Great! Now that we have an identifier for the quarter the questions were posted in, we can remove any questions for which we don't have the full set - namely, 2021_Q4 and 2014_Q2
#drop relevant rows from dataset
all_questions.drop(all_questions[(all_questions["Quarter"] == "2014_Q2") |
(all_questions["Quarter"] == "2021_Q4")].index, inplace=True)
Now that we have a column telling us which quarter a question was asked in, we can move ahead. All that's left to do is the following:
deep-learning
questions#group by quarter and calculate sum and total # of questions
quarterly = all_questions.groupby("Quarter").agg({"DeepLearning": ['sum', 'size']})
quarterly.columns = ["DeepLearning_Questions", "Total_Questions"]
#calculate % of questions that are deep-learning questions
quarterly["DeepLearning_Rate"] = (quarterly["DeepLearning_Questions"] /
quarterly ["Total_Questions"])
quarterly.reset_index(inplace=True)
quarterly.sample(5)
Quarter | DeepLearning_Questions | Total_Questions | DeepLearning_Rate | |
---|---|---|---|---|
14 | 2018_Q1 | 790 | 1199 | 0.658882 |
4 | 2015_Q3 | 153 | 310 | 0.493548 |
19 | 2019_Q2 | 1125 | 1786 | 0.629899 |
3 | 2015_Q2 | 135 | 284 | 0.475352 |
18 | 2019_Q1 | 1106 | 1733 | 0.638200 |
Great! Now all we need to do is visualize our data to make it easier to digest for our readers.
#plot details
N = len(quarterly["Quarter"]) #how many bars to show
ind = np.arange(N) #spacing between bars
#bar plots
p1 = plt.bar(ind, quarterly["Total_Questions"])
p2 = plt.bar(ind, quarterly["DeepLearning_Questions"])
p3 = quarterly.plot(x="Quarter",
y="DeepLearning_Rate",
marker="o")
#labels and legend
plt.title('Deep Learning vs. Other Data Science \n Questions by Quarter', fontsize = 25)
plt.ylabel('Number of Questions', fontsize=20)
plt.xlabel('Quarters', fontsize=20)
plt.xticks(ind, quarterly["Quarter"], rotation=90)
plt.legend((p1[0], p2[0], p3[0]),
("TotalQuestions", "DeepLearningQuestions", "DeepLearningRate"),
loc = "best")
plt.show()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-108-7473fb02e62a> in <module> 16 plt.xticks(ind, quarterly["Quarter"], rotation=90) 17 ---> 18 plt.legend((p1[0], p2[0], p3[0]), 19 ("TotalQuestions", "DeepLearningQuestions", "DeepLearningRate"), 20 loc = "best") TypeError: 'AxesSubplot' object is not subscriptable