Popular Data Science Questions¶

In this project we'll be in a hypothetical situation where we work for a company that produces data science content. We're tasked with figuring out what is the best conent to write about. We decide to investigate Stack Overflow and learn that it is part of Stack Exchange - an network of question-answer websites for a broad number of topics. We find that there's a Stack Overflow for Data Science, called Data Science Stack Exchange, which happens to be in the bottom 10 of sites with respect to the number of questions that received answers.

We decide that that is a good place to start our research. Since we're relatively new to the Data Science Stack Exchange (DSSE), we need to do some initial exploration. For example, we'll look into the following questions:

What kind of questions are welcome on this site?
What, other than questions, does the site's home subdivide into?
- Does it look useful towards our goal?
What information is available in each post?

Initial Exploration of DSSE¶

When we click on the site's home link, we can immediately see on the left-hand side how the website is split up:

Screenshot%202021-11-30%20151317.png

We see that the site is relatively simple. There is a section for questions, tags, users, and the unanswered questions. There is also a section for career/job opportunities. Since we want to figure out the best type of content to create, it would be best if we stuck to the questions and tags pages. If we look into the tags section, we can see the following:

Screenshot%202021-11-30%20151712.png

This seems like an excellent resource for all things related to data science. We can search for and filter through the numerous tags to help us find a specific question/answer. Instead of web scraping the website, we will go with an easier alternative.

Stack Exchange provides a public data base for each of its websites. Here is a link where we can query and explore the DSSE.

For example, the following query of the data base gives us the following results:

SELECT *
    FROM tags
    ORDER BY Count DESC;

Screenshot%202021-11-30%20152313.png

We can quickly see that the overwhelming majority of questions on the DSSE are related to machine-learning, with the next most popular tags being python and deep-learning. Since we started our journey on DataQuest by learning Python, perhaps that would be a good source of information for our research.

Another query gives us the following:

SELECT *
    FROM PostTypes;

Screenshot%202021-11-30%20155522.png

Here we can see that there are a total of eight different PostTypes that posts can be on the DSSE. The first two are self-explanatory, however the last six are a little confusing. Let's keep looking.

If we look at the database schema, we can see that the "Posts" table acutally has a lot of relevant information we will interested in:

Screenshot%202021-11-30%20155322.png

We can clearly see from the above that each post made on the DSSE contains the following relevant information:

Id - a unique identifier for each post.
PostTypeId - an identification number for the type of post.
CreationDate - the time of creation of the post.
Score - the score (or upvotes) of the post.
ViewCount - how many times the post was viewed.
Tags - the relevant tags that were used for the question.
AnswerCount - how many answers the question received (only for question posts)
FavoriteCount - how many times the question was favored (only for question posts)

Since we have seen that there are eight different types of posts, lets go ahead and see how many of them there are by running the following query:

SELECT PostTypeId,
       COUNT(*) as Num_posts
  FROM Posts
 GROUP BY PostTypeId;

Screenshot%202021-11-30%20160531.png

We know from an earlier query that ID 1 = Question, and ID 2 = Answer. Since these are really the only two kinds of posts that we're interested in, the others are inconsequential.

Since we want to stay recent (at the time of this writing, it is December 2021), we're going to run a query against the DSSE to find all the relevant information listed above from this year (2021):

SELECT Id,
       PostTypeId,
       CreationDate,
       Score,
       ViewCount,
       Tags,
       AnswerCount,
       FavoriteCount
  FROM Posts
 WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2021
 ORDER BY CreationDate;

And here is what the first 10 rows look like:

Screenshot%202021-11-30%20170440.png

Luckily for us, the DSSE allows us to run a query and then download the query results as a .csv file. Since we now how the information we are interested in, we can go ahead and download the .csv file and continue to analyze the data with Python.

Data Exploration¶

In [1]:

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt

data = pd.read_csv('QueryResults.csv', parse_dates=["CreationDate"])
data.head()

Out[1]:

	Id	PostTypeId	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
0	87391	1	2021-01-01 03:10:42	1	28	<decision-trees>	1	NaN
1	87392	1	2021-01-01 07:28:07	0	14	<machine-learning><python><deep-learning><imag...	1	NaN
2	87393	1	2021-01-01 08:07:33	1	18	<neural-network><deep-learning><inception>	0	NaN
3	87395	1	2021-01-01 10:31:51	1	37	<machine-learning><cloud><federated-learning>	1	1.0
4	87404	1	2021-01-01 18:00:21	2	48	<reinforcement-learning><openai-gym>	1	NaN

We can see that our FavoriteCount column has some NaN values. Let's quickly check to see how many missing values we have in our dataset, along with any other issues we might find.

In [2]:

#preview our data types and null-value counts
data.info()
# print(data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6315 entries, 0 to 6314
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             6315 non-null   int64         
 1   PostTypeId     6315 non-null   int64         
 2   CreationDate   6315 non-null   datetime64[ns]
 3   Score          6315 non-null   int64         
 4   ViewCount      6315 non-null   int64         
 5   Tags           6315 non-null   object        
 6   AnswerCount    6315 non-null   int64         
 7   FavoriteCount  505 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 394.8+ KB

Thankfully, the only column that has null values is our FavoriteCount column. Unfortunately however, only 505 rows out of a total 6315 rows don't have null values. That's only ~8% of the data for that column! Since the missing values represent the number of times a user "favorited" the post, a missing value signifies that no users had the post in their favorites list. We can simply replace these values with zeroes.

Data Cleaning - Missing Values¶

In [3]:

#fill in missing values with a zero
data.fillna(0, inplace=True)

#change data type to int64
data["FavoriteCount"] = data["FavoriteCount"].astype('int64')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6315 entries, 0 to 6314
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             6315 non-null   int64         
 1   PostTypeId     6315 non-null   int64         
 2   CreationDate   6315 non-null   datetime64[ns]
 3   Score          6315 non-null   int64         
 4   ViewCount      6315 non-null   int64         
 5   Tags           6315 non-null   object        
 6   AnswerCount    6315 non-null   int64         
 7   FavoriteCount  6315 non-null   int64         
dtypes: datetime64[ns](1), int64(6), object(1)
memory usage: 394.8+ KB

Data Cleaning - Removing Irrelevant Data¶

Now we have no more missing values. We can still make this information a little cleaner though. Since we are only interested in the questions asked this year, we know that all of the PostTypeId values for each row are identical. There's no more use for that information, so we can safely remove that column from our dataset.

In [4]:

#remove the PostTypeId column
data.drop("PostTypeId", axis=1, inplace=True)

Next, we will have to analyze the Tags column. We can definitely clean this up a bit. Let's check the data type of the Tags column first, just to be safe.

Data Cleaning - Cleaning "Tags" Column¶

In [5]:

#shows us the data type of each tag
data["Tags"].apply(lambda t: type(t)).value_counts()

Out[5]:

<class 'str'>    6315
Name: Tags, dtype: int64

We can see that each value in the column is a string. On Stack Exchange, each post can have a maximum of five tags. We could have five separate columns for each tag, but let's leave the structure alone for now. Instead, let's clean up the strings to make them easier to read.

In [6]:

data["Tags"].head()

Out[6]:

0                                     <decision-trees>
1    <machine-learning><python><deep-learning><imag...
2           <neural-network><deep-learning><inception>
3        <machine-learning><cloud><federated-learning>
4                 <reinforcement-learning><openai-gym>
Name: Tags, dtype: object

We can see from the above that the list of tags are grouped by angle brackets <> and placed one right after the other with no spaces. We can use the str.replace and str.split methods to simplify it.

In [7]:

#removes the <> from the tags
data["Tags"] = data["Tags"].str.replace('^<|>$','').str.split('><')
data["Tags"].head()

<ipython-input-7-f5c56739d64b>:2: FutureWarning: The default value of regex will change from True to False in a future version.
  data["Tags"] = data["Tags"].str.replace('^<|>$','').str.split('><')

Out[7]:

0                                     [decision-trees]
1    [machine-learning, python, deep-learning, imag...
2           [neural-network, deep-learning, inception]
3        [machine-learning, cloud, federated-learning]
4                 [reinforcement-learning, openai-gym]
Name: Tags, dtype: object

Data Analysis - Tag Counts¶

Now that they're cleaned up a bit, let's count how many times each tag was used. We can create a dictionary where the key is the tag, and the value is the number of times it was used.

In [8]:

tag_count = {}

#iterate through each tag in the list, for each row
for row in data["Tags"]:
    for tags in row:
        if tags in tag_count:
            tag_count[tags] += 1
        else:
            tag_count[tags] = 1

#converts the dictionary into a dataframe
tag_count = pd.DataFrame.from_dict(tag_count, orient="index")
tag_count.rename(columns={0:"Count"}, inplace=True)

#sorts by "Count" and displays top 20
most_used = tag_count.sort_values("Count",ascending=False).head(20)
most_used

Out[8]:

	Count
machine-learning	1668
python	1165
deep-learning	917
neural-network	580
keras	520
nlp	514
tensorflow	508
classification	495
time-series	393
scikit-learn	356
cnn	262
dataset	247
regression	227
clustering	213
pytorch	207
pandas	203
lstm	202
convolutional-neural-network	192
statistics	189
machine-learning-model	186

Let's try and visualize this information now, to get a better sense of scale.

In [9]:

#plot the most used tags
ax = most_used.plot.barh()
ax.invert_yaxis()
plt.show()

Data Analysis - Tag Views¶

We can see that the top three tags used are related to machine-learning, python, and deep-learning. While these are definitely helpful, a tag like "python" can be a bit vague as it doesn't really tell us exactly what the question is about. Before we dive further into that issue however, let's repeat this process based on ViewCount.

In [10]:

tag_views = {}

for index, row in data.iterrows():
    for tag in row["Tags"]:
        if tag in tag_views:
            tag_views[tag] += row["ViewCount"]
        else:
            tag_views[tag] = row["ViewCount"]

#converts tag_views dictionary into dataframe
tag_views = pd.DataFrame.from_dict(tag_views, orient="index")
tag_views.rename(columns={0:"Views"}, inplace=True)

#sorts by "Count" and displays top 20
most_viewed = tag_views.sort_values("Views",ascending=False).head(20)
most_viewed

Out[10]:

	Views
python	123413
machine-learning	93912
deep-learning	59274
keras	52803
tensorflow	51930
pandas	41321
scikit-learn	39538
nlp	33227
neural-network	29442
numpy	29251
classification	26761
cnn	24098
time-series	21040
matplotlib	20189
pytorch	18358
convolutional-neural-network	18097
dataset	17345
bert	16732
transformer	16488
image-classification	13765

In [11]:

#plot the most viewed tags
ax = most_viewed.plot.barh()
ax.invert_yaxis()
plt.show()

Many of the most used tags are also some of the most viewed. Our top three are python, machine-learning, and deep-learning. Python and machine-learning are clearly the two most used and most viewed, by a fairly large margin. Let's now check to see if there are any tags that are in our most_used dataset, but not in our most viewed.

In [12]:

#merge most_used with most_viewed, on most_used
in_used = pd.merge(most_used, most_viewed, how="left", left_index=True, right_index=True)
in_used

Out[12]:

	Count	Views
machine-learning	1668	93912.0
python	1165	123413.0
deep-learning	917	59274.0
neural-network	580	29442.0
keras	520	52803.0
nlp	514	33227.0
tensorflow	508	51930.0
classification	495	26761.0
time-series	393	21040.0
scikit-learn	356	39538.0
cnn	262	24098.0
dataset	247	17345.0
regression	227	NaN
clustering	213	NaN
pytorch	207	18358.0
pandas	203	41321.0
lstm	202	NaN
convolutional-neural-network	192	18097.0
statistics	189	NaN
machine-learning-model	186	NaN

Similarly, let's check to see if any tags are in the most_viewed, but not most_used.

In [13]:

#merge most_used with most_viewed, on most_viewed
in_viewed = pd.merge(most_used, most_viewed, how="right", left_index=True, right_index=True)
in_viewed

Out[13]:

	Count	Views
python	1165.0	123413
machine-learning	1668.0	93912
deep-learning	917.0	59274
keras	520.0	52803
tensorflow	508.0	51930
pandas	203.0	41321
scikit-learn	356.0	39538
nlp	514.0	33227
neural-network	580.0	29442
numpy	NaN	29251
classification	495.0	26761
cnn	262.0	24098
time-series	393.0	21040
matplotlib	NaN	20189
pytorch	207.0	18358
convolutional-neural-network	192.0	18097
dataset	247.0	17345
bert	NaN	16732
transformer	NaN	16488
image-classification	NaN	13765

The tags that are present in most_used but not present in most_viewed are:

regression
clustering
lstm
statistics
machine-learning-model

The tags that are presnt in most_viewed but not present in most_used are:

numpy
matplotlib
bert
transformer
image-classification

Also, it's important to note that some tags have a clear relationship between them. For example, python and pandas, because pandas is a Python library. So for the context of our research, we could tackle two tags (python and pandas) by creating content for pandas.

There are some tags however, that do not have any relationship between them, such as python and r, since they are both different languages.

In [14]:

data[data["Tags"].apply(
    lambda tags: True if "r" in tags and "pandas" in tags else False)]

Out[14]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount

Luckily for us, none of the data in our query results contains both of those tags together.

Data Analysis - Tag Relationships¶

If we look at the results in the previous section, we can see that the common tags in both most used and most viewed tags are: python, machine-learning, deep-learning, keras, tensorflow, pandas, scikit-learn, neural-network, and nlp. But what are these? What do they mean and are they related at all?

The DSSE has a tags page. Here is a brief summarization of each of the tags:

python - used for data science questions related to the programming language Python.
machine-learning - a subfield of computer science that draws on elements from algorithmic analysis, computational statistics, mathematics, etc. Mainly concerned with the use of data to build models that have high predictive ability.
deep-learning - a new area of Machine Learning research concerned with the technology used for learning hierarchical representations of data, mainly done with deep neural networks.
keras - a popular open-source deep learning API for Python built on top of TensorFlow and is useful for fast implementation.
tensorflow - an open source library for machine learning and machine intelligence.
pandas - a Python library for Panel Data manipulation and analysis.
scikit-learn - a popular machine learning package for Python that has simple and efficient tools for predictive data analysis.
neural-network - Artificial neural networks (ANN), are composed of 'neurons' - programming constructs that mimic the properties of biological neurons.
nlp - Natural language processing is a field of computer science, AI, and linguistics concerned with the interactions between computers and human (natural) languages.

We can clearly see that the most used/viewed tags are all very closely related to machine-learning and data analysis. Python seems to be the foundation upon which a lot of the above is built on, so it's natural that python and pandas would be used hand-in-hand.

Data Analysis - Deep Learning¶

Before we make an official recommendation, it would be nice to solidify our findings with some more concrete proof. One thing that comes to mind is "is deep learning just a fad?" Ideally, the content we decide to create will be the most useful for as long as possible. Could interest in deep learning be slowing down? Let's find out!

We begin by going back to the Stack Exchange Data Explorer, and perform a new query that fetches all questions ever asked on DSSE, with both their dates and tags.

SELECT Id, CreationDate, Tags
  FROM posts
 WHERE PostTypeId = 1;

We then download the results as a new .csv file named "all_questions.csv".

In [15]:

#reads in our new data
all_questions = pd.read_csv('all_questions.csv', parse_dates=["CreationDate"])
all_questions.head()

Out[15]:

	Id	CreationDate	Tags
0	14680	2016-10-21 10:19:08	<machine-learning><backpropagation>
1	14685	2016-10-21 15:33:24	<data-mining><dataset><data><sql>
2	14686	2016-10-21 15:48:03	<predictive-modeling><statistics><social-netwo...
3	14688	2016-10-21 16:58:18	<machine-learning><scikit-learn><tensorflow><k...
4	14690	2016-10-21 17:29:49	<machine-learning><python><data-cleaning><visu...

Data Cleaning - Cleaning "Tags" Column (Again)¶

In [16]:

#removes the <> from the tags
all_questions["Tags"] = all_questions["Tags"].str.replace('^<|>$','').str.split('><')
all_questions["Tags"].head()

<ipython-input-16-63d867bea81f>:2: FutureWarning: The default value of regex will change from True to False in a future version.
  all_questions["Tags"] = all_questions["Tags"].str.replace('^<|>$','').str.split('><')

Out[16]:

0                  [machine-learning, backpropagation]
1                    [data-mining, dataset, data, sql]
2    [predictive-modeling, statistics, social-netwo...
3    [machine-learning, scikit-learn, tensorflow, k...
4    [machine-learning, python, data-cleaning, visu...
Name: Tags, dtype: object

In [17]:

all_questions.head()

Out[17]:

	Id	CreationDate	Tags
0	14680	2016-10-21 10:19:08	[machine-learning, backpropagation]
1	14685	2016-10-21 15:33:24	[data-mining, dataset, data, sql]
2	14686	2016-10-21 15:48:03	[predictive-modeling, statistics, social-netwo...
3	14688	2016-10-21 16:58:18	[machine-learning, scikit-learn, tensorflow, k...
4	14690	2016-10-21 17:29:49	[machine-learning, python, data-cleaning, visu...

Data Analysis - What Is Considered Deep Learning?¶

Now that our new set of data is cleaned up a bit, we can move on. We're trying to figure out which questions are deep learning questions, so that we can track their interest and popularity over time. By navigation to the deep-learning tag info page on DSSE, we can see the 'Related Tags' section at the bottom. It lists the following tags:

machine-learning
neural-network
keras
tensorflow
python
cnn
nlp
lstm
classification
convolutional-neural-network

The tags cnn and convolutional-neural-network refer to the same thing. The other tags all refer to neural networks, machine learning, python, and everything else related to deep learning. However, something like python is so vague that we can't tell for sure if the question is directly related to deep-learning, so we'll remove it from our list. So for the sake of simplicity, I will consider any question that uses one of these tags (except python) to be considered a deep-learning question. There are, of course, more analytical ways of producing a list of tags related to deep-learning, but I feel like that is beyond the scope of this project.

In [18]:

#list of tags related to deep-learning
dl_tag_list = ["deep-learning", "machine-learning", "neural-network", "keras", "tensorflow",
               "cnn", "nlp", "lstm", "classification", "convolutional-neural-network"]

In [19]:

#check if all_questions["Tags"] contains one of the above
def is_deep_learning(tags):
    for tag in tags:
        if (tag in dl_tag_list):
            return 1
    return 0

In [20]:

#run above function on our all_questions["Tags"] column
all_questions["DeepLearning"] = all_questions["Tags"].apply(is_deep_learning)

#verify
all_questions.head()

Out[20]:

	Id	CreationDate	Tags	DeepLearning
0	14680	2016-10-21 10:19:08	[machine-learning, backpropagation]	1
1	14685	2016-10-21 15:33:24	[data-mining, dataset, data, sql]	0
2	14686	2016-10-21 15:48:03	[predictive-modeling, statistics, social-netwo...	0
3	14688	2016-10-21 16:58:18	[machine-learning, scikit-learn, tensorflow, k...	1
4	14690	2016-10-21 17:29:49	[machine-learning, python, data-cleaning, visu...	1

Data Analysis - Timeframe¶

Let's get some more information about our data set. In particular, let's see how many rows of data we have, as well as the earliest and most recent questions. This will help us to pick a timeframe to track the interest in deep learning.

In [21]:

all_questions.shape

Out[21]:

(31475, 4)

In [22]:

earliest = all_questions["CreationDate"].min()
recent = all_questions["CreationDate"].max()

print("The earliest question was asked on:", earliest)
print("The most recent question was asked on:", recent)

The earliest question was asked on: 2014-05-13 23:58:30
The most recent question was asked on: 2021-11-28 02:57:13

With something as dynamic as data science, I think it would be best to track this on a quarterly basis. Since it is (at the time of writing) December 1, 2021, we don't have the full data for this year. We do have data for the first three quarters of 2021 however. We similarly don't have the data for the first half of 2014, so we'll have to remove the questions whose CreationDate falls in those ranges. Before we remove anything though, let's create a way to identify what quarter a question falls into.

In [23]:

#extract values from CreationDate
def get_quarter(date):
    year = date.year
    month = date.month
    quarter = ((month-1) // 3) + 1
    return "{y}_Q{q}".format(y=year, q=quarter)

In [24]:

#apply our function from above
all_questions["Quarter"] = all_questions["CreationDate"].apply(get_quarter)
all_questions.head()

Out[24]:

	Id	CreationDate	Tags	DeepLearning	Quarter
0	14680	2016-10-21 10:19:08	[machine-learning, backpropagation]	1	2016_Q4
1	14685	2016-10-21 15:33:24	[data-mining, dataset, data, sql]	0	2016_Q4
2	14686	2016-10-21 15:48:03	[predictive-modeling, statistics, social-netwo...	0	2016_Q4
3	14688	2016-10-21 16:58:18	[machine-learning, scikit-learn, tensorflow, k...	1	2016_Q4
4	14690	2016-10-21 17:29:49	[machine-learning, python, data-cleaning, visu...	1	2016_Q4

Great! Now that we have an identifier for the quarter the questions were posted in, we can remove any questions for which we don't have the full set - namely, 2021_Q4 and 2014_Q2

In [25]:

#drop relevant rows from dataset
all_questions.drop(all_questions[(all_questions["Quarter"] == "2014_Q2") | 
                                 (all_questions["Quarter"] == "2021_Q4")].index, inplace=True)

Data Analysis - How Many Questions Were Asked?¶

Now that we have a column telling us which quarter a question was asked in, we can move ahead. All that's left to do is the following:

organize our data by quarter
count the number of deep-learning questions
count the total number of questions
calculate the ratio between the two

In [28]:

#group by quarter and calculate sum and total # of questions
quarterly = all_questions.groupby("Quarter").agg({"DeepLearning": ['sum', 'size']})
quarterly.columns = ["DeepLearning_Questions", "Total_Questions"]

#calculate % of questions that are deep-learning questions
quarterly["DeepLearning_Rate"] = (quarterly["DeepLearning_Questions"] / 
                                 quarterly ["Total_Questions"])
quarterly.reset_index(inplace=True)
quarterly.sample(5)

Out[28]:

	Quarter	DeepLearning_Questions	Total_Questions	DeepLearning_Rate
14	2018_Q1	790	1199	0.658882
4	2015_Q3	153	310	0.493548
19	2019_Q2	1125	1786	0.629899
3	2015_Q2	135	284	0.475352
18	2019_Q1	1106	1733	0.638200

Great! Now all we need to do is visualize our data to make it easier to digest for our readers.

In [108]:

#plot details
N = len(quarterly["Quarter"]) #how many bars to show
ind = np.arange(N) #spacing between bars

#bar plots
p1 = plt.bar(ind, quarterly["Total_Questions"])
p2 = plt.bar(ind, quarterly["DeepLearning_Questions"])
p3 = quarterly.plot(x="Quarter",
                    y="DeepLearning_Rate",
                    marker="o")

#labels and legend
plt.title('Deep Learning vs. Other Data Science \n Questions by Quarter', fontsize = 25)
plt.ylabel('Number of Questions', fontsize=20)
plt.xlabel('Quarters', fontsize=20)
plt.xticks(ind, quarterly["Quarter"], rotation=90)

plt.legend((p1[0], p2[0], p3[0]),
           ("TotalQuestions", "DeepLearningQuestions", "DeepLearningRate"),
           loc = "best")

plt.show()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-108-7473fb02e62a> in <module>
     16 plt.xticks(ind, quarterly["Quarter"], rotation=90)
     17 
---> 18 plt.legend((p1[0], p2[0], p3[0]),
     19            ("TotalQuestions", "DeepLearningQuestions", "DeepLearningRate"),
     20            loc = "best")

TypeError: 'AxesSubplot' object is not subscriptable

In [ ]: