In this project, we will try to answer the question: "What do people want to learn in data science", and in so doing evaluate the available content for our company which deals with data science education.
Scenario
We're working for a company that creates data science content, be it books, online articles, videos or interactive text-based platforms. We're tasked with figuring out what is best content to write about. We realize that if we wanted to figure out what programming content to write, we could consult Stack Overflow (a question and answer website about programming) and see what kind of content is more popular. After investigating Stack Overflow in depth, we find out that it is part of a question and answer website network called Stack Exchange.
Stack Exchange hosts sites on a multitude of fields and subjects, including mathematics, physics, philosophy, and data science! Here's a sample of the most popular sites:
Stack Exchange employs a reputation award system for its questions and answers. Each post — each question/answer — is a post that is subject to upvotes and downvotes. This ensures that good posts are easily identifiable. More details are available on this tour
Being a multidisciplinary field, there a few Stack Exchange websites there are relevant to our goal here:
And if we want to include Data Engineering, we can also consider:
At the time of writing, in terms of users Data Science Stack Exchange (DSSE) is among top 40 sites with 85K users, however, in terms of unanswered questions it is among bottom 7 with 64% unanswered questions. Further statistics are available here. This makes it quite attractive for exploring data science content.
The site is organized into following sections:
Home: Displays the top questions
and contains a side bar menu with links to other sections of the site.
[Questions]: Features all questions which can be filtered by No answers
and No accepted answers
. Currently, approximately 25,000 questions are featured of which over 15,000 are unanswered or with no accepted answer (63%).
Tags are key words which organize questions as topics and facilitate search.
Users features the data on users. Users with highest reputation
are featured before the others. Users can be searched and filtered by user name
, reputation
, New user
etc.
Unanswered features unanswered questions which can be filtered by votes, tags and newest.
The footer menu has links to Tour and Help sections along with links to other Stack Exchange sites and features.
Questions and Answers
Data Science Stack Exchange(DSSE) is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field.The method of asking questions, getting answers, and getting promoted to higher levels is given here:
import requests
# request data from "https://datascience.stackexchange.com/tags"
response = requests.get("https://datascience.stackexchange.com/tags")
content = response.content
from bs4 import BeautifulSoup
# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')
tag = parser.select(".post-tag")
tags_scraping = []
for i in range(0,36):
tag_text = tag[i].text
tags_scraping.append(tag_text)
print(tags_scraping)
API
Obviously, scraping is a tedious process and getting even the names of tags will involve a lot of coding.
Next, we can try the Stack Exchange API
First, we register the app for access token
https://stackoverflow.com/oauth/login_success
# access token requested
import requests
import requests.auth
response = requests.post("https://stackoverflow.com/oauth/dialog?client_id=18238&redirect_uri=https://stackoverflow.com/oauth/login_success")
print(response.status_code)
# access_token is only required for increased quota, write and access to private info
## Following access token is now expired
headers = {'access_token': '6QkGLWxpVNzS6XWzfOFXGw))', 'key': 'LCyb3nlOflFZqImiVbfZog((' }
response = requests.get("https://api.stackexchange.com/2.2/tags?page=1&pagesize=36&order=desc&sort=popular&site=datascience", headers=headers)
tags = response.json()
# Parse the json for extracting names of tags on page-1
import pandas as pd
tags_list = tags['items']
tags_df = pd.DataFrame(tags_list)
tags_api = tags_df["name"]
print(tags_api)
# Check whether tag list obtained from scraping and from api are equal
tags_api = list(tags_df["name"])
equal = tags_api==tags_scraping
print(equal) # Are equal
Stack Exchange Data Explorer (SEDE)
SEDE is an open source tool for running arbitrary queries against public data from the Stack Exchange network. Features include collaborative query editing for all graduated and public beta Stack Exchange sites.
The data is updated early every Sunday morning around 3:00 UTC.
Apart from web-scraping and use of API, SEDE provides access to the database and entertains T-SQL queries.
Following query run at SEDE gives the same results as obtained from web-scraping and API.
SELECT TOP 36
TagName, COUNT
FROM Tags
ORDER BY COUNT DESC;
The query results can be downloaded as a csv file, and a permalink can also be created to the query (as done above). We will now compare results from the query with those of web_scraping and API.
import pandas as pd
query = pd.read_csv("top_36_tags.csv")
tags_query = query["TagName"]
tags_query = list(tags_query)
print(tags_query)
# Check whether these results are same as those obtained from web-scraping and API
equal = tags_api==tags_scraping==tags_query
print(equal) # All results are equal
# Determine top ten popular topics from tags
top_ten = query[:10]
print(top_ten)
Above, we can see that maximum questions are contained in about top 10 or so tags. Same content (top 36 tags) is displayed on the tags home page
Now we will focus on the Posts
table in the database SEDE
to create a database for a year (2019) and carry out a more granular analysis.
We will run a query against the SEDE DSSE database that extracts the following columns for all the questions in 2019:
Id
: An identification number for the post.
PostTypeId
: An identification number for the type of post.
CreationDate
: The date and time of creation of the post.
Score
: The post's score.
ViewCount
: How many times the post was viewed.
Tags
: What tags were used.
AnswerCount
: How many answers the question got (only applicable to question posts).
FavoriteCount
: How many times the question was favored (only applicable to question posts).
Using the API in Tandem
We will use the API in tandem with query in order to compare some of our results.
# Find total number of questions in 2019 using API
response = requests.get("https://api.stackexchange.com/2.2/questions?pagesize=100&fromdate=1546300800&todate=1577750400&order=desc&sort=creation&site=stackoverflow&filter=!5-dmZUPsa(aVhP2rEGYktwXr6kydXo7VO6LStu", headers=headers)
questions_api = response.json()
questions_list = questions_api['items']
questions_100 = pd.DataFrame(questions_list)
print(questions_100.info())
We can see from above that the api is limited to returning 100 results per page and getting all the data would need running a loop several times consuming computational resources. We will, therefore, get the complete data from the database running following query
SELECT Id, CreationDate,
Score, ViewCount, Tags,
AnswerCount, FavoriteCount
FROM posts
WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;
Note that of the various post types Post TypeID for "questions" is 1.
We download the results as 2019_questions.csv
from SEDE.
# Convert the results into dataframe
questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])
print(questions.info())
We want to eventually focus on the tags
column as well as other popularity, so we will carry out following data-cleaning steps:
FavoriteCount
with 0
as the missing values indicate that the question was not voted upon. FavoriteCount
into int
Tags
string into a more readable format# Fill in missing values for the "FavoriteCount" column
questions.fillna(0, inplace=True)
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)
print(questions.info())
# Convert format of tags string
print(questions["Tags"].sample(5))
questions["Tags"] = questions["Tags"].str.replace('^<|>$','').str.split('><')
print(questions["Tags"].sample(5))
We will focus on Tags
to determine:
print(questions.sample(5))
tag_no = {}
for tags in questions["Tags"]:
for tag in tags:
if tag not in tag_no:
tag_no[tag] = 1
else:
tag_no[tag] += 1
tag_no = pd.DataFrame.from_dict(data=tag_no, orient="index")
tag_no.rename(columns={0: "No"}, inplace=True)
most_used=tag_no.sort_values(by='No', ascending=False, axis=0)
most_used.head(20)
We can use the SEDE Query to obtain the same results as above.
SELECT TagName, COUNT(TagName)
FROM Posts AS p
INNER JOIN PostTags AS pt ON pt.PostId=p.id
INNER JOIN Tags AS t ON t.id=pt.TagId
WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019
GROUP BY TagName
ORDER BY COUNT(TagName) DESC;
tags_2019 = pd.read_csv("tags_2019.csv")
tags_2019.rename(columns={"Unnamed: 1": "No"}, inplace=True)
print(tags_2019.head(20))
# Results obtained from query and dataset are equal
equal = list(most_used["No"]) == list(tags_2019["No"])
print(equal)
# Determining Viewcounts for Tags in 2019
view_no = {}
for index, row in questions.iterrows():
for tag in row['Tags']:
if tag not in view_no:
view_no[tag] = row['ViewCount']
else:
view_no[tag] += row['ViewCount']
view_no = pd.DataFrame.from_dict(data=view_no, orient="index")
view_no.rename(columns={0: "No"}, inplace=True)
most_viewed=view_no.sort_values(by='No', ascending=False, axis=0).head(20)
most_viewed
We can use the SEDE Query to obtain the same results as above.
SELECT TagName, SUM(ViewCount)
FROM Posts AS p
INNER JOIN PostTags AS pt ON pt.PostId=p.id
INNER JOIN Tags AS t ON t.id=pt.TagId
WHERE (PostTypeId = 1 AND YEAR(CreationDate) = 2019)
GROUP BY TagName
ORDER BY 2 DESC;
views_2019 = pd.read_csv("views_2019.csv")
views_2019.rename(columns={"Unnamed: 1": "No"}, inplace=True)
print(views_2019.head(20))
# Results obtained from query and dataset are equal
equal = list(most_viewed["No"]) == list(views_2019["No"].head(20))
print(equal)
import matplotlib.pyplot as plt
%matplotlib inline
most_used.plot(kind="barh", figsize=(16,8),logx=True) #log scaling on x axis due to very large numbers
most_viewed.plot(kind="barh", figsize=(16,8),logx=True)#log scaling on x axis due to very large numbers
fig, axes = plt.subplots(1, 2)
fig.set_size_inches((28, 12))
#log scaling on x axis due to very large numbers
most_used.plot(kind="barh", ax=axes[0], subplots=True, logx=True)
most_viewed.plot(kind="barh", ax=axes[1], subplots=True, logx=True)
If we want to see what tags are in most_used
, but not in most_viewed
, We can identify them by the missing values in ViewCount.
Similarly if we want to know which tags are in the latter, but not the former we can use a similar approach.
in_used = pd.merge(most_used, most_viewed, how="left", left_index=True, right_index=True)
print(in_used)
pd.merge(most_used, most_viewed, how="right", left_index=True, right_index=True)
clustering
, r
, predictive-modeling
, statistics
and machine-learning-model
are in most_used
but not in most_viewed
.
dataframe
, numpy
, csv
, pytorch
and seaborn
are in most_viewed
but not in most-used
.
Simpler Approach using Sets
A far simpler appraoch is to use set methods intersection
and difference
set(most_used.index).intersection(set(most_viewed.index))
set(most_used.index).difference(set(most_viewed.index))
set(most_viewed.index).difference(set(most_used.index))
As practitioners, we are aware that data science is a multi-disciplinary field with broad interlinked disciplines as under:
Our strategy will be:
tags
column in all_questions
dataframe against each superset for classification as a subset.Since general coding questions are dealt with at stack overflow, we will leave this part out.
dataset
and create superset ss_data
statistics
and create supersetss_stat
machine-learning
and deep-learning
and create superset ss_ai
Run following query at SEDE
SELECT Tags, TagName
FROM Posts AS p
INNER JOIN PostTags AS pt ON pt.PostId=p.id
INNER JOIN Tags AS t ON t.id=pt.TagId
WHERE TagName = 'dataset';
# Create a set 'ss_data'
ss_data_df = pd.read_csv("ss_data.csv")
ss_data_df["Tags"] = ss_data_df["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe
print(ss_data_df.head(5),'\n')
ss_data_list = []
for index, row in ss_data_df.iterrows():
for tag in row['Tags']:
ss_data_list.append(tag)
print(ss_data_list[:10], '\n')
print(len(ss_data_list),'\n')
ss_data = set(ss_data_list)
print(len(ss_data))
Run following query at SEDE
``` SELECT Tags, TagName FROM Posts AS p INNER JOIN PostTags AS pt ON pt.PostId=p.id INNER JOIN Tags AS t ON t.id=pt.TagId WHERE TagName = 'statistics';
# Create a set 'ss_stat'
ss_stat_df = pd.read_csv("ss_stats.csv")
ss_stat_df["Tags"] = ss_stat_df["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe
print(ss_stat_df.head(5),'\n')
ss_stat_list = []
for index, row in ss_stat_df.iterrows():
for tag in row['Tags']:
ss_stat_list.append(tag)
print(ss_stat_list[:10], '\n')
print(len(ss_stat_list),'\n')
ss_stat = set(ss_stat_list)
print(len(ss_stat))
Run following query at SEDE ``` SELECT Tags, TagName FROM Posts AS p INNER JOIN PostTags AS pt ON pt.PostId=p.id INNER JOIN Tags AS t ON t.id=pt.TagId WHERE TagName = 'machine-learning';
# Create a set 'ss_ml'
ss_ml_df = pd.read_csv("ss_ml.csv")
ss_ml_df["Tags"] = ss_ml_df["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe
print(ss_ml_df.head(5),'\n')
ss_ml_list = []
for index, row in ss_ml_df.iterrows():
for tag in row['Tags']:
ss_ml_list.append(tag)
print(ss_ml_list[:10], '\n')
print(len(ss_ml_list),'\n')
ss_ml = set(ss_ml_list)
print(len(ss_ml))
# Create a set 'ss_dl'
ss_dl_df = pd.read_csv("ss_dl.csv")
ss_dl_df["Tags"] = ss_dl_df["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe
print(ss_dl_df.head(5),'\n')
ss_dl_list = []
for index, row in ss_dl_df.iterrows():
for tag in row['Tags']:
ss_dl_list.append(tag)
print(ss_dl_list[:10], '\n')
print(len(ss_dl_list),'\n')
ss_dl = set(ss_dl_list)
print(len(ss_dl))
# We will join ss_ml and ss_dl to create one superset ss_ai
ss_ai = ss_ml.union(ss_dl)
print(len(ss_ai))
We have now created 3 supersets i.e. ss_data
, ss_stat
and ss_ai
containing unique values of tags related to three broad disciplines: "Data Management (dm)", "Statistics (stat)" and "Artificial Intelligence (ai)"
We will now see, how many questions ever asked on SEDE (Data Science) fall under these disciplines.
Run following query at SEDE to get all_questions:
SELECT Id, CreationDate, Tags
FROM posts
WHERE PostTypeId = 1;
# Create dataframe aq
aq = pd.read_csv("all_questions.csv")
aq["Tags"] = aq["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe
print(aq.head(10),'\n')
dm
, stat
, ai
and populate them with 0zeros = []
for index, row in aq.iterrows():
zeros.append(0)
aq["dm"] = zeros
aq["stat"] = zeros
aq["ai"] = zeros
print(aq.head(5),'\n')
print(ss_data)
# Convert "Tags" column to set using apply method
aq["Tags"] = aq["Tags"].apply(set)
print(aq.head(5),'\n')
""" Identifies whether a set is a subset of a superset"""
def inset(x, ss):
if x.issubset(ss):
return 1
else:
return 0
aq["dm"] = aq["Tags"].apply(inset, ss=ss_data)
aq["stat"] = aq["Tags"].apply(inset, ss=ss_stat)
aq["ai"] = aq["Tags"].apply(inset, ss=ss_ai)
print(aq.head(20))
print(aq.info())
Let's recall that the supersets are collections of tags which have ever been used with a combination of marker tags machine=learning
, deep-learning
, statistcs
and datasets
of 3 main disciplines i.e. "AI", "Maths" and "Data Management" which we identified based on domain knowledge. The larger the superset the more likely it will be for any combination of tags or even a single tag to be identified as a subset of a particular superset.
As seen below, at present the membership criteria is quite lax and is solely contingent upon "size of the superset".
# Finding the size of membership
print ('ai members:', '\n', sum(aq["ai"]))
print ('dm members:', '\n', sum(aq["dm"]))
print ('stat members:', '\n', sum(aq["stat"]))
As seen above, out of approximately 22,000 questions, more than 21,000 are classified as belonging to AI disciplines such as machine-learning
and deep-learning
.
So, it should not be enough to be a member of a superset.
Since, the supersets control which particular combination of tags
are its members, we must impose additionsl conditions for membership.
We will amend the function inset
to indiscip
to reflect following additional/ existing conditions:
Length of the set of tags must be at least 2, as a serious questioner homes in on topic of relevance from broad categories to specific topic or vice-versa. It is not necessary that, he will use the key tags such as machine-learning
or dataset
or statistics
as people with different knowledge and skills are likely to have different conceptions of what is a broad category.
Unique Membership We cannot implemenr unique membership, as there are lot of interlinks and cross-overs. A subset can be identified as belonging to more than one discipline. This is presently True
. But we need to place some constraints on membership based on 'tags-combination'. For this, we will impose following condition through indiscp
function:
Out of a given combination of 'tags' for a question which is a subset of a particular superset, at least one 'tag' out of the combination, should not be a member of one of the other two other supersets.
Size of Superset If we run following query at SEDE, we get the total number of tags used on the site.
SELECT COUNT(TagName)
FROM Tags
596
In order to qualify as a discipline, the size of superset should be substantial enough. We fix it arbitrarily as 25% (150) of unique tag names.
"""Identifies whether a set of 'tags' is a subset of a data science discipline"""
def indiscip(x, ss1, ss2, ss3):
if (x.issubset(ss1) and len(x) >=2 and len(ss1) > 150) and (len(x.difference(ss2))>=1 or len(x.difference(ss3))>=1):
return 1
else:
return 0
aq["dm"] = aq["Tags"].apply(indiscip, ss1=ss_data, ss2=ss_stat, ss3=ss_ai)
aq["stat"] = aq["Tags"].apply(indiscip, ss1=ss_stat, ss2=ss_data, ss3=ss_ai)
aq["ai"] = aq["Tags"].apply(indiscip, ss1=ss_ai, ss2=ss_data, ss3=ss_stat)
print ('ai questions:', '\n', sum(aq["ai"]))
print ('dm questions:', '\n', sum(aq["dm"]))
print ('stat questions:', '\n', sum(aq["stat"]))
# Plot results for AI
import seaborn as sns
sns.set(style="darkgrid")
ax1= sns.countplot(y="ai", data=aq, facecolor=(0, 0, 0, 0),
linewidth=5,
edgecolor=sns.color_palette("dark", 3))
# Plot results for Data Management
ax= sns.countplot(y="dm", data=aq, facecolor=(0, 0, 0, 0),
linewidth=5,
edgecolor=sns.color_palette("dark", 3))
# Plot results from Statistics
ax=sns.countplot(y="stat", data=aq, facecolor=(0, 0, 0, 0),
linewidth=5,
edgecolor=sns.color_palette("dark", 3))
From above it can be seen that AI and its components machine-learning
and subfield deep-learning
form the major thrust of questions at SEDS. However, other disciplines are not insignificant, and We will explore them further as a time series.
We have already classified all ai
questions as deep-learning
questions.
We will now use the dataframe aq
to:
Count how many questions per discipline including deep-learning
are asked per time period.
deep learning
and other disciplines' questions there are relative to the total amount of questions per time period.CreationDate
to string in yymm
formatget_qtr
which uses string functions to extract quarters from string in format yyQn
, where n
is the quarter number apply
function get_qtr
to CreationDate
and convert it to yyQn
formatgroupby
yyQn
and aggregate count for all disciplines ai
, dm
and stat
# change "CreationDate" to string
aq_ts = aq
aq_ts["qtr"] = aq_ts["CreationDate"].astype(str).str[2:7].str.replace('-','')
aq_ts.drop(labels="CreationDate", axis=1, inplace=True)
aq_ts=aq_ts.sort_values(by='qtr', ascending=True)
print(aq_ts.head())
# Define function get_qtr
""" extracts quarters from yymm"""
def get_qtr(string):
year = int(string[0:2])*100
month = int(string)-year
qtr = int(((month)-1)/3)+1
return '{y}Q{q}'.format(y=string[0:2], q=qtr)
# apply 'get_qtr' to 'aq_ts["qtr"]'
aq_ts["qtr"] = aq_ts["qtr"].apply(get_qtr)
print(aq_ts.tail())
import numpy as np
aq_ts_pt = aq_ts.pivot_table(values = ["ai","dm","stat"], index = "qtr", aggfunc=(np.sum))
print(aq_ts_pt)
aq_ts_pt.drop(labels="20Q1", inplace=True)
# Comparative Time Series Curves for "all disciplines"
fig, axes = plt.subplots()
fig.set_size_inches((14, 6))
#log scaling on x axis due to very large numbers
aq_ts_pt["ai"].plot(kind="line", c='orange', subplots=True, label="ai", legend=True)
aq_ts_pt["dm"].plot(kind="line", c='green', subplots=True, label="dm", legend=True)
aq_ts_pt["stat"].plot(kind="line", subplots=True, label="stat", legend=True)
# Plotting Time Series: "AI Growth"
fig, axes = plt.subplots()
fig.set_size_inches((14, 6))
plt.ylim(0, 800)
aq_ts_pt["ai"].plot(kind="bar", subplots=True, label="ai", legend=True, title = "AI Growth - Time Series")
aq_ts_pt["ai"].plot(kind="line", c='orange', subplots=True, label="dm", legend=True)
# Plotting Time Series: "DM Growth"
fig, axes = plt.subplots()
fig.set_size_inches((14, 6))
plt.ylim(0, 800)
aq_ts_pt["dm"].plot(kind="bar", color='orange', subplots=True, label="dm", legend=True, title = "DM Growth - Time Series")
aq_ts_pt["dm"].plot(kind="line", color='blue', subplots=True, label="dm", legend=True)
# Plotting Time Series: "Stat Growth"
fig, axes = plt.subplots()
fig.set_size_inches((14, 6))
plt.ylim(0, 800)
aq_ts_pt["stat"].plot(kind="bar", color='red', subplots=True, label="dm", legend=True, title = "Stat Growth - Time Series")
aq_ts_pt["stat"].plot(kind="line", color='green', subplots=True, label="dm", legend=True)
It is clear from above that the AI subfields including deep-learning
and to an extent data-management
subfields have shown marked growth since 4th quarter of 2018. Perhaps, this is indicative of the interest in the field in general. A major AI discipline maths-statistics
has not registered much interset despite being a major requirement for data analysis. Perhaps, it is due to requirement of STEM education in this field.
Our content, for present, should be focused on deep-learning
and data-management
.