#!/usr/bin/env python
# coding: utf-8

# # Project -  Selecting Content for Data Science Company
# 
# In this project, we will try to answer the question: "What do people want to learn in data science", and in so doing evaluate the available content for our company which deals with data science education. 
# 
# **Scenario**
# 
# We're working for a company that creates data science content, be it books, online articles, videos or interactive text-based platforms. We're tasked with figuring out what is best content to write about. We realize that if we wanted to figure out what programming content to write, we could consult [Stack Overflow](https://stackoverflow.com/) (a question and answer website about programming) and see what kind of content is more popular. After investigating Stack Overflow in depth, we find out that it is part of a question and answer website network called [Stack Exchange](https://en.wikipedia.org/wiki/Stack_Exchange).

# ### Stack Exchange
# 
# Stack Exchange hosts sites on a multitude of fields and subjects, including mathematics, physics, philosophy, and data science! Here's a sample of the most popular sites:
# 
# ![image.png](https://dq-content.s3.amazonaws.com/469/se_sites.png)
# 
# Stack Exchange employs a reputation award system for its questions and answers. Each post — each question/answer — is a post that is subject to upvotes and downvotes. This ensures that good posts are easily identifiable. More details are available on this [tour](https://stackexchange.com/tour)
# 
# Being a multidisciplinary field, there a few Stack Exchange websites there are relevant to our goal here:
# 
# - [Data Science](https://datascience.stackexchange.com/)
# - [Cross Validated](https://stats.stackexchange.com/) — a statistics site
# - [Artificial Intelligence](https://ai.stackexchange.com/)
# - [Mathematics](https://math.stackexchange.com/)
# - [Stack Overflow](https://stackoverflow.com/)
# 
# And if we want to include Data Engineering, we can also consider:
# 
# - [Database Administrators](https://dba.stackexchange.com/);
# - [Unix & Linux](https://unix.stackexchange.com/);
# - [Software Engineering](https://softwareengineering.stackexchange.com/);
# 
# At the time of writing, in terms of users Data Science Stack Exchange (DSSE) is among top 40 sites with 85K users, however, in terms of unanswered questions it is among bottom 7 with 64% unanswered questions. Further statistics are available [here](https://stackexchange.com/sites?view=list#questionsperday). This makes it quite attractive for exploring data science content.

# ### Data Science Stack Exchange (DSSE)
# 
# The site is organized into following sections:
# 
# - [Home](https://datascience.stackexchange.com/): Displays the `top questions` and contains a side bar menu with links to other sections of the site.
# 
# - [Questions]: Features all questions which can be filtered by `No answers` and `No accepted answers`. Currently, approximately 25,000 questions are featured of which over 15,000 are unanswered or with no accepted answer (63%).  
# 
# - [Tags](https://datascience.stackexchange.com/tags) are key words which organize questions as topics and facilitate search.
# 
# - [Users](https://datascience.stackexchange.com/users) features the data on users. Users with highest `reputation` are featured before the others. Users can be searched and filtered by `user name`, `reputation`, `New user` etc.
# 
# - [Unanswered](https://datascience.stackexchange.com/unanswered) features unanswered questions which can be filtered by votes, tags and newest.  
# 
# The footer menu has links to [Tour](https://datascience.stackexchange.com/tour) and [Help](https://datascience.stackexchange.com/help) sections along with links to other Stack Exchange sites and features. 

# **Questions and Answers** 
# 
# Data Science Stack Exchange(DSSE) is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field.The method of asking questions, getting answers, and getting promoted to higher levels is given [here](https://datascience.stackexchange.com/tour):
# 
# - The site is all about getting answers. It's not a discussion forum. There's no chit-chat.
# - Good answers are voted up and rise to the top. The best answers show up first so that they are always easy to find.
# - The person who asked can mark one answer as "accepted". Accepting doesn't mean it's the best answer, it just means that it worked for the person who asked.
# - Focus on questions about an actual problem you have faced. Include details about what you have tried and exactly what you are trying to do.
# - Avoid questions that are primarily opinion-based, or that are likely to generate discussion rather than answers. Questions that need improvement may be closed until someone fixes them.
# - All questions are tagged with their subject areas. Each can have up to 5 tags. We can click any tag to see a list of questions with that tag, or go to the tag list to browse for topics 0f interest.
# - User reputation score goes up when others vote up on their questions, answers and edits.
# - User Privileges are granted based on reputation score. 

# ### Getting Data
# 
# Stack Exchange sites have a number of convenient ways of getting data:
# 
# - **Scraping** : We can scrape a page for relevant information.
# 
# In this case, we will be retrieving tags on the first page of the [tags link](https://datascience.stackexchange.com/tags)

# In[ ]:


import requests
# request data from "https://datascience.stackexchange.com/tags"
response = requests.get("https://datascience.stackexchange.com/tags")

content = response.content


from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.

parser = BeautifulSoup(content, 'html.parser')

tag = parser.select(".post-tag")

tags_scraping = []

for i in range(0,36):
    tag_text = tag[i].text
    tags_scraping.append(tag_text)
print(tags_scraping)


# **API**
# 
# Obviously, scraping is a tedious process and getting even the names of tags will involve a lot of coding.
# 
# Next, we can try the [Stack Exchange API](https://api.stackexchange.com/)

# First, we register the app for access token
# - App: Py Lesson
# - Get client_id = 18238
# - redirect_uri = `https://stackoverflow.com/oauth/login_success`
# 

# In[ ]:


# access token requested
import requests
import requests.auth

response = requests.post("https://stackoverflow.com/oauth/dialog?client_id=18238&redirect_uri=https://stackoverflow.com/oauth/login_success")

print(response.status_code)


# In[ ]:


# access_token is only required for increased quota, write and access to private info
## Following access token is now expired

headers = {'access_token': '6QkGLWxpVNzS6XWzfOFXGw))', 'key': 'LCyb3nlOflFZqImiVbfZog((' }

response = requests.get("https://api.stackexchange.com/2.2/tags?page=1&pagesize=36&order=desc&sort=popular&site=datascience", headers=headers)
tags = response.json()


# In[ ]:


# Parse the json for extracting names of tags on page-1
import pandas as pd

tags_list = tags['items']

tags_df = pd.DataFrame(tags_list)

tags_api = tags_df["name"]
print(tags_api)


# In[ ]:


# Check whether tag list obtained from scraping and from api are equal
tags_api = list(tags_df["name"])

equal = tags_api==tags_scraping
print(equal) # Are equal


# **Stack Exchange Data Explorer [(SEDE)](https://data.stackexchange.com/help)**
# 
# SEDE is an open source tool for running arbitrary queries against public data from the Stack Exchange network. Features include collaborative query editing for all graduated and public beta Stack Exchange sites.
# 
# The data is updated early every Sunday morning around 3:00 UTC.
# 
# Apart from web-scraping and use of API, SEDE provides access to the database and entertains T-SQL queries.
# 
# [Following query](https://data.stackexchange.com/datascience/query/1259433/schema) run at SEDE gives the same results as obtained from web-scraping and API.
# 
# ```
# SELECT TOP 36 
#             TagName, COUNT 
#   FROM Tags 
# ORDER BY COUNT DESC;
# 
# ```
# The query results can be downloaded as a csv file, and a permalink can also be created to the query (as done above). We will now compare results from the query with those of web_scraping and API.

# In[ ]:


import pandas as pd
query = pd.read_csv("top_36_tags.csv")
tags_query = query["TagName"]
tags_query = list(tags_query)
print(tags_query)


# In[ ]:


# Check whether these results are same as those obtained from web-scraping and API

equal = tags_api==tags_scraping==tags_query

print(equal) # All results are equal


# In[ ]:


# Determine top ten popular topics from tags
top_ten = query[:10]
print(top_ten)


# ### Getting Posts Data from SEDE
# 
# Above, we can see that maximum questions are contained in about top 10 or so tags. Same content (top 36 tags) is displayed on the [tags home page](https://datascience.stackexchange.com/tags)
# Now we will focus on the `Posts` table in the database `SEDE` to create a database for a year (2019) and carry out a more granular analysis.  
# 
# We will run a query against the SEDE DSSE database that extracts the following columns for all the questions in 2019:
# `Id`: An identification number for the post.
# `PostTypeId`: An identification number for the type of post.
# `CreationDate`: The date and time of creation of the post.
# `Score`: The post's score.
# `ViewCount`: How many times the post was viewed.
# `Tags`: What tags were used.
# `AnswerCount`: How many answers the question got (only applicable to question posts).
# `FavoriteCount`: How many times the question was favored (only applicable to question posts).

# **Using the API in Tandem**
# 
# We will use the API in tandem with query in order to compare some of our results.

# In[ ]:


# Find total number of questions in 2019 using API

response = requests.get("https://api.stackexchange.com/2.2/questions?pagesize=100&fromdate=1546300800&todate=1577750400&order=desc&sort=creation&site=stackoverflow&filter=!5-dmZUPsa(aVhP2rEGYktwXr6kydXo7VO6LStu", headers=headers)
questions_api = response.json()


questions_list = questions_api['items']
questions_100 = pd.DataFrame(questions_list) 
print(questions_100.info())


# # Find total number of questions from query
# 
# We can see from above that the api is limited to returning 100 results per page and getting all the data would need running a loop several times consuming computational resources. We will, therefore, get the complete data from the database running [following query](https://data.stackexchange.com/datascience/query/1259625/questions-in-2019)
# 
# ```
# SELECT Id, CreationDate,
#        Score, ViewCount, Tags,
#        AnswerCount, FavoriteCount
#   FROM posts
#  WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;
#  ```
#  Note that of the various post types **Post TypeID** for "questions" is **1**.
#  
#  We download the results as `2019_questions.csv` from SEDE.

# In[ ]:


# Convert the results into dataframe

questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"]) 

print(questions.info())


# ### Data Cleaning
# 
# We want to eventually focus on the `tags` column as well as other popularity, so we will carry out following data-cleaning steps:
# - Fill missing values in `FavoriteCount` with `0` as the missing values indicate that the question was not voted upon. 
# - Convert `FavoriteCount` into `int`
# - Convert `Tags` string into a more readable format

# In[ ]:


# Fill in missing values for the "FavoriteCount" column

questions.fillna(0, inplace=True)  
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)

print(questions.info())


# In[ ]:


# Convert format of tags string

print(questions["Tags"].sample(5))


# In[ ]:


questions["Tags"] = questions["Tags"].str.replace('^<|>$','').str.split('><')
print(questions["Tags"].sample(5))


# ### Most Used and Most Viewed Tags
# We will focus on `Tags` to determine:
# 
# - Count how many times each tag was used.
# - Count how many times each tag was viewed.
# - Create visualizations for the top tags of each of the above results.

# In[ ]:


print(questions.sample(5))


# In[ ]:


tag_no = {}
for tags in questions["Tags"]:
    for tag in tags:
        if tag not in tag_no:
            tag_no[tag] = 1
        else:
            tag_no[tag] += 1


# In[ ]:


tag_no = pd.DataFrame.from_dict(data=tag_no, orient="index")
tag_no.rename(columns={0: "No"}, inplace=True)
most_used=tag_no.sort_values(by='No', ascending=False, axis=0)

most_used.head(20)


# ### Using SEDE Query
# 
# We can use the [SEDE Query](https://data.stackexchange.com/datascience/query/1259649/questions-in-2019) to obtain the same results as above.
# 
# ```
# SELECT TagName, COUNT(TagName)
#   FROM Posts AS p 
#                 INNER JOIN PostTags AS pt ON pt.PostId=p.id
#                 INNER JOIN Tags AS t ON t.id=pt.TagId
#  WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019
#  GROUP BY TagName
#  ORDER BY COUNT(TagName) DESC;
#  
#  ```

# In[ ]:


tags_2019 = pd.read_csv("tags_2019.csv") 
tags_2019.rename(columns={"Unnamed: 1": "No"}, inplace=True)
print(tags_2019.head(20))


# In[ ]:


# Results obtained from query and dataset are equal
equal = list(most_used["No"]) == list(tags_2019["No"])
print(equal)


# In[ ]:


# Determining Viewcounts for Tags in 2019

view_no = {}
for index, row in questions.iterrows():
    for tag in row['Tags']:
        if tag not in view_no:
            view_no[tag] = row['ViewCount']
        else:
            view_no[tag] += row['ViewCount']


# In[ ]:


view_no = pd.DataFrame.from_dict(data=view_no, orient="index")
view_no.rename(columns={0: "No"}, inplace=True)
most_viewed=view_no.sort_values(by='No', ascending=False, axis=0).head(20) 

most_viewed


# ### Using SEDE Query
# 
# We can use the [SEDE Query](https://data.stackexchange.com/datascience/query/1259696/viewcount-for-tags-created-in-2019-and-active-in-2019) to obtain the same results as above.
# 
# ```
# SELECT TagName, SUM(ViewCount)
#   FROM Posts AS p 
#                 INNER JOIN PostTags AS pt ON pt.PostId=p.id
#                 INNER JOIN Tags AS t ON t.id=pt.TagId
#                 WHERE (PostTypeId = 1 AND YEAR(CreationDate) = 2019)
#  GROUP BY TagName
#  ORDER BY 2 DESC;
#  
#  ```

# In[ ]:


views_2019 = pd.read_csv("views_2019.csv") 
views_2019.rename(columns={"Unnamed: 1": "No"}, inplace=True)
print(views_2019.head(20))


# In[ ]:


# Results obtained from query and dataset are equal
equal = list(most_viewed["No"]) == list(views_2019["No"].head(20))
print(equal)


# In[ ]:


import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
most_used.plot(kind="barh", figsize=(16,8),logx=True) #log scaling on x axis due to very large numbers


# In[ ]:


most_viewed.plot(kind="barh", figsize=(16,8),logx=True)#log scaling on x axis due to very large numbers


# In[ ]:


fig, axes = plt.subplots(1, 2)
fig.set_size_inches((28, 12))
#log scaling on x axis due to very large numbers
most_used.plot(kind="barh", ax=axes[0], subplots=True, logx=True)
most_viewed.plot(kind="barh", ax=axes[1], subplots=True, logx=True) 


# ### Relations Between Tags
# 
# If we want to see what tags are in `most_used`, but not in `most_viewed`, We can identify them by the missing values in ViewCount. 
# 
# Similarly if we want to know which tags are  in the latter, but not the former we can use a similar approach.

# In[ ]:


in_used = pd.merge(most_used, most_viewed, how="left", left_index=True, right_index=True)
print(in_used)


# In[ ]:


pd.merge(most_used, most_viewed, how="right", left_index=True, right_index=True)


# - `clustering`, `r`, `predictive-modeling`, `statistics` and `machine-learning-model` are in `most_used` but not in `most_viewed`.      
# 
# - `dataframe`, `numpy`, `csv`, `pytorch` and `seaborn` are in `most_viewed` but not in `most-used`. 	

# ** Simpler Approach using Sets**
# 
# A far simpler appraoch is to use set methods `intersection` and `difference`
# 
# - As is obvious below, it gives the same results as above. 

# In[ ]:


set(most_used.index).intersection(set(most_viewed.index)) 


# In[ ]:


set(most_used.index).difference(set(most_viewed.index)) 


# In[ ]:


set(most_viewed.index).difference(set(most_used.index)) 


# ### Identifying Potential Data Science Content using Domain Knowledge
# 
# As practitioners, we are aware that data science is a multi-disciplinary field with broad interlinked disciplines as under:
# 
# - Computer programming 
# - Data management
# - Calculus, Maths and Algebra
# - Statistics and Probability
# - Artificial Intelligence
# 
# Our strategy will be:
# - Identify most popular (most_used) tags associated with each dsicipline using domain knowledge.
# - Run queries in the DSDE database to get other tags most frequently used with most popular tags.
# - Create a superset of tags for each domain (discipline) and make them unique through set operations
# - classify `tags` column in `all_questions` dataframe against each superset for classification as a subset.
# 
# Since general coding questions are dealt with at [stack overflow](https://stackoverflow.com/), we will leave this part out. 
# - From Data management, we will focus on the tag `dataset` and create superset `ss_data`
# - From the mathematical domains, we will focus on the tag `statistics` and create superset`ss_stat`
# - From AI domains, we will focus on tags `machine-learning` and `deep-learning` and create superset `ss_ai`

# ### Superset Data
# 
# Run following query at SEDE
# ```
# SELECT Tags, TagName
#   FROM Posts AS p 
#                 INNER JOIN PostTags AS pt ON pt.PostId=p.id
#                 INNER JOIN Tags AS t ON t.id=pt.TagId
#  WHERE TagName = 'dataset';
#  ```

# In[ ]:


# Create a set 'ss_data'
ss_data_df = pd.read_csv("ss_data.csv")

ss_data_df["Tags"] = ss_data_df["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe

print(ss_data_df.head(5),'\n')


ss_data_list = []
for index, row in ss_data_df.iterrows():
    for tag in row['Tags']:
        ss_data_list.append(tag)
print(ss_data_list[:10], '\n')  

print(len(ss_data_list),'\n')

ss_data = set(ss_data_list)

print(len(ss_data))


# ### Superset Statistics
# 
# Run following query at SEDE
# 
# ```
# SELECT Tags, TagName
#   FROM Posts AS p 
#                 INNER JOIN PostTags AS pt ON pt.PostId=p.id
#                 INNER JOIN Tags AS t ON t.id=pt.TagId
#  WHERE TagName = 'statistics';

# In[ ]:


# Create a set 'ss_stat'
ss_stat_df = pd.read_csv("ss_stats.csv")

ss_stat_df["Tags"] = ss_stat_df["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe

print(ss_stat_df.head(5),'\n')


ss_stat_list = []
for index, row in ss_stat_df.iterrows():
    for tag in row['Tags']:
        ss_stat_list.append(tag)
print(ss_stat_list[:10], '\n')  

print(len(ss_stat_list),'\n')

ss_stat = set(ss_stat_list)

print(len(ss_stat))


# ### Superset machine-learning
# 
# Run following query at SEDE
# ```
# SELECT Tags, TagName
#   FROM Posts AS p 
#                 INNER JOIN PostTags AS pt ON pt.PostId=p.id
#                 INNER JOIN Tags AS t ON t.id=pt.TagId
#  WHERE TagName = 'machine-learning';

# In[ ]:


# Create a set 'ss_ml'
ss_ml_df = pd.read_csv("ss_ml.csv")

ss_ml_df["Tags"] = ss_ml_df["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe

print(ss_ml_df.head(5),'\n')


ss_ml_list = []
for index, row in ss_ml_df.iterrows():
    for tag in row['Tags']:
        ss_ml_list.append(tag)
print(ss_ml_list[:10], '\n')  

print(len(ss_ml_list),'\n')

ss_ml = set(ss_ml_list)

print(len(ss_ml))


# In[ ]:


# Create a set 'ss_dl'
ss_dl_df = pd.read_csv("ss_dl.csv")

ss_dl_df["Tags"] = ss_dl_df["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe

print(ss_dl_df.head(5),'\n')


ss_dl_list = []
for index, row in ss_dl_df.iterrows():
    for tag in row['Tags']:
        ss_dl_list.append(tag)
print(ss_dl_list[:10], '\n')  

print(len(ss_dl_list),'\n')

ss_dl = set(ss_dl_list)

print(len(ss_dl))


# In[ ]:


# We will join ss_ml and ss_dl to create one superset ss_ai
ss_ai = ss_ml.union(ss_dl)

print(len(ss_ai))


# ### Classification of Questions into Data Science Disciplines
# 
# - We have now created 3 supersets i.e. `ss_data`, `ss_stat`and `ss_ai`  containing unique values of tags related to three broad disciplines: "Data Management (dm)", "Statistics (stat)" and "Artificial Intelligence (ai)"  
# 
# - We will now see, how many questions ever asked on SEDE (Data Science) fall under these disciplines. 
# 
# - Run following query at SEDE to get all_questions:
# ```
# SELECT Id, CreationDate, Tags
#   FROM posts
#  WHERE PostTypeId = 1;
#  
#  ```
# 

# In[ ]:


# Create dataframe aq
aq = pd.read_csv("all_questions.csv")

aq["Tags"] = aq["Tags"].str.replace('^<|>$','').str.split('><') # clean dataframe


print(aq.head(10),'\n')


# - We will create three additional columns in the aq dataframe `dm`, `stat`, `ai` and populate them with 0
# - if the tags list is a subset of any of the supersets respective column will change to 1

# In[ ]:


zeros = []
for index, row in aq.iterrows():
    zeros.append(0)
aq["dm"] = zeros
aq["stat"] = zeros
aq["ai"] = zeros

print(aq.head(5),'\n')


# In[ ]:


print(ss_data)


# In[ ]:


# Convert "Tags" column to set using apply method
aq["Tags"] = aq["Tags"].apply(set)

print(aq.head(5),'\n')  
    

# In[ ]:


""" Identifies whether a set is a subset of a superset""" 
def inset(x, ss):
    if x.issubset(ss):
        return 1
    else:
        return 0
         
aq["dm"] = aq["Tags"].apply(inset, ss=ss_data)
aq["stat"] = aq["Tags"].apply(inset, ss=ss_stat)        
aq["ai"] = aq["Tags"].apply(inset, ss=ss_ai)    

print(aq.head(20))    


# In[ ]:


print(aq.info())


# ### Set membership
# Let's recall that the supersets are collections of tags which have ever been used with a combination of marker tags `machine=learning`, `deep-learning`, `statistcs` and `datasets` of 3 main disciplines i.e. "AI", "Maths" and "Data Management" which we identified based on domain knowledge. The larger the superset the more likely it will be for any combination of tags or even a single tag to be identified as a subset of a particular superset. 
# 
# As seen below, at present the membership criteria is quite lax and is solely  contingent upon "size of the superset". 

# In[ ]:


# Finding the size of membership
print ('ai members:', '\n', sum(aq["ai"]))

print ('dm members:', '\n', sum(aq["dm"]))

print ('stat members:', '\n', sum(aq["stat"])) 


# ### Refining "Discipline" Membership Criteria
# As seen above, out of approximately 22,000 questions, more than 21,000 are classified as belonging to AI disciplines such as `machine-learning` and `deep-learning`.
# So, it should not be enough to be a member of a superset. 
# 
# Since, the supersets control which particular combination of `tags` are its members, we must impose additionsl conditions for membership.  
# 
# We will amend the function `inset` to `indiscip` to reflect following additional/ existing conditions:
# 
# - **Length** of the set of tags must be at least 2, as a serious questioner homes in on topic of relevance from broad categories to specific topic or vice-versa. It is not necessary that, he will use the key tags such as `machine-learning` or `dataset` or `statistics` as people with different knowledge and skills are likely to have different conceptions of what is a broad category.
# 
# - **Unique Membership** We cannot implemenr unique membership, as there are lot of interlinks and cross-overs. A subset can be identified as belonging to more than one discipline. This is presently `True`. But we need to place some constraints on membership based on 'tags-combination'. For this, we will impose following condition through `indiscp` function:
# **Out of a given combination of 'tags' for a question which is a subset of a particular superset, at least one 'tag' out of the combination, should not be a member of one of the other two other supersets.**  
# 
# - **Size of Superset** If we run following query at SEDE, we get the total number of tags used on the site.
# ```
# SELECT COUNT(TagName)
#   FROM Tags
# 596
# ```
#  In order to qualify as a discipline, the size of superset should be substantial enough. We fix it arbitrarily as **25%** (150) of unique tag names. 

# In[ ]:


"""Identifies whether a set of 'tags' is a subset of a data science discipline""" 
def indiscip(x, ss1, ss2, ss3):
    if (x.issubset(ss1) and len(x) >=2 and len(ss1) > 150) and (len(x.difference(ss2))>=1 or len(x.difference(ss3))>=1):
        return 1
    else:
        return 0
         
aq["dm"] = aq["Tags"].apply(indiscip, ss1=ss_data, ss2=ss_stat, ss3=ss_ai)
aq["stat"] = aq["Tags"].apply(indiscip, ss1=ss_stat, ss2=ss_data, ss3=ss_ai)        
aq["ai"] = aq["Tags"].apply(indiscip, ss1=ss_ai, ss2=ss_data, ss3=ss_stat)  


print ('ai questions:', '\n', sum(aq["ai"]))

print ('dm questions:', '\n', sum(aq["dm"]))

print ('stat questions:', '\n', sum(aq["stat"]))


# In[ ]:


# Plot results for AI
import seaborn as sns
sns.set(style="darkgrid")

ax1= sns.countplot(y="ai", data=aq, facecolor=(0, 0, 0, 0),
                   linewidth=5,
                   edgecolor=sns.color_palette("dark", 3))


# In[ ]:


# Plot results for Data Management
ax= sns.countplot(y="dm", data=aq, facecolor=(0, 0, 0, 0),
                   linewidth=5,
                   edgecolor=sns.color_palette("dark", 3))


# In[ ]:


# Plot results from Statistics
ax=sns.countplot(y="stat", data=aq, facecolor=(0, 0, 0, 0),
                   linewidth=5,
                   edgecolor=sns.color_palette("dark", 3))


# ### Exploring Time-Series for all Disciplines
# From above it can be seen that **AI** and its components `machine-learning` and subfield `deep-learning` form the major thrust of questions at SEDS. However, other disciplines are not insignificant, and We will explore them further as a time series.
# 
# - We have already classified all `ai` questions as `deep-learning` questions. 
# We will now use the dataframe `aq` to:
# 
# - Count how many questions per discipline including `deep-learning` are asked per time period.
# - The total amount of questions per time period.
# - How many `deep learning` and other disciplines' questions there are relative to the total amount of questions per time period.

# ### Analysis Strategy
# 
# - We will change `CreationDate` to string in `yymm` format
# - Define a function `get_qtr` which uses string functions to extract quarters from string in format `yyQn`, where `n` is the quarter number 
# - `apply` function `get_qtr` to `CreationDate` and convert it to `yyQn` format
# - `groupby` `yyQn` and aggregate count for all disciplines `ai`, `dm` and `stat`
# - Plot results

# In[ ]:


# change "CreationDate" to string
aq_ts = aq
aq_ts["qtr"] = aq_ts["CreationDate"].astype(str).str[2:7].str.replace('-','')
aq_ts.drop(labels="CreationDate", axis=1, inplace=True)


# In[ ]:


aq_ts=aq_ts.sort_values(by='qtr', ascending=True)
print(aq_ts.head())


# In[ ]:


# Define function get_qtr
""" extracts quarters from yymm"""
def get_qtr(string):
    year = int(string[0:2])*100
    month = int(string)-year
    qtr = int(((month)-1)/3)+1
    return '{y}Q{q}'.format(y=string[0:2], q=qtr)


# In[ ]:


# apply 'get_qtr' to 'aq_ts["qtr"]'
aq_ts["qtr"] = aq_ts["qtr"].apply(get_qtr)


# In[ ]:


print(aq_ts.tail())


# In[ ]:


import numpy as np 
aq_ts_pt = aq_ts.pivot_table(values = ["ai","dm","stat"], index = "qtr", aggfunc=(np.sum))


# In[ ]:


print(aq_ts_pt)


# In[ ]:


aq_ts_pt.drop(labels="20Q1", inplace=True)


# In[ ]:


# Comparative Time Series Curves for "all disciplines"

fig, axes = plt.subplots()
fig.set_size_inches((14, 6))
#log scaling on x axis due to very large numbers
aq_ts_pt["ai"].plot(kind="line", c='orange', subplots=True, label="ai", legend=True)
aq_ts_pt["dm"].plot(kind="line", c='green', subplots=True, label="dm", legend=True)
aq_ts_pt["stat"].plot(kind="line", subplots=True, label="stat", legend=True)


# In[ ]:


# Plotting Time Series: "AI Growth"

fig, axes = plt.subplots()
fig.set_size_inches((14, 6))
plt.ylim(0, 800)

aq_ts_pt["ai"].plot(kind="bar", subplots=True, label="ai", legend=True, title = "AI Growth - Time Series")
aq_ts_pt["ai"].plot(kind="line", c='orange', subplots=True, label="dm", legend=True)


# In[ ]:


# Plotting Time Series: "DM Growth"

fig, axes = plt.subplots()
fig.set_size_inches((14, 6))
plt.ylim(0, 800)
aq_ts_pt["dm"].plot(kind="bar", color='orange', subplots=True, label="dm", legend=True, title = "DM Growth - Time Series")
aq_ts_pt["dm"].plot(kind="line", color='blue', subplots=True, label="dm", legend=True)


# In[ ]:


# Plotting Time Series: "Stat Growth"

fig, axes = plt.subplots()
fig.set_size_inches((14, 6))
plt.ylim(0, 800)
aq_ts_pt["stat"].plot(kind="bar", color='red', subplots=True, label="dm", legend=True, title = "Stat Growth - Time Series")
aq_ts_pt["stat"].plot(kind="line", color='green', subplots=True, label="dm", legend=True)


# ### Recommendations
# 
# - It is clear from above that the AI subfields including `deep-learning` and to an extent `data-management` subfields have shown marked growth since 4th quarter of 2018. Perhaps, this is indicative of the interest in the field in general. A major AI discipline `maths-statistics` has not registered much interset despite being a major requirement for data analysis. Perhaps, it is due to requirement of STEM education in this field.   
# 
# - Our content, for present, should be focused on `deep-learning` and `data-management`.