The goal this work - defining the most popular Data Science Questions.Pretend that I working in the company that creates data science content, it is books, online articles, videos or interactive text-based platform placed on online education platforms like Dataquest or O'Reill Media and etc.
Now the most popular resource in the format Question <-> Answer is set of web servers building by Stack Exchange. Being a multidisciplinary field, there a few Stack Exchange websites there are relevant to goal of this work:
Cross Validated — a statistics site
In this work will using data getting from Stack Exchange Data Explorer (SEDE)
In the right side on upper median horizontal line Stack Exchange Data Explorer (SEDE) shows the names and the descriptions of tables of databases - the database has 29 tables and the general information schema of its database can find and download from this link..
Extract from Post table next columns for defining DS questions in 2020 year using this SQL query for further work and save it in the file "ds_questions_2020.csv"
SELECT
Id,
PostTypeId,
CreationDate,
Title,
OwnerUserId,
Tags,
Score,
ViewCount,
AnswerCount,
CommentCount,
FavoriteCount
FROM
Posts
WHERE
PostTypeId = 1 AND CreationDate LIKE '%2020%';
import request modules and import "ds_questions_2020.csv" to pandas ds_quests_2020
dataframe and expore it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
import datetime as dt
%matplotlib inline
ds_quests_2020 = pd.read_csv("ds_questions_2020.csv", parse_dates = ["CreationDate"])
ds_quests_2020.head()
Id | PostTypeId | CreationDate | Title | OwnerUserId | Tags | Score | ViewCount | AnswerCount | CommentCount | FavoriteCount | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 67074 | 1 | 2020-01-26 16:36:27 | Duplicated features for gradient descent | 88944.0 | <gradient-descent> | 7 | 174 | 2 | 3 | 0.0 |
1 | 67076 | 1 | 2020-01-26 18:01:15 | how to predict the time series data with two t... | 86916.0 | <machine-learning><time-series><statistics> | 1 | 18 | 0 | 1 | NaN |
2 | 67079 | 1 | 2020-01-26 18:41:56 | Data prediction using scikit-learn and a list | 88948.0 | <machine-learning><python><scikit-learn> | 4 | 62 | 2 | 2 | NaN |
3 | 67080 | 1 | 2020-01-26 20:20:40 | Why does reducing polynomial regression to lin... | 88952.0 | <regression><linear-regression> | 2 | 32 | 1 | 1 | NaN |
4 | 67081 | 1 | 2020-01-26 21:04:04 | Combine RepeatedStratifiedKFold and crossval | 66508.0 | <python><cross-validation> | 1 | 139 | 1 | 0 | NaN |
ds_quests_2020.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7334 entries, 0 to 7333 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 7334 non-null int64 1 PostTypeId 7334 non-null int64 2 CreationDate 7334 non-null datetime64[ns] 3 Title 7334 non-null object 4 OwnerUserId 7301 non-null float64 5 Tags 7334 non-null object 6 Score 7334 non-null int64 7 ViewCount 7334 non-null int64 8 AnswerCount 7334 non-null int64 9 CommentCount 7334 non-null int64 10 FavoriteCount 962 non-null float64 dtypes: datetime64[ns](1), float64(2), int64(6), object(2) memory usage: 630.4+ KB
All non-text columns in the Posts table had integer type, because fix data types to correct in the dataframe in coiumns FavoriteCount
, OwnerUserId
. Fill NaN value by zero for column FavoriteCount
,
and check cleaned dataframe dimension info and view Tags
column.
# Fill zero 'FavoriteCount' column NaN values
ds_quests_2020["FavoriteCount"] = ds_quests_2020["FavoriteCount"].fillna(0)
# Drop NaN values in the 'OwnerUserId' and reset index
ds_quests_2020.dropna(inplace = True)
ds_quests_2020.reset_index(inplace=True)
# Convert float type to integer for "OwnerUserId" and FavoriteCount"
ds_quests_2020["OwnerUserId"] = ds_quests_2020["OwnerUserId"].astype("int64")
ds_quests_2020["FavoriteCount"] = ds_quests_2020["FavoriteCount"].astype("int64")
# convert tag symbols "<>" , "<" ">" to list separator symbols
ds_quests_2020["Tags"] = ds_quests_2020["Tags"].\
str.replace("><",",").str.replace("<", "").\
str.replace(">", "").str.split(",")
# View cleaned dataset
ds_quests_2020.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7301 entries, 0 to 7300 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 7301 non-null int64 1 Id 7301 non-null int64 2 PostTypeId 7301 non-null int64 3 CreationDate 7301 non-null datetime64[ns] 4 Title 7301 non-null object 5 OwnerUserId 7301 non-null int64 6 Tags 7301 non-null object 7 Score 7301 non-null int64 8 ViewCount 7301 non-null int64 9 AnswerCount 7301 non-null int64 10 CommentCount 7301 non-null int64 11 FavoriteCount 7301 non-null int64 dtypes: datetime64[ns](1), int64(9), object(2) memory usage: 684.6+ KB
ds_quests_2020.head()
index | Id | PostTypeId | CreationDate | Title | OwnerUserId | Tags | Score | ViewCount | AnswerCount | CommentCount | FavoriteCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 67074 | 1 | 2020-01-26 16:36:27 | Duplicated features for gradient descent | 88944 | [gradient-descent] | 7 | 174 | 2 | 3 | 0 |
1 | 1 | 67076 | 1 | 2020-01-26 18:01:15 | how to predict the time series data with two t... | 86916 | [machine-learning, time-series, statistics] | 1 | 18 | 0 | 1 | 0 |
2 | 2 | 67079 | 1 | 2020-01-26 18:41:56 | Data prediction using scikit-learn and a list | 88948 | [machine-learning, python, scikit-learn] | 4 | 62 | 2 | 2 | 0 |
3 | 3 | 67080 | 1 | 2020-01-26 20:20:40 | Why does reducing polynomial regression to lin... | 88952 | [regression, linear-regression] | 2 | 32 | 1 | 1 | 0 |
4 | 4 | 67081 | 1 | 2020-01-26 21:04:04 | Combine RepeatedStratifiedKFold and crossval | 66508 | [python, cross-validation] | 1 | 139 | 1 | 0 | 0 |
Look at to the columns that have explicit numerical values and view their statistical values using pd.describe()
ds_quests_2020[['Score', 'ViewCount', 'AnswerCount', 'FavoriteCount']].describe()
Score | ViewCount | AnswerCount | FavoriteCount | |
---|---|---|---|---|
count | 7301.000000 | 7301.000000 | 7301.000000 | 7301.000000 |
mean | 0.964936 | 181.570196 | 0.820299 | 0.148062 |
std | 1.491208 | 684.247016 | 0.816292 | 0.508415 |
min | -4.000000 | 2.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 24.000000 | 0.000000 | 0.000000 |
50% | 1.000000 | 41.000000 | 1.000000 | 0.000000 |
75% | 1.000000 | 100.000000 | 1.000000 | 0.000000 |
max | 34.000000 | 19532.000000 | 10.000000 | 11.000000 |
Create pivot dataframe for tags and other values in two steps - 1) Create pivot dictionary , 2) export dictionary to data frame and view it:
# Create dictionary for pivot table of tags and other value
quest_dict_2020_pvt = {}
"""
Conversion label table
tags -> tag_name
tags_count -> count
Score ->score
ViewCount - > view_count
AnswerCount -> answer_count
CommentCount -> comment_count
FavoriteCount -> favorite_count
"""
for i, row in enumerate(ds_quests_2020["Tags"]):
for tag in row:
if not tag in quest_dict_2020_pvt:
quest_dict_2020_pvt[tag] = [0, 0, 0, 0, 0, 0]
quest_dict_2020_pvt[tag][0] +=1
quest_dict_2020_pvt[tag][1] += ds_quests_2020.loc[i,"Score"]
quest_dict_2020_pvt[tag][2] += ds_quests_2020.loc[i,"ViewCount"]
quest_dict_2020_pvt[tag][3] += ds_quests_2020.loc[i,"AnswerCount"]
quest_dict_2020_pvt[tag][4] += ds_quests_2020.loc[i,"CommentCount"]
quest_dict_2020_pvt[tag][5] += ds_quests_2020.loc[i,"FavoriteCount"]
# Convert dictionare to dataframe
quest_tags_2020_pvt_df = pd.DataFrame.from_dict(quest_dict_2020_pvt,
orient = "index")
#Reset index
quest_tags_2020_pvt_df.reset_index(inplace=True)
# Rename columns of dataframe
quest_tags_2020_pvt_df = quest_tags_2020_pvt_df.rename(
columns = {"index":"tag_name",
0: "count", 1:"score",
2: "view_count", 3:"answer_count",
4: "comment_count",
5:"favorite_count"})
# Sort by tags count descending
quest_tags_2020_pvt_df = quest_tags_2020_pvt_df.sort_values("count",
ascending=False)
quest_tags_2020_pvt_df.reset_index(inplace=True, drop = True)
# view first 15 rows
quest_tags_2020_pvt_df.head(15)
tag_name | count | score | view_count | answer_count | comment_count | favorite_count | |
---|---|---|---|---|---|---|---|
0 | machine-learning | 2151 | 2282 | 349910 | 1966 | 1833 | 383 |
1 | python | 1373 | 1155 | 372108 | 1192 | 1244 | 175 |
2 | deep-learning | 1035 | 914 | 154857 | 766 | 789 | 146 |
3 | neural-network | 845 | 808 | 122433 | 685 | 648 | 123 |
4 | keras | 670 | 557 | 217626 | 478 | 587 | 105 |
5 | classification | 625 | 671 | 85673 | 585 | 572 | 83 |
6 | tensorflow | 557 | 412 | 154612 | 378 | 362 | 55 |
7 | nlp | 496 | 537 | 96198 | 415 | 375 | 89 |
8 | scikit-learn | 495 | 495 | 134727 | 480 | 460 | 73 |
9 | time-series | 379 | 314 | 43265 | 229 | 255 | 52 |
10 | regression | 354 | 352 | 42460 | 315 | 422 | 61 |
11 | cnn | 349 | 272 | 58067 | 266 | 244 | 41 |
12 | dataset | 302 | 280 | 58780 | 255 | 227 | 47 |
13 | lstm | 299 | 247 | 54998 | 183 | 229 | 40 |
14 | pandas | 266 | 199 | 154169 | 287 | 180 | 30 |
Extract first 15 rows and plot correlation matrix for popular tag_name
.
# Create `quest_tags_2020_pop_df'
quest_tags_2020_pop_df = quest_tags_2020_pvt_df.iloc[0:15, :].copy()
quest_tags_2020_pop_df.reset_index(inplace=True, drop = True)
# Plot correlation matrix
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(quest_tags_2020_pop_df.corr(), annot = True, cmap = "inferno",
fmt='.4g', square=True,);
ax.set_title('Popular tags value correlation', fontsize=20, fontweight = "bold")
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontweight = "bold")
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontweight = "bold")
ax.tick_params(bottom=False, left=False)
The plot above show a strong correlation from 0.785 to 0.9938 between top 15 tag_name values. Look at the popular tag_name and their dispersions for different values.
# Set output columns
plot_columns = ["count", "score", "view_count", "answer_count",
"comment_count", "favorite_count"]
fig = plt.figure(figsize=(13, 15))
for i, column in enumerate (plot_columns):
ax = fig.add_subplot(3,2,i+1)
quest_tags_2020_pop_df = quest_tags_2020_pop_df.sort_values(column,
ascending=False)
sns.barplot(x = column, y = "tag_name", data = quest_tags_2020_pop_df)
quest_tags_2020_pop_df.reset_index(inplace=True, drop=True)
ax.set_xlabel(column, fontsize=14)
plt.suptitle("Top 15 tags")
plt.tight_layout(pad=2)
This plot show that first three hot themes of 2020 year for Data Science part of Stack owerflow Machine-Learning, Python and Deep-Learning ( except in the view_count plot).
I normalize quest_tags_2020_pop_df and plot it in normalization view.
# Extract numerical values to x
x = quest_tags_2020_pop_df.iloc[:,1:]
# Call MinMaxScaler() method
min_max_scaler = preprocessing.MinMaxScaler()
# Normalization data for values in range 0 - 1.0
x_scaled = min_max_scaler.fit_transform(x)
# Create datafame for normalized data
quest_tags_2020_top15_norm = quest_tags_2020_pop_df.copy()
# Assign normalized values for columns 1 - 6
quest_tags_2020_top15_norm.iloc[:,1:] = x_scaled
quest_tags_2020_top15_norm
tag_name | count | score | view_count | answer_count | comment_count | favorite_count | |
---|---|---|---|---|---|---|---|
0 | machine-learning | 1.000000 | 1.000000 | 0.932662 | 1.000000 | 1.000000 | 1.000000 |
1 | python | 0.587268 | 0.458953 | 1.000000 | 0.565900 | 0.643678 | 0.410765 |
2 | deep-learning | 0.407958 | 0.343255 | 0.340961 | 0.326977 | 0.368421 | 0.328612 |
3 | neural-network | 0.307162 | 0.292367 | 0.242601 | 0.281548 | 0.283122 | 0.263456 |
4 | keras | 0.214324 | 0.171867 | 0.531373 | 0.165451 | 0.246219 | 0.212465 |
5 | nlp | 0.122016 | 0.162266 | 0.163016 | 0.130118 | 0.117967 | 0.167139 |
6 | classification | 0.190451 | 0.226596 | 0.131088 | 0.225463 | 0.237145 | 0.150142 |
7 | scikit-learn | 0.121485 | 0.142103 | 0.279896 | 0.166573 | 0.169389 | 0.121813 |
8 | regression | 0.046684 | 0.073452 | 0.000000 | 0.074033 | 0.146400 | 0.087819 |
9 | tensorflow | 0.154377 | 0.102256 | 0.340217 | 0.109366 | 0.110103 | 0.070822 |
10 | time-series | 0.059947 | 0.055209 | 0.002442 | 0.025799 | 0.045372 | 0.062323 |
11 | dataset | 0.019098 | 0.038886 | 0.049507 | 0.040381 | 0.028433 | 0.048159 |
12 | cnn | 0.044032 | 0.035046 | 0.047344 | 0.046551 | 0.038717 | 0.031161 |
13 | lstm | 0.017507 | 0.023044 | 0.038035 | 0.000000 | 0.029643 | 0.028329 |
14 | pandas | 0.000000 | 0.000000 | 0.338874 | 0.058329 | 0.000000 | 0.000000 |
fig = plt.figure(figsize=(12, 12))
for i, column in enumerate(plot_columns):
sns.scatterplot(data = quest_tags_2020_top15_norm, x =column,
y = "tag_name", label = column,
s=280, alpha=0.7)
quest_tags_2020_top15_norm.reset_index(inplace=True, drop=True)
plt.xticks(fontsize=15, fontweight = "bold")
plt.yticks(fontsize = 15, fontweight = "bold")
plt.legend(loc=4, fontsize=15)
<matplotlib.legend.Legend at 0x7f4a2c432160>
This plot show three leaders of questions - Machine-Learning, Python and Deep-Learning.
First step - run this query.
SELECT Id, Title, CreationDate, DeletionDate, OwnerUserId, Tags, Score, ViewCount, AnswerCount, CommentCount, FavoriteCount FROM Posts WHERE PostTypeId = 1 AND CreationDate < '2021-04-01' ;
and save it to file 'all_quests_2021_04_01.csv' and repeat all step described above but without dictionary - using only pandas methods.
# Read CSV and parse date
all_quests = pd.read_csv("all_quests_2021_04_01.csv", parse_dates = ["CreationDate"])
# Remove redundand colummns
all_quests = all_quests.drop(["DeletionDate", "Id"], axis = 1).copy()
all_quests.reset_index(inplace=True, drop=True)
# Fill zero NaN value for favorite count
all_quests["FavoriteCount"] = all_quests["FavoriteCount"].fillna(0)
# Drop NaN values
all_quests.dropna(inplace = True)
# Convert float to int
all_quests["OwnerUserId"] = all_quests["OwnerUserId"].astype("int64")
# Convert tags to list
all_quests["Tags"] = all_quests["Tags"].str.replace("><",",").\
str.replace("<","").\
str.replace(">","").\
str.split(",")
# Extract year and quart to new columns
all_quests["year_quart"] = 100 * all_quests["CreationDate"].dt.year +\
all_quests["CreationDate"].dt.quarter
# Drop redundant columns
all_quests = all_quests.drop(["Title", "CreationDate",
"OwnerUserId"], axis = 1).copy()
# Rename columns
all_quests = all_quests.rename(columns = {"Tags":"tag_name",
"Score": "score_number",
"ViewCount":"view_number",
"AnswerCount":"answer_number",
"CommentCount":"comment_number",
"FavoriteCount":"favorit_number"})
# Insert tag_number column
all_quests.insert(1, "tag_number", 1)
# Convert list to string for further grouping
all_quests["tag_name"] = all_quests["tag_name"].str.join(" ")
# Make grouped table
all_quests_gb = all_quests.groupby(["tag_name","year_quart"])[
"tag_number", "score_number",
"view_number", "answer_number",
"comment_number", "favorit_number"].\
agg("sum")
all_quests_gb.reset_index(inplace=True)
# Define general stats of tags from start to 04-01-2021
# Spilt tag_name to and create separate dataframe for counting multitags
tag_stat = all_quests_gb["tag_name"].str.split(" ", expand = True)
tag_stat = tag_stat.rename(columns = {0: "tag_name", 1: "two_tags",
2: "three_tags", 3: "four_tags",
4: "five_tags"})
# Add column for single tags counting
tag_stat.insert(1, column = "one_tag", value = 0)
#Fill values for forther grouping
tag_stat["one_tag"] = tag_stat.apply(
lambda row: 1 if row["two_tags"] == None else 0, axis = 1)
tag_stat_columns = ['two_tags', 'three_tags', 'four_tags', 'five_tags']
for column in tag_stat_columns:
tag_stat[column] = tag_stat.apply(
lambda row: 0 if row[column] == None else 1, axis = 1)
tag_stat_columns_gb = ['one_tag', 'two_tags', 'three_tags',
'four_tags', 'five_tags']
#Group table tag_stat
tag_stat_gb = tag_stat.groupby(["tag_name"])[tag_stat_columns_gb].agg("sum")
tag_stat_gb["total_tags"] = tag_stat_gb["one_tag"] +\
tag_stat_gb['two_tags'] +\
tag_stat_gb["three_tags"] +\
tag_stat_gb["four_tags"] +\
tag_stat_gb["five_tags"]
# Sort values for total tags
tag_stat_gb = tag_stat_gb.sort_values("total_tags", ascending = False)
tag_stat_gb.reset_index(inplace = True)
tag_stat_gb.head(10)
<ipython-input-11-94611ec983cc>:36: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead. all_quests_gb = all_quests.groupby(["tag_name","year_quart"])[
tag_name | one_tag | two_tags | three_tags | four_tags | five_tags | total_tags | |
---|---|---|---|---|---|---|---|
0 | machine-learning | 28 | 7618 | 6716 | 4479 | 2300 | 21141 |
1 | python | 24 | 3116 | 2642 | 1613 | 726 | 8121 |
2 | neural-network | 28 | 1582 | 1253 | 752 | 353 | 3968 |
3 | deep-learning | 20 | 1203 | 983 | 598 | 274 | 3078 |
4 | classification | 25 | 971 | 724 | 360 | 143 | 2223 |
5 | keras | 18 | 548 | 432 | 231 | 74 | 1303 |
6 | nlp | 27 | 576 | 401 | 206 | 77 | 1287 |
7 | r | 26 | 636 | 390 | 168 | 56 | 1276 |
8 | scikit-learn | 17 | 434 | 304 | 165 | 64 | 984 |
9 | time-series | 22 | 382 | 250 | 127 | 52 | 833 |
Table above shows that in most case questions about Data Science have two - four tags and have relation into several disciplines - for example typical question has tags for example machine-learning, python, scikit-learn
.
Leaders by descending :
tag_stat_top_10 = tag_stat_gb.iloc[0:10, [0,6]]
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
sns.barplot(y="tag_name", x="total_tags",data = tag_stat_top_10);
Extract tags with label machine learning and prepare data for viewing.
# Define required string
ml_tag = "machine-learning"
# Extract from dataframe rows contains string
ml_tag_df = all_quests_gb[all_quests_gb["tag_name"].str.contains(ml_tag,
regex=False) == True]
# define redundant columns and drop its
ml_tag_drop_columns = ['score_number', 'view_number', 'answer_number',
'comment_number', 'favorit_number']
ml_tag_df = ml_tag_df.drop(ml_tag_drop_columns, axis = 1).copy()
# Extract total number tags for checking
ml_tag_check = ml_tag_df.groupby(["year_quart"])[["tag_number"]].agg("sum")
#ml_tag_df = ml_tag_df.drop('tag_number', axis = 1).copy()
# Define new columns with number dispersion
new_ml_tag_columns =['ml_one_tag', 'ml_two_tags', 'ml_three_tags', 'ml_four_tags',
'ml_five_tags', 'ml_total_tags']
# Add new columns to dataframe
for i, column in enumerate(new_ml_tag_columns):
ml_tag_df.insert(i+3, column, 0)
def tags_number(df, row_name):
"""
Count words in the row and filling corresponding columns value in
the 'tag_number' column.
One word - column "one_tag", two words - column "two_tags" and etc..
Experiment instead lambda above - for assign values in dataframe use only
df.at
"""
for index, row in df.iterrows():
tags_no = len(row[row_name].split())
if tags_no == 1:
df.at[index, "ml_one_tag"] = df.loc[index, "tag_number"]
if tags_no ==2:
df.at[index,"ml_two_tags"] = df.loc[index, "tag_number"]
if tags_no == 3:
df.at[index,"ml_three_tags"] = df.loc[index, "tag_number"]
if tags_no == 4:
df.at[index,"ml_four_tags"] = df.loc[index, "tag_number"]
if tags_no == 5:
df.at[index,"ml_five_tags"] = df.loc[index, "tag_number"]
return df
# Filling columns "xxx_tags" with corresponding number
ml_tag_df = tags_number(ml_tag_df, "tag_name")
# Grouping by year_quart
tag_group_lst = ['ml_one_tag', 'ml_two_tags', 'ml_three_tags', 'ml_four_tags',
'ml_five_tags', 'ml_total_tags']
ml_tag_df = ml_tag_df.groupby(["year_quart"])[tag_group_lst].agg("sum").copy()
# Calculate total ml tags
for column in tag_group_lst[:-1]:
ml_tag_df['ml_total_tags'] += ml_tag_df[column]
# Check final values
ml_tag_df['ml_total_tags'] == ml_tag_check["tag_number"]
year_quart 201402 True 201403 True 201404 True 201501 True 201502 True 201503 True 201504 True 201601 True 201602 True 201603 True 201604 True 201701 True 201702 True 201703 True 201704 True 201801 True 201802 True 201803 True 201804 True 201901 True 201902 True 201903 True 201904 True 202001 True 202002 True 202003 True 202004 True 202101 True dtype: bool
# Counting total DS questions and separate its by year and quarter
all_quests_gb.reset_index(inplace=True)
all_qusts_total_df = all_quests_gb.groupby(["year_quart"])[["tag_number"]].\
agg("sum")
# Add new column with total DS question number
ml_tag_final_df = ml_tag_df.merge(all_qusts_total_df,
how = "inner",
left_on = "year_quart",
right_on = "year_quart")
# Rename last columns
ml_tag_final_df = ml_tag_final_df.rename(columns = {"tag_number":"total_tags_numbers"})
# Prepare final dataset for preview
ml_tag_final_df.reset_index(inplace = True)
ml_tag_final_df = ml_tag_final_df.sort_values("year_quart", ascending = False)
ml_tag_final_df.reset_index(inplace = True, drop = True)
ml_tag_final_df.head(10)
year_quart | ml_one_tag | ml_two_tags | ml_three_tags | ml_four_tags | ml_five_tags | ml_total_tags | total_tags_numbers | |
---|---|---|---|---|---|---|---|---|
0 | 202101 | 19 | 82 | 117 | 128 | 155 | 501 | 1848 |
1 | 202004 | 17 | 71 | 108 | 99 | 167 | 462 | 1528 |
2 | 202003 | 17 | 54 | 150 | 151 | 166 | 538 | 1750 |
3 | 202002 | 13 | 74 | 161 | 171 | 197 | 616 | 2248 |
4 | 202001 | 18 | 70 | 154 | 173 | 191 | 606 | 1775 |
5 | 201904 | 21 | 66 | 122 | 132 | 144 | 485 | 1489 |
6 | 201903 | 26 | 86 | 163 | 169 | 125 | 569 | 1731 |
7 | 201902 | 22 | 97 | 180 | 145 | 123 | 567 | 1777 |
8 | 201901 | 22 | 108 | 164 | 143 | 130 | 567 | 1715 |
9 | 201804 | 19 | 66 | 118 | 107 | 112 | 422 | 1261 |
ml_tag_final_df = ml_tag_final_df.sort_values("year_quart", ascending = True)
ml_tag_final_df.set_index('year_quart').plot(kind='barh', width = 0.95, figsize = (17, 17));
ml_tag_final_df.set_index('year_quart').plot(kind = "line", figsize = (17, 17));
Based of tables ans plots above I see that share of machine learning questions from general data science questions approximate equal 30 - 40 % of total questions that quickly growth by 2014 - 2020 year and little drop in 2021 year. The most of questions of machine learning have direct connection with several binding disciplines - for example keras, tensorflow and etc. Create content only clear machine learning not profitable because it share very small, but for defining commercially successful content require additional analysis for relations with binding disciplines.
Created on Apr 09, 2021
@author: Vadim Maklakov, used some ideas from public Internet resources.
© 3-clause BSD License
Software environment: Debian 10.9, Python 3.8.7
required next imported python modules:
pandas
numpy
matplotlib.pyplot
seaborn
sklearn
datetime