Notebook

Stack Exchange contains questions and answers that can be up or downvoted. The sites datascience.stackexchange and Stack Overflow look useful for our data science goals.

The tables with the highest number of tag counts would be more promising towards finding the most popular content. The include tags for machine learning and building models. Also Posts, Tags, AnswerCount, CommentCount, FavoriteCount, UpVotes, and PostTags.

In [1]:

#SELECT Id,
 #      CreationDate, 
  #     Score, 
   #    ViewCount, 
    #   Tags, 
     #  AnswerCount, 
      # FavoriteCount
#FROM Posts
#WHERE PostTypeID = 1 AND YEAR(CreationDate) = 2019;

In [2]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:

questions = pd.read_csv("2019_questions.csv", parse_dates = ["CreationDate"])

In [4]:

questions.head()

Out[4]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
0	44419	2019-01-23 09:21:13	1	21	<machine-learning><data-mining>	0	NaN
1	44420	2019-01-23 09:34:01	0	25	<machine-learning><regression><linear-regressi...	0	NaN
2	44423	2019-01-23 09:58:41	2	1651	<python><time-series><forecast><forecasting>	0	NaN
3	44427	2019-01-23 10:57:09	0	55	<machine-learning><scikit-learn><pca>	1	NaN
4	44428	2019-01-23 11:02:15	0	19	<dataset><bigdata><data><speech-to-text>	0	NaN

In [5]:

questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    1407 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB

FavoriteCount is missing values. In order to get them, we'd have to look at 7400 questions. This is not practical. Also, it should be an integer, not a float. The types of the remaining columns are reasonable. In Tags it would be helpful to remove the <> and study the most common tags.

In [6]:

questions.isnull().sum()

Out[6]:

Id                  0
CreationDate        0
Score               0
ViewCount           0
Tags                0
AnswerCount         0
FavoriteCount    7432
dtype: int64

The above confirms that FavoriteCount has 7432 missing values.

In [7]:

questions = questions.fillna(0)

In [8]:

questions.isnull().sum()

Out[8]:

Id               0
CreationDate     0
Score            0
ViewCount        0
Tags             0
AnswerCount      0
FavoriteCount    0
dtype: int64

We have just replaced the null ("NaN") missing FavoriteCount values with 0. Now we will change the data type for this column to an integer.

In [9]:

questions['FavoriteCount'] = questions['FavoriteCount'].astype(int)

In [10]:

questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    8839 non-null int64
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 483.5+ KB

In [11]:

questions['Tags'].head(5)

Out[11]:

0                      <machine-learning><data-mining>
1    <machine-learning><regression><linear-regressi...
2         <python><time-series><forecast><forecasting>
3                <machine-learning><scikit-learn><pca>
4             <dataset><bigdata><data><speech-to-text>
Name: Tags, dtype: object

In [12]:

questions['Tags'] = questions['Tags'].str.replace("^<|>$", "")
questions['Tags'].head(5)

Out[12]:

0                        machine-learning><data-mining
1    machine-learning><regression><linear-regressio...
2           python><time-series><forecast><forecasting
3                  machine-learning><scikit-learn><pca
4               dataset><bigdata><data><speech-to-text
Name: Tags, dtype: object

In [13]:

questions['Tags'] = questions['Tags'].str.split("><")
questions['Tags'].head(5)

Out[13]:

0                      [machine-learning, data-mining]
1    [machine-learning, regression, linear-regressi...
2         [python, time-series, forecast, forecasting]
3                [machine-learning, scikit-learn, pca]
4             [dataset, bigdata, data, speech-to-text]
Name: Tags, dtype: object

In [14]:

questions.head(3)

Out[14]:

	Id	CreationDate	Score	ViewCount	Tags
0	44419	2019-01-23 09:21:13	1	21	[machine-learning, data-mining]
1	44420	2019-01-23 09:34:01	0	25	[machine-learning, regression, linear-regressi...
2	44423	2019-01-23 09:58:41	2	1651	[python, time-series, forecast, forecasting]

In [15]:

num_tags = {}

for tags in questions['Tags']:
    for tag in tags:
        if tag in num_tags:
            num_tags[tag] += 1
        else:
            num_tags[tag] = 1
print(num_tags)

{'data-transfer': 1, 'dataset': 340, 'serialisation': 3, 'data-science-model': 186, 'neural': 16, 'ann': 2, 'ranking': 22, 'gaussian-process': 12, 'json': 10, 'yolo': 21, 'data-leakage': 8, 'classifier': 18, 'bayesian': 40, 'tokenization': 6, 'summarunner-architecture': 1, 'scoring': 12, 'impala': 1, 'kernel': 27, 'kendalls-tau-coefficient': 1, 'speech-to-text': 8, 'binary': 26, 'pgm': 1, 'non-parametric': 3, 'logistic-regression': 154, 'bias': 19, 'methods': 4, 'annotation': 12, 'evaluation': 66, 'javascript': 8, 'programming': 7, 'convolution': 103, 'graphs': 47, 'performance': 27, 'keras': 935, 'sentiment-analysis': 37, 'lightgbm': 23, 'softmax': 24, 'nlg': 9, 'nosql': 3, 'ipython': 18, 'data-analysis': 71, 'management': 2, 'automatic-summarization': 10, 'representation': 9, 'linear-regression': 175, 'hierarchical-data-format': 7, 'statistics': 234, 'feature-selection': 209, 'mse': 8, 'categorical-data': 81, 'sematic-similarity': 2, 'torch': 4, 'loss-function': 161, 'ensemble': 7, 'pooling': 4, 'stacked-lstm': 7, 'data': 213, 'beginner': 27, 'code': 5, 'hinge-loss': 7, 'reference-request': 18, 'kitti-dataset': 1, 'multivariate-distribution': 1, 'multi-instance-learning': 2, 'efficiency': 2, 'mini-batch-gradient-descent': 10, 'distribution': 57, 'cross-validation': 139, 'marginal-effects': 1, 'learning': 10, 'books': 7, 'siamese': 1, 'parameter-estimation': 6, 'usecase': 2, 'pyspark': 40, 'categorical-encoding': 3, 'web-scrapping': 8, 'normalization': 74, 'heatmap': 9, 'orange': 64, 'cnn': 489, 'k-means': 81, 'label-flipping': 1, 'self-driving': 3, 'meta-learning': 3, 'smote': 27, 'r': 268, 'activity-recognition': 5, 'matplotlib': 77, 'search-engine': 4, 'anonymization': 3, 'classification': 685, 'matrix-factorisation': 24, 'project-planning': 6, 'text': 41, 'counts': 3, 'homework': 4, 'mutual-information': 5, 'nl2sql': 1, 'dialog-flow': 2, 'mongodb': 2, 'google-cloud': 1, 'infographics': 2, 'feature-construction': 16, 'lda-classifier': 1, 'similarity': 72, 'education': 3, 'multilabel-classification': 92, 'hurdle-model': 1, 'structured-data': 5, 'transfer-learning': 69, 'groupby': 2, 'collinearity': 6, 'libsvm': 1, 'vgg16': 21, 'data-product': 3, 'forecasting': 85, 'information-theory': 9, 'open-set': 2, 'sensors': 5, 'version-control': 1, 'anaconda': 20, 'objective-function': 4, 'churn': 15, 'manifold': 1, 'gan': 85, 'word-embeddings': 117, 'stanford-nlp': 9, 'search': 5, 'markov-process': 14, 'gpu': 42, 'proximal-svm': 1, 'statsmodels': 1, 'evolutionary-algorithms': 11, 'adaboost': 1, 'dqn': 36, 'data-mining': 217, 'features': 32, 'nn': 1, 'audio-recognition': 25, 'embeddings': 44, 'colab': 18, 'accuracy': 89, 'forecast': 34, 'databases': 29, 'automation': 4, 'mean-shift': 2, 'hardware': 12, 'cost-function': 25, 'theano': 4, 'career': 9, 'google-prediction-api': 2, 'gbm': 10, 'text-classification': 1, 'epochs': 11, 'grid-search': 35, 'time': 5, 'computer-vision': 121, 'probability-calibration': 11, 'image-segmentation': 3, 'tableau': 9, 'convergence': 17, 'mnist': 23, 'anova': 2, 'pipelines': 17, 'weka': 19, 'social-network-analysis': 11, 'keras-rl': 6, 'graphical-model': 3, 'haar-cascade': 1, 'rstudio': 15, 'imbalanced-learn': 21, 'linear-algebra': 24, 'refit-model': 1, 'score': 14, 'unbalanced-classes': 42, 'market-basket-analysis': 12, 'weighted-data': 14, 'nlp': 493, 'parameter': 5, 'openai-gpt': 2, 'python-3.x': 13, 'ndcg': 5, 'error-handling': 17, 'chatbot': 14, 'vae': 14, 'rnn': 149, 'gmm': 2, 'apache-spark': 35, 'pac-learning': 6, 'distance': 44, 'data-indexing-techniques': 1, 'image': 32, 'tsne': 15, 'parquet': 1, 'policy-gradients': 27, 'ab-test': 6, 'confusion-matrix': 27, 'rmsle': 1, 'perceptron': 26, 'geospatial': 27, 'software-development': 2, 'unseen-data': 1, 'image-preprocessing': 67, 'octave': 4, 'discriminant-analysis': 5, 'cause-effect-relations': 1, 'autoencoder': 106, 'scikit-learn': 540, 'linux': 5, 'monte-carlo': 15, 'pattern-recognition': 1, 'convnet': 111, 'rbm': 4, 'predictive-modeling': 265, 'c': 4, 'dplyr': 6, 'tools': 8, 'redshift': 1, 'smotenc': 4, 'backpropagation': 65, 'variance': 35, 'clustering': 257, 'association-rules': 19, 'methodology': 10, 'rmse': 1, 'jupyter': 41, 'twitter': 8, 'pandas': 354, 'finetuning': 7, 'least-squares-svm': 1, 'parallel': 8, 'time-series': 466, 'model-selection': 58, 'multi-output': 7, 'genetic-programming': 2, 'lbp': 2, 'text-mining': 113, 'coursera': 3, 'wolfram-language': 3, 'deep-network': 29, 'apache-hadoop': 13, 'gaussian': 20, 'lstm': 402, 'label-smoothing': 1, 'reinforcement-learning': 203, 'noisification': 1, 'sas': 6, 'markov': 4, 'ensemble-modeling': 30, 'momentum': 3, 'self-study': 8, 'optimization': 124, 'vector-space-models': 7, 'preprocessing': 120, 'estimators': 8, 'crawling': 3, 'data-imputation': 16, 'scalability': 4, 'descriptive-statistics': 21, 'matrix': 22, 'rbf': 5, 'feature-engineering': 163, 'research': 11, 'unsupervised-learning': 110, 'bioinformatics': 4, 'inception': 10, 'generative-models': 46, 'q-learning': 37, 'neural-style-transfer': 8, 'scraping': 5, 'pca': 85, 'movielens': 2, 'probabilistic-programming': 9, 'dynamic-programming': 3, 'hive': 2, 'markov-hidden-model': 13, 'dbscan': 18, 'cloud-computing': 9, 'history': 1, 'pathfinder': 1, 'activation-function': 44, 'sequence-to-sequence': 35, 'sagemaker': 8, 'learning-to-rank': 6, 'experiments': 3, 'clusters': 10, 'online-learning': 13, 'machine-learning': 2693, 'definitions': 4, 'data-augmentation': 24, 'competitions': 2, 'goss': 1, 'ngrams': 7, 'predict': 3, 'randomized-algorithms': 6, 'knime': 1, 'sequence': 25, 'ai': 25, 'hog': 1, 'feature-reduction': 4, 'terminology': 16, 'interpolation': 6, 'algorithms': 68, 'james-stein-encoder': 1, 'expectation-maximization': 5, 'regex': 8, 'orange3': 20, 'open-source': 1, 'sql': 29, 'word2vec': 88, 'finance': 17, 'ibm-watson': 1, 'siamese-networks': 4, 'gridsearchcv': 28, 'cs231n': 1, 'consumerweb': 1, 'pearsons-correlation-coefficient': 2, 'natural-language-process': 124, 'noise': 17, 'neural-network': 1055, 'decision-trees': 145, 'stemming': 2, 'class-imbalance': 73, 'training': 148, 'math': 37, 'svr': 5, 'state-of-the-art': 1, 'dirichlet': 4, 'pytorch': 175, 'relational-dbms': 7, 'indexing': 6, 'attention-mechanism': 26, 'metadata': 2, 'machine-learning-model': 224, 'fuzzy-classification': 3, 'overfitting': 69, 'text-filter': 2, 'labels': 28, 'ml': 7, 'pickle': 9, 'matlab': 62, 'aggregation': 12, 'data.table': 4, 'frequentist': 1, 'xboost': 1, 'cloud': 6, 'openai-gym': 17, 'outlier': 48, 'doc2vec': 3, 'opencv': 39, 'visualization': 126, 'ridge-regression': 7, 'sports': 3, 'domain-adaptation': 3, 'encoder': 1, 'text-generation': 17, 'google': 17, 'inceptionresnetv2': 6, 'data-wrangling': 15, 'julia': 2, 'implementation': 9, 'bert': 64, 'svm': 136, 'probability': 76, 'deepmind': 7, 'hyperparameter-tuning': 59, 'k-nn': 50, 'faster-rcnn': 38, 'huggingface': 2, 'mathematics': 17, 'image-recognition': 86, 'semi-supervised-learning': 18, 'vc-theory': 5, 'plotting': 32, 'h2o': 4, 'ensemble-learning': 11, 'bayesian-nonparametric': 2, 'wikipedia': 1, 'numpy': 117, 'nltk': 43, 'paperspace': 1, 'non-convex': 1, 'sparsity': 2, 'one-hot-encoding': 4, 'transformer': 45, 'nvidia': 7, 'object-recognition': 14, 'c++': 1, 'causalimpact': 2, 'hyperparameter': 42, 'one-shot-learning': 2, 'caffe': 7, 'theory': 11, 'historgram': 7, 'arima': 11, 'activation': 1, 'recurrent-neural-net': 91, 'question-answering': 4, 'naive-bayes-classifier': 42, 'regression': 347, 'categories': 2, 'game': 7, 'bayesian-networks': 12, 'aws-lambda': 2, 'fastai': 6, 'recommender-system': 103, 'learning-rate': 8, 'glm': 3, 'sequential-pattern-mining': 17, 'manhattan': 3, 'software-recommendation': 4, 'kaggle': 43, 'pip': 4, 'pytorch-geometric': 2, 'java': 14, 'data-stream-mining': 4, 'parsing': 3, 'helmert-coding': 1, 'ggplot2': 3, 'generalization': 12, 'active-learning': 4, 'marketing': 6, 'word': 2, 'xgboost': 165, 'corpus': 1, '.net': 1, '3d-reconstruction': 9, 'lasso': 8, 'machine-translation': 28, 'mlp': 34, 'correlation': 80, 'dump': 1, 'rdkit': 1, 'python': 1814, 'reshape': 9, 'feature-map': 2, 'data-cleaning': 157, 'survival-analysis': 10, 'data-formats': 9, 'density-estimation': 3, 'multitask-learning': 7, 'counter-inference': 1, 'named-entity-recognition': 36, 'alex-net': 5, 'simulation': 11, 'amazon-ml': 1, 'encoding': 54, 'numerical': 6, 'allennlp': 2, 'aws': 20, '3d-object-detection': 1, 'anomaly-detection': 92, 'regularization': 50, 'mcmc': 4, 'gensim': 36, 'genetic-algorithms': 16, 'supervised-learning': 82, 'predictor-importance': 9, 'bayes-error': 1, 'explainable-ai': 10, 'anomaly': 4, 'discounted-reward': 5, 'etl': 6, 'excel': 24, 'finite-precision': 2, 'topic-model': 31, 'networkx': 2, 'cosine-distance': 21, 'privacy': 6, 'library': 2, 'lda': 27, 'spyder': 1, 'map-reduce': 3, 'jaccard-coefficient': 4, 'feature-extraction': 87, 'batch-normalization': 29, 'auc': 3, 'weight-initialization': 12, 'spss': 2, 'exploitation': 1, 'distributed': 7, 'image-size': 6, 'dropout': 15, 'seaborn': 38, 'csv': 27, 'random-forest': 159, 'prediction': 128, 'dummy-variables': 19, 'gru': 1, 'scipy': 40, 'powerbi': 10, 'sampling': 38, 'actor-critic': 21, 'information-retrieval': 32, 'spearmans-rank-correlation': 1, 'boosting': 49, 'multiclass-classification': 131, 'processing': 5, 'fuzzy-logic': 13, 'similar-documents': 20, 'scala': 9, 'image-classification': 211, 'pruning': 3, 'deep-learning': 1220, 'apache-nifi': 1, 'missing-data': 43, 'tensorflow': 584, 'spacy': 20, 'automl': 2, 'azure-ml': 12, 'language-model': 25, 'object-detection': 109, 'bigdata': 95, 'difference': 5, 'tfidf': 31, 'tesseract': 3, 'community': 1, 'gradient-descent': 98, 'dimensionality-reduction': 69, 'dataframe': 81, 'ocr': 26, 'feature-scaling': 59, 'normal-equation': 1, 'notation': 4, 'metric': 60}

In [16]:

no_of_tags = pd.DataFrame.from_dict(num_tags, orient='index')
print(no_of_tags.head())

                      0
data-transfer         1
dataset             340
serialisation         3
data-science-model  186
neural               16

In [17]:

times_tag_used = no_of_tags.sort_values([0])#times_tag_used is no_of_tags, 
                                            #I have just renamed it
times_tag_used.rename(columns={0:"Times Tag Used"}, inplace=True)
print(times_tag_used)                        

                            Times Tag Used
data-transfer                            1
spyder                                   1
rmse                                     1
pattern-recognition                      1
cause-effect-relations                   1
exploitation                             1
unseen-data                              1
rmsle                                    1
parquet                                  1
data-indexing-techniques                 1
refit-model                              1
haar-cascade                             1
text-classification                      1
nn                                       1
adaboost                                 1
statsmodels                              1
proximal-svm                             1
gru                                      1
manifold                                 1
version-control                          1
libsvm                                   1
hurdle-model                             1
lda-classifier                           1
google-cloud                             1
nl2sql                                   1
spearmans-rank-correlation               1
label-flipping                           1
siamese                                  1
least-squares-svm                        1
marginal-effects                         1
...                                    ...
feature-engineering                    163
xgboost                                165
linear-regression                      175
pytorch                                175
data-science-model                     186
reinforcement-learning                 203
feature-selection                      209
image-classification                   211
data                                   213
data-mining                            217
machine-learning-model                 224
statistics                             234
clustering                             257
predictive-modeling                    265
r                                      268
dataset                                340
regression                             347
pandas                                 354
lstm                                   402
time-series                            466
cnn                                    489
nlp                                    493
scikit-learn                           540
tensorflow                             584
classification                         685
keras                                  935
neural-network                        1055
deep-learning                         1220
python                                1814
machine-learning                      2693

[526 rows x 1 columns]

In [18]:

top_times_tags_used = times_tag_used.sort_values(by="Times Tag Used").tail(20)
top_times_tags_used

Out[18]:

	Times Tag Used
machine-learning-model	224
statistics	234
clustering	257
predictive-modeling	265
r	268
dataset	340
regression	347
pandas	354
lstm	402
time-series	466
cnn	489
nlp	493
scikit-learn	540
tensorflow	584
classification	685
keras	935
neural-network	1055
deep-learning	1220
python	1814
machine-learning	2693

The above are the top most used tags. Let us see how this looks graphically.

In [19]:

import matplotlib.pyplot as plt
import numpy as np
%magic inline

top_times_tags_used.plot(kind='barh', figsize=(16,8))
plt.show()

In [20]:

num_views = {}
for index, row in questions.iterrows():
    for tag in row['Tags']:
        if tag in num_views:
            num_views[tag] += row["ViewCount"]
    else:
            num_views[tag] = row["ViewCount"]
print(num_views)

{'dataset': 422, 'serialisation': 464, 'data-science-model': 32, 'neural': 21, 'ann': 11, 'ranking': 41, 'gaussian-process': 51, 'json': 15, 'yolo': 101, 'data-leakage': 99, 'classifier': 22, 'bayesian': 81, 'tokenization': 29, 'summarunner-architecture': 45, 'scoring': 2110, 'kernel': 216, 'kendalls-tau-coefficient': 474, 'speech-to-text': 24, 'binary': 31, 'pgm': 19, 'non-parametric': 32, 'logistic-regression': 48, 'bias': 30, 'methods': 20, 'annotation': 14, 'evaluation': 218, 'javascript': 110, 'programming': 12, 'convolution': 87, 'graphs': 130, 'performance': 61, 'keras': 756, 'sentiment-analysis': 177, 'data-augmentation': 41, 'softmax': 72, 'nlg': 80, 'nosql': 33, 'ipython': 1907, 'data-analysis': 81, 'management': 19, 'automatic-summarization': 12, 'representation': 29, 'supervised-learning': 19, 'hierarchical-data-format': 30, 'statistics': 20, 'feature-selection': 136, 'mse': 13, 'categorical-data': 21, 'sematic-similarity': 36, 'tools': 275, 'pac-learning': 164, 'ensemble': 55, 'pooling': 43, 'stacked-lstm': 54, 'data': 1641, 'beginner': 70, 'code': 189, 'hinge-loss': 8, 'reference-request': 29, 'kitti-dataset': 7, 'multivariate-distribution': 4, 'multi-instance-learning': 63, 'efficiency': 19, 'mini-batch-gradient-descent': 55, 'distribution': 21, 'cross-validation': 134, 'marginal-effects': 12, 'learning': 80, 'books': 36, 'siamese': 65, 'parameter-estimation': 32, 'usecase': 26, 'parallel': 36, 'categorical-encoding': 42, 'web-scrapping': 87, 'normalization': 140, 'heatmap': 13, 'orange': 234, 'cnn': 217, 'k-means': 49, 'label-flipping': 480, 'self-driving': 23, 'meta-learning': 101, 'smote': 98, 'r': 130, 'activity-recognition': 14, 'matplotlib': 718, 'search-engine': 48, 'anonymization': 117, 'pandas': 933, 'classification': 29, 'matrix-factorisation': 8, 'project-planning': 29, 'text': 48, 'counts': 57, 'homework': 177, 'mutual-information': 11, 'nl2sql': 14, 'dialog-flow': 29, 'mongodb': 25, 'google-cloud': 19, 'infographics': 42, 'feature-construction': 18, 'lda-classifier': 68, 'similarity': 5, 'image-recognition': 33, 'multilabel-classification': 29, 'structured-data': 130, 'transfer-learning': 246, 'groupby': 15, 'collinearity': 37, 'libsvm': 54, 'education': 31, 'vgg16': 206, 'data-product': 186, 'forecasting': 70, 'information-theory': 15, 'open-set': 17, 'sensors': 8, 'networkx': 14, 'version-control': 94, 'clustering': 98, 'objective-function': 11, 'churn': 35, 'manifold': 80, 'reinforcement-learning': 119, 'markov': 105, 'stanford-nlp': 31, 'search': 55, 'markov-process': 49, 'gpu': 89, 'statsmodels': 9, 'evolutionary-algorithms': 253, 'adaboost': 23, 'dqn': 30, 'data-mining': 1506, 'features': 10, 'nn': 115, 'audio-recognition': 34, 'embeddings': 63, 'colab': 228, 'state-of-the-art': 17, 'forecast': 65, 'numpy': 612, 'automation': 15, 'mean-shift': 19, 'hardware': 41, 'cost-function': 61, 'theano': 22, 'career': 45, 'google-prediction-api': 10, 'sequence': 11, 'text-classification': 32, 'epochs': 16, 'grid-search': 2324, 'time': 31, 'computer-vision': 75, 'probability-calibration': 89, 'image-segmentation': 32, 'tableau': 75, 'convergence': 38, 'mnist': 108, 'anova': 49, 'lda': 21, 'weka': 64, 'social-network-analysis': 135, 'keras-rl': 230, 'graphical-model': 28, 'haar-cascade': 16, 'rstudio': 15, 'imbalanced-learn': 143, 'linear-algebra': 39, 'refit-model': 37, 'score': 69, 'unbalanced-classes': 42, 'market-basket-analysis': 11, 'weighted-data': 88, 'nlp': 970, 'parameter': 33, 'openai-gpt': 20, 'python-3.x': 41, 'ndcg': 140, 'error-handling': 134, 'chatbot': 17, 'vae': 104, 'rnn': 115, 'gmm': 35, 'apache-spark': 33, 'loss-function': 105, 'distance': 78, 'image': 117, 'tsne': 105, 'parquet': 17, 'policy-gradients': 38, 'ab-test': 54, 'confusion-matrix': 43, 'perceptron': 50, 'geospatial': 10, 'software-development': 562, 'unseen-data': 35, 'image-preprocessing': 319, 'weight-initialization': 113, 'octave': 165, 'discriminant-analysis': 86, 'cause-effect-relations': 42, 'autoencoder': 158, 'scikit-learn': 428, 'linux': 192, 'monte-carlo': 3210, 'pattern-recognition': 23, 'convnet': 857, 'rbm': 43, 'predictive-modeling': 1796, 'c': 21, 'dplyr': 22, 'self-study': 42, 'redshift': 19, 'smotenc': 55, 'backpropagation': 463, 'variance': 91, 'anaconda': 1113, 'association-rules': 169, 'methodology': 31, 'rmse': 17, 'jupyter': 2006, 'twitter': 34, 'openai-gym': 471, 'finetuning': 28, 'least-squares-svm': 110, 'image-size': 33, 'model-selection': 64, 'multi-output': 171, 'genetic-programming': 21, 'lbp': 4, 'text-mining': 59, 'coursera': 110, 'wolfram-language': 42, 'deep-network': 79, 'machine-learning-model': 50, 'gaussian': 40, 'lstm': 73, 'gan': 19, 'noisification': 9, 'sas': 18, 'word-embeddings': 97, 'random-forest': 17, 'momentum': 14, 'optimization': 58, 'vector-space-models': 13, 'preprocessing': 568, 'estimators': 19, 'crawling': 36, 'data-imputation': 8, 'scalability': 94, 'descriptive-statistics': 8, 'matrix': 147, 'rbf': 70, 'feature-engineering': 71, 'research': 242, 'unsupervised-learning': 35, 'bioinformatics': 29, 'inception': 10, 'data-indexing-techniques': 47, 'q-learning': 139, 'neural-style-transfer': 13, 'scraping': 23, 'pca': 102, 'movielens': 34, 'probabilistic-programming': 10, 'dynamic-programming': 758, 'hive': 31, 'markov-hidden-model': 56, 'dbscan': 189, 'history': 20, 'pathfinder': 266, 'activation-function': 177, 'sequence-to-sequence': 104, 'sagemaker': 41, 'learning-to-rank': 27, 'experiments': 41, 'clusters': 63, 'online-learning': 24, 'machine-learning': 6839, 'definitions': 64, 'lightgbm': 17, 'competitions': 15, 'goss': 32, 'ngrams': 10, 'predict': 42, 'randomized-algorithms': 1195, 'knime': 36, 'gbm': 12, 'ai': 106, 'hog': 18, 'feature-reduction': 189, 'terminology': 22, 'algorithms': 13, 'james-stein-encoder': 26, 'expectation-maximization': 39, 'causalimpact': 49, 'regex': 184, 'orange3': 35, 'sql': 70, 'word2vec': 852, 'finance': 39, 'cloud-computing': 36, 'siamese-networks': 38, 'gridsearchcv': 32, 'consumerweb': 24, 'pearsons-correlation-coefficient': 83, 'natural-language-process': 16, 'noise': 57, 'neural-network': 2764, 'decision-trees': 616, 'stemming': 25, 'class-imbalance': 44, 'training': 175, 'math': 114, 'svr': 167, 'accuracy': 9, 'dirichlet': 134, 'pytorch': 33, 'relational-dbms': 22, 'indexing': 3465, 'attention-mechanism': 37, 'metadata': 130, 'apache-hadoop': 12, 'fuzzy-classification': 5, 'overfitting': 55, 'text-filter': 17, 'labels': 459, 'ml': 21, 'pickle': 98, 'matlab': 31, 'aggregation': 20, 'data.table': 78, 'frequentist': 11, 'xboost': 71, 'cloud': 15, 'outlier': 25, 'doc2vec': 14, 'opencv': 206, 'visualization': 245, 'ridge-regression': 30, 'sports': 12, 'domain-adaptation': 48, 'encoder': 16, 'text-generation': 23, 'google': 18, 'inceptionresnetv2': 29, 'data-wrangling': 29, 'julia': 7, 'implementation': 54, 'bert': 476, 'svm': 1231, 'probability': 65, 'deepmind': 21, 'hyperparameter-tuning': 28, 'k-nn': 168, 'faster-rcnn': 48, 'huggingface': 37, 'mathematics': 58, 'interpolation': 25, 'semi-supervised-learning': 28, 'vc-theory': 85, 'plotting': 21, 'h2o': 10, 'ensemble-learning': 32, 'bayesian-nonparametric': 23, 'wikipedia': 23, 'databases': 43, 'nltk': 98, 'paperspace': 12, 'sparsity': 14, 'one-hot-encoding': 36, 'spacy': 190, 'nvidia': 94, 'object-recognition': 229, 'c++': 24, 'impala': 19, 'hyperparameter': 227, 'one-shot-learning': 54, 'caffe': 11, 'ensemble-modeling': 145, 'theory': 40, 'historgram': 59, 'arima': 24, 'activation': 15, 'recurrent-neural-net': 10, 'question-answering': 21, 'naive-bayes-classifier': 52, 'regression': 159, 'categories': 13, 'game': 22, 'bayesian-networks': 28, 'aws-lambda': 964, 'fastai': 18, 'recommender-system': 195, 'learning-rate': 5, 'glm': 11, 'sequential-pattern-mining': 85, 'manhattan': 43, 'software-recommendation': 18, 'kaggle': 426, 'pip': 30, 'pytorch-geometric': 51, 'java': 249, 'data-stream-mining': 14, 'parsing': 152, 'ggplot2': 140, 'generalization': 99, 'active-learning': 25, 'marketing': 21, 'word': 118, 'xgboost': 559, '.net': 438, '3d-reconstruction': 9, 'lasso': 38, 'machine-translation': 21, 'generative-models': 127, 'correlation': 16, 'python': 1646, 'predictor-importance': 34, 'feature-map': 22, 'data-cleaning': 280, 'survival-analysis': 31, 'data-formats': 99, 'density-estimation': 54, 'multitask-learning': 35, 'counter-inference': 58, 'named-entity-recognition': 68, 'alex-net': 15, 'simulation': 292, 'amazon-ml': 35, 'encoding': 18, 'numerical': 401, 'allennlp': 171, 'aws': 175, 'anomaly-detection': 36, 'regularization': 74, 'mcmc': 22, 'gensim': 35, 'genetic-algorithms': 30, 'linear-regression': 214, 'reshape': 54, 'bayes-error': 128, 'explainable-ai': 117, 'anomaly': 67, 'discounted-reward': 16, 'etl': 148, 'excel': 19, 'finite-precision': 11, 'topic-model': 28, 'torch': 5, 'cosine-distance': 129, 'privacy': 11, 'library': 10, 'pipelines': 19, 'spyder': 43, 'map-reduce': 45, 'jaccard-coefficient': 176, 'feature-extraction': 69, 'batch-normalization': 919, 'auc': 38, 'mlp': 24, 'spss': 17, 'exploitation': 12, 'distributed': 146, 'time-series': 22, 'dropout': 135, 'seaborn': 212, 'csv': 2103, 'pyspark': 988, 'prediction': 45, 'dummy-variables': 117, 'gru': 16, 'scipy': 12, 'powerbi': 16, 'sampling': 23, 'actor-critic': 173, 'information-retrieval': 37, 'boosting': 62, 'multiclass-classification': 589, 'processing': 7, 'fuzzy-logic': 316, 'similar-documents': 37, 'scala': 22, 'image-classification': 25, 'pruning': 7, 'deep-learning': 1138, 'apache-nifi': 119, 'missing-data': 9, 'tensorflow': 16, 'transformer': 26, 'automl': 12, 'azure-ml': 11, 'language-model': 22, 'object-detection': 358, 'bigdata': 74, 'difference': 103, 'tfidf': 91, 'tesseract': 34, 'community': 42, 'gradient-descent': 852, 'dimensionality-reduction': 604, 'dataframe': 549, 'ocr': 222, 'feature-scaling': 46, 'normal-equation': 103, 'notation': 76, 'metric': 159}

In [21]:

num_views = pd.DataFrame.from_dict(num_views, orient='index')
num_views.rename(columns={0: "Times Tags Viewed"}, inplace=True)
print(num_views)

                           Times Tags Viewed
dataset                                  422
serialisation                            464
data-science-model                        32
neural                                    21
ann                                       11
ranking                                   41
gaussian-process                          51
json                                      15
yolo                                     101
data-leakage                              99
classifier                                22
bayesian                                  81
tokenization                              29
summarunner-architecture                  45
scoring                                 2110
kernel                                   216
kendalls-tau-coefficient                 474
speech-to-text                            24
binary                                    31
pgm                                       19
non-parametric                            32
logistic-regression                       48
bias                                      30
methods                                   20
annotation                                14
evaluation                               218
javascript                               110
programming                               12
convolution                               87
graphs                                   130
...                                      ...
boosting                                  62
multiclass-classification                589
processing                                 7
fuzzy-logic                              316
similar-documents                         37
scala                                     22
image-classification                      25
pruning                                    7
deep-learning                           1138
apache-nifi                              119
missing-data                               9
tensorflow                                16
transformer                               26
automl                                    12
azure-ml                                  11
language-model                            22
object-detection                         358
bigdata                                   74
difference                               103
tfidf                                     91
tesseract                                 34
community                                 42
gradient-descent                         852
dimensionality-reduction                 604
dataframe                                549
ocr                                      222
feature-scaling                           46
normal-equation                          103
notation                                  76
metric                                   159

[511 rows x 1 columns]

In [22]:

top_views = num_views.sort_values(by="Times Tags Viewed").tail(20)
top_views

Out[22]:

	Times Tags Viewed
aws-lambda	964
nlp	970
pyspark	988
anaconda	1113
deep-learning	1138
randomized-algorithms	1195
svm	1231
data-mining	1506
data	1641
python	1646
predictive-modeling	1796
ipython	1907
jupyter	2006
csv	2103
scoring	2110
grid-search	2324
neural-network	2764
monte-carlo	3210
indexing	3465
machine-learning	6839

In [23]:

top_views.plot(kind='barh', figsize=(16,8))
plt.show()

Seeing the plots side by side we have the following.

In [24]:

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,8))
top_times_tags_used.plot(kind='barh', ax=axes[0], subplots=True)
top_views.plot(kind='barh', ax=axes[1], subplots=True)
plt.show()

Among the top ten tags are: python, machine-learning, deep-learning, neural-network, keras, tensorflow, classification, and scikit-learn. Let's look at how these may be related. Deep learning is an area of machine learning.Neural networks are used in modeling often complex data in a way that mimics the brain. Keras is a neural network library written in Python. TensorFlow is an open source machine larning library. Scikit-Learn is a Python module used for machine learning, among other things. Classification is used in supervised learning, a part of the machine learning process. So we see the interconnectedness of these tags.

In [25]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [26]:

all_quests = pd.read_csv("all_questions.csv", parse_dates=["CreationDate"])
all_quests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21576 entries, 0 to 21575
Data columns (total 3 columns):
Id              21576 non-null int64
CreationDate    21576 non-null datetime64[ns]
Tags            21576 non-null object
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 505.8+ KB

In [27]:

all_quests['Tags'] = all_quests['Tags'].str.replace("^<|>$", "").str.split("><")
print(all_quests['Tags'].head(5))

0        [python, keras, tensorflow, cnn, probability]
1                                     [neural-network]
2                        [python, ibm-watson, chatbot]
3                                              [keras]
4    [r, predictive-modeling, machine-learning-mode...
Name: Tags, dtype: object

In [28]:

quest_tags = {}

for tags in all_quests['Tags']:
    for tag in tags:
        if tag in quest_tags:
            quest_tags[tag] += 1
        else:
            quest_tags[tag] = 1
print(quest_tags)

{'data-transfer': 1, 'dataset': 893, 'serialisation': 3, 'data-science-model': 259, 'pyro': 1, 'neural': 17, 'gbm': 48, 'gaussian-process': 13, 'json': 23, 'yolo': 39, 'data-leakage': 13, 'classifier': 40, 'bayesian': 76, 'summarunner-architecture': 1, 'scoring': 31, 'regex': 22, 'neo4j': 10, 'kernel': 51, 'kendalls-tau-coefficient': 1, 'speech-to-text': 14, 'binary': 68, 'pgm': 6, 'generative-models': 85, 'non-parametric': 3, 'haar-cascade': 1, 'logistic-regression': 406, 'dynamic-programming': 5, 'education': 31, '3d-reconstruction': 9, 'gaussian': 51, 'javascript': 24, 'programming': 35, 'preprocessing': 272, 'graphs': 139, 'performance': 110, 'keras': 1750, 'sentiment-analysis': 118, 'data-augmentation': 48, 'softmax': 24, 'nlg': 10, 'nosql': 21, 'ipython': 45, 'data-analysis': 106, 'energy': 4, 'management': 7, 'automatic-summarization': 17, 'representation': 16, 'supervised-learning': 235, 'hierarchical-data-format': 20, 'statistics': 650, 'feature-selection': 551, 'apache-spark': 208, 'categorical-data': 237, 'sematic-similarity': 3, 'torch': 12, 'loss-function': 277, 'ensemble': 19, 'pooling': 4, 'stacked-lstm': 15, 'data': 476, 'metaheuristics': 1, 'beginner': 125, 'code': 5, 'hinge-loss': 8, 'reference-request': 65, 'kitti-dataset': 1, 'sequence-to-sequence': 58, 'efficiency': 27, 'mini-batch-gradient-descent': 28, 'distribution': 98, 'cross-validation': 334, 'marginal-effects': 1, 'learning': 24, 'books': 27, 'siamese': 1, 'parameter-estimation': 33, 'usecase': 7, 'parallel': 32, 'categorical-encoding': 4, 'genetic': 4, 'normalization': 158, 'similar-documents': 39, 'orange': 189, 'cnn': 815, 'k-means': 265, 'label-flipping': 1, 'self-driving': 5, 'meta-learning': 8, 'smote': 44, 'r': 1119, 'activity-recognition': 6, 'matplotlib': 108, 'search-engine': 12, 'anonymization': 8, 'pandas': 736, 'matrix-factorisation': 46, 'project-planning': 10, 'text': 97, 'counts': 7, 'mutual-information': 13, 'nl2sql': 1, 'dialog-flow': 2, 'mongodb': 13, 'google-cloud': 1, 'infographics': 7, 'feature-construction': 54, 'lda-classifier': 6, 'similarity': 195, 'image-recognition': 200, 'multilabel-classification': 197, 'hurdle-model': 1, 'structured-data': 5, 'transfer-learning': 99, 'groupby': 2, 'apache-pig': 7, 'collinearity': 10, 'libsvm': 12, 'vgg16': 28, 'data-product': 9, 'forecasting': 132, 'information-theory': 28, 'open-set': 2, 'sensors': 7, 'version-control': 5, 'clustering': 846, 'objective-function': 11, 'churn': 29, 'methods': 14, 'gan': 151, 'word-embeddings': 253, 'stanford-nlp': 40, 'lsi': 5, 'search': 40, 'markov-process': 50, 'neural-style-transfer': 15, 'proximal-svm': 1, 'statsmodels': 3, 'evolutionary-algorithms': 12, 'adaboost': 4, 'dqn': 63, 'data-mining': 947, 'features': 33, 'nn': 1, 'gridsearchcv': 35, 'embeddings': 73, 'colab': 23, 'accuracy': 186, 'databases': 81, 'ann': 3, 'automation': 6, 'mean-shift': 2, 'cost-function': 55, 'theano': 46, 'feature-engineering': 333, 'google-prediction-api': 7, 'hyperparameter': 101, 'text-classification': 4, 'epochs': 17, 'markov-hidden-model': 25, 'non-convex': 1, 'ai': 38, 'stata': 4, 'freebase': 3, 'time': 6, 'computer-vision': 313, 'probability-calibration': 18, 'image-segmentation': 8, 'linearly-separable': 1, 'tableau': 26, 'convergence': 27, 'mnist': 54, 'anova': 1, 'lda': 83, 'multivariate-distribution': 3, 'gradient-descent': 253, 'keras-rl': 10, 'julia': 9, 'gru': 5, 'rstudio': 39, 'imbalanced-learn': 21, 'rmse': 1, 'optimization': 290, 'refit-model': 1, 'score': 20, 'web-scrapping': 8, 'knowledge-base': 16, 'unbalanced-classes': 148, 'market-basket-analysis': 23, 'weighted-data': 39, 'nlp': 1170, 'ngboost': 1, 'parameter': 27, 'openai-gpt': 3, 'python-3.x': 20, 'ndcg': 6, 'error-handling': 36, 'chatbot': 15, 'vae': 16, 'rnn': 404, 'gmm': 2, 'mse': 8, 'pac-learning': 11, 'distance': 104, 'missing-data': 110, 'image': 36, 'tsne': 44, 'parquet': 1, 'policy-gradients': 41, 'homework': 11, 'confusion-matrix': 72, 'rmsle': 1, 'perceptron': 59, 'geospatial': 51, 'software-development': 14, 'unseen-data': 2, 'image-preprocessing': 71, 'octave': 14, 'discriminant-analysis': 12, 'cause-effect-relations': 1, 'autoencoder': 201, 'scikit-learn': 1307, 'linux': 9, 'monte-carlo': 21, 'pattern-recognition': 1, 'convnet': 426, 'rbm': 31, 'predictive-modeling': 817, 'c': 12, 'dplyr': 11, 'tokenization': 11, 'redshift': 5, 'smotenc': 4, 'backpropagation': 198, 'variance': 60, 'anaconda': 38, 'association-rules': 54, 'methodology': 26, 'bias': 35, 'jupyter': 85, 'twitter': 17, 'ranking': 66, 'finetuning': 8, 'time-series': 1005, 'model-selection': 141, 'multi-output': 8, 'genetic-programming': 3, 'dropout': 49, 'text-mining': 472, 'class-imbalance': 133, 'glorot-initialization': 2, 'wolfram-language': 3, 'deep-network': 41, 'machine-learning-model': 336, 'annotation': 16, 'lstm': 694, 'manifold': 6, 'label-smoothing': 1, 'reinforcement-learning': 413, 'noisification': 3, 'sas': 17, 'markov': 12, 'random-forest': 463, 'momentum': 5, 'self-study': 36, 'reductions': 2, 'online-learning': 45, 'vector-space-models': 17, 'convolution': 210, 'estimators': 17, 'crawling': 11, 'data-imputation': 56, 'scalability': 26, 'descriptive-statistics': 48, 'matrix': 30, 'rbf': 6, 'library': 12, 'research': 36, 'unsupervised-learning': 271, 'bioinformatics': 17, 'inception': 39, 'data-indexing-techniques': 5, 'q-learning': 95, 'kaggle': 75, 'gpu': 97, 'scraping': 33, 'pca': 193, 'movielens': 2, 'probabilistic-programming': 14, 'career': 49, 'consumerweb': 6, 'classification': 1899, 'dbscan': 41, 'spectral-clustering': 1, 'history': 4, 'pathfinder': 1, 'activation-function': 84, 'openai-gym': 23, 'game': 14, 'learning-to-rank': 12, 'experiments': 21, 'clusters': 39, 'ibm-watson': 2, 'machine-learning': 6969, 'definitions': 28, 'lightgbm': 25, 'competitions': 5, 'goss': 1, 'ngrams': 17, 'predict': 3, 'randomized-algorithms': 15, 'knime': 1, 'sequence': 68, 'grid-search': 54, 'least-squares-svm': 1, 'feature-reduction': 6, 'terminology': 58, 'anomaly': 8, 'algorithms': 308, 'james-stein-encoder': 1, 'heatmap': 11, 'expectation-maximization': 16, 'forecast': 93, 'demographic-data': 1, 'orange3': 23, 'sql': 73, 'bayesian-neural-network': 1, 'word2vec': 244, 'finance': 38, 'cloud-computing': 17, 'siamese-networks': 6, 'audio-recognition': 64, 'cs231n': 3, 'hog': 3, 'pearsons-correlation-coefficient': 3, 'natural-language-process': 182, 'lbp': 2, 'noise': 26, 'neural-network': 2939, 'apache-kafka': 3, 'tools': 56, 'decision-trees': 430, 'stemming': 2, 'coursera': 4, 'training': 321, 'spatial-transformer': 2, 'object-recognition': 51, 'svr': 9, 'state-of-the-art': 6, 'dirichlet': 6, 'natural-gradient-boosting': 1, 'pytorch': 239, 'relational-dbms': 11, 'indexing': 13, 'attention-mechanism': 32, 'metadata': 10, 'apache-hadoop': 110, 'fuzzy-classification': 3, 'overfitting': 147, 'text-filter': 6, 'labels': 65, 'ml': 8, 'extreme-learning-machine': 1, 'pickle': 9, 'matlab': 144, 'aggregation': 28, 'data.table': 13, 'frequentist': 1, 'xboost': 1, 'cloud': 9, 'hbase': 2, 'ggplot2': 21, 'doc2vec': 4, 'opencv': 52, 'visualization': 421, 'ridge-regression': 7, 'sports': 6, 'domain-adaptation': 9, 'encoder': 2, 'text-generation': 30, 'google': 31, 'inceptionresnetv2': 6, 'data-wrangling': 34, 'implementation': 9, 'bert': 68, 'svm': 389, 'probability': 184, 'deepmind': 8, 'hyperparameter-tuning': 99, 'k-nn': 88, 'f1score': 1, 'faster-rcnn': 46, 'huggingface': 2, 'mathematics': 19, 'interpolation': 15, 'semi-supervised-learning': 40, 'vc-theory': 11, 'plotting': 88, 'h2o': 4, 'ensemble-learning': 21, 'bayesian-nonparametric': 2, 'wikipedia': 1, 'numpy': 193, 'nltk': 105, 'paperspace': 1, 'featurization': 5, 'hive': 11, 'sparsity': 2, 'one-hot-encoding': 6, 'transformer': 45, 'nvidia': 13, 'math': 51, 'c++': 1, 'causalimpact': 2, 'caffe': 22, 'one-shot-learning': 3, 'ensemble-modeling': 106, 'theory': 28, 'historgram': 11, 'arima': 12, 'activation': 1, 'recurrent-neural-net': 151, 'question-answering': 6, 'naive-bayes-classifier': 120, 'regression': 869, 'categories': 2, 'sagemaker': 8, 'bayesian-networks': 40, 'aws-lambda': 2, 'fastai': 6, 'recommender-system': 310, 'learning-rate': 16, 'glm': 19, 'sequential-pattern-mining': 43, 'manhattan': 3, 'software-recommendation': 32, 'language-model': 49, 'pip': 9, 'pytorch-geometric': 2, 'data-stream-mining': 14, 'parsing': 23, 'helmert-coding': 1, 'outlier': 125, 'explainable-ai': 11, 'active-learning': 9, 'marketing': 23, 'word': 6, 'separable': 1, 'xgboost': 366, 'corpus': 1, '.net': 6, 'java': 58, 'lasso': 8, 'apache-mahout': 16, 'mlp': 53, 'correlation': 207, 'dump': 1, 'machine-translation': 51, 'python': 3937, 'reshape': 17, 'feature-map': 2, 'tflearn': 9, 'data-cleaning': 444, 'survival-analysis': 26, 'data-formats': 29, 'impala': 1, 'density-estimation': 3, 'multitask-learning': 19, 'counter-inference': 1, 'named-entity-recognition': 84, 'alex-net': 10, 'handwritten': 1, 'simulation': 22, 'amazon-ml': 10, 'rdkit': 1, 'encoding': 94, 'evaluation': 156, 'allennlp': 2, 'aws': 40, '3d-object-detection': 1, 'anomaly-detection': 205, 'regularization': 105, 'mcmc': 6, 'gensim': 78, 'genetic-algorithms': 38, 'linear-regression': 439, 'predictor-importance': 16, 'sparse': 1, 'bayes-error': 6, 'generalization': 19, 'object-detection': 155, 'discounted-reward': 5, 'rattle': 3, 'etl': 12, 'excel': 51, 'finite-precision': 3, 'topic-model': 92, 'networkx': 2, 'cosine-distance': 47, 'privacy': 9, 'pipelines': 23, 'spyder': 1, 'map-reduce': 25, 'numerical': 18, 'jaccard-coefficient': 8, 'feature-extraction': 271, 'batch-normalization': 43, 'auc': 6, 'weight-initialization': 19, 'spss': 5, 'exploitation': 1, 'distributed': 35, 'image-size': 5, 'graphical-model': 22, 'seaborn': 59, 'csv': 65, 'pyspark': 101, 'boosting': 68, 'pybrain': 3, 'dummy-variables': 20, 'scipy': 52, 'powerbi': 18, 'sampling': 110, 'actor-critic': 25, 'information-retrieval': 104, 'spearmans-rank-correlation': 1, 'hardware': 18, 'tranformation': 4, 'multiclass-classification': 294, 'processing': 17, 'fuzzy-logic': 26, 'weka': 55, 'scala': 41, 'image-classification': 461, 'pruning': 3, 'deep-learning': 2805, 'apache-nifi': 1, 'open-source': 16, 'social-network-analysis': 80, 'tensorflow': 1229, 'spacy': 21, 'automl': 4, 'azure-ml': 35, 'data-engineering': 2, 'linear-algebra': 42, 'bigdata': 414, 'difference': 7, 'tfidf': 53, 'multi-instance-learning': 2, 'tesseract': 3, 'community': 3, 'ab-test': 29, 'dimensionality-reduction': 178, 'dataframe': 157, 'prediction': 265, 'ocr': 47, 'feature-scaling': 132, 'normal-equation': 6, 'notation': 9, 'metric': 95}

In [29]:

quest_tags = pd.DataFrame.from_dict(num_tags, orient='index')
quest_tags.rename(columns={0: "Times Tags Used"}, inplace=True)
print(quest_tags)

                           Times Tags Used
data-transfer                            1
dataset                                340
serialisation                            3
data-science-model                     186
neural                                  16
ann                                      2
ranking                                 22
gaussian-process                        12
json                                    10
yolo                                    21
data-leakage                             8
classifier                              18
bayesian                                40
tokenization                             6
summarunner-architecture                 1
scoring                                 12
impala                                   1
kernel                                  27
kendalls-tau-coefficient                 1
speech-to-text                           8
binary                                  26
pgm                                      1
non-parametric                           3
logistic-regression                    154
bias                                    19
methods                                  4
annotation                              12
evaluation                              66
javascript                               8
programming                              7
...                                    ...
boosting                                49
multiclass-classification              131
processing                               5
fuzzy-logic                             13
similar-documents                       20
scala                                    9
image-classification                   211
pruning                                  3
deep-learning                         1220
apache-nifi                              1
missing-data                            43
tensorflow                             584
spacy                                   20
automl                                   2
azure-ml                                12
language-model                          25
object-detection                       109
bigdata                                 95
difference                               5
tfidf                                   31
tesseract                                3
community                                1
gradient-descent                        98
dimensionality-reduction                69
dataframe                               81
ocr                                     26
feature-scaling                         59
normal-equation                          1
notation                                 4
metric                                  60

[526 rows x 1 columns]

In [30]:

top_qtags_used = quest_tags.sort_values(by="Times Tags Used").tail(20)
top_qtags_used

Out[30]:

	Times Tags Used
machine-learning-model	224
statistics	234
clustering	257
predictive-modeling	265
r	268
dataset	340
regression	347
pandas	354
lstm	402
time-series	466
cnn	489
nlp	493
scikit-learn	540
tensorflow	584
classification	685
keras	935
neural-network	1055
deep-learning	1220
python	1814
machine-learning	2693

In [31]:

top_qtags_used.plot(kind='barh', figsize=(10,7))
plt.show()

Of the top tags used over time, deep learning tags should include: deep learning, machine learning, neural network,classification, TensorFlow, SciKit Learn,and cnn. These are the top tags used that are related to deep learning. If we look at the questions asked per year that have these tags, we can track interest in deep learning.

In [32]:

def categorize(tag):
    for tags in all_quests['Tags']:
        if "deep-learning" in tags:
            return "Deep Learning"
        elif "machine-learning" in tags:
            return "Deep Learning"
        elif "neural-network" in tags:
            return "Deep Learning"
        elif "classification" in tags:
            return "Deep Learning"
        elif "tensorflow" in tags:
            return "Deep Learning"
        elif "sciKit-learn" in tags:
            return "Deep Learning"
        elif "cnn" in tags:
            return "Deep Learning"
        else:
            return "None"
                            
year = all_quests["CreationDate"].dt.year
all_quests['Year'] = year
all_quests["category"] = all_quests["Tags"].apply(categorize)
tps = all_quests.pivot_table(index=all_quests['Year'], 
      columns=all_quests['category'], aggfunc='size')
print(tps)                            
                            

category  Deep Learning
Year                   
2014                562
2015               1167
2016               2146
2017               2957
2018               5475
2019               8810
2020                459

As we can see, the tags for deep learning increased yearly. The exception is 2020. Since we are still in 2020, we can assume that the data for 2020 is incomplete and we will exclude it.

In [33]:

tps = all_quests[all_quests["CreationDate"].dt.year < 2020]

In [34]:

tps = tps.pivot_table(index=all_quests['Year'], 
      columns=tps['category'], aggfunc='size')
print(tps)

category  Deep Learning
Year                   
2014                562
2015               1167
2016               2146
2017               2957
2018               5475
2019               8810

In [35]:

tps.plot(kind='bar', figsize=(10,7), title="Deep Learning Tags Per Year")
plt.grid(b=None)
plt.show()

Now we will count the total questions per year, and the percentage of deep learning questions there were per year.

In [36]:

yearly = tps.groupby('Year').agg({"Deep Learning": ['sum', 'size']})
yearly.columns = ["Deep Learning Questions", "Total Questions"]
yearly["Deep Learning Rate"] = yearly["Deep Learning Questions"]\
                               /yearly["Total Questions"]
yearly.reset_index(inplace=True)
print(yearly)

   Year  Deep Learning Questions  Total Questions  Deep Learning Rate
0  2014                      562                1               562.0
1  2015                     1167                1              1167.0
2  2016                     2146                1              2146.0
3  2017                     2957                1              2957.0
4  2018                     5475                1              5475.0
5  2019                     8810                1              8810.0

In [ ]: