Stack Exchange contains questions and answers that can be up or downvoted. The sites datascience.stackexchange and Stack Overflow look useful for our data science goals.
The tables with the highest number of tag counts would be more promising towards finding the most popular content. The include tags for machine learning and building models. Also Posts, Tags, AnswerCount, CommentCount, FavoriteCount, UpVotes, and PostTags.
#SELECT Id,
# CreationDate,
# Score,
# ViewCount,
# Tags,
# AnswerCount,
# FavoriteCount
#FROM Posts
#WHERE PostTypeID = 1 AND YEAR(CreationDate) = 2019;
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
questions = pd.read_csv("2019_questions.csv", parse_dates = ["CreationDate"])
questions.head()
Id | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount | |
---|---|---|---|---|---|---|---|
0 | 44419 | 2019-01-23 09:21:13 | 1 | 21 | <machine-learning><data-mining> | 0 | NaN |
1 | 44420 | 2019-01-23 09:34:01 | 0 | 25 | <machine-learning><regression><linear-regressi... | 0 | NaN |
2 | 44423 | 2019-01-23 09:58:41 | 2 | 1651 | <python><time-series><forecast><forecasting> | 0 | NaN |
3 | 44427 | 2019-01-23 10:57:09 | 0 | 55 | <machine-learning><scikit-learn><pca> | 1 | NaN |
4 | 44428 | 2019-01-23 11:02:15 | 0 | 19 | <dataset><bigdata><data><speech-to-text> | 0 | NaN |
questions.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8839 entries, 0 to 8838 Data columns (total 7 columns): Id 8839 non-null int64 CreationDate 8839 non-null datetime64[ns] Score 8839 non-null int64 ViewCount 8839 non-null int64 Tags 8839 non-null object AnswerCount 8839 non-null int64 FavoriteCount 1407 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(4), object(1) memory usage: 483.5+ KB
FavoriteCount is missing values. In order to get them, we'd have to look at 7400 questions. This is not practical. Also, it should be an integer, not a float. The types of the remaining columns are reasonable. In Tags it would be helpful to remove the <> and study the most common tags.
questions.isnull().sum()
Id 0 CreationDate 0 Score 0 ViewCount 0 Tags 0 AnswerCount 0 FavoriteCount 7432 dtype: int64
The above confirms that FavoriteCount has 7432 missing values.
questions = questions.fillna(0)
questions.isnull().sum()
Id 0 CreationDate 0 Score 0 ViewCount 0 Tags 0 AnswerCount 0 FavoriteCount 0 dtype: int64
We have just replaced the null ("NaN") missing FavoriteCount values with 0. Now we will change the data type for this column to an integer.
questions['FavoriteCount'] = questions['FavoriteCount'].astype(int)
questions.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8839 entries, 0 to 8838 Data columns (total 7 columns): Id 8839 non-null int64 CreationDate 8839 non-null datetime64[ns] Score 8839 non-null int64 ViewCount 8839 non-null int64 Tags 8839 non-null object AnswerCount 8839 non-null int64 FavoriteCount 8839 non-null int64 dtypes: datetime64[ns](1), int64(5), object(1) memory usage: 483.5+ KB
questions['Tags'].head(5)
0 <machine-learning><data-mining> 1 <machine-learning><regression><linear-regressi... 2 <python><time-series><forecast><forecasting> 3 <machine-learning><scikit-learn><pca> 4 <dataset><bigdata><data><speech-to-text> Name: Tags, dtype: object
questions['Tags'] = questions['Tags'].str.replace("^<|>$", "")
questions['Tags'].head(5)
0 machine-learning><data-mining 1 machine-learning><regression><linear-regressio... 2 python><time-series><forecast><forecasting 3 machine-learning><scikit-learn><pca 4 dataset><bigdata><data><speech-to-text Name: Tags, dtype: object
questions['Tags'] = questions['Tags'].str.split("><")
questions['Tags'].head(5)
0 [machine-learning, data-mining] 1 [machine-learning, regression, linear-regressi... 2 [python, time-series, forecast, forecasting] 3 [machine-learning, scikit-learn, pca] 4 [dataset, bigdata, data, speech-to-text] Name: Tags, dtype: object
questions.head(3)
Id | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount | |
---|---|---|---|---|---|---|---|
0 | 44419 | 2019-01-23 09:21:13 | 1 | 21 | [machine-learning, data-mining] | 0 | 0 |
1 | 44420 | 2019-01-23 09:34:01 | 0 | 25 | [machine-learning, regression, linear-regressi... | 0 | 0 |
2 | 44423 | 2019-01-23 09:58:41 | 2 | 1651 | [python, time-series, forecast, forecasting] | 0 | 0 |
num_tags = {}
for tags in questions['Tags']:
for tag in tags:
if tag in num_tags:
num_tags[tag] += 1
else:
num_tags[tag] = 1
print(num_tags)
{'data-transfer': 1, 'dataset': 340, 'serialisation': 3, 'data-science-model': 186, 'neural': 16, 'ann': 2, 'ranking': 22, 'gaussian-process': 12, 'json': 10, 'yolo': 21, 'data-leakage': 8, 'classifier': 18, 'bayesian': 40, 'tokenization': 6, 'summarunner-architecture': 1, 'scoring': 12, 'impala': 1, 'kernel': 27, 'kendalls-tau-coefficient': 1, 'speech-to-text': 8, 'binary': 26, 'pgm': 1, 'non-parametric': 3, 'logistic-regression': 154, 'bias': 19, 'methods': 4, 'annotation': 12, 'evaluation': 66, 'javascript': 8, 'programming': 7, 'convolution': 103, 'graphs': 47, 'performance': 27, 'keras': 935, 'sentiment-analysis': 37, 'lightgbm': 23, 'softmax': 24, 'nlg': 9, 'nosql': 3, 'ipython': 18, 'data-analysis': 71, 'management': 2, 'automatic-summarization': 10, 'representation': 9, 'linear-regression': 175, 'hierarchical-data-format': 7, 'statistics': 234, 'feature-selection': 209, 'mse': 8, 'categorical-data': 81, 'sematic-similarity': 2, 'torch': 4, 'loss-function': 161, 'ensemble': 7, 'pooling': 4, 'stacked-lstm': 7, 'data': 213, 'beginner': 27, 'code': 5, 'hinge-loss': 7, 'reference-request': 18, 'kitti-dataset': 1, 'multivariate-distribution': 1, 'multi-instance-learning': 2, 'efficiency': 2, 'mini-batch-gradient-descent': 10, 'distribution': 57, 'cross-validation': 139, 'marginal-effects': 1, 'learning': 10, 'books': 7, 'siamese': 1, 'parameter-estimation': 6, 'usecase': 2, 'pyspark': 40, 'categorical-encoding': 3, 'web-scrapping': 8, 'normalization': 74, 'heatmap': 9, 'orange': 64, 'cnn': 489, 'k-means': 81, 'label-flipping': 1, 'self-driving': 3, 'meta-learning': 3, 'smote': 27, 'r': 268, 'activity-recognition': 5, 'matplotlib': 77, 'search-engine': 4, 'anonymization': 3, 'classification': 685, 'matrix-factorisation': 24, 'project-planning': 6, 'text': 41, 'counts': 3, 'homework': 4, 'mutual-information': 5, 'nl2sql': 1, 'dialog-flow': 2, 'mongodb': 2, 'google-cloud': 1, 'infographics': 2, 'feature-construction': 16, 'lda-classifier': 1, 'similarity': 72, 'education': 3, 'multilabel-classification': 92, 'hurdle-model': 1, 'structured-data': 5, 'transfer-learning': 69, 'groupby': 2, 'collinearity': 6, 'libsvm': 1, 'vgg16': 21, 'data-product': 3, 'forecasting': 85, 'information-theory': 9, 'open-set': 2, 'sensors': 5, 'version-control': 1, 'anaconda': 20, 'objective-function': 4, 'churn': 15, 'manifold': 1, 'gan': 85, 'word-embeddings': 117, 'stanford-nlp': 9, 'search': 5, 'markov-process': 14, 'gpu': 42, 'proximal-svm': 1, 'statsmodels': 1, 'evolutionary-algorithms': 11, 'adaboost': 1, 'dqn': 36, 'data-mining': 217, 'features': 32, 'nn': 1, 'audio-recognition': 25, 'embeddings': 44, 'colab': 18, 'accuracy': 89, 'forecast': 34, 'databases': 29, 'automation': 4, 'mean-shift': 2, 'hardware': 12, 'cost-function': 25, 'theano': 4, 'career': 9, 'google-prediction-api': 2, 'gbm': 10, 'text-classification': 1, 'epochs': 11, 'grid-search': 35, 'time': 5, 'computer-vision': 121, 'probability-calibration': 11, 'image-segmentation': 3, 'tableau': 9, 'convergence': 17, 'mnist': 23, 'anova': 2, 'pipelines': 17, 'weka': 19, 'social-network-analysis': 11, 'keras-rl': 6, 'graphical-model': 3, 'haar-cascade': 1, 'rstudio': 15, 'imbalanced-learn': 21, 'linear-algebra': 24, 'refit-model': 1, 'score': 14, 'unbalanced-classes': 42, 'market-basket-analysis': 12, 'weighted-data': 14, 'nlp': 493, 'parameter': 5, 'openai-gpt': 2, 'python-3.x': 13, 'ndcg': 5, 'error-handling': 17, 'chatbot': 14, 'vae': 14, 'rnn': 149, 'gmm': 2, 'apache-spark': 35, 'pac-learning': 6, 'distance': 44, 'data-indexing-techniques': 1, 'image': 32, 'tsne': 15, 'parquet': 1, 'policy-gradients': 27, 'ab-test': 6, 'confusion-matrix': 27, 'rmsle': 1, 'perceptron': 26, 'geospatial': 27, 'software-development': 2, 'unseen-data': 1, 'image-preprocessing': 67, 'octave': 4, 'discriminant-analysis': 5, 'cause-effect-relations': 1, 'autoencoder': 106, 'scikit-learn': 540, 'linux': 5, 'monte-carlo': 15, 'pattern-recognition': 1, 'convnet': 111, 'rbm': 4, 'predictive-modeling': 265, 'c': 4, 'dplyr': 6, 'tools': 8, 'redshift': 1, 'smotenc': 4, 'backpropagation': 65, 'variance': 35, 'clustering': 257, 'association-rules': 19, 'methodology': 10, 'rmse': 1, 'jupyter': 41, 'twitter': 8, 'pandas': 354, 'finetuning': 7, 'least-squares-svm': 1, 'parallel': 8, 'time-series': 466, 'model-selection': 58, 'multi-output': 7, 'genetic-programming': 2, 'lbp': 2, 'text-mining': 113, 'coursera': 3, 'wolfram-language': 3, 'deep-network': 29, 'apache-hadoop': 13, 'gaussian': 20, 'lstm': 402, 'label-smoothing': 1, 'reinforcement-learning': 203, 'noisification': 1, 'sas': 6, 'markov': 4, 'ensemble-modeling': 30, 'momentum': 3, 'self-study': 8, 'optimization': 124, 'vector-space-models': 7, 'preprocessing': 120, 'estimators': 8, 'crawling': 3, 'data-imputation': 16, 'scalability': 4, 'descriptive-statistics': 21, 'matrix': 22, 'rbf': 5, 'feature-engineering': 163, 'research': 11, 'unsupervised-learning': 110, 'bioinformatics': 4, 'inception': 10, 'generative-models': 46, 'q-learning': 37, 'neural-style-transfer': 8, 'scraping': 5, 'pca': 85, 'movielens': 2, 'probabilistic-programming': 9, 'dynamic-programming': 3, 'hive': 2, 'markov-hidden-model': 13, 'dbscan': 18, 'cloud-computing': 9, 'history': 1, 'pathfinder': 1, 'activation-function': 44, 'sequence-to-sequence': 35, 'sagemaker': 8, 'learning-to-rank': 6, 'experiments': 3, 'clusters': 10, 'online-learning': 13, 'machine-learning': 2693, 'definitions': 4, 'data-augmentation': 24, 'competitions': 2, 'goss': 1, 'ngrams': 7, 'predict': 3, 'randomized-algorithms': 6, 'knime': 1, 'sequence': 25, 'ai': 25, 'hog': 1, 'feature-reduction': 4, 'terminology': 16, 'interpolation': 6, 'algorithms': 68, 'james-stein-encoder': 1, 'expectation-maximization': 5, 'regex': 8, 'orange3': 20, 'open-source': 1, 'sql': 29, 'word2vec': 88, 'finance': 17, 'ibm-watson': 1, 'siamese-networks': 4, 'gridsearchcv': 28, 'cs231n': 1, 'consumerweb': 1, 'pearsons-correlation-coefficient': 2, 'natural-language-process': 124, 'noise': 17, 'neural-network': 1055, 'decision-trees': 145, 'stemming': 2, 'class-imbalance': 73, 'training': 148, 'math': 37, 'svr': 5, 'state-of-the-art': 1, 'dirichlet': 4, 'pytorch': 175, 'relational-dbms': 7, 'indexing': 6, 'attention-mechanism': 26, 'metadata': 2, 'machine-learning-model': 224, 'fuzzy-classification': 3, 'overfitting': 69, 'text-filter': 2, 'labels': 28, 'ml': 7, 'pickle': 9, 'matlab': 62, 'aggregation': 12, 'data.table': 4, 'frequentist': 1, 'xboost': 1, 'cloud': 6, 'openai-gym': 17, 'outlier': 48, 'doc2vec': 3, 'opencv': 39, 'visualization': 126, 'ridge-regression': 7, 'sports': 3, 'domain-adaptation': 3, 'encoder': 1, 'text-generation': 17, 'google': 17, 'inceptionresnetv2': 6, 'data-wrangling': 15, 'julia': 2, 'implementation': 9, 'bert': 64, 'svm': 136, 'probability': 76, 'deepmind': 7, 'hyperparameter-tuning': 59, 'k-nn': 50, 'faster-rcnn': 38, 'huggingface': 2, 'mathematics': 17, 'image-recognition': 86, 'semi-supervised-learning': 18, 'vc-theory': 5, 'plotting': 32, 'h2o': 4, 'ensemble-learning': 11, 'bayesian-nonparametric': 2, 'wikipedia': 1, 'numpy': 117, 'nltk': 43, 'paperspace': 1, 'non-convex': 1, 'sparsity': 2, 'one-hot-encoding': 4, 'transformer': 45, 'nvidia': 7, 'object-recognition': 14, 'c++': 1, 'causalimpact': 2, 'hyperparameter': 42, 'one-shot-learning': 2, 'caffe': 7, 'theory': 11, 'historgram': 7, 'arima': 11, 'activation': 1, 'recurrent-neural-net': 91, 'question-answering': 4, 'naive-bayes-classifier': 42, 'regression': 347, 'categories': 2, 'game': 7, 'bayesian-networks': 12, 'aws-lambda': 2, 'fastai': 6, 'recommender-system': 103, 'learning-rate': 8, 'glm': 3, 'sequential-pattern-mining': 17, 'manhattan': 3, 'software-recommendation': 4, 'kaggle': 43, 'pip': 4, 'pytorch-geometric': 2, 'java': 14, 'data-stream-mining': 4, 'parsing': 3, 'helmert-coding': 1, 'ggplot2': 3, 'generalization': 12, 'active-learning': 4, 'marketing': 6, 'word': 2, 'xgboost': 165, 'corpus': 1, '.net': 1, '3d-reconstruction': 9, 'lasso': 8, 'machine-translation': 28, 'mlp': 34, 'correlation': 80, 'dump': 1, 'rdkit': 1, 'python': 1814, 'reshape': 9, 'feature-map': 2, 'data-cleaning': 157, 'survival-analysis': 10, 'data-formats': 9, 'density-estimation': 3, 'multitask-learning': 7, 'counter-inference': 1, 'named-entity-recognition': 36, 'alex-net': 5, 'simulation': 11, 'amazon-ml': 1, 'encoding': 54, 'numerical': 6, 'allennlp': 2, 'aws': 20, '3d-object-detection': 1, 'anomaly-detection': 92, 'regularization': 50, 'mcmc': 4, 'gensim': 36, 'genetic-algorithms': 16, 'supervised-learning': 82, 'predictor-importance': 9, 'bayes-error': 1, 'explainable-ai': 10, 'anomaly': 4, 'discounted-reward': 5, 'etl': 6, 'excel': 24, 'finite-precision': 2, 'topic-model': 31, 'networkx': 2, 'cosine-distance': 21, 'privacy': 6, 'library': 2, 'lda': 27, 'spyder': 1, 'map-reduce': 3, 'jaccard-coefficient': 4, 'feature-extraction': 87, 'batch-normalization': 29, 'auc': 3, 'weight-initialization': 12, 'spss': 2, 'exploitation': 1, 'distributed': 7, 'image-size': 6, 'dropout': 15, 'seaborn': 38, 'csv': 27, 'random-forest': 159, 'prediction': 128, 'dummy-variables': 19, 'gru': 1, 'scipy': 40, 'powerbi': 10, 'sampling': 38, 'actor-critic': 21, 'information-retrieval': 32, 'spearmans-rank-correlation': 1, 'boosting': 49, 'multiclass-classification': 131, 'processing': 5, 'fuzzy-logic': 13, 'similar-documents': 20, 'scala': 9, 'image-classification': 211, 'pruning': 3, 'deep-learning': 1220, 'apache-nifi': 1, 'missing-data': 43, 'tensorflow': 584, 'spacy': 20, 'automl': 2, 'azure-ml': 12, 'language-model': 25, 'object-detection': 109, 'bigdata': 95, 'difference': 5, 'tfidf': 31, 'tesseract': 3, 'community': 1, 'gradient-descent': 98, 'dimensionality-reduction': 69, 'dataframe': 81, 'ocr': 26, 'feature-scaling': 59, 'normal-equation': 1, 'notation': 4, 'metric': 60}
no_of_tags = pd.DataFrame.from_dict(num_tags, orient='index')
print(no_of_tags.head())
0 data-transfer 1 dataset 340 serialisation 3 data-science-model 186 neural 16
times_tag_used = no_of_tags.sort_values([0])#times_tag_used is no_of_tags,
#I have just renamed it
times_tag_used.rename(columns={0:"Times Tag Used"}, inplace=True)
print(times_tag_used)
Times Tag Used data-transfer 1 spyder 1 rmse 1 pattern-recognition 1 cause-effect-relations 1 exploitation 1 unseen-data 1 rmsle 1 parquet 1 data-indexing-techniques 1 refit-model 1 haar-cascade 1 text-classification 1 nn 1 adaboost 1 statsmodels 1 proximal-svm 1 gru 1 manifold 1 version-control 1 libsvm 1 hurdle-model 1 lda-classifier 1 google-cloud 1 nl2sql 1 spearmans-rank-correlation 1 label-flipping 1 siamese 1 least-squares-svm 1 marginal-effects 1 ... ... feature-engineering 163 xgboost 165 linear-regression 175 pytorch 175 data-science-model 186 reinforcement-learning 203 feature-selection 209 image-classification 211 data 213 data-mining 217 machine-learning-model 224 statistics 234 clustering 257 predictive-modeling 265 r 268 dataset 340 regression 347 pandas 354 lstm 402 time-series 466 cnn 489 nlp 493 scikit-learn 540 tensorflow 584 classification 685 keras 935 neural-network 1055 deep-learning 1220 python 1814 machine-learning 2693 [526 rows x 1 columns]
top_times_tags_used = times_tag_used.sort_values(by="Times Tag Used").tail(20)
top_times_tags_used
Times Tag Used | |
---|---|
machine-learning-model | 224 |
statistics | 234 |
clustering | 257 |
predictive-modeling | 265 |
r | 268 |
dataset | 340 |
regression | 347 |
pandas | 354 |
lstm | 402 |
time-series | 466 |
cnn | 489 |
nlp | 493 |
scikit-learn | 540 |
tensorflow | 584 |
classification | 685 |
keras | 935 |
neural-network | 1055 |
deep-learning | 1220 |
python | 1814 |
machine-learning | 2693 |
The above are the top most used tags. Let us see how this looks graphically.
import matplotlib.pyplot as plt
import numpy as np
%magic inline
top_times_tags_used.plot(kind='barh', figsize=(16,8))
plt.show()
num_views = {}
for index, row in questions.iterrows():
for tag in row['Tags']:
if tag in num_views:
num_views[tag] += row["ViewCount"]
else:
num_views[tag] = row["ViewCount"]
print(num_views)
{'dataset': 422, 'serialisation': 464, 'data-science-model': 32, 'neural': 21, 'ann': 11, 'ranking': 41, 'gaussian-process': 51, 'json': 15, 'yolo': 101, 'data-leakage': 99, 'classifier': 22, 'bayesian': 81, 'tokenization': 29, 'summarunner-architecture': 45, 'scoring': 2110, 'kernel': 216, 'kendalls-tau-coefficient': 474, 'speech-to-text': 24, 'binary': 31, 'pgm': 19, 'non-parametric': 32, 'logistic-regression': 48, 'bias': 30, 'methods': 20, 'annotation': 14, 'evaluation': 218, 'javascript': 110, 'programming': 12, 'convolution': 87, 'graphs': 130, 'performance': 61, 'keras': 756, 'sentiment-analysis': 177, 'data-augmentation': 41, 'softmax': 72, 'nlg': 80, 'nosql': 33, 'ipython': 1907, 'data-analysis': 81, 'management': 19, 'automatic-summarization': 12, 'representation': 29, 'supervised-learning': 19, 'hierarchical-data-format': 30, 'statistics': 20, 'feature-selection': 136, 'mse': 13, 'categorical-data': 21, 'sematic-similarity': 36, 'tools': 275, 'pac-learning': 164, 'ensemble': 55, 'pooling': 43, 'stacked-lstm': 54, 'data': 1641, 'beginner': 70, 'code': 189, 'hinge-loss': 8, 'reference-request': 29, 'kitti-dataset': 7, 'multivariate-distribution': 4, 'multi-instance-learning': 63, 'efficiency': 19, 'mini-batch-gradient-descent': 55, 'distribution': 21, 'cross-validation': 134, 'marginal-effects': 12, 'learning': 80, 'books': 36, 'siamese': 65, 'parameter-estimation': 32, 'usecase': 26, 'parallel': 36, 'categorical-encoding': 42, 'web-scrapping': 87, 'normalization': 140, 'heatmap': 13, 'orange': 234, 'cnn': 217, 'k-means': 49, 'label-flipping': 480, 'self-driving': 23, 'meta-learning': 101, 'smote': 98, 'r': 130, 'activity-recognition': 14, 'matplotlib': 718, 'search-engine': 48, 'anonymization': 117, 'pandas': 933, 'classification': 29, 'matrix-factorisation': 8, 'project-planning': 29, 'text': 48, 'counts': 57, 'homework': 177, 'mutual-information': 11, 'nl2sql': 14, 'dialog-flow': 29, 'mongodb': 25, 'google-cloud': 19, 'infographics': 42, 'feature-construction': 18, 'lda-classifier': 68, 'similarity': 5, 'image-recognition': 33, 'multilabel-classification': 29, 'structured-data': 130, 'transfer-learning': 246, 'groupby': 15, 'collinearity': 37, 'libsvm': 54, 'education': 31, 'vgg16': 206, 'data-product': 186, 'forecasting': 70, 'information-theory': 15, 'open-set': 17, 'sensors': 8, 'networkx': 14, 'version-control': 94, 'clustering': 98, 'objective-function': 11, 'churn': 35, 'manifold': 80, 'reinforcement-learning': 119, 'markov': 105, 'stanford-nlp': 31, 'search': 55, 'markov-process': 49, 'gpu': 89, 'statsmodels': 9, 'evolutionary-algorithms': 253, 'adaboost': 23, 'dqn': 30, 'data-mining': 1506, 'features': 10, 'nn': 115, 'audio-recognition': 34, 'embeddings': 63, 'colab': 228, 'state-of-the-art': 17, 'forecast': 65, 'numpy': 612, 'automation': 15, 'mean-shift': 19, 'hardware': 41, 'cost-function': 61, 'theano': 22, 'career': 45, 'google-prediction-api': 10, 'sequence': 11, 'text-classification': 32, 'epochs': 16, 'grid-search': 2324, 'time': 31, 'computer-vision': 75, 'probability-calibration': 89, 'image-segmentation': 32, 'tableau': 75, 'convergence': 38, 'mnist': 108, 'anova': 49, 'lda': 21, 'weka': 64, 'social-network-analysis': 135, 'keras-rl': 230, 'graphical-model': 28, 'haar-cascade': 16, 'rstudio': 15, 'imbalanced-learn': 143, 'linear-algebra': 39, 'refit-model': 37, 'score': 69, 'unbalanced-classes': 42, 'market-basket-analysis': 11, 'weighted-data': 88, 'nlp': 970, 'parameter': 33, 'openai-gpt': 20, 'python-3.x': 41, 'ndcg': 140, 'error-handling': 134, 'chatbot': 17, 'vae': 104, 'rnn': 115, 'gmm': 35, 'apache-spark': 33, 'loss-function': 105, 'distance': 78, 'image': 117, 'tsne': 105, 'parquet': 17, 'policy-gradients': 38, 'ab-test': 54, 'confusion-matrix': 43, 'perceptron': 50, 'geospatial': 10, 'software-development': 562, 'unseen-data': 35, 'image-preprocessing': 319, 'weight-initialization': 113, 'octave': 165, 'discriminant-analysis': 86, 'cause-effect-relations': 42, 'autoencoder': 158, 'scikit-learn': 428, 'linux': 192, 'monte-carlo': 3210, 'pattern-recognition': 23, 'convnet': 857, 'rbm': 43, 'predictive-modeling': 1796, 'c': 21, 'dplyr': 22, 'self-study': 42, 'redshift': 19, 'smotenc': 55, 'backpropagation': 463, 'variance': 91, 'anaconda': 1113, 'association-rules': 169, 'methodology': 31, 'rmse': 17, 'jupyter': 2006, 'twitter': 34, 'openai-gym': 471, 'finetuning': 28, 'least-squares-svm': 110, 'image-size': 33, 'model-selection': 64, 'multi-output': 171, 'genetic-programming': 21, 'lbp': 4, 'text-mining': 59, 'coursera': 110, 'wolfram-language': 42, 'deep-network': 79, 'machine-learning-model': 50, 'gaussian': 40, 'lstm': 73, 'gan': 19, 'noisification': 9, 'sas': 18, 'word-embeddings': 97, 'random-forest': 17, 'momentum': 14, 'optimization': 58, 'vector-space-models': 13, 'preprocessing': 568, 'estimators': 19, 'crawling': 36, 'data-imputation': 8, 'scalability': 94, 'descriptive-statistics': 8, 'matrix': 147, 'rbf': 70, 'feature-engineering': 71, 'research': 242, 'unsupervised-learning': 35, 'bioinformatics': 29, 'inception': 10, 'data-indexing-techniques': 47, 'q-learning': 139, 'neural-style-transfer': 13, 'scraping': 23, 'pca': 102, 'movielens': 34, 'probabilistic-programming': 10, 'dynamic-programming': 758, 'hive': 31, 'markov-hidden-model': 56, 'dbscan': 189, 'history': 20, 'pathfinder': 266, 'activation-function': 177, 'sequence-to-sequence': 104, 'sagemaker': 41, 'learning-to-rank': 27, 'experiments': 41, 'clusters': 63, 'online-learning': 24, 'machine-learning': 6839, 'definitions': 64, 'lightgbm': 17, 'competitions': 15, 'goss': 32, 'ngrams': 10, 'predict': 42, 'randomized-algorithms': 1195, 'knime': 36, 'gbm': 12, 'ai': 106, 'hog': 18, 'feature-reduction': 189, 'terminology': 22, 'algorithms': 13, 'james-stein-encoder': 26, 'expectation-maximization': 39, 'causalimpact': 49, 'regex': 184, 'orange3': 35, 'sql': 70, 'word2vec': 852, 'finance': 39, 'cloud-computing': 36, 'siamese-networks': 38, 'gridsearchcv': 32, 'consumerweb': 24, 'pearsons-correlation-coefficient': 83, 'natural-language-process': 16, 'noise': 57, 'neural-network': 2764, 'decision-trees': 616, 'stemming': 25, 'class-imbalance': 44, 'training': 175, 'math': 114, 'svr': 167, 'accuracy': 9, 'dirichlet': 134, 'pytorch': 33, 'relational-dbms': 22, 'indexing': 3465, 'attention-mechanism': 37, 'metadata': 130, 'apache-hadoop': 12, 'fuzzy-classification': 5, 'overfitting': 55, 'text-filter': 17, 'labels': 459, 'ml': 21, 'pickle': 98, 'matlab': 31, 'aggregation': 20, 'data.table': 78, 'frequentist': 11, 'xboost': 71, 'cloud': 15, 'outlier': 25, 'doc2vec': 14, 'opencv': 206, 'visualization': 245, 'ridge-regression': 30, 'sports': 12, 'domain-adaptation': 48, 'encoder': 16, 'text-generation': 23, 'google': 18, 'inceptionresnetv2': 29, 'data-wrangling': 29, 'julia': 7, 'implementation': 54, 'bert': 476, 'svm': 1231, 'probability': 65, 'deepmind': 21, 'hyperparameter-tuning': 28, 'k-nn': 168, 'faster-rcnn': 48, 'huggingface': 37, 'mathematics': 58, 'interpolation': 25, 'semi-supervised-learning': 28, 'vc-theory': 85, 'plotting': 21, 'h2o': 10, 'ensemble-learning': 32, 'bayesian-nonparametric': 23, 'wikipedia': 23, 'databases': 43, 'nltk': 98, 'paperspace': 12, 'sparsity': 14, 'one-hot-encoding': 36, 'spacy': 190, 'nvidia': 94, 'object-recognition': 229, 'c++': 24, 'impala': 19, 'hyperparameter': 227, 'one-shot-learning': 54, 'caffe': 11, 'ensemble-modeling': 145, 'theory': 40, 'historgram': 59, 'arima': 24, 'activation': 15, 'recurrent-neural-net': 10, 'question-answering': 21, 'naive-bayes-classifier': 52, 'regression': 159, 'categories': 13, 'game': 22, 'bayesian-networks': 28, 'aws-lambda': 964, 'fastai': 18, 'recommender-system': 195, 'learning-rate': 5, 'glm': 11, 'sequential-pattern-mining': 85, 'manhattan': 43, 'software-recommendation': 18, 'kaggle': 426, 'pip': 30, 'pytorch-geometric': 51, 'java': 249, 'data-stream-mining': 14, 'parsing': 152, 'ggplot2': 140, 'generalization': 99, 'active-learning': 25, 'marketing': 21, 'word': 118, 'xgboost': 559, '.net': 438, '3d-reconstruction': 9, 'lasso': 38, 'machine-translation': 21, 'generative-models': 127, 'correlation': 16, 'python': 1646, 'predictor-importance': 34, 'feature-map': 22, 'data-cleaning': 280, 'survival-analysis': 31, 'data-formats': 99, 'density-estimation': 54, 'multitask-learning': 35, 'counter-inference': 58, 'named-entity-recognition': 68, 'alex-net': 15, 'simulation': 292, 'amazon-ml': 35, 'encoding': 18, 'numerical': 401, 'allennlp': 171, 'aws': 175, 'anomaly-detection': 36, 'regularization': 74, 'mcmc': 22, 'gensim': 35, 'genetic-algorithms': 30, 'linear-regression': 214, 'reshape': 54, 'bayes-error': 128, 'explainable-ai': 117, 'anomaly': 67, 'discounted-reward': 16, 'etl': 148, 'excel': 19, 'finite-precision': 11, 'topic-model': 28, 'torch': 5, 'cosine-distance': 129, 'privacy': 11, 'library': 10, 'pipelines': 19, 'spyder': 43, 'map-reduce': 45, 'jaccard-coefficient': 176, 'feature-extraction': 69, 'batch-normalization': 919, 'auc': 38, 'mlp': 24, 'spss': 17, 'exploitation': 12, 'distributed': 146, 'time-series': 22, 'dropout': 135, 'seaborn': 212, 'csv': 2103, 'pyspark': 988, 'prediction': 45, 'dummy-variables': 117, 'gru': 16, 'scipy': 12, 'powerbi': 16, 'sampling': 23, 'actor-critic': 173, 'information-retrieval': 37, 'boosting': 62, 'multiclass-classification': 589, 'processing': 7, 'fuzzy-logic': 316, 'similar-documents': 37, 'scala': 22, 'image-classification': 25, 'pruning': 7, 'deep-learning': 1138, 'apache-nifi': 119, 'missing-data': 9, 'tensorflow': 16, 'transformer': 26, 'automl': 12, 'azure-ml': 11, 'language-model': 22, 'object-detection': 358, 'bigdata': 74, 'difference': 103, 'tfidf': 91, 'tesseract': 34, 'community': 42, 'gradient-descent': 852, 'dimensionality-reduction': 604, 'dataframe': 549, 'ocr': 222, 'feature-scaling': 46, 'normal-equation': 103, 'notation': 76, 'metric': 159}
num_views = pd.DataFrame.from_dict(num_views, orient='index')
num_views.rename(columns={0: "Times Tags Viewed"}, inplace=True)
print(num_views)
Times Tags Viewed dataset 422 serialisation 464 data-science-model 32 neural 21 ann 11 ranking 41 gaussian-process 51 json 15 yolo 101 data-leakage 99 classifier 22 bayesian 81 tokenization 29 summarunner-architecture 45 scoring 2110 kernel 216 kendalls-tau-coefficient 474 speech-to-text 24 binary 31 pgm 19 non-parametric 32 logistic-regression 48 bias 30 methods 20 annotation 14 evaluation 218 javascript 110 programming 12 convolution 87 graphs 130 ... ... boosting 62 multiclass-classification 589 processing 7 fuzzy-logic 316 similar-documents 37 scala 22 image-classification 25 pruning 7 deep-learning 1138 apache-nifi 119 missing-data 9 tensorflow 16 transformer 26 automl 12 azure-ml 11 language-model 22 object-detection 358 bigdata 74 difference 103 tfidf 91 tesseract 34 community 42 gradient-descent 852 dimensionality-reduction 604 dataframe 549 ocr 222 feature-scaling 46 normal-equation 103 notation 76 metric 159 [511 rows x 1 columns]
top_views = num_views.sort_values(by="Times Tags Viewed").tail(20)
top_views
Times Tags Viewed | |
---|---|
aws-lambda | 964 |
nlp | 970 |
pyspark | 988 |
anaconda | 1113 |
deep-learning | 1138 |
randomized-algorithms | 1195 |
svm | 1231 |
data-mining | 1506 |
data | 1641 |
python | 1646 |
predictive-modeling | 1796 |
ipython | 1907 |
jupyter | 2006 |
csv | 2103 |
scoring | 2110 |
grid-search | 2324 |
neural-network | 2764 |
monte-carlo | 3210 |
indexing | 3465 |
machine-learning | 6839 |
top_views.plot(kind='barh', figsize=(16,8))
plt.show()
Seeing the plots side by side we have the following.
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,8))
top_times_tags_used.plot(kind='barh', ax=axes[0], subplots=True)
top_views.plot(kind='barh', ax=axes[1], subplots=True)
plt.show()
Among the top ten tags are: python, machine-learning, deep-learning, neural-network, keras, tensorflow, classification, and scikit-learn. Let's look at how these may be related. Deep learning is an area of machine learning.Neural networks are used in modeling often complex data in a way that mimics the brain. Keras is a neural network library written in Python. TensorFlow is an open source machine larning library. Scikit-Learn is a Python module used for machine learning, among other things. Classification is used in supervised learning, a part of the machine learning process. So we see the interconnectedness of these tags.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
all_quests = pd.read_csv("all_questions.csv", parse_dates=["CreationDate"])
all_quests.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21576 entries, 0 to 21575 Data columns (total 3 columns): Id 21576 non-null int64 CreationDate 21576 non-null datetime64[ns] Tags 21576 non-null object dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 505.8+ KB
all_quests['Tags'] = all_quests['Tags'].str.replace("^<|>$", "").str.split("><")
print(all_quests['Tags'].head(5))
0 [python, keras, tensorflow, cnn, probability] 1 [neural-network] 2 [python, ibm-watson, chatbot] 3 [keras] 4 [r, predictive-modeling, machine-learning-mode... Name: Tags, dtype: object
quest_tags = {}
for tags in all_quests['Tags']:
for tag in tags:
if tag in quest_tags:
quest_tags[tag] += 1
else:
quest_tags[tag] = 1
print(quest_tags)
{'data-transfer': 1, 'dataset': 893, 'serialisation': 3, 'data-science-model': 259, 'pyro': 1, 'neural': 17, 'gbm': 48, 'gaussian-process': 13, 'json': 23, 'yolo': 39, 'data-leakage': 13, 'classifier': 40, 'bayesian': 76, 'summarunner-architecture': 1, 'scoring': 31, 'regex': 22, 'neo4j': 10, 'kernel': 51, 'kendalls-tau-coefficient': 1, 'speech-to-text': 14, 'binary': 68, 'pgm': 6, 'generative-models': 85, 'non-parametric': 3, 'haar-cascade': 1, 'logistic-regression': 406, 'dynamic-programming': 5, 'education': 31, '3d-reconstruction': 9, 'gaussian': 51, 'javascript': 24, 'programming': 35, 'preprocessing': 272, 'graphs': 139, 'performance': 110, 'keras': 1750, 'sentiment-analysis': 118, 'data-augmentation': 48, 'softmax': 24, 'nlg': 10, 'nosql': 21, 'ipython': 45, 'data-analysis': 106, 'energy': 4, 'management': 7, 'automatic-summarization': 17, 'representation': 16, 'supervised-learning': 235, 'hierarchical-data-format': 20, 'statistics': 650, 'feature-selection': 551, 'apache-spark': 208, 'categorical-data': 237, 'sematic-similarity': 3, 'torch': 12, 'loss-function': 277, 'ensemble': 19, 'pooling': 4, 'stacked-lstm': 15, 'data': 476, 'metaheuristics': 1, 'beginner': 125, 'code': 5, 'hinge-loss': 8, 'reference-request': 65, 'kitti-dataset': 1, 'sequence-to-sequence': 58, 'efficiency': 27, 'mini-batch-gradient-descent': 28, 'distribution': 98, 'cross-validation': 334, 'marginal-effects': 1, 'learning': 24, 'books': 27, 'siamese': 1, 'parameter-estimation': 33, 'usecase': 7, 'parallel': 32, 'categorical-encoding': 4, 'genetic': 4, 'normalization': 158, 'similar-documents': 39, 'orange': 189, 'cnn': 815, 'k-means': 265, 'label-flipping': 1, 'self-driving': 5, 'meta-learning': 8, 'smote': 44, 'r': 1119, 'activity-recognition': 6, 'matplotlib': 108, 'search-engine': 12, 'anonymization': 8, 'pandas': 736, 'matrix-factorisation': 46, 'project-planning': 10, 'text': 97, 'counts': 7, 'mutual-information': 13, 'nl2sql': 1, 'dialog-flow': 2, 'mongodb': 13, 'google-cloud': 1, 'infographics': 7, 'feature-construction': 54, 'lda-classifier': 6, 'similarity': 195, 'image-recognition': 200, 'multilabel-classification': 197, 'hurdle-model': 1, 'structured-data': 5, 'transfer-learning': 99, 'groupby': 2, 'apache-pig': 7, 'collinearity': 10, 'libsvm': 12, 'vgg16': 28, 'data-product': 9, 'forecasting': 132, 'information-theory': 28, 'open-set': 2, 'sensors': 7, 'version-control': 5, 'clustering': 846, 'objective-function': 11, 'churn': 29, 'methods': 14, 'gan': 151, 'word-embeddings': 253, 'stanford-nlp': 40, 'lsi': 5, 'search': 40, 'markov-process': 50, 'neural-style-transfer': 15, 'proximal-svm': 1, 'statsmodels': 3, 'evolutionary-algorithms': 12, 'adaboost': 4, 'dqn': 63, 'data-mining': 947, 'features': 33, 'nn': 1, 'gridsearchcv': 35, 'embeddings': 73, 'colab': 23, 'accuracy': 186, 'databases': 81, 'ann': 3, 'automation': 6, 'mean-shift': 2, 'cost-function': 55, 'theano': 46, 'feature-engineering': 333, 'google-prediction-api': 7, 'hyperparameter': 101, 'text-classification': 4, 'epochs': 17, 'markov-hidden-model': 25, 'non-convex': 1, 'ai': 38, 'stata': 4, 'freebase': 3, 'time': 6, 'computer-vision': 313, 'probability-calibration': 18, 'image-segmentation': 8, 'linearly-separable': 1, 'tableau': 26, 'convergence': 27, 'mnist': 54, 'anova': 1, 'lda': 83, 'multivariate-distribution': 3, 'gradient-descent': 253, 'keras-rl': 10, 'julia': 9, 'gru': 5, 'rstudio': 39, 'imbalanced-learn': 21, 'rmse': 1, 'optimization': 290, 'refit-model': 1, 'score': 20, 'web-scrapping': 8, 'knowledge-base': 16, 'unbalanced-classes': 148, 'market-basket-analysis': 23, 'weighted-data': 39, 'nlp': 1170, 'ngboost': 1, 'parameter': 27, 'openai-gpt': 3, 'python-3.x': 20, 'ndcg': 6, 'error-handling': 36, 'chatbot': 15, 'vae': 16, 'rnn': 404, 'gmm': 2, 'mse': 8, 'pac-learning': 11, 'distance': 104, 'missing-data': 110, 'image': 36, 'tsne': 44, 'parquet': 1, 'policy-gradients': 41, 'homework': 11, 'confusion-matrix': 72, 'rmsle': 1, 'perceptron': 59, 'geospatial': 51, 'software-development': 14, 'unseen-data': 2, 'image-preprocessing': 71, 'octave': 14, 'discriminant-analysis': 12, 'cause-effect-relations': 1, 'autoencoder': 201, 'scikit-learn': 1307, 'linux': 9, 'monte-carlo': 21, 'pattern-recognition': 1, 'convnet': 426, 'rbm': 31, 'predictive-modeling': 817, 'c': 12, 'dplyr': 11, 'tokenization': 11, 'redshift': 5, 'smotenc': 4, 'backpropagation': 198, 'variance': 60, 'anaconda': 38, 'association-rules': 54, 'methodology': 26, 'bias': 35, 'jupyter': 85, 'twitter': 17, 'ranking': 66, 'finetuning': 8, 'time-series': 1005, 'model-selection': 141, 'multi-output': 8, 'genetic-programming': 3, 'dropout': 49, 'text-mining': 472, 'class-imbalance': 133, 'glorot-initialization': 2, 'wolfram-language': 3, 'deep-network': 41, 'machine-learning-model': 336, 'annotation': 16, 'lstm': 694, 'manifold': 6, 'label-smoothing': 1, 'reinforcement-learning': 413, 'noisification': 3, 'sas': 17, 'markov': 12, 'random-forest': 463, 'momentum': 5, 'self-study': 36, 'reductions': 2, 'online-learning': 45, 'vector-space-models': 17, 'convolution': 210, 'estimators': 17, 'crawling': 11, 'data-imputation': 56, 'scalability': 26, 'descriptive-statistics': 48, 'matrix': 30, 'rbf': 6, 'library': 12, 'research': 36, 'unsupervised-learning': 271, 'bioinformatics': 17, 'inception': 39, 'data-indexing-techniques': 5, 'q-learning': 95, 'kaggle': 75, 'gpu': 97, 'scraping': 33, 'pca': 193, 'movielens': 2, 'probabilistic-programming': 14, 'career': 49, 'consumerweb': 6, 'classification': 1899, 'dbscan': 41, 'spectral-clustering': 1, 'history': 4, 'pathfinder': 1, 'activation-function': 84, 'openai-gym': 23, 'game': 14, 'learning-to-rank': 12, 'experiments': 21, 'clusters': 39, 'ibm-watson': 2, 'machine-learning': 6969, 'definitions': 28, 'lightgbm': 25, 'competitions': 5, 'goss': 1, 'ngrams': 17, 'predict': 3, 'randomized-algorithms': 15, 'knime': 1, 'sequence': 68, 'grid-search': 54, 'least-squares-svm': 1, 'feature-reduction': 6, 'terminology': 58, 'anomaly': 8, 'algorithms': 308, 'james-stein-encoder': 1, 'heatmap': 11, 'expectation-maximization': 16, 'forecast': 93, 'demographic-data': 1, 'orange3': 23, 'sql': 73, 'bayesian-neural-network': 1, 'word2vec': 244, 'finance': 38, 'cloud-computing': 17, 'siamese-networks': 6, 'audio-recognition': 64, 'cs231n': 3, 'hog': 3, 'pearsons-correlation-coefficient': 3, 'natural-language-process': 182, 'lbp': 2, 'noise': 26, 'neural-network': 2939, 'apache-kafka': 3, 'tools': 56, 'decision-trees': 430, 'stemming': 2, 'coursera': 4, 'training': 321, 'spatial-transformer': 2, 'object-recognition': 51, 'svr': 9, 'state-of-the-art': 6, 'dirichlet': 6, 'natural-gradient-boosting': 1, 'pytorch': 239, 'relational-dbms': 11, 'indexing': 13, 'attention-mechanism': 32, 'metadata': 10, 'apache-hadoop': 110, 'fuzzy-classification': 3, 'overfitting': 147, 'text-filter': 6, 'labels': 65, 'ml': 8, 'extreme-learning-machine': 1, 'pickle': 9, 'matlab': 144, 'aggregation': 28, 'data.table': 13, 'frequentist': 1, 'xboost': 1, 'cloud': 9, 'hbase': 2, 'ggplot2': 21, 'doc2vec': 4, 'opencv': 52, 'visualization': 421, 'ridge-regression': 7, 'sports': 6, 'domain-adaptation': 9, 'encoder': 2, 'text-generation': 30, 'google': 31, 'inceptionresnetv2': 6, 'data-wrangling': 34, 'implementation': 9, 'bert': 68, 'svm': 389, 'probability': 184, 'deepmind': 8, 'hyperparameter-tuning': 99, 'k-nn': 88, 'f1score': 1, 'faster-rcnn': 46, 'huggingface': 2, 'mathematics': 19, 'interpolation': 15, 'semi-supervised-learning': 40, 'vc-theory': 11, 'plotting': 88, 'h2o': 4, 'ensemble-learning': 21, 'bayesian-nonparametric': 2, 'wikipedia': 1, 'numpy': 193, 'nltk': 105, 'paperspace': 1, 'featurization': 5, 'hive': 11, 'sparsity': 2, 'one-hot-encoding': 6, 'transformer': 45, 'nvidia': 13, 'math': 51, 'c++': 1, 'causalimpact': 2, 'caffe': 22, 'one-shot-learning': 3, 'ensemble-modeling': 106, 'theory': 28, 'historgram': 11, 'arima': 12, 'activation': 1, 'recurrent-neural-net': 151, 'question-answering': 6, 'naive-bayes-classifier': 120, 'regression': 869, 'categories': 2, 'sagemaker': 8, 'bayesian-networks': 40, 'aws-lambda': 2, 'fastai': 6, 'recommender-system': 310, 'learning-rate': 16, 'glm': 19, 'sequential-pattern-mining': 43, 'manhattan': 3, 'software-recommendation': 32, 'language-model': 49, 'pip': 9, 'pytorch-geometric': 2, 'data-stream-mining': 14, 'parsing': 23, 'helmert-coding': 1, 'outlier': 125, 'explainable-ai': 11, 'active-learning': 9, 'marketing': 23, 'word': 6, 'separable': 1, 'xgboost': 366, 'corpus': 1, '.net': 6, 'java': 58, 'lasso': 8, 'apache-mahout': 16, 'mlp': 53, 'correlation': 207, 'dump': 1, 'machine-translation': 51, 'python': 3937, 'reshape': 17, 'feature-map': 2, 'tflearn': 9, 'data-cleaning': 444, 'survival-analysis': 26, 'data-formats': 29, 'impala': 1, 'density-estimation': 3, 'multitask-learning': 19, 'counter-inference': 1, 'named-entity-recognition': 84, 'alex-net': 10, 'handwritten': 1, 'simulation': 22, 'amazon-ml': 10, 'rdkit': 1, 'encoding': 94, 'evaluation': 156, 'allennlp': 2, 'aws': 40, '3d-object-detection': 1, 'anomaly-detection': 205, 'regularization': 105, 'mcmc': 6, 'gensim': 78, 'genetic-algorithms': 38, 'linear-regression': 439, 'predictor-importance': 16, 'sparse': 1, 'bayes-error': 6, 'generalization': 19, 'object-detection': 155, 'discounted-reward': 5, 'rattle': 3, 'etl': 12, 'excel': 51, 'finite-precision': 3, 'topic-model': 92, 'networkx': 2, 'cosine-distance': 47, 'privacy': 9, 'pipelines': 23, 'spyder': 1, 'map-reduce': 25, 'numerical': 18, 'jaccard-coefficient': 8, 'feature-extraction': 271, 'batch-normalization': 43, 'auc': 6, 'weight-initialization': 19, 'spss': 5, 'exploitation': 1, 'distributed': 35, 'image-size': 5, 'graphical-model': 22, 'seaborn': 59, 'csv': 65, 'pyspark': 101, 'boosting': 68, 'pybrain': 3, 'dummy-variables': 20, 'scipy': 52, 'powerbi': 18, 'sampling': 110, 'actor-critic': 25, 'information-retrieval': 104, 'spearmans-rank-correlation': 1, 'hardware': 18, 'tranformation': 4, 'multiclass-classification': 294, 'processing': 17, 'fuzzy-logic': 26, 'weka': 55, 'scala': 41, 'image-classification': 461, 'pruning': 3, 'deep-learning': 2805, 'apache-nifi': 1, 'open-source': 16, 'social-network-analysis': 80, 'tensorflow': 1229, 'spacy': 21, 'automl': 4, 'azure-ml': 35, 'data-engineering': 2, 'linear-algebra': 42, 'bigdata': 414, 'difference': 7, 'tfidf': 53, 'multi-instance-learning': 2, 'tesseract': 3, 'community': 3, 'ab-test': 29, 'dimensionality-reduction': 178, 'dataframe': 157, 'prediction': 265, 'ocr': 47, 'feature-scaling': 132, 'normal-equation': 6, 'notation': 9, 'metric': 95}
quest_tags = pd.DataFrame.from_dict(num_tags, orient='index')
quest_tags.rename(columns={0: "Times Tags Used"}, inplace=True)
print(quest_tags)
Times Tags Used data-transfer 1 dataset 340 serialisation 3 data-science-model 186 neural 16 ann 2 ranking 22 gaussian-process 12 json 10 yolo 21 data-leakage 8 classifier 18 bayesian 40 tokenization 6 summarunner-architecture 1 scoring 12 impala 1 kernel 27 kendalls-tau-coefficient 1 speech-to-text 8 binary 26 pgm 1 non-parametric 3 logistic-regression 154 bias 19 methods 4 annotation 12 evaluation 66 javascript 8 programming 7 ... ... boosting 49 multiclass-classification 131 processing 5 fuzzy-logic 13 similar-documents 20 scala 9 image-classification 211 pruning 3 deep-learning 1220 apache-nifi 1 missing-data 43 tensorflow 584 spacy 20 automl 2 azure-ml 12 language-model 25 object-detection 109 bigdata 95 difference 5 tfidf 31 tesseract 3 community 1 gradient-descent 98 dimensionality-reduction 69 dataframe 81 ocr 26 feature-scaling 59 normal-equation 1 notation 4 metric 60 [526 rows x 1 columns]
top_qtags_used = quest_tags.sort_values(by="Times Tags Used").tail(20)
top_qtags_used
Times Tags Used | |
---|---|
machine-learning-model | 224 |
statistics | 234 |
clustering | 257 |
predictive-modeling | 265 |
r | 268 |
dataset | 340 |
regression | 347 |
pandas | 354 |
lstm | 402 |
time-series | 466 |
cnn | 489 |
nlp | 493 |
scikit-learn | 540 |
tensorflow | 584 |
classification | 685 |
keras | 935 |
neural-network | 1055 |
deep-learning | 1220 |
python | 1814 |
machine-learning | 2693 |
top_qtags_used.plot(kind='barh', figsize=(10,7))
plt.show()
Of the top tags used over time, deep learning tags should include: deep learning, machine learning, neural network,classification, TensorFlow, SciKit Learn,and cnn. These are the top tags used that are related to deep learning. If we look at the questions asked per year that have these tags, we can track interest in deep learning.
def categorize(tag):
for tags in all_quests['Tags']:
if "deep-learning" in tags:
return "Deep Learning"
elif "machine-learning" in tags:
return "Deep Learning"
elif "neural-network" in tags:
return "Deep Learning"
elif "classification" in tags:
return "Deep Learning"
elif "tensorflow" in tags:
return "Deep Learning"
elif "sciKit-learn" in tags:
return "Deep Learning"
elif "cnn" in tags:
return "Deep Learning"
else:
return "None"
year = all_quests["CreationDate"].dt.year
all_quests['Year'] = year
all_quests["category"] = all_quests["Tags"].apply(categorize)
tps = all_quests.pivot_table(index=all_quests['Year'],
columns=all_quests['category'], aggfunc='size')
print(tps)
category Deep Learning Year 2014 562 2015 1167 2016 2146 2017 2957 2018 5475 2019 8810 2020 459
As we can see, the tags for deep learning increased yearly. The exception is 2020. Since we are still in 2020, we can assume that the data for 2020 is incomplete and we will exclude it.
tps = all_quests[all_quests["CreationDate"].dt.year < 2020]
tps = tps.pivot_table(index=all_quests['Year'],
columns=tps['category'], aggfunc='size')
print(tps)
category Deep Learning Year 2014 562 2015 1167 2016 2146 2017 2957 2018 5475 2019 8810
tps.plot(kind='bar', figsize=(10,7), title="Deep Learning Tags Per Year")
plt.grid(b=None)
plt.show()
Now we will count the total questions per year, and the percentage of deep learning questions there were per year.
yearly = tps.groupby('Year').agg({"Deep Learning": ['sum', 'size']})
yearly.columns = ["Deep Learning Questions", "Total Questions"]
yearly["Deep Learning Rate"] = yearly["Deep Learning Questions"]\
/yearly["Total Questions"]
yearly.reset_index(inplace=True)
print(yearly)
Year Deep Learning Questions Total Questions Deep Learning Rate 0 2014 562 1 562.0 1 2015 1167 1 1167.0 2 2016 2146 1 2146.0 3 2017 2957 1 2957.0 4 2018 5475 1 5475.0 5 2019 8810 1 8810.0