Guided Project - Polular data science questions¶

01. Data Cleaning¶

In [42]:

import pandas as pd

df = pd.read_csv('post.csv')
df = df[['Id', 'CreationDate', 'Score', 'ViewCount', 'Tags', 'AnswerCount', 'FavoriteCount']]
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
Id               50000 non-null int64
CreationDate     50000 non-null object
Score            50000 non-null int64
ViewCount        23685 non-null float64
Tags             23685 non-null object
AnswerCount      23685 non-null float64
FavoriteCount    6518 non-null float64
dtypes: float64(3), int64(2), object(2)
memory usage: 2.7+ MB

Out[42]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
0	5	2014-05-13 23:58:30	9	716.0	<machine-learning>	1.0	1.0
1	7	2014-05-14 00:11:06	4	443.0	<education><open-source>	3.0	1.0
2	9	2014-05-14 00:36:31	5	NaN	NaN	NaN	NaN
3	10	2014-05-14 00:53:43	13	NaN	NaN	NaN	NaN
4	14	2014-05-14 01:25:59	22	1722.0	<data-mining><definitions>	4.0	6.0

In [43]:

df.fillna(0, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
Id               50000 non-null int64
CreationDate     50000 non-null object
Score            50000 non-null int64
ViewCount        50000 non-null float64
Tags             50000 non-null object
AnswerCount      50000 non-null float64
FavoriteCount    50000 non-null float64
dtypes: float64(3), int64(2), object(2)
memory usage: 2.7+ MB

In [44]:

df['Tags'] = df['Tags'].astype(str)

In [45]:

import datetime as dt
df['CreationDate'] = pd.to_datetime(df['CreationDate'])

In [46]:

df["Tags"] = df["Tags"].str.replace("^<|>$", "").str.split("><")
df['Tags']

Out[46]:

0                                       [machine-learning]
1                                 [education, open-source]
2                                                      [0]
3                                                      [0]
4                               [data-mining, definitions]
5                                              [databases]
6                      [machine-learning, bigdata, libsvm]
7                                                      [0]
8                                                      [0]
9          [bigdata, scalability, efficiency, performance]
10                                [nosql, relational-dbms]
11                                                     [0]
12       [data-mining, clustering, octave, k-means, cat...
13                                                     [0]
14                                                     [0]
15                                                     [0]
16                                                     [0]
17                                                     [0]
18                                                     [0]
19                                                     [0]
20                                                     [0]
21                               [data-mining, clustering]
22                                                     [0]
23                                            [algorithms]
24                                                     [0]
25               [nosql, tools, processing, apache-hadoop]
26                                                     [0]
27                                            [bigdata, r]
28                                                     [0]
29                                                     [0]
                               ...                        
49970           [machine-learning, data-cleaning, outlier]
49971                    [feature-extraction, convolution]
49972                                                  [0]
49973                                                  [0]
49974                                                  [0]
49975                       [machine-learning, regression]
49976                 [neural-network, image-segmentation]
49977                                                  [0]
49978                                                  [r]
49979    [machine-learning, neural-network, autoencoder...
49980                                                  [0]
49981                                                  [0]
49982                [python, correlation, graphs, matrix]
49983    [machine-learning, predictive-modeling, cross-...
49984    [python, predictive-modeling, unsupervised-lea...
49985                                                  [0]
49986                                                  [0]
49987                                                  [0]
49988    [machine-learning, classification, scikit-lear...
49989    [machine-learning, classification, logistic-re...
49990       [machine-learning, python, dataset, csv, json]
49991                                                  [0]
49992    [machine-learning, nlp, lstm, machine-learning...
49993                                                  [0]
49994                                                [cnn]
49995                                                  [0]
49996                                                  [0]
49997                                                  [0]
49998                                                  [0]
49999                              [machine-learning, sap]
Name: Tags, Length: 50000, dtype: object

Tag popularity

how many times each tag was used
how many times a question with that tag was viewed

In [47]:

tag_used_dict = {}
for tags in df['Tags']:
    for i in tags:
        if i in tag_used_dict:
            tag_used_dict[i] += 1
        else:
            tag_used_dict[i] = 0

In [48]:

tag_view_dict = {}
for index, row in df.iterrows():
    for i in row['Tags']:
        if i in tag_view_dict:
            tag_view_dict[i] += row['ViewCount']
        else:
            tag_view_dict[i] = 0

In [49]:

tag_used_dict

Out[49]:

{'machine-learning': 7564,
 'education': 27,
 'open-source': 13,
 '0': 26314,
 'data-mining': 978,
 'definitions': 30,
 'databases': 83,
 'bigdata': 420,
 'libsvm': 12,
 'scalability': 25,
 'efficiency': 30,
 'performance': 118,
 'nosql': 22,
 'relational-dbms': 10,
 'clustering': 920,
 'octave': 14,
 'k-means': 292,
 'categorical-data': 261,
 'algorithms': 316,
 'tools': 52,
 'processing': 16,
 'apache-hadoop': 102,
 'r': 1170,
 'data-cleaning': 484,
 'predictive-modeling': 859,
 'statistics': 706,
 'data-stream-mining': 14,
 'neo4j': 10,
 'parallel': 34,
 'distributed': 34,
 'google': 32,
 'search': 48,
 'recommender-system': 318,
 'information-retrieval': 108,
 'similarity': 200,
 'dimensionality-reduction': 187,
 'python': 4371,
 'nlp': 1317,
 'topic-model': 96,
 'lda': 89,
 'language-model': 55,
 'feature-selection': 604,
 'feature-extraction': 269,
 'deep-learning': 3106,
 'scikit-learn': 1445,
 'lstm': 794,
 'neural-network': 3163,
 'audio-recognition': 73,
 'classification': 2055,
 'orange': 192,
 'ensemble-learning': 23,
 'backpropagation': 222,
 'cnn': 932,
 'gradient-descent': 295,
 'feature-scaling': 154,
 'labels': 74,
 'keras': 1964,
 'multilabel-classification': 220,
 'computer-vision': 361,
 'time-series': 1110,
 'rnn': 444,
 'k-nn': 106,
 'apache-spark': 209,
 'scala': 36,
 'text-mining': 486,
 'autoencoder': 231,
 'beginner': 125,
 'theory': 31,
 'hyperparameter': 113,
 'hyperparameter-tuning': 123,
 'grid-search': 62,
 'tensorflow': 1411,
 'unsupervised-learning': 313,
 'anomaly-detection': 229,
 'outlier': 138,
 'dataset': 982,
 'binary': 72,
 'experiments': 24,
 'hardware': 19,
 'multitask-learning': 21,
 'cross-validation': 393,
 'training': 362,
 'accuracy': 213,
 'overfitting': 169,
 'word-embeddings': 288,
 'gpu': 112,
 'logistic-regression': 443,
 'optimization': 313,
 'theano': 45,
 'svm': 406,
 'pandas': 841,
 'association-rules': 59,
 'gan': 170,
 'prediction': 299,
 'model-selection': 151,
 'regression': 971,
 'sql': 80,
 'version-control': 5,
 'linux': 11,
 'convnet': 438,
 'correlation': 240,
 'market-basket-analysis': 22,
 'visualization': 461,
 'software-development': 12,
 'random-forest': 515,
 'pca': 221,
 'reinforcement-learning': 465,
 'q-learning': 101,
 'image-classification': 506,
 'evaluation': 172,
 'confusion-matrix': 90,
 'text': 108,
 'perceptron': 62,
 'linear-regression': 491,
 'forecasting': 156,
 'feature-engineering': 368,
 'distance': 102,
 'decision-trees': 491,
 'score': 20,
 'word2vec': 241,
 'convolution': 230,
 'generative-models': 104,
 'hierarchical-data-format': 20,
 'machine-translation': 60,
 'churn': 26,
 'encoding': 97,
 'ggplot2': 23,
 'gensim': 73,
 'preprocessing': 297,
 'pyspark': 106,
 'data': 530,
 'missing-data': 114,
 'plotting': 98,
 'multiclass-classification': 316,
 'knowledge-base': 18,
 'unbalanced-classes': 155,
 'sas': 19,
 'reference-request': 61,
 'fuzzy-logic': 25,
 'image-recognition': 202,
 'apache-mahout': 12,
 'feature-construction': 52,
 'sequential-pattern-mining': 42,
 'sentiment-analysis': 129,
 'class-imbalance': 168,
 'smote': 54,
 'survival-analysis': 28,
 'game': 13,
 'seaborn': 69,
 'machine-learning-model': 393,
 'ocr': 53,
 'gbm': 47,
 'graphs': 159,
 'matlab': 140,
 'jupyter': 98,
 'recurrent-neural-net': 174,
 'excel': 58,
 'sampling': 113,
 'search-engine': 13,
 'learning-to-rank': 15,
 'kernel': 66,
 'bayesian': 81,
 'dropout': 55,
 'batch-normalization': 49,
 'mlp': 65,
 'natural-language-process': 208,
 'sequence': 74,
 'dataframe': 193,
 'loss-function': 325,
 'numpy': 227,
 'reshape': 22,
 'etl': 10,
 'data-augmentation': 58,
 'scipy': 57,
 'non-parametric': 7,
 'bayesian-nonparametric': 1,
 'error-handling': 34,
 'transfer-learning': 112,
 'embeddings': 87,
 'features': 37,
 'image-preprocessing': 86,
 'colab': 32,
 'pickle': 11,
 'ipython': 45,
 'spatial-transformer': 1,
 'distribution': 107,
 'research': 34,
 'cost-function': 57,
 'faster-rcnn': 49,
 'keras-rl': 9,
 'gridsearchcv': 42,
 'objective-function': 12,
 'clusters': 38,
 'math': 56,
 'anaconda': 39,
 'regex': 24,
 'one-hot-encoding': 24,
 'matplotlib': 136,
 'nltk': 114,
 'orange3': 26,
 'ngrams': 16,
 'active-learning': 12,
 'object-detection': 186,
 'pytorch': 284,
 'cosine-distance': 48,
 'bert': 109,
 'matrix': 43,
 'weighted-data': 41,
 'geospatial': 60,
 'annotation': 22,
 'powerbi': 18,
 'social-network-analysis': 82,
 'twitter': 18,
 'lightgbm': 33,
 'data-science-model': 325,
 'management': 7,
 'dummy-variables': 22,
 'xgboost': 430,
 'online-learning': 44,
 'pgm': 5,
 'boosting': 71,
 'probability': 207,
 'mathematics': 33,
 '3d-reconstruction': 10,
 'normalization': 185,
 'self-driving': 4,
 'methodology': 22,
 'smotenc': 5,
 'regularization': 117,
 'explainable-ai': 23,
 'image': 39,
 'similar-documents': 42,
 'metric': 113,
 'anonymization': 6,
 'naive-bayes-classifier': 133,
 'dynamic-programming': 4,
 'ab-test': 33,
 'activation-function': 103,
 'dqn': 68,
 'deep-network': 47,
 'career': 48,
 'data-analysis': 127,
 'forecast': 85,
 'categorical-encoding': 21,
 'neural': 17,
 'transformer': 73,
 'pipelines': 43,
 'python-3.x': 82,
 'bioinformatics': 18,
 'esl': 3,
 'structured-data': 6,
 'aws': 44,
 'sagemaker': 11,
 'azure-ml': 37,
 '3d-object-detection': 2,
 'opencv': 63,
 'yolo': 52,
 'linear-algebra': 49,
 'markov-process': 53,
 'markov-hidden-model': 29,
 'multivariate-distribution': 12,
 'weka': 58,
 'interpolation': 17,
 'mnist': 60,
 'shap': 8,
 'apache-pig': 6,
 'scraping': 36,
 'ensemble-modeling': 112,
 'hive': 10,
 'programming': 29,
 'graphical-model': 23,
 'convergence': 24,
 'genetic-algorithms': 41,
 'text-generation': 39,
 'gaussian': 55,
 'rbm': 29,
 'semi-supervised-learning': 38,
 'books': 25,
 'mutual-information': 12,
 'named-entity-recognition': 93,
 'rstudio': 43,
 'data-formats': 29,
 'data-imputation': 64,
 'supervised-learning': 250,
 'tsne': 44,
 'ann': 5,
 'vae': 16,
 'text-filter': 5,
 'text-classification': 49,
 'dplyr': 13,
 'arima': 19,
 'statsmodels': 12,
 'question-answering': 7,
 'dbscan': 45,
 'time': 14,
 'ranking': 79,
 'data.table': 13,
 'image-segmentation': 28,
 'cloud-computing': 18,
 'csv': 68,
 'data-wrangling': 38,
 'self-study': 35,
 'ai': 43,
 'vgg16': 30,
 'speech-to-text': 21,
 'tfidf': 68,
 'software-recommendation': 31,
 'consumerweb': 4,
 'classifier': 41,
 'julia': 9,
 'community': 2,
 'kaggle': 86,
 'terminology': 62,
 'data-product': 7,
 'inception': 39,
 'numerical': 17,
 'cs231n': 2,
 'bayes-error': 5,
 'variance': 65,
 'bias': 43,
 'sequence-to-sequence': 70,
 'generalization': 24,
 'pooling': 10,
 'weight-initialization': 25,
 'word': 6,
 'estimators': 14,
 'google-prediction-api': 9,
 'probability-calibration': 21,
 'finetuning': 10,
 'cloud': 8,
 'automatic-summarization': 22,
 'finance': 42,
 'vc-theory': 12,
 'vector-space-models': 20,
 'aggregation': 26,
 'discriminant-analysis': 16,
 'nlg': 9,
 'proximal-svm': 0,
 'least-squares-svm': 1,
 'markov': 9,
 'attention-mechanism': 48,
 'windows': 7,
 'multi-output': 17,
 'lasso': 11,
 'stanford-nlp': 42,
 'mini-batch-gradient-descent': 37,
 'learning-rate': 18,
 'noise': 24,
 'scoring': 32,
 'caffe': 19,
 'gaussian-process': 17,
 'jaccard-coefficient': 11,
 'stacked-lstm': 16,
 'tableau': 27,
 'parsing': 23,
 'noisification': 2,
 'linear-programming': 1,
 'google-cloud': 7,
 'ensemble': 20,
 'bagging': 2,
 'f1score': 15,
 'implementation': 30,
 'counts': 7,
 'lsi': 5,
 'randomized-algorithms': 18,
 'c': 11,
 'structural-equation-modelling': 2,
 'exploratory-factor-analysis': 1,
 'automl': 5,
 'openai-gpt': 5,
 'policy-gradients': 47,
 'deepmind': 8,
 'parameter-estimation': 30,
 'historgram': 14,
 'openai-gym': 31,
 'ridge-regression': 12,
 'expectation-maximization': 14,
 'gru': 9,
 'map-reduce': 24,
 'mongodb': 14,
 'indexing': 15,
 'data-indexing-techniques': 4,
 '.net': 6,
 'metadata': 9,
 'javascript': 25,
 'usecase': 5,
 'parameter': 28,
 'reductions': 1,
 'learning': 25,
 'pac-learning': 10,
 'serialisation': 2,
 'library': 11,
 'project-planning': 10,
 'predictor-importance': 24,
 'actor-critic': 23,
 'monte-carlo': 17,
 'chatbot': 12,
 'encoder': 3,
 'bootstraping': 2,
 'parametric': 0,
 'sports': 5,
 'matrix-factorisation': 44,
 'json': 27,
 'wolfram-language': 2,
 'descriptive-statistics': 56,
 'sensors': 7,
 'domain-adaptation': 7,
 'representation': 19,
 'collinearity': 11,
 'hinge-loss': 7,
 'one-shot-learning': 4,
 'heatmap': 14,
 'web-scraping': 12,
 'huggingface': 5,
 'data-leakage': 18,
 'ndcg': 6,
 'infographics': 6,
 'object-recognition': 50,
 'spacy': 31,
 'sparse': 5,
 'methods': 13,
 'glm': 22,
 'bayesian-networks': 41,
 'plotly': 2,
 'adaboost': 5,
 'spyder': 4,
 'mse': 17,
 'hog': 3,
 'evolutionary-algorithms': 14,
 'information-theory': 26,
 'simulation': 20,
 'finite-precision': 4,
 'tflearn': 8,
 'alex-net': 9,
 'discounted-reward': 6,
 'derivation': 5,
 'categories': 2,
 'multi-instance-learning': 2,
 'privacy': 7,
 'marketing': 24,
 'softmax': 27,
 'pearsons-correlation-coefficient': 6,
 'pip': 9,
 'siamese-networks': 8,
 'crawling': 10,
 'freebase': 2,
 'networkx': 3,
 'interpretation': 5,
 'tokenization': 14,
 'confidence': 1,
 'torch': 10,
 'java': 55,
 'state-of-the-art': 6,
 'inceptionresnetv2': 7,
 'mcmc': 5,
 'neural-style-transfer': 14,
 'natural-gradient-boosting': 1,
 'ngboost': 0,
 'handwritten': 0,
 'auc': 18,
 'ner': 2,
 'imbalanced-learn': 30,
 'data-engineering': 7,
 'genetic-programming': 3,
 'epochs': 21,
 'image-size': 4,
 'feature-reduction': 4,
 'amazon-ml': 8,
 'pytorch-geometric': 4,
 'corpus': 5,
 'groundtruth': 0,
 'labelling': 7,
 'pruning': 6,
 'stata': 4,
 'meta-learning': 9,
 'manifold': 5,
 'competitions': 6,
 'svr': 10,
 'feature-map': 1,
 'coursera': 5,
 'ml': 11,
 'rdkit': 1,
 'predict': 8,
 'cart': 3,
 'chi-square-test': 2,
 'skmultilearn': 0,
 'gradient': 6,
 'nvidia': 12,
 'momentum': 3,
 'bokeh': 2,
 'information-extraction': 2,
 'probabilistic-programming': 13,
 'featurization': 3,
 'apache-kafka': 2,
 'tranformation': 3,
 'graphviz': 1,
 'elastic-net': 2,
 'mlflow': 1,
 'movielens': 1,
 'difference': 8,
 'summarunner-architecture': 0,
 'h2o': 8,
 'rbf': 7,
 'pybrain': 1,
 'rmse': 5,
 'gmm': 2,
 'fuzzy-classification': 3,
 'notation': 10,
 'causalimpact': 6,
 'genetic': 4,
 'doc2vec': 6,
 'automation': 5,
 'knime': 2,
 'lda-classifier': 5,
 'mean-shift': 2,
 'normal-equation': 5,
 'anova': 4,
 'groupby': 5,
 'code': 6,
 'pattern-recognition': 8,
 'fastai': 4,
 'homework': 11,
 'unseen-data': 2,
 'tesseract': 6,
 'skorch': 1,
 'density-estimation': 6,
 'glorot-initialization': 1,
 'arrow': 0,
 'activity-recognition': 5,
 'semantic-similarity': 7,
 'anomaly': 5,
 'hbase': 1,
 'energy': 3,
 'functional-api': 3,
 'frequentist': 1,
 'dashboards': 0,
 'redshift': 4,
 'kerastuner': 0,
 'manhattan': 2,
 'spectral-clustering': 3,
 'exploitation': 0,
 'history': 2,
 'multi-agent': 0,
 'allennlp': 2,
 'mxnet': 1,
 'crisp-dm': 1,
 'goss': 0,
 'dirichlet': 6,
 'sap': 1,
 'hana': 0,
 'sparsity': 3,
 'helmert-coding': 0,
 'james-stein-encoder': 0,
 'gluon': 0,
 'isolation-forest': 2,
 'gnn': 1,
 'graph-neural-network': 3,
 'spearmans-rank-correlation': 2,
 'kendalls-tau-coefficient': 2,
 'ibm-watson': 0,
 'spss': 3,
 'tpu': 1,
 'knowledge-graph': 2,
 'early-stopping': 2,
 'lime': 1,
 'c++': 0,
 'hashingvectorizer': 1,
 'naive-bayes-algorithim': 1,
 'nl2sql': 0,
 'cuda': 1,
 'aws-lambda': 1,
 'estimation': 0,
 'estimation-updating': 0,
 'duplicate-records': 0,
 'fasttext': 2,
 'chainer': 0,
 'imputation': 0,
 'rattle': 3,
 'parquet': 0,
 'metaheuristics': 1,
 'latex': 0,
 'wikipedia': 1,
 'kitti-dataset': 0,
 'separable': 0,
 'linearly-separable': 1,
 'stemming': 1,
 'field-aware-factorization-machines': 0,
 'hashing-trick': 0,
 'or-tools': 0,
 'pyro': 0,
 'bayesian-neural-network': 0,
 'infere': 0}

In [50]:

tag_view_dict

Out[50]:

{'machine-learning': 10233797.0,
 'education': 43029.0,
 'open-source': 81347.0,
 '0': 0.0,
 'data-mining': 1463551.0,
 'definitions': 28023.0,
 'databases': 105292.0,
 'bigdata': 593751.0,
 'libsvm': 9895.0,
 'scalability': 14157.0,
 'efficiency': 89635.0,
 'performance': 114993.0,
 'nosql': 22004.0,
 'relational-dbms': 4300.0,
 'clustering': 1092229.0,
 'octave': 13350.0,
 'k-means': 487571.0,
 'categorical-data': 421168.0,
 'algorithms': 439566.0,
 'tools': 156580.0,
 'processing': 5858.0,
 'apache-hadoop': 137152.0,
 'r': 2315572.0,
 'data-cleaning': 988851.0,
 'predictive-modeling': 766715.0,
 'statistics': 1056087.0,
 'data-stream-mining': 8002.0,
 'neo4j': 7306.0,
 'parallel': 43284.0,
 'distributed': 52358.0,
 'google': 13256.0,
 'search': 27721.0,
 'recommender-system': 225510.0,
 'information-retrieval': 79010.0,
 'similarity': 324351.0,
 'dimensionality-reduction': 228175.0,
 'python': 10487181.0,
 'nlp': 1435590.0,
 'topic-model': 131439.0,
 'lda': 100240.0,
 'language-model': 83249.0,
 'feature-selection': 751725.0,
 'feature-extraction': 367439.0,
 'deep-learning': 4537935.0,
 'scikit-learn': 3568042.0,
 'lstm': 705631.0,
 'neural-network': 4924901.0,
 'audio-recognition': 42677.0,
 'classification': 2086082.0,
 'orange': 92527.0,
 'ensemble-learning': 3608.0,
 'backpropagation': 316182.0,
 'cnn': 608790.0,
 'gradient-descent': 321500.0,
 'feature-scaling': 188331.0,
 'labels': 63404.0,
 'keras': 3017772.0,
 'multilabel-classification': 244579.0,
 'computer-vision': 431154.0,
 'time-series': 1022405.0,
 'rnn': 620676.0,
 'k-nn': 54086.0,
 'apache-spark': 685246.0,
 'scala': 123389.0,
 'text-mining': 632733.0,
 'autoencoder': 186492.0,
 'beginner': 260580.0,
 'theory': 6077.0,
 'hyperparameter': 231746.0,
 'hyperparameter-tuning': 84919.0,
 'grid-search': 44879.0,
 'tensorflow': 1910096.0,
 'unsupervised-learning': 290008.0,
 'anomaly-detection': 291220.0,
 'outlier': 113803.0,
 'dataset': 1244310.0,
 'binary': 104812.0,
 'experiments': 6722.0,
 'hardware': 9697.0,
 'multitask-learning': 14669.0,
 'cross-validation': 673873.0,
 'training': 385621.0,
 'accuracy': 310805.0,
 'overfitting': 227033.0,
 'word-embeddings': 303576.0,
 'gpu': 282302.0,
 'logistic-regression': 549747.0,
 'optimization': 358528.0,
 'theano': 88461.0,
 'svm': 725166.0,
 'pandas': 4324555.0,
 'association-rules': 46180.0,
 'gan': 107530.0,
 'prediction': 232252.0,
 'model-selection': 98234.0,
 'regression': 868703.0,
 'sql': 130857.0,
 'version-control': 28776.0,
 'linux': 8419.0,
 'convnet': 938390.0,
 'correlation': 434468.0,
 'market-basket-analysis': 11745.0,
 'visualization': 918054.0,
 'software-development': 3682.0,
 'random-forest': 1185875.0,
 'pca': 115567.0,
 'reinforcement-learning': 316535.0,
 'q-learning': 96764.0,
 'image-classification': 509591.0,
 'evaluation': 300990.0,
 'confusion-matrix': 189129.0,
 'text': 132457.0,
 'perceptron': 45807.0,
 'linear-regression': 688328.0,
 'forecasting': 55163.0,
 'feature-engineering': 351467.0,
 'distance': 94608.0,
 'decision-trees': 1077741.0,
 'score': 5367.0,
 'word2vec': 363681.0,
 'convolution': 410239.0,
 'generative-models': 54448.0,
 'hierarchical-data-format': 15790.0,
 'machine-translation': 32814.0,
 'churn': 6240.0,
 'encoding': 180602.0,
 'ggplot2': 24078.0,
 'gensim': 141601.0,
 'preprocessing': 299454.0,
 'pyspark': 487917.0,
 'data': 466215.0,
 'missing-data': 186628.0,
 'plotting': 147052.0,
 'multiclass-classification': 439745.0,
 'knowledge-base': 29058.0,
 'unbalanced-classes': 267485.0,
 'sas': 6586.0,
 'reference-request': 60268.0,
 'fuzzy-logic': 13942.0,
 'image-recognition': 158694.0,
 'apache-mahout': 3302.0,
 'feature-construction': 42131.0,
 'sequential-pattern-mining': 22461.0,
 'sentiment-analysis': 124704.0,
 'class-imbalance': 128745.0,
 'smote': 51259.0,
 'survival-analysis': 13982.0,
 'game': 1151.0,
 'seaborn': 187131.0,
 'machine-learning-model': 117919.0,
 'ocr': 23203.0,
 'gbm': 170397.0,
 'graphs': 209803.0,
 'matlab': 56215.0,
 'jupyter': 249937.0,
 'recurrent-neural-net': 91676.0,
 'excel': 91715.0,
 'sampling': 260137.0,
 'search-engine': 2162.0,
 'learning-to-rank': 2763.0,
 'kernel': 27692.0,
 'bayesian': 16426.0,
 'dropout': 72917.0,
 'batch-normalization': 43813.0,
 'mlp': 52089.0,
 'natural-language-process': 66724.0,
 'sequence': 42191.0,
 'dataframe': 1394951.0,
 'loss-function': 326040.0,
 'numpy': 484487.0,
 'reshape': 20356.0,
 'etl': 4157.0,
 'data-augmentation': 29904.0,
 'scipy': 28007.0,
 'non-parametric': 368.0,
 'bayesian-nonparametric': 31.0,
 'error-handling': 34364.0,
 'transfer-learning': 47949.0,
 'embeddings': 42444.0,
 'features': 4575.0,
 'image-preprocessing': 17654.0,
 'colab': 125549.0,
 'pickle': 3379.0,
 'ipython': 188711.0,
 'spatial-transformer': 470.0,
 'distribution': 40709.0,
 'research': 17193.0,
 'cost-function': 80453.0,
 'faster-rcnn': 17443.0,
 'keras-rl': 4209.0,
 'gridsearchcv': 29925.0,
 'objective-function': 575.0,
 'clusters': 57016.0,
 'math': 10397.0,
 'anaconda': 352448.0,
 'regex': 27582.0,
 'one-hot-encoding': 1028.0,
 'matplotlib': 157232.0,
 'nltk': 172643.0,
 'orange3': 3999.0,
 'ngrams': 7709.0,
 'active-learning': 1340.0,
 'object-detection': 59642.0,
 'pytorch': 247712.0,
 'cosine-distance': 68177.0,
 'bert': 68468.0,
 'matrix': 53376.0,
 'weighted-data': 90470.0,
 'geospatial': 128796.0,
 'annotation': 7772.0,
 'powerbi': 4364.0,
 'social-network-analysis': 66406.0,
 'twitter': 10988.0,
 'lightgbm': 10057.0,
 'data-science-model': 91792.0,
 'management': 493.0,
 'dummy-variables': 6854.0,
 'xgboost': 927565.0,
 'online-learning': 30349.0,
 'pgm': 19487.0,
 'boosting': 35842.0,
 'probability': 78734.0,
 'mathematics': 12058.0,
 '3d-reconstruction': 1680.0,
 'normalization': 244842.0,
 'self-driving': 1436.0,
 'methodology': 3641.0,
 'smotenc': 1982.0,
 'regularization': 73281.0,
 'explainable-ai': 1952.0,
 'image': 10755.0,
 'similar-documents': 30256.0,
 'metric': 66633.0,
 'anonymization': 7220.0,
 'naive-bayes-classifier': 110757.0,
 'dynamic-programming': 178.0,
 'ab-test': 18837.0,
 'activation-function': 152705.0,
 'dqn': 23273.0,
 'deep-network': 18538.0,
 'career': 71815.0,
 'data-analysis': 34738.0,
 'forecast': 80135.0,
 'categorical-encoding': 709.0,
 'neural': 8975.0,
 'transformer': 32787.0,
 'pipelines': 6418.0,
 'python-3.x': 8057.0,
 'bioinformatics': 75596.0,
 'esl': 3919.0,
 'structured-data': 493.0,
 'aws': 37148.0,
 'sagemaker': 913.0,
 'azure-ml': 22833.0,
 '3d-object-detection': 60.0,
 'opencv': 49895.0,
 'yolo': 28030.0,
 'linear-algebra': 16304.0,
 'markov-process': 30793.0,
 'markov-hidden-model': 9818.0,
 'multivariate-distribution': 264.0,
 'weka': 31168.0,
 'interpolation': 2670.0,
 'mnist': 29131.0,
 'shap': 588.0,
 'apache-pig': 5338.0,
 'scraping': 114185.0,
 'ensemble-modeling': 128568.0,
 'hive': 12211.0,
 'programming': 139699.0,
 'graphical-model': 5576.0,
 'convergence': 32945.0,
 'genetic-algorithms': 37552.0,
 'text-generation': 8761.0,
 'gaussian': 36348.0,
 'rbm': 22641.0,
 'semi-supervised-learning': 21203.0,
 'books': 16291.0,
 'mutual-information': 8538.0,
 'named-entity-recognition': 73297.0,
 'rstudio': 111090.0,
 'data-formats': 78926.0,
 'data-imputation': 44132.0,
 'supervised-learning': 143596.0,
 'tsne': 41107.0,
 'ann': 143.0,
 'vae': 8365.0,
 'text-filter': 2187.0,
 'text-classification': 1399.0,
 'dplyr': 8868.0,
 'arima': 3716.0,
 'statsmodels': 451.0,
 'question-answering': 541.0,
 'dbscan': 54675.0,
 'time': 6687.0,
 'ranking': 43413.0,
 'data.table': 42431.0,
 'image-segmentation': 2166.0,
 'cloud-computing': 75063.0,
 'csv': 369071.0,
 'data-wrangling': 97218.0,
 'self-study': 33760.0,
 'ai': 6780.0,
 'vgg16': 17397.0,
 'speech-to-text': 6106.0,
 'tfidf': 64694.0,
 'software-recommendation': 194395.0,
 'consumerweb': 2172.0,
 'classifier': 62041.0,
 'julia': 4728.0,
 'community': 530.0,
 'kaggle': 134966.0,
 'terminology': 93752.0,
 'data-product': 1576.0,
 'inception': 78464.0,
 'numerical': 21971.0,
 'cs231n': 111.0,
 'bayes-error': 1958.0,
 'variance': 41649.0,
 'bias': 11415.0,
 'sequence-to-sequence': 43458.0,
 'generalization': 4495.0,
 'pooling': 1023.0,
 'weight-initialization': 9867.0,
 'word': 9820.0,
 'estimators': 23546.0,
 'google-prediction-api': 1813.0,
 'probability-calibration': 10376.0,
 'finetuning': 1454.0,
 'cloud': 1849.0,
 'automatic-summarization': 13668.0,
 'finance': 13971.0,
 'vc-theory': 31950.0,
 'vector-space-models': 8017.0,
 'aggregation': 92041.0,
 'discriminant-analysis': 9258.0,
 'nlg': 1891.0,
 'proximal-svm': 0,
 'least-squares-svm': 19.0,
 'markov': 19415.0,
 'attention-mechanism': 38051.0,
 'windows': 13396.0,
 'multi-output': 838.0,
 'lasso': 501.0,
 'stanford-nlp': 43173.0,
 'mini-batch-gradient-descent': 22271.0,
 'learning-rate': 9569.0,
 'noise': 7765.0,
 'scoring': 13492.0,
 'caffe': 13337.0,
 'gaussian-process': 875.0,
 'jaccard-coefficient': 8346.0,
 'stacked-lstm': 16685.0,
 'tableau': 29383.0,
 'parsing': 23769.0,
 'noisification': 4408.0,
 'linear-programming': 86.0,
 'google-cloud': 161.0,
 'ensemble': 3233.0,
 'bagging': 45.0,
 'f1score': 871.0,
 'implementation': 4404.0,
 'counts': 1929.0,
 'lsi': 361.0,
 'randomized-algorithms': 21057.0,
 'c': 75230.0,
 'structural-equation-modelling': 45.0,
 'exploratory-factor-analysis': 13.0,
 'automl': 564.0,
 'openai-gpt': 284.0,
 'policy-gradients': 6680.0,
 'deepmind': 1667.0,
 'parameter-estimation': 58701.0,
 'historgram': 25227.0,
 'openai-gym': 8029.0,
 'ridge-regression': 1468.0,
 'expectation-maximization': 3264.0,
 'gru': 100158.0,
 'map-reduce': 19704.0,
 'mongodb': 16777.0,
 'indexing': 46371.0,
 'data-indexing-techniques': 345.0,
 '.net': 3246.0,
 'metadata': 4903.0,
 'javascript': 46916.0,
 'usecase': 7866.0,
 'parameter': 149730.0,
 'reductions': 770.0,
 'learning': 5374.0,
 'pac-learning': 31536.0,
 'serialisation': 2104.0,
 'library': 81554.0,
 'project-planning': 576.0,
 'predictor-importance': 8966.0,
 'actor-critic': 3586.0,
 'monte-carlo': 7979.0,
 'chatbot': 1694.0,
 'encoder': 104.0,
 'bootstraping': 23.0,
 'parametric': 0,
 'sports': 5994.0,
 'matrix-factorisation': 8615.0,
 'json': 100308.0,
 'wolfram-language': 257.0,
 'descriptive-statistics': 16180.0,
 'sensors': 245.0,
 'domain-adaptation': 1640.0,
 'representation': 11852.0,
 'collinearity': 3308.0,
 'hinge-loss': 2088.0,
 'one-shot-learning': 539.0,
 'heatmap': 1447.0,
 'web-scraping': 452.0,
 'huggingface': 90.0,
 'data-leakage': 1793.0,
 'ndcg': 1710.0,
 'infographics': 3218.0,
 'object-recognition': 81914.0,
 'spacy': 7384.0,
 'sparse': 86.0,
 'methods': 13234.0,
 'glm': 21802.0,
 'bayesian-networks': 26572.0,
 'plotly': 325.0,
 'adaboost': 1197.0,
 'spyder': 865.0,
 'mse': 634.0,
 'hog': 323.0,
 'evolutionary-algorithms': 1073.0,
 'information-theory': 15462.0,
 'simulation': 6868.0,
 'finite-precision': 223.0,
 'tflearn': 15975.0,
 'alex-net': 6748.0,
 'discounted-reward': 586.0,
 'derivation': 77.0,
 'categories': 38.0,
 'multi-instance-learning': 96.0,
 'privacy': 3854.0,
 'marketing': 9701.0,
 'softmax': 8315.0,
 'pearsons-correlation-coefficient': 6547.0,
 'pip': 1451.0,
 'siamese-networks': 501.0,
 'crawling': 46394.0,
 'freebase': 4283.0,
 'networkx': 1142.0,
 'interpretation': 505.0,
 'tokenization': 17482.0,
 'confidence': 25.0,
 'torch': 6802.0,
 'java': 47603.0,
 'state-of-the-art': 8875.0,
 'inceptionresnetv2': 1480.0,
 'mcmc': 900.0,
 'neural-style-transfer': 5210.0,
 'natural-gradient-boosting': 10.0,
 'ngboost': 0,
 'handwritten': 0,
 'auc': 1018.0,
 'ner': 54.0,
 'imbalanced-learn': 7340.0,
 'data-engineering': 903.0,
 'genetic-programming': 312.0,
 'epochs': 25524.0,
 'image-size': 230.0,
 'feature-reduction': 102.0,
 'amazon-ml': 3879.0,
 'pytorch-geometric': 783.0,
 'corpus': 116.0,
 'groundtruth': 0,
 'labelling': 168.0,
 'pruning': 418.0,
 'stata': 3180.0,
 'meta-learning': 957.0,
 'manifold': 2778.0,
 'competitions': 4316.0,
 'svr': 2093.0,
 'feature-map': 52.0,
 'coursera': 238.0,
 'ml': 1960.0,
 'rdkit': 27.0,
 'predict': 571.0,
 'cart': 65.0,
 'chi-square-test': 30.0,
 'skmultilearn': 0,
 'gradient': 102.0,
 'nvidia': 26102.0,
 'momentum': 2331.0,
 'bokeh': 367.0,
 'information-extraction': 16.0,
 'probabilistic-programming': 2859.0,
 'featurization': 24129.0,
 'apache-kafka': 2467.0,
 'tranformation': 759.0,
 'graphviz': 22.0,
 'elastic-net': 145.0,
 'mlflow': 28.0,
 'movielens': 1540.0,
 'difference': 16583.0,
 'summarunner-architecture': 0,
 'h2o': 162.0,
 'rbf': 1103.0,
 'pybrain': 77.0,
 'rmse': 301.0,
 'gmm': 86.0,
 'fuzzy-classification': 417.0,
 'notation': 1944.0,
 'causalimpact': 153.0,
 'genetic': 9427.0,
 'doc2vec': 290.0,
 'automation': 258.0,
 'knime': 54.0,
 'lda-classifier': 4367.0,
 'mean-shift': 321.0,
 'normal-equation': 2394.0,
 'anova': 80.0,
 'groupby': 170.0,
 'code': 404.0,
 'pattern-recognition': 230.0,
 'fastai': 394.0,
 'homework': 52129.0,
 'unseen-data': 79.0,
 'tesseract': 981.0,
 'skorch': 13.0,
 'density-estimation': 1608.0,
 'glorot-initialization': 4064.0,
 'arrow': 0,
 'activity-recognition': 209.0,
 'semantic-similarity': 190.0,
 'anomaly': 5151.0,
 'hbase': 2155.0,
 'energy': 309.0,
 'functional-api': 146.0,
 'frequentist': 14.0,
 'dashboards': 0,
 'redshift': 1536.0,
 'kerastuner': 0,
 'manhattan': 255.0,
 'spectral-clustering': 86.0,
 'exploitation': 0,
 'history': 632.0,
 'multi-agent': 0,
 'allennlp': 59.0,
 'mxnet': 23.0,
 'crisp-dm': 6.0,
 'goss': 0,
 'dirichlet': 2082.0,
 'sap': 20.0,
 'hana': 0,
 'sparsity': 83.0,
 'helmert-coding': 0,
 'james-stein-encoder': 0,
 'gluon': 0,
 'isolation-forest': 107.0,
 'gnn': 13.0,
 'graph-neural-network': 52.0,
 'spearmans-rank-correlation': 28.0,
 'kendalls-tau-coefficient': 26.0,
 'ibm-watson': 0,
 'spss': 6716.0,
 'tpu': 17.0,
 'knowledge-graph': 24.0,
 'early-stopping': 334.0,
 'lime': 18.0,
 'c++': 0,
 'hashingvectorizer': 24.0,
 'naive-bayes-algorithim': 14.0,
 'nl2sql': 0,
 'cuda': 84.0,
 'aws-lambda': 1686.0,
 'estimation': 0,
 'estimation-updating': 0,
 'duplicate-records': 0,
 'fasttext': 32.0,
 'chainer': 0,
 'imputation': 0,
 'rattle': 948.0,
 'parquet': 0,
 'metaheuristics': 89.0,
 'latex': 0,
 'wikipedia': 33.0,
 'kitti-dataset': 0,
 'separable': 0,
 'linearly-separable': 22.0,
 'stemming': 49.0,
 'field-aware-factorization-machines': 0,
 'hashing-trick': 0,
 'or-tools': 0,
 'pyro': 0,
 'bayesian-neural-network': 0,
 'infere': 0}

In [33]:

tag_used = pd.DataFrame.from_dict(tag_used_dict, orient='index')
tag_used.rename(columns={0:'used'}, inplace=True)
tag_viewed = pd.DataFrame.from_dict(tag_view_dict, orient='index')
tag_viewed.rename(columns={0:'viewed'}, inplace=True)

In [51]:

tag = tag_used.merge(tag_viewed, left_index=True, right_index=True)
tag

Out[51]:

	used	viewed
machine-learning	7564	151280.0
education	27	540.0
open-source	13	260.0
0	26314	526280.0
data-mining	978	19560.0
definitions	30	600.0
databases	83	1660.0
bigdata	420	8400.0
libsvm	12	240.0
scalability	25	500.0
efficiency	30	600.0
performance	118	2360.0
nosql	22	440.0
relational-dbms	10	200.0
clustering	920	18400.0
octave	14	280.0
k-means	292	5840.0
categorical-data	261	5220.0
algorithms	316	6320.0
tools	52	1040.0
processing	16	320.0
apache-hadoop	102	2040.0
r	1170	23400.0
data-cleaning	484	9680.0
predictive-modeling	859	17180.0
statistics	706	14120.0
data-stream-mining	14	280.0
neo4j	10	200.0
parallel	34	680.0
distributed	34	680.0
...	...	...
knowledge-graph	2	40.0
early-stopping	2	40.0
lime	1	20.0
c++	0	0.0
hashingvectorizer	1	20.0
naive-bayes-algorithim	1	20.0
nl2sql	0	0.0
cuda	1	20.0
aws-lambda	1	20.0
estimation	0	0.0
estimation-updating	0	0.0
duplicate-records	0	0.0
fasttext	2	40.0
chainer	0	0.0
imputation	0	0.0
rattle	3	60.0
parquet	0	0.0
metaheuristics	1	20.0
latex	0	0.0
wikipedia	1	20.0
kitti-dataset	0	0.0
separable	0	0.0
linearly-separable	1	20.0
stemming	1	20.0
field-aware-factorization-machines	0	0.0
hashing-trick	0	0.0
or-tools	0	0.0
pyro	0	0.0
bayesian-neural-network	0	0.0
infere	0	0.0

592 rows × 2 columns

In [55]:

tag = tag.drop(['0'], axis=0)

In [56]:

used_20 = tag['used'].sort_values().tail(20)
viewed_20 = tag['viewed'].sort_values().tail(20)

In [57]:

%matplotlib inline
used_20.plot(kind='barh', title='top_10_used_tag')

Out[57]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a5957c5ba8>

In [58]:

viewed_20.plot(kind='barh', title='top_10_viewed_tag')

Out[58]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a5957fd780>

In [59]:

tag.info()

<class 'pandas.core.frame.DataFrame'>
Index: 591 entries, machine-learning to infere
Data columns (total 2 columns):
used      591 non-null int64
viewed    591 non-null float64
dtypes: float64(1), int64(1)
memory usage: 13.9+ KB

02. Tag relationship¶

In [60]:

final ={}
for i in tag.index:
    related_tag={}
    for index, row in df.iterrows():
        if i in row['Tags']:
            for k in row['Tags']:
                if k in related_tag:
                    related_tag[k] += 1
                else:
                    related_tag[k] = 1
    final[i] = related_tag

In [61]:

final_relate = pd.DataFrame.from_dict(final, orient='index')

In [62]:

order = final_relate.index
final_relate = fianl_relate.reindex(order, axis=1)
final_relate

Out[62]:

	.net	3d-object-detection	3d-reconstruction	ab-test	accuracy	activation-function	active-learning	activity-recognition	actor-critic	adaboost	...	weighted-data	weka	wikipedia	windows	wolfram-language	word	word-embeddings	word2vec	xgboost	yolo
.net	7.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3d-object-detection	NaN	3.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3d-reconstruction	NaN	NaN	11.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
ab-test	NaN	NaN	NaN	34.0	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
accuracy	NaN	NaN	NaN	NaN	214.0	NaN	NaN	NaN	NaN	NaN	...	NaN	1.0	NaN	NaN	NaN	NaN	NaN	1.0	5.0	NaN
activation-function	NaN	NaN	NaN	NaN	NaN	104.0	NaN	NaN	NaN	NaN	...	1.0	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN
active-learning	NaN	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
activity-recognition	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
actor-critic	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	24.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
adaboost	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN
aggregation	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN
ai	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
alex-net	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
algorithms	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	2.0	NaN	1.0	NaN	NaN	1.0	NaN	3.0	NaN
allennlp	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
amazon-ml	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
anaconda	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	1.0	1.0
ann	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
annotation	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
anomaly	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
anomaly-detection	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN
anonymization	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
anova	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
apache-hadoop	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
apache-kafka	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
apache-mahout	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
apache-pig	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
apache-spark	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1.0	NaN	NaN	1.0	NaN	NaN	NaN	1.0	1.0	NaN
arima	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN
arrow	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
tpu	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
training	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	1.0	NaN	...	1.0	NaN	NaN	NaN	NaN	NaN	2.0	NaN	4.0	2.0
tranformation	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
transfer-learning	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	2.0	1.0	NaN	1.0
transformer	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN
tsne	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	1.0	NaN	NaN	NaN	NaN	2.0	4.0	NaN	NaN
twitter	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
unbalanced-classes	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	NaN	...	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8.0	NaN
unseen-data	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
unsupervised-learning	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1.0	1.0	NaN	NaN	NaN	NaN	2.0	2.0	NaN	NaN
usecase	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
vae	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
variance	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN
vc-theory	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
vector-space-models	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	5.0	3.0	NaN	NaN
version-control	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
vgg16	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
visualization	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	...	NaN	3.0	NaN	NaN	2.0	NaN	1.0	2.0	1.0	NaN
web-scraping	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
weight-initialization	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
weighted-data	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	...	42.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	NaN
weka	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	...	NaN	59.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
wikipedia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
windows	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	8.0	NaN	NaN	NaN	NaN	1.0	1.0
wolfram-language	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	3.0	NaN	NaN	NaN	NaN	NaN
word	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	7.0	2.0	1.0	NaN	NaN
word-embeddings	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	2.0	289.0	106.0	1.0	NaN
word2vec	NaN	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	1.0	106.0	242.0	NaN	NaN
xgboost	NaN	NaN	NaN	NaN	5.0	NaN	NaN	NaN	NaN	1.0	...	4.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	431.0	NaN
yolo	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	53.0

591 rows × 591 columns

In [63]:

final_relate['python'].sort_values(ascending=False).head(10)

Out[63]:

python              4372.0
machine-learning    1243.0
scikit-learn         672.0
pandas               592.0
keras                541.0
tensorflow           372.0
deep-learning        363.0
neural-network       357.0
classification       255.0
nlp                  229.0
Name: python, dtype: float64

The 'final_relate' table represetns the relation between tags. The value in the dataframe is the number of times that show up together. For example, 'python' was tagged with 'machine-learning' 499 times, with 'keras' 280 times, with 'pandas' 244 times, with 'scikit-learn' 235 times.

With that being said, let's dive deep into top 8 most used tags.

['python', 'machine-learning', 'deep-learning', 'neural-network, 'keras', 'tensorflow', 'classification', 'scikit-learn']

In [64]:

final_relate['python'].sort_values(ascending=False).head(10)[::-1].plot(kind='barh')

Out[64]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a5958304a8>

In [65]:

final_relate['machine-learning'].sort_values(ascending=False).head(10)[::-1].plot(kind='barh')

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a595c89e48>

In [66]:

final_relate['deep-learning'].sort_values(ascending=False).head(10)[::-1].plot(kind='barh')

Out[66]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a59698b128>

In [67]:

final_relate['neural-network'].sort_values(ascending=False).head(10)[::-1].plot(kind='barh')

Out[67]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a59a7728d0>

In [68]:

final_relate['keras'].sort_values(ascending=False).head(10)[::-1].plot(kind='barh')

Out[68]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a59a318c18>

In [69]:

final_relate['tensorflow'].sort_values(ascending=False).head(10)[::-1].plot(kind='barh')

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a5964fa780>

In [70]:

final_relate['classification'].sort_values(ascending=False).head(10)[::-1].plot(kind='barh')

Out[70]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a5961bf550>

In [71]:

final_relate['scikit-learn'].sort_values(ascending=False).head(10)[::-1].plot(kind='barh')

Out[71]:

<matplotlib.axes._subplots.AxesSubplot at 0x2a59a4d4198>