Data Science Articles

Author: Khuyen Tran

In [1]:
import datapane as dp 
import pandas as pd
import numpy as np

# Load data from dp.Blob
medium = dp.Blob.get(name='medium', owner='khuyentran1401').download_df()
In [2]:
medium.head(10)
Out[2]:
Title Subtitle Image Author Publication Year Month Day Tag Reading_Time Claps Comment url Author_url
0 Apply and Lambda usage in pandas Learn these to master Pandas 1 Rahul Agarwal Towards Data Science 2019 7 1 data_science 6 1.5K 0 https://towardsdatascience.com/apply-and-lambd... https://towardsdatascience.com/@rahul_agarwal?...
1 Jupyter is the new Excel (but not for your boss) nan 1 Dan Lester Towards Data Science 2019 7 1 data_science 10 1.5K 0 https://towardsdatascience.com/jupyter-is-the-... https://towardsdatascience.com/@dan_19973?sour...
2 Fuzzy matching at scale From 3.7 hours to 0.2 seconds. How to perform ... 1 Josh Taylor Towards Data Science 2019 7 1 data_science 7 547 0 https://towardsdatascience.com/fuzzy-matching-... https://towardsdatascience.com/@thejoshtaylor?...
3 Artificial Intelligence in Video Games An overview of how video game A.I. has develop... 1 Laura E Shummon Maass Towards Data Science 2019 7 1 data_science 14 265 0 https://towardsdatascience.com/artificial-inte... https://towardsdatascience.com/@laurashummonma...
4 Affinity Propagation Algorithm Explained Affinity Propagation was first published in 20... 1 Cory Maklin Towards Data Science 2019 7 1 data_science 6 92 0 https://towardsdatascience.com/unsupervised-ma... https://towardsdatascience.com/@corymaklin?sou...
5 Deploying Models to Flask A walk-through on how to deploy machine learni... 1 Jeremy Chow Towards Data Science 2019 7 1 data_science 8 859 0 https://towardsdatascience.com/deploying-model... https://towardsdatascience.com/@jeremyrchow?so...
6 AI, Machine Learning, Deep Learning Explained ... Supervised ML, Unsupervised ML, Reinforcement 1 Jun Wu Towards Data Science 2019 7 1 data_science 7 406 0 https://towardsdatascience.com/ai-machine-lear... https://towardsdatascience.com/@junwu_46652?so...
7 Tweepy for beginners Using Twitters API to build your own data set 1 Richard Chadwick Towards Data Science 2019 7 1 data_science 7 260 0 https://towardsdatascience.com/tweepy-for-begi... https://towardsdatascience.com/@richchad?sourc...
8 BIRCH Clustering Algorithm Example In Python Existing data clustering methods do not adequa... 1 Cory Maklin Towards Data Science 2019 7 1 data_science 6 100 0 https://towardsdatascience.com/machine-learnin... https://towardsdatascience.com/@corymaklin?sou...
9 Zomato, Bangalore Data Analysis What and where to eat in Bangalorea data scien... 1 Shubhankar Rawat Towards Data Science 2019 7 1 data_science 15 190 0 https://towardsdatascience.com/zomato-bangalor... https://towardsdatascience.com/@shubhankarrawa...
In [3]:
medium = medium.replace('nan', np.nan)
In [4]:
# Drop duplicated
medium = medium.drop_duplicates(subset=['Title', 'Subtitle', 'Author', 'Year',
                                  'Month', 'Day', 'Tag'])
In [5]:
medium.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 147392 entries, 0 to 148139
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype   
---  ------        --------------   -----   
 0   Title         138215 non-null  object  
 1   Subtitle      88691 non-null   object  
 2   Image         147392 non-null  uint8   
 3   Author        147294 non-null  object  
 4   Publication   71402 non-null   category
 5   Year          147392 non-null  uint16  
 6   Month         147392 non-null  uint8   
 7   Day           147392 non-null  uint8   
 8   Tag           147392 non-null  category
 9   Reading_Time  147392 non-null  uint8   
 10  Claps         147392 non-null  category
 11  Comment       147392 non-null  uint8   
 12  url           147392 non-null  object  
 13  Author_url    147294 non-null  object  
dtypes: category(3), object(5), uint16(1), uint8(5)
memory usage: 8.8+ MB

Visualize Tags

In [6]:
import plotly.express as px

# Save the charts to build an interactive report later
charts = []

tag_plot = px.bar(x=medium.Tag.value_counts().index,
      y=medium.Tag.value_counts().values,
                 labels={'y': 'Number of Articles',
                        'x': 'Tags'},
                 title='Number of articles in each data science-related topic')
tag_plot