In this notebook, we will analyze my Medium article stats. The functions for scraping and formatting the data were developed in the Development
notebook, and here we will focus on looking at the data quantitatively and visually.
To apply to your own medium data
stats.html
in the data/
directory. You can also save the responses to do a similar analysis.# Might need to run this on MAC for multiprocessing to work properly
# see https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
For any of the figures, I recommend opening them in plotly and touching them up. plotly
is an incredible library and I highly it as a replacement for whatever plotting library you are using.
Thanks to a few functions already developed, you can get all of the statistics for your articles in under 10 seconds.
from retrieval import process_in_parallel, get_table_rows
table_rows = get_table_rows(fname='stats.html')
Found 121 entries in table.
Each of these entries is a separate article. To get the information about each article, we use the next function. This scrapes both the article metadata and the article itself (using requests
and BeautifulSoup
).
df = process_in_parallel(table_rows=table_rows, processes=25)
df.head()
Processed 121 articles in 6.28 seconds.
claps | days_since_publication | fans | num_responses | publication | published_date | read_ratio | read_time | reads | started_date | ... | type | views | word_count | claps_per_word | editing_days | <tag>Education | <tag>Data Science | <tag>Towards Data Science | <tag>Machine Learning | <tag>Python | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
116 | 2 | 569.141963 | 2 | 0 | None | 2017-06-10 14:25:00 | 41.61 | 7 | 67 | 2017-06-10 14:24:00 | ... | published | 161 | 1859 | 0.001076 | 0 | 0 | 0 | 0 | 0 | 0 |
114 | 18 | 561.824008 | 3 | 0 | None | 2017-06-17 22:02:00 | 33.12 | 14 | 52 | 2017-06-17 22:02:00 | ... | published | 157 | 3891 | 0.004626 | 0 | 0 | 0 | 0 | 0 | 0 |
117 | 50 | 549.204130 | 19 | 0 | None | 2017-06-30 12:55:00 | 20.29 | 42 | 213 | 2017-06-30 12:00:00 | ... | published | 1050 | 12025 | 0.004158 | 0 | 0 | 0 | 0 | 1 | 1 |
111 | 0 | 548.361527 | 0 | 0 | None | 2017-07-01 09:08:00 | 36.54 | 9 | 19 | 2017-06-30 18:21:00 | ... | published | 52 | 2533 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0 |
109 | 0 | 544.373876 | 0 | 0 | None | 2017-07-05 08:51:00 | 8.93 | 14 | 5 | 2017-07-03 20:18:00 | ... | published | 56 | 3892 | 0.000000 | 1 | 0 | 0 | 0 | 0 | 0 |
5 rows × 24 columns
With the comprehensive data, we can do any sort of analysis we want. There's a lot of data here and I'm sure you'll be able to find other interesting things to do with the data.
# Data science imports
import pandas as pd
import numpy as np
%load_ext autoreload
%autoreload 2
# Options for pandas
pd.options.display.max_columns = 25
# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
We can start off by looking at correlations. We'll limit this to the published
articles for now.
corrs = df[df['type'] == 'published'].corr()
corrs.round(2)
claps | days_since_publication | fans | num_responses | read_ratio | read_time | reads | title_word_count | views | word_count | claps_per_word | editing_days | <tag>Education | <tag>Data Science | <tag>Towards Data Science | <tag>Machine Learning | <tag>Python | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
claps | 1.00 | -0.13 | 0.99 | 0.89 | -0.05 | -0.11 | 0.75 | 0.09 | 0.73 | -0.12 | 0.76 | -0.01 | 0.27 | 0.36 | 0.53 | 0.18 | 0.26 |
days_since_publication | -0.13 | 1.00 | -0.14 | -0.08 | 0.03 | 0.36 | 0.05 | -0.31 | 0.04 | 0.33 | -0.08 | -0.11 | -0.76 | -0.43 | -0.39 | -0.11 | 0.04 |
fans | 0.99 | -0.14 | 1.00 | 0.87 | -0.07 | -0.11 | 0.76 | 0.10 | 0.75 | -0.12 | 0.73 | -0.00 | 0.27 | 0.37 | 0.54 | 0.20 | 0.26 |
num_responses | 0.89 | -0.08 | 0.87 | 1.00 | 0.05 | -0.14 | 0.76 | 0.03 | 0.69 | -0.15 | 0.80 | -0.05 | 0.19 | 0.33 | 0.50 | 0.09 | 0.27 |
read_ratio | -0.05 | 0.03 | -0.07 | 0.05 | 1.00 | -0.60 | -0.02 | 0.01 | -0.20 | -0.53 | 0.27 | 0.10 | 0.09 | -0.02 | -0.12 | -0.34 | -0.27 |
read_time | -0.11 | 0.36 | -0.11 | -0.14 | -0.60 | 1.00 | -0.08 | -0.13 | 0.03 | 0.96 | -0.24 | -0.06 | -0.42 | -0.22 | -0.15 | 0.19 | 0.26 |
reads | 0.75 | 0.05 | 0.76 | 0.76 | -0.02 | -0.08 | 1.00 | 0.01 | 0.93 | -0.11 | 0.53 | -0.08 | -0.01 | 0.36 | 0.32 | 0.22 | 0.37 |
title_word_count | 0.09 | -0.31 | 0.10 | 0.03 | 0.01 | -0.13 | 0.01 | 1.00 | 0.01 | -0.14 | 0.09 | -0.02 | 0.33 | 0.13 | 0.32 | 0.27 | 0.24 |
views | 0.73 | 0.04 | 0.75 | 0.69 | -0.20 | 0.03 | 0.93 | 0.01 | 1.00 | -0.01 | 0.36 | -0.06 | -0.03 | 0.33 | 0.31 | 0.31 | 0.41 |
word_count | -0.12 | 0.33 | -0.12 | -0.15 | -0.53 | 0.96 | -0.11 | -0.14 | -0.01 | 1.00 | -0.23 | 0.00 | -0.38 | -0.21 | -0.14 | 0.16 | 0.17 |
claps_per_word | 0.76 | -0.08 | 0.73 | 0.80 | 0.27 | -0.24 | 0.53 | 0.09 | 0.36 | -0.23 | 1.00 | -0.06 | 0.24 | 0.27 | 0.35 | -0.03 | 0.18 |
editing_days | -0.01 | -0.11 | -0.00 | -0.05 | 0.10 | -0.06 | -0.08 | -0.02 | -0.06 | 0.00 | -0.06 | 1.00 | 0.20 | -0.00 | 0.12 | 0.05 | -0.05 |
<tag>Education | 0.27 | -0.76 | 0.27 | 0.19 | 0.09 | -0.42 | -0.01 | 0.33 | -0.03 | -0.38 | 0.24 | 0.20 | 1.00 | 0.38 | 0.45 | 0.12 | -0.06 |
<tag>Data Science | 0.36 | -0.43 | 0.37 | 0.33 | -0.02 | -0.22 | 0.36 | 0.13 | 0.33 | -0.21 | 0.27 | -0.00 | 0.38 | 1.00 | 0.34 | 0.27 | 0.05 |
<tag>Towards Data Science | 0.53 | -0.39 | 0.54 | 0.50 | -0.12 | -0.15 | 0.32 | 0.32 | 0.31 | -0.14 | 0.35 | 0.12 | 0.45 | 0.34 | 1.00 | 0.21 | 0.19 |
<tag>Machine Learning | 0.18 | -0.11 | 0.20 | 0.09 | -0.34 | 0.19 | 0.22 | 0.27 | 0.31 | 0.16 | -0.03 | 0.05 | 0.12 | 0.27 | 0.21 | 1.00 | 0.30 |
<tag>Python | 0.26 | 0.04 | 0.26 | 0.27 | -0.27 | 0.26 | 0.37 | 0.24 | 0.41 | 0.17 | 0.18 | -0.05 | -0.06 | 0.05 | 0.19 | 0.30 | 1.00 |
If we are looking at maximizing claps, what do we want to focus on?
corrs['claps'].sort_values(ascending=False)
claps 1.000000 fans 0.992251 num_responses 0.893159 claps_per_word 0.762564 reads 0.749108 views 0.725450 <tag>Towards Data Science 0.533000 <tag>Data Science 0.363490 <tag>Education 0.267466 <tag>Python 0.255534 <tag>Machine Learning 0.182377 title_word_count 0.088137 editing_days -0.013607 read_ratio -0.052946 read_time -0.112306 word_count -0.116210 days_since_publication -0.126835 Name: claps, dtype: float64
Okay, so most of these occur after the article is released. However, the tag Towards Data Science
seems to help quite a bit! It also looks like the read time is negatively correlated with the number of claps.
Using the plotly
python library, we can very rapidly create interactive great looking charts.
Here are the avaiable colorscales if you want to try others:
colorscales = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu',
'Reds', 'Blues', 'Picnic', 'Rainbow', 'Portland', 'Jet',
'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis', 'Cividis']
colorscales = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu',
'Reds', 'Blues', 'Picnic', 'Rainbow', 'Portland', 'Jet',
'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis', 'Cividis']
figure = ff.create_annotated_heatmap(z = corrs.round(2).values,
x =list(corrs.columns),
y=list(corrs.index),
colorscale='Portland',
annotation_text=corrs.round(2).values)
iplot(figure)
Correlations by themselves don't tell us that much. It does not help that most of these are pretty obvious, such as the claps
and fans
will be highly correlated. Sometimes correlations by themselves are useful, but not really in this case.
figure = ff.create_scatterplotmatrix(df[['read_time', 'claps', 'type']],
index = 'type', colormap='Jet', title='Scatterplot Matrix by Type',
diag='histogram', width=800, height=800)
iplot(figure)
figure = ff.create_scatterplotmatrix(df[['read_time', 'claps', 'publication']],
index = 'publication', title='Scatterplot Matrix by Publication',
diag='histogram', width=800, height=800)
iplot(figure)
figure = ff.create_scatterplotmatrix(df[['read_time', 'claps', 'views',
'num_responses', 'publication']],
index = 'publication',
diag='histogram',
size=8, width=1000, height=1000,
title='Scatterplot Matrix by Publication')
iplot(figure)
figure = ff.create_scatterplotmatrix(df[['read_time', 'views', 'read_ratio', 'publication']],
index = 'publication',
diag='histogram',
size=8, width=1000, height=1000,
title='Scatterplot Matrix by Publication')
iplot(figure)
from visuals import make_hist
figure = make_hist(df, x='views', category='publication')
iplot(figure)
figure = make_hist(df, x='word_count', category='type')
iplot(figure)
figure=make_hist(df, x='claps')
iplot(figure)
from visuals import make_cum_plot
figure = make_cum_plot(df, y='views')
iplot(figure)
figure = make_cum_plot(df, y='word_count')
iplot(figure)
figure = make_cum_plot(df, y='views', category='publication')
iplot(figure)
figure = make_cum_plot(df, y=['word_count', 'views'])
iplot(figure)
figure = make_cum_plot(df, y=['views', 'reads'])
iplot(figure)
The neat part about plotly is we can easily add more elements to our plots. For example, to make a range selector and a range slider, let's just pass in an extra parameter to the function.
figure = make_cum_plot(df, 'word_count', ranges=True)
iplot(figure)
figure = make_cum_plot(df, 'read_time', ranges=True)
iplot(figure)
from visuals import make_scatter_plot
figure = make_scatter_plot(df, x='read_time', y='read_ratio')
iplot(figure)
figure = make_scatter_plot(df, x='read_time', y='read_ratio', category='type')
iplot(figure)
figure = make_scatter_plot(df, x='read_time', y='views', ylog=True,
category='type')
iplot(figure)
figure = make_scatter_plot(df, x='read_time', y='views', ylog=True,
scale='read_ratio', sizeref=0.2)
iplot(figure)
df['binned_ratio'] = pd.cut(df['read_ratio'], list(range(0, 100, 10))).astype('str')
df['binned_claps'] = pd.cut(df['claps'], list(np.insert(np.logspace(start=0, stop=5, num=6),0,-1).astype(int))).astype(str)
figure = make_scatter_plot(df, x='word_count', y='fans',
scale='claps', sizeref=5)
iplot(figure)
figure = make_scatter_plot(df, x='word_count', y='reads', xlog=True,
scale='claps', sizeref=3)
iplot(figure)
For the linear regressions, we'll focus on articles that were published in Towards Data Science. This makes the relationships clearer because the other articles are a mixed bag. We'll start off using a single variable - univariate - and focusing on linear relationships.
tds = df[df['publication'] == 'Towards Data Science'].copy()
figure = make_scatter_plot(tds, 'word_count', 'views')
iplot(figure)
Let's do a regression of the number of words versus the views for articles published in towards data science. We are using statsmodels.api.OLS
which sets the intercept to be 0. I made this choice because the number of views can never be negative (sometimes we do need an intercept so I left this as a parameter).
import statsmodels.api as sm
lin_reg=sm.OLS(tds['views'], tds['word_count']).fit()
lin_reg.summary()
Dep. Variable: | views | R-squared: | 0.502 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.495 |
Method: | Least Squares | F-statistic: | 78.50 |
Date: | Mon, 31 Dec 2018 | Prob (F-statistic): | 2.02e-13 |
Time: | 17:49:30 | Log-Likelihood: | -935.54 |
No. Observations: | 79 | AIC: | 1873. |
Df Residuals: | 78 | BIC: | 1875. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
word_count | 13.3585 | 1.508 | 8.860 | 0.000 | 10.357 | 16.360 |
Omnibus: | 17.663 | Durbin-Watson: | 1.804 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 20.864 |
Skew: | 1.166 | Prob(JB): | 2.95e-05 |
Kurtosis: | 3.949 | Cond. No. | 1.00 |
This tells us that for every extra word, I get 13 more views! If we look at the plot, there is one outlying data point beyond 5000 words. What happens if I stick to articles under 5000 words published on Towards Data Science?
tds_clean = tds[tds['word_count'] < 5000].copy()
lin_reg = sm.OLS(tds_clean['views'], tds_clean['word_count']).fit()
lin_reg.summary()
Dep. Variable: | views | R-squared: | 0.522 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.516 |
Method: | Least Squares | F-statistic: | 84.10 |
Date: | Mon, 31 Dec 2018 | Prob (F-statistic): | 5.62e-14 |
Time: | 17:49:31 | Log-Likelihood: | -922.54 |
No. Observations: | 78 | AIC: | 1847. |
Df Residuals: | 77 | BIC: | 1849. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
word_count | 14.0089 | 1.528 | 9.171 | 0.000 | 10.967 | 17.051 |
Omnibus: | 18.017 | Durbin-Watson: | 1.588 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 21.482 |
Skew: | 1.204 | Prob(JB): | 2.16e-05 |
Kurtosis: | 3.902 | Cond. No. | 1.00 |
Now we see that for every extra word, I get 14 more views! However, it looks like I want to keep my articles under 5000 words (about a 25 minute reading time).
If we want to fit a model with an intercept, we can use scipy.stats.linregress
figure = make_scatter_plot(tds_clean, 'read_time', 'read_ratio')
iplot(figure)
from scipy import stats
stats.linregress(tds_clean['read_time'], tds_clean['read_ratio'])
LinregressResult(slope=-2.3226617522329582, intercept=53.29509659584714, rvalue=-0.7752588331903641, pvalue=7.99685522357628e-17, stderr=0.21707239892501062)
This time, we see that for every additional minute of reading time, the percentage of people who read the article declines by 2.3%. For an article with a 0 minute reading time, 53% of people will read it!
Let's take a look at a few different fits.
from visuals import make_linear_regression
figure, summary = make_linear_regression(tds_clean, x='word_count', y='views', intercept_0=True)
iplot(figure)
summary
Dep. Variable: | views | R-squared: | 0.522 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.516 |
Method: | Least Squares | F-statistic: | 84.10 |
Date: | Mon, 31 Dec 2018 | Prob (F-statistic): | 5.62e-14 |
Time: | 17:49:36 | Log-Likelihood: | -922.54 |
No. Observations: | 78 | AIC: | 1847. |
Df Residuals: | 77 | BIC: | 1849. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
word_count | 14.0089 | 1.528 | 9.171 | 0.000 | 10.967 | 17.051 |
Omnibus: | 18.017 | Durbin-Watson: | 1.937 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 21.482 |
Skew: | 1.204 | Prob(JB): | 2.16e-05 |
Kurtosis: | 3.902 | Cond. No. | 1.00 |
tds_clean['read_pct'] = list(tds_clean['read_ratio'])
figure, summary = make_linear_regression(tds_clean, x='read_time', y='read_pct', intercept_0=False)
iplot(figure)
summary
param | value | |
---|---|---|
0 | pvalue | 7.996855e-17 |
1 | rvalue | -7.752588e-01 |
2 | slope | -2.322662e+00 |
3 | intercept | 5.329510e+01 |
figure, summary = make_linear_regression(tds_clean, x='title_word_count', y='fans', intercept_0=True)
iplot(figure)
summary
Dep. Variable: | fans | R-squared: | 0.403 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.396 |
Method: | Least Squares | F-statistic: | 52.05 |
Date: | Mon, 31 Dec 2018 | Prob (F-statistic): | 3.25e-10 |
Time: | 17:49:37 | Log-Likelihood: | -603.60 |
No. Observations: | 78 | AIC: | 1209. |
Df Residuals: | 77 | BIC: | 1212. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
title_word_count | 51.7363 | 7.171 | 7.215 | 0.000 | 37.457 | 66.015 |
Omnibus: | 22.055 | Durbin-Watson: | 2.394 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 29.671 |
Skew: | 1.267 | Prob(JB): | 3.61e-07 |
Kurtosis: | 4.645 | Cond. No. | 1.00 |
This clearly is not the best fit!
Next, we'll let the degree of the fit increase above 1. Overfitting (especially with limited data) is definitely going to be the outcome, but we'll let this serve as a lesson about having too many parameters in your model!
from visuals import make_poly_fits
figure, fit_stats = make_poly_fits(tds_clean, x='word_count', y='reads', degree=6)
fit_stats
fit | rmse | params | |
---|---|---|---|
0 | fit degree = 1 | 76472.581041 | [1.170106685608046, 6119.608859797098] |
1 | fit degree = 2 | 75872.518369 | [-0.0009811189543149205, 5.808790215202399, 15... |
2 | fit degree = 3 | 75821.238490 | [-2.4384184596157315e-07, 0.000735343641687801... |
3 | fit degree = 4 | 72017.556993 | [1.684312966169666e-09, -1.5717944375583822e-0... |
4 | fit degree = 5 | 67571.317247 | [1.6691066484561153e-12, -1.7949974951244516e-... |
5 | fit degree = 6 | 67090.118088 | [-5.282716312236527e-16, 9.034780377254753e-12... |
iplot(figure)
tds_clean['log_views'] = np.log10(tds_clean['views'])
figure, fig_stats = make_poly_fits(tds_clean, x='word_count', y='log_views', degree=15)
fit_stats
fit | rmse | params | |
---|---|---|---|
0 | fit degree = 1 | 76472.581041 | [1.170106685608046, 6119.608859797098] |
1 | fit degree = 2 | 75872.518369 | [-0.0009811189543149205, 5.808790215202399, 15... |
2 | fit degree = 3 | 75821.238490 | [-2.4384184596157315e-07, 0.000735343641687801... |
3 | fit degree = 4 | 72017.556993 | [1.684312966169666e-09, -1.5717944375583822e-0... |
4 | fit degree = 5 | 67571.317247 | [1.6691066484561153e-12, -1.7949974951244516e-... |
5 | fit degree = 6 | 67090.118088 | [-5.282716312236527e-16, 9.034780377254753e-12... |
iplot(figure)
figure, fig_stats = make_poly_fits(tds_clean, x='title_word_count', y='fans', degree=10)
iplot(figure)
Next, we'll consider more independent variables in our model. For this, we need to break out the exceptional Scikit-Learn library. We'll use liner_model.LinearRegression
which supports multiple independent variables.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
x = ['read_time', 'editing_days', 'title_word_count']
x.extend(c for c in df.columns if '<tag>' in c)
x
['read_time', 'editing_days', 'title_word_count', '<tag>Education', '<tag>Data Science', '<tag>Towards Data Science', '<tag>Machine Learning', '<tag>Python']
lin_model = LinearRegression()
lin_model.fit(tds[x], tds['reads'])
/usr/local/lib/python3.6/site-packages/sklearn/linear_model/base.py:509: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
lin_model = LinearRegression()
lin_model.fit(tds[x], tds['reads'])
slopes, intercept, = lin_model.coef_, lin_model.intercept_
fit = lin_model.predict(tds[x])
r2 = lin_model.score(tds[x], tds['reads'])
rmse = np.sqrt(mean_squared_error(y_true=tds['reads'], y_pred=fit))
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
for p, s in zip(x, slopes):
print(f'Independent Variable: {p.replace("_", " ").title():25} Slope: {s:.2f}')
print(f'Intercept: {intercept:.2f}')
print(f'\nCoefficient of Determination: {r2:.2f}')
print(f'RMSE: {rmse:.2f}')
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Independent Variable: Read Time Slope: -104.48 Independent Variable: Editing Days Slope: -273.31 Independent Variable: Title Word Count Slope: -513.13 Independent Variable: <Tag>Education Slope: -6884.35 Independent Variable: <Tag>Data Science Slope: 1751.44 Independent Variable: <Tag>Towards Data Science Slope: 3874.60 Independent Variable: <Tag>Machine Learning Slope: 1027.65 Independent Variable: <Tag>Python Slope: 7062.91 Intercept: 13383.85 Coefficient of Determination: 0.37 RMSE: 6934.72
We can see that some variables contribute positively to the number of reads, while others decrease the number of reads! Evidently, I should decrease the reading time, not use the tag education, and use the tags Towards Data Science and Python.
figure, summary = make_linear_regression(tds, x=x, y='reads', intercept_0=False)
iplot(figure)
summary
name | value | |
---|---|---|
0 | r2 | 0.366392 |
1 | rmse | 6934.723627 |
2 | intercept | 13383.854227 |
3 | read_time | -104.478607 |
4 | editing_days | -273.309195 |
5 | title_word_count | -513.130674 |
6 | <tag>Education | -6884.350279 |
7 | <tag>Data Science | 1751.443700 |
8 | <tag>Towards Data Science | 3874.601135 |
9 | <tag>Machine Learning | 1027.652389 |
10 | <tag>Python | 7062.905683 |
figure, summary = make_linear_regression(tds, x=x, y='fans', intercept_0=False)
iplot(figure)
summary
name | value | |
---|---|---|
0 | r2 | 0.239216 |
1 | rmse | 432.978179 |
2 | intercept | 226.705249 |
3 | read_time | -0.079675 |
4 | editing_days | -10.860046 |
5 | title_word_count | -27.733488 |
6 | <tag>Education | 48.315045 |
7 | <tag>Data Science | 134.080643 |
8 | <tag>Towards Data Science | 409.148274 |
9 | <tag>Machine Learning | 50.524024 |
10 | <tag>Python | 248.587705 |
The most fun part of this is extrapolating wildly into the future! Using the past stats, we can make estimates for the future using the numbers of days since publishing.
from visuals import make_extrapolation
figure, future_df = make_extrapolation(tds, y='reads', years=1.5, degree=3)
iplot(figure)
figure, future_df = make_extrapolation(df, y='word_count', years=2.5, degree=3)
iplot(figure)
figure, future_df = make_extrapolation(df, 'read_time', years=1, degree=3)
iplot(figure)
Well, that's about all I have! There is a lot of additional analysis that could be done here, and going forward, I'll be further developing these functions and trying to extract more information. Feel free to use these functions on your own articles, and of course, contribute as needed! Developing this library has been enjoyable, and I look forward to expanding it so any suggestions are welcome and appreciated.