This notebook shows the historical count and future estimate of the number of *.ipynb
files on GitHub. The daily count comes from executing the query extension:ipynb nbformat_minor.
*.ipynb
file hits.import warnings
warnings.simplefilter('ignore', FutureWarning)
%matplotlib inline
import datetime
import fbprophet
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
mpl.style.use('ggplot')
figsize = (14,7)
now = datetime.datetime.utcnow()
print(f'This notebook was last rendered at {now} UTC')
First, let's load the historical data into a DataFrame indexed by date.
hits_df = pd.read_csv('ipynb_counts.csv', index_col=0, header=0, parse_dates=True)
hits_df.reset_index(inplace=True)
hits_df.drop_duplicates(subset='date', inplace=True)
hits_df.set_index('date', inplace=True)
hits_df.sort_index(ascending=True, inplace=True)
hits_df.tail(3)
There might be missing counts for days that we failed to sample. We build up the expected date range and insert NaNs for dates we missed.
til_today = pd.date_range(hits_df.index[0], hits_df.index[-1])
hits_df = hits_df.reindex(til_today)
Now we plot the known notebook counts.
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(f'GitHub search hits for {len(hits_df)} days')
ax.plot(hits_df.hits, 'ko', markersize=1, label='hits')
ax.legend(loc='upper left')
ax.set_xlabel('Date')
ax.set_ylabel('# of ipynb files');
Next, let's look at various measurements of change.
The total change in the number of *.ipynb
hits between the first day we have data and today is:
total_delta_nbs = hits_df.iloc[-1] - hits_df.iloc[0]
total_delta_nbs
The mean daily change for the entire duration is:
avg_delta_nbs = total_delta_nbs / len(hits_df)
avg_delta_nbs
The change in hit count between any two consecutive days for which we have data looks like the following:
daily_deltas = (hits_df.hits - hits_df.hits.shift())
fig, ax = plt.subplots(figsize=figsize)
ax.plot(daily_deltas, 'ko', markersize=2)
ax.set_xlabel('Date')
ax.set_ylabel('$\Delta$ # of ipynb files')
ax.set_title('Day-to-Day Change');
The large jumps in the data are from GitHub reporting drastically different counts from one day to the next. Maybe GitHub was rebuilding a search index when we queried or had a search broker out-of-sync with the others?
Let's drop outliers defined as values more than two standard deviations away from a centered 180 day rolling mean.
daily_delta_rolling = daily_deltas.rolling(window=90, min_periods=0, center=True)
outliers = abs(daily_deltas - daily_delta_rolling.mean()) > 1.5*daily_delta_rolling.std()
outliers.value_counts()
cleaned_hits_df = hits_df.copy()
cleaned_hits_df[outliers] = np.NaN
cleaned_daily_deltas = (cleaned_hits_df.hits - cleaned_hits_df.hits.shift())
fig, ax = plt.subplots(figsize=figsize)
ax.plot(cleaned_daily_deltas, 'ko', markersize=2)
ax.set_xlabel('Date')
ax.set_ylabel('$\Delta$ # of ipynb files')
ax.set_title('Day-to-Day Change Sans Outliers');
Now let's do a simple linear interpolation for missing values and then look at the rolling mean of change.
filled_df = cleaned_hits_df.interpolate(method='time')
smoothed_daily_deltas = (filled_df.hits - filled_df.hits.shift()).rolling(window=30, min_periods=0, center=False).mean()
fig, ax = plt.subplots(figsize=figsize)
ax.plot(smoothed_daily_deltas, 'r-')
ax.set_xlabel('Date')
ax.set_ylabel('$\Delta$ # of ipynb files')
ax.set_title('30-Day Rolling Mean of Day-to-Day Change');
Now let's use fbprophet to forecast growth for the upcoming year. We'll try to forecast based on the raw search hit data with outliers removed.
periods = 365
def forecast(df):
m = fbprophet.Prophet(
interval_width=0.95, # uncertainty interval
changepoint_prior_scale=0.01, # allow less flexibility in trend
changepoint_range=0.9, # consider changepoints in more recent data
)
df = df.reset_index().rename(columns={'index': 'ds', 'hits': 'y'})
m.fit(df)
future = m.make_future_dataframe(periods=periods)
return m, m.predict(future)
model, forecast_df = forecast(cleaned_hits_df)
def plot_forecast(m, fc, changepoints=False):
fig, ax = plt.subplots(figsize=figsize)
m.plot(fc, ax=ax)
if changepoints:
fbprophet.plot.add_changepoints_to_plot(fig.gca(), m, fc)
ax.set_xlabel('Date')
ax.set_ylabel('# ipynb files')
ax.minorticks_on()
ax.legend(loc='upper left')
ax.set_title(f'GitHub search hits predicted until {fc.iloc[-1].ds.date()} (95% confidence interval)')
plot_forecast(model, forecast_df, changepoints=False)
Now we can plot the components of the model. The weekly component appears to track the work week while the yearly component seems to track with a traditional academic year in the northern hemisphere.
cv_df = fbprophet.plot.plot_components(model, forecast_df, figsize=figsize,)
We'll use Prophet's cross validation function to measure the root mean square error for forecasts overlapping with past data.
cv_df = fbprophet.diagnostics.cross_validation(model, horizon='365 days', initial='730 days', period='90 days')
fig, ax = plt.subplots(figsize=figsize)
fbprophet.plot.plot_cross_validation_metric(cv_df, metric='rmse', ax=ax)
ax.set_title('Root Mean Square Error')
ax.minorticks_on();
Finally, it's nice to celebrate million-notebook milestones. We can use our model to predict when they're going to occur.
combined_df = pd.concat([cleaned_hits_df.reset_index(drop=True).rename(columns={'hits': 'y'}), forecast_df], axis=1)
rows = []
cols = {'y': 'actual', 'yhat_upper': 'optimistic', 'yhat': 'predicted', 'yhat_lower': 'conservative'}
for i in range(1, 11):
milestone = i * 1e6
row = {'milestone': milestone}
for col in cols:
gt_df = combined_df[combined_df[col] > milestone]
if len(gt_df):
row[col] = gt_df.iloc[0].ds
rows.append(row)
pd.DataFrame(rows, columns=['milestone']+list(cols.keys())).rename(columns=cols)