from IPython.display import Markdown
Markdown(filename='README.md')
ipynb
files, stayed there for a day or so, and
then began climbing again from that new origin.*.ipynb
files on GitHub.%matplotlib inline
import datetime
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
mpl.style.use('ggplot')
figsize = (14,7)
now = datetime.datetime.utcnow()
print(f'This notebook was last rendered at {now} UTC')
This notebook was last rendered at 2024-03-28 05:08:45.361769 UTC
First, let's load the historical data into a DataFrame indexed by date.
hits_df = pd.read_csv('ipynb_counts.csv', index_col=0, header=0, parse_dates=True)
hits_df.reset_index(inplace=True)
hits_df.drop_duplicates(subset='date', inplace=True)
hits_df.set_index('date', inplace=True)
hits_df.sort_index(ascending=True, inplace=True)
hits_df.tail(3)
hits | |
---|---|
date | |
2024-03-26 | 1084 |
2024-03-27 | 960 |
2024-03-28 | 976 |
There might be missing counts for days that we failed to sample. We build up the expected date range and insert NaNs for dates we missed.
til_today = pd.date_range(hits_df.index[0], hits_df.index[-1])
hits_df = hits_df.reindex(til_today)
Now we plot the known notebook counts.
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(f'GitHub search hits for {len(hits_df)} days')
ax.plot(hits_df.hits, 'ko', markersize=1, label='hits')
ax.legend(loc='upper left')
ax.set_xlabel('Date')
ax.set_ylabel('# of ipynb files');
Growth appears exponential until December 2020, at which point the count dropped suddenly and resumed growth from a new origin.
The total change in the number of *.ipynb
hits between the first day we have data and today is:
total_delta_nbs = hits_df.iloc[-1] - hits_df.iloc[0]
total_delta_nbs
hits -64872.0 dtype: float64
The mean daily change for the entire duration is:
avg_delta_nbs = total_delta_nbs / len(hits_df)
avg_delta_nbs
hits -18.743716 dtype: float64
The change in hit count between any two consecutive days for which we have data looks like the following:
daily_deltas = (hits_df.hits - hits_df.hits.shift())
fig, ax = plt.subplots(figsize=figsize)
ax.plot(daily_deltas, 'ko', markersize=2)
ax.set_xlabel('Date')
ax.set_ylabel('$\Delta$ # of ipynb files')
ax.set_title('Day-to-Day Change');
The large jumps in the data are from GitHub reporting drastically different counts from one day to the next.
Let's drop outliers defined as values more than two standard deviations away from a centered 90 day rolling mean.
daily_delta_rolling = daily_deltas.rolling(window=90, min_periods=0, center=True)
outliers = abs(daily_deltas - daily_delta_rolling.mean()) > 2*daily_delta_rolling.std()
outliers.value_counts()
hits False 3371 True 90 Name: count, dtype: int64
cleaned_hits_df = hits_df.copy()
cleaned_hits_df[outliers] = np.NaN
cleaned_daily_deltas = (cleaned_hits_df.hits - cleaned_hits_df.hits.shift())
fig, ax = plt.subplots(figsize=figsize)
ax.plot(cleaned_daily_deltas, 'ko', markersize=2)
ax.set_xlabel('Date')
ax.set_ylabel('$\Delta$ # of ipynb files')
ax.set_title('Day-to-Day Change Sans Outliers');