Jason asked about a plateau in contributions around 2016, which showed up in one of my slides. This notebook digs a little deeper into that to see if we can understand why.
First, load our dataset and plotting utilities
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%run utils.py
Load data processed from the raw event sources in post-process-data
df = pd.read_feather("data/processed.feather")
df.head()
type | repo_name | actor_id | actor_login | org | date | |
---|---|---|---|---|---|---|
0 | PushEvent | ipython/ipython | 72162001 | (no author)@5d925edf-2efb-0310-891d-d227fdbe9301 | ipython | 2003-03-02 04:28:20+00:00 |
1 | PushEvent | ipython/ipython | 57394 | fperez | ipython | 2003-03-02 04:28:20+00:00 |
2 | PushEvent | ipython/ipython | 57394 | fperez | ipython | 2003-03-02 04:28:21+00:00 |
3 | PushEvent | ipython/ipython | 72162001 | (no author)@5d925edf-2efb-0310-891d-d227fdbe9301 | ipython | 2003-03-02 04:28:21+00:00 |
4 | PushEvent | ipython/ipython | 72162001 | (no author)@5d925edf-2efb-0310-891d-d227fdbe9301 | ipython | 2003-03-02 05:13:27+00:00 |
If we look at annual total contributions by org, there is a plateu around 2016, which Jason asked about in the chat.
plot_events(df, groupby="org", timeseries=True, metric="count", freq="Y")
plt.title("Annual contributions by org")
Text(0.5, 1.0, 'Annual contributions by org')
If we switch to the unique contributors per year metric, it's still rising (if slowing a bit), with a drop in 2020, which has been an exceptional year and we only have data for the first 9 months anyway.
plot_events(df, groupby="org", timeseries=True, metric="nunique", freq="Y")
plt.title("Annual unique contributors by org")
Text(0.5, 1.0, 'Annual unique contributors by org')
For some of us, working on Jupyter is our job, which means we have a lot more time to discuss Issues, open Pull Requests, and push changes. Others might only have the occasional night or weekend free, while still others might only have their one Issue to report.
How does the community distribute into these groups?
from itertools import chain
bins = list(chain(range(10), np.logspace(1, 4, 100)))
counts, _ = np.histogram(df.actor_id.value_counts(), bins=bins)
counts[:5]
fig, ax = plt.subplots()
hist = df.actor_id.value_counts().plot(
kind="hist",
bins=bins,
ax=ax,
cumulative=-1,
)
ax.set_yscale("log")
ax.set_xscale("log")
ax.set_ylabel("Number of contributors")
ax.set_xlabel("Number of contributions")
ax.set_title(f"Number of contributors with at least X contributions ({df.actor_id.nunique()//1000}k total)")
ax.set_xlim(1, 10000)
ax.set_ylim(1, None)
print(f"{df.actor_id.nunique()} total unique contributors")
25037 total unique contributors
Rescaling to a fraction shows that ~90% of contributors have contributed to the conversation fewer than 10 times.
from matplotlib import ticker as tick
def adjust_y_axis(x, pos):
return f"{100 * x / (len(df.actor_id.unique())):.2g}%"
ax.yaxis.set_major_formatter(tick.FuncFormatter(adjust_y_axis))
display(ax.figure)
We can try to break-down the annual contributions into groups, top 5 contributors, top 10 excluding top 5, top 50 excluding top 10, and so forth.
This will allow us to see what are the primary contributors to the activity plateau. Is it driven by a pleateau in audience, or are the most-active folks less active?
These exclusions allow us to do cumulative sums / stacked charts later on, as the cumulative sum across a given row gives totals with no double-counted groups.
by_year = df.groupby(df.date.dt.year)
top_contributors_by_year = by_year.actor_id.value_counts()
top_contributors_by_year
date actor_id 2003 57394 187 72162005 29 72162001 6 72162013 5 72162017 5 ... 2020 72165242 1 72165243 1 72165244 1 72165245 1 72165246 1 Name: actor_id, Length: 32687, dtype: int64
levels = [5, 10, 50, 100]
columns = ["year"] + [f"top {n}" for n in levels] + ["rest"]
rows = []
years = years = sorted(df.date.dt.year.unique())
for year_dt, idx in by_year.groups.items():
events = df.loc[idx]
row = [year_dt.year]
tops = top_contributors_by_year[year_dt.year].index
n_prev = Nonema
for n in levels:
mask = events.actor_id.isin(tops[:n])
if n_prev:
mask = mask & ~events.actor_id.isin(tops[:n_prev])
n_prev = n
row.append(events[mask].actor_id.count())
row.append(events[~events.actor_id.isin(tops[:n])].actor_id.count())
rows.append(row)
counts_by_year = pd.DataFrame(rows, columns=columns)
counts_by_year = counts_by_year.set_index("year")
counts_by_year
top 5 | top 10 | top 50 | top 100 | rest | |
---|---|---|---|---|---|
year | |||||
2003 | 232 | 11 | 10 | 0 | 0 |
2004 | 480 | 35 | 17 | 0 | 0 |
2005 | 682 | 214 | 90 | 0 | 0 |
2006 | 1558 | 383 | 266 | 3 | 0 |
2007 | 1532 | 296 | 331 | 29 | 0 |
2008 | 1363 | 492 | 494 | 57 | 0 |
2009 | 857 | 241 | 255 | 21 | 0 |
2010 | 1820 | 331 | 585 | 128 | 70 |
2011 | 6983 | 1050 | 1372 | 407 | 486 |
2012 | 9004 | 2120 | 2684 | 785 | 1544 |
2013 | 13363 | 2607 | 3037 | 994 | 2914 |
2014 | 17695 | 4149 | 4366 | 1206 | 4456 |
2015 | 21152 | 7760 | 8996 | 2012 | 7332 |
2016 | 29334 | 13201 | 18353 | 3495 | 12053 |
2017 | 26754 | 12663 | 15693 | 3167 | 15203 |
2018 | 20423 | 11937 | 15741 | 3431 | 19000 |
2019 | 16965 | 8722 | 20798 | 4398 | 20841 |
2020 | 12713 | 7967 | 18560 | 3301 | 15298 |
The same table as percentages (cumulative this time)
total = counts_by_year.cumsum(axis=1)["rest"]
fractions = counts_by_year.cumsum(axis=1)
(fractions.div(total, axis=0)).style.format("{:,.0%}")
top 5 | top 10 | top 50 | top 100 | rest | |
---|---|---|---|---|---|
year | |||||
2003 | 92% | 96% | 100% | 100% | 100% |
2004 | 90% | 97% | 100% | 100% | 100% |
2005 | 69% | 91% | 100% | 100% | 100% |
2006 | 70% | 88% | 100% | 100% | 100% |
2007 | 70% | 84% | 99% | 100% | 100% |
2008 | 57% | 77% | 98% | 100% | 100% |
2009 | 62% | 80% | 98% | 100% | 100% |
2010 | 62% | 73% | 93% | 98% | 100% |
2011 | 68% | 78% | 91% | 95% | 100% |
2012 | 56% | 69% | 86% | 90% | 100% |
2013 | 58% | 70% | 83% | 87% | 100% |
2014 | 56% | 69% | 82% | 86% | 100% |
2015 | 45% | 61% | 80% | 84% | 100% |
2016 | 38% | 56% | 80% | 84% | 100% |
2017 | 36% | 54% | 75% | 79% | 100% |
2018 | 29% | 46% | 68% | 73% | 100% |
2019 | 24% | 36% | 65% | 71% | 100% |
2020 | 22% | 36% | 68% | 74% | 100% |
We can plot each group individually to see how their contributions have grown over time
remember, top 10 is really 6-10, top 50 is really 11-50, etc.
counts_by_year.plot()
plt.xlim(2003,2020)
plt.title("Contributions from each slice")
# plt.xticks(counts_by_year.index[::2]);
Text(0.5, 1.0, 'Contributions from each slice')
We see a huge drop in activity from the top 5 contributors
The same data as a stacked bar chart, showing cumulative sums, we can see how each changes.
It's a little harder to see individual slopes in this one due to the stacking, so plot it twice, once with top5 ad the bottom, once with the rest at the bottom.
counts_by_year.plot.area()
plt.xlim(2003, 2020)
(2003.0, 2020.0)
# reverse column order so that 'rest' is at the bottom
counts_by_year[counts_by_year.columns[::-1]].plot.area()
plt.xlim(2003, 2020)
(2003.0, 2020.0)
Key data points:
Not counting 2020, the contribution plateau can be ascribed almost entirely to a drop in activity from the top 5-10 contributors. "outside" contributions, i.e. folks who contribute less than the top 100 in any given year, are continuing to rise linearly through 2019 (if not exponentially, as seen up to 2016).
I think we can conclude that the plateau in total activity is ascribed primarily to two factors:
Of course, the relative contributions the top 5 number should drop or at least flatten as we develop the community into having more members with a 'maintainer' role. And speaking as one of those five in 2016, the absolute value should also drop. We used to have only a few folks who answered ~every GitHub issue and that was never going to be sustainable. This has spread out, as it should. However, since we know the wider activity is rising, I think we can probably say that we aren't doing enough expansion of maintainer capacity, and can do more community and/or funding work to expand our "maintainer" base to keep up with the community.
Note: some of these drops may or may not mean a drop in actual activity, but rather a drop in measured activity. This isn't a complete dataset, and activity in other places, such as the discourse forum, online meetings, conferences, etc. are not counted.
The data here is a very rough proxy for actual activity and contributions, and to first order is really just the GitHub comment frequency, which dwarfs any other source of events.