Example: Number of Github users

This is a fun example. Suppose we wish to predict how many sign-ups there are on Github.com. Officially, Github does not release an up-to-date count, and at last offical annoucment (January 2013) the count was 3 million. What if we wish to measure it today? We could extrapolate future numbers from previous annoucements, but this uses little data and we could potentially be off by hundreds of thousands, and you are essentially just curve fitting complicated models.

Instead, what we are going to use is user id numbers from real-time feeds on Github. The script github_events.py will pull the most recent 300 events from the Github Public Timeline feed (we'll be accessing data using their API). From this, we pull out the user ids associated with each event. We run the script below and display some output:

In [1]:
%run github_events.py


print "Some User ids from the latest events (push, star, fork etc.) on Github."
print ids[:10]
print
print "Number of unique ids found: ", ids.shape[0]
print "Largest user id: ", ids.max()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
c:\Python27\lib\site-packages\IPython\utils\py3compat.pyc in execfile(fname, glob, loc)
    169             else:
    170                 filename = fname
--> 171             exec compile(scripttext, filename, 'exec') in glob, loc
    172     else:
    173         def execfile(fname, *where):

C:\Users\Cameron\Dropbox\My Work\Probabilistic-Programming-and-Bayesian-Methods-for-Hackers\Chapter2_MorePyMC\github_events.py in <module>()
     22     data = loads(r.text)
     23     for event in data:
---> 24         ids[k] = ( event["actor"]["id"] )
     25         k+=1
     26 

KeyError: 'id'
Some User ids from the latest events (push, star, fork etc.) on Github.
[1524995 1978503 1926860 1524995 3707208  374604   37715  770655  502701
 4349707]

Number of unique ids found:  300
Largest user id:  2085773151
In [2]:
figsize(12.5,3)
plt.hist( ids, bins = 45, alpha = 0.9)
plt.title("Histogram of %d Github User ids"%ids.shape[0] );
plt.xlabel("User id")
plt.ylabel("Frequency");
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-e83805e8eaea> in <module>()
----> 1 figsize(12.5,3)
      2 plt.hist( ids, bins = 45, alpha = 0.9)
      3 plt.title("Histogram of %d Github User ids"%ids.shape[0] );
      4 plt.xlabel("User id")
      5 plt.ylabel("Frequency");

NameError: name 'figsize' is not defined

There are some users with multiple events, but we are only interested in unique user ids, hence why we have less than 300 ids. Above I printed the largest user id. Why is this important? If Github assigns user ids serially, which is a fair assumption, then we know that there are certainly more users than that number. Remember, we are only looking at less than 300 individuals out of a much larger population, so it is unlikely that we would have sampled the most recent sign-up.

At best, we can only estimate the total number of sign-ups. Let's get more familar with this problem. Consider a fictional website that we wish to estimate the number of users:

  1. Suppose we sampled only two individuals in a similar manner to above: the ids are 3 and 10 respectively. Would it be likely that the website has millions of users? Not very. Alternatively, it is more likely the website has less than 100 users.

  2. On the other hand, if the ids were 3 and 34 989, we might be more willing to guess there could possibly thousands, or millions of user sign-ups. We are not very confident in an estimate, due to a lack of data.

  3. If we sample thousands of users, and the maximum user id is still 34 989, then is seems likely that the total number of sign ups is near 35 000. Hence our inference should be more confident.

We make the following assumption:

Assumption: Every user is equally likely to perform an event. Clearly, looking at the above histogram, this assumption is violated. The participation on Github is skewed towards early adopters, likely as these early-adopting individuals have a) more invested in Github, and b) saw the value earlier in Github, therefore are more interested in it. The distribution is also skewed towards new sign ups, who likely signed up just to push a project.

To create a Bayesian model of this is easy. Based on the above assumption, all user_ids sampled are from a DiscreteUniform model, with lower bound 1 and an unknown upperbound. We don't have a strong belief about what the upper-bound might be, but we do know it will be larger than ids.max().

Working with such large numbers can cause numerical problem, hence we will scale everything by a million. Thus, instead of a DiscreteUniform, we will used a Uniform:

In [3]:
FACTOR = 1000000.

import pymc as pm

upper_bound = pm.Uniform( "n_sign_ups", ids.max()/FACTOR, (ids.max())/FACTOR + 1)
obs = pm.Uniform("obs", 0, upper_bound, value = ids/FACTOR, observed = True )

#code to be examplained in Chp. 3.
mcmc = pm.MCMC([upper_bound, obs] )
mcmc.sample( 100000, 45000)
---------------------------------------------------------------------------
ZeroProbability                           Traceback (most recent call last)
<ipython-input-3-4a6014825d95> in <module>()
      4 
      5 upper_bound = mc.Uniform( "n_sign_ups", ids.max()/FACTOR, (ids.max())/FACTOR + 1)
----> 6 obs = mc.Uniform("obs", 0, upper_bound, value = ids/FACTOR, observed = True )
      7 
      8 #code to be examplained in Chp. 3.

c:\Python27\lib\site-packages\pymc\distributions.pyc in __init__(self, *args, **kwds)
    268                 random = debug_wrapper(random)
    269             else:
--> 270                 Stochastic.__init__(self, logp=logp, random=random, logp_partial_gradients = logp_partial_gradients, dtype=dtype, **arg_dict_out)
    271 
    272     new_class.__name__ = name

c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in __init__(self, logp, doc, name, parents, random, trace, value, dtype, rseed, observed, cache_depth, plot, verbose, isdata, check_logp, logp_partial_gradients)
    714         if check_logp:
    715             # Check initial value
--> 716             if not isinstance(self.logp, float):
    717                 raise ValueError("Stochastic " + self.__name__ + "'s initial log-probability is %s, should be a float." %self.logp.__repr__())
    718 

c:\Python27\lib\site-packages\pymc\PyMCObjects.pyc in get_logp(self)
    846                 raise ZeroProbability(self.errmsg + "\nValue: %s\nParents' values:%s" % (self._value, self._parents.value))
    847             else:
--> 848                 raise ZeroProbability(self.errmsg)
    849 
    850         return logp

ZeroProbability: Stochastic obs's value is outside its support,
 or it forbids its parents' current values.
In [4]:
from scipy.stats.mstats import mquantiles

samples =  mcmc.trace("n_sign_ups")[:]

hist(samples, bins = 100, 
             label = "Uniform prior",
             normed=True, alpha = 0.8, 
             histtype="stepfilled", color = "#7A68A6"  );

quantiles_mean = np.append( mquantiles( samples, [0.05, 0.5, 0.95]), samples.mean() )
print "Quantiles: ", quantiles_mean[:3]
print "Mean: ", quantiles_mean[-1]
plt.vlines( quantiles_mean, 0, 33, 
            linewidth=2, linestyles = ["--", "--", "--", "-"],
            )
plt.title("Posterior distribution of total number of Github users" )
plt.xlabel("number of users (in millions)")
plt.legend()
plt.xlim( ids.max()/FACTOR - 0.01, ids.max()/FACTOR + 0.12 );
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-2ca0106ec6de> in <module>()
      1 from scipy.stats.mstats import mquantiles
      2 
----> 3 samples =  mcmc.trace("n_sign_ups")[:]
      4 
      5 hist(samples, bins = 100, 

NameError: name 'mcmc' is not defined

Above we have plotted the posterior distribution. Note that there is no posterior probability assigned to the number of users being less than ids.max(). That is good, as it would be an impossible situation.

The three dashed vertical bars, from left to right, are the 5%, 50% and 95% quantitle lines. That is, 5% of the probability is before the first line, 50% before the second and 95% before the third. The 50% quantitle is also know as the median and is a better measure of centrality than the mean for heavily skewed distributions like this one. The solid line is the posterior distribution's mean.

So what can we say? Using the data above, there is a 95% chance that there are less than 4.4 million users, and is probably around 4.36 million users. I was wondering how accurate this figure was. At the time of this writing, it seems a bit high considering only five months prior the number was at 3 million:

I thought perhaps the user_id parameter was being used liberally to users/bots/changed names etc, so I contacted Github Support about it:

So we may be overestimating by including organizations, which perhaps should not be counted as users. TODO: estimate the number of organizations. Any takers?