Performance implications of frozen distributions in SciPy¶

This notebook is a quick demonstration of the surprising performance implications of two different interfaces for sampling from probability distributions in SciPy.

In [ ]:

import scipy.stats
import numpy as np

One option for sampling from a distribution — in this case, a Poisson distribution with a λ of 5 — is to create a distribution object, like this:

In [ ]:

distribution = scipy.stats.poisson(5)
distribution.random_state = 0x00c0ffee
distribution.rvs(size=12)

Another option is to use the poisson object and specify the distribution parameters and the random state in each call, like this:

In [ ]:

rs = np.random.RandomState(1234)

scipy.stats.poisson.rvs(5, size=12, random_state=rs)

RandomState is really a pseudorandom number generator, and we can see that each call modifies its state:

In [ ]:

rs = np.random.RandomState(1234)

print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))

These two techniques produce identical results. To see whether or not they behave identically, let's collect some timings.

In [ ]:

def mkpoisson(l,seed):
    p = scipy.stats.poisson(l)
    p.random_state = seed
    return p

def experiment_one(agents, steps):
    def mkpoisson(l,seed):
        p = scipy.stats.poisson(l)
        p.random_state = seed
        return p

    seeds = np.random.randint(1<<32, size=agents)
    streams = [mkpoisson(12, seed) for seed in seeds]
    for p in streams:
        p.rvs(size=steps)

def experiment_two(agents, steps):
    seeds = np.random.randint(1<<32, size=agents)
    states = [np.random.RandomState(seed) for seed in seeds]
    for rs in states:
        scipy.stats.poisson.rvs(12, size=steps, random_state=rs)

In [ ]:

%%time

experiment_one(10000,1000)

In [ ]:

%%time

experiment_two(10000,1000)

If you're running these experiments in a similar environment to my computer, you'll see radically different timings — experiment_two is going to be substantially faster than experiment_one. (I've observed a 2x difference on Binder and a 4-5x difference locally.)

To see why, let's profile both functions:

In [ ]:

from cProfile import run
import pstats
from pstats import SortKey

In [ ]:

run("experiment_one(10000,1000)", sort=SortKey.TIME)

In [ ]:

run("experiment_two(10000,1000)", sort=SortKey.TIME)