Notebook

The cell below loads the visual style of the notebook when run.¶

In [0]:

from IPython.core.display import HTML
css_file = '../../styles/styles.css'
HTML(open(css_file, "r").read())

Statistics in Python¶

Learning Objectives

Explore the connection between the Gaussian distribution and error bars

Investigate the origin of the uncertainty transformation formula, and when it breaks down

Tackle statistical problems with monte-carlo techniques

Show that photon counting follows the Poisson distribution

In the lectures we will introduce some basic statistics, and show how the concepts of error bars and the Gaussian distribution are linked. In this Python practical, we investigate these concepts further, by using Python's ability to generate random numbers.

numpy.random¶

The module numpy.random contains many functions to draw a set of random numbers from many different distributions. For example, we can draw 1000 numbers at random from a Gaussian distribution using the normal function:

In [0]:

%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

import numpy.random as random

# draw 1000 random numbers from a Gaussian distribution
# rand_numbers is a numpy array with 1000 values in it.
rand_numbers = random.normal(size=1000)

# plot their histogram, normalised so area under histogram=1
# I'm going to do this a lot, so make a function to do it!
def plot_rv(samples, nbins=20):
    fig,ax = plt.subplots()
    ax.hist(samples, density=True, bins=nbins)
    ax.set_xlabel('x')
    ax.set_ylabel('PDF(x)')
    plt.show()

# call the function
plot_rv(rand_numbers)

Gaussian random numbers and error bars¶

One of the way this can be useful is to create simulated observations of a quantity. In the lectures we explain that when scientists write $x = 10 \pm 2$ is it usually shorthand for the fact that our knowledge about $x$ is represented by a Gaussian distribution with mean 10 and standard deviation 2.

Therefore, if we were to repeatedly measure $x$, we would expect our results to be distributed according to a Gaussian distribution with a mean 10 and standard deviation 2. The key point is that this means we can simulate many measurements of $x$ by drawing random numbers from the corresponding Gaussian (normal) distribution:

In [0]:

# note how we use the loc and scale arguments
# to specify mean and std. dev. of Gaussian
# xvals contains 1000 samples from a Gaussian of mean 10 and std. dev 2
xvals = random.normal(loc=10, scale=2, size=1000)

# xvals is a numpy array of simulated measurements of x
print(type(xvals))
print(len(xvals))

plot_rv(xvals)

The array xvals contains many samples of $x$, just as if we'd measured $x$ over and over again. Drawing these kind of samples is an incredibly useful way of simulating statistical problems.

One definition of the probability of event $A$ is that the probability, $P(A)$ is just the fraction of trials in which $A$ occurs. Therefore, using arrays of samples, we can calculate the probability that some statement is true, by looking at the fraction of our samples that satisfy that statement.

As an example, for the measurement above $x = 10 \pm 2$ what is the probability that the true value of x is at least 14? We can use what we learnt about fancy slicing in numpy to answer the question. The expression xvals >= 14 will return an array which contains True or False for each entry in xvals. We can count the number of True entries using np.count_nonzero. Therefore, the fraction of random samples which are at least 14 is given by:

In [0]:

number_of_samples_where_x_gt_14 = np.count_nonzero( xvals >= 14)
number_of_samples = len(xvals)

frac_greater_than_14 = number_of_samples_where_x_gt_14 / number_of_samples
print('Probability that x >= 14 is {:.2f} percent'.format(100*frac_greater_than_14))

Properties of Gaussian distribution

We will see in the stats lecture that there is a roughly $\sim30$% chance that a Gaussian random variable will lie more than 1$\sigma$ away from the mean. Confirm that this is true by seeing what fraction of the random numbers in xvals have values less than 8 or more than 12.

What fraction of our samples lie more than 2$\sigma$ from the mean?

In [0]:

# YOUR CODE HERE

Combinations of random variables: error propagation

Let us test the equation of error propagation that we derive in the lectures. Suppose we measure

two quantities: $x = 100 \pm 3$ and $y = 80 \pm 4$. According to the equation of error propagation, the sum $z = x+y$ should be normally distributed with mean $\bar{z} = 180$ and standard deviation $\sigma_z = \sqrt{9+16} = 5$.

Test this by drawing 1000 random samples for $x$, 1000 random samples for $y$ and adding them together. Calculate the mean and standard deviation of the result, and plot the histogram to see if it looks like a Gaussian

In [0]:

# YOUR CODE HERE

Using simulation for difficult cases of error propagation.¶

The same technique as above can be a very useful way of handling difficult error propagation calculations. For example, the radial velocity of a star with an orbiting planet is given by:

$$ v^{3} \approx \frac{2 \pi G}{P} \frac{m_{p}^{3} \sin^{3} i}{m_{s}^{2}},$$

where $P$ is the orbital period, $m_{p}$ is the mass of the planet, $m_{s}$ is the mass of the star and $i$ is the inclination of the planet's orbit to our line of sight. If $m_s = (2.00 \pm 0.05) \times 10^{30}$ kg, $m_p = (2.00 \pm 0.05) \times 10^{27}$ kg, $P = 5.000 \pm 0.001$ years and $i = 75 \pm 5$ degrees, calculate the radial velocity and it's error.

This would be hard to calculate using standard error formulae - the $\sin^3$ term alone is a bit of a nightmare. However, we can readily simulate it...

In [0]:

# set up some useful variables
year_in_secs = 86400 * 365
Nsims = 100000
G = 6.67e-11

# simulate many observations of all the input variables
P  = np.random.normal(loc=5*year_in_secs, scale=0.001*year_in_secs, size=Nsims)
mp = np.random.normal(loc=2.0e27, scale=0.05e27, size=Nsims)
ms = np.random.normal(loc=2.0e30, scale=0.05e30, size=Nsims)
i  = np.random.normal(loc=75, scale=5, size=Nsims)

# convert i to radians
i  = np.radians(i)

# now we have arrays with 100000 samples of input measurements
# calculate array of resulting v**3 values
v3 = 2*np.pi*G * mp**3 * np.sin(i)**3 / P / ms**2

# cube root
v = v3**(1./3.)

# calculate mean and std dev
print("Mean velocity = {:.1f} +/- {:.1f} m/s".format( v.mean(), v.std() ))

The breakdown of the error propagation formula¶

We can also use this technique to explore situations where the error propagation formula we derived breaks down. In the lecture we derive the formula for the error on $y=f(x)$ using a Taylor expansion of $f(x)$ around the mean of $x$, $\mu_x$.

One of the ways this can break down is if the error bar is not small - this means approximating $y=f(x)$ with a Taylor expansion is not a good approximation (look at Figure 53 in the notes to see this graphically)

Breakdown of error propagation 1: large error bars

The $V$-band magnitude of a star is $V = 22 \pm 1$. Using the method above, draw many samples of the stars's magnitude and convert them to fluxes in mJy. Plot the histogram of the fluxes.

(You can assume that a star with $V=0$ has a flux of 3631 mJy)

In [0]:

# YOUR CODE HERE

If you did the example above correctly, you should find the resulting distribution of fluxes is not a Gaussian. This makes it very hard to meaningfully express the flux of the star in the form $F = \bar{F} \pm \sigma_F$. It is the large error on the magnitude that is the cause of the problem - you should be very cautious about applying the error propagation formula when the uncertainties are large!

Simulation as a general approach¶

The examples above are specific examples of using random numbers and simulation as an approach to statistical questions. The approach can be generalised to tackle many different problems. The basic idea is to write some code that simulates the process of interest, run it many times and look at the results of your simulation.

It can be applied to probability more generally, using the following trick:

Suppose the probability of event $A$ is $P(A) = q$.

If you want to see if $A$ occurred in each trial, choose a random number $x$ that is uniformly distributed between 0 and 1. If $x <= q$, then event $A$ occured.

An example: Boys vs. Girls¶

Suppose there's an island on which boys and girls are equally likely to be born, but families stop having children after their first boy. Do you expect equal numbers of boys and girls to be born? Let's simulate it. First we write a function that simulates a single family:

In [0]:

def family():
    ngirls = 0
    nboys = 0
    # keep breeding until we have a boy!
    while(nboys == 0):

        # pick a random (uniform) number between 0 and 1
        rand_num = random.uniform()

        #50/50 boy or girl, so boy if our random number is > 0.5
        if rand_num > 0.5:
            nboys  = nboys + 1
        #otherwise it's a girl
        else:
            ngirls = ngirls + 1

    # once a boy is born, this family is done having kids
    return nboys, ngirls

# we can call this to simulate a family picked at random
nb, ng= family()
print("Random family had {} boys and {} girls".format(nb, ng))

Obviously, by construction each family will always have one boy, but the numbers of girls born to each family will differ. Let's simulate many families:

In [0]:

nboys, ngirls = 0, 0
for i in range(100000):
    nb, ng = family()
    nboys = nboys + nb
    ngirls = ngirls + ng

total_children = nboys+ngirls
print("{:.1f} percent of all children are boys".format(100*nboys/total_children))
print("{:.1f} percent of all children are girls".format(100*ngirls/total_children))

Perhaps surprisingly, our results show that there will be as many boys as girls! This actually makes sense if you think hard about it. By construction, all families will have one boy. Each individial family might have 0, 1, 2, 3... girls but the average number of girls turns out to be the same as the average number of boys (50/50 boy/girl chance) so the same number of boys and girls are born.

Homework #6

Counting Photons

In the homework this week, we are going to use the approaches we learnt above to show that the number of photons detected in an exposure follow the Poisson distribution, and that this distribution tends towards a Gaussian when the expected number of photons becomes large.

We start with a simple way to simulate the number of photons detected per second. Let us say that, on average, a star emits $N$ photons/sec. However, each photon is emitted at random.

Consider a very small time interval $\tau$. If we have $N$ photons/sec, we expect to see one photon in the time interval $\tau$ with probability $N\tau$. For example, if $\tau = 1/1000^{\rm th}$ of a second and N=10 photons/sec, we expect to detect one photon with probability $10/1000 = 0.01$. We might see two photons in this time interval, but that's very unlikely, so we'll ignore it for now.

Q1: Simulating a very short time span (2 points)

Write a function to simulate a single short timespan. Your function should accept the average number of photons/sec and the length of the timespan as arguments. It should return either 1 or 0 depending on whether a photon was emitted or not.

Remember that the function you write should be able to pass the tests in the cell beneath it.

In [0]:

def sim_timespan(N, tau):
    """Simulate a short timespan and see if a photon is emitted

    The timespan should be short, so that we can neglect the possibility that two photons are emitted.

    Args:

        N (float): average number of photons/sec emitted
        tau (float): length of timespan in seconds

    Returns:

        emitted (int) : 0 or 1 depending on whether a photon is emitted
    """
    # YOUR CODE HERE

In [0]:

from nose.tools import assert_almost_equal, assert_equal, assert_in
# check we get either 0 or 1
assert_in(sim_timespan(10, 1/400), (0, 1))

# run 10000 sims
many_sims = np.array([sim_timespan(10, 1/400) for i in range(10000)])
# check that we get a photon roughly 2.5% of the time in this simulation
assert_almost_equal(np.fabs(many_sims.mean() - 0.025), 0, places=1)

Q2: Simulating a whole second (3 points)

Now you have a function that simulates a single short timespan, you can simulate a whole second.

The way to do this is to imagine that one whole second exposure can be considered the sum of $1/\tau$

shorter exposures, so we can run many simulations of a short timespan, and add all the detected photons together.

In the function below, break the second into 500 short timespans, and for each timespan use the function you wrote above to see if a photon is emitted or not. Add all the photons that are emitted to get the total number of photons that are emitted in a second.

Next, and for full marks, you should change your function so that it doesn't call the function above 500 times, which is slow. Instead, work out how to perform this calculation by generating 500 random numbers - this will be much faster to run.

Remember that the function you write should be able to pass the tests in the cell beneath it.

In [0]:

def detected_photons(N):
    """Calculate the number of photons emitted in a given second interval

    Calculates the number of photons emitted in an interval of one second, chosen at random.

    Args:
        N (float): the average number of photons emitted per second
    Returns:
        Np (int): the actual number of photons emitted in this simulated second
    Raises:
        AssertionError: if N <= 1
    """
    assert N>=1, "N cannot be 0 or negative"
    # YOUR CODE HERE

In [0]:

from nose.tools import assert_less
# run many sims
many_sims = np.array([detected_photons(20) for i in range(10000)])
assert_less(np.fabs(many_sims.mean()-20), 0.1)

Q3: 3 photons per second (2 points)

Use the function you have written to simulate 10 000 exposures of 1 second, when the average number of photons per second is 3. Plot a histogram of the number of photons detected in each exposure.

In [0]:

# YOUR CODE HERE

Q4: 30 photons per second (2 points)

Use the function you have written to simulate 10 000 exposures of 1 second, when the average number of photons per second is 30. Plot a histogram of the number of photons detected in each exposure.

For comparison, on the same plot, draw the PDF for a Gaussian of mean $30$ and standard deviation $\sqrt{30}$

In [0]:

# YOUR CODE HERE

Q5: 30 photons per second (1 points)

Using the results of the simulation above. If the average number of photons per second is 30, what is the probability that we detect less than 20 photons in a 1 second exposure?

Store your answer in a variable called prob_lt_20.

In [0]:

# YOUR CODE HERE

In [0]:

from nose.tools import assert_almost_equal
assert_almost_equal(np.fabs(prob_lt_20-0.02), 0, places=1)

Extra Credit (4 points)

Extra credit questions allow you to make up for marks dropped in this and other homeworks. You can't score more than 100% overall, but if you get 4 extra credit points this week, and lose 4 points next week, you'd still be on course for 100% marks. I don't expect you to answer extra credit questions, unless you want to.

For extra credit this week write a simulation to solve the following classic stats problem (fair warning - this one is tough).

You are a contestant on a game show...

There are 3 doors, behind which are two goats and a car.

You pick a door (call it door A). You’re hoping for the car of course.
The game show host examines the other doors (B & C) and always opens one of them with a goat (both doors might have goats; he’ll randomly pick one to open).
You are given a choice to stick with your original door, or switch to the other door.

Should you stick, or switch?

You'll need to write a function that simulates a single game, and then call this function many times whilst sticking, and again whilst switching to see how your success rate depends on the strategy.

Hint: you'll want to use the randint function in the numpy.random module to choose a door

In [0]:

# YOUR CODE HERE