Elements of Data Science

by Allen Downey

MIT License


This notebook introduces "statistical inference", which is the process of using a sample to make inferences about a population. This figure from Wikipedia illustrates the idea:

On the left, the "population" is the group we are interested in. Sometimes it is a group of people, but in general it can be any kind of group.

If we can observe the entire group, we might not need statistical inference. If not, sometimes we can observe a "sample" (or subset) of the population and use those observations to make claims about the population.

If you have studied statistics before, you might have encountered some of these ideas before: hypothesis testing, p-values, estimation, standard error, and confidence intervals.

In this notebook we'll approach these topics using computation and simulation, as opposed to mathematical analysis. I hope this approach makes the ideas clearer.

If you have not seen these ideas before, don't worry. That might even be better.

We'll look at three examples:

  • Testing whether a coin is "fair".

  • Testing whether first babies are more like to be born early (or late).

  • Estimating the average height of men in the U.S., and quantifying the precision of the estimate.

The Euro problem

In Information Theory, Inference, and Learning Algorithms, David MacKay writes, "A statistical statement appeared in The Guardian on Friday January 4, 2002:

When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. ‘It looks very suspicious to me’, said Barry Blight, a statistics lecturer at the London School of Economics. ‘If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%’.*

But do these data give evidence that the coin is biased rather than fair?"

Before we answer MacKay's question, let's unpack what Dr. Blight said:

"If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%".

To see where that comes from, I'll simulate the result of spinning an "unbiased" coin, meaning that the probability of heads is 50%.

Here's an example with 10 spins:

In [1]:
import numpy as np

spins = np.random.random(10) < 0.5

np.random.random returns numbers between 0 and 1, uniformly distributed. So the probability of being less than 0.5 is 50%.

The sum of the array is the number of True elements, that is, the number of heads:

In [2]:

We can wrap that in a function that simulates n spins with probability p.

In [3]:
def spin(n, p):
    return np.sum(np.random.random(n) < p)

Here's an example with the actual sample size (250) and hypothetical probability (50%).

In [4]:
heads, tails = 140, 110
sample_size = heads + tails
In [5]:
hypo_prob = 0.5
spin(sample_size, hypo_prob)

Since we are generating random numbers, we expect to see different values if we run the experiment more than once.

Here's a loop that runs spin 10 times.

In [6]:
n = 250
p = 0.5

for i in range(10):
    print(spin(n, p))

As expected, the results vary from one run to the next.

Now let's run the simulated experiment 10000 times and store the results in a NumPy array.

In [7]:
outcomes = np.empty(1000)

for i in range(len(outcomes)):
    outcomes[i] = spin(n, p)

np.empty creates an empty array with the given length. Each time through the loop, we run spin and assign the result to an element of outcomes.

The result is an array of 10000 integers, each representing the number of heads in a simulated experiment.

The mean of outcomes is about 125:

In [8]:

Which makes sense. On average, the expected number of heads is the product of the hypothetical probability and the sample size:

In [9]:
expected = hypo_prob * sample_size

Now let's see how much the values in outcomes differ from the expected value:

In [10]:
diffs = outcomes - expected

diffs is an array that contains the deviation of each experiment from the expected value, 125.

Here's the mean of the absolute deviations:

In [11]:

So a typical experiment deviates from the mean by about 6.

To see the whole distribution of deviations, we can plot a histogram. The following function uses Matplotlib to plot a histogram and adjust some of the settings.

In [12]:
import matplotlib.pyplot as plt

def plot_hist(values):
    xs, ys, patches = plt.hist(values,
    return patches[0]

Here's what the distribution of deviations looks like:

In [13]:

plt.title('Sampling distribution (n=250)')
plt.xlabel('Deviation from expected number of heads');

This is the "sampling distribution" of deviations. It shows how much variation we should expect between experiments with this sample size (n = 250).


Getting get back to the Euro example, Dr. Bright reported:

"If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%".

The article doesn't say so explicitly, but this is a "p-value". To understand what that means, let's count how many times, in 10000 attempts, the outcome is "as extreme as" the observed outcome, 140 heads.

The observed deviation is the difference between the observed and expected number of heads:

In [14]:
observed_diff = heads - expected

Let's see how many times the simulated diffs exceed the observed deviation:

In [15]:
np.mean(diffs >= observed_diff)

It's around 3%. But Dr. Blight said 7%. Where did that come from?

So far, we only counted the cases where the outcome is more heads than expected. We might also want to count the cases where the outcome is fewer than expected.

Here's the probability of falling below the expected number by 15 or more.

In [16]:
np.mean(diffs <= -observed_diff)

To get the total probability of a result "as extreme as that", we can use the absolute value of the simulated differences:

In [17]:
np.mean(abs(diffs) >= observed_diff)

So that's consistent with what Dr. Blight reported.

To show what that looks like graphically, I'll use the following function, which fills in the histogram between low and high.

In [18]:
def fill_hist(low, high, patch):
    fill = plt.axvspan(low, high, 

The following plot shows the sampling distribution of diffs with two regions shaded. These regions represent the probability that an unbiased coin yields a deviation from the expected as extreme as 15.

In [19]:
patch = plot_hist(diffs)

# fill the right tail of the hist
low = observed_diff
high = diffs.max()
fill_hist(low, high, patch)

# fill the left tail of the hist
low = diffs.min()
high = -observed_diff
fill_hist(low, high, patch)

plt.title('Sampling distribution (n=250)')
plt.xlabel('Deviation from expected number of heads')

These results show that there is a non-negligible chance of getting a result as extreme as 140 heads, even if the coin is actually fair.

So even if the results are "suspicious" they don't provide compelling evidence that the coin is biased.

Exercise: There are a few ways to make "crooked" dice, that is, dice that are more likely to land on one side than the others. Suppose you run a casino and you suspect that a patron is using a die that comes up 3 more often than it should.

You confiscate the die, roll it 300 times, and 63 times it comes up 3. Does that support your suspicions?

  • To answer this question, use spin to simulate the experiment, assuming that the die is fair.

  • Use a for loop to run spin 1000 times and store the results in a NumPy array.

  • What is the mean of the results from the simulated experiments?

  • What is the expected number of 3s if the die is fair?

  • Use plot_hist to plot the results. The histogram you plot approximates the sampling distribution.

In [20]:
# Solution goes here
In [21]:
# Solution goes here
In [22]:
# Solution goes here
In [23]:
# Solution goes here
In [24]:
# Solution goes here

Exercise: Continuing the previous exercise, compute the probability of seeing a deviation from the expected value that is "as extreme" as the observed difference.

For this context, what do you think is the best definition of "as extreme"?

Plot the histogram of the random deviations again, and use fill_hist to fill the region of the histogram that corresponds to the p-value you computed.

In [25]:
# Solution goes here
In [26]:
# Solution goes here
In [27]:
# Solution goes here

Are first babies more likely to be late?

The examples so far have been based on coins and dice, which are relatively simple. In this section we'll look at an example that's based on real-world data.

Here's the motivation for it: When my wife and I were expecting our first baby, we heard that first babies are more likely to be late. We also hear that first babies are more likely to be early. Neither claim was supported by evidence.

Fortunately, I am a data scientist! Also fortunately, the CDC runs the National Survey of Family Growth (NSFG), which "gathers information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health."

I got the data from their web page,, and wrote some code to get it into a Pandas Dataframe:

In [28]:
import os

if not os.path.exists('nsfg.hdf5'):
In [29]:
import pandas as pd

nsfg = pd.read_hdf('nsfg.hdf5')

The nsfg DataFrame contains 9358 rows, one for each recorded pregnancy, and 11 columns, one of each of the variables I selected.

Here are the first few lines.

In [30]:

The variables we need are birthord, which indicates birth order, and prglength, which is pregnancy length in weeks.

I'll make two boolean Series, one for first babies and one for others.

In [31]:
firsts = (nsfg.birthord == 1)
others = (nsfg.birthord > 1)

np.sum(firsts), np.sum(others)

We can use the boolean Series to select pregnancy lengths for the two groups and compute their means:

In [32]:
prglngth = nsfg['prglngth']

mean_first = prglngth[firsts].mean() 
mean_other = prglngth[others].mean()

mean_first, mean_other

Here's the difference in means, in weeks.

In [33]:
diff = mean_first - mean_other

Here it is converted to days:

In [34]:
diff * 7

It looks like first babies are born 1.4 days later than other babies, on average.

Hypothesis testing

The apparent difference between these groups is based on a random sample that is much smaller than the actual population. So we can't be sure that the difference we see in the sample reflects a real difference in the population. There are two other possibilities we should keep in mind:

  • Systematic errors: The sample might be more likely to include some pregancies, and less likely to include others, in a way that causes an apparent difference in the sample, even if there is no such difference in the population.

  • Sampling errors: Even if every pregnancy is equally likely to appear in the sample, it is still possible to see a difference in the sample that is not in the population, just because of random variation.

We can never rule out the possibility of systematic errors, but we can test whether an apparent effect could be explained by random sampling.

Here's how:

  1. First we choose a "test statistic" that measures the size of the effect; the test statistic in this example is the difference in mean pregnancy length.

  2. Next we define a model of the population under the assumption that there is actually no difference between the groups. This assumption is called the "null hypothesis".

  3. Then we use the model to compute the distribution of the test statistic under the null hypothesis.

We have already done step 1, but to make it easier to repeat, I'll wrap it in a function.

In [35]:
def test_stat(group1, group2):
    """Difference in means.
    group1: sequence of values
    group2: sequence of values
    returns: float difference in means
    diff = np.mean(group1) - np.mean(group2)
    return diff

test_stat takes two sequences and computes the difference in their means.

Here's how we use it to compute the actual difference in the sample.

In [36]:
group1 = prglngth[firsts]
group2 = prglngth[others]

actual_diff = test_stat(group1, group2)

Now we have to define the null hypothesis, which is a model of the world where there is no difference in pregnancy length between first babies and others.

One way to do that is to put the two groups together and then divide them up again at random. That way the distribution of pregnancy lengths is the same for both groups.

I'll use concatenate to pool the groups.

In [37]:
len(group1), len(group2)
In [38]:
pool = np.concatenate([group1, group2])

I'll use shuffle to reorder them.

In [39]:

Then I'll use split to make two simulated groups, the same size as the originals.

In [40]:
n = len(group1)
sim_group1, sim_group2 = np.split(pool, [n])
In [41]:
len(sim_group1), len(sim_group2)

Now we can compute the test statistic for the simulated data.

In [42]:
test_stat(sim_group1, sim_group2)

In the simulated data, the distribution of pregnancy lengths is the same for both groups, so the difference is usually close to 0.

But because it is based on a random shuffle of the groups, we get a different value each time we run it.

To see what the whole distribution looks like, we can run the simulation many times and store the results.

In [43]:
diffs = np.empty(1000)

for i in range(len(diffs)):
    sim_group1, sim_group2 = np.split(pool, [n])
    diffs[i] = test_stat(sim_group1, sim_group2)

The result is the "sampling distribution of the test statistic under the null hypothesis".

The mean of this distribution should close to zero, because it is based on the assumption that there is actually no difference between the groups.

In [44]:

And here's what the whole distribution looks like.

In [45]:

plt.xlabel('Difference in mean (weeks)')
plt.title('Distribution of test statistic under null hypothesis');

If there were actually no difference between the groups, we would expect to see a difference as big as 0.15 weeks by chance, at least occasionally. But a difference as big as 0.2 would be rare.

To quantify that surprise, we can estimate the probability that the test statistic, under the null hypothesis, exceeds the observed differences in the means.

The result is a "p-value".

In [46]:
p_value = np.mean(diffs >= actual_diff)

In this example the result is 0, which is to say that in 1000 simulations of the null hypothesis, we never saw a difference as big as 0.2.

Interpreting p-values

To interpret this result, remember that we started with three possible explanations for the observed difference between the groups:

  1. The observed difference might be "real"; that is, there might be an actual difference in pregnancy length between first babies and others.

  2. There might be no real difference between the groups, and the observed difference might be because of a systematic error in the sampling process or the data collection process. For example, maybe reported pregnancy lengths are less accurate for first time mothers.

  3. There might be no real difference between the groups, and the observed difference might be due to random variation in the sampling process.

By computing a p-value, we have established that it would be rare to see a difference as big as 0.2 due to sampling alone. So we can conclude that the third explanation is unlikely.

That makes it more likely that the difference is real, but we still can't rule out the second possibility.

Exercise: The test statistic we chose is the difference in means between the two groups.

But suppose we would like to know whether first babies are more unpredictable than other babies. In that case the test statistic we choose might be the standard deviation of pregnancy length, which is one way to quantify unpredictability.

As an exercise:

  1. Write a version of test_stat that computes the difference in standard deviation between the groups.

  2. Write a loop that estimates the distribution of this test statistic under the null hypothesis.

  3. Compute a p-value.

In [47]:
# Solution goes here
In [48]:
# Solution goes here
In [49]:
# Solution goes here
In [50]:
# Solution goes here
In [51]:
# Solution goes here
In [52]:
# Solution goes here


Suppose we want to estimate the average height of men in the U.S.

We can use data from the BRFSS:

"The Behavioral Risk Factor Surveillance System (BRFSS) is the nation's premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services."

In [53]:
# Get the data file

import os

if not os.path.exists('brfss.hdf5'):
In [54]:
import pandas as pd

brfss = pd.read_hdf('brfss.hdf5', 'brfss')

We can use SEX to select male respondents.

In [55]:
male = (brfss.SEX == 1)

Then we select height data.

In [56]:
heights = brfss['HTM4']
data = heights[male]

We can use isnan to check for NaN values:

In [57]:
np.mean(np.isnan(data)) * 100

About 4% of the values are missing.

Here are the mean and standard deviation, ignoring missing data.

In [58]:
print('Mean male height in cm =', np.nanmean(data))
print('Std male height in cm =', np.nanstd(data))

Quantifying precision

At this point we have an estimate of the average adult male height. We'd like to know how accurate this estimate is, and how precise. In the context of estimation, these words have a technical distinction:

Given a set of data points from repeated measurements of the same quantity, the set can be said to be precise if the values are close to each other, while the set can be said to be accurate if their average is close to the true value of the quantity being measured.

Usually accuracy is what we really care about, but it's hard to measure accuracy unless you know the true value. And if you know the true value, you don't have to estimate it.

Quantifying precision is not as useful, but it is much easier. Here's one way to do it:

  1. Use the data you have to make a model of the population.

  2. Use the model to simulate the data collection process.

  3. Use the simulated data to compute an estimate.

By repeating these steps, we can quantify the variability of the estimate due to random sampling.

To model the population, I'll use resampling; that is, I will treat the observed measurements as if they were taken from the entire population, and I will draw random samples from them.

We can use np.random.choice to resample the data:

In [59]:
size = len(data)
sim_data = np.random.choice(data, size, replace=True)

With replace=True, we sample with replacement, which means that some measurements might be chosen more than once, and some might not be chosen at all.

(If we sample without replacement, the resampled data is always identical to the original, so that's no good.)

Now we can use nanmean to compute the mean of the simulated data, ignoring missing values.

In [60]:

If we repeat this process 1000 times, we can see how much the results vary.

In [61]:
outcomes = np.empty(1000)
size = len(data)

for i in range(len(outcomes)):
    sim_data = np.random.choice(data, size, replace=True)
    outcomes[i] = np.nanmean(sim_data)

The result is the "sampling distribution", which shows how much the results of the experiment would vary if we ran it many times. Here's what it looks like:

In [62]:
plt.title('Sampling distribution of the mean')
plt.xlabel('Mean adult male height, U.S.');

The width of this distribution shows how much the results vary from one experiment to the next.

We can quantify this variability by computing the standard deviation of the sampling distribution, which is called "standard error".

In [63]:
std_err = np.std(outcomes)

We can also summarize the sampling distribution with a "confidence interval", which is a range that contains a specified fraction, like 90%, of the values in sampling_dist_mean.

The central 90% confidence interval is between the 5th and 95th percentiles of the sampling distribution.

In [64]:
ci_90 = np.percentile(outcomes, [5, 95])

The following function plots a histogram and shades the 90% confidence interval.

In [65]:
def plot_sampling_dist(outcomes):
    patch = plot_hist(outcomes)
    low, high = np.percentile(outcomes, [5, 95])
    fill_hist(low, high, patch)
    print('Mean = ', np.mean(outcomes))
    print('Std error = ', np.std(outcomes))
    print('90% CI = ', (low, high))

Here's what it looks like for the sampling distribution of mean adult height:

In [66]:
plt.xlabel('Mean adult male height, U.S. (%)');

For an experiment like this, we can compute the standard error analytically.

In [67]:
size = len(data)
analytic_std_err = np.std(data) / np.sqrt(size)

The result is close to what we observed computationally.

In [68]:
analytic_std_err, std_err

This result indicates that our estimate of the mean is precise; that is, if we ran this experiment many times, the results would fall in a narrow range.

But this range reflects only variability due to random sampling. If there are systematic errors in the sampling process, or in the measurement process, the result would not be accurate.

Computing a standard error or confidence interval can be useful, but it only quantifies variability due to random sampling, not other sources of error.

Exercise: One nice thing about using resampling is that it is easy to compute the sampling distribution for other statistics.

For example, suppose we want to estimate the coefficient of variation (standard deviation as a fraction of the mean) for adult male height. Here's how we can compute it.

In [69]:
cv = np.nanstd(data) / np.nanmean(data)

So the standard deviation is about 5% of the mean.

Write a loop that uses resampling to estimate the sampling distribution of cv; store the results in an array named outcomes.

Then use plot_sampling_dist to plot the sampling distribution of the coefficient of variation.

What is the standard error of the estimated coefficient of variation?

In [70]:
# Solution goes here
In [71]:
# Solution goes here


This notebook presents computational methods for computing p-values, standard errors, and confidence intervals. The two processes are similar, but they answer different questions.

The following diagram outlines the hypothesis testing process:

Again, the key steps are

  1. Choose a test statistic that quantifies the observed effect.

  2. Define a model of the null hypothesis and use it to generate simulated data.

  3. Compute the distribution of the test statistic under the null hypothesis.

  4. Compute a p-value, which is probability, under the null hypothesis, of seeing an effect as extreme as what you saw.

The following figure shows the similar process for computing standard errors and confidence intervals.

The essential steps are:

  1. Choose a sample statistic that quantifies the thing you want to estimate.

  2. Use the data to make a model of the population, assuming that the estimate is accurate.

  3. Use the model to simulate the sampling process and generate simulated data.

  4. Compute the sampling distribution of the estimate.

  5. Use the sampling distribution to compute the standard error, confidence interval, or both.

Finally, remember that both processes only account for variability due to random sampling. They don't tell us anything about systematic errors in the sampling process, measurement error, or other sources of error.

In [ ]: