# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import math
from scipy import stats
import numpy as np
np.random.seed(0)
Whenever we analyze a random sample of a population, our goal is to draw robust conclusions about the whole population, despite the fact that we can measure only part of it. We must guard against being misled by the inevitable random variations in a sample. It is possible that a sample has very different characteristics than the population itself, but those samples are unlikely. It is much more typical that a random sample, especially a large simple random sample, is representative of the population. More importantly, the chance that a sample resembles the population is something that we know in advance, based on how we selected our random sample. Statistical inference is the practice of using this information to make precise statements about a population based on a random sample.
The Bay Area Bike Share service published a dataset describing every bicycle rental from September 2014 to August 2015 in their system. This set of all trips is a population, and fortunately we have the data for each and every trip rather than just a sample. We'll focus only on the free trips, which are trips that last less than 1800 seconds (half an hour).
free_trips = Table.read_table("trip.csv").where('Duration', are.below(1800)).select([3, 6, 1])
free_trips
Start Station | End Station | Duration |
---|---|---|
Harry Bridges Plaza (Ferry Building) | San Francisco Caltrain (Townsend at 4th) | 765 |
San Antonio Shopping Center | Mountain View City Hall | 1036 |
Post at Kearny | 2nd at South Park | 307 |
San Jose City Hall | San Salvador at 1st | 409 |
Embarcadero at Folsom | Embarcadero at Sansome | 789 |
Yerba Buena Center of the Arts (3rd @ Howard) | San Francisco Caltrain (Townsend at 4th) | 293 |
Embarcadero at Folsom | Embarcadero at Sansome | 896 |
Embarcadero at Sansome | Steuart at Market | 255 |
Beale at Market | Temporary Transbay Terminal (Howard at Beale) | 126 |
Post at Kearny | South Van Ness at Market | 932 |
... (338333 rows omitted)
A quantity measured for a whole population is called a parameter. For example, the average duration for any free trip between September 2014 to August 2015 can be computed exactly from these data: 550.00.
np.average(free_trips.column('Duration'))
550.00143345658103
Although we have information for the full population, we will investigate what we can learn from a simple random sample. Below, we sample 100 trips without replacement from free_trips
not just once, but 40 different times. All 4000 trips are stored in a table called samples
.
n = 100
samples = Table(['Sample #', 'Duration'])
for i in np.arange(40):
for trip in free_trips.sample(n).rows:
samples.append([i, trip.item('Duration')])
samples
Sample # | Duration |
---|---|
0 | 557 |
0 | 491 |
0 | 877 |
0 | 279 |
0 | 339 |
0 | 542 |
0 | 715 |
0 | 255 |
0 | 1543 |
0 | 194 |
... (3990 rows omitted)
A quantity measured for a sample is called a statistic. For example, the average duration of a free trip in a sample is a statistic, and we find that there is some variation across samples.
for i in np.arange(3):
sample_average = np.average(samples.where('Sample #', i).column('Duration'))
print("Average duration in sample", i, "is", sample_average)
Average duration in sample 0 is 547.29 Average duration in sample 1 is 500.6 Average duration in sample 2 is 660.16
Often, the goal of analysis is to estimate parameters from statistics. Proper statistical inference involves more than just picking an estimate: we should also say something about how confident we should be in that estimate. A single guess is rarely sufficient to make a robust conclusion about a population. We also need to quantify in some way how precise we believe that guess to be.
Already we can see from the example above that the average duration of a sample is around 550, the parameter. Our goal is to quantify this notion of around, and one way to do so is to state a range of values.
We can learn a great deal about the behavior of a statistic by computing it for many samples. In the scatter diagram below, the order in which the statistic was computed appears on the horizontal axis, and the value of the average duration for a sample of n
trips appears on the vertical axis.
samples.group(0, np.average).scatter(0, 1, s=50)
The sample number doesn't tell us anything beyond when the sample was selected, and we can see from this unstructured cloud of points that one sample does not affect the next in any consistent way. The sample averages are indeed all around 550. Some are above, and some are below.
samples.group(0, np.average).hist(1, bins=np.arange(480, 641, 10))
There are 40 averages for 40 samples. If we find an interval that contains the middle 38 out of 40, it will contain 95% of the samples. One such interval is bounded by the second smallest and the second largest sample averages among the 40 samples.
averages = np.sort(samples.group(0, np.average).column(1))
lower = averages.item(1)
upper = averages.item(38)
print('Empirically, 95% of sample averages fell within the interval', lower, 'to', upper)
Empirically, 95% of sample averages fell within the interval 502.38 to 620.04
This statement gives us our first notion of a margin of error. If this statement were to hold up not just for our 40 samples, but for all samples, then a reasonable strategy for estimating the true population average would be to draw a 100-trip sample, compute its sample average, and guess that the true average was within max_deviation
of this quantity. We would be right at least 95% of the time.
The problem with this approach is that finding the max_deviation
required drawing many different samples, and we would like to be able to say something similar after having collected only one 100-trip sample.
By contrast, The individual durations in the samples themselves are not closely clustered around 550. Instead, they span the whole range from 0 to 1800. The durations for the first three 100-item samples are visualized below. All have somewhat similar shapes; they were drawn from the same population.
for i in np.arange(3):
samples.where('Sample #', i).hist(1, bins=np.arange(0, 1801, 200))
We have now observed two different ways in which the variation of a random sample can be observed: the statistics computed from those samples vary in magnitude, and the sampled quantities themselves vary in distribution. That is, there is variability across samples, which we see by comparing sample averages, and there is variability within each sample, which we see in the duration histograms above.
The 80th percentile of a collection of values is the smallest value in the collection that is at least as large as 80% of the values in the collection. Likewise, the 40th percentile is at least as large as 40% of the values.
For example, let's consider the sizes of the five largest continents: Africa, Antarctica, Asia, North America, and South America, rounded to the nearest million square miles.
sizes = [12, 17, 6, 9, 7]
The 20th percentile is the smallest value that is at least as large as 20% or one fifth of the elements.
percentile(20, sizes)
6
The 60th percentile is at least as large as 60% or three fifths of the elements.
percentile(60, sizes)
9
To find the element corresponding to a percentile $p$, sort the elements and take the $k$th element, where $k=\frac{p}{100} \times n$ where $n$ is the size of the collection. In general, any percentile from 0 to 100 can be computed for a collection of values, and it is always an element of the collection. If $k$ is not an integer, round it up to find the percentile element. For example, the 90th percentile of a five element set is the fifth element, because $\frac{90}{100} \times 5$ is 4.5, which rounds up to 5.
percentile(90, sizes)
17
When the percentile
function is called with a single argument, it returns a function that computes percentiles of collections.
tenth = percentile(10)
tenth(sizes)
6
tenth(np.arange(50))
4
tenth(np.arange(5000))
499
Inference begins with a population: a collection that you can describe but often cannot measure directly. One thing we can often do with a populuation is to draw a simple random sample. By measuring a statistic of the sample, we may hope to estimate a parameter of the population. In this process, the population is not random, nor are its parameters. The sample is random, and so is its statistic. The whole process of estimation therefore gives a random result. It starts from a population (fixed), draws a sample (random), computes a statistic (random), and then estimates the parameter from the statistic (random). By calling the process random, we mean that the final estimate could have been different because the sample could have been different.
A confidence interval is one notion of a margin of error around an estimate. The margin of error acknowledges that the estimation process could have been misleading because of the variation in the sample. It gives a range that might contain the original parameter, along with a percentage of confidence. For example, a 95% confidence interval is one in which we would expect the interval to contain the true parameter for 95% of random samples. For any particular sample, any particular estimate, and any particular interval, the parameter is either contained or not; we never know for sure. The purpose of the confidence interval is to quantify how often our estimation process gives us a range containing the truth.
There are many different ways to compute a confidence interval. The challenge in computing a confidence interval from a single sample is that we don't know the population, and so we don't know how its samples will vary. One way to make progress despite this limited information is to assume that the variability within the sample is similar to the variability within the population.
Even though we happen to have the whole population of free_trips
, let's imagine for a moment that we have only a single 100-trip sample from this population. That sample has 100 durations. Our goal now is to learn what we can about the population using just this sample.
sample = free_trips.sample(n)
sample
Start Station | End Station | Duration |
---|---|---|
Grant Avenue at Columbus Avenue | San Francisco Caltrain 2 (330 Townsend) | 877 |
Embarcadero at Folsom | San Francisco Caltrain (Townsend at 4th) | 597 |
Beale at Market | Beale at Market | 67 |
2nd at Folsom | Washington at Kearny | 714 |
Clay at Battery | Embarcadero at Sansome | 510 |
South Van Ness at Market | Powell Street BART | 494 |
San Jose Diridon Caltrain Station | MLK Library | 530 |
Mechanics Plaza (Market at Battery) | Clay at Battery | 286 |
South Van Ness at Market | Civic Center BART (7th at Market) | 172 |
Broadway St at Battery St | 5th at Howard | 551 |
... (90 rows omitted)
average = np.average(sample.column('Duration'))
average
518.55999999999995
A technique called bootstrap resampling uses the variability of the sample as a proxy for the variability of the population. Instead of drawing many samples from the population, we will draw many samples from the sample. Since the original sample and the re-sampled sample have the same size, we must draw with replacement in order to see any variability at all.
resamples = Table(['Resample #', 'Duration'])
for i in np.arange(40):
resample = sample.sample(n, with_replacement=True)
for trip in resample.rows:
resamples.append([i, trip.item('Duration')])
resamples
Resample # | Duration |
---|---|
0 | 329 |
0 | 252 |
0 | 252 |
0 | 685 |
0 | 434 |
0 | 396 |
0 | 434 |
0 | 585 |
0 | 394 |
0 | 202 |
... (3990 rows omitted)
Using the resamples
, we can investigate the variability of the corresponding sample averages.
resamples.group(0, np.average).scatter(0, 1, s=50)
These sample averages are not necessarily clustered around the true average 550. Instead, they will be clustered around the average of the sample from which they were re-sampled.
resamples.group(0, np.average).hist(1, bins=np.arange(480, 641, 10))
Although the center of this distribution is not the same as the true population average, its variability is similar.
A technique called bootstrap resampling estimates a confidence interval using resampling. To find a confidence interval, we must estimate how much the sample statistic deviates from the population parameter. The boostrap assumes that a resampled statistic will deviate from the sample statistic in approximately the same way that the sample statistic deviates from the parameter.
To carry out the technique, we first compute the deviations from the sample statistic (in this case, the sample average) for many resampled samples. The number is arbitrary, and it doesn't require any collecting of new data; 1000 or more is typical.
deviations = Table(['Resample #', 'Deviation'])
for i in np.arange(1000):
resample = sample.sample(n, with_replacement=True)
dev = average - np.average(resample.column(2))
deviations.append([i, dev])
deviations
Resample # | Deviation |
---|---|
0 | 30.95 |
1 | 41.16 |
2 | 3.37 |
3 | 2.54 |
4 | -22.87 |
5 | 27.63 |
6 | 49.93 |
7 | 11.91 |
8 | 26.32 |
9 | -5.97 |
... (990 rows omitted)
A deviation of zero indicates that the resampled statistic is the same as the sampled statistic. That's evidence that the sample statistic is perhaps the same as the population parameter. A positive deviation is evidence that the sample statistic may be greater than the parameter. As we can see, for averages, the distribution of deviations is roughly symmetric.
deviations.hist(1, bins=np.arange(-100, 101, 10))
A 95% confidence interval is computed based on the middle 95% of this distribution of deviations. The max and min devations are computed by the 97.5th percentile and 2.5th percentile respectively, because the span between these contains 95% of the distribution.
lower = average - percentile(97.5, deviations.column(1))
lower
461.93000000000001
upper = average - percentile(2.5, deviations.column(1))
upper
576.29999999999995
Now, we have not only an estimate, but an interval around it that expresses our uncertainty about the value of the parameter.
print('A 95% confidence interval around', average, 'is', lower, 'to', upper)
A 95% confidence interval around 518.56 is 461.93 to 576.3
And behold, the interval happens to contain the true parameter 550. Normally, after generating a confidence interval, there is no way to check that it is correct, because to do so requires access to the entire population. In this case, we started with the whole population in a single table, and so we are able to check.
The following function encapsulates the entire process of producing a confidence interval for the population average using a single sample. Each time it is run, even for the same sample, the interval will be slightly different because of the random variation of resampling. However, the upper and lower bounds generated by 2000 samples for this problem are typically quite similar for multiple calls.
def average_ci(sample, label):
deviations = Table(['Resample #', 'Deviation'])
n = sample.num_rows
average = np.average(sample.column(label))
for i in np.arange(1000):
resample = sample.sample(n, with_replacement=True)
dev = np.average(resample.column(label)) - average
deviations.append([i, dev])
return (average - percentile(97.5, deviations.column(1)),
average - percentile(2.5, deviations.column(1)))
average_ci(sample, 'Duration')
(457.43999999999994, 572.63999999999987)
average_ci(sample, 'Duration')
(458.7399999999999, 575.50999999999988)
However, if we draw an entirely new sample, then the confidence interval will change.
average_ci(free_trips.sample(n), 'Duration')
(530.18000000000006, 661.21000000000004)
The interval is about the same width, and that's typical of multiple samples of the same size drawn from the same population. However, if a much larger sample is drawn, then the confidence interval that results is typically going to be much smaller. We're not losing confidence in this case; we're just using more evidence to perform inference, and so we have less uncertainty about the full population.
average_ci(free_trips.sample(16 * n), 'Duration')
(527.18937499999993, 555.11437499999988)
When given only a sample, it is not possible to check whether the parameter was estimated correctly. However, in the case of the Bay Area Bike Share, we have not only the sample, but many samples and the whole population. Therefore, we can check how often the confidence intervals we compute in this way do in fact contain the true parameter.
Before measuring the accuracy of our technique, we can visualize the result of resampling each sample. The following scatter diagram has 6 vertical lines for 6 samples, and each line consists of 100 points that represent the statistics from resamples for that sample.
samples = Table(['Sample #', 'Resample average', 'Sample Average'])
for i in np.arange(6):
sample = free_trips.sample(n)
average = np.average(sample.column('Duration'))
for j in np.arange(100):
resample = sample.sample(n, with_replacement=True)
resample_average = np.average(resample.column('Duration'))
samples.append([i, resample_average, average])
samples.scatter(0, s=80)
We can see from the plot above that for almost every sample, there are some resampled averages above 550 and some below. When we use these resampled averages to compute confidence intervals, many of those intervals contain 550 as a result.
Below, we draw 100 more samples and show the upper and lower bounds of the confidence intervals computed from each one.
intervals = Table(['Sample #', 'Lower', 'Estimate', 'Upper'])
for i in np.arange(100):
sample = free_trips.sample(n)
average = np.average(sample.column('Duration'))
lower, upper = average_ci(sample, 'Duration')
intervals.append([i, lower, average, upper])
We can compute precisely which intervals contain the truth, and we should find that about 95 of them do.
correct = np.logical_and(intervals.column('Lower') <= 550, intervals.column('Upper') >= 550)
np.count_nonzero(correct)
97
Each confidence interval is generated from only a single sample of 100 trips. As a result, the width of the interval varies quite a bit as well. Compared to the interval width we computed originally (which required multiple samples), we see that some of these intervals are much more narrow and some are much wider.
Table().with_column('Width', intervals.column('Upper')-intervals.column('Lower')).hist(0)
Given a single sample, it is impossible to know whether you have correctly estimated the parameter or not. However, you can infer a precise notion of confidence by following a process based on random sampling.