This notebook will allow you to practice some of the concepts from ThinkStats2 Chapter 9.
First, we'll start with the question that Allen poses at the beginning of the chapter: "Suppose we toss a coin 250 times and we see 140 heads. Is this strong evidence that the coin is biased?"
As Allen says, classical hypothesis testing is similar to a proof by contradiction. First, we assume that the thing we are trying to show is false (that the coin is biased). Second, we show that this leads to an observed event being excedingly improbable (seeing 140 heads out of 250 tosses). Finally, we can conclude that our assumption (that the coin is not biased) is unlikely to be true.
Write a function to simulate n random coin flips of a fair coin (p(heads) = 0.5). Your function should return the number of heads that occur in those n coin clips.
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
counter = 0
for i in range(n):
choices = choice(["heads", "tails"])
if choices == "heads":
counter += 1
return counter
flipData = []
for i in range(1000):
flipData.append(simulate_fair_coin_flips(250))
print flipData
[142, 115, 120, 127, 140, 129, 122, 119, 119, 134, 141, 132, 120, 109, 128, 129, 138, 126, 114, 106, 136, 132, 124, 112, 124, 138, 124, 126, 125, 117, 116, 134, 114, 117, 113, 132, 117, 129, 114, 124, 117, 116, 125, 119, 140, 129, 145, 140, 132, 128, 132, 128, 120, 138, 127, 131, 131, 113, 120, 122, 117, 118, 122, 112, 123, 116, 131, 125, 124, 121, 139, 125, 124, 110, 119, 123, 125, 124, 135, 122, 139, 122, 131, 120, 131, 117, 125, 121, 120, 126, 128, 126, 114, 116, 135, 121, 123, 126, 136, 122, 123, 117, 130, 120, 135, 132, 118, 136, 129, 120, 119, 130, 119, 132, 111, 134, 123, 128, 110, 134, 124, 111, 124, 140, 130, 118, 132, 127, 112, 137, 110, 132, 123, 119, 113, 119, 126, 124, 116, 119, 131, 130, 122, 133, 112, 127, 125, 111, 123, 131, 121, 127, 138, 126, 122, 122, 108, 124, 134, 114, 127, 119, 140, 128, 122, 136, 112, 136, 116, 129, 122, 129, 135, 123, 130, 122, 123, 112, 141, 123, 131, 117, 130, 119, 134, 132, 120, 124, 119, 131, 124, 138, 131, 124, 113, 121, 122, 114, 134, 126, 137, 126, 121, 129, 132, 111, 126, 144, 111, 140, 123, 117, 131, 124, 109, 129, 127, 121, 123, 126, 126, 124, 126, 127, 121, 139, 133, 130, 125, 128, 127, 125, 117, 129, 127, 118, 134, 118, 127, 130, 113, 124, 129, 123, 125, 129, 129, 116, 127, 138, 119, 123, 122, 122, 105, 122, 132, 126, 125, 128, 109, 125, 125, 117, 132, 117, 124, 132, 122, 130, 122, 122, 133, 133, 121, 127, 128, 121, 120, 144, 113, 123, 136, 125, 120, 120, 134, 120, 116, 133, 113, 127, 140, 131, 116, 115, 120, 118, 124, 125, 124, 116, 129, 113, 123, 134, 130, 127, 116, 109, 128, 108, 118, 124, 131, 124, 112, 115, 130, 121, 133, 134, 117, 119, 131, 117, 125, 122, 121, 135, 126, 119, 119, 119, 118, 118, 118, 114, 122, 124, 118, 136, 122, 131, 131, 113, 134, 122, 130, 127, 125, 120, 108, 125, 134, 133, 131, 128, 113, 126, 117, 119, 124, 128, 116, 113, 137, 117, 133, 123, 138, 119, 123, 126, 114, 116, 109, 135, 137, 124, 119, 124, 130, 113, 129, 137, 121, 117, 111, 114, 134, 132, 121, 126, 112, 111, 132, 128, 116, 121, 114, 110, 135, 133, 109, 119, 132, 121, 122, 142, 132, 105, 121, 129, 128, 132, 124, 108, 125, 127, 119, 135, 122, 129, 135, 128, 127, 115, 135, 139, 120, 150, 125, 129, 121, 132, 126, 112, 126, 128, 115, 109, 116, 125, 134, 133, 121, 121, 121, 125, 125, 127, 122, 126, 106, 124, 116, 140, 123, 128, 127, 119, 137, 129, 113, 126, 129, 125, 127, 120, 117, 115, 120, 127, 123, 130, 121, 131, 138, 129, 113, 125, 117, 129, 134, 120, 124, 137, 127, 128, 119, 133, 131, 129, 113, 130, 137, 121, 128, 124, 121, 132, 129, 128, 125, 125, 124, 130, 108, 115, 132, 129, 126, 121, 138, 131, 122, 123, 124, 125, 144, 127, 119, 125, 117, 136, 129, 121, 124, 128, 119, 128, 117, 137, 128, 130, 141, 115, 125, 144, 119, 115, 132, 135, 113, 119, 119, 126, 112, 128, 129, 123, 121, 124, 118, 122, 109, 129, 133, 125, 117, 124, 123, 126, 113, 125, 132, 122, 122, 124, 135, 139, 119, 130, 130, 132, 133, 135, 118, 120, 109, 119, 115, 113, 122, 120, 122, 138, 129, 124, 121, 131, 125, 117, 119, 131, 121, 131, 117, 132, 117, 114, 121, 127, 121, 116, 126, 149, 118, 130, 119, 123, 113, 121, 129, 121, 126, 122, 132, 110, 121, 132, 145, 118, 107, 134, 143, 114, 106, 114, 125, 125, 129, 129, 124, 109, 133, 135, 120, 123, 117, 131, 114, 115, 112, 140, 127, 132, 117, 120, 125, 115, 122, 118, 129, 113, 124, 129, 118, 118, 126, 125, 123, 122, 128, 132, 116, 127, 131, 130, 134, 120, 115, 129, 126, 109, 113, 104, 118, 130, 133, 124, 124, 114, 117, 130, 135, 124, 126, 126, 124, 129, 113, 140, 125, 129, 123, 113, 126, 118, 128, 128, 124, 128, 128, 120, 127, 128, 120, 130, 126, 118, 119, 131, 126, 121, 139, 128, 127, 138, 122, 127, 133, 124, 126, 128, 131, 119, 126, 138, 122, 125, 122, 125, 123, 122, 132, 127, 124, 129, 127, 118, 126, 117, 118, 114, 119, 130, 129, 121, 125, 133, 116, 137, 121, 126, 128, 114, 110, 138, 121, 126, 131, 132, 120, 119, 107, 109, 128, 126, 126, 144, 125, 124, 116, 121, 130, 127, 123, 119, 121, 114, 127, 111, 126, 121, 127, 117, 113, 130, 140, 120, 116, 118, 124, 136, 115, 119, 124, 123, 124, 132, 129, 121, 114, 138, 124, 115, 133, 125, 116, 127, 119, 127, 117, 132, 123, 124, 132, 136, 129, 123, 122, 112, 130, 141, 122, 116, 131, 134, 128, 132, 130, 128, 112, 129, 129, 120, 140, 126, 125, 134, 123, 125, 125, 132, 134, 127, 131, 111, 124, 131, 126, 118, 127, 126, 132, 130, 126, 120, 123, 123, 124, 118, 126, 135, 121, 117, 122, 131, 132, 122, 116, 151, 112, 129, 115, 139, 131, 115, 118, 124, 124, 139, 138, 128, 127, 122, 124, 125, 123, 130, 133, 120, 126, 127, 121, 129, 112, 118, 133, 108, 128, 123, 116, 137, 130, 117, 115, 132, 131, 109, 121, 113, 124, 128, 129, 105, 136, 124, 122, 129, 125, 127, 133, 121, 112, 115, 123, 130, 129, 116, 126, 120, 137, 130, 123, 130, 107, 122, 122, 113, 119, 129, 139, 138, 131, 115, 124, 112, 116, 132, 115, 124, 134, 120, 133, 126, 114, 122, 137, 123, 139, 121, 127, 133, 131, 120, 127, 129, 109, 131, 133, 107, 126, 106, 115, 114, 129, 121, 117, 126, 123, 118, 130, 124, 140, 113, 120, 129, 131, 111, 128, 135, 119, 129, 118, 123, 122, 124]
from random import choice
import random
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
heads = 0
tails = 0
for i in range(n):
coinFlip = random.randint(0, 1)
if (coinFlip == 0):
heads +=1;
elif (coinFlip ==1):
tails +=1;
return (heads, tails)
print simulate_fair_coin_flips(250)
(136, 114)
from random import randint
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
return reduce(lambda heads, _: heads + randint(0, 1), xrange(n))
print simulate_fair_coin_flips(250)
137
from random import choice
import itertools
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
heads = 0
for _ in itertools.repeat(None, n):
heads += choice([0,1])
return heads
print simulate_fair_coin_flips(250)
121
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
return sum([choice([0, 1]) for i in range(n)])
print simulate_fair_coin_flips(250)
126
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
heads = 0
for i in xrange(n):
if choice(['heads', 'tails'])=='heads':
heads += 1
return heads
print simulate_fair_coin_flips(250)
133
from random import choice
import random
import thinkstats2
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
coin_flips = [random.choice('HT') for _ in range(n)]
hist = thinkstats2.Hist(coin_flips)
return hist['H']
print simulate_fair_coin_flips(250)
116
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
heads=0
for i in range(n):
heads += choice([0,1])
return heads
print simulate_fair_coin_flips(250)
145
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
side = [0,1]
h = 0;
t = 0;
for i in range(n):
if choice(side) == 0:
h += 1
return h
print simulate_fair_coin_flips(250)
133
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
return sum(choice((0,1)) for _ in xrange(n))
print simulate_fair_coin_flips(250)
125
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
heads = sum([choice([0,1]) for i in range(0,n)])
return heads
print simulate_fair_coin_flips(250)
132
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
toss = [choice([0,1]) for i in xrange(n)]
return sum(toss)
print simulate_fair_coin_flips(250)
127
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
sample = [choice('HT') for _ in range(n)]
return sample.count('H')
print simulate_fair_coin_flips(250)
122
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
count = 0
for i in range(0,n):
if choice([0,1]) == 1:
count += 1
return count
print simulate_fair_coin_flips(250)
138
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
count = 0
for _ in range(n):
if choice('HT') == 'H':
count += 1
return count
print simulate_fair_coin_flips(250)
137
from random import choice
import thinkstats2
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
sample = [choice('HT') for _ in range(n)]
hist =thinkstats2.Hist(sample)
return hist['H']
print simulate_fair_coin_flips(250)
122
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
countHeads = 0
#The options for whether or not the coin is heads
isHeads = [0, 1]
for i in range(n):
countHeads += choice(isHeads)
return countHeads
print simulate_fair_coin_flips(250)
131
from random import choice
choice([1,2,3])
1
from random import choice
import numpy as np
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
return sum(np.random.randint(2, size=n))
print simulate_fair_coin_flips(250)
105
from random import choice
import thinkstats2
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
sample = [choice('HT') for _ in range(n)]
hist = thinkstats2.Hist(sample)
return hist['H']
print simulate_fair_coin_flips(250)
132
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
headcount = 0
for i in range(n):
if choice([0,1]) == 0:
headcount+=1
return headcount
print simulate_fair_coin_flips(250)
123
from random import choice
def simulate_fair_coin_flips(n):
""" Return the number of heads that occur in n flips of a
fair coin p(heads) = 0.5 """
return sum([choice((0,1)) for i in range(n)])
print simulate_fair_coin_flips(250)
130
Next, repeat your simulation of 240 coin flips 1000 times. Create and display a CDF of the number of times heads appears based on 1000 random trials.
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
cdf = thinkstats2.Cdf(flipData, label='flipdata')
thinkplot.Cdf(cdf)
# your implementation here (imports included for convenience)
{'xscale': 'linear', 'yscale': 'linear'}
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
# your implementation here (imports included for convenience)
headsAppears = [];
for i in range(1000):
coinFlipResults = simulate_fair_coin_flips(250)
heads = coinFlipResults[0]
headsAppears.append(heads)
headsCdf = thinkstats2.Cdf(headsAppears)
%matplotlib inline
import numpy as np
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
iters = 1000
flips = 250
heads = [simulate_fair_coin_flips(flips) for _ in range(iters)]
cdf = thinkstats2.Cdf(heads, label=('heads per %d flips' % flips))
thinkplot.Cdf(cdf)
thinkplot.show()
<matplotlib.figure.Figure at 0x7fcbbd448fd0>
# Simulated for 250 coin flips instead of 240
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
%matplotlib inline
headlist = []
for _ in itertools.repeat(None, 1000):
headlist.append(simulate_fair_coin_flips(250))
headcdf = thinkstats2.Cdf(headlist)
thinkplot.Cdf(headcdf)
{'xscale': 'linear', 'yscale': 'linear'}
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
flips_1000 = [simulate_fair_coin_flips(250) for i in range(1000)]
cdf = thinkstats2.Cdf(flips_1000)
thinkplot.Cdf(cdf)
thinkplot.Show()
<matplotlib.figure.Figure at 0x7fae273bd1d0>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
heads_res = []
for i in xrange(1000):
heads_res.append(simulate_fair_coin_flips(250))
cdf = thinkstats2.Cdf(heads_res)
thinkplot.Cdf(cdf)
/home/jsutker/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
{'xscale': 'linear', 'yscale': 'linear'}
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
# your implementation here (imports included for convenience)
coin_flips = [simulate_fair_coin_flips(250) for i in range(1000)]
cdf = thinkstats2.Cdf(coin_flips)
thinkplot.Cdf(cdf)
{'xscale': 'linear', 'yscale': 'linear'}
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
flips = [simulate_fair_coin_flips(250) for _ in range(1000)]
cdf = thinkstats2.Cdf(flips, label="flips")
thinkplot.Cdf(cdf)
{'xscale': 'linear', 'yscale': 'linear'}
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
holder = []
for i in range(1000):
holder += [simulate_fair_coin_flips(250)]
cdf_holder = thinkstats2.Cdf(holder)
thinkplot.Cdf(cdf_holder)
thinkplot.Show(title='CDF of Coin Flip Head Times',
xlabel='Number of Time Heads',
ylabel='CDF')
/home/tj/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') /home/tj/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py:519: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots. warnings.warn("No labelled objects found. "
<matplotlib.figure.Figure at 0x7f3130be7750>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
results = [simulate_fair_coin_flips(250) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)
thinkplot.Cdf(cdf)
thinkplot.Config(
title='Coin Flips',
xlabel='Number of Heads',
ylabel='CDF'
)
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
head_cts = []
for i in range(1000):
head_cts.append(simulate_fair_coin_flips(250))
cdf = thinkstats2.Cdf(head_cts, label='Head Counts')
thinkplot.Cdf(cdf)
thinkplot.Show(xlabel='heads', ylabel='CDF')
<matplotlib.figure.Figure at 0x7f52519a0dd0>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
flips = [simulate_fair_coin_flips(250) for i in xrange(1000)]
cdf_flips = thinkstats2.Cdf(flips)
thinkplot.Cdf(cdf_flips)
{'xscale': 'linear', 'yscale': 'linear'}
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
sample = []
for n in range(1000):
sample.append(simulate_fair_coin_flips(250))
sample = thinkstats2.Cdf(sample)
thinkplot.Cdf(sample)
{'xscale': 'linear', 'yscale': 'linear'}
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
total = []
for i in range(0,1000):
total.append(simulate_fair_coin_flips(240))
cdf = thinkstats2.Cdf(total)
thinkplot.Cdf(cdf)
# your implementation here (imports included for convenience)
{'xscale': 'linear', 'yscale': 'linear'}
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
# your implementation here (imports included for convenience)
res = []
for _ in range(1000):
res.append(simulate_fair_coin_flips(240))
cdf = thinkstats2.Cdf(res)
thinkplot.Cdf(cdf)
thinkplot.show(xlabel='No of Heads in 240 coin flips', ylabel='CDF')
<matplotlib.figure.Figure at 0x10fde6110>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
# your implementation here (imports included for convenience)
headsCounts= [simulate_fair_coin_flips(240) for i in range(1000)]
cdf = thinkstats2.Cdf(headsCounts)
thinkplot.Cdf(cdf)
thinkplot.Config(title ='Number of times a fair coin toss results in heads')
thinkplot.Show(xlabel = 'Coin toss resulting in heads', ylabel ='CDF')
<matplotlib.figure.Figure at 0x7f5505153690>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
flipResults = []
for i in range(1000):
# The instructions say 240, but everything else says 250, so I'm going with 250
flipResults.append(simulate_fair_coin_flips(250))
flipCdf = thinkstats2.Cdf(flipResults, label = 'Coin Flips')
thinkplot.Cdf(flipCdf)
thinkplot.Show(xlabel = 'Probability of Heads', ylabel='CDF', title='CDF of Coin Flips')
<matplotlib.figure.Figure at 0x11231dd90>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
trials = [simulate_fair_coin_flips(250) for i in range(1000)]
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
resultsList = []
for i in range(10000):
resultsList.append(simulate_fair_coin_flips(250))
cdf = thinkstats2.Cdf(resultsList)
thinkplot.Cdf(cdf)
thinkplot.Show()
# your implementation here (imports included for convenience)
<matplotlib.figure.Figure at 0x7f23c4a9ddd0>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
heads = [simulate_fair_coin_flips(240) for i in range(1000)]
cdf = thinkstats2.Cdf(heads)
thinkplot.Cdf(cdf)
thinkplot.Config(title ='Number of occurences of heads')
thinkplot.Show(xlabel = 'Heads coin toss', ylabel ='CDF')
<matplotlib.figure.Figure at 0x10950ff10>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
# your implementation here (imports included for convenience)
hcs = []
for i in range(1000):
hcs.append(simulate_fair_coin_flips(250))
cdf = thinkstats2.Cdf(hcs)
thinkplot.Cdf(cdf, label='heads count')
thinkplot.Show(loc='lower right')
<matplotlib.figure.Figure at 0x7f9a1d53db50>
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
def coin_flips_trials(n, m):
head_num_trials = []
for i in range(n):
head_num_trials.append(simulate_fair_coin_flips(m))
hist = thinkstats2.Hist(head_num_trials)
thinkplot.Hist(hist)
return head_num_trials
trials = coin_flips_trials(1000, 250)
The p-value is simply the probability that we would have seen a result as extreme (or greater) as 140 heads out of 250 flips under the hypothesis that the coin is fair (the null hypothesis). Using the CDF you created in the previous cell, compute the p-value. If you want to test your learning a bit more: compute the p-value without using the CDF explicitly (instead use the results of the 1000 random trials directly).
Hint: you should use the PercentileRank function of CDF to compute the p-value, however, there is one important gotcha. The PercentileRank function returns the percentage of data that is equal to or less than the input value. When computing the p-value we want the percentage of the data that is equal to or greater than the observed value.
percentile = cdf.PercentileRank(140)
#print percentile
print 100 - percentile
1.7
pvalue_of_equal_to_or_less_than_1tailed = headsCdf.PercentileRank(140)
print "Pvalue for percentage of data that is equal to or less than for a one tailed test" , pvalue_of_equal_to_or_less_than
pvalue_of_equal_to_or_greater_than_1tailed = 100 - pvalue_of_equal_to_or_less_than
print "Pvalue for percentage of data that is equal to or greater than for a one tailed test", pvalue_of_equal_to_or_greater_than
Pvalue for percentage of data that is equal to or less than for a one tailed test 97.1 Pvalue for percentage of data that is equal to or greater than for a one tailed test 2.9
observed = 140
pvalue = 100 - cdf.PercentileRank(observed - 1)
num_above = sum(h >= observed for h in heads)
pvalue_calculated = 100 * num_above / float(len(heads))
print 'pvalue using PercentileRank: %f' % pvalue
print 'pvalue calculated: %f' % pvalue_calculated
pvalue using PercentileRank: 2.700000 pvalue calculated: 2.700000
import numpy as np
import scipy.stats as stats
p1 = 1 - 0.01*(stats.percentileofscore(np.array(headlist), 140))
print 'p value is', p1, '(one tailed)'
p value is 0.025 (one tailed)
rank = cdf.PercentileRank(140)
p_value = 100 - rank
print "p-value for 140 heads out of 250 coins: ", p_value
p-value for 140 heads out of 250 coins: 3.2
print "p-value:", str(100-cdf.PercentileRank(140))+"%"
p-value: 3.3%
print "P-value of data that 140/250 flips are heads"
print str(100 - cdf.PercentileRank(140)) + "%"
P-value that 140/250 flips are heads 2.8%
pvalue = 100 - cdf.PercentileRank(140)
print pvalue
2.7
p_val = 100 - cdf_holder.PercentileRank(139)
print p_val, '%'
3.4 %
1 - cdf[139]
0.040000000000000036
pvalue = 100 - cdf.PercentileRank(139)
print "Pvalue: ", pvalue, "%"
Pvalue: 3.6 %
p_value = 100 - cdf_flips.PercentileRank(139)
print 'P-value calculated with CDF: ', p_value, '%'
flips.sort()
ord_index = flips.index(140)
vals_above = 1 - ord_index/1000.0
print 'P-value calculated without CDF: ', 100*vals_above, '%'
P-value calculated with CDF: 3.1 % P-value calculated without CDF: 3.1 %
p_val = 100 - sample.PercentileRank(139)
print p_val
4.0
p_value = (100 - cdf.PercentileRank(139))/100
print 'P-value: ', p_value
P-value: 0.005
print "p-value:", float(format(100 - cdf.PercentileRank(139), '.2f'))
p-value: 0.4
percRank = cdf.PercentileRank(139)
pVal = 1 - percRank/100
print "P-value using CDF: ",pVal
count = sum(1.0 for x in headsCounts if x>= 140.0)
print "P-value using the results of 1000 random trials directly: ",count/1000
P-value using CDF: 0.004 P-value using the results of 1000 random trials directly: 0.004
percentileRank = flipCdf.PercentileRank(140)
pValue = 1 - float(percentileRank)/100
print "The p-value is ", pValue
The p-value is 0.018
print "Percent of data that is equal to or greater than value:"
print str(100 - cdf.PercentileRank(140)) + "%"
Percent of data that is equal to or greater than value: 2.5%
sum(i >= 140 for i in resultsList)/float(len(resultsList))
0.0286
percRank = cdf.PercentileRank(139)
pVal = 1 - (percRank / 100)
print "P value with CDF: ", pVal
count = sum(1.0 for x in heads if x >= 140.0)
print "P-value with the results of 1000 random trials: ", count / 1000
P value with CDF: 0.011 P-value with the results of 1000 random trials: 0.011
print "p-value:", 1 - cdf.PercentileRank(139)/100
# 139 because we want to include 140 in our counts
p-value: 0.033
trialsCDF = thinkstats2.Cdf(trials)
print 100.0 - trialsCDF.PercentileRank(140)
2.0
The p-value we computed above is called a one-tailed test in that we only counted simulations of the null-hypothesis that had 140 or more heads (Allen uses the terminology of one versus two-sided tests, see ThinkStats2 9.4). A two-tailed test would count simulations with 140 or more tails as well (which is what Allen shows in the book). Whether to use a one-tailed or a two-tailed test mostly has to do with your prior expectations regarding the hypothesis you are testing. For instance, if you had a reason to suspect that the coin would be biased towards heads (but not tails) you would use a one-tailed test. If you had no reason to assume a priori that the coin was biased towards heads or tails, you should use a two-tailed test.
Modify your coin flip simulation code to return the number of heads or tails, whichever is larger, out of n flips.
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
counterh = 0
countert = 0
for i in range(n):
choices = choice(["heads", "tails"])
if choices == "heads":
counterh += 1
else:
countert += 1
if counterh >= countert:
#print "heads"
return counterh
else:
#print "tails"
return countert
print simulate_fair_coin_flips_two_sided(250)
131
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
heads = 0
tails = 0
for i in range(n):
coinFlip = random.randint(0, 1)
if (coinFlip == 0):
heads +=1;
elif (coinFlip ==1):
tails +=1;
# print heads, tails
if (heads > tails):
return heads
elif (tails > heads):
return tails
else:
return heads
print simulate_fair_coin_flips_two_sided(250)
128
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
heads = simulate_fair_coin_flips(n)
tails = n - heads
return max(heads, tails)
print simulate_fair_coin_flips_two_sided(250)
141
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
heads = 0
for _ in itertools.repeat(None, n):
heads += choice([0,1])
return max(heads, n-heads)
pass
print simulate_fair_coin_flips_two_sided(250)
137
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
count = sum([choice([0, 1]) for i in range(n)])
return count if (count>125) else (250-count)
print simulate_fair_coin_flips_two_sided(250)
126
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
heads = 0
for i in xrange(n):
if choice(['heads', 'tails'])=='heads':
heads += 1
if heads>(n/2):
return heads
else:
return n-heads
print simulate_fair_coin_flips_two_sided(250)
131
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
coin_flips = [random.choice('HT') for _ in range(n)]
hist = thinkstats2.Hist(coin_flips)
if(hist['H'] >= hist['T']):
return hist['H']
else:
return hist['T']
print simulate_fair_coin_flips_two_sided(250)
128
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
ht=0
for i in range(n):
ht += choice([0,1])
if (n-ht)>ht:
return n-ht
else:
return ht
print simulate_fair_coin_flips_two_sided(250)
131
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
side = [0,1]
h = 0;
t = 0;
for i in range(n):
if choice(side) == 0:
h += 1
else:
t += 1
if t > h:
return t
else:
return h
print simulate_fair_coin_flips_two_sided(250)
136
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
heads = sum(choice((0,1)) for _ in xrange(n))
return max(heads, n-heads)
print simulate_fair_coin_flips_two_sided(250)
127
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
result = sum([choice([0,1]) for i in range(n)])
if result >= n/2.0:
return result
else:
return n - result
print simulate_fair_coin_flips_two_sided(250)
130
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
heads = simulate_fair_coin_flips(n)
return max(heads, n-heads)
print simulate_fair_coin_flips_two_sided(250)
125
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
toss = [choice([0,1]) for i in xrange(n)]
heads = sum(toss)
if heads >= n/2.0:
return heads
else:
return n-heads
print simulate_fair_coin_flips_two_sided(250)
130
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
sample = [choice('HT') for _ in range(n)]
if sample.count('H') > sample.count('T'):
return sample.count('H')
else:
return sample.count('T')
print simulate_fair_coin_flips_two_sided(250)
133
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
count = 0
for i in range(0,n):
if choice([0,1]) == 1:
count += 1
if count > n/2:
return count
return n-count
print simulate_fair_coin_flips_two_sided(250)
133
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
count_heads = 0
count_tails = 0
for _ in range(n):
if choice('HT') == 'H':
count_heads += 1
else:
count_tails += 1
return max(count_heads, count_tails)
print simulate_fair_coin_flips_two_sided(250)
125
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
sample = [choice('HT') for _ in range(n)]
hist =thinkstats2.Hist(sample)
return hist['H'], hist['T']
print simulate_fair_coin_flips_two_sided(250)
(127, 123)
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
outcomesDict ={
'heads': 0,
'tails': 0
}
coinOptions = ['heads', 'tails']
for i in range(n):
outcomesDict[choice(coinOptions)] += 1
if (outcomesDict['heads'] > outcomesDict['tails']):
return outcomesDict['heads']
else:
return outcomesDict['tails']
print simulate_fair_coin_flips_two_sided(250)
139
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
count = simulate_fair_coin_flips(n)
return max(count, abs(n - count))
print simulate_fair_coin_flips_two_sided(250)
125
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
heads= sum(np.random.randint(2, size=n))
return max(heads, n-heads)
print simulate_fair_coin_flips_two_sided(250)
136
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
sample = [choice('HT') for _ in range(n)]
hist = thinkstats2.Hist(sample)
return hist['H'], hist['T']
print simulate_fair_coin_flips_two_sided(250)
(124, 126)
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
res = {"heads":0, "tails":0}
for i in range(n):
if choice([0,1]) == 0:
res["heads"]+=1
else:
res["tails"]+=1
return max(res.values())
print simulate_fair_coin_flips_two_sided(250)
132
def simulate_fair_coin_flips_two_sided(n):
""" Return the number of heads or tails, whichever is larger,
that occur in n flips of a fair coin p(heads) = 0.5 """
heads = sum([choice((0,1)) for i in range(n)])
return heads if heads > (n-heads) else (n-heads)
print simulate_fair_coin_flips_two_sided(250)
130
Using the function simulate_fair_coin_flips_two_sided
, create and display a CDF of the number of times the most common outcome, heads or tails, appears based on 1000 random trials.
flipDataTwo = []
for i in range(1000):
flipDataTwo.append(simulate_fair_coin_flips_two_sided(250))
cdf2 = thinkstats2.Cdf(flipDataTwo, label='flipdata')
thinkplot.Cdf(cdf2)
{'xscale': 'linear', 'yscale': 'linear'}
headsOrTailsAppears = [];
for i in range(1000):
coinFlipResult = simulate_fair_coin_flips_two_sided(250)
headsOrTailsAppears.append(coinFlipResult)
mostFrequentResultCdf = thinkstats2.Cdf(headsOrTailsAppears)
iters = 1000
flips = 250
totals = [simulate_fair_coin_flips_two_sided(flips) for _ in range(iters)]
two_side_cdf = thinkstats2.Cdf(totals)
thinkplot.Cdf(two_side_cdf)
thinkplot.show()
<matplotlib.figure.Figure at 0x7f7c6c2fc990>
htlist = []
for _ in itertools.repeat(None, 1000):
htlist.append(simulate_fair_coin_flips_two_sided(250))
htcdf = thinkstats2.Cdf(htlist)
thinkplot.Cdf(htcdf)
{'xscale': 'linear', 'yscale': 'linear'}
two_flips_1000 = [simulate_fair_coin_flips_two_sided(250) for i in range(1000)]
two_cdf = thinkstats2.Cdf(two_flips_1000)
thinkplot.Cdf(two_cdf)
thinkplot.Show()
/home/yuzhong/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py:519: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots. warnings.warn("No labelled objects found. "
<matplotlib.figure.Figure at 0x7fae273ad4d0>
top_res = []
for i in xrange(1000):
top_res.append(simulate_fair_coin_flips_two_sided(250))
cdf = thinkstats2.Cdf(top_res)
thinkplot.Cdf(cdf)
{'xscale': 'linear', 'yscale': 'linear'}
coin_flips = [simulate_fair_coin_flips_two_sided(250) for i in range(1000)]
cdf = thinkstats2.Cdf(coin_flips)
thinkplot.Cdf(cdf)
{'xscale': 'linear', 'yscale': 'linear'}
flips = [simulate_fair_coin_flips_two_sided(250) for _ in range(1000)]
cdf1 = thinkstats2.Cdf(flips, label="flips")
thinkplot.Cdf(cdf1)
{'xscale': 'linear', 'yscale': 'linear'}
holder = []
for i in range(1000):
holder += [simulate_fair_coin_flips_two_sided(250)]
cdf_holder = thinkstats2.Cdf(holder)
thinkplot.Cdf(cdf_holder)
thinkplot.Show(title='CDF of Coin Flip Two Sided Times',
xlabel='Number Greater when Flipped',
ylabel='CDF')
<matplotlib.figure.Figure at 0x7f31223387d0>
results = [simulate_fair_coin_flips_two_sided(250) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)
thinkplot.Cdf(cdf)
thinkplot.Config(
title='Coin Flips',
xlabel='Number of Heads',
ylabel='CDF'
)
toss_cts = []
for i in range(1000):
toss_cts.append(simulate_fair_coin_flips_two_sided(250))
cdf = thinkstats2.Cdf(toss_cts, label='Most Common Outcome Counts')
thinkplot.Cdf(cdf)
thinkplot.Show(xlabel='heads', ylabel='CDF')
<matplotlib.figure.Figure at 0x7f5251a419d0>
results = [simulate_fair_coin_flips_two_sided(250) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)
thinkplot.Cdf(cdf)
thinkplot.Config(
title='Coin Flips',
xlabel='Number of Most Common Outcome',
ylabel='CDF'
)
ts_flips = [simulate_fair_coin_flips_two_sided(250) for i in xrange(1000)]
cdf_ts_flips = thinkstats2.Cdf(ts_flips)
thinkplot.Cdf(cdf_ts_flips)
{'xscale': 'linear', 'yscale': 'linear'}
sample = []
for n in range(1000):
sample.append(simulate_fair_coin_flips_two_sided(250))
sample = thinkstats2.Cdf(sample)
thinkplot.Cdf(sample)
{'xscale': 'linear', 'yscale': 'linear'}
total = []
for i in range(0,1000):
total.append(simulate_fair_coin_flips_two_sided(240))
cdf = thinkstats2.Cdf(total)
thinkplot.Cdf(cdf)
# your implementation here (imports included for convenience)
{'xscale': 'linear', 'yscale': 'linear'}
res = []
for _ in range(1000):
res.append(simulate_fair_coin_flips_two_sided(250))
cdf = thinkstats2.Cdf(res)
thinkplot.Cdf(cdf)
thinkplot.show(xlabel='No of Most Common Outcome in 250 coin flips', ylabel='CDF')
<matplotlib.figure.Figure at 0x1117b4390>
I'm not sure if you're asking me to pick either heads or tails, whichever appears in greater number, and then make that CDF or if you want me to compare the CDF of both. So I'm going to do both things
flipResults = []
for i in range(1000):
# The instructions say 240, but everything else says 250, so I'm going with 250
flipResults.append(simulate_fair_coin_flips_two_sided(250))
flipCdf = thinkstats2.Cdf(flipResults, label = 'Coin Flips')
thinkplot.Cdf(flipCdf)
thinkplot.Show(xlabel = 'Probability of Heads', ylabel='CDF', title='CDF of Two-Sided Coin Flips')
<matplotlib.figure.Figure at 0x1120abb10>
trials = [simulate_fair_coin_flips_two_sided(250) for i in range(1000)]
cdf = thinkstats2.Cdf(trials)
thinkplot.Cdf(cdf)
{'xscale': 'linear', 'yscale': 'linear'}
resultsList = []
for i in range(10000):
resultsList.append(simulate_fair_coin_flips_two_sided(250))
cdf = thinkstats2.Cdf(resultsList)
thinkplot.Cdf(cdf)
thinkplot.Show()
<matplotlib.figure.Figure at 0x7f23c4949610>
results = [simulate_fair_coin_flips_two_sided(250) for _ in range(1000)]
head, tail = zip(*twoSidedResults)
cdfTwoHeads = thinkstats2.Cdf(head, label = 'heads')
cdfTwoTails = thinkstats2.Cdf(tail, label = 'tails')
thinkplot.PrePlot(2)
thinkplot.Cdfs([cdfTwoHeads, cdfTwoTails])
thinkplot.Config(title ='Number of occurences in a coin toss')
thinkplot.Show(xlabel = 'Number of each possible toss', ylabel ='CDF')
<matplotlib.figure.Figure at 0x108f0e710>
counts = []
for i in range(1000):
counts.append(simulate_fair_coin_flips_two_sided(250))
cdf = thinkstats2.Cdf(counts)
thinkplot.Cdf(cdf, label='counts')
thinkplot.Show(loc='lower right')
<matplotlib.figure.Figure at 0x7f9a0ee1b050>
def coin_flips_trials_two_sided(n, m):
head_num_trials = []
for i in range(n):
head_num_trials.append(simulate_fair_coin_flips_two_sided(m))
hist = thinkstats2.Hist(head_num_trials)
thinkplot.Hist(hist)
return head_num_trials
trials_two_sided = coin_flips_trials(1000, 250)
Use the CDF to compute a two-tailed (or two-sided) p-value for the observed data (140 heads out of 250 flips).
percentile = cdf2.PercentileRank(140)
print "lower"
print percentile
print "higher"
print 100 - percentile
lower 94.9 higher 5.1
pvalue_of_equal_to_or_less_than_2tailed = mostFrequentResultCdf.PercentileRank(140)
print "Percentile rank for percentage of data that is equal to or less than for a two tailed test", pvalue_of_equal_to_or_less_than
pvalue_of_equal_to_or_greater_than_2tailed = 100 - pvalue_of_equal_to_or_less_than
print "Pvalue for percentage of data that is equal to or greater than for a two tailed test", pvalue_of_equal_to_or_greater_than
Percentile rank for percentage of data that is equal to or less than for a two tailed test 97.1 Pvalue for percentage of data that is equal to or greater than for a two tailed test 2.9
pvalue_two_side = 100 - two_side_cdf.PercentileRank(observed - 1)
print 'pvalue two sided: %.2f' % pvalue_two_side
pvalue two sided: 5.40
p2 = 1 - 0.01*(stats.percentileofscore(np.array(htlist), 140))
print 'p value is', p2 ,'(two tailed)'
p value is 0.052 (two tailed)
two_rank = two_cdf.PercentileRank(140)
two_p_value = 100 - two_rank
print "two sided p-value for 140 heads out of 250 coins: ", two_p_value
two sided p-value for 140 heads out of 250 coins: 5.4
print "p-value:", str(100-cdf.PercentileRank(140))+"%"
p-value: 4.4%
print "P-value of data that 140/250 flips are heads"
print str(100 - cdf.PercentileRank(140)) + "%"
P-value of data that 140/250 flips are heads 4.1%
pvalue = 100 - cdf1.PercentileRank(140)
print pvalue
5.4
p_val = 100 - cdf_holder.PercentileRank(139)
print p_val, '%'
7.5 %
1 - cdf[139]
0.062000000000000055
pvalue = 100 - cdf.PercentileRank(139)
print "Pvalue: ", pvalue, "%"
Pvalue: 6.2 %
ts_p_value = 100 - cdf_ts_flips.PercentileRank(139)
print 'Two-sided p_value: ', ts_p_value, '%'
Two-sided p_value: 6.3 %
p_val = 100 - sample.PercentileRank(139)
print p_val
6.7
p_value = (100 - cdf.PercentileRank(139))/100
print 'P-value: ', p_value
P-value: 0.009
print "p-value:", float(format(100 - cdf.PercentileRank(139), '.2f'))
p-value: 5.8
percRank = cdfTwoHeads.PercentileRank(139)
pVal = 1 - percRank/100
print "P-value using CDF: ",pVal
count = sum(1.0 for x in head if x>= 140.0)
print "P-value using the results of 1000 random trials directly: ",count/1000
P-value using CDF: 0.029 P-value using the results of 1000 random trials directly: 0.029
percentileRank = flipCdf.PercentileRank(140)
pValue = 1 - float(percentileRank)/100
print "The p-value is ", pValue
The p-value is 0.038
print "Two-sided p-value:"
print str(100 - cdf.PercentileRank(140)) + "%"
Two-sided p-value: 4.7%
1-cdf[139]
0.0
print "p-value:", 1 - cdf.PercentileRank(139)/100
# 139 because we want to include 140 in our counts
p-value: 0.071
trials_two_sidedCDF = thinkstats2.Cdf(trials)
print str(100.0 - trials_two_sidedCDF.PercentileRank(140)) + "%"
2.0%
This approach (via simulations of the null-hypothesis) to computing p-values has its limitations. For instance, suppose you observed 180 heads in 250 flips. If you used your CDF from above to answer this question, what would go wrong? What would you need to do in order to get a sensible estimate of this p-value?
What went wrong when I tried to get the p-value of 180 was that it was higher than or equal to all other entries. I would likely need to run many more trials in order to widen the birth of possibilities my model can account for. As it stands, some outcomes are so unlikely that they are never reached with just 1000 trials.
pvalue = 100 - cdf.PercentileRank(179)
print "Pvalue: ", pvalue, "%"
Pvalue: 0.0 %
There were no coin flips that resulted in 180 heads. (Odds are that) We'd have to run orders of magnitude more trials until we actually generated a trial that resulted in 180 heads.
In the two-tailed approach, the coin could be biased toward either heads or tails. The data used to make the CDF could have been all cases where the coin came up tails more often, so comparing an observation of 180 heads to this would just be plain wrong. It seems like the one-tailed approach would be better for this case where you're specifically testing if the coin is baised in one direction.
Well, the data would be an outlier in the set, thus leading to a strange representation. In order to get a better sense of the p-value, it may be better to do a single-sided p-value. Further, it may be better to avoid using a p-value altogether, given that for certain values it does not necessarily lend useful information about the coin.
for element in headsOrTailsAppears:
if element > 179:
print element
Using the same cdf, if we compute the percentile rank of 180, we get 100% which means that our calculation of the p-value would be 0. The reason is because in our 1000 trials, we don't have any trials where the most common outcome exceeds 155. You would need to increase the number of trials in order to get a percentile rank of 180 that might not be 100% because you need trials where you actually flipped heads 180 times out of the 250.
p3 = 1 - 0.01*(stats.percentileofscore(np.array(htlist), 180))
print 'p value is', p3 ,'(two tailed)'
p value is 0.0 (two tailed)
The CDF from above doesn't actually include 180 as a viable option, so it would theoretically have a percentile rank over 100% and a negative p-value. I would need to get either a cdf that represents 180 or a conditional that cuts off at 100% to get a p-value representative of 0 to show the many heads are plausibly explained by more than chance.
The odds of getting 180 heads out of 250 flips are far lower than 1 in 1000, so the p-value of this would be calculated as exactly 0 percent. In order to get a more sensible estimate of the p-value, the simulation would have to be run way more than 1000 times, or a continuous mathematical model, like a Gaussian, can be used.
It would likely say the odds were 0 which is wrong (well for all intents and purposes it's fine). You would need to increase the number of iterations sufficiently so that it is likely at least some of them had 180 or more heads. This is problematic since that would require in excess of ~10^13 trials.
180/250 is too unlikely to occur in 1000 tests, we can see from the cdf we did above that 180 will be a 0%. To get a sensible estimate the size would need to dramatically increase.
You would get a p-value of 0.0, because within our 1000 trials, we never saw 180 heads, so the CDF would calculate its p-value as there is nothing at or above 180, so the p-value is 0. A p-value of 0 intuitive suggests that this result is impossible. You'd need more trials to actually see its real probability
In this two-tailed approach, the coin could theoretically be biased in either direction. The flips used to create it could be all heads, and you might assume that a tails value of 180 is normal, when in reality it might not be normal at all.
180 heads in 250 flips does not occur in the random trials. Many many more trials would have to be run and even then fitting a curve to the data to account for values that did not show up in the trials would be the best fit. I know seaborn does a KDE fit to histograms, so that would probably be one way of getting a sensible estimate of the value.
Write a function that takes as input a data frame and computes the absolute value of the difference in mean age between men and women.
import numpy as np
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
man = []
woman = []
# print data.Age
# newData = pd.Series([data["Age"], data["Sex"]])
# print newData["Age"]
# data.loc[data["Sex"] == "male", "Sex"] = data["Age"]
# data.loc[data["Sex"] == "female", "Sex"] = data["Age"]
# print data["Sex"]
# print data.iterrows()
for index, row in data.iterrows():
if row.Sex == "male":
man.append(row.Age)
else:
woman.append(row.Age)
# print man
# for x in range(len(data.Age)):
# # print x
# if data.Sex[x] == "male":
# man.append(data.Age[x])
# else:
# woman.append(data.Age[x])
return abs(np.mean(man)-np.mean(woman))
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
men = data[data.Sex == 'male']
women = data[data.Sex == 'female']
return abs(men.Age.mean() - women.Age.mean())
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
men = data[data.Sex == 'male']
women = data[data.Sex == 'female']
mean_men_age = men.Age.mean()
mean_women_age = women.Age.mean()
return abs(mean_men_age - mean_women_age)
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
agem = []
agef = []
for i in data.index:
if data.Sex[i] == 'male':
agem.append(data.Age[i])
elif data.Sex[i] == 'female':
agef.append(data.Age[i])
else:
print 'unknown Sex'
continue
diff = abs(np.mean(agem)-np.mean(agef))
return diff
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
gender = data.groupby('Sex')
a, b = gender.Age.mean()
return abs(a-b)
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
return abs(data[data["Sex"]=='male']["Age"].mean()-data[data["Sex"]=='female']["Age"].mean())
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
m_age = data[data.Sex == "male"].Age.mean()
f_age = data[data.Sex == "female"].Age.mean()
return abs(m_age - f_age)
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
import numpy as np
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
men = []
women = []
for index, row in data.iterrows():
if row.Sex == "male":
men.append(row.Age)
else:
women.append(row.Age)
return abs(np.mean(women)-np.mean(men))
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
gender_groupby = data.groupby('Sex')
a, b = gender_groupby.Age.mean()
return abs(a-b)
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
return abs(data[data.Sex == 'male'].Age.mean() - data[data.Sex == 'female'].Age.mean())
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
import math
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
male_age= data[data.Sex == 'male'].Age.mean()
female_age= data[data.Sex == 'female'].Age.mean()
print "male age av: ", male_age
print "female age av: ", female_age
return abs(male_age-female_age)
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
male age av: 30.7266445916 female age av: 27.9157088123 observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
grouped = data.groupby('Sex')
female, male = grouped.Age.mean()
return abs(female-male)
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
womean = data.Age[data['Sex'] == 'female'].mean()
mean = data.Age[data['Sex'] == 'male'].mean()
return abs(mean-womean)
observed_age_diff = compute_age_diff(data_titanic)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
men = data[data.Sex == "male"]
women = data[data.Sex == "female"]
return men.Age.mean() - women.Age.mean()
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
men_only = data[data.Sex == 'male']
women_only = data[data.Sex == 'female']
return abs(men_only.Age.mean() - women_only.Age.mean())
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
fem = data[data.Sex == 'female']
mal = data[data.Sex == 'male']
ageDiff = abs(fem.Age.mean()-mal.Age.mean())
return ageDiff
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
return (
abs(data[data.Sex == 'male']['Age'].mean() -
data[data.Sex == 'female']['Age'].mean()))
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
return abs(data[data.Sex == "male"].Age.mean() - data[data.Sex == "female"].Age.mean())
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
return np.mean(data[data['Sex'] == 'male'].Age) - np.mean(data[data['Sex'] == 'female'].Age)
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
male = data[data.Sex == 'male']
female = data[data.Sex == 'female']
ageDiff = abs(female.Age.mean() - male.Age.mean())
return ageDiff
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
men = data.Age[data.Sex=="male"]
women = data.Age[data.Sex=="female"]
return abs(women.mean() - men.mean())
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
def compute_age_diff(data):
""" Compute the absolute value of the difference in mean age
between men and women on the titanic """
male = data['Age'][data['Sex'] == 'male'].mean()
female = data['Age'][data['Sex'] == 'female'].mean()
return abs(male-female)
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff
observed age difference 2.81093577935
Write a function called shuffle_ages
that returns a copy of the original data frame but where the Ages have been randomly permuted.
Hint: there are lots of ways to do this, but numpy.random.permutation
seems to be an especially succint choice. Make sure to try this function out on a small, hand-made Pandas series to get the idea of how it works.
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
data2 = data
data2["Age"] = permutation(data["Age"])
# print data["random"]
return data2
compute_age_diff(shuffle_ages(data))
1.0720918017812302
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
permutedAges = permutation(data.Age)
data_new = data
# print data_new.head()
data_new["Age"] = permutedAges
return data_new
compute_age_diff(shuffle_ages(data))
0.15471898708482357
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
df = data.copy()
df.Age = np.random.permutation(df.Age)
return df
compute_age_diff(shuffle_ages(data))
2.2638128948770664
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
age = []
for index in data.index:
age.append(data.Age[index])
permuted_age = np.random.permutation(age).tolist()
data.Age = permuted_age
return data
compute_age_diff(shuffle_ages(data))
2.0298046230747779
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
data_copy = data.copy()
data_copy.Age = permutation(data_copy.Age)
return data_copy
compute_age_diff(shuffle_ages(data))
0.40177133287660993
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
res = data.copy()
res["Age"] = permutation(res["Age"])
return res
compute_age_diff(shuffle_ages(data))
0.018900053284614415
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
data2 = data
data2.Age = permutation(data.Age)
return data2
compute_age_diff(shuffle_ages(data))
1.2728859962954502
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
data2 = data.copy()
shuffle_age = permutation(data.Age.tolist())
data2.Age = shuffle_age
return data2
compute_age_diff(shuffle_ages(data))
0.62823095074978852
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
shuffled = data.copy()
shuffled_ages = permutation(shuffled.Age)
shuffled.Age = shuffled_ages
return shuffled
compute_age_diff(shuffle_ages(data))
0.08569409555707708
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
new_data = data.copy()
new_data.Age = permutation(new_data.Age.values)
return new_data
compute_age_diff(shuffle_ages(data))
0.31173098881023265
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
shuffle = data.copy()
new_ages = permutation(data.Age.tolist())
shuffle.Age = new_ages
return shuffle
compute_age_diff(shuffle_ages(data))
0.65757656491842553
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
newframe = data.copy()
newframe.Age = permutation(newframe.Age)
return newframe
random_age_diff = compute_age_diff(shuffle_ages(data_titanic))
print random_age_diff
0.371210829464
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
newdata = data.copy()
newage = permutation(data.Age.tolist())
newdata.Age = newage
return newdata
compute_age_diff(shuffle_ages(data))
0.9292677171348096
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
res = data
res.Age = permutation(res.Age).astype(int)
return res
compute_age_diff(shuffle_ages(data))
2.649336479663038
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
shuffled_data = data.copy()
shuffled_data.Age = permutation(shuffled_data.Age)
return shuffled_data
compute_age_diff(shuffle_ages(data))
1.9901857349470973
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
shuffledAges = data.copy()
shuffledAges['Age'] = permutation(shuffledAges['Age'])
return shuffledAges
compute_age_diff(shuffle_ages(data))
0.12766105909517478
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
df = data.copy()
df = df.apply(permutation)
return df
compute_age_diff(shuffle_ages(data))
0.6840909898251759
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
dataCopy = data.copy(deep=True)
dataCopy.Age = (np.random.permutation(data.Age))
return dataCopy
compute_age_diff(shuffle_ages(data))
0.2272429017279407
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
new_data = data.copy()
new_data.Age = permutation(data.Age.values)
return new_data
compute_age_diff(shuffle_ages(data))
0.70081525462434513
from numpy.random import permutation
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
ages_shuff = np.random.permutation(data.Age)
data_copy = data
data_copy.Age = ages_shuff
return data_copy
compute_age_diff(shuffle_ages(data))
0.5950168734617236
from numpy.random import permutation
import numpy as np
def shuffle_ages(data):
""" Return a new dataframe (don't modify the original) where
the values in the Age column have been randomly permuted. """
d = data
ages = data['Age'].values
np.random.shuffle(ages)
d['Age'] = ages
return d
compute_age_diff(shuffle_ages(data))
0.47985105681154749
Using 1000 random simulations, compute the p-value for the hypothesis that the mean ages of men and women were different (you may wish to use Cdf as in the previous section).
randAge = []
for i in range(1000):
randAge.append(compute_age_diff(shuffle_ages(data)))
cdf3 = thinkstats2.Cdf(randAge, label='Random_Age')
thinkplot.Cdf(cdf3)
print cdf3.PercentileRank(1)
#Just in case this makes no sense, I did this because in the event that randomly changing around ages doesn't make
#things come out to be 0 on average, then there must be some bias on either the male or female side.
63.7
diffAges = [];
for i in range(1000):
meandiff = compute_age_diff(shuffle_ages(data))
diffAges.append(meandiff)
meanDiffAgesCdf = thinkstats2.Cdf(diffAges)
def shuffled_age_diff(data):
return compute_age_diff(shuffle_ages(data))
iters = 1000
age_diffs = [shuffled_age_diff(data) for _ in range(iters)]
age_cdf = thinkstats2.Cdf(age_diffs)
x = observed_age_diff
y = age_cdf.PercentileRank(x) / 100
thinkplot.Cdf(age_cdf)
plt.axvline(x, 0, y, color='red')
plt.axhline(y, 0, x/4.0, color='red')
plt.plot(x, y, '.', color='green', markersize=30)
plt.xlim((0, 4.0))
thinkplot.show()
<matplotlib.figure.Figure at 0x7fcbbd9d5a50>
agelist = []
for _ in itertools.repeat(None, 1000):
agelist.append(compute_age_diff(shuffle_ages(data)))
#plotting cdf
agecdf = thinkstats2.Cdf(agelist)
thinkplot.Cdf(agecdf)
#Computing p-value for pbserved data 2 years of difference
page = 1 - 0.01*(stats.percentileofscore(np.array(agelist), 2))
print 'p value is', page
p value is 0.074
age_diffs = [compute_age_diff(shuffle_ages(data)) for i in range(1000)]
age_diff_cdf = thinkstats2.Cdf(age_diffs)
thinkplot.Cdf(age_diff_cdf)
thinkplot.Show()
<matplotlib.figure.Figure at 0x7fae27647b50>
diff_arr = []
for i in xrange(1000):
diff_arr.append(compute_age_diff(shuffle_ages(data)))
cdf = thinkstats2.Cdf(diff_arr)
print "p-value:", str(100-cdf.PercentileRank(compute_age_diff(data)))+"%"
p-value: 0.8%
trials = [compute_age_diff(shuffle_ages(data)) for _ in range(1000)]
cdf = thinkstats2.Cdf(trials)
thinkplot.Cdf(cdf)
print str(100 - cdf.PercentileRank(2.81093577935)) + "%"
0.9%
ages = [compute_age_diff(shuffle_ages(data)) for _ in range(1000)]
cdf2 = thinkstats2.Cdf(ages, label="age diff")
thinkplot.Cdf(cdf2)
{'xscale': 'linear', 'yscale': 'linear'}
holder = []
for i in range(1000):
holder += [compute_age_diff(shuffle_ages(data))]
cdf_holder = thinkstats2.Cdf(holder)
thinkplot.Cdf(cdf_holder)
thinkplot.Show(title='CDF of Age difference Average for gender',
xlabel='Absolute Age Difference',
ylabel='CDF')
print 'pval:', 100 - cdf_holder.PercentileRank(observed_age_diff), '%'
pval: 1.5 %
<matplotlib.figure.Figure at 0x7f312299fa50>
results = [compute_age_diff(shuffle_ages(data)) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)
thinkplot.Cdf(cdf)
thinkplot.Config(
title='Mean Ages',
xlabel='Mean Age Difference Between "Males" and "Females"',
ylabel='CDF'
)
1-cdf[observed_age_diff]
0.014000000000000012
mean_cts = []
for i in range(1000):
mean_cts.append(compute_age_diff(shuffle_ages(data)))
cdf = thinkstats2.Cdf(mean_cts, label='Abs of Age CDF')
thinkplot.Cdf(cdf)
thinkplot.Show(xlabel='age dif', ylabel='CDF')
pvalue = 100 - cdf.PercentileRank(compute_age_diff(data))
print "Pvalue: ", pvalue, "%"
Pvalue: 0.8 %
<matplotlib.figure.Figure at 0x7f525160ddd0>
results = [compute_age_diff(shuffle_ages(data)) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)
thinkplot.Cdf(cdf)
thinkplot.Config(
title='Mean Ages',
xlabel='Mean Age Difference Between "Males" and "Females"',
ylabel='CDF'
)
# This isn't __quite__ right since it doesn't include the
# values that are exactly the same as the observed difference,
# but the impact of this is pretty negligible since this is
# more continuous than the coin flips example
1 - cdf[observed_age_diff]
0.009000000000000008
simulation = [compute_age_diff(shuffle_ages(data)) for i in xrange(1000)]
cdf_sim = thinkstats2.Cdf(simulation)
thinkplot.Cdf(cdf_sim)
print 'P-value of male/female mean age diff: ', 100 - cdf_sim.PercentileRank(observed_age_diff), '%'
P-value of male/female mean age diff: 1.2 %
sample_raw = []
sample_perm = []
for n in range(1000):
sample_raw.append(compute_age_diff(data_titanic))
sample_perm.append(compute_age_diff(shuffle_ages(data_titanic)))
sample_raw = thinkstats2.Cdf(sample_raw)
sample_perm = thinkstats2.Cdf(sample_perm)
thinkplot.Cdfs([sample_raw, sample_perm])
p_val = 100 - sample_perm.PercentileRank(compute_age_diff(data_titanic))
print p_val
1.3
total = []
for i in range(0,1000):
total.append(compute_age_diff(shuffle_ages(data)))
cdf = thinkstats2.Cdf(total)
thinkplot.Cdf(cdf)
{'xscale': 'linear', 'yscale': 'linear'}
res = []
for _ in range(1000):
res.append(compute_age_diff(shuffle_ages(data)))
cdf = thinkstats2.Cdf(res)
print "p-value:", float(format(100 - cdf.PercentileRank(observed_age_diff), '.2f'))
p-value: 0.6
import thinkstats2
ageDiffs= [compute_age_diff(shuffle_ages(data)) for i in range(1000)]
ageCdf = thinkstats2.Cdf(ageDiffs)
thinkplot.Cdf(ageCdf)
#If I'm looking for mean ages that are different, I want anything that's not 0
percentileRank = ageCdf.PercentileRank(observed_age_diff)
p_Val = 1 - percentileRank/100
print "P-value using CDF: ",p_Val*100, "%"
P-value using CDF: 1.1 %
ageDiffs = []
for i in range(1000):
ageDiffs.append(compute_age_diff(shuffle_ages(data)))
ageCdf = thinkstats2.Cdf(ageDiffs)
print "p-value of observed age difference: ", (1 - ageCdf[compute_age_diff(data)])
p-value of observed age difference: 0.013
trials = [compute_age_diff(shuffle_ages(data)) for _ in range(1000)]
cdf = thinkstats2.Cdf(trials)
thinkplot.Cdf(cdf)
{'xscale': 'linear', 'yscale': 'linear'}
resultsList = []
for i in range(1000):
resultsList.append(compute_age_diff(shuffle_ages(data)))
cdf = thinkstats2.Cdf(resultsList)
1-cdf[observed_age_diff]
0.0050000000000000044
# not sure whether 1 - cdf[value] vs the implementation at the start (using percentile rank)
# is the right option
diffs = []
for i in range(1000):
diffs.append(compute_age_diff(shuffle_ages(data)))
cdf = thinkstats2.Cdf(diffs)
thinkplot.Cdf(cdf, label='age diffs')
thinkplot.Show(loc='lower right')
bigger = 0
for x in diffs:
if x >= observed_age_diff:
bigger +=1
print "p-value:", bigger/1000.0
p-value: 0.016
<matplotlib.figure.Figure at 0x7f9a0e9938d0>
def age_test(n, data):
age_trials = []
for i in range(n):
age_trials.append(compute_age_diff(shuffle_ages(data)))
hist = thinkstats2.Hist(age_trials)
thinkplot.Hist(hist)
return age_trials
titanic_age_trials = age_test(1000, data)
Ignoring passengers with missing ages:
Disclaimer: (1) is a bit of a trick question (sorry!), but I included it to encourage being precise about the definition of the null hypothesis and eactly which population it refers to.
The average was different for all the people, though we can't tell for certain if it is males or females because there is no connection after the age data has been shuffled.
What I can draw from this is that there is a particular concentration of ages that is higher or lower for males or females. We know that for a value of 1 year in difference, that is a larger gap than 63% of occupants saw, and thereby 37% less than others, but not whether it corresponds to being male or female. We would likely want to leave the ages corresponding with the sex in order to determine that.
males = data[data.Sex == 'male']
print "average male age", males.Age.mean()
females = data[data.Sex == 'female']
print "average female age", females.Age.mean()
print "diff", abs(males.Age.mean() - females.Age.mean())
average male age 29.4970640177 average female age 30.0498084291 diff 0.552744411459
males = data[data.Sex == 'male']
females = data[data.Sex == 'female']
avg_male_age = males.Age.mean()
avg_female_age = females.Age.mean()
print 'Average male age: %d' % avg_male_age
print 'Average female age: %d' % avg_female_age
Average male age: 30 Average female age: 27
The p value is significant (>0.05) when the observed age is 2 years. We can reject the null hypothesis and say that the average male and female age are different.
When the observed age is 1 year, the p value is insignificant (<0.01) and we cannot reject the null hypothesis and say that the average age of male and female passengers are different.
The average age was different and the p value of 2.0% we calculated indicates that the 3 year difference we calculated was probably not due to random chance b/c it is so small and most of the values will actually have a differences as we can see from the graph also.
Answers.
Yes, the average age of male verses female were different as proven when finding observed_age_diff, whereas in the null hypothesis has age difference zero.
The p-value is below 5% which implies that it is 'statistically significant'. Here, the chances of the mean age differing that much by chance given the null hypothesis is true is very very low.
1 The average age is definitily different.
2 This difference is statistically significant. A random distribution of ages of Titanic passengers is very unlikely to have a observed difference in means for males and females.
Yes, and it appears that the difference may be significant given that the p-value is approximately 0.013. Of course, there are people who did not report their age, and there is a whole testing data set that is not included - thus, it is difficult to actually form a true conclusion.
The average age of male versus female passengers on the Titanic was different. Since the p-value we calculated was around 1%, we know that the difference in average age between genders is statistically significant because we tested our null hypothesis by shuffling the ages and the percentage of the shuffled data that is equal to or greater than our observed value is under 1%, which means that this effect is not due to chance. So it follows that the average ages of men vs women on the titanic was actually different, not just due to chance.
yes the average age is different (medain age difference is about 2.8)
The mean difference in age is plausibly caused by more than chance. A random sample of one age group cannot definitely be male or female. The permutation results also suggest that it is unlikely that a random sample of titanic passengers will exhibt the same age difference.
Yes, the average of male versus female passengers on the titanic was different. (The difference in mean ages was about 2.8 years.) This isn't different by that many years in the grand scheme of things, though.
I think that this p-value means that, if the distributions of the ages of male and female passengers weren't different, it would be very unlikely for us to get this particular distribution of ages, which means that there is a (statistically sigificant?) difference between ages of men and women on the titanic.
The average age of male versus female passengers was different, and the low p-value indicates that the age difference of about 3 years was probably not due to random chance.