ThinkStats 9.1 - 9.3 Companion¶

This notebook will allow you to practice some of the concepts from ThinkStats2 Chapter 9.

Companion to 9.1 - 9.2¶

First, we'll start with the question that Allen poses at the beginning of the chapter: "Suppose we toss a coin 250 times and we see 140 heads. Is this strong evidence that the coin is biased?"

As Allen says, classical hypothesis testing is similar to a proof by contradiction. First, we assume that the thing we are trying to show is false (that the coin is biased). Second, we show that this leads to an observed event being excedingly improbable (seeing 140 heads out of 250 tosses). Finally, we can conclude that our assumption (that the coin is not biased) is unlikely to be true.

Write a function to simulate n random coin flips of a fair coin (p(heads) = 0.5). Your function should return the number of heads that occur in those n coin clips.

In [10]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    counter = 0
    for i in range(n):
        choices = choice(["heads", "tails"])
        
        if choices == "heads":
            counter += 1
    return counter

flipData = []
for i in range(1000):
    flipData.append(simulate_fair_coin_flips(250))
print flipData

[142, 115, 120, 127, 140, 129, 122, 119, 119, 134, 141, 132, 120, 109, 128, 129, 138, 126, 114, 106, 136, 132, 124, 112, 124, 138, 124, 126, 125, 117, 116, 134, 114, 117, 113, 132, 117, 129, 114, 124, 117, 116, 125, 119, 140, 129, 145, 140, 132, 128, 132, 128, 120, 138, 127, 131, 131, 113, 120, 122, 117, 118, 122, 112, 123, 116, 131, 125, 124, 121, 139, 125, 124, 110, 119, 123, 125, 124, 135, 122, 139, 122, 131, 120, 131, 117, 125, 121, 120, 126, 128, 126, 114, 116, 135, 121, 123, 126, 136, 122, 123, 117, 130, 120, 135, 132, 118, 136, 129, 120, 119, 130, 119, 132, 111, 134, 123, 128, 110, 134, 124, 111, 124, 140, 130, 118, 132, 127, 112, 137, 110, 132, 123, 119, 113, 119, 126, 124, 116, 119, 131, 130, 122, 133, 112, 127, 125, 111, 123, 131, 121, 127, 138, 126, 122, 122, 108, 124, 134, 114, 127, 119, 140, 128, 122, 136, 112, 136, 116, 129, 122, 129, 135, 123, 130, 122, 123, 112, 141, 123, 131, 117, 130, 119, 134, 132, 120, 124, 119, 131, 124, 138, 131, 124, 113, 121, 122, 114, 134, 126, 137, 126, 121, 129, 132, 111, 126, 144, 111, 140, 123, 117, 131, 124, 109, 129, 127, 121, 123, 126, 126, 124, 126, 127, 121, 139, 133, 130, 125, 128, 127, 125, 117, 129, 127, 118, 134, 118, 127, 130, 113, 124, 129, 123, 125, 129, 129, 116, 127, 138, 119, 123, 122, 122, 105, 122, 132, 126, 125, 128, 109, 125, 125, 117, 132, 117, 124, 132, 122, 130, 122, 122, 133, 133, 121, 127, 128, 121, 120, 144, 113, 123, 136, 125, 120, 120, 134, 120, 116, 133, 113, 127, 140, 131, 116, 115, 120, 118, 124, 125, 124, 116, 129, 113, 123, 134, 130, 127, 116, 109, 128, 108, 118, 124, 131, 124, 112, 115, 130, 121, 133, 134, 117, 119, 131, 117, 125, 122, 121, 135, 126, 119, 119, 119, 118, 118, 118, 114, 122, 124, 118, 136, 122, 131, 131, 113, 134, 122, 130, 127, 125, 120, 108, 125, 134, 133, 131, 128, 113, 126, 117, 119, 124, 128, 116, 113, 137, 117, 133, 123, 138, 119, 123, 126, 114, 116, 109, 135, 137, 124, 119, 124, 130, 113, 129, 137, 121, 117, 111, 114, 134, 132, 121, 126, 112, 111, 132, 128, 116, 121, 114, 110, 135, 133, 109, 119, 132, 121, 122, 142, 132, 105, 121, 129, 128, 132, 124, 108, 125, 127, 119, 135, 122, 129, 135, 128, 127, 115, 135, 139, 120, 150, 125, 129, 121, 132, 126, 112, 126, 128, 115, 109, 116, 125, 134, 133, 121, 121, 121, 125, 125, 127, 122, 126, 106, 124, 116, 140, 123, 128, 127, 119, 137, 129, 113, 126, 129, 125, 127, 120, 117, 115, 120, 127, 123, 130, 121, 131, 138, 129, 113, 125, 117, 129, 134, 120, 124, 137, 127, 128, 119, 133, 131, 129, 113, 130, 137, 121, 128, 124, 121, 132, 129, 128, 125, 125, 124, 130, 108, 115, 132, 129, 126, 121, 138, 131, 122, 123, 124, 125, 144, 127, 119, 125, 117, 136, 129, 121, 124, 128, 119, 128, 117, 137, 128, 130, 141, 115, 125, 144, 119, 115, 132, 135, 113, 119, 119, 126, 112, 128, 129, 123, 121, 124, 118, 122, 109, 129, 133, 125, 117, 124, 123, 126, 113, 125, 132, 122, 122, 124, 135, 139, 119, 130, 130, 132, 133, 135, 118, 120, 109, 119, 115, 113, 122, 120, 122, 138, 129, 124, 121, 131, 125, 117, 119, 131, 121, 131, 117, 132, 117, 114, 121, 127, 121, 116, 126, 149, 118, 130, 119, 123, 113, 121, 129, 121, 126, 122, 132, 110, 121, 132, 145, 118, 107, 134, 143, 114, 106, 114, 125, 125, 129, 129, 124, 109, 133, 135, 120, 123, 117, 131, 114, 115, 112, 140, 127, 132, 117, 120, 125, 115, 122, 118, 129, 113, 124, 129, 118, 118, 126, 125, 123, 122, 128, 132, 116, 127, 131, 130, 134, 120, 115, 129, 126, 109, 113, 104, 118, 130, 133, 124, 124, 114, 117, 130, 135, 124, 126, 126, 124, 129, 113, 140, 125, 129, 123, 113, 126, 118, 128, 128, 124, 128, 128, 120, 127, 128, 120, 130, 126, 118, 119, 131, 126, 121, 139, 128, 127, 138, 122, 127, 133, 124, 126, 128, 131, 119, 126, 138, 122, 125, 122, 125, 123, 122, 132, 127, 124, 129, 127, 118, 126, 117, 118, 114, 119, 130, 129, 121, 125, 133, 116, 137, 121, 126, 128, 114, 110, 138, 121, 126, 131, 132, 120, 119, 107, 109, 128, 126, 126, 144, 125, 124, 116, 121, 130, 127, 123, 119, 121, 114, 127, 111, 126, 121, 127, 117, 113, 130, 140, 120, 116, 118, 124, 136, 115, 119, 124, 123, 124, 132, 129, 121, 114, 138, 124, 115, 133, 125, 116, 127, 119, 127, 117, 132, 123, 124, 132, 136, 129, 123, 122, 112, 130, 141, 122, 116, 131, 134, 128, 132, 130, 128, 112, 129, 129, 120, 140, 126, 125, 134, 123, 125, 125, 132, 134, 127, 131, 111, 124, 131, 126, 118, 127, 126, 132, 130, 126, 120, 123, 123, 124, 118, 126, 135, 121, 117, 122, 131, 132, 122, 116, 151, 112, 129, 115, 139, 131, 115, 118, 124, 124, 139, 138, 128, 127, 122, 124, 125, 123, 130, 133, 120, 126, 127, 121, 129, 112, 118, 133, 108, 128, 123, 116, 137, 130, 117, 115, 132, 131, 109, 121, 113, 124, 128, 129, 105, 136, 124, 122, 129, 125, 127, 133, 121, 112, 115, 123, 130, 129, 116, 126, 120, 137, 130, 123, 130, 107, 122, 122, 113, 119, 129, 139, 138, 131, 115, 124, 112, 116, 132, 115, 124, 134, 120, 133, 126, 114, 122, 137, 123, 139, 121, 127, 133, 131, 120, 127, 129, 109, 131, 133, 107, 126, 106, 115, 114, 129, 121, 117, 126, 123, 118, 130, 124, 140, 113, 120, 129, 131, 111, 128, 135, 119, 129, 118, 123, 122, 124]

In [9]:

from random import choice
import random

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    heads = 0
    tails = 0
    for i in range(n):
        coinFlip = random.randint(0, 1)
        if (coinFlip == 0): 
            heads +=1; 
        elif (coinFlip ==1): 
            tails +=1; 
    return (heads, tails)

print simulate_fair_coin_flips(250)

(136, 114)

In [16]:

from random import randint

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    return reduce(lambda heads, _: heads + randint(0, 1), xrange(n))

print simulate_fair_coin_flips(250)

In [40]:

from random import choice
import itertools

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    heads = 0
    for _ in itertools.repeat(None, n):
        heads += choice([0,1])
    return heads

print simulate_fair_coin_flips(250)

In [12]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    return sum([choice([0, 1]) for i in range(n)])

print simulate_fair_coin_flips(250)

In [1]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    heads = 0
    for i in xrange(n):
        if choice(['heads', 'tails'])=='heads':
            heads += 1
    return heads

print simulate_fair_coin_flips(250)

In [15]:

from random import choice
import random
import thinkstats2

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    coin_flips = [random.choice('HT') for _ in range(n)]
    hist = thinkstats2.Hist(coin_flips)
    return hist['H']

print simulate_fair_coin_flips(250)

In [13]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    heads=0
    
    for i in range(n):
        heads += choice([0,1])
    
    return heads

print simulate_fair_coin_flips(250)

In [1]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    side = [0,1]
    h = 0;
    t = 0;
    for i in range(n):
        if choice(side) == 0:
            h += 1
    return h
        
print simulate_fair_coin_flips(250)

In [5]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    return sum(choice((0,1)) for _ in xrange(n))

print simulate_fair_coin_flips(250)

In [98]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    
    heads = sum([choice([0,1]) for i in range(0,n)])
            
    return heads

print simulate_fair_coin_flips(250)

In [14]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    
    toss = [choice([0,1]) for i in xrange(n)]
    return sum(toss)

print simulate_fair_coin_flips(250)

In [3]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    sample = [choice('HT') for _ in range(n)]
    return sample.count('H')
    

print simulate_fair_coin_flips(250)

In [4]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    count = 0
    for i in range(0,n):
        if choice([0,1]) == 1:
            count += 1
    return count

print simulate_fair_coin_flips(250)

In [2]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    count = 0
    for _ in range(n):
        if choice('HT') == 'H':
            count += 1
    return count

print simulate_fair_coin_flips(250)

In [18]:

from random import choice
import thinkstats2

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    sample = [choice('HT') for _ in range(n)]
    hist =thinkstats2.Hist(sample)
    return hist['H']

print simulate_fair_coin_flips(250)

In [26]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    countHeads = 0
    #The options for whether or not the coin is heads
    isHeads = [0, 1]
    for i in range(n):
        countHeads += choice(isHeads)
        
    return countHeads
print simulate_fair_coin_flips(250)

In [18]:

from random import choice
choice([1,2,3])

Out[18]:

In [28]:

from random import choice
import numpy as np

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    return sum(np.random.randint(2, size=n))

print simulate_fair_coin_flips(250)

In [45]:

from random import choice
import thinkstats2

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    sample = [choice('HT') for _ in range(n)]
    hist = thinkstats2.Hist(sample)
    return hist['H']

print simulate_fair_coin_flips(250)

In [2]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    headcount = 0
    for i in range(n):
        if choice([0,1]) == 0:
            headcount+=1
    return headcount

print simulate_fair_coin_flips(250)

In [65]:

from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    return sum([choice((0,1)) for i in range(n)])

print simulate_fair_coin_flips(250)

Next, repeat your simulation of 240 coin flips 1000 times. Create and display a CDF of the number of times heads appears based on 1000 random trials.

In [11]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt


cdf = thinkstats2.Cdf(flipData, label='flipdata')
thinkplot.Cdf(cdf)
# your implementation here (imports included for convenience)

Out[11]:

{'xscale': 'linear', 'yscale': 'linear'}

In [24]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

# your implementation here (imports included for convenience)
headsAppears = [];
for i in range(1000): 
    coinFlipResults = simulate_fair_coin_flips(250)
    heads = coinFlipResults[0]
    headsAppears.append(heads)
headsCdf = thinkstats2.Cdf(headsAppears)

In [41]:

%matplotlib inline
import numpy as np
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

iters = 1000
flips = 250
heads = [simulate_fair_coin_flips(flips) for _ in range(iters)]

cdf = thinkstats2.Cdf(heads, label=('heads per %d flips' % flips))
thinkplot.Cdf(cdf)
thinkplot.show()

<matplotlib.figure.Figure at 0x7fcbbd448fd0>

In [86]:

# Simulated for 250 coin flips instead of 240
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt
%matplotlib inline

headlist = []
for _ in itertools.repeat(None, 1000):
    headlist.append(simulate_fair_coin_flips(250))

headcdf = thinkstats2.Cdf(headlist)
thinkplot.Cdf(headcdf)

Out[86]:

{'xscale': 'linear', 'yscale': 'linear'}

In [180]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

flips_1000 = [simulate_fair_coin_flips(250) for i in range(1000)]
cdf = thinkstats2.Cdf(flips_1000)
thinkplot.Cdf(cdf)
thinkplot.Show()

<matplotlib.figure.Figure at 0x7fae273bd1d0>

In [2]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

heads_res = []
for i in xrange(1000):
    heads_res.append(simulate_fair_coin_flips(250))
cdf = thinkstats2.Cdf(heads_res)
thinkplot.Cdf(cdf)

/home/jsutker/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Out[2]:

{'xscale': 'linear', 'yscale': 'linear'}

In [35]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

# your implementation here (imports included for convenience)
coin_flips = [simulate_fair_coin_flips(250) for i in range(1000)]
cdf = thinkstats2.Cdf(coin_flips)
thinkplot.Cdf(cdf)

Out[35]:

{'xscale': 'linear', 'yscale': 'linear'}

In [24]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

flips = [simulate_fair_coin_flips(250) for _ in range(1000)]

cdf = thinkstats2.Cdf(flips, label="flips")
thinkplot.Cdf(cdf)

Out[24]:

{'xscale': 'linear', 'yscale': 'linear'}

In [2]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

holder = []
for i in range(1000):
    holder += [simulate_fair_coin_flips(250)]
    
cdf_holder = thinkstats2.Cdf(holder)
thinkplot.Cdf(cdf_holder)

thinkplot.Show(title='CDF of Coin Flip Head Times',
              xlabel='Number of Time Heads',
              ylabel='CDF')

/home/tj/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/home/tj/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py:519: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "

<matplotlib.figure.Figure at 0x7f3130be7750>

In [8]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

results = [simulate_fair_coin_flips(250) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)

thinkplot.Cdf(cdf)
thinkplot.Config(
    title='Coin Flips',
    xlabel='Number of Heads',
    ylabel='CDF'
)

In [99]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

head_cts = []

for i in range(1000):
    head_cts.append(simulate_fair_coin_flips(250))

cdf = thinkstats2.Cdf(head_cts, label='Head Counts')
thinkplot.Cdf(cdf)
thinkplot.Show(xlabel='heads', ylabel='CDF')

<matplotlib.figure.Figure at 0x7f52519a0dd0>

In [15]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

flips = [simulate_fair_coin_flips(250) for i in xrange(1000)]
cdf_flips = thinkstats2.Cdf(flips)
thinkplot.Cdf(cdf_flips)

Out[15]:

{'xscale': 'linear', 'yscale': 'linear'}

In [21]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

sample = []
for n in range(1000):
    sample.append(simulate_fair_coin_flips(250))

sample = thinkstats2.Cdf(sample)
thinkplot.Cdf(sample)

Out[21]:

{'xscale': 'linear', 'yscale': 'linear'}

In [5]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

total = []
for i in range(0,1000):
    total.append(simulate_fair_coin_flips(240))
cdf = thinkstats2.Cdf(total)
thinkplot.Cdf(cdf)
# your implementation here (imports included for convenience)

Out[5]:

{'xscale': 'linear', 'yscale': 'linear'}

In [17]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

# your implementation here (imports included for convenience)
res = []
for _ in range(1000):
    res.append(simulate_fair_coin_flips(240))

cdf = thinkstats2.Cdf(res)
thinkplot.Cdf(cdf)
thinkplot.show(xlabel='No of Heads in 240 coin flips', ylabel='CDF')

<matplotlib.figure.Figure at 0x10fde6110>

In [38]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

# your implementation here (imports included for convenience)
headsCounts= [simulate_fair_coin_flips(240) for i in range(1000)]
cdf = thinkstats2.Cdf(headsCounts)
thinkplot.Cdf(cdf)
thinkplot.Config(title ='Number of times a fair coin toss results in heads')
thinkplot.Show(xlabel = 'Coin toss resulting in heads', ylabel ='CDF')

<matplotlib.figure.Figure at 0x7f5505153690>

In [27]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

flipResults = []
for i in range(1000):
#     The instructions say 240, but everything else says 250, so I'm going with 250
    flipResults.append(simulate_fair_coin_flips(250))

flipCdf = thinkstats2.Cdf(flipResults, label = 'Coin Flips')

thinkplot.Cdf(flipCdf)
thinkplot.Show(xlabel = 'Probability of Heads', ylabel='CDF', title='CDF of Coin Flips')

<matplotlib.figure.Figure at 0x11231dd90>

In [50]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

trials = [simulate_fair_coin_flips(250) for i in range(1000)]

In [98]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

resultsList = []

for i in range(10000):
    resultsList.append(simulate_fair_coin_flips(250))
    

    
cdf = thinkstats2.Cdf(resultsList)
thinkplot.Cdf(cdf)
thinkplot.Show()

# your implementation here (imports included for convenience)

<matplotlib.figure.Figure at 0x7f23c4a9ddd0>

In [46]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

heads = [simulate_fair_coin_flips(240) for i in range(1000)]
cdf = thinkstats2.Cdf(heads)

thinkplot.Cdf(cdf)
thinkplot.Config(title ='Number of occurences of heads')
thinkplot.Show(xlabel = 'Heads coin toss', ylabel ='CDF')

<matplotlib.figure.Figure at 0x10950ff10>

In [3]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

# your implementation here (imports included for convenience)
hcs = []
for i in range(1000):
    hcs.append(simulate_fair_coin_flips(250))

cdf = thinkstats2.Cdf(hcs)
thinkplot.Cdf(cdf, label='heads count')
thinkplot.Show(loc='lower right')

<matplotlib.figure.Figure at 0x7f9a1d53db50>

In [66]:

%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

def coin_flips_trials(n, m):
    head_num_trials = []
    for i in range(n):
        head_num_trials.append(simulate_fair_coin_flips(m))
    hist = thinkstats2.Hist(head_num_trials)
    thinkplot.Hist(hist)
    return head_num_trials

trials = coin_flips_trials(1000, 250)

The p-value is simply the probability that we would have seen a result as extreme (or greater) as 140 heads out of 250 flips under the hypothesis that the coin is fair (the null hypothesis). Using the CDF you created in the previous cell, compute the p-value. If you want to test your learning a bit more: compute the p-value without using the CDF explicitly (instead use the results of the 1000 random trials directly).

Hint: you should use the PercentileRank function of CDF to compute the p-value, however, there is one important gotcha. The PercentileRank function returns the percentage of data that is equal to or less than the input value. When computing the p-value we want the percentage of the data that is equal to or greater than the observed value.

In [13]:

percentile = cdf.PercentileRank(140)
#print percentile
print 100 - percentile

1.7

In [60]:

pvalue_of_equal_to_or_less_than_1tailed = headsCdf.PercentileRank(140)
print "Pvalue for percentage of data that is equal to or less than for a one tailed test" , pvalue_of_equal_to_or_less_than
pvalue_of_equal_to_or_greater_than_1tailed = 100 -  pvalue_of_equal_to_or_less_than
print "Pvalue for percentage of data that is equal to or greater than for a one tailed test",  pvalue_of_equal_to_or_greater_than

Pvalue for percentage of data that is equal to or less than for a one tailed test 97.1
Pvalue for percentage of data that is equal to or greater than for a one tailed test 2.9

In [88]:

observed = 140

pvalue = 100 - cdf.PercentileRank(observed - 1)

num_above = sum(h >= observed for h in heads)
pvalue_calculated = 100 * num_above / float(len(heads))

print 'pvalue using PercentileRank: %f' % pvalue
print 'pvalue calculated:           %f' % pvalue_calculated

pvalue using PercentileRank: 2.700000
pvalue calculated:           2.700000

In [87]:

import numpy as np
import scipy.stats as stats
p1 = 1 - 0.01*(stats.percentileofscore(np.array(headlist), 140))
print 'p value is', p1, '(one tailed)'

p value is 0.025 (one tailed)

In [150]:

rank = cdf.PercentileRank(140)
p_value = 100 - rank
print "p-value for 140 heads out of 250 coins: ", p_value

p-value for 140 heads out of 250 coins:  3.2

In [3]:

print "p-value:", str(100-cdf.PercentileRank(140))+"%"

p-value: 3.3%

In [30]:

print "P-value of data that 140/250 flips are heads"
print str(100 - cdf.PercentileRank(140)) + "%"

P-value that 140/250 flips are heads
2.8%

In [36]:

pvalue = 100 - cdf.PercentileRank(140)
print pvalue

2.7

In [3]:

p_val = 100 - cdf_holder.PercentileRank(139)
print p_val, '%'

3.4 %

In [9]:

1 - cdf[139]

Out[9]:

0.040000000000000036

In [100]:

pvalue = 100 - cdf.PercentileRank(139)

print "Pvalue: ", pvalue, "%"

Pvalue:  3.6 %

In [16]:

p_value = 100 - cdf_flips.PercentileRank(139)
print 'P-value calculated with CDF: ', p_value, '%'

flips.sort()
ord_index = flips.index(140)
vals_above = 1 - ord_index/1000.0

print 'P-value calculated without CDF: ', 100*vals_above, '%'

P-value calculated with CDF:  3.1 %
P-value calculated without CDF:  3.1 %

In [22]:

p_val = 100 - sample.PercentileRank(139)
print p_val

4.0

In [9]:

p_value = (100 - cdf.PercentileRank(139))/100
print 'P-value: ', p_value

P-value:  0.005

In [18]:

print "p-value:", float(format(100 - cdf.PercentileRank(139), '.2f'))

p-value: 0.4

In [39]:

percRank = cdf.PercentileRank(139)
pVal = 1 - percRank/100
print "P-value using CDF: ",pVal

count = sum(1.0 for x in headsCounts if x>= 140.0)
print "P-value using the results of 1000 random trials directly: ",count/1000

P-value using CDF:  0.004
P-value using the results of 1000 random trials directly:  0.004

In [28]:

percentileRank = flipCdf.PercentileRank(140)
pValue = 1 - float(percentileRank)/100
print "The p-value is ", pValue

The p-value is  0.018

In [52]:

print "Percent of data that is equal to or greater than value:"
print str(100 - cdf.PercentileRank(140)) + "%"

Percent of data that is equal to or greater than value:
2.5%

In [99]:

sum(i >= 140 for i in resultsList)/float(len(resultsList))

Out[99]:

0.0286

In [47]:

percRank = cdf.PercentileRank(139)
pVal = 1 - (percRank / 100)
print "P value with CDF: ", pVal

count = sum(1.0 for x in heads if x >= 140.0)
print "P-value with the results of 1000 random trials: ", count / 1000

P value with CDF:  0.011
P-value with the results of 1000 random trials:  0.011

In [12]:

print "p-value:", 1 - cdf.PercentileRank(139)/100
# 139 because we want to include 140 in our counts

p-value: 0.033

In [67]:

trialsCDF = thinkstats2.Cdf(trials)
print 100.0 - trialsCDF.PercentileRank(140)

2.0

The p-value we computed above is called a one-tailed test in that we only counted simulations of the null-hypothesis that had 140 or more heads (Allen uses the terminology of one versus two-sided tests, see ThinkStats2 9.4). A two-tailed test would count simulations with 140 or more tails as well (which is what Allen shows in the book). Whether to use a one-tailed or a two-tailed test mostly has to do with your prior expectations regarding the hypothesis you are testing. For instance, if you had a reason to suspect that the coin would be biased towards heads (but not tails) you would use a one-tailed test. If you had no reason to assume a priori that the coin was biased towards heads or tails, you should use a two-tailed test.

Modify your coin flip simulation code to return the number of heads or tails, whichever is larger, out of n flips.

In [18]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    counterh = 0
    countert = 0
    for i in range(n):
        choices = choice(["heads", "tails"])
        
        if choices == "heads":
            counterh += 1
        else:
            countert += 1
    if counterh >= countert:
        #print "heads"
        return counterh
    else:
        #print "tails"
        return countert

print simulate_fair_coin_flips_two_sided(250)

In [61]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    heads = 0
    tails = 0
    for i in range(n):
        coinFlip = random.randint(0, 1)
        if (coinFlip == 0): 
            heads +=1; 
        elif (coinFlip ==1): 
            tails +=1; 
#     print heads, tails
    if (heads > tails): 
        return heads
    elif (tails > heads): 
        return tails
    else: 
        return heads
    
print simulate_fair_coin_flips_two_sided(250)

In [73]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    heads = simulate_fair_coin_flips(n)
    tails = n - heads
    return max(heads, tails)

print simulate_fair_coin_flips_two_sided(250)

In [98]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    heads = 0
    for _ in itertools.repeat(None, n):
        heads += choice([0,1])
    return max(heads, n-heads)

    pass

print simulate_fair_coin_flips_two_sided(250)

In [133]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    count = sum([choice([0, 1]) for i in range(n)])
    return count if (count>125) else (250-count)

print simulate_fair_coin_flips_two_sided(250)

In [4]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    heads = 0
    for i in xrange(n):
        if choice(['heads', 'tails'])=='heads':
            heads += 1
    if heads>(n/2):
        return heads
    else:
        return n-heads

print simulate_fair_coin_flips_two_sided(250)

In [49]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    coin_flips = [random.choice('HT') for _ in range(n)]
    hist = thinkstats2.Hist(coin_flips)
    if(hist['H'] >= hist['T']):
        return hist['H']
    else:
        return hist['T']

print simulate_fair_coin_flips_two_sided(250)

In [39]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    ht=0
    
    for i in range(n):
        ht += choice([0,1])
    
    if (n-ht)>ht:
        return n-ht
    else:
        return ht

print simulate_fair_coin_flips_two_sided(250)

In [4]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    side = [0,1]
    h = 0;
    t = 0;
    for i in range(n):
        if choice(side) == 0:
            h += 1
        else:
            t += 1
    if t > h:
        return t
    else:
        return h

print simulate_fair_coin_flips_two_sided(250)

In [20]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    heads = sum(choice((0,1)) for _ in xrange(n))
    return max(heads, n-heads)

print simulate_fair_coin_flips_two_sided(250)

In [106]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    
    result = sum([choice([0,1]) for i in range(n)])
    
    if result >= n/2.0:
        return result
    else:
        return n - result
    

print simulate_fair_coin_flips_two_sided(250)

In [4]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    heads = simulate_fair_coin_flips(n)
    return max(heads, n-heads)

print simulate_fair_coin_flips_two_sided(250)

In [17]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
      
    toss = [choice([0,1]) for i in xrange(n)]
    heads = sum(toss)
    if heads >= n/2.0:
        return heads
    else:
        return n-heads


print simulate_fair_coin_flips_two_sided(250)

In [23]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    sample = [choice('HT') for _ in range(n)]
    if sample.count('H') > sample.count('T'):
        return sample.count('H')
    else:
        return sample.count('T')

print simulate_fair_coin_flips_two_sided(250)

In [10]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    count = 0
    for i in range(0,n):
        if choice([0,1]) == 1:
            count += 1
    if count > n/2:
        return count
    return n-count

print simulate_fair_coin_flips_two_sided(250)

In [20]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    count_heads = 0
    count_tails = 0
    for _ in range(n):
        if choice('HT') == 'H':
            count_heads += 1
        else:
            count_tails += 1
    return max(count_heads, count_tails)

print simulate_fair_coin_flips_two_sided(250)

In [44]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    sample = [choice('HT') for _ in range(n)]
    hist =thinkstats2.Hist(sample)
    return hist['H'], hist['T']

print simulate_fair_coin_flips_two_sided(250)

(127, 123)

In [29]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    outcomesDict ={
        'heads': 0,
        'tails': 0
    }
    coinOptions = ['heads', 'tails']
    
    for i in range(n):
        outcomesDict[choice(coinOptions)] += 1
        
    if (outcomesDict['heads'] > outcomesDict['tails']):
        return outcomesDict['heads']
    else:
        return outcomesDict['tails']
        

print simulate_fair_coin_flips_two_sided(250)

In [53]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    count = simulate_fair_coin_flips(n)
    return max(count, abs(n - count))

print simulate_fair_coin_flips_two_sided(250)

In [91]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    heads= sum(np.random.randint(2, size=n))
    return max(heads, n-heads)

print simulate_fair_coin_flips_two_sided(250)

In [48]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    sample = [choice('HT') for _ in range(n)]
    hist = thinkstats2.Hist(sample)
    return hist['H'], hist['T']

print simulate_fair_coin_flips_two_sided(250)

(124, 126)

In [23]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    res = {"heads":0, "tails":0}
    for i in range(n):
        if choice([0,1]) == 0:
            res["heads"]+=1
        else:
            res["tails"]+=1
    return max(res.values())

print simulate_fair_coin_flips_two_sided(250)

In [68]:

def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    heads = sum([choice((0,1)) for i in range(n)])
    return heads if heads > (n-heads) else (n-heads)

print simulate_fair_coin_flips_two_sided(250)

Using the function simulate_fair_coin_flips_two_sided, create and display a CDF of the number of times the most common outcome, heads or tails, appears based on 1000 random trials.

In [20]:

flipDataTwo = []
for i in range(1000):
    flipDataTwo.append(simulate_fair_coin_flips_two_sided(250))

cdf2 = thinkstats2.Cdf(flipDataTwo, label='flipdata')
thinkplot.Cdf(cdf2)

Out[20]:

{'xscale': 'linear', 'yscale': 'linear'}

In [72]:

headsOrTailsAppears = [];
for i in range(1000): 
    coinFlipResult = simulate_fair_coin_flips_two_sided(250)
    headsOrTailsAppears.append(coinFlipResult)
mostFrequentResultCdf = thinkstats2.Cdf(headsOrTailsAppears)

In [101]:

iters = 1000
flips = 250
totals = [simulate_fair_coin_flips_two_sided(flips) for _ in range(iters)]

two_side_cdf = thinkstats2.Cdf(totals)
thinkplot.Cdf(two_side_cdf)
thinkplot.show()

<matplotlib.figure.Figure at 0x7f7c6c2fc990>

In [99]:

htlist = []
for _ in itertools.repeat(None, 1000):
    htlist.append(simulate_fair_coin_flips_two_sided(250))

htcdf = thinkstats2.Cdf(htlist)
thinkplot.Cdf(htcdf)

Out[99]:

{'xscale': 'linear', 'yscale': 'linear'}

In [179]:

two_flips_1000 = [simulate_fair_coin_flips_two_sided(250) for i in range(1000)]
two_cdf = thinkstats2.Cdf(two_flips_1000)
thinkplot.Cdf(two_cdf)
thinkplot.Show()

/home/yuzhong/anaconda2/lib/python2.7/site-packages/matplotlib/axes/_axes.py:519: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
  warnings.warn("No labelled objects found. "

<matplotlib.figure.Figure at 0x7fae273ad4d0>

In [5]:

top_res = []
for i in xrange(1000):
    top_res.append(simulate_fair_coin_flips_two_sided(250))
cdf = thinkstats2.Cdf(top_res)
thinkplot.Cdf(cdf)

Out[5]:

{'xscale': 'linear', 'yscale': 'linear'}

In [50]:

coin_flips = [simulate_fair_coin_flips_two_sided(250) for i in range(1000)]
cdf = thinkstats2.Cdf(coin_flips)
thinkplot.Cdf(cdf)

Out[50]:

{'xscale': 'linear', 'yscale': 'linear'}

In [40]:

flips = [simulate_fair_coin_flips_two_sided(250) for _ in range(1000)]

cdf1 = thinkstats2.Cdf(flips, label="flips")
thinkplot.Cdf(cdf1)

Out[40]:

{'xscale': 'linear', 'yscale': 'linear'}

In [5]:

holder = []
for i in range(1000):
    holder += [simulate_fair_coin_flips_two_sided(250)]
    
cdf_holder = thinkstats2.Cdf(holder)
thinkplot.Cdf(cdf_holder)

thinkplot.Show(title='CDF of Coin Flip Two Sided Times',
              xlabel='Number Greater when Flipped',
              ylabel='CDF') 

<matplotlib.figure.Figure at 0x7f31223387d0>

In [21]:

results = [simulate_fair_coin_flips_two_sided(250) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)

thinkplot.Cdf(cdf)
thinkplot.Config(
    title='Coin Flips',
    xlabel='Number of Heads',
    ylabel='CDF'
)

In [112]:

toss_cts = []

for i in range(1000):
    toss_cts.append(simulate_fair_coin_flips_two_sided(250))

cdf = thinkstats2.Cdf(toss_cts, label='Most Common Outcome Counts')
thinkplot.Cdf(cdf)
thinkplot.Show(xlabel='heads', ylabel='CDF')

<matplotlib.figure.Figure at 0x7f5251a419d0>

In [5]:

results = [simulate_fair_coin_flips_two_sided(250) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)
thinkplot.Cdf(cdf)
thinkplot.Config(
    title='Coin Flips',
    xlabel='Number of Most Common Outcome',
    ylabel='CDF'
)

In [18]:

ts_flips = [simulate_fair_coin_flips_two_sided(250) for i in xrange(1000)]
cdf_ts_flips = thinkstats2.Cdf(ts_flips)
thinkplot.Cdf(cdf_ts_flips)

Out[18]:

{'xscale': 'linear', 'yscale': 'linear'}

In [24]:

sample = []
for n in range(1000):
    sample.append(simulate_fair_coin_flips_two_sided(250))

sample = thinkstats2.Cdf(sample)
thinkplot.Cdf(sample)

Out[24]:

{'xscale': 'linear', 'yscale': 'linear'}

In [11]:

total = []
for i in range(0,1000):
    total.append(simulate_fair_coin_flips_two_sided(240))
cdf = thinkstats2.Cdf(total)
thinkplot.Cdf(cdf)
# your implementation here (imports included for convenience)

Out[11]:

{'xscale': 'linear', 'yscale': 'linear'}

In [23]:

res = []
for _ in range(1000):
    res.append(simulate_fair_coin_flips_two_sided(250))

cdf = thinkstats2.Cdf(res)
thinkplot.Cdf(cdf)
thinkplot.show(xlabel='No of Most Common Outcome in 250 coin flips', ylabel='CDF')

<matplotlib.figure.Figure at 0x1117b4390>

I'm not sure if you're asking me to pick either heads or tails, whichever appears in greater number, and then make that CDF or if you want me to compare the CDF of both. So I'm going to do both things

In [30]:

flipResults = []
for i in range(1000):
#     The instructions say 240, but everything else says 250, so I'm going with 250
    flipResults.append(simulate_fair_coin_flips_two_sided(250))

flipCdf = thinkstats2.Cdf(flipResults, label = 'Coin Flips')

thinkplot.Cdf(flipCdf)
thinkplot.Show(xlabel = 'Probability of Heads', ylabel='CDF', title='CDF of Two-Sided Coin Flips')

<matplotlib.figure.Figure at 0x1120abb10>

In [54]:

trials = [simulate_fair_coin_flips_two_sided(250) for i in range(1000)]
cdf = thinkstats2.Cdf(trials)
thinkplot.Cdf(cdf)

Out[54]:

{'xscale': 'linear', 'yscale': 'linear'}

In [95]:

resultsList = []

for i in range(10000):
    resultsList.append(simulate_fair_coin_flips_two_sided(250))
    

    
cdf = thinkstats2.Cdf(resultsList)
thinkplot.Cdf(cdf)
thinkplot.Show()

<matplotlib.figure.Figure at 0x7f23c4949610>

In [49]:

results = [simulate_fair_coin_flips_two_sided(250) for _ in range(1000)]
head, tail = zip(*twoSidedResults)

cdfTwoHeads = thinkstats2.Cdf(head, label = 'heads')
cdfTwoTails = thinkstats2.Cdf(tail, label = 'tails')

thinkplot.PrePlot(2)
thinkplot.Cdfs([cdfTwoHeads, cdfTwoTails])
thinkplot.Config(title ='Number of occurences in a coin toss')
thinkplot.Show(xlabel = 'Number of each possible toss', ylabel ='CDF')

<matplotlib.figure.Figure at 0x108f0e710>

In [24]:

counts = []
for i in range(1000):
    counts.append(simulate_fair_coin_flips_two_sided(250))

cdf = thinkstats2.Cdf(counts)
thinkplot.Cdf(cdf, label='counts')
thinkplot.Show(loc='lower right')

<matplotlib.figure.Figure at 0x7f9a0ee1b050>

In [69]:

def coin_flips_trials_two_sided(n, m):
    head_num_trials = []
    for i in range(n):
        head_num_trials.append(simulate_fair_coin_flips_two_sided(m))
    hist = thinkstats2.Hist(head_num_trials)
    thinkplot.Hist(hist)
    return head_num_trials

trials_two_sided = coin_flips_trials(1000, 250)

Use the CDF to compute a two-tailed (or two-sided) p-value for the observed data (140 heads out of 250 flips).

In [24]:

percentile = cdf2.PercentileRank(140)
print "lower"
print percentile
print "higher"
print 100 - percentile

lower
94.9
higher
5.1

In [70]:

pvalue_of_equal_to_or_less_than_2tailed = mostFrequentResultCdf.PercentileRank(140)
print "Percentile rank for percentage of data that is equal to or less than for a two tailed test", pvalue_of_equal_to_or_less_than
pvalue_of_equal_to_or_greater_than_2tailed = 100 -  pvalue_of_equal_to_or_less_than
print "Pvalue for percentage of data that is equal to or greater than for a two tailed test",  pvalue_of_equal_to_or_greater_than

Percentile rank for percentage of data that is equal to or less than for a two tailed test 97.1
Pvalue for percentage of data that is equal to or greater than for a two tailed test 2.9

In [105]:

pvalue_two_side = 100 - two_side_cdf.PercentileRank(observed - 1)
print 'pvalue two sided: %.2f' % pvalue_two_side

pvalue two sided: 5.40

In [100]:

p2 = 1 - 0.01*(stats.percentileofscore(np.array(htlist), 140))
print 'p value is', p2 ,'(two tailed)'

p value is 0.052 (two tailed)

In [177]:

two_rank = two_cdf.PercentileRank(140)
two_p_value = 100 - two_rank
print "two sided p-value for 140 heads out of 250 coins: ", two_p_value

two sided p-value for 140 heads out of 250 coins:  5.4

In [6]:

print "p-value:", str(100-cdf.PercentileRank(140))+"%"

p-value: 4.4%

In [51]:

print "P-value of data that 140/250 flips are heads"
print str(100 - cdf.PercentileRank(140)) + "%"

P-value of data that 140/250 flips are heads
4.1%

In [41]:

pvalue = 100 - cdf1.PercentileRank(140)
print pvalue

5.4

In [6]:

p_val = 100 - cdf_holder.PercentileRank(139)
print p_val, '%'

7.5 %

In [22]:

1 - cdf[139]

Out[22]:

0.062000000000000055

In [115]:

pvalue = 100 - cdf.PercentileRank(139)

print "Pvalue: ", pvalue, "%"

Pvalue:  6.2 %

In [19]:

ts_p_value = 100 - cdf_ts_flips.PercentileRank(139)
print 'Two-sided p_value: ', ts_p_value, '%'

Two-sided p_value:  6.3 %

In [25]:

p_val = 100 - sample.PercentileRank(139)
print p_val

6.7

In [12]:

p_value = (100 - cdf.PercentileRank(139))/100
print 'P-value: ', p_value

P-value:  0.009

In [24]:

print "p-value:", float(format(100 - cdf.PercentileRank(139), '.2f'))

p-value: 5.8

In [64]:

percRank = cdfTwoHeads.PercentileRank(139)
pVal = 1 - percRank/100
print "P-value using CDF: ",pVal

count = sum(1.0 for x in head if x>= 140.0)
print "P-value using the results of 1000 random trials directly: ",count/1000

P-value using CDF:  0.029
P-value using the results of 1000 random trials directly:  0.029

In [31]:

percentileRank = flipCdf.PercentileRank(140)
pValue = 1 - float(percentileRank)/100
print "The p-value is ", pValue

The p-value is  0.038

In [55]:

print "Two-sided p-value:"
print str(100 - cdf.PercentileRank(140)) + "%"

Two-sided p-value:
4.7%

In [101]:

1-cdf[139]

Out[101]:

0.0

In [25]:

print "p-value:", 1 - cdf.PercentileRank(139)/100
# 139 because we want to include 140 in our counts

p-value: 0.071

In [70]:

trials_two_sidedCDF = thinkstats2.Cdf(trials)
print str(100.0 - trials_two_sidedCDF.PercentileRank(140)) + "%"

2.0%

This approach (via simulations of the null-hypothesis) to computing p-values has its limitations. For instance, suppose you observed 180 heads in 250 flips. If you used your CDF from above to answer this question, what would go wrong? What would you need to do in order to get a sensible estimate of this p-value?

Response¶

What went wrong when I tried to get the p-value of 180 was that it was higher than or equal to all other entries. I would likely need to run many more trials in order to widen the birth of possibilities my model can account for. As it stands, some outcomes are so unlikely that they are never reached with just 1000 trials.

In [119]:

pvalue = 100 - cdf.PercentileRank(179)

print "Pvalue: ", pvalue, "%"

Pvalue:  0.0 %

There were no coin flips that resulted in 180 heads. (Odds are that) We'd have to run orders of magnitude more trials until we actually generated a trial that resulted in 180 heads.

In the two-tailed approach, the coin could be biased toward either heads or tails. The data used to make the CDF could have been all cases where the coin came up tails more often, so comparing an observation of 180 heads to this would just be plain wrong. It seems like the one-tailed approach would be better for this case where you're specifically testing if the coin is baised in one direction.

Well, the data would be an outlier in the set, thus leading to a strange representation. In order to get a better sense of the p-value, it may be better to do a single-sided p-value. Further, it may be better to avoid using a p-value altogether, given that for certain values it does not necessarily lend useful information about the coin.

In [76]:

for element in headsOrTailsAppears: 
    if element > 179: 
        print element

Using the same cdf, if we compute the percentile rank of 180, we get 100% which means that our calculation of the p-value would be 0. The reason is because in our 1000 trials, we don't have any trials where the most common outcome exceeds 155. You would need to increase the number of trials in order to get a percentile rank of 180 that might not be 100% because you need trials where you actually flipped heads 180 times out of the 250.

In [257]:

p3 = 1 - 0.01*(stats.percentileofscore(np.array(htlist), 180))
print 'p value is', p3 ,'(two tailed)'

p value is 0.0 (two tailed)

The CDF from above doesn't actually include 180 as a viable option, so it would theoretically have a percentile rank over 100% and a negative p-value. I would need to get either a cdf that represents 180 or a conditional that cuts off at 100% to get a p-value representative of 0 to show the many heads are plausibly explained by more than chance.

The odds of getting 180 heads out of 250 flips are far lower than 1 in 1000, so the p-value of this would be calculated as exactly 0 percent. In order to get a more sensible estimate of the p-value, the simulation would have to be run way more than 1000 times, or a continuous mathematical model, like a Gaussian, can be used.

It would likely say the odds were 0 which is wrong (well for all intents and purposes it's fine). You would need to increase the number of iterations sufficiently so that it is likely at least some of them had 180 or more heads. This is problematic since that would require in excess of ~10^13 trials.

180/250 is too unlikely to occur in 1000 tests, we can see from the cdf we did above that 180 will be a 0%. To get a sensible estimate the size would need to dramatically increase.

You would get a p-value of 0.0, because within our 1000 trials, we never saw 180 heads, so the CDF would calculate its p-value as there is nothing at or above 180, so the p-value is 0. A p-value of 0 intuitive suggests that this result is impossible. You'd need more trials to actually see its real probability

In this two-tailed approach, the coin could theoretically be biased in either direction. The flips used to create it could be all heads, and you might assume that a tails value of 180 is normal, when in reality it might not be normal at all.

180 heads in 250 flips does not occur in the random trials. Many many more trials would have to be run and even then fitting a curve to the data to account for values that did not show up in the trials would be the best fit. I know seaborn does a KDE fit to histograms, so that would probably be one way of getting a sensible estimate of the value.

Write a function that takes as input a data frame and computes the absolute value of the difference in mean age between men and women.

In [102]:

import numpy as np
def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    man = []
    woman = []
#     print data.Age
#     newData = pd.Series([data["Age"], data["Sex"]])
#     print newData["Age"]
    
#     data.loc[data["Sex"] == "male", "Sex"] = data["Age"]
#     data.loc[data["Sex"] == "female", "Sex"] = data["Age"]
#     print data["Sex"]
#     print data.iterrows()
    
    for index, row in data.iterrows():
        if row.Sex == "male":
            man.append(row.Age)
        else:
            woman.append(row.Age)

#     print man
#     for x in range(len(data.Age)):
# #         print x
        
#         if data.Sex[x] == "male":
#             man.append(data.Age[x])
#         else:
#             woman.append(data.Age[x])
    return abs(np.mean(man)-np.mean(woman))

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [87]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    men = data[data.Sex == 'male']
    women = data[data.Sex == 'female']
    return abs(men.Age.mean() - women.Age.mean())

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [22]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    men = data[data.Sex == 'male']
    women = data[data.Sex == 'female']
    
    mean_men_age = men.Age.mean()
    mean_women_age = women.Age.mean()
    return abs(mean_men_age - mean_women_age)

    
observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [240]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    agem = []
    agef = []
    for i in data.index:
        if data.Sex[i] == 'male':
            agem.append(data.Age[i])
        elif data.Sex[i] == 'female':
            agef.append(data.Age[i])
        else:
            print 'unknown Sex'
            continue
    diff = abs(np.mean(agem)-np.mean(agef))
    
    return diff

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [184]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    gender = data.groupby('Sex')
    a, b = gender.Age.mean()
    return abs(a-b)

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [8]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    return abs(data[data["Sex"]=='male']["Age"].mean()-data[data["Sex"]=='female']["Age"].mean())

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [57]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    m_age = data[data.Sex == "male"].Age.mean() 
    f_age = data[data.Sex == "female"].Age.mean()
    return abs(m_age - f_age)

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [20]:

import numpy as np

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    men = []
    women = []
    
    for index, row in data.iterrows():
        if row.Sex == "male":
            men.append(row.Age)
        else:
            women.append(row.Age)
            
    return abs(np.mean(women)-np.mean(men))

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [8]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    gender_groupby = data.groupby('Sex')
    a, b = gender_groupby.Age.mean()
    return abs(a-b)

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [30]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    return abs(data[data.Sex == 'male'].Age.mean() - data[data.Sex == 'female'].Age.mean())

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [151]:

import math

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    
    male_age=  data[data.Sex == 'male'].Age.mean()
    female_age=  data[data.Sex == 'female'].Age.mean()
    print "male age av: ", male_age
    print "female age av: ", female_age
    
    return abs(male_age-female_age)

    

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

male age av:  30.7266445916
female age av:  27.9157088123
observed age difference 2.81093577935

In [21]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    
    grouped = data.groupby('Sex')
    female, male = grouped.Age.mean()
    return abs(female-male)

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [134]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    womean = data.Age[data['Sex'] == 'female'].mean()
    mean = data.Age[data['Sex'] == 'male'].mean()
    return abs(mean-womean)

observed_age_diff = compute_age_diff(data_titanic)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [30]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    men = data[data.Sex == "male"]
    women = data[data.Sex == "female"]
    return men.Age.mean() - women.Age.mean()

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [26]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    men_only = data[data.Sex == 'male']
    women_only = data[data.Sex == 'female']
    return abs(men_only.Age.mean() - women_only.Age.mean())

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [104]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    fem = data[data.Sex == 'female']
    mal = data[data.Sex == 'male']
    ageDiff = abs(fem.Age.mean()-mal.Age.mean())
    return ageDiff

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [37]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    return (
        abs(data[data.Sex == 'male']['Age'].mean() - 
            data[data.Sex == 'female']['Age'].mean()))

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [161]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    return abs(data[data.Sex == "male"].Age.mean() - data[data.Sex == "female"].Age.mean())

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [108]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    return np.mean(data[data['Sex'] == 'male'].Age) - np.mean(data[data['Sex'] == 'female'].Age)

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [51]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    male = data[data.Sex == 'male']
    female = data[data.Sex == 'female']
    ageDiff = abs(female.Age.mean() - male.Age.mean())
    return ageDiff

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [31]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    men = data.Age[data.Sex=="male"]
    women = data.Age[data.Sex=="female"]
    return abs(women.mean() - men.mean())

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

In [72]:

def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    male = data['Age'][data['Sex'] == 'male'].mean()
    female = data['Age'][data['Sex'] == 'female'].mean()
    return abs(male-female)

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

observed age difference 2.81093577935

Write a function called shuffle_ages that returns a copy of the original data frame but where the Ages have been randomly permuted.

Hint: there are lots of ways to do this, but numpy.random.permutation seems to be an especially succint choice. Make sure to try this function out on a small, hand-made Pandas series to get the idea of how it works.

In [105]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    
    data2 = data
    data2["Age"] = permutation(data["Age"])
#     print data["random"]
    return data2

compute_age_diff(shuffle_ages(data))

Out[105]:

1.0720918017812302

In [112]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    permutedAges = permutation(data.Age)
    data_new = data
#     print data_new.head()
    data_new["Age"] = permutedAges
    return data_new

compute_age_diff(shuffle_ages(data))

Out[112]:

0.15471898708482357

In [28]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    df = data.copy()
    df.Age = np.random.permutation(df.Age)
    return df

compute_age_diff(shuffle_ages(data))

Out[28]:

2.2638128948770664

In [245]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    age = []
    for index in data.index:
        age.append(data.Age[index])
    permuted_age = np.random.permutation(age).tolist()
    data.Age = permuted_age
    return data

compute_age_diff(shuffle_ages(data))

Out[245]:

2.0298046230747779

In [191]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    data_copy = data.copy()
    data_copy.Age = permutation(data_copy.Age)
    return data_copy

compute_age_diff(shuffle_ages(data))

Out[191]:

0.40177133287660993

In [9]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    res = data.copy()
    res["Age"] = permutation(res["Age"])
    return res

compute_age_diff(shuffle_ages(data))

Out[9]:

0.018900053284614415

In [28]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    data2 = data
    data2.Age = permutation(data.Age)
    return data2

compute_age_diff(shuffle_ages(data))

Out[28]:

1.2728859962954502

In [9]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    data2 = data.copy()
    shuffle_age = permutation(data.Age.tolist())
    data2.Age = shuffle_age
    return data2
    
compute_age_diff(shuffle_ages(data))

Out[9]:

0.62823095074978852

In [144]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    
    shuffled = data.copy()
    shuffled_ages = permutation(shuffled.Age)
    shuffled.Age = shuffled_ages
    
    return shuffled

compute_age_diff(shuffle_ages(data))

Out[144]:

0.08569409555707708

In [9]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    new_data = data.copy()
    new_data.Age = permutation(new_data.Age.values)
    return new_data

compute_age_diff(shuffle_ages(data))

Out[9]:

0.31173098881023265

In [22]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    
    shuffle = data.copy()
    new_ages = permutation(data.Age.tolist())
    shuffle.Age = new_ages
    
    return shuffle
    
compute_age_diff(shuffle_ages(data))

Out[22]:

0.65757656491842553

In [135]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    newframe = data.copy()
    newframe.Age = permutation(newframe.Age)
    return newframe

random_age_diff = compute_age_diff(shuffle_ages(data_titanic))

print random_age_diff

0.371210829464

In [35]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    
    newdata = data.copy()
    newage = permutation(data.Age.tolist())
    newdata.Age = newage
    return newdata
    
compute_age_diff(shuffle_ages(data))

Out[35]:

0.9292677171348096

In [27]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    res = data
    res.Age = permutation(res.Age).astype(int)
    return res

compute_age_diff(shuffle_ages(data))

Out[27]:

2.649336479663038

In [103]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    shuffled_data = data.copy()
    shuffled_data.Age = permutation(shuffled_data.Age)
    return shuffled_data
    

compute_age_diff(shuffle_ages(data))

Out[103]:

1.9901857349470973

In [68]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    shuffledAges = data.copy()
    shuffledAges['Age'] = permutation(shuffledAges['Age'])
    return shuffledAges

compute_age_diff(shuffle_ages(data))

Out[68]:

0.12766105909517478

In [173]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    df = data.copy()
    df = df.apply(permutation)
    return df

compute_age_diff(shuffle_ages(data))

Out[173]:

0.6840909898251759

In [169]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    dataCopy = data.copy(deep=True)
    dataCopy.Age = (np.random.permutation(data.Age))
    return dataCopy

compute_age_diff(shuffle_ages(data))

Out[169]:

0.2272429017279407

In [52]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    new_data = data.copy()
    new_data.Age = permutation(data.Age.values)
    return new_data

compute_age_diff(shuffle_ages(data))

Out[52]:

0.70081525462434513

In [45]:

from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    ages_shuff = np.random.permutation(data.Age)
    data_copy = data
    data_copy.Age = ages_shuff
    return data_copy

compute_age_diff(shuffle_ages(data))

Out[45]:

0.5950168734617236

In [73]:

from numpy.random import permutation
import numpy as np

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    d = data
    ages = data['Age'].values
    np.random.shuffle(ages)
    d['Age'] = ages
    return d

compute_age_diff(shuffle_ages(data))

Out[73]:

0.47985105681154749

Using 1000 random simulations, compute the p-value for the hypothesis that the mean ages of men and women were different (you may wish to use Cdf as in the previous section).

In [107]:

randAge = []
for i in range(1000):
    randAge.append(compute_age_diff(shuffle_ages(data)))

cdf3 = thinkstats2.Cdf(randAge, label='Random_Age')
thinkplot.Cdf(cdf3)

print cdf3.PercentileRank(1)
#Just in case this makes no sense, I did this because in the event that randomly changing around ages doesn't make
#things come out to be 0 on average, then there must be some bias on either the male or female side.

63.7

In [113]:

diffAges = [];
for i in range(1000): 
    meandiff = compute_age_diff(shuffle_ages(data))
    diffAges.append(meandiff)
meanDiffAgesCdf = thinkstats2.Cdf(diffAges)

In [75]:

def shuffled_age_diff(data):
    return compute_age_diff(shuffle_ages(data))
    
iters = 1000    
age_diffs = [shuffled_age_diff(data) for _ in range(iters)]

age_cdf = thinkstats2.Cdf(age_diffs)

x = observed_age_diff
y = age_cdf.PercentileRank(x) / 100

thinkplot.Cdf(age_cdf)
plt.axvline(x, 0, y, color='red')
plt.axhline(y, 0, x/4.0, color='red')
plt.plot(x, y, '.', color='green', markersize=30)
plt.xlim((0, 4.0))
thinkplot.show()

<matplotlib.figure.Figure at 0x7fcbbd9d5a50>

In [248]:

agelist = []
for _ in itertools.repeat(None, 1000):
    agelist.append(compute_age_diff(shuffle_ages(data)))

#plotting cdf
agecdf = thinkstats2.Cdf(agelist)
thinkplot.Cdf(agecdf)

#Computing p-value for pbserved data 2 years of difference
page = 1 - 0.01*(stats.percentileofscore(np.array(agelist), 2))
print 'p value is', page

p value is 0.074

In [196]:

age_diffs = [compute_age_diff(shuffle_ages(data)) for i in range(1000)]
age_diff_cdf = thinkstats2.Cdf(age_diffs)
thinkplot.Cdf(age_diff_cdf)
thinkplot.Show()

<matplotlib.figure.Figure at 0x7fae27647b50>

In [10]:

diff_arr = []
for i in xrange(1000):
    diff_arr.append(compute_age_diff(shuffle_ages(data)))
cdf = thinkstats2.Cdf(diff_arr)
print "p-value:", str(100-cdf.PercentileRank(compute_age_diff(data)))+"%"

p-value: 0.8%

In [64]:

trials = [compute_age_diff(shuffle_ages(data)) for _ in range(1000)]
cdf = thinkstats2.Cdf(trials)
thinkplot.Cdf(cdf)

print str(100 - cdf.PercentileRank(2.81093577935)) + "%"

0.9%

In [44]:

ages = [compute_age_diff(shuffle_ages(data)) for _ in range(1000)]

cdf2 = thinkstats2.Cdf(ages, label="age diff")
thinkplot.Cdf(cdf2)

Out[44]:

{'xscale': 'linear', 'yscale': 'linear'}

In [10]:

holder = []
for i in range(1000):
    holder += [compute_age_diff(shuffle_ages(data))]
    
cdf_holder = thinkstats2.Cdf(holder)
thinkplot.Cdf(cdf_holder)

thinkplot.Show(title='CDF of Age difference Average for gender',
              xlabel='Absolute Age Difference',
              ylabel='CDF') 

print 'pval:', 100 - cdf_holder.PercentileRank(observed_age_diff), '%'

pval: 1.5 %

<matplotlib.figure.Figure at 0x7f312299fa50>

In [46]:

results = [compute_age_diff(shuffle_ages(data)) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)

thinkplot.Cdf(cdf)

thinkplot.Config(
    title='Mean Ages',
    xlabel='Mean Age Difference Between "Males" and "Females"',
    ylabel='CDF'
)

1-cdf[observed_age_diff]

Out[46]:

0.014000000000000012

In [150]:

mean_cts = []

for i in range(1000):
    mean_cts.append(compute_age_diff(shuffle_ages(data)))

cdf = thinkstats2.Cdf(mean_cts, label='Abs of Age CDF')
thinkplot.Cdf(cdf)
thinkplot.Show(xlabel='age dif', ylabel='CDF')

pvalue = 100 - cdf.PercentileRank(compute_age_diff(data))

print "Pvalue: ", pvalue, "%"

Pvalue:  0.8 %

<matplotlib.figure.Figure at 0x7f525160ddd0>

In [10]:

results = [compute_age_diff(shuffle_ages(data)) for _ in xrange(1000)]
cdf = thinkstats2.Cdf(results)
thinkplot.Cdf(cdf)
thinkplot.Config(
    title='Mean Ages',
    xlabel='Mean Age Difference Between "Males" and "Females"',
    ylabel='CDF'
)
# This isn't __quite__ right since it doesn't include the
# values that are exactly the same as the observed difference,
# but the impact of this is pretty negligible since this is
# more continuous than the coin flips example
1 - cdf[observed_age_diff]

Out[10]:

0.009000000000000008

In [23]:

simulation = [compute_age_diff(shuffle_ages(data)) for i in xrange(1000)]

cdf_sim = thinkstats2.Cdf(simulation)
thinkplot.Cdf(cdf_sim)

print 'P-value of male/female mean age diff: ', 100 - cdf_sim.PercentileRank(observed_age_diff), '%'

P-value of male/female mean age diff:  1.2 %

In [136]:

sample_raw = []
sample_perm = []
for n in range(1000):
    sample_raw.append(compute_age_diff(data_titanic))
    sample_perm.append(compute_age_diff(shuffle_ages(data_titanic)))

sample_raw = thinkstats2.Cdf(sample_raw)
sample_perm = thinkstats2.Cdf(sample_perm)
thinkplot.Cdfs([sample_raw, sample_perm])


p_val = 100 - sample_perm.PercentileRank(compute_age_diff(data_titanic))
print p_val

1.3

In [37]:

total = []
for i in range(0,1000):
    total.append(compute_age_diff(shuffle_ages(data)))
                 
cdf = thinkstats2.Cdf(total)
thinkplot.Cdf(cdf)

Out[37]:

{'xscale': 'linear', 'yscale': 'linear'}

In [30]:

res = []
for _ in range(1000):
    res.append(compute_age_diff(shuffle_ages(data)))

cdf = thinkstats2.Cdf(res)
print "p-value:", float(format(100 - cdf.PercentileRank(observed_age_diff), '.2f'))

p-value: 0.6

In [115]:

import thinkstats2

ageDiffs= [compute_age_diff(shuffle_ages(data)) for i in range(1000)]
ageCdf = thinkstats2.Cdf(ageDiffs)
thinkplot.Cdf(ageCdf)
#If I'm looking for mean ages that are different, I want anything that's not 0
percentileRank =  ageCdf.PercentileRank(observed_age_diff)
p_Val = 1 - percentileRank/100
print "P-value using CDF: ",p_Val*100, "%"

P-value using CDF:  1.1 %

In [64]:

ageDiffs = []
for i in range(1000):
    ageDiffs.append(compute_age_diff(shuffle_ages(data)))
    
ageCdf = thinkstats2.Cdf(ageDiffs)

print "p-value of observed age difference: ", (1 - ageCdf[compute_age_diff(data)])

p-value of observed age difference:  0.013

In [176]:

trials = [compute_age_diff(shuffle_ages(data)) for _ in range(1000)]
cdf = thinkstats2.Cdf(trials)
thinkplot.Cdf(cdf)

Out[176]:

{'xscale': 'linear', 'yscale': 'linear'}

In [174]:

resultsList = []
for i in range(1000):
    resultsList.append(compute_age_diff(shuffle_ages(data)))
    
cdf = thinkstats2.Cdf(resultsList)
1-cdf[observed_age_diff]

Out[174]:

0.0050000000000000044

In [53]:

# not sure whether 1 - cdf[value] vs the implementation at the start (using percentile rank) 
# is the right option

In [51]:

diffs = []
for i in range(1000):
    diffs.append(compute_age_diff(shuffle_ages(data)))

cdf = thinkstats2.Cdf(diffs)
thinkplot.Cdf(cdf, label='age diffs')
thinkplot.Show(loc='lower right')

bigger = 0
for x in diffs:
    if x >= observed_age_diff:
        bigger +=1
print "p-value:", bigger/1000.0

p-value: 0.016

<matplotlib.figure.Figure at 0x7f9a0e9938d0>

In [78]:

def age_test(n, data):
    age_trials = []
    for i in range(n):
        age_trials.append(compute_age_diff(shuffle_ages(data)))
    hist = thinkstats2.Hist(age_trials)
    thinkplot.Hist(hist)
    return age_trials

titanic_age_trials = age_test(1000, data)

Ignoring passengers with missing ages:

Was the average age of male versus female passengers on the titanic different?
What additional (if any) conclusions can you draw based on the p-value you just computed? In other words, what does this p-value mean?

Disclaimer: (1) is a bit of a trick question (sorry!), but I included it to encourage being precise about the definition of the null hypothesis and eactly which population it refers to.

1¶

The average was different for all the people, though we can't tell for certain if it is males or females because there is no connection after the age data has been shuffled.

2¶

What I can draw from this is that there is a particular concentration of ages that is higher or lower for males or females. We know that for a value of 1 year in difference, that is a larger gap than 63% of occupants saw, and thereby 37% less than others, but not whether it corresponds to being male or female. We would likely want to leave the ages corresponding with the sex in order to determine that.

In [123]:

males = data[data.Sex == 'male']
print "average male age", males.Age.mean()

females = data[data.Sex == 'female']
print "average female age", females.Age.mean()

print "diff", abs(males.Age.mean() - females.Age.mean())

average male age 29.4970640177
average female age 30.0498084291
diff 0.552744411459

In [36]:

males = data[data.Sex == 'male']
females = data[data.Sex == 'female']

avg_male_age = males.Age.mean()
avg_female_age = females.Age.mean()

print 'Average male age:   %d' % avg_male_age
print 'Average female age: %d' % avg_female_age

Average male age:   30
Average female age: 27

The p value is significant (>0.05) when the observed age is 2 years. We can reject the null hypothesis and say that the average male and female age are different.

When the observed age is 1 year, the p value is insignificant (<0.01) and we cannot reject the null hypothesis and say that the average age of male and female passengers are different.

Yes, since the p-value of the average age difference is about 1%
It means that the difference in average ages was most likely not chance and that there's a reason behind it

The average age was different and the p value of 2.0% we calculated indicates that the 3 year difference we calculated was probably not due to random chance b/c it is so small and most of the values will actually have a differences as we can see from the graph also.

Yes, it was different. But not by much, and this is a pretty small sample size so that is to be expected.
It is more likely to have a larger age difference?

Answers.

Yes, the average age of male verses female were different as proven when finding observed_age_diff, whereas in the null hypothesis has age difference zero.
The p-value is below 5% which implies that it is 'statistically significant'. Here, the chances of the mean age differing that much by chance given the null hypothesis is true is very very low.

Yes, the average age for males was about 2.8 years older than the average age for females
The p-value is statistically significant since it is clearly below 5%. I think this means that it was not random or chance that the average age was older for males.

1 The average age is definitily different.

2 This difference is statistically significant. A random distribution of ages of Titanic passengers is very unlikely to have a observed difference in means for males and females.

Maybe. The mean age of male passengers in the training set was different from the mean age of female passengers in the training set.
The p-value seems to be rather low, so it is unlikely that the difference in mean age is just by chance.

Yes, and it appears that the difference may be significant given that the p-value is approximately 0.013. Of course, there are people who did not report their age, and there is a whole testing data set that is not included - thus, it is difficult to actually form a true conclusion.

The average age of male versus female passengers on the Titanic was different. Since the p-value we calculated was around 1%, we know that the difference in average age between genders is statistically significant because we tested our null hypothesis by shuffling the ages and the percentage of the shuffled data that is equal to or greater than our observed value is under 1%, which means that this effect is not due to chance. So it follows that the average ages of men vs women on the titanic was actually different, not just due to chance.

yes the average age is different (medain age difference is about 2.8)
The mean difference in age is plausibly caused by more than chance. A random sample of one age group cannot definitely be male or female. The permutation results also suggest that it is unlikely that a random sample of titanic passengers will exhibt the same age difference.

Yes, the average of male versus female passengers on the titanic was different. (The difference in mean ages was about 2.8 years.) This isn't different by that many years in the grand scheme of things, though.
I think that this p-value means that, if the distributions of the ages of male and female passengers weren't different, it would be very unlikely for us to get this particular distribution of ages, which means that there is a (statistically sigificant?) difference between ages of men and women on the titanic.

The average age of male versus female passengers was different, and the low p-value indicates that the age difference of about 3 years was probably not due to random chance.

Yes
It is unlikley that the age difference is due just to random chance

Well, we observed that the average age of male and female passengers was different by about 2 years. We were testing whether this was expected in a normal distribution of passenger ages or not.
We can conclude that, if the ages were the same between male and female passengers, the difference we observed in the data would have a 0.016 chance of occuring.

The average age of of males was different than the female passengers by about 2.8 years.
However, the calculated p-value indicates that the difference in average age of males and females on the Titanic was not significant. 27.4% is a very high p-value indicating that a difference of 2.8 years is not a terribly unexpected difference in means.