Notebook

In [2]:

# HIDDEN

from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In [3]:

# HIDDEN

# Construct a 52-card deck
from itertools import product

ranks = ['A', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K']
suits = ['♠︎', '♥︎', '♦︎', '♣︎']
cards = product(ranks, suits)

deck = Table(['rank', 'suit']).with_rows(cards)

Repetition¶

It is often the case when programming that you will wish to repeat the same operation multiple times, perhaps with slightly different behavior each time. You could copy-paste the code 10 times, but that's tedious and prone to typos, and if you wanted to do it a thousand times (or a million times), forget it.

A better solution is to use a for statement to loop over the contents of a sequence. A for statement begins with the word for, followed by a name for the item in the sequence, followed by the word in, and ending with an expression that evaluates to a sequence. The indented body of the for statement is executed once for each item in that sequence.

In [4]:

for i in np.arange(5):
    print(i)

A typical use of a for statement is to build up a table by repeating a random computation many times and storing each result as a new row. The append method of a table takes a sequence and adds a new row. It's different from with_row because a new table is not created; instead, the original table is extended. The cell below draws 100 cards, but keeps only the aces.

In [5]:

aces = Table(['Rank', 'Suit'])
for i in np.arange(100):
    card = deck.row(np.random.randint(deck.num_rows))
    if card.item(0) == 'A':
        aces.append(card)
        
aces

Out[5]:

Rank	Suit
A	♠︎
A	♣︎
A	♥︎
A	♦︎
A	♠︎
A	♦︎

This pattern can be used to track the results of repeated experiments. For example, perhaps we want to learn about the empirical properties of some randomly drawn poker hands. Below, we track whether the hand contains four-of-a-kind or five cards of the same suit (called a flush).

In [6]:

hands = Table(['Four-of-a-kind', 'Flush'])
for i in np.arange(10000):
    hand = deck.sample(5)
    four_of_a_kind = max(hand.group('rank').column('count')) == 4
    flush = max(hand.group('suit').column('count')) == 5
    hands.append([four_of_a_kind, flush])
    
hands

Out[6]:

Four-of-a-kind	Flush
False	False
False	False
False	False
False	False
False	False
False	False
False	False
False	False
False	False
False	False

... (9990 rows omitted)

A for statement can also iterate over a sequence of labels. We can use this feature to summarize the results of our experiment. These are rare hands indeed!

In [7]:

for label in hands.labels:
    success = np.count_nonzero(hands.column(label))
    print('A', label, 'was drawn', success, 'of', hands.num_rows, 'times')

A Four-of-a-kind was drawn 2 of 10000 times
A Flush was drawn 22 of 10000 times

Randomized response¶

Next, we'll look at a technique that was designed several decades ago to help conduct surveys of sensitive subjects. Researchers wanted to ask participants a few questions: Have you ever had an affair? Do you secretly think you are gay? Have you ever shoplifted? Have you ever sung a Justin Bieber song in the shower? They figured that some people might not respond honestly, because of the social stigma associated with answering "yes". So, they came up with a clever way to estimate the fraction of the population who are in the "yes" camp, without violating anyone's privacy.

Here's the idea. We'll instruct the respondent to roll a fair 6-sided die, secretly, where no one else can see it. If the die comes up 1, 2, 3, or 4, then respondent is supposed to answer honestly. If it comes up 5 or 6, the respondent is supposed to answer the opposite of what their true answer would be. But, they shouldn't reveal what came up on their die.

Notice how clever this is. Even if the person says "yes", that doesn't necessarily mean their true answer is "yes" -- they might very well have just rolled a 5 or 6. So the responses to the survey don't reveal any one individual's true answer. Yet, in aggregate, the responses give enough information that we can get a pretty good estimate of the fraction of people whose true answer is "yes".

Let's try a simulation, so we can see how this works. We'll write some code to perform this operation. First, a function to simulate rolling one die:

In [8]:

def roll_once():
    return np.random.randint(1, 7)

Now we'll use this to write a function to simulate how someone is supposed to respond to the survey. The argument to the function is their true answer (True or False); the function returns what they're supposed to tell the interview.

In [9]:

def respond(true_answer):
    if roll_once() >= 5:
        return not true_answer
    else:
        return true_answer

We can try it. Assume our true answer is 'no'; let's see what happens this time:

In [10]:

respond(False)

Out[10]:

False

Of course, if you were to run it many times, you might get a different result each time. Below, we build a table of the responses for many responses when the true answer is always False.

In [11]:

responses = Table(['Truth', 'Response'])
for i in np.arange(1000):
    responses.append([False, respond(False)])
responses

Out[11]:

Truth	Response
False	False
False	False
False	False
False	False
False	False
False	False
False	True
False	False
False	True
False	False

... (990 rows omitted)

Let's build a bar chart and look at how many True and False responses we get.

In [12]:

responses.group('Response').barh('Response')

In [13]:

responses.where('Response', False).num_rows

Out[13]:

In [14]:

responses.where('Response', True).num_rows

Out[14]:

Exercise for you: If N out of 1000 responses are True, approximately what fraction of the population has truly sung a Justin Bieber song in the shower?

Analysis¶

This method is called "randomized response". It is one way to poll people about sensitive subjects, while still protecting their privacy. You can see how it is a nice example of randomness at work.

It turns out that randomized response has beautiful generalizations. For instance, your Chrome web browser uses it to anonymously report feedback to Google, in a way that won't violate your privacy. That's all we'll say about it for this semester, but if you take an upper-division course, maybe you'll get to see some generalizations of this beautiful technique.

The steps in the randomized response survey can be visualized using a tree diagram. The diagram partitions all the survey respondents according to their true and answer and the answer that they eventually give. It also displays the proportions of respondents whose true answers are 1 ("True") and 0 ("False"), as well as the chances that determine the answers that they give. As in the code above, we have used p to denote the proportion whose true answer is 1.

Tree Diagram

The respondents who answer 1 split into two groups. The first group consists of the respondents whose true answer and given answers are both 1. If the number of respondents is large, the proportion in this group is likely to be about 2/3 of p. The second group consists of the respondents whose true answer is 0 and given answer is 1. This proportion in this group is likely to be about 1/3 of 1-p.

We can observed $p^*$, the proportion of 1's among the given answers. Thus $$ p^* ~\approx ~ \frac{2}{3} \times p ~+~ \frac{1}{3} \times (1-p) $$

This allows us to solve for an approximate value of p: $$ p ~ \approx ~ 3p^* - 1 $$

In this way we can use the observed proportion of 1's to "work backwards" and get an estimate of p, the proportion in which whe are interested.

Technical note. It is worth noting the conditions under which this estimate is valid. The calculation of the proportions in the two groups whose given answer is 1 relies on each of the groups being large enough so that the Law of Averages allows us to make estimates about how their dice are going to land. This means that it is not only the total number of respondents that has to be large – the number of respondents whose true answer is 1 has to be large, as does the number whose true answer is 0. For this to happen, p must be neither close to 0 nor close to 1. If the characteristic of interest is either extremely rare or extremely common in the population, the method of randomized response described in this example might not work well.

Let's try out this method on some real data. The chance of drawing a poker hand with no aces is

$$\frac{48}{52} \times \frac{47}{51} \times \frac{46}{50} \times \frac{45}{49} \times \frac{44}{48}$$

In [19]:

np.product(np.arange(48, 43,-1) / np.arange(52, 47, -1))

Out[19]:

0.65884199833779666

It is quite embarassing indeed to draw a hand with no aces. The table below contains one column for the truth of whether a hand has no aces and another for the randomized response.

In [28]:

ace_responses = Table(['Truth', 'Response'])
for i in np.arange(10000):
    hand = deck.sample(5)
    no_aces = hand.where('rank', 'A').num_rows == 0
    ace_responses.append([no_aces, respond(no_aces)])
ace_responses

Out[28]:

Truth	Response
False	True
True	True
False	True
True	False
True	False
True	True
False	False
True	False
False	False
True	True

... (9990 rows omitted)

Using our derived formula, we can estimate what fraction of hands have no aces.

In [29]:

3 * np.count_nonzero(ace_responses.column('Response')) / 10000 - 1

Out[29]:

0.6644000000000001