An Algorithm for Human Deceit¶

The material for this notebook is provided via this free online textbook :

I am writing this notebook to document my learning, and hopefully to help others learn machine learning. You can think of it as my personal lecture notes. I would love suggestions / corrections / feedback for these notebooks.

Email me: [email protected]

In [4]:
social()

Out[4]:

Introduction¶

Social data has an addition layer of interest. This is because we cannot guarantee people are honest with their answers, which adds a further layer of complication to inference. For example, simply asking individuals "Have you ever cheated on a test?" will surely contain some rate of dishonesty. What you can say for certain is that the true rate is less than your observed rate (assume individuals lie only about not cheating. To demonstrate a solution to this dishonesty problem with Bayesian modeling, we need to introduce the binomial Distribution.

The Binomial Distribution¶

The binomial distribution is one of the most popular distributions due to its simplicity and usefulness. The binomial distribution has 2 parameters: $N$, a positive integer representing $N$ trials or number of instances of potential events, and $p$, the probability of an event occurring in a single trial. Like the Poisson distribution, it is a discrete distribution, but unlike the Poisson, it only weighs integers from 0 to $N$. The mass distribution looks like:

P(X=k) = \begin{align} N\\ K \end{align} \ p^k(1-p)^{N-k}

If $X$ is a binomial random variable with parameters $p$ and $N$, denoted by $X \sim Bin(N,p)$, then $X$ is the number of events that occurred in the $N$ trials, where 0 <= $X$ <=$N$, and $p$ is the probability of a single event. The larger $p$ is, the more events are likely to occur. The expected value of a binomial is equal to $Np$. Below we plot the probability mass distribution for varying parameters.

In [1]:
import scipy.stats as stats
import numpy as np

%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

sns.set(context = 'notebook', style = 'whitegrid')

In [2]:
binomial = stats.binom

parameters = [(10, 0.4), (10, 0.9)]
colors = ["#9b59b6", "#3498db"]

for i in range(2):
N, p = parameters[i]
_x = np.arange(N + 1)
plt.bar(_x - 0.5, binomial.pmf(_x, N, p), color=colors[i],
alpha=0.6,
label='$N$: {}, $p$: {}'.format(N, p))
plt.legend(loc='upper left')
plt.xlim(0, 10.5)
plt.xlabel('$k$')
plt.ylabel('$P(X=k)$')
plt.title('Probability mass distribution of binomial random variables')

Out[2]:
<matplotlib.text.Text at 0x10b8168d0>

The special case when $N$ = 1 corresponds to the Bernoulli distribution. There is another connection between Bernoulli and Binomial random variables as well. If we have $X_1, X_2, ..., X_N$ Bernoulli random variables with the same $p$, then $Z=X_1 + X_2+...+X_N \ \sim \ Binomial(N,p)$

The expected value of a Bernoulli random variable is $p$. This can be seen by noting the more general Binomial random variable has expected value $Np$ and setting $N$ = 1.

Example: Cheating Among Students¶

We will use the binomial distribution to determine the frequency of students cheating during an exam. If we let $N$ be the total number of students who took the exam and assuming each student is interviewed post-exam (answering without consequence), we will receive integer $X$ 'Yes I did cheat" answers. We then find the posterior distribution of $p$, given $N$, some specified prior on $p$, and observed data $X$.

This is an unrealistic model. No student, even with a free-pass against punishment, would admit to cheating. What we need is a better algorithm to ask students if they had cheater. The proposed model is noted below.

In the interview process for each student, the student flips a coin, hidden from the interviewer. The student agrees to answer honestly if the coin comes up Heads. Otherwise, if the coin comes up Tails, the student (secretly) flips the coin again, and answers "Yes, I did cheat' if the coin flips Heads and 'No, I did not cheat' if the coin flips Tails. This way, the interviewer does not know if a 'Yes' was the result of a guilty plea, or a Heads on the second coin toss. Thus privacy is preserved and the researchers receive honest answers.

The interviewers are still receiving false data, since some of the "Yes" answers are just randomness rather than confessions. Yet, now as a researcher we know about half the students are telling the truth. We can use PyMC to dig through this noisy model, and find a posterior distribution for the true frequency of liars.

SUppose 100 students are being surveyed for cheating, and we wish to find $p$, the proportion of cheaters. There are a few ways to do this in PyMC, we demonstrate the most explicit way, and then show a simplified version. Both methods arrive at the same inference.

In our data generation model, we sample $p$, the true proportion of cheaters, from a prior. Since we have no clue about the value of $p$, we will assign it a Uniform(0, 1) prior.

In [3]:
import pymc as pm

N = 100
p = pm.Uniform('freq_cheating', 0, 1)


Again, we assign Bernoulli random variables to the 100 students: 1 implies they cheater and 0 implies they did not.

In [4]:
true_answers = pm.Bernoulli('truths', p, size=N)


If we carry out the algorithm, the next step that occurs is the first coin-flip each student makes. This can be modeled again by sampling 100 Bernoulli random variables with $p$ = 0.5: denote 1 as Heads and 0 as Tails.

In [5]:
first_coin_flip = pm.Bernoulli('first_flip', 0.5, size=N)
first_coin_flip.value

Out[5]:
array([False, False, False,  True,  True,  True,  True, False, False,
False, False, False, False,  True, False, False, False, False,
True,  True, False,  True,  True,  True,  True, False,  True,
False, False,  True, False,  True,  True,  True, False,  True,
False, False, False,  True, False, False,  True, False, False,
True,  True, False,  True, False, False,  True, False, False,
True,  True,  True, False, False, False, False,  True,  True,
True,  True, False,  True, False, False, False,  True,  True,
False,  True, False,  True, False,  True,  True, False, False,
False, False, False,  True,  True, False,  True,  True,  True,
False,  True,  True, False, False,  True, False, False, False, False], dtype=bool)

Although not everyone flips a second time, we can still model the possible realization of second coin-flips:

In [6]:
second_coin_flips = pm.Bernoulli('second_flips', 0.5, size=N)


Using these variables, we can return a possible realization of the observed proportion of "Yes" responses. We do this using a PyMC deterministic variable:

In [7]:
@pm.deterministic
fc=first_coin_flip,
sc=second_coin_flips):
observed = fc * t_a + (1 - fc) * sc
return observed.sum() / float(N)


The line fc*t_a + (1 - fc)*sc contains the heart of the coin flip algorithm described above. Elements in this array are 1 if and only if (1) the first toss is Heads and the student cheater, or (2) the first toss is Tails, and the second is Heads, and are else 0. Finally, the last line sums this vector and divides by float(N), produces a proportion.

In [8]:
observed_proportion.value

Out[8]:
0.34999999999999998

Next we need a dataset. After performing our coin-flipped interviews, the researchers received 35 "Yes" responses. To put this into perspective, if there truly were no cheaters, we should expect to see on average 1/4 of all responses being a 'Yes' (half chance of having the first coin land Tails, and another half chance of having the second coin land Heads. So about 25% or 25 responses in a cheat-free world in this scenario. On the other hand if all students cheated, we should expect to see approximately 3/4 of all responses be 'Yes'.

Suppose that the results of the study observe a Binomial random variable, with $N$ = 100 and $p$ = observed_proportion with value = 35:

In [9]:
X = 35

observations = pm.Binomial('obs', N, observed_proportion, observed=True,
value=X)


Below we add all the variables of interest to a model container and run our algorithm over the model.

In [10]:
model = pm.Model([p, true_answers, first_coin_flip,
second_coin_flips, observed_proportion, observations])

# Explained in later notebook
mcmc = pm.MCMC(model)
mcmc.sample(40000, 15000)

 [-----------------100%-----------------] 40000 of 40000 complete in 11.1 sec
In [11]:
p_trace = mcmc.trace('freq_cheating')[:]

plt.hist(p_trace, histtype='stepfilled', normed=True, alpha=0.8, bins=30,
label='posterior distribution', color='#348ABD')
plt.vlines([0.05, 0.35], [0, 0], [5, 5], alpha=0.5)
plt.xlim(0, 1)
plt.legend()

Out[11]:
<matplotlib.legend.Legend at 0x10b656650>

With regards to the plot, we are still pretty uncertain about the true frequency of cheaters, but we have narrowed it down to a range between 0.05 to 0.35. This is still pretty good, since as a priori we had no idea how many students might have cheated (hence the uniform distribution of the prior). On the other hand, it is also pretty bad since there is a 30% range window where the true value lives.

Have we learned anything? Yes, since it is implausible, that there are no cheaters since the posterior assigns very low probability to $p$ = 0. Since we started from a uniform prior, which treats all values of $p$ as equally plausible, yet the data ruled out $p$ = 0 as a possibility. Therefore we can be confident there were some cheaters.

A Simpler PyMC Model¶

Given a value for $p$ (from which a god-like position we know), we can find the probability a student will answer "Yes":

P("Yes")=P(Heads on first coin)P(cheater)+P(Tails on first coin)P(Heads on second coin)

$$= \frac{1}{2}p + \frac{1}{2} \frac{1}{2}$$

$$= \frac{p}{2} + \frac{1}{4}$$

Thus, knowing $p$ allows us to know the probability a student will response "Yes". In PyMC, we can create a deterministic function to evaluate the probability of responding "Yes", given $p$:

In [12]:
p = pm.Uniform('freq_cheating', 0, 1)

@pm.deterministic
def p_skewed(p=p):
return 0.5 * p + 0.25


If we know the probability of respondents saying "Yes", which is p_skewed, and we have $N$ = 100 students, the number of 'Yes' responses is a binomial random variable with parameters $N$ and p_skewed.

This is where we include our observed 35 'Yes' responses. In the declaration of the pm.Binomial, we include a value = 35 and observed = True.

In [13]:
yes_responses =pm.Binomial('number_cheats', 100, p_skewed,
value=35, observed=True)


Below we add all the variables if interest to a model object and run our MCMC algorithm.

In [14]:
model = pm.Model([yes_responses, p_skewed, p])

mcmc = pm.MCMC(model)
mcmc.sample(25000, 2500)

 [-----------------100%-----------------] 25000 of 25000 complete in 1.1 sec
In [15]:
p_trace = mcmc.trace('freq_cheating')[:]

plt.hist(p_trace, histtype='stepfilled', normed=True, alpha=0.8, bins=30,
label='posterior distribution', color='#348ABD')
plt.vlines([0.05, 0.35], [0,0], [5, 5], alpha=0.5)
plt.xlim(0, 1)
plt.legend();


You can imagine that as we tweek the number of observed cheaters and the number of students in the sample, we could become more confident with our results. Here is an example if we had 500 students and 300 reported cheaters, this works out to 60% of students reported a "Yes", compared to 35% in the original example.

In [16]:
yes_responses =pm.Binomial('number_cheats', 500, p_skewed,
value=300, observed=True)

model = pm.Model([yes_responses, p_skewed, p])

mcmc = pm.MCMC(model)
mcmc.sample(25000, 2500)

 [-----------------100%-----------------] 25000 of 25000 complete in 1.3 sec
In [17]:
p_trace = mcmc.trace('freq_cheating')[:]

plt.hist(p_trace, histtype='stepfilled', normed=True, alpha=0.8, bins=30,
label='posterior distribution', color='#348ABD')
plt.vlines([p_trace.mean()+2*p_trace.std(),
p_trace.mean() -2* p_trace.std()],
[0,0], [10, 10], alpha=0.5)
plt.xlim(0, 1)
plt.ylim(0, 13)
plt.legend();


Here we see a much tighter distribution, in-fact, the range of the distribution is only about 9%, and we can be 95% certain that the amount of students cheating lies between ~53 and 69 percent.

In [18]:
print p_trace.mean()+2*p_trace.std()
print p_trace.mean()-2*p_trace.std()
(p_trace.mean()+2*p_trace.std()) - (p_trace.mean() -2* p_trace.std())

0.699838931488
0.534834725277

Out[18]:
0.16500420621173983

More PyMC Tricks¶

Lighter deterministic variables with Lambda class

Sometimes writing a deterministic function using the @pm.deterministic decorator can seem like a chore, especially for a small function. Built in lambda functions can handle this with elegance and simplicity.

beta = pm.Normal('coefficients', 0, size=(N, 1))
x = np.random.randn((N, 1))
linear_combination = pm.Lambda(lambda x=x,
beta=beta:np.dot(x.T, beta))



Arrays of PyMC variables

There is no reason why we cannot store multiple heterogeneous PyMC variables in a Numpy array. Just remember to set the dtype of the array to object upon initialization.

In [19]:
N = 10
x = np.empty(N, dtype=object)
for i in range(0, N):
x[i] = pm.Exponential('x_%i' % i, (i + 1) ** 2)


Extra Example: Challenger Space Shuttle Disaster¶

On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle Challenger exploded shortly after lift-off, killing all seven crew members. The presidential commission on the accident concluded that it was caused by the failure of an O-ring in a field joint on the rocket booster, and that this failure was due to a faulty design that made the O-ring unacceptably sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was lost at sea), and these data were discussed on the evening preceding the Challenger launch, but unfortunately only the data corresponding to the 7 flights on which there was a damage incident were considered important and these were thought to show no obvious trend. The data are shown below.

In [20]:
import pandas as pd

np.set_printoptions(precision=3, suppress=True)

In [21]:
# Other import method, something is up with the data formating: fix later
np.set_printoptions(precision=3, suppress=True)
usecols=[1, 2], missing_values="NA",
delimiter=",")

challenger_data = challenger_data[~np.isnan(challenger_data[:, 1])]

In [22]:
# Drop NA values
data = data[data['incident'].notnull()]
print data

          date  temperature             incident
0   04/12/1981           66                    0
1   11/12/1981           70                    1
2      3/22/82           69                    0
4   01/11/1982           68                    0
5   04/04/1983           67                    0
6      6/18/83           72                    0
7      8/30/83           73                    0
8     11/28/83           70                    0
9   02/03/1984           57                    1
10  04/06/1984           63                    1
11     8/30/84           70                    1
12  10/05/1984           78                    0
13  11/08/1984           67                    0
14     1/24/85           53                    1
15  04/12/1985           67                    0
16     4/29/85           75                    0
17     6/17/85           70                    0
18     7/29/85           81                    0
19     8/27/85           76                    0
20  10/03/1985           79                    0
21    10/30/85           75                    1
22    11/26/85           76                    0
23  01/12/1986           58                    1
24     1/28/86           31  Challenger Accident

In [23]:
# Plot the data and remove the actual incident
plt.scatter(data['temperature'][:-1], data['incident'][:-1], s= 75, color='k',
alpha=0.5)
plt.yticks([0, 1])
plt.ylabel('Incident?')
plt.xlabel('Outside Temperature (F)')

Out[23]:
<matplotlib.text.Text at 0x110139a50>

It looks clear that the probability of damage incidents occurring increases as the outside temperature decreases. We are interested in modeling the probability here because it doesn't look like there is a clear cut off point between temperature and damage incidents of the O-ring. We want to answer the question "At temperature $t$, what is the probability of an incident?".

We need a function of temperature, call it $p(t)$, this is bound between 0 and 1, and changes from 1 to 0 as we increase temperature. There are many functions that do this, but the most popular choice is a logistic function.

$$p(t) = \frac{1}{1 + e^{\beta t}}$$

In this model, $\beta$ is the variable we are uncertain about. Below is the function plotted for $\beta$ = 1, 3, and -5.

In [24]:
def logistic(x, beta):
return 1.0 / (1.0 + np.exp(beta * x))

x = np.linspace(-4, 4, 100)
plt.plot(x, logistic(x, 1), label = r"$\beta = 1$")
plt.plot(x, logistic(x, 3), label = r"$\beta = 3$")
plt.plot(x, logistic(x, -5), label = r"$\beta = -5$")
plt.legend();


But something is missing. In the plot of the logistic function, the probability changes only near zero, yet in our data above the probability changes around 65 to 70. We need to add a bias term to our logistic function:

$$p(t) = \frac{1}{1 + e^{\beta t + \alpha}}$$

Some plots below with different $\alpha$.

In [25]:
def logistic(x, beta, alpha=0):
return 1.0 / (1.0 + np.exp(np.dot(beta, x) + alpha))

x = np.linspace(-4, 4, 100)

plt.plot(x, logistic(x, 1), label=r"$\beta = 1$", ls="--", lw=1)
plt.plot(x, logistic(x, 3), label=r"$\beta = 3$", ls="--", lw=1)
plt.plot(x, logistic(x, -5), label=r"$\beta = -5$", ls="--", lw=1)

plt.plot(x, logistic(x, 1, 1), label=r"$\beta = 1, \alpha = 1$",
color="#348ABD")
plt.plot(x, logistic(x, 3, -2), label=r"$\beta = 3, \alpha = -2$",
color="#A60628")
plt.plot(x, logistic(x, -5, 7), label=r"$\beta = -5, \alpha = 7$",
color="#7A68A6")

plt.legend(loc="lower left");


Adding a constant term results in shifting the curve left or right.

Let's start modeling this in PyMC. The $\beta$, $\alpha$ parameters have no reason to be positive, bounded, or relatively large, so they are best modeled by a Normal random variable. Recall that a normal curve has two parameters, the mean $\mu$, and precision, $\tau$.

Note: variance is the reciprical of $\tau$ and is better suited to Bayesian analysis.

Below we continue out modeling of the Challenger space craft:

In [27]:
# temp = data['temperature'][:-1].values
# D = data['incident'][:-1].values

temperature = challenger_data[:, 0]
D = challenger_data[:, 1]  # defect or not?

beta = pm.Normal("beta", 0, 0.001, value=0)
alpha = pm.Normal("alpha", 0, 0.001, value=0)

@pm.deterministic
def p(t = temperature, alpha = alpha, beta = beta):
return 1.0 / (1.0 + np.exp(beta * t + alpha))


We have our probabilities, but how to we connect them to our observed data? A Bernoulli random variable with parameter $p$, denoted Ber(p), is a random variable that takes value 1 with probability $p$, and 0 else. Thus our model looks like:

• Defect Incident, $D_i \sim Ber(p(t_i)), \ \ i = 1, ..., N$

where $p(t)$ is out logistic function and $t_i$ are the temperatures we have observations about. Notice in the above code we set values of beta and alpha to 0. The reason is that if beta and alpha are very large, they make p = 1 or 0. Unfortunately, pm.Beroulli does not like probabilities of exactly 0 or 1, though they are mathematically well-defined probabilities. So by setting the coefficient values to 0, we set the variable p to be a reasonable starting value. This has no effect on our results, nor does it mean we are including any additional information in our prior. It is simply a computational caveat in PyMC.

In [28]:
print p.value

[ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5
0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]

In [29]:
# connect the probabilities in p with our observations through
# a Bernoulli random variable

obs = pm.Bernoulli('bernoulli_obs', p, value = D, observed = True)
model = pm.Model([obs, beta, alpha])

map_ = pm.MAP(model)
map_.fit()
mcmc = pm.MCMC(model)
mcmc.sample(120000, 100000, 2)

 [-----------------100%-----------------] 120000 of 120000 complete in 8.9 sec

We have trained our model on the observed data, now we can sample values from the posterior. Let's look at the posterior distributions for $\alpha$ and $\beta$:

In [30]:
alpha_samples = mcmc.trace('alpha')[:, None]  # best to make them 1d
beta_samples = mcmc.trace('beta')[:, None]

# histogram of the samples:
plt.subplot(211)
plt.title(r"Posterior distributions of the variables $\alpha, \beta$")
plt.hist(beta_samples, histtype='stepfilled', bins=35, alpha=0.85,
label=r"posterior of $\beta$", color="#7A68A6", normed=True)
plt.legend()

plt.subplot(212)
plt.hist(alpha_samples, histtype='stepfilled', bins=35, alpha=0.85,
label=r"posterior of $\alpha$", color="#A60628", normed=True)
plt.legend();


All samples of $\beta$ are greater than 0. If instead the posterior was centered around 0, we might suspect that $\beta$ = 0, implying that temperature has no effect on the probability of defect.

Similarly, all $\alpha$ posterior values are negative and far away from 0, implying that it is correct to believe $\alpha$ is significantly less than 0.

Regarding the spread of the data, we are very uncertain what the true parameters might be (this is to be expected with the small sample size).

Next let's take a look at the expected probability for a specific value of the temperature. That is, we average over all samples from the posterior to get a likely value for $p(t_i)$.

In [33]:
t = np.linspace(temperature.min() - 5, temperature.max() + 5, 50)[:, None]
p_t = logistic(t.T, beta_samples, alpha_samples)

mean_prob_t = p_t.mean(axis=0)

In [36]:
from IPython.core.pylabtools import figsize
# figsize(18, 5)

plt.plot(t, mean_prob_t, lw=3, label="average posterior \nprobability \
of defect")
plt.plot(t, p_t[0, :], ls="--", label="realization from posterior")
plt.plot(t, p_t[-2, :], ls="--", label="realization from posterior")
plt.scatter(temperature, D, color="k", s=50, alpha=0.5)
plt.title("Posterior expected value of probability of defect; \
plus realizations")
plt.legend(loc="lower left")
plt.ylim(-0.1, 1.1)
plt.xlim(t.min(), t.max())
plt.ylabel("probability")
plt.xlabel("temperature");


Above we plotted two possible realizations of what the actually underlying system might be. Both are equally likely as any other draw. The blue like is what occurs when we average all the 20000 dotted lines together.

An interesting question to ask is for what temperatures are most uncertain above the defect-probability? Below we plot the expected value and the associated 95% intervals for each temperature.

In [40]:
from scipy.stats.mstats import mquantiles

# vectorized bottom and top 2.5% quantiles for 'confidence interval'
qs = mquantiles(p_t, [0.025, 0.975], axis = 0)
plt.fill_between(t[:,0], *qs, alpha=0.7, color='#7A68A6')

plt.plot(t[:, 0], qs[0], label="95% CI", color="#7A68A6", alpha=0.7)

plt.plot(t, mean_prob_t, lw=1, ls="--", color="k",
label="average posterior \nprobability of defect")

plt.xlim(t.min(), t.max())
plt.ylim(-0.02, 1.02)
plt.legend(loc="lower left")
plt.scatter(temperature, D, color="k", s=50, alpha=0.5)
plt.xlabel("temp, $t$")

plt.ylabel("probability estimate")
plt.title("Posterior probability estimates given temp. $t$");


The 95% confidence interval in purple represents the interval, for each temperature, that contains 95% of the distribution. For example, at 70 degrees, we can be 95% certain that the probability of defect lies between 0.13 and 0.42.

This data gives us a good idea how to proceed. Ideally we should test more )-rings around the 60-65 degree temperature to get a better estimate of probabilities in this 'transition' range. It is always good to report your confidence intervals as well as just the reported estimates.

What about the day of the Challenger disaster¶

On the day of the Challenger disaster, the outside temperature was 31 degrees. What is the posterior distribution of a defect occuring, given this temperature? It is possible to see this in the plot below.

In [41]:
prob_31 = logistic(31, beta_samples, alpha_samples)

plt.xlim(0.995, 1)
plt.hist(prob_31, bins=10000, normed=True, histtype='stepfilled')
plt.title('Posterior distribution of a probability of a defect @ 31 degrees')
plt.xlabel('probability of a defect occurring in O-ring');


Turns out we can predict that it was almost guaranteed that a defect would occur at this temperature.

Is this model appropriate?¶

You might be wondering, is the logistic function for $p(t)$, and the specific priors we chose actually the best method? There are many ways to measure goodness of fit. One could simply apply a cutoff to the predicted probabilities in the fitted logistic function. A probability > 50% would indicate a 1, and <= 50% indicates a 0. Then we can simply test the proportion of predictions that match the observed data.

In [1]:
from IPython.core.display import HTML

def css_styling():

return HTML(styles)
css_styling()

Out[1]:
In [2]:
def social():
code = """