#!/usr/bin/env python
# coding: utf-8

# # Distributions

# Think Bayes, Second Edition
# 
# Copyright 2020 Allen B. Downey
# 
# License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# In[1]:


# If we're running on Colab, install empiricaldist
# https://pypi.org/project/empiricaldist/

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    get_ipython().system('pip install empiricaldist')


# In[2]:


# Get utils.py

import os

if not os.path.exists('utils.py'):
    get_ipython().system('wget https://github.com/AllenDowney/ThinkBayes2/raw/master/code/soln/utils.py')


# ## Introduction
# 
# 

# ## Distributions
# 
# In statistics a **distribution** is a set of values and their
# corresponding probabilities.
# 
# For example, if you toss a coin, there are two possible outcomes with
# approximately equal probabilities.
# 
# If you roll a six-sided die, the set of possible values is the numbers 1
# to 6, and the probability associated with each value is 1/6.
# 
# To represent distributions, we'll use a library called `empiricaldist`.
# An "empirical" distribution is based on data, as opposed to a
# theoretical distribution.
# 
# This library provides a class called `Pmf`, which represents a
# **probability mass function**.
# 
# `empiricaldist` is available from the Python Package Index (PyPI). You
# can [download it here](https://pypi.org/project/empiricaldist/) or
# install it with `pip`. For more details, see the Preface.
# 
# To use `Pmf` you can import it like this:

# In[3]:


from empiricaldist import Pmf


# The following example makes a `Pmf` that represents the outcome of a
# coin toss.

# In[4]:


coin = Pmf()
coin['heads'] = 1/2
coin['tails'] = 1/2
coin


# The two outcomes have the same probability, $0.5$.
# 
# This example makes a `Pmf` that represents the distribution of outcomes
# of a six-sided die:

# In[5]:


die = Pmf()
for x in [1,2,3,4,5,6]:
    die[x] = 1
    
die


# `Pmf` creates an empty `Pmf` with no values. The `for` loop adds the
# values $1$ through $6$, each with "probability" $1$.
# 
# In this `Pmf`, the probabilities don't add up to 1, so they are not
# really probabilities. We can use `normalize` to make them add up to 1.

# In[6]:


die.normalize()


# The return value from `normalize` is the sum of the probabilities before normalizing.
# 
# Now we can see that the total is 1 (at least within floating-point error).

# In[7]:


die


# In[8]:


die.sum()


# Another way make a `Pmf` is to provide a sequence of values.
# 
# In this example, every value appears once, so they all have the same probability.

# In[9]:


die = Pmf.from_seq([1,2,3,4,5,6])
die


# More generally, values can appear more than once, as in this example:

# In[10]:


letters = Pmf.from_seq(list('Mississippi'))
letters


# 
# The `Pmf` class inherits from a Pandas `Series`, so anything you can do
# with a `Series`, you can also do with a `Pmf`.
# 
# For example, you can use the bracket operator to look up a value and
# returns the corresponding probability.

# In[11]:


letters['s']


# In the word "Mississippi", about 36% of the letters are "s".
# 
# However, if you ask for the probability of a value that's not in the
# distribution, you get a `KeyError`.
# 
# 

# In[12]:


try:
    letters['t']
except KeyError as e:
    print('KeyError')


# You can also call a `Pmf` as if it were a function, with a value in parentheses.

# In[13]:


letters('s')


# If the value is in the distribution the results are the same. 
# But if the value is not in the distribution, the result is $0$, not an error.

# In[14]:


letters('t')


# With parentheses, you can also provide a sequence of values and get a sequence of probabilities.

# In[15]:


die([1,4,7])


# As these examples shows, the values in a `Pmf` can be integers or strings. In general, they can be any type that can be stored in the index of a Pandas `Series`.
# 
# If you are familiar with Pandas, that will help you work with `Pmf` objects. But I will explain what you need to know as we go along.

# ## The cookie problem
# 
# In this section I'll use a `Pmf` to solve the cookie problem from Section XX.
# Here's the statement of the problem again:
# 
# > Suppose there are two bowls of cookies.
# > Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies.
# > Bowl 2 contains 20 of each.
# >
# > Now suppose you choose one of the bowls at random and, without
# > looking, select a cookie at random. The cookie is vanilla. What is
# > the probability that it came from Bowl 1?
# 
# Here's a `Pmf` that represents the two hypotheses and their prior probabilities:

# In[16]:


prior = Pmf.from_seq(['Bowl 1', 'Bowl 2'])
prior


# This distribution, which contains the prior probability for each hypothesis, is called (wait for it) the **prior distribution**.
# 
# To update the distribution based on new data (the vanilla cookie),
# we multiply the priors by the likelihoods. The likelihood
# of drawing a vanilla cookie from Bowl 1 is `3/4`. The likelihood
# for Bowl 2 is `1/2`.

# In[17]:


likelihood_vanilla = [0.75, 0.5]
posterior = prior * likelihood_vanilla
posterior


# The result is the unnormalized posteriors.
# We can use `normalize` to compute the posterior probabilities:

# In[18]:


posterior.normalize()


# The return value from `normalize` is the total probability of the data, which is $5/8$.
# 
# `posterior`, which contains the posterior probability for each hypothesis, is called (wait now) the **posterior distribution**.

# In[19]:


posterior


# From the posterior distribution we can select the posterior probability for Bowl 1:

# In[20]:


posterior('Bowl 1')


# And the answer is 0.6.
# 
# One benefit of using `Pmf` objects is that it is easy to do successive updates with more data.
# For example, suppose you put the first cookie back (so the contents of the bowls don't change) and draw again from the same bowl.
# If the second cookie is also vanilla, we can do a second update like this:

# In[21]:


posterior *= likelihood_vanilla
posterior.normalize()
posterior


# Now the posterior probability for Bowl 1 is almost 70%.
# But suppose we do the same thing again and get a chocolate cookie.
# Here's the update.

# In[22]:


likelihood_chocolate = [0.25, 0.5]
posterior *= likelihood_chocolate
posterior.normalize()
posterior


# Now the posterior probability for Bowl 1 is about 53%.
# After two vanilla cookies and one chocolate, the posterior probabilities are close to 50/50.

# ## 101 Bowls
# 
# Next let's solve a cookie problem with 101 bowls:
# 
# * Bowl 0 contains no vanilla cookies,
# 
# * Bowl 1 contains 1% vanilla cookies,
# 
# * Bowl 2 contains 2% vanilla cookies,
# 
# and so on, up to
# 
# * Bowl 99 contains 99% vanilla cookies, and
# 
# * Bowl 100 contains all vanilla cookies.
# 
# As in the previous version, there are only two kinds of cookies, vanilla and chocolate.  So Bowl 0 is all chocolate cookies, Bowl 1 is 99% chocolate, and so on.
# 
# Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla.  What is the probability that the cookie came from Bowl $x$, for each value of $x$?
# 
# To solve this problem, I'll use `np.arange` to represent 101 hypotheses, numbered from 0 to 100.

# In[23]:


import numpy as np

hypos = np.arange(101)


# The result is a NumPy array, which we can use to make the prior distribution:

# In[24]:


prior = Pmf(1, hypos)
prior.normalize()


# As this example shows, we can initialize a `Pmf` with two parameters.
# The first parameter is the prior probability; the second parameter is a sequence of values.
# 
# In this example, the probabilities are all the same, so we only have to provide one of them; it gets "broadcast" across the hypotheses.
# 
# Since all hypotheses have the same prior probability, this distribution is **uniform**.
# 
# The likelihood of the data is the fraction of vanilla cookies in each bowl, which we can calculate using `hypos`:

# In[25]:


likelihood_vanilla = hypos/100


# Now we can compute the posterior distribution in the usual way:
# 

# In[26]:


posterior1 = prior * likelihood_vanilla
posterior1.normalize()


# The following figure shows the prior distribution and the posterior distribution after one vanilla cookie.

# In[27]:


from utils import decorate, savefig

def decorate_bowls(title):
    decorate(xlabel='Bowl #',
             ylabel='PMF',
             title=title)


# In[28]:


prior.plot(label='prior', color='gray')
posterior1.plot(label='posterior')
decorate_bowls('Posterior after one vanilla cookie')


# The posterior probability of Bowl 0 is 0 because it contains no vanilla cookies.
# The posterior probability of Bowl 100 is the highest because it contain the most vanilla cookies.
# In between, the shape of the posterior distribution is a line because the the likelihoods are proportional to the bowl numbers.
# 
# Now suppose we put the cookie back, draw again from the same bowl, and get another vanilla cookie.
# Here's the update after the second cookie:

# In[29]:


posterior2 = posterior1 * likelihood_vanilla
posterior2.normalize()


# And here's what the posterior distribution looks like.

# In[30]:


posterior2.plot(label='posterior')
decorate_bowls('Posterior after two vanilla cookies')


# %TODO improve this transition...
# 
# Because the likelihood function is a line, the posterior after two cookies is a parabola.
# At this point the high-numbered bowls are the most likely because they contain the most vanilla cookies, and the low-numbered bowls have been all but eliminated.
# 
# Now suppose we draw again and get a chocolate cookie.
# Here's the update:

# In[31]:


likelihood_chocolate = 1 - hypos/100

posterior3 = posterior2 * likelihood_chocolate
posterior3.normalize()


# And here's the posterior distribution.

# In[32]:


posterior3.plot(label='posterior')
decorate_bowls('Posterior after 2 vanilla, 1 chocolate')


# Now Bowl 100 has been eliminated because it contains no chocolare cookies.
# But the high-numbered bowls are still more likely than the low-numbered bowls, because we have seen more vanilla cookies than chocolate.
# 
# In fact, the peak of the posterior distribution is at Bowl 67, which corresponds to the fraction of vanilla cookies in the data we've observed, $2/3$.
# 
# The quantity with the highest posterior probability is called the **MAP**, which stands for "maximum a posteori probability", where "a posteori" is unnecessary Latin for "posterior".
# 
# To compute the MAP, we can use the `Series` method `idxmax`:

# In[33]:


posterior3.idxmax()


# Or `Pmf` provides a more memorable name for the same thing:

# In[34]:


posterior3.max_prob()


# As you might suspect, this example isn't really about bowls; it's about estimating proportions.
# Imagine that you have one bowl of cookies.
# You don't know what fraction of cookies are vanilla, but you think it is equally likely to be any fraction from 0 to 1.
# If you draw three cookies and two are vanilla, what proportion of cookies in the bowl do you think are vanilla?
# The posterior distribution we just computed is the answer to that question.
# 
# We'll come back to estimating proportions in the next chapter.
# But first let's use a `Pmf` to solve the dice problem.

# Here's a figure for the book.

# In[35]:


import matplotlib.pyplot as plt

plt.figure(figsize=(4, 6))
plt.subplot(311)
prior.plot(label='prior', color='gray')
posterior1.plot(label='1 vanilla', color='C0')
plt.ylabel('PMF')
plt.title('101 Bowls')
plt.legend()

plt.subplot(312)
posterior2.plot(label='2 vanilla', color='C1')
plt.ylabel('PMF')
plt.legend()

plt.subplot(313)
posterior3.plot(label='2 vanilla, 1 chocolate', color='C2')
decorate_bowls('')

savefig('fig02-01')


# ## The dice problem
# 
# In the previous chapter we solved the dice problem using a Bayes table.
# Here's the statement of the problem again:
# 
# > Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
# > I choose one of the dice at random, roll it, and report that the outcome is a 1.
# > What is the probability that I chose the 6-sided die?
# 
# Let's solve it again using a `Pmf`.
# I'll use integers to represent the hypotheses:

# In[36]:


hypos = [6, 8, 12]


# And I can make the prior distribution like this:
# 

# In[37]:


prior = Pmf(1/3, hypos)
prior


# As in the previous example, the prior probability gets broadcast across the hypotheses.
# 
# Now we can compute the likelihood of the data:

# In[38]:


likelihood1 = 1/6, 1/8, 1/12


# And use it to compute the posterior distribution.
# 

# In[39]:


posterior = prior * likelihood1
posterior.normalize()
posterior


# The posterior probability for the 6-sided die is $4/9$.
# 
# Now suppose I roll the same die again and get a $7$.
# Here are the likelihoods:

# In[40]:


likelihood2 = 0, 1/8, 1/12


# The likelihood for the 6-sided die is $0$ because it is not possible to get a 7 on a 6-sided die.
# The other two likelihoods are the same as in the previous update.
# 
# Now we can do the update in the usual way:

# In[41]:


posterior *= likelihood2
posterior.normalize()
posterior


# After rolling a 1 and a 7, the posterior probability of the 8-sided die is about 69%.
# 
# One note about the `Pmf` class: if you multiply a `Pmf` by a sequence, the result is a `Pmf`:

# In[42]:


type(prior * likelihood1)


# If you do it the other way around, the result is a `Series`:

# In[43]:


type(likelihood1 * prior)


# We usually want the posterior distribution to be a `Pmf`, so I usually put the prior first.  But even if we do it the other way, we can use `Pmf` to convert the result to a `Pmf`:

# In[44]:


Pmf(likelihood1 * prior)


# ## Updating dice
# 
# The following function is a more general version of the update in the previous section:

# In[45]:


def update_dice(pmf, data):
    """Update a pmf based on new data.
    
    pmf: Pmf of possible dice and their probabilities
    data: integer outcome
    """
    hypos = pmf.qs
    likelihood = 1 / hypos
    impossible = (data > hypos)
    likelihood[impossible] = 0
    pmf *= likelihood
    pmf.normalize()


# The first parameter is a `Pmf` that represents the possible dice and their probabilities.
# The second parameter is the outcome of rolling a die.
# 
# The first line selects `qs` from the `Pmf`, which is the index of the `Series`; in this example, it represents the hypotheses.
# 
# Since the hypotheses are integers, we can use them to compute the likelihoods.
# In general, if there are `n` sides on the die, the probability of any possible outcome is `1/n`.
# 
# However, we have to check for impossible outcomes!
# If the outcome exceeds the hypothetical number of sides on the die, the probability of that outcome is $0$.
# 
# `impossible` is a Boolean Series that is `True` for each impossible die.
# I use it as an index into `likelihood` to set the corresponding probabilities to $0$.
# 
# Finally, I multiply `pmf` by the likelihoods and normalize.
# 
# Here's how we can use this function to compute the updates in the previous section.
# I start with a fresh copy of the prior distribution:
# 

# In[46]:


pmf = prior.copy()
pmf


# And use `update_dice` to do the updates.

# In[47]:


update_dice(pmf, 1)
update_dice(pmf, 7)
pmf


# The result is the same.
# 

# ## Summary
# 
# This chapter introduces the `empiricaldist` module, which provides `Pmf`, which we use to represent a set of hypotheses and their probabilities.
# 
# We use a `Pmf` to solve the cookie problem and the dice problem, which we saw in the previous chapter.
# With a `Pmf` it is easy to perform sequential updates as we see multiple pieces of data.
# 
# We also solved a more general version of the cookie problem, with 101 bowls rather than two.
# Then we computed the MAP, which is the quantity with the highest posterior probability.
# 
# In the next chapter \...
# 
# But first you might want to work on the exercises.

# ## Exercises

# **Exercise:** Suppose I have a box with a 6-sided die, an 8-sided die, and a 12-sided die.
# I choose one of the dice at random, roll it four times, and get 1, 3, 5, and 7.
# What is the probability that I chose the 8-sided die?

# In[48]:


# Solution

pmf = prior.copy()
for data in [1, 3, 5, 7]:
    update_dice(pmf, data)
    
pmf


# **Exercise:** In the previous version of the dice problem, the prior probabilities are the same because the box contains one of each die.
# But suppose the box contains 1 die that is 4-sided, 2 dice that are 6-sided, 3 dice that are 8-sided, 4 dice that are 12-sided, and 5 dice that are 20-sided.
# I choose a die, roll it, and get a 7.  
# What is the probability that I chose an 8-sided die?

# In[49]:


# Solution

ps = [1,2,3,4,5]
qs = [4,6,8,12,20]
pmf = Pmf(ps, qs)
update_dice(pmf, 7)
pmf


# **Exercise:** Suppose I have two sock drawers.  
# One contains equal numbers of black and white socks.  
# The other contains equal numbers of red, green, and blue socks.
# Suppose I choose a drawer at random, choose two socks at random, and I tell you that I got a matching pair.
# What is the probability that the socks are white?
# 
# For simplicity, let's assume that there are so many socks in both drawers that removing one sock makes a negligible change to the proportions.

# In[50]:


# Solution

# In the BlackWhite drawer, the probability of getting a match is 1/2
# In the RedGreenBlue drawer, the probability of a match is 1/3

hypos = ['BlackWhite', 'RedGreenBlue']
prior = Pmf(1/2, hypos)
likelihood = 1/2, 1/3
posterior = prior * likelihood
posterior.normalize()
posterior


# In[51]:


# Solution

# If I drew from the BlackWhite drawer, the probability the
# socks are white is 1/2

posterior['BlackWhite'] / 2


# **Exercise:** Here's a problem from [Bayesian Data Analysis](http://www.stat.columbia.edu/~gelman/book/):
# 
# > Elvis Presley had a twin brother (who died at birth). What is the probability that Elvis was an identical twin?
# 
# Hint: In 1935, about 2/3 of twins were fraternal and 1/3 were identical.

# In[52]:


# Solution

# The trick to this question is to notice that Elvis's twin was a brother.
# If they were identical twins, it is certain they would be the same sex.
# If they were fraternal twins, the likelihood is only 50%.

# Here's a solution using a Bayes table

import pandas as pd

table = pd.DataFrame(index=['identical', 'fraternal'])
table['prior'] = 1/3, 2/3
table['likelihood'] = 1, 1/2

table['unnorm'] = table['prior'] * table['likelihood']
prob_data = table['unnorm'].sum()

table['posterior'] = table['unnorm'] / prob_data
table


# In[54]:


# Solution

# Here's a solution using a Pmf

hypos = ['identical', 'fraternal']
prior = Pmf([1/3, 2/3], hypos)
prior


# In[55]:


# Solution

likelihood = 1, 1/2
posterior = prior * likelihood
posterior.normalize()
posterior


# In[ ]: