#!/usr/bin/env python
# coding: utf-8

# # Probability Mass Functions

# This notebook is part of [Bite Size Bayes](https://allendowney.github.io/BiteSizeBayes/), an introduction to probability and Bayesian statistics using Python.
# 
# Copyright 2020 Allen B. Downey
# 
# License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# The following cell downloads `utils.py`, which contains some utility function we'll need.

# In[1]:


from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/AllenDowney/BiteSizeBayes/raw/master/utils.py')


# If everything we need is installed, the following cell should run with no error messages.

# In[3]:


import pandas as pd
import numpy as np


# ## Review
# 
# [In the previous notebook](https://colab.research.google.com/github/AllenDowney/BiteSizeBayes/blob/master/05_test.ipynb) we used a Bayes table to interpret medical tests.
# 
# In this notebook we'll solve an expanded version of the cookie problem with 101 Bowls.  It might seem like a silly problem, but it's not: the solution demonstrates a Bayesian way to estimate a proportion, and it applies to lots of real problems that don't involve cookies.
# 
# Then I'll introduce an alternative to the Bayes table, a probability mass function (PMF), which is a useful way to represent and do computations with distributions.
# 
# Here's the function, from the previous notebook, we'll use to make Bayes tables:

# In[1]:


import pandas as pd

def make_bayes_table(hypos, prior, likelihood):
    """Make a Bayes table.
    
    hypos: sequence of hypotheses
    prior: prior probabilities
    likelihood: sequence of likelihoods
    
    returns: DataFrame
    """
    table = pd.DataFrame(index=hypos)
    table['prior'] = prior
    table['likelihood'] = likelihood
    table['unnorm'] = table['prior'] * table['likelihood']
    prob_data = table['unnorm'].sum()
    table['posterior'] = table['unnorm'] / prob_data
    return table


# ## 101 Bowls
# 
# In [Notebook 4](https://colab.research.google.com/github/AllenDowney/BiteSizeBayes/blob/master/05_dice.ipynb), we saw that the Bayes table works with more than two hypotheses.  As an example, we solved a cookie problem with five bowls.
# 
# Now we'll take it even farther and solve a cookie problem with 101 bowls:
# 
# * Bowl 0 contains no vanilla cookies,
# 
# * Bowl 1 contains 1% vanilla cookies,
# 
# * Bowl 2 contains 2% vanilla cookies,
# 
# and so on, up to
# 
# * Bowl 99 contains 99% vanilla cookies, and
# 
# * Bowl 100 contains all vanilla cookies.
# 
# As in the previous problems, there are only two kinds of cookies, vanilla and chocolate.  So Bowl 0 is all chocolate cookies, Bowl 1 is 99% chocolate, and so on.
# 
# Suppose we choose a bowl at random, choose a cookie at random, and it turns out to be vanilla.  What is the probability that the cookie came from Bowl $x$, for each value of $x$?
# 
# To solve this problem, I'll use `np.arange` to represent 101 hypotheses, numbered from 0 to 100.

# In[2]:


import numpy as np

xs = np.arange(101)


# The prior probability for each bowl is $1/101$.  I could create a sequence with 101 identical values, but if all of the priors are equal, we only have to probide one value:

# In[3]:


prior = 1/101


# Because of the way I numbered the bowls, the probability of a vanilla cookie from Bowl $x$ is $x/100$.  So we can compute the likelihoods like this:

# In[4]:


likelihood = xs/100


# And that's all we need; the Bayes table does the rest:

# In[5]:


table = make_bayes_table(xs, prior, likelihood)


# Here's a feature we have not seen before: we can give the index of the Bayes table a name, which will appear when we display the table.

# In[6]:


table.index.name = 'Bowl'


# Here are the first few rows:

# In[7]:


table.head()


# Because Bowl 0 contains no vanilla cookies, its likelihood is 0, so its posterior probability is 0.  That is, the cookie cannot have come from Bowl 0.
# 
# Here are the last few rows of the table.

# In[8]:


table.tail()


# The posterior probabilities are substantially higher for the high-numbered bowls.
# 
# There is a pattern here that will be clearer if we plot the results.

# In[9]:


import matplotlib.pyplot as plt

def plot_table(table):
    """Plot results from the 101 Bowls problem.
    
    table: DataFrame representing a Bayes table
    """
    table['prior'].plot()
    table['posterior'].plot()

    plt.xlabel('Bowl #')
    plt.ylabel('Probability')
    plt.legend()


# In[10]:


plot_table(table)
plt.title('One cookie');


# The prior probabilities are uniform; that is, they are the same for every bowl.
# 
# The posterior probabilities increase linearly; Bowl 0 is the least likely (actually impossible), and Bowl 100 is the most likely.

# ## Two cookies
# 
# Suppose we put the first cookie back, stir the bowl thoroughly, and draw another cookie from the same bowl.  and suppose it turns out to be another vanilla cookie.
# 
# Now what is the probability that we are drawing from Bowl $x$?
# 
# To answer this question, we can use the posterior probabilities from the previous problem as prior probabilities for a new Bayes table, and then update with the new data.

# In[11]:


prior2 = table['posterior']
likelihood2 = likelihood

table2 = make_bayes_table(xs, prior2, likelihood2)
plot_table(table2)
plt.title('Two cookies');


# The blue line shows the posterior after one cookie, which is the prior before the second cookie.
# 
# The orange line shows the posterior after two cookies, which curves upward.  Having see two vanilla cookies, the high-numbered bowls are more likely; the low-numbered bowls are less likely.
# 
# I bet you can guess what's coming next.

# ## Three cookies
# 
# Suppose we put the cookie back, stir, draw another cookie from the same bowl, and get a chocolate cookie.
# 
# What do you think the posterior distribution looks like after these three cookies?
# 
# Hint: what's the probability that the chocolate cookie came from Bowl 100?
# 
# We'll use the posterior after two cookies as the prior for the third cookie:

# In[12]:


prior3 = table2['posterior']


# Now, what about the likelihoods?  Remember that the probability of a vanilla cookie from Bowl $x$ is $x/100$.  So the probability of a chocolate cookie is $(1 - x/100)$, which we can compute like this.

# In[13]:


likelihood3 = 1 - xs/100


# That's it.  Everything else is the same.

# In[14]:


table3 = make_bayes_table(xs, prior3, likelihood3)


# And here are the results

# In[15]:


plot_table(table3)
plt.title('Three cookies');


# The blue line is the posterior after two cookies; the orange line is the posterior after three cookies.
# 
# Because Bowl 100 contains no chocolate cookies, the posterior probability for Bowl 100 is 0.
# 
# The posterior distribution has a peak near 60%.  We can use `idxmax` to find it: 

# In[16]:


table3['posterior'].idxmax()


# The peak in the posterior distribution is at 67%.
# 
# This value has a name; it is the **MAP**, which stands for "Maximum Aposteori Probability" ("aposteori" is Latin for posterior).
# 
# In this example, the MAP is close to the proportion of vanilla cookies in the dataset: 2/3.

# **Exercise:** Let's do a version of the dice problem where we roll the die more than once.  Here's the statement of the problem again:
# 
# > Suppose you have a 4-sided, 6-sided, 8-sided, and 12-sided die.  You choose one at random, roll it and get a 1. What is the probability that the die you rolled is 4-sided?  What are the posterior probabilities for the other dice?
# 
# And here's a solution using a Bayes table:

# In[17]:


hypos = ['H4', 'H6', 'H8', 'H12']
prior = 1/4
likelihood = 1/4, 1/6, 1/8, 1/12

table = make_bayes_table(hypos, prior, likelihood)
table


# Now suppose you roll the same die again and get a 6.  What are the posterior probabilities after the second roll?
# 
# Use `idxmax` to find the MAP.

# In[18]:


# Solution goes here


# In[19]:


# Solution goes here


# ## Probability Mass Functions
# 
# When we do more than one update, we don't always want to keep the whole Bayes table.  In this section we'll replace the Bayes table with a more compact representation, a probability mass function, or PMF.
# 
# A PMF is a set of possible outcomes and their corresponding probabilities.  There are many ways to represent a PMF; in this notebook I'll use a Pandas Series.
# 
# Here's a function that takes a sequence of outcomes, `xs`, and a sequence of probabilities, `ps`, and returns a Pandas Series that represents a PMF.

# In[20]:


def make_pmf(xs, ps, **options):
    """Make a Series that represents a PMF.
    
    xs: sequence of values
    ps: sequence of probabilities
    options: keyword arguments passed to Series constructor
    
    returns: Pandas Series
    """
    pmf = pd.Series(ps, index=xs, **options)
    return pmf


# And here's a PMF that represents the prior from the 101 Bowls problem.

# In[21]:


xs = np.arange(101)
prior = 1/101

pmf = make_pmf(xs, prior)
pmf.head()


# Now that we have a priod, we need to compute likelihoods.
# 
# Here are the likelihoods for a vanilla cookie:

# In[22]:


likelihood_vanilla = xs / 100


# And for a chocolate cookie.

# In[23]:


likelihood_chocolate = 1 - xs / 100


# To compute posterior probabilities, I'll use the following function, which takes a PMF and a sequence of likelihoods, and updates the PMF:

# In[24]:


def bayes_update(pmf, likelihood):
    """Do a Bayesian update.
    
    pmf: Series that represents the prior
    likelihood: sequence of likelihoods
    """
    pmf *= likelihood
    pmf /= pmf.sum()


# The steps here are the same as in the Bayes table:
# 
# 1. Multiply the prior by the likelihoods.
# 
# 2. Add up the products to get the total probability of the data.
# 
# 3. Divide through to normalize the posteriors.

# Now we can do the update for a vanilla cookie.

# In[25]:


bayes_update(pmf, likelihood_vanilla)


# Here's what the PMF looks like after the update.

# In[26]:


pmf.plot()

plt.xlabel('Bowl #')
plt.ylabel('Probability')
plt.title('One cookie');


# That's consistent with what we got with the Bayes table.
# 
# The advantage of using a PMF is that it is easier to do multiple updates.  The following cell starts again with the uniform prior and does updates with two vanilla cookies and one chocolate cookie:

# In[27]:


data = 'VVC'

pmf = make_pmf(xs, prior)

for cookie in data:
    if cookie == 'V':
        bayes_update(pmf, likelihood_vanilla)
    else:
        bayes_update(pmf, likelihood_chocolate)


# Here's what the results look like:

# In[28]:


pmf.plot()

plt.xlabel('Bowl #')
plt.ylabel('Probability')
plt.title('Three cookies');


# Again, that's consistent with what we got with the Bayes table.
# 
# In the next section, I'll use a PMF and `bayes_update` to solve a dice problem.

# ## The dice problem
# 
# As an exercise, let's do one more version of the dice problem:
# 
# > Suppose you have a 4-sided, 6-sided, 8-sided, 12-sided, and a **20-sided die**.  You choose one at random, roll it and **get a 7**. What is the probability that the die you rolled is 4-sided?  What are the posterior probabilities for the other dice?
# 
# Notice that in this version I've added a 20-sided die and the outcome is 7, not 1.
# 
# Here's a PMF that represents the prior:

# In[29]:


sides = np.array([4, 6, 8, 12, 20])
prior = 1/5

pmf = make_pmf(sides, prior)
pmf


# In this version, the hypotheses are integers rather than strings, so we can compute the likelihoods like this:

# In[30]:


likelihood = 1 / sides


# But the outcome is 7, so any die with fewer than 7 sides has likelihood 0.
# 
# We can adjust `likelihood` by making a Boolean Series:

# In[31]:


too_low = (sides < 7)


# And using it to set the corresponding elements of `likelihood` to 0.

# In[32]:


likelihood[too_low] = 0
likelihood


# Now we can do the update and display the results.

# In[33]:


bayes_update(pmf, likelihood)
pmf


# The 4-sided and 6-sided dice have been eliminated.  Of the remaining dice, the 8-sided die is the most likely.

# **Exercise:** Suppose you have the same set of 5 die.  You choose a die, roll it six times, and get 6, 7, 2, 5, 1, and 2 again.  Use `idxmax` to find the MAP.  What is the posterior probability of the MAP?

# In[34]:


# Solution goes here


# In[35]:


# Solution goes here


# In[36]:


# Solution goes here


# In[37]:


# Solution goes here


# ## Summary
# 
# In this notebook, we extended the cookie problem with more bowls and the dice problem with more dice.
# 
# I defined the MAP, which is the quantity in a posterior distribution with the highest probability.
# 
# Although the cookie problem is not particularly realistic or useful, the method we used to solve it applies to many problems in the real world where we want to estimate a proportion.
# 
# [In the next notebook](https://colab.research.google.com/github/AllenDowney/BiteSizeBayes/blob/master/07_euro.ipynb) we'll use the same method on a related problem, and take another step toward doing Bayesian statistics.

# In[ ]: