#!/usr/bin/env python # coding: utf-8 # # Simple Bayesian Modeling # # Here we'll apply what we learned in the previous section, and use Python to fit some Bayesian models. # We'll start with the classic model fitting problem: **Fitting a line to data**. # # We'll begin with some standard Python imports: # In[ ]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt import numpy as np import seaborn; seaborn.set() # for plot formatting # ## The Data # # For the sake of this demonstration, let's start by creating some data that we will fit with a straight line: # In[ ]: def make_data(intercept, slope, N=20, dy=5, rseed=42): rand = np.random.RandomState(rseed) x = 100 * rand.rand(N) y = intercept + slope * x y += dy * rand.randn(N) return x, y, dy * np.ones_like(x) theta_true = [25, 0.5] x, y, dy = make_data(*theta_true) plt.errorbar(x, y, dy, fmt='o'); # ## The Model # # Next we need to specify a model. We're fitting a straight line to data, so we'll need a slope and an intercept; i.e. # # $$ # y_M(x) = mx + b # $$ # # where our paramteter vector might be # # $$ # \theta = [b, m] # $$ # # But this is only half the picture: what we mean by a "model" in a Bayesian sense is not only this expected value $y_M(x;\theta)$, but a **probability distribution** for our data. # That is, we need an expression to compute the likelihood $P(D\mid\theta)$ for our data as a function of the parameters $\theta$. # Here we are given data with simple errorbars, which imply that the probability for any *single* data point is a normal distribution about the true value. That is, # # $$ # y_i \sim \mathcal{N}(y_M(x_i;\theta), \sigma) # $$ # # or, in other words, # # $$ # P(x_i,y_i\mid\theta) = \frac{1}{\sqrt{2\pi\varepsilon_i^2}} \exp\left(\frac{-\left[y_i - y_M(x_i;\theta)\right]^2}{2\varepsilon_i^2}\right) # $$ # # where $\varepsilon_i$ are the (known) measurement errorsindicated by the errorbars. # Assuming all the points are independent, we can find the full likelihood by multiplying the individual likelihoods together: # # $$ # P(D\mid\theta) = \prod_{i=1}^N P(x_i,y_i\mid\theta) # $$ # # For convenience (and also for numerical accuracy) this is often expressed in terms of the log-likelihood: # # $$ # \log P(D\mid\theta) = -\frac{1}{2}\sum_{i=1}^N\left(\log(2\pi\varepsilon_i^2) + \frac{\left[y_i - y_M(x_i;\theta)\right]^2}{\varepsilon_i^2}\right) # $$ # ### Quick Exercise # # 1. Write a Python function which computes the log-likelihood given a parameter vector $\theta$, an array of errors $\varepsilon$, and an array of $x$ and $y$ values # # 2. Use tools in [``scipy.optimize``](http://docs.scipy.org/doc/scipy/reference/optimize.html) to maximize this likelihood (i.e. minimize the negative log-likelihood). How close is this result to the input ``theta_true`` above? # In[ ]: # Write your code here # In[ ]: # In[ ]: # In[ ]: # See the [Solutions-02](Solutions-02.ipynb#Quick-Exercise) notebook for one possible approach. # ## The Prior # # We have computed the likelihood, now we need to think about the prior $P(\theta\mid I)$. # # This is where Bayesianism gets a bit controversial... what can we actually say about the slope and intercept before we fit our data? # # There are a couple approaches to choosing priors that you'll come across in practice: # ### 0. Conjugate Priors # # In the days before computational power caught up with the aspirations of Bayesians, it was important to choose priors which would make the posterior analytically computable. A [conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior) is a prior which, due to its mathematical relation to the likelihood, makes the result analytically computable. # The problem is that the form of the conjugate prior is very rarely defensible on any grounds other than computational convenience, and so this is **almost never a good choice**. # You'll still occasionally hear people attack Bayesian approaches because of the use of conjugate priors: these people's arguments are decades outdated, and you should treat them appropriately. # ### 1. Empirical Priors # # Empirical Priors are priors which are actually posteriors from previous studies of the same phenomenon. For example, it's common in Supernova cosmology studies to use the WMAP results as a prior: that is, we actually plug-in a *real result* and use our new data to improve on that. This situation is where Bayesian approaches really shine. # # For our linear fit, you might imagine that our $x, y$ data is a more accurate version of a previous experiment, where we've found that the intercept is $\theta_0 = 50 \pm 30$ and the slope is $\theta_1 = 1.0 \pm 0.5$. # In this case, we'd encode this prior knowledge in the prior distribution itself. # ### 2. Flat Priors # # If you don't have an empirical prior, you might be tempted to simply use a *flat prior* – i.e. a prior that is constant between two reasonable limits (i.e. equal probability slopes from -1000 to +1000). # # The problem is that flat priors are not always non-informative! For example, a flat prior on the slope will effectively give a higher weight to larger slopes. # We can see this straightforwardly by plotting regularly-spaced slopes between 0 and 20: # In[ ]: xx = np.linspace(-1, 1) for slope in np.linspace(0, 20, 100): plt.plot(xx, slope * xx, '-k', linewidth=1) plt.axis([-1, 1, -1, 1], aspect='equal'); # The density of the lines is a proxy for the probability of those slopes with a flat prior. # This is an important point to realize: **flat priors are not necessarily minimally informative**. # ### 3. Non-informative Priors # # What we *really* want in cases where no empirical prior is available is a **non-informative prior**. Among other things, such a prior should not depend on the units of the data. # Perhaps the most principled approach to choosing non-informative priors was the *principle of maximum entropy* advocated by Jaynes ([book](http://omega.albany.edu:8008/JaynesBook.html)). # # Similar in spirit is the commonly-used [Jeffreys Prior](https://en.wikipedia.org/wiki/Jeffreys_prior), which in many cases of interest amounts to a "scale invariant" prior: a flat prior on the logarithm of the parameter. # # In the case of the linear slope, we often want a prior which does not artificially over-weight large slopes: there are a couple possible approaches to this (see http://arxiv.org/abs/1411.5018 for some discussion). For our situation, we might use a flat prior on the angle the line makes with the x-axis, which gives # # $$ # P(m) \propto (1 + m^2)^{-3/2} # $$ # # For lack of a better term, I like to call this a "symmetric prior" on the slope (because it's the same whether we're fitting $y = mx + b$ or $x = m^\prime y + b^\prime$). # ### Implementation # # Let's define two python functions to compute the options for our prior: we'll use both a (log) flat prior and a (log) symmetric prior. # In general, we need not worry about the normalization of the prior or the likelihood, which makes our lives easier: # In[ ]: def log_flat_prior(theta): if np.all(np.abs(theta) < 1000): return 0 # log(1) else: return -np.inf # log(0) def log_symmetric_prior(theta): if np.abs(theta[0]) < 1000: return -1.5 * np.log(1 + theta[1] ** 2) else: return -np.inf # log(0) # With these defined, we now have what we need to compute the log posterior as a function of the model parameters. # You might be tempted to maximize this posterior in the same way that we did with the likelihood above, but this is not a Bayesian result! The Bayesian result is a (possibly marginalized) posterior probability for our parameters. # The mode of a probability distribution is perhaps slightly informative, but it is in no way a Bayesian result. # # In the breakout, we will take a few minutes now to plot the posterior probability as a function of the slope and intercept. # ## Breakout # # 1. Using matplotlib, plot the posterior probability distribution for the slope and intercept, once for each prior. I would suggest using ``plt.contourf()`` or ``plt.pcolor()``. How different are the distributions? # # 2. Modify the dataset – how do the results change if you have very few data points or very large errors? # # 3. If you finish this quickly, try adding 1-sigma and 2-sigma contours to your plot, keeping in mind that the probabilities are not normalized! You can add them to your plot with ``plt.contour()``. # In[ ]: # Write your code here # In[ ]: # In[ ]: # In[ ]: # In[ ]: # In[ ]: # See the [Solutions-02](Solutions-02.ipynb#Breakout) notebook for one possible approach to answering these questions. But give this a good try before simply clicking! It's by working through this yourself that you will really learn the material.