#!/usr/bin/env python
# coding: utf-8

# # Fitting Models to Data
# 
# In the first chapter, we learned that while Bayes' formula is simple and its derivation is straightforward, using it to estimate model parameters is hampered by the fact that its calculation typically involves multidimensional integration that is rarely computable in closed form. Most useful Bayesian models, therefore, require computational methods in order to obtain reasonable estimates. Now that we have some Python at our disposal, we will make use of them by stepping through a selection of numerical methods for calculating Bayesian models.

# As a motivating example, we will use some real data to build and estimate a simple parametric model. Specifically, we are looking at some measurements taken from subjects in a medical research study, and trying to fit a normal distribution to the weight of the group of patients.

# In[1]:


import numpy as np
import pandas as pd

patient_data = pd.read_csv('data/patients.csv')
patient_data.head()


# The weights are slightly skewed to the right, so we will use the logarithm of the values for the model.

# In[2]:


get_ipython().run_line_magic('matplotlib', 'inline')
import seaborn as sb
import scipy.stats as stats

log_weight = np.log(patient_data.weight.values)
sb.distplot(log_weight, kde=False, fit=stats.norm)


# To specify our model, we need to select a likelihood that describes the data as a function of the model parameters, and a set of priors for the parameters. We have already selected a normal distribution to describe the data; that will serve as the likelihood. 
# 
# $$ f(x \mid \mu, \tau) = \sqrt{\frac{\tau}{2\pi}} \exp\left(-\frac{\tau}{2} (x-\mu)^2 \right) $$
# 
# This leaves us with the task of choosing priors for the mean $\mu$ and precision $\tau$. Since the mean is a continuous variable defined on the real line, it is natural to use another normal distribution for the prior. Though we have quite a lot of prior information regarding the mean weight of adult humans, for the time being we are going to ignore this information and specify a normal distribution with a zero mean (on the log scale) and a standard deviation of 10, which translates to a precision of 0.01. This results in a prior that is extremely vague, relative to our actual state of knowledge.

# In[3]:


import matplotlib.pyplot as plt

mu_prior = lambda x, mu=0, tau=0.01: np.sqrt(0.5*tau/np.pi) * np.exp(-0.5*tau*(x-mu)**2)

mu_x = np.linspace(-50, 50)
plt.plot(mu_x, mu_prior(mu_x))


# For the precision, since we wish to constrain it to the positive part of the real line, we will select an exponential distribution as the prior. This is just a special case of the gamma distribution, where the $\alpha$ parameter is fixed to one.
# 
# $$f(x \mid \beta) = \beta \exp(-\beta x)$$
# 
# For our purposes, we will choose a prior that has considerable probability over a very broad range of values; we are conservatively allocating probability such that there is no chance that the true value is in the extreme tail of the distribution. Let's choose $\beta=0.01$.

# In[4]:


from scipy.stats import expon

tau_prior = lambda x, beta=0.01: beta * np.exp(-beta*x)

tau_x = np.linspace(0, 100)
plt.plot(tau_x, tau_prior(tau_x))


# With likelihood and priors in hand, we can now formulate a non-normalized posterior distribution:
# 
# $$ \begin{align}
# Pr(\mu, \tau | y) &\propto Pr(\mu)Pr(\tau)Pr(y|\mu,\tau) \\
# &= \sqrt{\frac{0.0001}{2\pi}} \exp\left(-\frac{0.0001}{2} \mu^2 \right) \times 0.0001 \exp(-0.0001 x) \times \prod_i \sqrt{\frac{\tau}{2\pi}} \exp\left(-\frac{\tau}{2} (y_i-\mu)^2 \right) 
# \end{align}$$
# 
# The corresponding implementation in Python is below. The increment (`+=`) and decrement (`-=`) operators allow us to break the formula into readable pieces, which makes it somewhat easier to code without making errors. Note that it is implemented on the log scale, which results in a more numerically stable function.

# In[60]:


def calc_post(params, y):

    mu, tau = params
    post = np.log(tau_prior(tau))
    post += np.log(mu_prior(mu))
    post += 0.5*len(y)*(np.log(tau) - np.log(2*np.pi)) 
    post -= (tau/2. * (y-mu)**2).sum()
    
    return post
        
calc_post((4, 0.1), log_weight)  


# The simplest way to characterize the posterior distribution with this function is to evaluate it over a grid. Using these calculated values we can visualize the posterior distribution using either a heatmap or a contour plot.

# In[25]:


# Create grid
mu_x = np.linspace(3.97, 4.03)
tau_x = np.linspace(20, 26)

# Calculate posterior on grid
z = np.array([[calc_post((mu, tau), log_weight) for mu in mu_x] for tau in tau_x])

# Plot posterior
x, y = np.meshgrid(mu_x, tau_x)
plt.pcolor(x, y, z, cmap='Reds')
cplot = plt.contour(x, y, z-z.max(), [-0.1, -0.5, -1, -2, -3, -4], cmap=plt.cm.Reds)
plt.ylabel(r'$\tau$');plt.xlabel(r'$\mu$');


# If we wish to obtain estimates of the parameters, the most expedient route is to locate the largest value in the matrix of log-probability values stored in `z`, and pull out the corresponding values in `x` and `y`, which are the values of $\mu$ and $\tau$, respectively, that provided this maximum.

# In[38]:


max_index = z.ravel().argmax()
max_y = y.ravel()[max_index]
max_x = x.ravel()[max_index]

print('The largest value is at index {0}, which corresponds to mu={1} and tau={2}'.format(max_index,
                                                                                    max_x.round(2),
                                                                                    max_y.round(2)))


# Of course, whether or not we are happy with these values depends on the precision with which we wish to estimate the parameters, which here is governed by the coarseness of the grid over which the log-posterior was evaluated. Also, this approach is very inefficient, requiring several thousand evaluations in order to obtain two values.
# 
# Instead, we can directly maximize the joint posterior distribution with respect to the unknown parameters, using a numerical optimization procedure. Several such procedures are implemented in SciPy, as we touched upon in the last chapter. Specifically, we want to find:
# 
# $$\begin{aligned}\hat{\theta}_{MAP} &= \text{argmax}_{\theta}\left(Pr(y|\theta)Pr(\theta)\right) \\
# &= \text{argmax}_{\theta}\left(Pr(\theta)\prod_i Pr(y_i|\theta)\right) \\
# &= \text{argmax}_{\theta}\left(\log[Pr(\theta)] + \sum_i \log[Pr(y_i|\theta)]\right)
# \end{aligned}$$
# 
# This is the *maximum a posteriori* (MAP) estimate of the mode of the posterior distribution.
# 
# Let's pick one of the more efficient algorithms, *Powell's method*, as implemented in `fmin_powell`. As the function name implies, this is a minimization procedure, so we need to wrap our log-posterior function in another function that simply returns the negative of its returned value.

# In[62]:


from scipy import optimize

calc_post_min = lambda *args: -calc_post(*args)
optimize.fmin_bfgs(calc_post_min, (4, 10), args=(log_weight,))


# There are two major limitations of `MAP`, first that it can only handle variables whose values are continutous (Python dtype `float`), so models that use distributions like the binomial, multinomial, or Poisson to model unobserved stochastic variables cannot be estimated. The second shortcoming is that MAP yields only point estimates of the variables, and cannot provide estimates of associated uncertainty, such as a credible interval. Thus, for MAP to be useful for most analyses, it must be paired with resampling methods like bootstrapping so that uncertainty intervals may be approximated.

# ## Normal approximation
# 
# A more robust approach to approximating the parameters of our model is to use a *Laplace approximation*, which fits a multivariate normal distribution to the mode of the joint posterior. This approach takes advantage of the fact that we can use a Taylor series to approximate the evaluation of any function at an arbitrary point; we will use the mode *m*:
# 
# $$
# f(x) \approx f(m) + f^{\prime}(m)(x-m) + \frac{1}{2!}f^{\prime\prime}(m)(x-m)^2 + \frac{1}{3!}f^{\prime\prime\prime}(m)(x-m)^3 + \ldots $$
# 
# If we truncate this expansion at the second order, we are left with:
# 
# $$
# L(x) = f(m) + f^{\prime}(m)(x-m) + \frac{1}{2!}f^{\prime\prime}(m)(x-m)^2$$
# 
# We can simplify this by noting that the first derivative of any function at its mode is zero.
# 
# $$
# L(x) = f(m) + \frac{1}{2}f^{\prime\prime}(m)(x-m)^2$$
# 
# Note that what we are left with has the same form as a log-normal density.
# 
# $$\log[N(x|\mu,\tau)] \propto \frac{1}{2}\log(\tau) - \frac{\tau}{2}(x - \mu)^2$$
# 
# This implies that we can approximate the posterior with a normal distribution. We can caclulate $\mu = m$ via optimization, as we have done with the MAP estimator, and the scale parameter $\tau = -f^{\prime\prime}(m)$ by estimating the curvature at the mode. Note that for a multivariate model, this approximation of $\tau$ is just the Fisher information matrix (*i.e.* the Hessian of the joint log probability at the maximum).
# 
# To obtain estimates using the normal approximation, we will use the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm that is provided by SciPy. In addition to returning an estimate of the mode, it returns the estimated variance-covariance matrix, which we will need to parameterize the approximating multivariate normal distribution.

# In[65]:


opt = optimize.fmin_bfgs(calc_post_min, (4, 20), 
          args=(log_weight,), full_output=True)
mu, Sigma = opt[0], opt[3]
mu, Sigma


# Thus, our approximated mode is $\mu=4.0$, $\tau=22.8$. We can plug this value, along with the variance-covariance matrix, into a function that returns the kernel of a multivariate normal distribution, and use this to plot the approximate posterior:

# In[74]:


det = np.linalg.det 
inv = np.linalg.inv

def lmvn(value, mu, Sigma):
    # Log kernel of multivariate normal
    delta = np.array(value) - mu
    return 1 / (2. * (np.log(det(Sigma))) - np.dot(delta.T, np.dot(inv(Sigma), delta)))


# In[93]:


z = np.array([[lmvn((t1, t2), mu, Sigma) 
             for t1 in mu_x] for t2 in tau_x])
x, y = np.meshgrid(mu_x, tau_x)
cplot = plt.contour(x, y, z - z.max(), cmap=plt.cm.RdBu)
plt.ylabel(r'$\tau$');plt.xlabel(r'$\mu$');


# Along with this, we can estimate a 95% probability interval for the estimated mode:

# In[76]:


from scipy.stats.distributions import norm

se = np.sqrt(np.diag(Sigma))

mu[0] + norm.ppf(0.025)*se[0], mu[0] + norm.ppf(0.975)*se[0]


# In[77]:


mu[1] + norm.ppf(0.025)*se[1], mu[1] + norm.ppf(0.975)*se[1]


# Of course, this approximation is only reasonable for posteriors that are not strongly skewed, bimodal, or leptokurtic (heavy-tailed).

# ##  Numerical Integration Approaches
# 
# Bayesian analysis often requires integration over multiple dimensions that is intractable both via analytic methods or standard methods of numerical integration.
# However, it is often possible to compute these integrals by simulating
# (drawing samples) from posterior distributions. For example, consider the expected value of a random variable $\mathbf{x}$:
# 
# $$\begin{gathered}
# \begin{split}E[{\bf x}] = \int {\bf x} f({\bf x}) d{\bf x}, \qquad
# {\bf x} = \{x_1,...,x_k\}\end{split}\notag\\\begin{split}\end{split}\notag\end{gathered}$$
# 
# where $k$ (the dimension of vector $x$) is perhaps very large. If we can produce a reasonable number of random vectors $\{{\bf x_i}\}$, we can use these values to approximate the unknown integral. This process is known as *Monte Carlo integration*. In general, MC integration allows integrals against probability density functions:
# 
# $$\begin{gathered}
# \begin{split}I = \int h(\mathbf{x}) f(\mathbf{x}) \mathbf{dx}\end{split}\notag\\\begin{split}\end{split}\notag\end{gathered}$$
# 
# to be estimated by finite sums:
# 
# $$\begin{gathered}
# \begin{split}\hat{I} = \frac{1}{n}\sum_{i=1}^n h(\mathbf{x}_i),\end{split}\notag\\\begin{split}\end{split}\notag\end{gathered}$$
# 
# where $\mathbf{x}_i$ is a sample from $f$. This estimate is valid and useful because:
# 
# -   By the strong law of large numbers:
# 
# $$\begin{gathered}
# \begin{split}\hat{I} \rightarrow I   \mbox{   with probability 1}\end{split}\notag\\\begin{split}\end{split}\notag\end{gathered}$$
# 
# -   Simulation error can be measured and controlled:
# 
# $$Var(\hat{I}) = \frac{1}{n(n-1)}\sum_{i=1}^n
#    (h(\mathbf{x}_i)-\hat{I})^2$$

# ### How is this relevant to Bayesian analysis? 
# 
# When we observe data $y$ that we hypothesize as being obtained from a sampling model $f(y|\theta)$, where $\theta$ is a vector of (unknown) model parameters, a Bayesian places a *prior* distribution $p(\theta)$ on the parameters to describe the uncertainty in the true values of the parameters. Bayesian inference, then, is obtained by calculating the *posterior* distribution, which is proportional to the product of these quantities:
# 
# $$p(\theta | y) \propto f(y|\theta) p(\theta)$$
# 
# unfortunately, for most problems of interest, the normalizing constant cannot be calculated because it involves mutli-dimensional integration over $\theta$.
# 
# Returning to our integral for MC sampling, if we replace $f(\mathbf{x})$
# with a posterior, $p(\theta|y)$ and make $h(\theta)$ an interesting function of the unknown parameter, the resulting expectation is that of the posterior of $h(\theta)$:
# 
# $$E[h(\theta)|y] = \int h(\theta) p(\theta|y) d\theta \approx \frac{1}{n}\sum_{i=1}^n h(\theta)$$
# 
# We also require integrals to obtain marginal estimates from a joint model. If $\theta$ is of length $K$, then inference about any particular parameter is obtained by:
# 
# $$p(\theta_i|y) \propto \int p(\theta|y) d\theta_{-i}$$
# 
# where the `-i` subscript indicates all elements except the $i^{th}$.

# ## Rejection Sampling
# 
# Though Monte Carlo integration allows us to estimate integrals that are unassailable by analysis and standard numerical methods, it relies on the ability to draw samples from the posterior distribution. For known parametric forms, this is not a problem; probability integral transforms or bivariate techniques (e.g Box-Muller method) may be used to obtain samples from uniform pseudo-random variates generated from a computer. Often, however, we cannot readily generate random values from non-standard posteriors. In such instances, we can use rejection sampling to generate samples.
# 
# Consider a function, $f(x)$ which can be evaluated for any value on the support of $x:S_x = [A,B]$, but may not be integrable or easily sampled from. If we can calculate the maximum value of $f(x)$, we can then define a rectangle that is guaranteed to contain all possible values
# $(x,f(x))$. It is then trivial to generate points over the box and enumerate the values that fall under the curve.
# 
# 
# $$\begin{gathered}
# \begin{split}\frac{\mbox{Points under curve}}{\mbox{Points generated}} \times \mbox{box area} = \lim_{n \to \infty} \int_A^B f(x) dx\end{split}\notag\\\begin{split}\end{split}\notag\end{gathered}$$
# 
# ### Example: triangular distribution

# In[78]:


def rtriangle(low, high, mode):
    alpha = -1
    while np.random.random() > alpha:
        u = np.random.uniform(low, high)
        if u < mode:
            alpha = (u - low) / (mode - low)
        else:
            alpha = (high - u) / (high - mode)
    return(u)


# In[79]:


_ = plt.hist([rtriangle(0, 7, 2) for t in range(10000)], bins=100)


# This approach is useful, for example, in estimating the normalizing constant for posterior distributions.
# 
# 
# If $f(x)$ has unbounded support (i.e. infinite tails), such as a Gaussian distribution, a bounding box is no longer appropriate. We must specify a majorizing (or, enveloping) function, $g(x)$, which implies:
# 
# $$\begin{gathered}
# \begin{split}cg(x) \ge  f(x) \qquad\forall x \in (-\infty,\infty)\end{split}\notag\\\begin{split}\end{split}\notag\end{gathered}$$
# 
# Having done this, we can now sample ${x_i}$ from $g(x)$ and accept or reject each of these values based upon $f(x_i)$. Specifically, for each draw $x_i$, we also draw a uniform random variate $u_i$ and accept $x_i$
# if $u_i < f(x_i)/cg(x_i)$, where $c$ is a constant. This procedure is repeated until a sufficient number of samples is obtained. This approach is made more efficient by choosing an enveloping distribution that is “close” to the target distribution, thus maximizing the number of accepted points. 
# 
# To apply rejection sampling to the beta-binomial example, we first need to find a majorizing function $g(x)$ from which we can easily draw samples. We have seen in the previous section that the multivariate normal might serve as a suitable candidate, if multiplied by an appropriately large value of $c$. However, the thinness of the normal tails makes it difficult to use as a majorizing function. Instead, a multivariate Student's T distribution offers heavier tails for a suitably-small value for the degrees of freedom $\nu$:
# 
# $$f(\mathbf{x}| \nu,\mu,\Sigma) = \frac{\Gamma\left[(\nu+p)/2\right]}{\Gamma(\nu/2)\nu^{p/2}\pi^{p/2}\left|{\boldsymbol\Sigma}\right|^{1/2}\left[1+\frac{1}{\nu}({\mathbf x}-{\boldsymbol\mu})^T{\boldsymbol\Sigma}^{-1}({\mathbf x}-{\boldsymbol\mu})\right]^{(\nu+p)/2}}$$

# We can draw samples from a multivariate-T density by combining mutlivariate normal and $\chi^2$ random variates:
# 
# > ### Generating multivariate-T samples
# 
# > If $X$ is distributed multivariate normal $\text{MVN}(\mathbf{0},\Sigma)$ and $S$ is a $\chi^2$ random variable with $\mu$ degrees of freedom, then a multivariate Student's-T random variable $T = T_1,\ldots,T_p$ can be generated by $T_i = \frac{\sqrt{\nu}X_i}{S} + \mu_i$, where $\mu = \mu_1,\ldots,\mu$ is a mean vector.

# This is implemented in Python by:

# In[80]:


chi2 = np.random.chisquare
mvn = np.random.multivariate_normal

rmvt = lambda nu, S, mu=0, size=1: (np.sqrt(nu) * (mvn(np.zeros(len(S)), S, size).T
                                   / chi2(nu, size))).T + mu


# Finally, we need an implementation of the multivariate T probability distribution function, which is as follows:

# In[81]:


from scipy.special import gammaln

def mvt(x, nu, S, mu=0): 

    d = len(S)
    n = len(x)
    X = np.atleast_2d(x) - mu
    
    Q = X.dot(np.linalg.inv(S)).dot(X.T).sum()
    log_det = np.log(np.linalg.det(S))
    log_pdf = gammaln((nu + d)/2.) - 0.5 * (d*np.log(np.pi*nu) + log_det) - gammaln(nu/2.)
    log_pdf -= 0.5*(nu + d)*np.log(1 + Q/nu)
    
    return(np.exp(log_pdf))


# The next step is to find the constant $c$ that ensures:
# 
# $$cg(\theta) \ge  f(\theta|y) \qquad\forall \theta \in (-\infty,\infty)$$
# 
# Alternatively, we want to ensure:
# 
# $$\log[f(\theta|y)] - \log[g(\theta)] \le c'$$

# In[84]:


def calc_diff(theta, y, nu, S, mu):
    
    return calc_post(theta, y) - np.log(mvt(theta, nu, S, mu))

calc_diff_min = lambda *args: -calc_diff(*args)


# We can calculate an appropriate value of $c'$ by simply using the approximation method described above on `calc_diff` (tweaked to produce a negative value for minimization):

# In[85]:


opt = optimize.fmin_bfgs(calc_diff_min, 
                (4, 20), 
                args=(log_weight, 4, 2*Sigma, mu), 
                full_output=True)


# In[86]:


c = opt[1]
c


# Now we can execute a rejection sampling algorithm:

# In[88]:


def reject(post, nu, S, mu, n, data, c):
    
    k = len(mode)
    
    # Draw samples from g(theta)
    theta = rmvt(nu, S, mu, size=n)
    
    # Calculate probability under g(theta)
    gvals = np.array([np.log(mvt(t, nu, S, mu)) for t in theta])

    # Calculate probability under f(theta)
    fvals = np.array([post(t, data) for t in theta])
    
    # Calculate acceptance probability
    p = np.exp(fvals - gvals + c)
    
    return theta[np.random.random(n) < p]


# In[89]:


nsamples = 1000
sample = reject(calc_post, 4, var, mode, nsamples, log_weight, c)


# In[96]:


z = np.array([[calc_post((mu, tau), log_weight) for mu in mu_x] for tau in tau_x])
cplot = plt.contour(x, y, z - z.max(), [-0.5, -1, -2, -4, -8], cmap=plt.cm.RdBu)
plt.clabel(cplot, inline=1, fontsize=10, fmt='%1.1f')
plt.ylabel('log(K)');plt.xlabel('logit($\eta$)')
plt.scatter(*sample.T)


# Notice that the efficiency of rejection sampling is not very high for this problem.

# In[97]:


float(sample.size)/nsamples


# ## Importance Sampling
# 
# Rejection sampling is usually subject to declining performance as the dimension of the parameter space increases. Further improvement is gained by using optimized algorithms such as *importance sampling* which, as the name implies, samples more frequently from important areas of the distribution. For example, consider calculating the expected value of an unknown parameter $\theta$:
# 
# $$E[\theta | y] = \frac{\int \theta f(y|\theta) p(\theta) d\theta}{\int f(y|\theta) p(\theta) d\theta} = \frac{\int \theta p(\theta | y) d\theta}{\int p(\theta|y) d\theta}$$
# 
# In most cases, the two integrals above present severe obstacles to obtaining a result. If the posterior $p(\theta|y)$ is a density from which it is easy to sample, we could approximiate these integrals using Monte Carlo simulation, as we saw above, but too often it is not. Instead, assume that we can draw from some other probability density $q(\theta)$ that is some approximation of $p$. We could then write:
# 
# $$E[\theta | y] = \frac{\int \theta \frac{p(\theta|y)}{q(\theta)} q(\theta) d\theta}{\int \frac{p(\theta|y)}{q(\theta)} q(\theta) d\theta}$$
# 
# Expressed this way, $w(\theta) = p(\theta|y) / q(\theta)$ can be regarded as *weights* for the $M$ values of $\theta$ sampled from $q$ that we can use to correct the sample so that it approximates $E[\theta | y]$. Specifically, the **importance sampling estimate** is:
# 
# $$\hat{\theta}_{is} = \frac{\sum_{i=1}^{M} \theta^{(i)} w(\theta^{(i)})}{\sum_{i=1}^{M} w(\theta^{(i)})}$$
# 
# where $\theta^{(i)}$ is the $i^{th}$ sample simulated from $q(\theta)$. The standard error for the importance sampling estimate is:
# 
# $$\text{SE}_{is} = \frac{\sqrt{\sum_{i=1}^{M} [(\theta^{(i)} - \hat{\theta}_{is}) w(\theta^{(i)})]^2}}{\sum_{i=1}^{M} w(\theta^{(i)})}$$
# 
# The efficiency of importance sampling is related to the selection of the importance sampling distribution $q$.

# Let's consider again the problem of fitting a normal distribution to the patient weight dataset. Here, we will use a multivariate T density as the simulation distribution $q$.
# 
# Here are 1000 sampled values to use for approximating the posterior:

# In[98]:


theta = rmvt(3, Sigma, mu, size=1000)


# We can obtain the probability of these values under the posterior density:

# In[99]:


f_theta = np.array([calc_post(t, log_weight) for t in theta])


# and under the T distribution:

# In[100]:


q_theta = mvt(theta, 3, Sigma, mu)


# This allows us to calculate the importance weights:

# In[101]:


w = np.exp(f_theta - q_theta - max(f_theta - q_theta))


# notice that we have subtracted the maximum value of the differences, which normalizes the weights.
# 
# Now, we can obtain estimates of the parameters:

# In[102]:


theta_si = [(w*t).sum()/w.sum() for t in theta.T]
theta_si


# Finally, the standard error of the estimates:

# In[103]:


se = [np.sqrt((((theta.T[i] - theta_si[i])* w)**2).sum()/w.sum()) for i in (0,1)]
se


# ---

# In[15]:


from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()