#!/usr/bin/env python
# coding: utf-8

# # Introduction to Gaussian Processes
# 
# Anthony Gitter (w/ minor edits from Colin Dewey)
# 
# March 26, 2019
# 
# This notebook introduces the Gaussian process regression.  It uses code from:
# - http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb
# - http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_kernels.ipynb
# - http://seaborn.pydata.org/tutorial/distributions.html
# - http://gpss.cc/gpss13/assets/lab1.pdf
# 
# To run it yourself, install the [GPy](https://github.com/SheffieldML/GPy) package with `pip install GPy` and all other packages imported below.
# 
# References:
# - [Ebden 2008](https://arxiv.org/pdf/1505.02965.pdf)
# - [Do and Li 2008](http://cs229.stanford.edu/section/cs229-gaussian_processes.pdf)
# - [Rasmussen and Williams 2006](http://www.gaussianprocess.org/gpml/chapters/RW2.pdf)

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')

import GPy
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

sns.set(style='white', context='notebook', font_scale=1.4)


# ## Bivariate Gaussian distribution
# Suppose $r_i$ and $r_j$ represent DNase read density at positions $i$ and $j$ on the same chromosome.  For illustrative purposes, we will treat them as Gaussian random variables, despite that not being the appropriate distribution for count-based data, where we count the number of reads aligned to a position.
# 
# If $i$ and $j$ are very far apart, we might expect $r_i$ and $r_j$ to be independent.

# In[2]:


# mean density of r_i and r_j
mean = [10, 5]
# covariance matrix
cov = [(5, 0), (0, 1)]
print('Covariance matrix:')
for row in cov:
    print('\t'.join(map(str,row)))

# draw 1000 samples
samples = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(samples, columns=['$r_i$', '$r_j$'])

# plot the samples
sns.jointplot(x='$r_i$', y='$r_j$', data=df);


# If $i$ and $j$ are very close together, we might expect $r_i$ and $r_j$ to have high covariance and similar means.

# In[3]:


# mean density of r_i and r_j
mean = [5, 5]
# covariance matrix
cov = [(3, 2.75), (2.75, 3.5)]
print('Covariance matrix:')
for row in cov:
    print('\t'.join(map(str,row)))

# draw 1000 samples
samples = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(samples, columns=['$r_i$', '$r_j$'])

# plot the samples
sns.jointplot(x='$r_i$', y='$r_j$', data=df);


# At intermediate distances, perhaps there is still some relationship between $r_i$ and $r_j$ but it is not as strong.

# In[4]:


# mean density of r_i and r_j
mean = [5, 4]
# covariance matrix
cov = [(3, 1), (1, 1)]
print('Covariance matrix:')
for row in cov:
    print('\t'.join(map(str,row)))

# draw 1000 samples
samples = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(samples, columns=['$r_i$', '$r_j$'])

# plot the samples
sns.jointplot(x='$r_i$', y='$r_j$', data=df);


# ## Multivariate Gaussian distribution
# For the read densities at three positions $i$, $j$, $k$, we can use a multivariate Gaussian distribution over $r_i$, $r_j$, $r_k$.  From this distribution we now draw triplets [$r_i$, $r_j$, $r_k$].  Instead of plotting the 3d samples, we'll plot all possible 1d and 2d projections.
# 
# Negative entries in the covariance do not make sense in the read density case but help illustrate the multivariate Gaussian.

# In[6]:


# mean density of r_i r_j r_k
mean = [5, 4, 3]
# covariance matrix
cov = [(3, 1.5, 0.5), (1.5, 1, -0.3), (0.5, -0.3, 2)]
print('Covariance matrix:')
for row in cov:
    print('\t'.join(map(str,row)))

# draw 1000 samples
samples = np.random.multivariate_normal(mean, cov, 1000)
df = pd.DataFrame(samples, columns=['$r_i$', '$r_j$', '$r_k$'])

# plot the samples
sns.pairplot(data=df);


# ## Kernels
# The examples above manually set all of the entries in the covariance matrix.  Using a kernel function for the covariance provides a way to compute the covariance of two variables and can also be thought of as a similarity function.  Kernels are widely used in other areas of machine learning as well.
# 
# In the DNase read density case, we want the kernel to capture the same intuition we explored above: when $i$ and $j$ are close, the read densities $r_i$ and $r_j$ are similar.  The kernel function provides a mathematical way to define _close_ and _similar_.
# 
# We can use the squared-exponential kernel also known as the Radial Basis Function (RBF) kernel:
# 
# $$k(i,j) = \sigma^2 \exp [\dfrac{-(i-j)^2}{2 l^2}]$$
# 
# Notice that this similarity function depends only on the distance between $i$ and $j$: $$d = |i - j|$$
# 
# It has two parameters:
# - $l$ is the **lengthscale** that defines what we mean by _close_
# - $\sigma^2$ is the **variance** that defines what we mean by $similar$
# 
# We can use GPy to see what the similarity function looks like for different $l$ and $\sigma^2$.

# In[7]:


# see http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_kernels.ipynb
# define three RBF kernels
kernel1 = GPy.kern.RBF(input_dim=1, variance=1, lengthscale=1)
kernel2 = GPy.kern.RBF(input_dim=1, variance=2, lengthscale=1)
kernel3 = GPy.kern.RBF(input_dim=1, variance=0.5, lengthscale=5)

kernel1.plot(ax=plt.gca(), label='$l$ = 1, $\sigma^2$ = 1', color='purple');
kernel2.plot(ax=plt.gca(), label='$l$ = 1, $\sigma^2$ = 2', color='blue');
kernel3.plot(ax=plt.gca(), label='$l$ = 5, $\sigma^2$ = 0.5', color='black');

plt.xlabel('distance: $|i-j|$');
plt.ylabel('similarity: $k(i,j)$');
plt.legend()
plt.title('Example squared exponential kernel functions');


# ## Gaussian processes
# Gaussian distributions are distributions over one or more variables.  Gaussian processes are distributions over *functions*.  In the DNase case, a Gaussian distribution can define a distribution over functions that map genomic positions $i$ to read densities $r_i$.
# 
# A fundamental idea in Gaussian process is that a kernel function allows it to define the covariance between any pair of points $i$ and $j$.  Given a kernel function, we can then sample *functions* instead of *points* from the distribution.  Technically speaking, when we sample a function below we are sampling a vector of many $r_i$ for many positions $i$.

# In[8]:


# see http://gpss.cc/gpss13/assets/lab1.pdf
X = np.arange(1.,1000.) # points evenly spaced over [1,1000]
X = X[:,None] # reshape X to make it n*D
mu = 10*np.ones(len(X)) # vector of the means, which are all 10

kernel = GPy.kern.RBF(input_dim=1, variance=5, lengthscale=150) # define the kernel
C = kernel.K(X,X) # covariance matrix

# sample functions from the prior distribution over functions with mean mu and covariance C
# for the 999 positions i, we sample a vector of the 999 read densities r_i
samples = 5
Z = np.random.multivariate_normal(mu,C,samples)
for i in range(samples):
    plt.plot(X,Z[i,:])

plt.ylabel('$r_i$');
plt.xlabel('$i$');
plt.title('Samples drawn from the prior GP model of the read density');


# One way to view the Gaussian process is as a distribution over functions.  We sample a function, then that function is fully deterministic (unless we also add a noise term for error, which we do below) and we can obtain $r_i$ for any $i$.  Another interpretation is that it generalizes the multivariate Gaussian distribution to infinite dimensions.  We can sample $N$ points, 999 in the examples above, and obtain a multivariate Gaussian over that many variables.  All we need is the kernel function to define the covariance so we do not need an explicit $N \times N$ covariance matrix in advance.
# 
# If we sample more functions and look at the relationship between $r_i$ and $r_j$ for positions $i$ and $j$ we can see the correspondence between sampling functions from the Gaussian process and sampling from bivariate Gaussians above.  For $i = 400$ and $j = 410$, the read densities are very similar.  For $i = 400$ and $j = 600$, they are much less similar.

# In[9]:


samples = 100
Z = np.random.multivariate_normal(mu,C,samples)
for i in range(samples):
    plt.plot(X,Z[i,:])

plt.ylabel('$r_i$');
plt.xlabel('$i$');
plt.title('Samples drawn from the prior GP model of the read density');

# plot the samples at positions 400 and 410
df = pd.DataFrame.from_dict({'$r_{400}$':Z[:,399], '$r_{410}$':Z[:,409]})
sns.jointplot(x='$r_{400}$', y='$r_{410}$', data=df);

# plot the samples at positions 400 and 600
df = pd.DataFrame.from_dict({'$r_{400}$':Z[:,399], '$r_{600}$':Z[:,599]})
sns.jointplot(x='$r_{400}$', y='$r_{600}$', data=df);


# ## Gaussian process regression
# Gaussian processes can also be used for Bayesian linear regression.  Instead of fitting a single curve through a set of points, we obtain a posterior distribution over functions.  Above, we sampled functions from a prior distribution given the kernel parameters.  In regression, we can sample from the posterior distribution given our data and the kernel parameters.  Oftentimes, we want to know the mean and variance of the posterior at all positions $i$.

# In[10]:


# see http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb
samples = 25
# our data in the form of (i,ri) points
i = np.random.uniform(1,1000,(samples,1))
ri = 10 + np.abs(15*np.sin(i/150)) + 2.5*np.random.randn(samples,1)

# define a kernel, the same kernel we saw above
kernel = GPy.kern.RBF(input_dim=1, variance=5, lengthscale=150)

# obtain the posterior distribution
# also have a noise variance parameter for the sampling error
# this is distinct from the variance parameter in the kernel
model = GPy.models.GPRegression(i,ri,kernel,noise_var=17)
display(model)

# display the posterior
fig = model.plot();
plt.ylabel('$r_i$');
plt.xlabel('$i$');
plt.title('GP model of the read density after observing {} data points'.format(samples));
plt.axis([1,1000,0,30]);


# The kernel parameters and variance of the sample noise parameter were set arbitrarily above.  We can optimize the parameters by finding the parameters that maximize the likelihood of our observed data.

# In[11]:


model.optimize(messages=True)
display(model)

# display the posterior with the updated parameters
fig = model.plot();
plt.ylabel('$r_i$');
plt.xlabel('$i$');
plt.title('Optimized GP model after observing {} data points'.format(samples));
plt.axis([1,1000,0,30]);


# The is the core model used to represent DNase read density in PIQ.  A Gaussian process models the observed background read density at each position in the genome.  Then the background density is adapted to reflect the influence of TF binding, where each TF (unique PWM) imparts its own "shape" on the local read density.

# ## Other kernels
# One reason Gaussian processes are so powerful is that different kernel functions enable us to impose different types of structure on the prior distribution.  Moreover, it is possible to combine kernels to obtain new valid kernels.  This is not relevant for PIQ, but is an important component of Gaussian process modeling.

# In[13]:


kernel1 = GPy.kern.PeriodicExponential(input_dim=1, variance=10, lengthscale=1)
kernel2 = GPy.kern.Poly(input_dim=1, variance=1, scale=0.25)
kernel3 = GPy.kern.Add([kernel1, kernel2])

kernel1.plot(ax=plt.gca(), label='periodic', color='purple');
kernel2.plot(ax=plt.gca(), label='polynomial', color='blue');
kernel3.plot(ax=plt.gca(), label='periodic + polynomial', color='black');

plt.legend()
plt.title('Combining kernel functions');


# In[ ]: