Notebook

Gaussian Processes¶

Neil D. Lawrence, Amazon Cambridge and University of Sheffield¶

2019-01-09¶

Abstract: Classical machine learning and statistical approaches to learning, such as neural networks and linear regression, assume a parametric form for functions. Gaussian process models are an alternative approach that assumes a probabilistic prior over functions. This brings benefits, in that uncertainty of function estimation is sustained throughout inference, and some challenges: algorithms for fitting Gaussian processes tend to be more complex than parametric models. In this sessions I will introduce Gaussian processes and explain why sustaining uncertainty is important.

$$ \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} % Already defined by latex %\newcommand{\det}[1]{\left|#1\right|} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} % Already defined by latex %\newcommand{\vec}{#1:} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$

[\small{[edit]}]{style="text-align:right"} ###

[@Rasmussen:book06]{style="text-align:right"}

[\small{[edit]}]{style="text-align:right"}

What is Machine Learning?¶

. . .

$$ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

. . .

data : observations, could be actively or passively acquired (meta-data).

. . .

model : assumptions, based on previous experience (other data! transfer learning etc), or beliefs about the regularities of the universe. Inductive bias.

. . .

prediction : an action to be taken or a categorization or a quality score.

. . .

Royal Society Report: Machine Learning: Power and Promise of Computers that Learn by Example

What is Machine Learning?¶

$$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

To combine data with a model need:

a prediction function $\mappingFunction(\cdot)$ includes our beliefs about the regularities of the universe

an objective function $\errorFunction(\cdot)$ defines the cost of misprediction.

Artificial Intelligence¶

Machine learning is a mainstay because of importance of prediction.

Uncertainty¶

Uncertainty in prediction arises from:
scarcity of training data and
mismatch between the set of prediction functions we choose and all possible prediction functions.
Also uncertainties in objective, leave those for another day.

[\small{[edit]}]{style="text-align:right"} ### Neural Networks and Prediction Functions

adaptive non-linear function models inspired by simple neuron models [@McCulloch:neuron43]
have become popular because of their ability to model data.
can be composed to form highly complex functions
start by focussing on one hidden layer

Prediction Function of One Hidden Layer¶

$$ \mappingFunction(\inputVector) = \left.\mappingVector^{(2)}\right.^\top \activationVector(\mappingMatrix_{1}, \inputVector) $$

$\mappingFunction(\cdot)$ is a scalar function with vector inputs,

$\activationVector(\cdot)$ is a vector function with vector inputs.

dimensionality of the vector function is known as the number of hidden units, or the number of neurons.
elements of $\activationVector(\cdot)$ are the activation function of the neural network
elements of $\mappingMatrix_{1}$ are the parameters of the activation functions.

Relations with Classical Statistics¶

In statistics activation functions are known as basis functions.
would think of this as a linear model: not linear predictions, linear in the parameters
$\mappingVector_{1}$ are static parameters.

Adaptive Basis Functions¶

In machine learning we optimize $\mappingMatrix_{1}$ as well as $\mappingMatrix_{2}$ (which would normally be denoted in statistics by $\boldsymbol{\beta}$).
Revisit that decision: follow the path of @Neal:bayesian94 and @MacKay:bayesian92.
Consider the probabilistic approach.

[\small{[edit]}]{style="text-align:right"} ### Probabilistic Modelling

Probabilistically we want, $$ p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*), $$ $\dataScalar_*$ is a test output $\inputVector_*$ is a test input $\inputMatrix$ is a training input matrix $\dataVector$ is training outputs

Joint Model of World¶

$$ p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*) = \int p(\dataScalar_*|\inputVector_*, \mappingMatrix) p(\mappingMatrix | \dataVector, \inputMatrix) \text{d} \mappingMatrix $$

. . .

$\mappingMatrix$ contains $\mappingMatrix_1$ and $\mappingMatrix_2$

$p(\mappingMatrix | \dataVector, \inputMatrix)$ is posterior density

Likelihood¶

$p(\dataScalar|\inputVector, \mappingMatrix)$ is the likelihood of data point

. . .

Normally assume independence: $$ p(\dataVector|\inputMatrix, \mappingMatrix) = \prod_{i=1}^\numData p(\dataScalar_i|\inputVector_i, \mappingMatrix),$$

Likelihood and Prediction Function¶

$$ p(\dataScalar_i | \mappingFunction(\inputVector_i)) = \frac{1}{\sqrt{2\pi \dataStd^2}} \exp\left(-\frac{\left(\dataScalar_i - \mappingFunction(\inputVector_i)\right)^2}{2\dataStd^2}\right) $$

Unsupervised Learning¶

Can also consider priors over latents $$ p(\dataVector_*|\dataVector) = \int p(\dataVector_*|\inputMatrix_*, \mappingMatrix) p(\mappingMatrix | \dataVector, \inputMatrix) p(\inputMatrix) p(\inputMatrix_*) \text{d} \mappingMatrix \text{d} \inputMatrix \text{d}\inputMatrix_* $$
This gives unsupervised learning.

Probabilistic Inference¶

Data: $\dataVector$
Model: $p(\dataVector, \dataVector^*)$
Prediction: $p(\dataVector^*| \dataVector)$

[\small{[edit]}]{style="text-align:right"} ### Graphical Models

Represent joint distribution through conditional dependencies.
E.g. Markov chain

$$p(\dataVector) = p(\dataScalar_\numData | \dataScalar_{\numData-1}) p(\dataScalar_{\numData-1}|\dataScalar_{\numData-2}) \dots p(\dataScalar_{2} | \dataScalar_{1})$$

In [ ]:

import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[3, 1],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)


pgm.add_node(daft.Node("y_1", r"$y_1$", 0.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_2", r"$y_2$", 1.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_3", r"$y_3$", 2.5, 0.5, fixed=False))
pgm.add_edge("y_1", "y_2")
pgm.add_edge("y_2", "y_3")

pgm.render().figure.savefig("../slides/diagrams/ml/markov.svg", transparent=True)

¶

Predict Perioperative Risk of Clostridium Difficile Infection Following Colon Surgery [@Steele:predictive12]

Performing Inference¶

Easy to write in probabilities
But underlying this is a wealth of computational challenges.
High dimensional integrals typically require approximation.

Linear Models¶

In statistics, focussed more on linear model implied by $$ \mappingFunction(\inputVector) = \left.\mappingVector^{(2)}\right.^\top \activationVector(\mappingMatrix_1, \inputVector) $$
Hold $\mappingMatrix_1$ fixed for given analysis.
Gaussian prior for $\mappingMatrix$, $$ \mappingVector^{(2)} \sim \gaussianSamp{\zerosVector}{\covarianceMatrix}. $$ $$ \dataScalar_i = \mappingFunction(\inputVector_i) + \noiseScalar_i, $$ where $$ \noiseScalar_i \sim \gaussianSamp{0}{\dataStd^2} $$

\newslides{Linear Gaussian Models}

Normally integrals are complex but for this Gaussian linear case they are trivial.

Multivariate Gaussian Properties¶

[\small{[edit]}]{style="text-align:right"} ### Recall Univariate Gaussian Properties

. . .

Sum of Gaussian variables is also Gaussian.

$$\dataScalar_i \sim \gaussianSamp{\meanScalar_i}{\dataStd_i^2}$$

. . .

$$\sum_{i=1}^{\numData} \dataScalar_i \sim \gaussianSamp{\sum_{i=1}^\numData \meanScalar_i}{\sum_{i=1}^\numData\dataStd_i^2}$$

Recall Univariate Gaussian Properties¶

Scaling a Gaussian leads to a Gaussian.
. . .

$$\dataScalar \sim \gaussianSamp{\meanScalar}{\dataStd^2}$$

. . .

$$\mappingScalar\dataScalar\sim \gaussianSamp{\mappingScalar\meanScalar}{\mappingScalar^2 \dataStd^2}$$

Multivariate Consequence¶

[If]{style="text-align:left"} $$\inputVector \sim \gaussianSamp{\meanVector}{\covarianceMatrix}$$

. . .

[And]{style="text-align:left"} $$\dataVector= \mappingMatrix\inputVector$$

. . .

[Then]{style="text-align:left"} $$\dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top}$$

Linear Gaussian Models¶

linear Gaussian models are easier to deal with
Even the parameters within the process can be handled, by considering a particular limit.

[\small{[edit]}]{style="text-align:right"} ### Multivariate Gaussian Properties

If $$ \dataVector = \mappingMatrix \inputVector + \noiseVector, $$
Assume $$ \begin{align} \inputVector & \sim \gaussianSamp{\meanVector}{\covarianceMatrix}\\ \noiseVector & \sim \gaussianSamp{\zerosVector}{\covarianceMatrixTwo} \end{align} $$
Then $$ \dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top + \covarianceMatrixTwo}. $$ If $\covarianceMatrixTwo=\dataStd^2\eye$, this is Probabilistic Principal Component Analysis [@Tipping:probpca99], because we integrated out the inputs (or latent variables they would be called in that case).

Non linear on Inputs¶

Set each activation function computed at each data point to be

$$ \activationScalar_{i,j} = \activationScalar(\mappingVector^{(1)}_{j}, \inputVector_{i}) $$

Define design matrix $$ \activationMatrix = \begin{bmatrix} \activationScalar_{1, 1} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numHidden} \\ \activationScalar_{1, 2} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numData} \\ \vdots & \vdots & \ddots & \vdots \\ \activationScalar_{\numData, 1} & \activationScalar_{\numData, 2} & \dots & \activationScalar_{\numData, \numHidden} \end{bmatrix}. $$

Matrix Representation of a Neural Network¶

$$\dataScalar\left(\inputVector\right) = \activationVector\left(\inputVector\right)^\top \mappingVector + \noiseScalar$$

. . .

$$\dataVector = \activationMatrix\mappingVector + \noiseVector$$

. . .

$$\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2\eye}$$

Prior Density¶

Define

$$ \mappingVector \sim \gaussianSamp{\zerosVector}{\alpha\eye}, $$

Rules of multivariate Gaussians to see that,

$$ \dataVector \sim \gaussianSamp{\zerosVector}{\alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye}. $$$$ \kernelMatrix = \alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye. $$

Joint Gaussian Density¶

Elements are a function $\kernel_{i,j} = \kernel\left(\inputVector_i, \inputVector_j\right)$

$$ \kernelMatrix = \alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye. $$

Covariance Function¶

$$ \kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) = \alpha \activationVector\left(\mappingMatrix_1, \inputVector_i\right)^\top \activationVector\left(\mappingMatrix_1, \inputVector_j\right) $$

formed by inner products of the rows of the design matrix.

Gaussian Process¶

Instead of making assumptions about our density over each data point, $\dataScalar_i$ as i.i.d.
make a joint Gaussian assumption over our data.
covariance matrix is now a function of both the parameters of the activation function, $\mappingMatrix_1$, and the input variables, $\inputMatrix$.
Arises from integrating out $\mappingVector^{(2)}$.

Basis Functions¶

Can be very complex, such as deep kernels, [@Cho:deep09] or could even put a convolutional neural network inside.
Viewing a neural network in this way is also what allows us to beform sensible batch normalizations [@Ioffe:batch15].

[\small{[edit]}]{style="text-align:right"} ### Non-degenerate Gaussian Processes

This process is degenerate.
Covariance function is of rank at most $\numHidden$.
As $\numData \rightarrow \infty$, covariance matrix is not full rank.
Leading to $\det{\kernelMatrix} = 0$

Infinite Networks¶

In ML Radford Neal [@Neal:bayesian94] asked "what would happen if you took $\numHidden \rightarrow \infty$?"

Roughly Speaking¶

Instead of

$$ \begin{align*} \kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) & = \alpha \activationVector\left(\mappingMatrix_1, \inputVector_i\right)^\top \activationVector\left(\mappingMatrix_1, \inputVector_j\right)\\ & = \alpha \sum_k \activationScalar\left(\mappingVector^{(1)}_k, \inputVector_i\right) \activationScalar\left(\mappingVector^{(1)}_k, \inputVector_j\right) \end{align*} $$

Sample infinitely many from a prior density, $p(\mappingVector^{(1)})$,

$$ \kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) = \alpha \int \activationScalar\left(\mappingVector^{(1)}, \inputVector_i\right) \activationScalar\left(\mappingVector^{(1)}, \inputVector_j\right) p(\mappingVector^{(1)}) \text{d}\mappingVector^{(1)} $$

Also applies for non-Gaussian $p(\mappingVector^{(1)})$ because of the central limit theorem.

Simple Probabilistic Program¶

If $$ \begin{align*} \mappingVector^{(1)} & \sim p(\cdot)\\ \phi_i & = \activationScalar\left(\mappingVector^{(1)}, \inputVector_i\right), \end{align*} $$ has finite variance.
Then taking number of hidden units to infinity, is also a Gaussian process.

¶

Distributions over Functions¶

[\small{[edit]}]{style="text-align:right"}

In [ ]:

import numpy as np
np.random.seed(4949)

Sampling a Function¶

Multi-variate Gaussians

We will consider a Gaussian with a particular structure of covariance matrix.
Generate a single sample from this 25 dimensional Gaussian density, $$ \mappingFunctionVector=\left[\mappingFunction_{1},\mappingFunction_{2}\dots \mappingFunction_{25}\right]. $$
We will plot these points against their index.

In [ ]:

import teaching_plots as plot
from mlai import Kernel, exponentiated_quadratic

In [ ]:

kernel=Kernel(function=exponentiated_quadratic, lengthscale=0.5)
plot.two_point_sample(kernel.K, diagrams='../slides/diagrams/gp')

In [ ]:

import pods
from ipywidgets import IntSlider

In [ ]:

pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(0, 0, 8, 1))

Gaussian Distribution Sample¶

[\small{[edit]}]{style="text-align:right"}

In [ ]:

import pods
from ipywidgets import IntSlider

In [ ]:

pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(9, 9, 12, 1))

Prediction of $\mappingFunction_{2}$ from $\mappingFunction_{1}$¶

Uluru¶

Prediction with Correlated Gaussians¶

Prediction of $\mappingFunction_2$ from $\mappingFunction_1$ requires conditional density.
Conditional density is also Gaussian. $$ p(\mappingFunction_2|\mappingFunction_1) = \gaussianDist{\mappingFunction_2}{\frac{\kernelScalar_{1, 2}}{\kernelScalar_{1, 1}}\mappingFunction_1}{ \kernelScalar_{2, 2} - \frac{\kernelScalar_{1,2}^2}{\kernelScalar_{1,1}}} $$ where covariance of joint density is given by $$ \kernelMatrix = \begin{bmatrix} \kernelScalar_{1, 1} & \kernelScalar_{1, 2}\\ \kernelScalar_{2, 1} & \kernelScalar_{2, 2}.\end{bmatrix} $$

[\small{[edit]}]{style="text-align:right"}

In [ ]:

import pods
from ipywidgets import IntSlider

In [ ]:

pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(13, 13, 17, 1))

Prediction of $\mappingFunction_{8}$ from $\mappingFunction_{1}$¶

Key Object¶

Covariance function, $\kernelMatrix$
Determines properties of samples.
Function of $\inputMatrix$, $$\kernelScalar_{i,j} = \kernelScalar(\inputVector_i, \inputVector_j)$$

Linear Algebra¶

Posterior mean $$\mappingFunction_D(\inputVector_*) = \kernelVector(\inputVector_*, \inputMatrix) \kernelMatrix^{-1} \mathbf{y}$$
Posterior covariance $$\mathbf{C}_* = \kernelMatrix_{*,*} - \kernelMatrix_{*,\mappingFunctionVector} \kernelMatrix^{-1} \kernelMatrix_{\mappingFunctionVector, *}$$

Linear Algebra¶

Posterior mean

In [ ]:

$$\mappingFunction_D(\inputVector_*) = \kernelVector(\inputVector_*, \inputMatrix) \boldsymbol{\alpha}$$

Posterior covariance $$\covarianceMatrix_* = \kernelMatrix_{*,*} - \kernelMatrix_{*,\mappingFunctionVector} \kernelMatrix^{-1} \kernelMatrix_{\mappingFunctionVector, *}$$

¶

[\small{[edit]}]{style="text-align:right"} ### Exponentiated Quadratic Covariance

In [ ]:

from mlai import Kernel

In [ ]:

from mlai import eq_cov

In [ ]:

kernel = Kernel(function=eq_cov,
                     name='Exponentiated Quadratic',
                     shortname='eq',                     
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \exp\left(-\frac{\ltwoNorm{\inputVector-\inputVector^\prime}^2}{2\lengthScale^2}\right)',
                     lengthscale=0.2)

In [ ]:

import teaching_plots as plot

In [ ]:

plot.covariance_func(kernel=kernel, diagrams='../slides/diagrams/kern/')

$$\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \exp\left(-\frac{\ltwoNorm{\inputVector-\inputVector^\prime}^2}{2\lengthScale^2}\right)$$

\includesvgclass{../slides/diagrams/kern/eq_covariance.svg}

[\small{[edit]}]{style="text-align:right"}

Olympic Marathon Data¶

- Gold medal times for Olympic Marathon since 1896. - Marathons before 1924 didn’t have a standardised distance. - Present results using pace per km. - In 1904 Marathon was badly organised leading to very slow times.

Image from Wikimedia Commons http://bit.ly/16kMKHQ

In [ ]:

import matplotlib.pyplot as plt
import teaching_plots as plot
import mlai

In [ ]:

xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(figure=fig, 
                  filename='../slides/diagrams/datasets/olympic-marathon.svg', 
                  transparent=True, 
                  frameon=True)

Olympic Marathon Data¶

[\small{[edit]}]{style="text-align:right"}

Alan Turing¶

### Probability Winning Olympics?

He was a formidable Marathon runner.
In 1946 he ran a time 2 hours 46 minutes.
- That's a pace of 3.95 min/km.
What is the probability he would have won an Olympics if one had been held in 1946?

In [ ]:

import teaching_plots as plot

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/olympic-marathon-gp.svg', 
                  transparent=True, frameon=True)

Olympic Marathon Data GP¶

[\small{[edit]}]{style="text-align:right"}

In [ ]:

gpoptimizeInit

Learning Covariance Parameters¶

Can we determine covariance parameters from the data?

¶

$$\gaussianDist{\dataVector}{\mathbf{0}}{\kernelMatrix}=\frac{1}{(2\pi)^\frac{\numData}{2}{\det{\kernelMatrix}^{\frac{1}{2}}}}{\exp\left(-\frac{\dataVector^{\top}\kernelMatrix^{-1}\dataVector}{2}\right)}$$

¶

$$\begin{aligned} \gaussianDist{\dataVector}{\mathbf{0}}{\kernelMatrix}=\frac{1}{(2\pi)^\frac{\numData}{2}{\color{white} \det{\kernelMatrix}^{\frac{1}{2}}}}{\color{white}\exp\left(-\frac{\dataVector^{\top}\kernelMatrix^{-1}\dataVector}{2}\right)} \end{aligned} $$

¶

$$ \begin{aligned} \log \gaussianDist{\dataVector}{\mathbf{0}}{\kernelMatrix}=&{\color{white}-\frac{1}{2}\log\det{\kernelMatrix}}{\color{white}-\frac{\dataVector^{\top}\kernelMatrix^{-1}\dataVector}{2}} \\ &-\frac{\numData}{2}\log2\pi \end{aligned} $$$$ \errorFunction(\parameterVector) = {\color{white} \frac{1}{2}\log\det{\kernelMatrix}} + {\color{white} \frac{\dataVector^{\top}\kernelMatrix^{-1}\dataVector}{2}} $$

¶

The parameters are inside the covariance function (matrix). \normalsize $$\kernelScalar_{i, j} = \kernelScalar(\inputVals_i, \inputVals_j; \parameterVector)$$

Eigendecomposition of Covariance¶

[\Large $$\kernelMatrix = \rotationMatrix \eigenvalueMatrix^2 \rotationMatrix^\top$$]{}

$\eigenvalueMatrix$ represents distance on axes. $\rotationMatrix$ gives rotation.

### Eigendecomposition of Covariance

$\eigenvalueMatrix$ is diagonal, $\rotationMatrix^\top\rotationMatrix = \eye$.
Useful representation since $\det{\kernelMatrix} = \det{\eigenvalueMatrix^2} = \det{\eigenvalueMatrix}^2$.

Capacity control: ${\color{white} \log \det{\kernelMatrix}}$¶

In [ ]:

import matplotlib.pyplot as plt
import numpy as np
fill_color = [1., 1., 0.]
black_color = [0., 0., 0.]
blue_color = [0., 0., 1.]
magenta_color = [1., 0., 1.]

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_figsize)
ax.set_axis_off()
cax = fig.add_axes([0., 0., 1., 1.])
cax.set_axis_off()

counter = 1
lambda1 = 0.5
lambda2 = 0.3

cax.set_xlim([0., 1.])
cax.set_ylim([0., 1.])

# Matrix label axes
tax2 = fig.add_axes([0, 0.47, 0.1, 0.1])
tax2.set_xlim([0, 1.])
tax2.set_ylim([0, 1.])
tax2.set_axis_off()
label_eigenvalue = tax2.text(0.5, 0.5, "\Large $\eigenvalueMatrix=$")

ax = fig.add_axes([0.5, 0.25, 0.5, 0.5])
ax.set_xlim([-0.25, 0.6])
ax.set_ylim([-0.25, 0.6])
from matplotlib.patches import Polygon
pat_hand = ax.add_patch(Polygon(np.column_stack(([0, 0, lambda1, lambda1], 
                    [0, lambda2, lambda2, 0])), 
                    facecolor=fill_color, 
                    edgecolor=black_color, 
                    visible=False))
data = pat_hand.get_path().vertices
rotation_matrix = np.asarray([[np.sqrt(2)/2, -np.sqrt(2)/2], 
                              [np.sqrt(2)/2,  np.sqrt(2)/2]])
new = np.dot(rotation_matrix,data.T)
pat_hand = ax.add_patch(Polygon(np.column_stack(([0, 0, lambda1, lambda1], 
                                                 [0, lambda2, lambda2, 0])), 
                    facecolor=fill_color, 
                    edgecolor=black_color, 
                    visible=False))
pat_hand_rot = ax.add_patch(Polygon(new.T, 
                       facecolor=fill_color, 
                       edgecolor=black_color))
pat_hand_rot.set(visible=False)

# 3D box
pat_hand3 = [ax.add_patch(Polygon(np.column_stack(([0, -0.2*lambda1, 0.8*lambda1, lambda1], 
                                      [0, -0.2*lambda2, -0.2*lambda2, 0])), 
                     facecolor=fill_color, 
                     edgecolor=black_color))]
                  
pat_hand3.append(ax.add_patch(Polygon(np.column_stack(([0, -0.2*lambda1, -0.2*lambda1, 0], 
                                          [0, -0.2*lambda2, 0.8*lambda2, lambda2])), 
                         facecolor=fill_color, 
                         edgecolor=black_color)))

pat_hand3.append(ax.add_patch(Polygon(np.column_stack(([-0.2*lambda1, 0, lambda1, 0.8*lambda1], 
                                          [0.8*lambda2, lambda2, lambda2, 0.8*lambda2])), 
                         facecolor=fill_color,
                         edgecolor=black_color)))
pat_hand3.append(ax.add_patch(Polygon(np.column_stack(([lambda1, 0.8*lambda1, 0.8*lambda1, lambda1], 
                                          [lambda2, 0.8*lambda2, -0.2*lambda2, 0])), 
                         facecolor=fill_color, 
                         edgecolor=black_color)))

for hand in pat_hand3:
    hand.set(visible=False)

ax.set_aspect('equal')
ax.set_axis_off()
xlim = ax.get_xlim()
ylim = ax.get_ylim()
xspan = xlim[1] - xlim[0]
yspan = ylim[1] - ylim[0]
#ar_one = ax.arrow([0, lambda1], [0, 0])
ar_one = ax.arrow(x=0, y=0, dx=lambda1, dy=0)
#ar_two = ax.arrow([0, 0], [0, lambda2])
ar_two = ax.arrow(x=0, y=0, dx=0, dy=lambda2)
#ar_three = ax.arrow([0, -0.2*lambda1], [0, -0.2*lambda2])
ar_three = ax.arrow(x=0, y=0, dx=-0.2*lambda1, dy=-0.2*lambda2)
ar_one_text = ax.text(0.5*lambda1, -0.05*yspan, 
                      '$\eigenvalue_1$', 
                      horizontalalignment='center')
ar_two_text = ax.text(-0.05*xspan, 0.5*lambda2, 
                      '$\eigenvalue_2$', 
                      horizontalalignment='center')
ar_three_text = ax.text(-0.05*xspan-0.1*lambda1, -0.1*lambda2+0.05*yspan, 
                        '$\eigenvalue_3$', 
                        horizontalalignment='center')
ar_one.set(linewidth=3, visible=False, color=blue_color)
ar_one_text.set(visible=False)

ar_two.set(linewidth=3, visible=False, color=blue_color)
ar_two_text.set(visible=False)

ar_three.set(linewidth=3, visible=False, color=blue_color)
ar_three_text.set(visible=False)


matrix_ax = fig.add_axes([0.2, 0.35, 0.3, 0.3])
matrix_ax.set_aspect('equal')
matrix_ax.set_axis_off()
eigenvals = [['$\eigenvalue_1$', '$0$'],['$0$', '$\eigenvalue_2$']]
plot.matrix(eigenvals, 
            matrix_ax, 
            bracket_style='square', 
            type='entries', 
            bracket_color=black_color)


# First arrow
matrix_ax.cla()
plot.matrix(eigenvals, 
            matrix_ax, 
            bracket_style='square', 
            type='entries',
            highlight=True,
            highlight_row=[0, 0],
            highlight_col=':',
            highlight_color=magenta_color,
            bracket_color=black_color)

ar_one.set(visible=True)
ar_one_text.set(visible=True)

file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1

# Second arrow
matrix_ax.cla()
plot.matrix(eigenvals, 
            matrix_ax, 
            bracket_style='square', 
            type='entries', 
            highlight=True,
            highlight_row=[1,1],
            highlight_col=':',
            highlight_color=magenta_color,
            bracket_color=black_color)

ar_two.set(visible=True)
ar_two_text.set(visible=True)

file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1

matrix_ax.cla()
plot.matrix(eigenvals, matrix_ax, 
            bracket_style='square', 
            type='entries', 
            bracket_color=black_color)

file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1


tax = fig.add_axes([0.1, 0.1, 0.8, 0.1])
tax.set_axis_off()
tax.set_xlim([0, 1])
tax.set_ylim([0, 1])
det_text = text(0.5, 0.5,
                '\Large $\det{\eigenvalueMatrix} = \eigenvalue_1 \eigenvalue_2$', 
                horizontalalignment='center')
file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1

axes(ax)
pat_hand.set(visible=True)
file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1

det_text_plot = text(0.5*lambda1, 
                     0.5*lambda2, 
                     '\Large $\det{\eigenvalueMatrix}$', 
                     horizontalalignment='center')
                     
file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1


eigenvals2 = {'$\eigenvalue_1$', '$0$' '$0$'; '$0$', '$\eigenvalue_2$' '$0$'; '$0$', '$0$' '$\eigenvalue_3$'}
axes(matrix_ax)
matrix_ax.cla()
plot.matrix(eigenvals2, matrix_ax, 
            bracket_style='square', 
            type='entries',
            highlight=True,
            highlight_row=[2,2],
            highlight_col=':',
            highlight_color=magenta_color)

file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1


ar_three.set(visible=True)
ar_three_text.set(visible=True)
pat_hand3.set(visible=True)
det_text.set(string='\Large $\det{\eigenvalueMatrix} = \eigenvalue_1 \eigenvalue_2\eigenvalue_3$')

file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1

matrix_ax.cla()
plot.matrix(eigenvals, 
            matrix_ax, 
            bracket_style='square', 
            type='entries', 
            bracket_color=black_color)
            
ar_three.set(visible=False)
ar_three_text.set(visible=False)
pat_hand3.set(visible=False)
det_text.set(string='\Large $\det{\eigenvalueMatrix} = \eigenvalue_1 \eigenvalue_2$')

file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1



det_text.set(string='\Large $\det{\rotationMatrix\eigenvalueMatrix} = \eigenvalue_1 \eigenvalue_2$')
label_eigenvalue.set(string='\Large $\rotationMatrix\eigenvalueMatrix=$')



rotate_object(rotation_matrix, ar_one)
rotate_object(rotation_matrix, ar_one_text)
rotate_object(rotation_matrix, ar_two)
rotate_object(rotation_matrix, ar_two_text)
rotate_object(rotation_matrix, det_textPlot)
pat_hand_rot.set(visible=True)
pat_hand.set(visible=False)

W = [['$\mappingScalar_{1, 1}$', '$\mappingScalar_{1, 2}$'],[ '$\mappingScalar_{2, 1}$', '$\mappingScalar_{2, 2}$']]
plot.matrix(W, 
            matrix_ax, 
            bracket_style='square', 
            type='entries', 
            bracket_color=black_color)


file_name = 'gp-optimise-determinant{counter:>3}'.format(counter=counter)
mlai.write_figure(os.path.join(diagrams, file_name), transparent=True)
counter += 1

Data Fit: ${\color{white} \frac{\dataVector^\top\kernelMatrix^{-1}\dataVector}{2}}$¶

$$\errorFunction(\parameterVector) = {\color{white}\frac{1}{2}\log\det{\kernelMatrix}}+{\color{white}\frac{\dataVector^{\top}\kernelMatrix^{-1}\dataVector}{2}}$$¶

In [ ]:

import matplotlib.pyplot as plt
import numpy as np
import os

In [ ]:

import GPy
import teaching_plots as plot
import mlai
import gp_tutorial

In [ ]:

np.random.seed(125)
diagrams = '../slides/diagrams/gp'

black_color=[0., 0., 0.]
red_color=[1., 0., 0.]
blue_color=[0., 0., 1.]
magenta_color=[1., 0., 1.]
fontsize=18

In [ ]:

y_lim = [-2.2, 2.2]
y_ticks = [-2, -1, 0, 1, 2]
x_lim = [-2, 2]
x_ticks = [-2, -1, 0, 1, 2]
err_y_lim = [-12, 20]

linewidth=3
markersize=15
markertype='.'

In [ ]:

x = np.linspace(-1, 1, 6)[:, np.newaxis]
xtest = np.linspace(x_lim[0], x_lim[1], 200)[:, np.newaxis]

# True data
true_kern = GPy.kern.RBF(1) + GPy.kern.White(1)
true_kern.rbf.lengthscale = 1.0
true_kern.white.variance = 0.01
K = true_kern.K(x) 
y = np.random.multivariate_normal(np.zeros((6,)), K, 1).T

In [ ]:

# Fitted model
kern = GPy.kern.RBF(1) + GPy.kern.White(1)
kern.rbf.lengthscale = 1.0
kern.white.variance = 0.01

lengthscales = np.asarray([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2, 4, 8, 16, 100])

fig1, ax1 = plt.subplots(figsize=plot.one_figsize)    
fig2, ax2 = plt.subplots(figsize=plot.one_figsize)    
line = ax2.semilogx(np.NaN, np.NaN, 'x-', 
                    color=black_color)
ax.set_ylim(err_y_lim)
ax.set_xlim([0.025, 32])
ax.grid(True)
ax.set_xticks([0.01, 0.1, 1, 10, 100])
ax.set_xticklabels(['$10^{-2}$', '$10^{-1}$', '$10^0$', '$10^1$', '$10^2$'])


err = np.zeros_like(lengthscales)
err_log_det = np.zeros_like(lengthscales)
err_fit = np.zeros_like(lengthscales)

counter = 0
for i, ls in enumerate(lengthscales):
        kern.rbf.lengthscale=ls
        K = kern.K(x) 
        invK, L, Li, log_det_K = GPy.util.linalg.pdinv(K)
        err[i] = 0.5*(log_det_K + np.dot(np.dot(y.T,invK),y))
        err_log_det[i] = 0.5*log_det_K
        err_fit[i] = 0.5*np.dot(np.dot(y.T,invK), y)
        Kx = kern.K(x, xtest)
        ypred_mean = np.dot(np.dot(Kx.T, invK), y)
        ypred_var = kern.Kdiag(xtest) - np.sum((np.dot(Kx.T,invK))*Kx.T, 1)
        ypred_sd = np.sqrt(ypred_var)
        ax1.clear()
        _ = gp_tutorial.gpplot(xtest.flatten(),
                               ypred_mean.flatten(),
                               ypred_mean.flatten()-2*ypred_sd.flatten(),
                               ypred_mean.flatten()+2*ypred_sd.flatten(), 
                               ax=ax1)
        x_lim = ax1.get_xlim()
        ax1.set_ylabel('$f(x)$', fontsize=fontsize)
        ax1.set_xlabel('$x$', fontsize=fontsize)

        p = ax1.plot(x, y, markertype, color=black_color, markersize=markersize, linewidth=linewidth)
        ax1.set_ylim(y_lim)
        ax1.set_xlim(x_lim)                                    
        ax1.set_xticks(x_ticks)
        #ax.set(box=False)
           
        ax1.plot([x_lim[0], x_lim[0]], y_lim, color=black_color)
        ax1.plot(x_lim, [y_lim[0], y_lim[0]], color=black_color)

        file_name = 'gp-optimise{counter:0>3}.svg'.format(counter=counter)
        mlai.write_figure(os.path.join(diagrams, file_name),
                          figure=fig1,
                          transparent=True)
        counter += 1

        ax2.clear()
        t = ax2.semilogx(lengthscales[0:i+1], err[0:i+1], 'x-', 
                        color=magenta_color, 
                        markersize=markersize,
                        linewidth=linewidth)
        t2 = ax2.semilogx(lengthscales[0:i+1], err_log_det[0:i+1], 'x-', 
                         color=blue_color, 
                        markersize=markersize,
                        linewidth=linewidth)
        t3 = ax2.semilogx(lengthscales[0:i+1], err_fit[0:i+1], 'x-', 
                         color=red_color, 
                        markersize=markersize,
                        linewidth=linewidth)
        ax2.set_ylim(err_y_lim)
        ax2.set_xlim([0.025, 32])
        ax2.set_xticks([0.01, 0.1, 1, 10, 100])
        ax2.set_xticklabels(['$10^{-2}$', '$10^{-1}$', '$10^0$', '$10^1$', '$10^2$'])

        ax2.grid(True)

        ax2.set_ylabel('negative log likelihood', fontsize=fontsize)
        ax2.set_xlabel('length scale, $\ell$', fontsize=fontsize)
        file_name = 'gp-optimise{counter:0>3}.svg'.format(counter=counter)
        mlai.write_figure(os.path.join(diagrams, file_name),
                          figure=fig2,
                          transparent=True)
        counter += 1
        #ax.set_box(False)
        xlim = ax2.get_xlim()
        ax2.plot([xlim[0], xlim[0]], err_y_lim, color=black_color)
        ax2.plot(xlim, [err_y_lim[0], err_y_lim[0]], color=black_color)

\includesvg{../slides/diagrams/gp/gp-optimise001}

\includesvg{../slides/diagrams/gp/gp-optimise003}

\includesvg{../slides/diagrams/gp/gp-optimise005}

\includesvg{../slides/diagrams/gp/gp-optimise007}

\includesvg{../slides/diagrams/gp/gp-optimise009}

\includesvg{../slides/diagrams/gp/gp-optimise011}

\includesvg{../slides/diagrams/gp/gp-optimise013}

\includesvg{../slides/diagrams/gp/gp-optimise015}

\includesvg{../slides/diagrams/gp/gp-optimise017}

\includesvg{../slides/diagrams/gp/gp-optimise019}

\includesvg{../slides/diagrams/gp/gp-optimise021}

[\small{[edit]}]{style="text-align:right"}

Della Gatta Gene Data¶

Given given expression levels in the form of a time series from @DellaGatta:direct08.

In [ ]:

import matplotlib.pyplot as plt
import teaching_plots as plot
import mlai

In [ ]:

xlim = (-20,260)
ylim = (5, 7.5)
yhat = (y-offset)/scale

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('time/min', fontsize=20)
ax.set_ylabel('expression', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(figure=fig, 
                  filename='../slides/diagrams/datasets/della-gatta-gene.svg', 
                  transparent=True, 
                  frameon=True)

Della Gatta Gene Data¶

Gene Expression Example¶

Want to detect if a gene is expressed or not, fit a GP to each gene @Kalaitzis:simple11.

¶

###

In [ ]:

import teaching_plots as plot

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full.log_likelihood()), fontsize=20)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/della-gatta-gene-gp.svg', 
                  transparent=True, frameon=True)

TP53 Gene Data GP¶

In [ ]:

import teaching_plots as plot

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full2, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full2.log_likelihood()), fontsize=20)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/della-gatta-gene-gp2.svg', 
                  transparent=True, frameon=True)

TP53 Gene Data GP¶

In [ ]:

import teaching_plots as plot

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full3, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full3.log_likelihood()), fontsize=20)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/della-gatta-gene-gp3.svg', 
                  transparent=True, frameon=True)

TP53 Gene Data GP¶

In [ ]:

import teaching_plots as plot

In [ ]:

plot.multiple_optima(diagrams='../slides/diagrams/gp')

Multiple Optima¶

[\small{[edit]}]{style="text-align:right"}

Example: Prediction of Malaria Incidence in Uganda¶

[]{style="text-align:right"}

Work with Ricardo Andrade Pacheco, John Quinn and Martin Mubaganzi (Makerere University, Uganda)
See AI-DEV Group.

Malaria Prediction in Uganda¶

[[@Andrade:consistent14,@Mubangizi:malaria14]]{style="text-align:right"}

Kapchorwa District¶

Tororo District¶

Malaria Prediction in Nagongera (Sentinel Site)¶

Mubende District¶

Malaria Prediction in Uganda¶

GP School at Makerere¶

Kabarole District¶

Early Warning Systems¶

[\small{[edit]}]{style="text-align:right"}

Additive Covariance¶

In [ ]:

from mlai import Kernel

In [ ]:

from mlai import linear_cov

In [ ]:

from mlai import eq_cov

In [ ]:

from mlai import add_cov

In [ ]:

kernel = Kernel(function=add_cov,
                     name='Additive',
                     shortname='add',                     
                     formula='\kernelScalar_f(\inputVector, \inputVector^\prime) = \kernelScalar_g(\inputVector, \inputVector^\prime) + \kernelScalar_h(\inputVector, \inputVector^\prime)', 
                     kerns=[linear_cov, eq_cov], 
                     kern_args=[{'variance': 25}, {'lengthscale' : 0.2}])

In [ ]:

import teaching_plots as plot

In [ ]:

plot.covariance_func(kernel=kernel, diagrams='../slides/diagrams/kern/')

$$\kernelScalar_f(\inputVector, \inputVector^\prime) = \kernelScalar_g(\inputVector, \inputVector^\prime) + \kernelScalar_h(\inputVector, \inputVector^\prime)$$

\includesvgclass{../slides/diagrams/kern/add_covariance.svg}

[\small{[edit]}]{style="text-align:right"}

{### Analysis of US Birth Rates ###

Gelman Book¶

@Gelman:bayesian13

[\small{[edit]}]{style="text-align:right"}

Basis Function Covariance¶

In [ ]:

import teaching_plots as plot
import mlai
import numpy as np

In [ ]:

basis = mlai.Basis(function=radial, 
                   number=3,
                   data_limits=[-0.5, 0.5], 
                   width=0.125)
kernel = mlai.Kernel(function=basis_cov,
                     name='Basis',
                     shortname='basis',                  
                     formula='\kernel(\inputVector, \inputVector^\prime) = \basisVector(\inputVector)^\top \basisVector(\inputVector^\prime)',
                     basis=basis)
                     
plot.covariance_func(kernel, diagrams='../slides/diagrams/kern/')

$$\kernel(\inputVector, \inputVector^\prime) = \basisVector(\inputVector)^\top \basisVector(\inputVector^\prime)$$

\includesvgclass{../slides/diagrams/kern/basis_covariance.svg}

[\small{[edit]}]{style="text-align:right"} \#\#\# Brownian Covariance

In [ ]:

import teaching_plots as plot
import mlai
import numpy as np

In [ ]:

t=np.linspace(0, 2, 200)[:, np.newaxis]
kernel = mlai.Kernel(function=brownian_cov,
                     name='Brownian',
                     formula='\kernelScalar(t, t^\prime)=\alpha \min(t, t^\prime)',
                     shortname='brownian')
plot.covariance_func(kernel, t, diagrams='../slides/diagrams/kern/')

$$\kernelScalar(t, t^\prime)=\alpha \min(t, t^\prime)$$

\includesvgclass{../slides/diagrams/kern/brownian_covariance.svg}

[\small{[edit]}]{style="text-align:right"} \#\#\# MLP Covariance

In [ ]:

import teaching_plots as plot
import mlai
import numpy as np

In [ ]:

kernel = mlai.Kernel(function=mlp_cov,
                     name='Multilayer Perceptron',
                     shortname='mlp',                    
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b}{\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right)\left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)',
                     w=5, b=0.5)
                     
plot.covariance_func(kernel, diagrams='../slides/diagrams/kern/')

$$\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b}{\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right)\left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)$$

\includesvgclass{../slides/diagrams/kern/mlp_covariance.svg}

[\small{[edit]}]{style="text-align:right"}

GPSS: Gaussian Process Summer School¶

- - Next one is in Sheffield in *September 2019*. - Many lectures from past meetings available online

\includesvgclass{../slides/diagrams/logo/gpss-logo.svg}

[\small{[edit]}]{style="text-align:right"}

GPy: A Gaussian Process Framework in Python¶

### GPy: A Gaussian Process Framework in Python

BSD Licensed software base.
Wide availability of libraries, 'modern' scripting language.
Allows us to set projects to undergraduates in Comp Sci that use GPs.
Available through GitHub https://github.com/SheffieldML/GPy
Reproducible Research with Jupyter Notebook.

Features¶

Probabilistic-style programming (specify the model, not the algorithm).
Non-Gaussian likelihoods.
Multivariate outputs.
Dimensionality reduction.
Approximations for large data sets.

[\small{[edit]}]{style="text-align:right"}

Other Software¶

[\small{[edit]}]{style="text-align:right"}

MXFusion: Modular Probabilistic Programming on MXNet¶

### MxFusion

- Work by Eric Meissner and Zhenwen Dai. - Probabilistic programming. - Available on [Github](https://github.com/amzn/mxfusion)

In [ ]:

<img class="" src="../slides/diagrams/mxfusion-logo.png" width="" height="auto" align="center" style="background:none; border:none; box-shadow:none; display:block; margin-left:auto; margin-right:auto;vertical-align:middle">

</div>

</td>
</tr>
</table>

Acknowledgments¶

Stefanos Eleftheriadis, John Bronskill, Hugh Salimbeni, Rich Turner, Zhenwen Dai, Javier Gonzalez, Andreas Damianou, Mark Pullin, Michael Smith, James Hensman, John Quinn, Martin Mubangizi.

Thanks!¶

twitter: @lawrennd
blog: http://inverseprobability.com

Gaussian Processes¶

Neil D. Lawrence, Amazon Cambridge and University of Sheffield¶

2019-01-09¶

What is Machine Learning?¶

What is Machine Learning?¶

Artificial Intelligence¶

Uncertainty¶

Prediction Function of One Hidden Layer¶

Relations with Classical Statistics¶

Adaptive Basis Functions¶

Joint Model of World¶

Likelihood¶

Likelihood and Prediction Function¶

Unsupervised Learning¶

Probabilistic Inference¶

Performing Inference¶

Linear Models¶

Multivariate Gaussian Properties¶

Recall Univariate Gaussian Properties¶

Multivariate Consequence¶

Linear Gaussian Models¶

Non linear on Inputs¶

Matrix Representation of a Neural Network¶

Prior Density¶

Joint Gaussian Density¶

Covariance Function¶

Gaussian Process¶

Basis Functions¶

Infinite Networks¶

Roughly Speaking¶

Simple Probabilistic Program¶

Further Reading¶

Distributions over Functions¶

Sampling a Function¶

Gaussian Distribution Sample¶

Prediction of $\mappingFunction_{2}$ from $\mappingFunction_{1}$¶

Uluru¶

Prediction with Correlated Gaussians¶

Prediction of $\mappingFunction_{8}$ from $\mappingFunction_{1}$¶

Key Object¶

Linear Algebra¶

Linear Algebra¶

Olympic Marathon Data¶

Olympic Marathon Data¶

Alan Turing¶

Olympic Marathon Data GP¶

Learning Covariance Parameters¶

Eigendecomposition of Covariance¶

Capacity control: ${\color{white} \log \det{\kernelMatrix}}$¶

Data Fit: ${\color{white} \frac{\dataVector^\top\kernelMatrix^{-1}\dataVector}{2}}$¶

$$\errorFunction(\parameterVector) = {\color{white}\frac{1}{2}\log\det{\kernelMatrix}}+{\color{white}\frac{\dataVector^{\top}\kernelMatrix^{-1}\dataVector}{2}}$$¶

Della Gatta Gene Data¶

Della Gatta Gene Data¶

Gene Expression Example¶

TP53 Gene Data GP¶

TP53 Gene Data GP¶

TP53 Gene Data GP¶

Multiple Optima¶

Example: Prediction of Malaria Incidence in Uganda¶

Malaria Prediction in Uganda¶

Kapchorwa District¶

Tororo District¶

Malaria Prediction in Nagongera (Sentinel Site)¶

Mubende District¶

Malaria Prediction in Uganda¶

GP School at Makerere¶

Kabarole District¶

Early Warning Systems¶

Early Warning Systems¶

Additive Covariance¶

Gelman Book¶

Basis Function Covariance¶

GPSS: Gaussian Process Summer School¶

GPy: A Gaussian Process Framework in Python¶

Features¶

Other Software¶

MXFusion: Modular Probabilistic Programming on MXNet¶

Acknowledgments¶

Thanks!¶

References {#references .unnumbered}¶