#!/usr/bin/env python # coding: utf-8 # # Deep Gaussian Processes # ### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield # ### 2019-01-11 # # **Abstract**: Classical machine learning and statistical approaches to learning, such # as neural networks and linear regression, assume a parametric form for # functions. Gaussian process models are an alternative approach that # assumes a probabilistic prior over functions. This brings benefits, in # that uncertainty of function estimation is sustained throughout # inference, and some challenges: algorithms for fitting Gaussian # processes tend to be more complex than parametric models. In these # sessions I will introduce Gaussian processes and explain why sustaining # uncertainty is important. We’ll then look at some extensions of Gaussian # process models, in particular composition of Gaussian processes, or deep # Gaussian processes. # # $$ # \newcommand{\Amatrix}{\mathbf{A}} # \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} # \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} # \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} # \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} # \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} # \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} # \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} # \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} # \newcommand{\Kuui}{\Kuu^{-1}} # \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} # \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} # \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} # \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} # \newcommand{\aMatrix}{\mathbf{A}} # \newcommand{\aScalar}{a} # \newcommand{\aVector}{\mathbf{a}} # \newcommand{\acceleration}{a} # \newcommand{\bMatrix}{\mathbf{B}} # \newcommand{\bScalar}{b} # \newcommand{\bVector}{\mathbf{b}} # \newcommand{\basisFunc}{\phi} # \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} # \newcommand{\basisFunction}{\phi} # \newcommand{\basisLocation}{\mu} # \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} # \newcommand{\basisScalar}{\basisFunction} # \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} # \newcommand{\activationFunction}{\phi} # \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} # \newcommand{\activationScalar}{\basisFunction} # \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} # \newcommand{\bigO}{\mathcal{O}} # \newcommand{\binomProb}{\pi} # \newcommand{\cMatrix}{\mathbf{C}} # \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} # \newcommand{\cdataMatrix}{\hat{\dataMatrix}} # \newcommand{\cdataScalar}{\hat{\dataScalar}} # \newcommand{\cdataVector}{\hat{\dataVector}} # \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} # \newcommand{\centeredKernelScalar}{b} # \newcommand{\centeredKernelVector}{\centeredKernelScalar} # \newcommand{\centeringMatrix}{\mathbf{H}} # \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} # \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} # \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} # \newcommand{\coregionalizationMatrix}{\mathbf{B}} # \newcommand{\coregionalizationScalar}{b} # \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} # \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} # \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} # \newcommand{\covarianceScalar}{c} # \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} # \newcommand{\covarianceMatrix}{\mathbf{C}} # \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} # \newcommand{\croupierScalar}{s} # \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} # \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} # \newcommand{\dataDim}{p} # \newcommand{\dataIndex}{i} # \newcommand{\dataIndexTwo}{j} # \newcommand{\dataMatrix}{\mathbf{Y}} # \newcommand{\dataScalar}{y} # \newcommand{\dataSet}{\mathcal{D}} # \newcommand{\dataStd}{\sigma} # \newcommand{\dataVector}{\mathbf{ \dataScalar}} # \newcommand{\decayRate}{d} # \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} # \newcommand{\degreeScalar}{d} # \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} # % Already defined by latex # %\newcommand{\det}[1]{\left|#1\right|} # \newcommand{\diag}[1]{\text{diag}\left(#1\right)} # \newcommand{\diagonalMatrix}{\mathbf{D}} # \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} # \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} # \newcommand{\displacement}{x} # \newcommand{\displacementVector}{\textbf{\displacement}} # \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} # \newcommand{\distanceScalar}{d} # \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} # \newcommand{\eigenvaltwo}{\ell} # \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} # \newcommand{\eigenvaltwoVector}{\mathbf{l}} # \newcommand{\eigenvalue}{\lambda} # \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} # \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} # \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} # \newcommand{\eigenvectorMatrix}{\mathbf{U}} # \newcommand{\eigenvectorScalar}{u} # \newcommand{\eigenvectwo}{\mathbf{v}} # \newcommand{\eigenvectwoMatrix}{\mathbf{V}} # \newcommand{\eigenvectwoScalar}{v} # \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} # \newcommand{\errorFunction}{E} # \newcommand{\expDist}[2]{\left<#1\right>_{#2}} # \newcommand{\expSamp}[1]{\left<#1\right>} # \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } # \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} # \newcommand{\expectedDistanceMatrix}{\mathcal{D}} # \newcommand{\eye}{\mathbf{I}} # \newcommand{\fantasyDim}{r} # \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} # \newcommand{\fantasyScalar}{z} # \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} # \newcommand{\featureStd}{\varsigma} # \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} # \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} # \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} # \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} # \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} # \newcommand{\given}{|} # \newcommand{\half}{\frac{1}{2}} # \newcommand{\heaviside}{H} # \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} # \newcommand{\hiddenScalar}{h} # \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} # \newcommand{\identityMatrix}{\eye} # \newcommand{\inducingInputScalar}{z} # \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} # \newcommand{\inducingInputMatrix}{\mathbf{Z}} # \newcommand{\inducingScalar}{u} # \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} # \newcommand{\inducingMatrix}{\mathbf{U}} # \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} # \newcommand{\inputDim}{q} # \newcommand{\inputMatrix}{\mathbf{X}} # \newcommand{\inputScalar}{x} # \newcommand{\inputSpace}{\mathcal{X}} # \newcommand{\inputVals}{\inputVector} # \newcommand{\inputVector}{\mathbf{ \inputScalar}} # \newcommand{\iterNum}{k} # \newcommand{\kernel}{\kernelScalar} # \newcommand{\kernelMatrix}{\mathbf{K}} # \newcommand{\kernelScalar}{k} # \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} # \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} # \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} # \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} # \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} # \newcommand{\lagrangeMultiplier}{\lambda} # \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} # \newcommand{\lagrangian}{L} # \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} # \newcommand{\laplacianFactorScalar}{m} # \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} # \newcommand{\laplacianMatrix}{\mathbf{L}} # \newcommand{\laplacianScalar}{\ell} # \newcommand{\laplacianVector}{\mathbf{ \ell}} # \newcommand{\latentDim}{q} # \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} # \newcommand{\latentDistanceScalar}{\delta} # \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} # \newcommand{\latentForce}{f} # \newcommand{\latentFunction}{u} # \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} # \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} # \newcommand{\latentIndex}{j} # \newcommand{\latentScalar}{z} # \newcommand{\latentVector}{\mathbf{ \latentScalar}} # \newcommand{\latentMatrix}{\mathbf{Z}} # \newcommand{\learnRate}{\eta} # \newcommand{\lengthScale}{\ell} # \newcommand{\rbfWidth}{\ell} # \newcommand{\likelihoodBound}{\mathcal{L}} # \newcommand{\likelihoodFunction}{L} # \newcommand{\locationScalar}{\mu} # \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} # \newcommand{\locationMatrix}{\mathbf{M}} # \newcommand{\variance}[1]{\text{var}\left( #1 \right)} # \newcommand{\mappingFunction}{f} # \newcommand{\mappingFunctionMatrix}{\mathbf{F}} # \newcommand{\mappingFunctionTwo}{g} # \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} # \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} # \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} # \newcommand{\scaleScalar}{s} # \newcommand{\mappingScalar}{w} # \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} # \newcommand{\mappingMatrix}{\mathbf{W}} # \newcommand{\mappingScalarTwo}{v} # \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} # \newcommand{\mappingMatrixTwo}{\mathbf{V}} # \newcommand{\maxIters}{K} # \newcommand{\meanMatrix}{\mathbf{M}} # \newcommand{\meanScalar}{\mu} # \newcommand{\meanTwoMatrix}{\mathbf{M}} # \newcommand{\meanTwoScalar}{m} # \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} # \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} # \newcommand{\mrnaConcentration}{m} # \newcommand{\naturalFrequency}{\omega} # \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} # \newcommand{\neilurl}{http://inverseprobability.com/} # \newcommand{\noiseMatrix}{\boldsymbol{ E}} # \newcommand{\noiseScalar}{\epsilon} # \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} # \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} # \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} # \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} # \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} # \newcommand{\numActive}{m} # \newcommand{\numBasisFunc}{m} # \newcommand{\numComponents}{m} # \newcommand{\numComps}{K} # \newcommand{\numData}{n} # \newcommand{\numFeatures}{K} # \newcommand{\numHidden}{h} # \newcommand{\numInducing}{m} # \newcommand{\numLayers}{\ell} # \newcommand{\numNeighbors}{K} # \newcommand{\numSequences}{s} # \newcommand{\numSuccess}{s} # \newcommand{\numTasks}{m} # \newcommand{\numTime}{T} # \newcommand{\numTrials}{S} # \newcommand{\outputIndex}{j} # \newcommand{\paramVector}{\boldsymbol{ \theta}} # \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} # \newcommand{\parameterScalar}{\theta} # \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} # \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} # \newcommand{\precisionScalar}{j} # \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} # \newcommand{\precisionMatrix}{\mathbf{J}} # \newcommand{\pseudotargetScalar}{\widetilde{y}} # \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} # \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} # \newcommand{\rank}[1]{\text{rank}\left(#1\right)} # \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} # \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} # \newcommand{\responsibility}{r} # \newcommand{\rotationScalar}{r} # \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} # \newcommand{\rotationMatrix}{\mathbf{R}} # \newcommand{\sampleCovScalar}{s} # \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} # \newcommand{\sampleCovMatrix}{\mathbf{s}} # \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} # \newcommand{\sign}[1]{\text{sign}\left(#1\right)} # \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} # \newcommand{\singularvalue}{\ell} # \newcommand{\singularvalueMatrix}{\mathbf{L}} # \newcommand{\singularvalueVector}{\mathbf{l}} # \newcommand{\sorth}{\mathbf{u}} # \newcommand{\spar}{\lambda} # \newcommand{\trace}[1]{\text{tr}\left(#1\right)} # \newcommand{\BasalRate}{B} # \newcommand{\DampingCoefficient}{C} # \newcommand{\DecayRate}{D} # \newcommand{\Displacement}{X} # \newcommand{\LatentForce}{F} # \newcommand{\Mass}{M} # \newcommand{\Sensitivity}{S} # \newcommand{\basalRate}{b} # \newcommand{\dampingCoefficient}{c} # \newcommand{\mass}{m} # \newcommand{\sensitivity}{s} # \newcommand{\springScalar}{\kappa} # \newcommand{\springVector}{\boldsymbol{ \kappa}} # \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} # \newcommand{\tfConcentration}{p} # \newcommand{\tfDecayRate}{\delta} # \newcommand{\tfMrnaConcentration}{f} # \newcommand{\tfVector}{\mathbf{ \tfConcentration}} # \newcommand{\velocity}{v} # \newcommand{\sufficientStatsScalar}{g} # \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} # \newcommand{\sufficientStatsMatrix}{\mathbf{G}} # \newcommand{\switchScalar}{s} # \newcommand{\switchVector}{\mathbf{ \switchScalar}} # \newcommand{\switchMatrix}{\mathbf{S}} # \newcommand{\tr}[1]{\text{tr}\left(#1\right)} # \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} # \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} # \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} # \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} # \newcommand{\vScalar}{v} # \newcommand{\vVector}{\mathbf{v}} # \newcommand{\vMatrix}{\mathbf{V}} # \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} # % Already defined by latex # %\newcommand{\vec}{#1:} # \newcommand{\vecb}[1]{\left(#1\right):} # \newcommand{\weightScalar}{w} # \newcommand{\weightVector}{\mathbf{ \weightScalar}} # \newcommand{\weightMatrix}{\mathbf{W}} # \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} # \newcommand{\weightedAdjacencyScalar}{a} # \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} # \newcommand{\onesVector}{\mathbf{1}} # \newcommand{\zerosVector}{\mathbf{0}} # $$ # # # # # # # # #
# # # #
# #
# The cosmic microwave background is, to a very high degree of # precision, a Gaussian process. The parameters of its covariance function # are given by fundamental parameters of the universe, such as the amount # of dark matter and mass. #
#
# # $=f\Bigg($$\Bigg)$ # #
# #
# What we observe today is some non-linear function of the cosmic # microwave background. #
#
# Image credit: Kai Arulkumaran #
# Inference in a Gaussian process has computational complexity of # $\bigO(\numData^3)$ and storage demands of $\bigO(\numData^2)$. This is # too large for many modern data sets. # # Low rank approximations allow us to work with Gaussian processes with # computational complexity of $\bigO(\numData\numInducing^2)$ and storage # demands of $\bigO(\numData\numInducing)$, where $\numInducing$ is a user # chosen parameter. # # In machine learning, low rank approximations date back to # @Smola:sparsegp00, @Williams:nystrom00, who considered the Nystr"om # approximation and @Csato:sparse02;@Csato:thesis02 who considered low # rank approximations in the context of on-line learning. Selection of # active points for the approximation was considered by @Seeger:fast03 and # @Snelson:pseudo05 first proposed that the active set could be optimized # directly. Those approaches were reviewed by @Quinonero:unifying05 under # a unifying likelihood approximation perspective. General rules for # deriving the maximum likelihood for these sparse approximations were # given in @Lawrence:larger07. # # Modern variational interpretations of these low rank approaches were # first explored in @Titsias:variational09. A more modern summary which # considers each of these approximations as an $\alpha$-divergence is # given by @Thang:unifying17. # # ### Variational Compression # # Inducing variables are a compression of the real observations. The basic # idea is can I create a new data set that summarizes all the information # in the original data set. If this data set is smaller, I've compressed # the information in the original data set. # # Inducing variables can be thought of as pseudo-data, indeed in # @Snelson:pseudo05 they were referred to as *pseudo-points*. # # The only requirement for inducing variables is that they are jointly # distributed as a Gaussian process with the original data. This means # that they can be from the space $\mappingFunctionVector$ or a space that # is related through a linear operator (see e.g. @Alvarez:efficient10). # For example we could choose to store the gradient of the function at # particular points or a value from the frequency spectrum of the function # [@Lazaro:spectrum10]. # # ### Variational Compression II # # Inducing variables don't only allow for the compression of the # non-parameteric information into a reduced data aset but they also allow # for computational scaling of the algorithms through, for example # stochastic variational approaches @Hensman:bigdata13 or parallelization # @Gal:Distributed14,@Dai:gpu14, @Seeger:auto17. # # # We’ve seen how we go from parametric to non-parametric. The limit # implies infinite dimensional $\mappingVector$. Gaussian processes are # generally non-parametric: combine data with covariance function to get # model. This representation *cannot* be summarized by a parameter vector # of a fixed size. # # Parametric models have a representation that does not respond to # increasing training set size. Bayesian posterior distributions over # parameters contain the information about the training data, for example # if we use use Bayes’ rule from training data, $$ # p\left(\mappingVector|\dataVector, \inputMatrix\right), # $$ to make predictions on test data $$ # p\left(\dataScalar_*|\inputMatrix_*, \dataVector, \inputMatrix\right) = \int # p\left(\dataScalar_*|\mappingVector,\inputMatrix_*\right)p\left(\mappingVector|\dataVector, # \inputMatrix)\text{d}\mappingVector\right) # $$ then $\mappingVector$ becomes a bottleneck for information about the # training set to pass to the test set. The solution is to increase # $\numBasisFunc$ so that the bottleneck is so large that it no longer # presents a problem. How big is big enough for $\numBasisFunc$? # Non-parametrics says $\numBasisFunc \rightarrow \infty$. # # Now no longer possible to manipulate the model through the standard # parametric form. However, it *is* possible to express *parametric* as # GPs: $$ # \kernelScalar\left(\inputVector_i,\inputVector_j\right)=\basisFunction_:\left(\inputVector_i\right)^\top\basisFunction_:\left(\inputVector_j\right). # $$ These are known as degenerate covariance matrices. Their rank is at # most $\numBasisFunc$, non-parametric models have full rank covariance # matrices. Most well known is the “linear kernel”, $$ # \kernelScalar(\inputVector_i, \inputVector_j) = \inputVector_i^\top\inputVector_j. # $$ For non-parametrics prediction at a new point, # $\mappingFunctionVector_*$, is made by conditioning on # $\mappingFunctionVector$ in the joint distribution. In GPs this involves # combining the training data with the covariance function and the mean # function. Parametric is a special case when conditional prediction can # be summarized in a *fixed* number of parameters. Complexity of # parametric model remains fixed regardless of the size of our training # data set. For a non-parametric model the required number of parameters # grows with the size of the training data. # # ### Augment Variable Space # # In inducing variable approximations, we augment the variable space with # a set of inducing points, $\inducingVector$. These inducing points are # jointly Gaussian distributed with the points from our function, # $\mappingFunctionVector$. So we have a joint Gaussian process with # covariance, $$ # \begin{bmatrix} # \mappingFunctionVector\\ # \inducingVector # \end{bmatrix} \sim \gaussianSamp{\zerosVector}{\kernelMatrix} # $$ where the kernel matrix itself can be decomposed into $$ # \kernelMatrix = # \begin{bmatrix} # \Kff & \Kfu \\ # \Kuf & \Kuu # \end{bmatrix} # $$ # # This defines a joint density between the original function points, # $\mappingFunctionVector$ and our inducing points, $\inducingVector$. # This can be decomposed through the product rule to give. $$ # p(\mappingFunctionVector, \inducingVector) = p(\mappingFunctionVector| \inducingVector) p(\inducingVector) # $$ The Gaussian process is (typically) given by a noise corrupted form # of $\mappingFunctionVector$, i.e., $$ # \dataScalar(\inputVector) = \mappingFunction(\inputVector) + \noiseScalar, # $$ which can be written probabilisticlly as, $$ # p(\dataVector) = \int p(\dataVector|\mappingFunctionVector) p(\mappingFunctionVector) \text{d}\mappingFunctionVector, # $$ where for the independent case we have # $p(\dataVector | \mappingFunctionVector) = \prod_{i=1}^\numData p(\dataScalar_i|\mappingFunction_i)$. # # Inducing variables are like auxilliary variables in Monte Carlo # algorithms. We introduce the inducing variables by augmenting this # integral with an additional integral over $\inducingVector$, $$ # p(\dataVector) = \int p(\dataVector|\mappingFunctionVector) p(\mappingFunctionVector|\inducingVector) p(\inducingVector) \text{d}\inducingVector \text{d}\mappingFunctionVector. # $$ Now, conceptually speaking we are going to integrate out # \$\mappingFunctionVector\#, initially leaving \$\inducingVector in # place. This gives, $$ # p(\dataVector) = \int p(\dataVector|\inducingVector) p(\inducingVector) \text{d}\inducingVector. # $$ # # Note the similarity between this form and our standard *parametric* # form. If we had defined our model through standard basis functions we # would have, $$ # \dataScalar(\inputVector) = \weightVector^\top\basisVector(\inputVector) + \noiseScalar # $$ and the resulting probabilistic representation would be $$ # p(\dataVector) = \int p(\dataVector|\weightVector) p(\weightVector) \text{d} \weightVector # $$ allowing us to predict $$ # p(\dataVector^*|\dataVector) = \int p(\dataVector^*|\weightVector) p(\weightVector|\dataVector) \text{d} \weightVector # $$ # # The new prediction algorithm involves $$ # p(\dataVector^*|\dataVector) = \int p(\dataVector^*|\inducingVector) p(\inducingVector|\dataVector) \text{d} \inducingVector # $$ but *importantly* the length of $\inducingVector$ is not fixed at # *design* time like the number of parameters were. We can vary the number # of inducing variables we use to give us the model capacity we require. # # Unfortunately, computation of $p(\dataVector|\inducingVector)$ turns out # to be intractable. As a result, we need to turn to approximations to # make progress. # # ### Variational Bound on $p(\dataVector |\inducingVector)$ # # The conditional density of the data given the inducing points can be # *lower* bounded variationally $$ # \begin{aligned} # \log p(\dataVector|\inducingVector) & = \log \int p(\dataVector|\mappingFunctionVector) p(\mappingFunctionVector|\inducingVector) \text{d}\mappingFunctionVector\\ & = \int q(\mappingFunctionVector) \log \frac{p(\dataVector|\mappingFunctionVector) p(\mappingFunctionVector|\inducingVector)}{q(\mappingFunctionVector)}\text{d}\mappingFunctionVector + \KL{q(\mappingFunctionVector)}{p(\mappingFunctionVector|\dataVector, \inducingVector)}. # \end{aligned} # $$ # # The key innovation from @Titsias:variational09 was to then make a # particular choice for $q(\mappingFunctionVector)$. If we set # $q(\mappingFunctionVector)=p(\mappingFunctionVector|\inducingVector)$, # $$ # \log p(\dataVector|\inducingVector) \geq \log \int p(\mappingFunctionVector|\inducingVector) \log p(\dataVector|\mappingFunctionVector)\text{d}\mappingFunctionVector. # $$ $$ # p(\dataVector|\inducingVector) \geq \exp \int p(\mappingFunctionVector|\inducingVector) \log p(\dataVector|\mappingFunctionVector)\text{d}\mappingFunctionVector. # $$ # # ### Optimal Compression in Inducing Variables # # Maximizing the lower bound minimizes the Kullback-Leibler divergence (or # *information gain*) between our approximating density, # $p(\mappingFunctionVector|\inducingVector)$ and the true posterior # density, $p(\mappingFunctionVector|\dataVector, \inducingVector)$. # # $$ # \KL{p(\mappingFunctionVector|\inducingVector)}{p(\mappingFunctionVector|\dataVector, \inducingVector)} = \int p(\mappingFunctionVector|\inducingVector) \log \frac{p(\mappingFunctionVector|\inducingVector)}{p(\mappingFunctionVector|\dataVector, \inducingVector)}\text{d}\inducingVector # $$ # # This bound is minimized when the information stored about $\dataVector$ # is already stored in $\inducingVector$. In other words, maximizing the # bound seeks an *optimal compression* from the *information gain* # perspective. # # For the case where $\inducingVector = \mappingFunctionVector$ the bound # is exact ($\mappingFunctionVector$ $d$-separates $\dataVector$ from # $\inducingVector$). # # ### Choice of Inducing Variables # # The quality of the resulting bound is determined by the choice of the # inducing variables. You are free to choose whichever heuristics you like # for the inducing variables, as long as they are drawn jointly from a # valid Gaussian process, i.e. such that $$ # \begin{bmatrix} # \mappingFunctionVector\\ # \inducingVector # \end{bmatrix} \sim \gaussianSamp{\zerosVector}{\kernelMatrix} # $$ where the kernel matrix itself can be decomposed into $$ # \kernelMatrix = # \begin{bmatrix} # \Kff & \Kfu \\ # \Kuf & \Kuu # \end{bmatrix} # $$ Choosing the inducing variables amounts to specifying $\Kfu$ and # $\Kuu$ such that $\kernelMatrix$ remains positive definite. The typical # choice is to choose $\inducingVector$ in the same domain as # $\mappingFunctionVector$, associating each inducing output, # $\inducingScalar_i$ with a corresponding input location # $\inducingInputVector$. However, more imaginative choices are absolutely # possible. In particular, if $\inducingVector$ is related to # $\mappingFunctionVector$ through a linear operator (see e.g. # @Alvarez:efficient10), then valid $\Kuu$ and $\Kuf$ can be constructed. # For example we could choose to store the gradient of the function at # particular points or a value from the frequency spectrum of the function # [@Lazaro:spectrum10]. # # ### Variational Compression II # # Inducing variables don't only allow for the compression of the # non-parameteric information into a reduced data set but they also allow # for computational scaling of the algorithms through, for example # stochastic variational # approaches[@Hoffman:stochastic12; @Hensman:bigdata13] or parallelization # @Gal:Distributed14,@Dai:gpu14, @Seeger:auto17. # # # ### A Simple Regression Problem # # Here we set up a simple one dimensional regression problem. The input # locations, $\inputMatrix$, are in two separate clusters. The response # variable, $\dataVector$, is sampled from a Gaussian process with an # exponentiated quadratic covariance. # In[ ]: import numpy as np import GPy # In[ ]: np.random.seed(101) # In[ ]: N = 50 noise_var = 0.01 X = np.zeros((50, 1)) X[:25, :] = np.linspace(0,3,25)[:,None] # First cluster of inputs/covariates X[25:, :] = np.linspace(7,10,25)[:,None] # Second cluster of inputs/covariates # Sample response variables from a Gaussian process with exponentiated quadratic covariance. k = GPy.kern.RBF(1) y = np.random.multivariate_normal(np.zeros(N),k.K(X)+np.eye(N)*np.sqrt(noise_var)).reshape(-1,1) # First we perform a full Gaussian process regression on the data. We # create a GP model, `m_full`, and fit it to the data, plotting the # resulting fit. # In[ ]: m_full = GPy.models.GPRegression(X,y) _ = m_full.optimize(messages=True) # Optimize parameters of covariance function # In[ ]: import matplotlib.pyplot as plt import mlai import teaching_plots as plot from gp_tutorial import gpplot # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m_full, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2) xlim = ax.get_xlim() ylim = ax.get_ylim() mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/sparse-demo-full-gp.svg', transparent=True, frameon=True) # #
# Full Gaussian process fitted to the data set. #
# Now we set up the inducing variables, $\mathbf{u}$. Each inducing # variable has its own associated input index, $\mathbf{Z}$, which lives # in the same space as $\inputMatrix$. Here we are using the true # covariance function parameters to generate the fit. # In[ ]: kern = GPy.kern.RBF(1) Z = np.hstack( (np.linspace(2.5,4.,3), np.linspace(7,8.5,3)))[:,None] m = GPy.models.SparseGPRegression(X,y,kernel=kern,Z=Z) m.noise_var = noise_var m.inducing_inputs.constrain_fixed() display(m) # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/sparse-demo-constrained-inducing-6-unlearned-gp.svg', transparent=True, frameon=True) # #
# Sparse Gaussian process fitted with six inducing variables, no # optimization of parameters or inducing variables. #
# In[ ]: _ = m.optimize(messages=True) display(m) # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/sparse-demo-constrained-inducing-6-learned-gp.svg', transparent=True, frameon=True) # #
# Gaussian process fitted with inducing variables fixed and parameters # optimized #
# In[ ]: m.randomize() m.inducing_inputs.unconstrain() _ = m.optimize(messages=True) # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2,xlim=xlim, ylim=ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/sparse-demo-unconstrained-inducing-6-gp.svg', transparent=True, frameon=True) # #
# Gaussian process fitted with location of inducing variables and # parameters both optimized #
# Now we will vary the number of inducing points used to form the # approximation. # In[ ]: m.num_inducing=8 m.randomize() M = 8 m.set_Z(np.random.rand(M,1)*12) _ = m.optimize(messages=True) # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/sparse-demo-sparse-inducing-8-gp.svg', transparent=True, frameon=True) #
# Comparison of the full Gaussian process fit with a sparse Gaussian # process using eight inducing varibles. Both inducing variables and # parameters are optimized. #
# And we can compare the probability of the result to the full model. # In[ ]: print(m.log_likelihood(), m_full.log_likelihood()) # - Let’s be explicity about storing approximate posterior of # $\inducingVector$, $q(\inducingVector)$. # - Now we have # $$p(\dataVector^*|\dataVector) = \int p(\dataVector^*| \inducingVector) q(\inducingVector | \dataVector) \text{d} \inducingVector$$ # # - Inducing variables look a lot like regular parameters. # - *But*: their dimensionality does not need to be set at design time. # - They can be modified arbitrarily at run time without effecting the # model likelihood. # - They only effect the quality of compression and the lower bound. # # - Exploit the resulting factorization ... # $$p(\dataVector^*|\dataVector) = \int p(\dataVector^*| \inducingVector) q(\inducingVector | \dataVector) \text{d} \inducingVector$$ # \pause # - The distribution now *factorizes*: # $$p(\dataVector^*|\dataVector) = \int \prod_{i=1}^{\numData^*}p(\dataScalar^*_i| \inducingVector) q(\inducingVector | \dataVector) \text{d} \inducingVector$$ # - This factorization can be exploited for stochastic variational # inference [@Hoffman:stochastic12]. # #
# Modern data availability #
#
# # # #
# #
# Proxy for index of deprivation? #
#
# # # #
# #
# Actually index of deprivation is a proxy for this ... #
#
# # # #
# # \catdoc # # # # # # #
# [[@Hensman:bigdata13]]{style="text-align:left"} # # []{style="text-align:right"} #
#
# # # #
# #
# #
# # # # # #
# [[@Hensman:bigdata13]]{style="text-align:left"} # # []{style="text-align:right"} #
#
# # # #
# #
# #
# - *A Unifying Framework for Gaussian Process Pseudo-Point # Approximations using Power Expectation Propagation* # @Thang:unifying17 # # - *Deep Gaussian Processes and Variational Propagation of Uncertainty* # @Damianou:thesis2015 # # Even in the early days of Gaussian processes in machine learning, it was # understood that we were throwing something fundamental away. This is # perhaps captured best by David MacKay in his 1997 NeurIPS tutorial on # Gaussian processes, where he asked "Have we thrown out the baby with the # bathwater?". The quote below is from his summarization paper. # # > According to the hype of 1987, neural networks were meant to be # > intelligent models which discovered features and patterns in data. # > Gaussian processes in contrast are simply smoothing devices. How can # > Gaussian processes possibly repalce neural networks? What is going on? # > # > @MacKay:gpintroduction98 # # bathwater?” [Published as @MacKay:gpintroduction98]} # In[ ]: import teaching_plots as plot #
# A deep neural network. Input nodes are shown at the bottom. Each # hidden layer is the result of applying an affine transformation to the # previous layer and placing through an activation function. #
# Mathematically, each layer of a neural network is given through # computing the activation function, $\basisFunction(\cdot)$, contingent # on the previous layer, or the inputs. In this way the activation # functions, are composed to generate more complex interactions than would # be possible with any single layer. $$ # \begin{align} # \hiddenVector_{1} &= \basisFunction\left(\mappingMatrix_1 \inputVector\right)\\ # \hiddenVector_{2} &= \basisFunction\left(\mappingMatrix_2\hiddenVector_{1}\right)\\ # \hiddenVector_{3} &= \basisFunction\left(\mappingMatrix_3 \hiddenVector_{2}\right)\\ # \dataVector &= \mappingVector_4 ^\top\hiddenVector_{3} # \end{align} # $$ # # ### Overfitting # # One potential problem is that as the number of nodes in two adjacent # layers increases, the number of parameters in the affine transformation # between layers, $\mappingMatrix$, increases. If there are $k_{i-1}$ # nodes in one layer, and $k_i$ nodes in the following, then that matrix # contains $k_i k_{i-1}$ parameters, when we have layer widths in the # 1000s that leads to millions of parameters. # # One proposed solution is known as *dropout* where only a sub-set of the # neural network is trained at each iteration. An alternative solution # would be to reparameterize $\mappingMatrix$ with its *singular value # decomposition*. $$ # \mappingMatrix = \eigenvectorMatrix\eigenvalueMatrix\eigenvectwoMatrix^\top # $$ or $$ # \mappingMatrix = \eigenvectorMatrix\eigenvectwoMatrix^\top # $$ where if $\mappingMatrix \in \Re^{k_1\times k_2}$ then # $\eigenvectorMatrix\in \Re^{k_1\times q}$ and # $\eigenvectwoMatrix \in \Re^{k_2\times q}$, i.e. we have a low rank # matrix factorization for the weights. # In[ ]: import teaching_plots as plot # #
# Pictorial representation of the low rank form of the matrix # $\mappingMatrix$ #
# In[ ]: import teaching_plots as plot # Including the low rank decomposition of $\mappingMatrix$ in the neural # network, we obtain a new mathematical form. Effectively, we are adding # additional *latent* layers, $\latentVector$, in between each of the # existing hidden layers. In a neural network these are sometimes known as # *bottleneck* layers. The network can now be written mathematically as $$ # \begin{align} # \latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\ # \hiddenVector_{1} &= \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\ # \latentVector_{2} &= \eigenvectwoMatrix^\top_2 \hiddenVector_{1}\\ # \hiddenVector_{2} &= \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\ # \latentVector_{3} &= \eigenvectwoMatrix^\top_3 \hiddenVector_{2}\\ # \hiddenVector_{3} &= \basisFunction\left(\eigenvectorMatrix_3 \latentVector_{3}\right)\\ # \dataVector &= \mappingVector_4^\top\hiddenVector_{3}. # \end{align} # $$ # # $$ # \begin{align} # \latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\ # \latentVector_{2} &= \eigenvectwoMatrix^\top_2 \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\ # \latentVector_{3} &= \eigenvectwoMatrix^\top_3 \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\ # \dataVector &= \mappingVector_4 ^\top \latentVector_{3} # \end{align} # $$ # # Now if we replace each of these neural networks with a Gaussian process. # This is equivalent to taking the limit as the width of each layer goes # to infinity, while appropriately scaling down the outputs. # # $$ # \begin{align} # \latentVector_{1} &= \mappingFunctionVector_1\left(\inputVector\right)\\ # \latentVector_{2} &= \mappingFunctionVector_2\left(\latentVector_{1}\right)\\ # \latentVector_{3} &= \mappingFunctionVector_3\left(\latentVector_{2}\right)\\ # \dataVector &= \mappingFunctionVector_4\left(\latentVector_{3}\right) # \end{align} # $$ # #
# # # #
# #
# The DeepFace architecture [@Taigman:deepface14], visualized through # colors to represent the functional mappings at each layer. There are 120 # million parameters in the model. #
# The DeepFace architecture [@Taigman:deepface14] consists of layers that # deal with *translation* and *rotational* invariances. These layers are # followed by three locally-connected layers and two fully-connected # layers. Color illustrates feature maps produced at each layer. The net # includes more than 120 million parameters, where more than 95% come from # the local and fully connected layers. # #
# # # #
# #
# Deep learning models are composition of simple functions. We can # think of a pinball machine as an analogy. Each layer of pins corresponds # to one of the layers of functions in the model. Input data is # represented by the location of the ball from left to right when it is # dropped in from the top. Output class comes from the position of the # ball as it leaves the pins at the bottom. #
# We can think of what these models are doing as being similar to early # pin ball machines. In a neural network, we input a number (or numbers), # whereas in pinball, we input a ball. The location of the ball on the # left-right axis can be thought of as the number. As the ball falls # through the machine, each layer of pins can be thought of as a different # layer of neurons. Each layer acts to move the ball from left to right. # # In a pinball machine, when the ball gets to the bottom it might fall # into a hole defining a score, in a neural network, that is equivalent to # the decision: a classification of the input object. # # An image has more than one number associated with it, so it's like # playing pinball in a *hyper-space*. # In[ ]: import pods from ipywidgets import IntSlider # In[ ]: pods.notebook.display_plots('pinball{sample:0>3}.svg', '../slides/diagrams', sample=IntSlider(1, 1, 2, 1)) #
# # # #
# #
# At initialization, the pins, which represent the parameters of the # function, aren't in the right place to bring the balls to the correct # decisions. #
#
# # # #
# #
# After learning the pins are now in the right place to bring the balls # to the correct decisions. #
# Learning involves moving all the pins to be in the right position, so # that the ball falls in the right place. But moving all these pins in # hyperspace can be difficult. In a hyper space you have to put a lot of # data through the machine for to explore the positions of all the pins. # Adversarial learning reflects the fact that a ball can be moved a small # distance and lead to a very different result. # # Probabilistic methods explore more of the space by considering a range # of possible paths for the ball through the machine. # # Mathematically, a deep Gaussian process can be seen as a composite # *multivariate* function, $$ # \mathbf{g}(\inputVector)=\mappingFunctionVector_5(\mappingFunctionVector_4(\mappingFunctionVector_3(\mappingFunctionVector_2(\mappingFunctionVector_1(\inputVector))))). # $$ Or if we view it from the probabilistic perspective we can see that # a deep Gaussian process is specifying a factorization of the joint # density, the standard deep model takes the form of a Markov chain. # In[ ]: from matplotlib import rc rc("font", **{'family':'sans-serif','sans-serif':['Helvetica'],'size':30}) rc("text", usetex=True) # $$ # p(\dataVector|\inputVector)= p(\dataVector|\mappingFunctionVector_5)p(\mappingFunctionVector_5|\mappingFunctionVector_4)p(\mappingFunctionVector_4|\mappingFunctionVector_3)p(\mappingFunctionVector_3|\mappingFunctionVector_2)p(\mappingFunctionVector_2|\mappingFunctionVector_1)p(\mappingFunctionVector_1|\inputVector) # $$ # # #
# Probabilistically the deep Gaussian process can be represented as a # Markov chain. #
# In[ ]: from matplotlib import rc rc("font", **{'family':'sans-serif','sans-serif':['Helvetica'], 'size':15}) rc("text", usetex=True) # # # ### Why Deep? # # If the result of composing many functions together is simply another # function, then why do we bother? The key point is that we can change the # class of functions we are modeling by composing in this manner. A # Gaussian process is specifying a prior over functions, and one with a # number of elegant properties. For example, the derivative process (if it # exists) of a Gaussian process is also Gaussian distributed. That makes # it easy to assimilate, for example, derivative observations. But that # also might raise some alarm bells. That implies that the *marginal # derivative distribution* is also Gaussian distributed. If that's the # case, then it means that functions which occasionally exhibit very large # derivatives are hard to model with a Gaussian process. For example, a # function with jumps in. # # A one off discontinuity is easy to model with a Gaussian process, or # even multiple discontinuities. They can be introduced in the mean # function, or independence can be forced between two covariance functions # that apply in different areas of the input space. But in these cases we # will need to specify the number of discontinuities and where they occur. # In otherwords we need to *parameterise* the discontinuities. If we do # not know the number of discontinuities and don't wish to specify where # they occur, i.e. if we want a non-parametric representation of # discontinuities, then the standard Gaussian process doesn't help. # # ### Stochastic Process Composition # # The deep Gaussian process leads to *non-Gaussian* models, and # non-Gaussian characteristics in the covariance function. In effect, what # we are proposing is that we change the properties of the functions we # are considering by \*composing stochastic processes\$. This is an # approach to creating new stochastic processes from well known processes. # # # # Additionally, we are not constrained to the formalism of the chain. For # example, we can easily add single nodes emerging from some point in the # depth of the chain. This allows us to combine the benefits of the # graphical modelling formalism, but with a powerful framework for # relating one set of variables to another, that of Gaussian processes # # # ### Difficulty for Probabilistic Approaches # # The challenge for composition of probabilistic models is that you need # to propagate a probability densities through non linear mappings. This # allows you to create broader classes of probability density. # Unfortunately it renders the resulting densities *intractable*. # # # # # # # The argument in the deep learning revolution is that deep architectures # allow us to develop an abstraction of the feature set through model # composition. Composing Gaussian processes is analytically intractable. # To form deep Gaussian processes we use a variational approach to stack # the models. # In[ ]: import pods # In[ ]: pods.notebook.display_plots('stack-gp-sample-Linear-{sample:0>1}.svg', directory='../../slides/diagrams/deepgp', sample=(0,4)) # ### Stacked PCA # #
# Composition of linear functions just leads to a new linear # function. #
# Stacking a series of linear functions simply leads to a new linear # function. The use of multiple linear function merely changes the # covariance of the resulting Gaussian. If $$ # \latentMatrix \sim \gaussianSamp{\zerosVector}{\eye} # $$ and the $i$th hidden layer is a multivariate linear transformation # defined by $\weightMatrix_i$, $$ # \dataMatrix = \latentMatrix\weightMatrix_1 \weightMatrix_2 \dots \weightMatrix_\numLayers # $$ then the rules of multivariate Gaussians tell us that $$ # \dataMatrix \sim \gaussianSamp{\zerosVector}{\weightMatrix_\numLayers \dots \weightMatrix_1 \weightMatrix^\top_1 \dots \weightMatrix^\top_\numLayers}. # $$ So the model can be replaced by one where we set # $\vMatrix = \weightMatrix_\numLayers \dots \weightMatrix_2 \weightMatrix_1$. # So is such a model trivial? The answer is that it depends. There are two # cases in which such a model remaisn interesting. Firstly, if we make # intermediate observations stemming from the chain. So, for example, if # we decide that, $$ # \latentMatrix_i = \weightMatrix_i \latentMatrix_{i-1} # $$ and set # $\latentMatrix_{0} = \inputMatrix \sim \gaussianSamp{\zerosVector}{\eye}$, # then the matrices $\weightMatrix$ inter-relate a series of jointly # Gaussian observations in an interesting way, stacking the full data # matrix to give $$ # \latentMatrix = \begin{bmatrix} # \latentMatrix_0 \\ # \latentMatrix_1 \\ # \vdots \\ # \latentMatrix_\numLayers # \end{bmatrix} # $$ we can obtain # $$\latentMatrix \sim \gaussianSamp{\zerosVector}{\begin{bmatrix} # \eye & \weightMatrix^\top_1 & \weightMatrix_1^\top\weightMatrix_2^\top & \dots & \vMatrix^\top \\ # \weightMatrix_1 & \weightMatrix_1 \weightMatrix_1^\top & \weightMatrix_1 \weightMatrix_1^\top \weightMatrix_2^\top & \dots & \weightMatrix_1 \vMatrix^\top \\ # \weightMatrix_2 \weightMatrix_1 & \weightMatrix_2 \weightMatrix_1 \weightMatrix_1^\top & \weightMatrix_2 \weightMatrix_1 \weightMatrix_1^\top \weightMatrix_2^\top & \dots & \weightMatrix_2 \weightMatrix_1 \vMatrix^\top \\ # \vdots & \vdots & \vdots & \ddots & \vdots \\ # \vMatrix & \vMatrix \weightMatrix_1^\top & \vMatrix \weightMatrix_1^\top \weightMatrix_2^\top& \dots & \vMatrix\vMatrix^\top # \end{bmatrix}}$$ which is a highly structured Gaussian covariance with # hierarchical dependencies between the variables $\latentMatrix_i$. # # ### Stacked GP # In[ ]: pods.notebook.display_plots('stack-gp-sample-RBF-{sample:0>1}.svg', directory='../../slides/diagrams/deepgp', sample=(0,4)) #
# Stacking Gaussian process models leads to non linear mappings at each # stage. Here we are mapping from two dimensions to two dimensions in each # layer. #
# Note that once the box has folded over on itself, it cannot be unfolded. # So a feature that is generated near the top of the model cannot be # removed furthr down the model. # # This folding over effect happens in low dimensions. In higher dimensions # it is less common. # # Observation of this effect at a talk in Cambridge was one of the things # that caused David Duvenaud (and collaborators) to consider the behavior # of deeper Gaussian process models [@Duvenaud:pathologies14]. # # Such folding over in the latent spaces necessarily forces the density to # be non-Gaussian. Indeed, since folding-over is avoided as we increase # the dimensionality of the latent spaces, such processes become more # Gaussian. If we take the limit of the latent space dimensionality as it # tends to infinity, the entire deep Gaussian process returns to a # standard Gaussian process, with a covariance function given as a deep # kernel (such as those described by @Cho:deep09). # # Further analysis of these deep networks has been conducted by # @Dunlop:deep2017, who use analysis of the deep network's stationary # density (treating it as a Markov chain across layers), to explore the # nature of the implied process prior for a deep GP. # # Both of these works, however, make constraining assumptions on the form # of the Gaussian process prior at each layer (e.g. same covariance at # each layer). In practice, the form of this covariance can be learnt and # the densities described by the deep GP are more general than those # mentioned in either of these papers. # In[ ]: from IPython.lib.display import YouTubeVideo YouTubeVideo('XhIvygQYFFQ') # David Duvenaud also created a YouTube video to help visualize what # happens as you drop through the layers of a deep GP. # # ### GPy: A Gaussian Process Framework in Python # #
# # # #
# # GPy is a BSD licensed software code base for implementing Gaussian # process models in python. This allows GPs to be combined with a wide # variety of software libraries. # # The software itself is avaialble on # [GitHub](https://github.com/SheffieldML/GPy) and the team welcomes # contributions. # # The aim for GPy is to be a probabilistic-style programming language, # i.e. you specify the model rather than the algorithm. As well as a large # range of covariance functions the software allows for non-Gaussian # likelihoods, multivariate outputs, dimensionality reduction and # approximations for larger data sets. # # The GPy library can be installed via pip: # In[ ]: pip install GPy # This notebook depends on PyDeepGP. These libraries can be installed via # pip: # In[ ]: pip install git+https://github.com/SheffieldML/PyDeepGP.git # ### Olympic Marathon Data # # # # # # #
# - Gold medal times for Olympic Marathon since 1896. # - Marathons before 1924 didn’t have a standardised distance. # - Present results using pace per km. # - In 1904 Marathon was badly organised leading to very slow times. # # #
# # # #
# # Image from Wikimedia Commons #
# The first thing we will do is load a standard data set for regression # modelling. The data consists of the pace of Olympic Gold Medal Marathon # winners for the Olympics from 1896 to present. First we load in the data # and plot. # In[ ]: import numpy as np import pods # In[ ]: data = pods.datasets.olympic_marathon_men() x = data['X'] y = data['Y'] offset = y.mean() scale = np.sqrt(y.var()) # In[ ]: import matplotlib.pyplot as plt import teaching_plots as plot import mlai # In[ ]: xlim = (1875,2030) ylim = (2.5, 6.5) yhat = (y-offset)/scale fig, ax = plt.subplots(figsize=plot.big_wide_figsize) _ = ax.plot(x, y, 'r.',markersize=10) ax.set_xlabel('year', fontsize=20) ax.set_ylabel('pace min/km', fontsize=20) ax.set_xlim(xlim) ax.set_ylim(ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True) #
# # # #
# # Things to notice about the data include the outlier in 1904, in this # year, the olympics was in St Louis, USA. Organizational problems and # challenges with dust kicked up by the cars following the race meant that # participants got lost, and only very few participants completed. # # More recent years see more consistently quick marathons. # # Data is fine for answering very specific questions, like "Who won the # Olympic Marathon in 2012?", because we have that answer stored, however, # we are not given the answer to many other questions. For example, Alan # Turing was a formidable marathon runner, in 1946 he ran a time 2 hours # 46 minutes (just under four minutes per kilometer, faster than I and # most of the other [Endcliffe Park # Run](http://www.parkrun.org.uk/sheffieldhallam/) runners can do 5 km). # What is the probability he would have won an Olympics if one had been # held in 1946? # # # # # # #
# # # #
#
# Alan Turing, in 1946 he was only 11 minutes slower than the winner of # the 1948 games. Would he have won a hypothetical games held in 1946? # Source: [Alan Turing Internet # Scrapbook](http://www.turing.org.uk/scrapbook/run.html) #
# Our first objective will be to perform a Gaussian process fit to the # data, we'll do this using the [GPy # software](https://github.com/SheffieldML/GPy). # In[ ]: import GPy # In[ ]: m_full = GPy.models.GPRegression(x,yhat) _ = m_full.optimize() # Optimize parameters of covariance function # The first command sets up the model, then `m_full.optimize()` optimizes # the parameters of the covariance function and the noise level of the # model. Once the fit is complete, we'll try creating some test points, # and computing the output of the GP model in terms of the mean and # standard deviation of the posterior functions between 1870 and 2030. We # plot the mean function and the standard deviation at 200 locations. We # can obtain the predictions using `y_mean, y_var = m_full.predict(xt)` # In[ ]: xt = np.linspace(1870,2030,200)[:,np.newaxis] yt_mean, yt_var = m_full.predict(xt) yt_sd=np.sqrt(yt_var) # Now we plot the results using the helper function in `teaching_plots`. # In[ ]: import teaching_plots as plot # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2) ax.set_xlim(xlim) ax.set_ylim(ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/olympic-marathon-gp.svg', transparent=True, frameon=True) # # # ### Fit Quality # # In the fit we see that the error bars (coming mainly from the noise # variance) are quite large. This is likely due to the outlier point in # 1904, ignoring that point we can see that a tighter fit is obtained. To # see this making a version of the model, `m_clean`, where that point is # removed. # In[ ]: x_clean=np.vstack((x[0:2, :], x[3:, :])) y_clean=np.vstack((y[0:2, :], y[3:, :])) m_clean = GPy.models.GPRegression(x_clean,y_clean) _ = m_clean.optimize() # ### Deep GP Fit # # Let's see if a deep Gaussian process can help here. We will construct a # deep Gaussian process with one hidden layer (i.e. one Gaussian process # feeding into another). # # Build a Deep GP with an additional hidden layer (one dimensional) to fit # the model. # In[ ]: import GPy import deepgp # In[ ]: hidden = 1 m = deepgp.DeepGP([y.shape[1],hidden,x.shape[1]],Y=yhat, X=x, inits=['PCA','PCA'], kernels=[GPy.kern.RBF(hidden,ARD=True), GPy.kern.RBF(x.shape[1],ARD=True)], # the kernels for each layer num_inducing=50, back_constraint=False) # Deep Gaussian process models also can require some thought in # initialization. Here we choose to start by setting the noise variance to # be one percent of the data variance. # # Optimization requires moving variational parameters in the hidden layer # representing the mean and variance of the expected values in that layer. # Since all those values can be scaled up, and this only results in a # downscaling in the output of the first GP, and a downscaling of the # input length scale to the second GP. It makes sense to first of all fix # the scales of the covariance function in each of the GPs. # # Sometimes, deep Gaussian processes can find a local minima which # involves increasing the noise level of one or more of the GPs. This # often occurs because it allows a minimum in the KL divergence term in # the lower bound on the likelihood. To avoid this minimum we habitually # train with the likelihood variance (the noise on the output of the GP) # fixed to some lower value for some iterations. # # Let's create a helper function to initialize the models we use in the # notebook. # In[ ]: import deepgp # In[ ]: get_ipython().run_line_magic('load', '-s initialize deepgp_tutorial.py') # In[ ]: # Bind the new method to the Deep GP object. deepgp.DeepGP.initialize=initialize # In[ ]: # Call the initalization m.initialize() # Now optimize the model. The first stage of optimization is working on # variational parameters and lengthscales only. # In[ ]: m.optimize(messages=False,max_iters=100) # Now we remove the constraints on the scale of the covariance functions # associated with each GP and optimize again. # In[ ]: for layer in m.layers: pass #layer.kern.variance.constrain_positive(warning=False) m.obslayer.kern.variance.constrain_positive(warning=False) m.optimize(messages=False,max_iters=100) # Finally, we allow the noise variance to change and optimize for a large # number of iterations. # In[ ]: for layer in m.layers: layer.likelihood.variance.constrain_positive(warning=False) m.optimize(messages=True,max_iters=10000) # For our optimization process we define a new function. # In[ ]: get_ipython().run_line_magic('load', '-s staged_optimize deepgp_tutorial.py') # In[ ]: # Bind the new method to the Deep GP object. deepgp.DeepGP.staged_optimize=staged_optimize # In[ ]: m.staged_optimize(messages=(True,True,True)) # ### Plot the prediction # # The prediction of the deep GP can be extracted in a similar way to the # normal GP. Although, in this case, it is an approximation to the true # distribution, because the true distribution is not Gaussian. # In[ ]: import matplotlib.pyplot as plt # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2) ax.set_xlim(xlim) ax.set_ylim(ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/olympic-marathon-deep-gp.svg', transparent=True, frameon=True) # ### Olympic Marathon Data Deep GP # # # In[ ]: get_ipython().run_line_magic('load', '-s posterior_sample deepgp_tutorial.py') # In[ ]: deepgp.DeepGP.posterior_sample = posterior_sample # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, xlabel='year', ylabel='pace min/km', portion = 0.225) ax.set_xlim(xlim) ax.set_ylim(ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/olympic-marathon-deep-gp-samples.svg', transparent=True, frameon=True) # ### Olympic Marathon Data Deep GP # # # # ### Fitted GP for each layer # # Now we explore the GPs the model has used to fit each layer. First of # all, we look at the hidden layer. # In[ ]: get_ipython().run_line_magic('load', '-s visualize deepgp_tutorial.py') # In[ ]: # Bind the new method to the Deep GP object. deepgp.DeepGP.visualize=visualize # In[ ]: m.visualize(scale=scale, offset=offset, xlabel='year', ylabel='pace min/km',xlim=xlim, ylim=ylim, dataset='olympic-marathon', diagrams='../slides/diagrams/deepgp') # In[ ]: import pods # In[ ]: pods.notebook.display_plots('olympic-marathon-deep-gp-layer-{sample:0>1}.svg', '../slides/diagrams/deepgp', sample=(0,1)) # # # # In[ ]: get_ipython().run_line_magic('load', '-s visualize_pinball deepgp_tutorial.py') # In[ ]: # Bind the new method to the Deep GP object. deepgp.DeepGP.visualize_pinball=visualize_pinball # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) m.visualize_pinball(ax=ax, scale=scale, offset=offset, points=30, portion=0.1, xlabel='year', ylabel='pace km/min', vertical=True) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/olympic-marathon-deep-gp-pinball.svg', transparent=True, frameon=True) # ### Olympic Marathon Pinball Plot # # # # The pinball plot shows the flow of any input ball through the deep # Gaussian process. In a pinball plot a series of vertical parallel lines # would indicate a purely linear function. For the olypmic marathon data # we can see the first layer begins to shift from input towards the right. # Note it also does so with some uncertainty (indicated by the shaded # backgrounds). The second layer has less uncertainty, but bunches the # inputs more strongly to the right. This input layer of uncertainty, # followed by a layer that pushes inputs to the right is what gives the # heteroschedastic noise. # # ### Della Gatta Gene Data # # - Given given expression levels in the form of a time series from # @DellaGatta:direct08. # In[ ]: import numpy as np import pods # In[ ]: data = pods.datasets.della_gatta_TRP63_gene_expression(data_set='della_gatta',gene_number=937) x = data['X'] y = data['Y'] offset = y.mean() scale = np.sqrt(y.var()) # In[ ]: import matplotlib.pyplot as plt import teaching_plots as plot import mlai # In[ ]: xlim = (-20,260) ylim = (5, 7.5) yhat = (y-offset)/scale fig, ax = plt.subplots(figsize=plot.big_wide_figsize) _ = ax.plot(x, y, 'r.',markersize=10) ax.set_xlabel('time/min', fontsize=20) ax.set_ylabel('expression', fontsize=20) ax.set_xlim(xlim) ax.set_ylim(ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/della-gatta-gene.svg', transparent=True, frameon=True) #
# # # #
# # - Want to detect if a gene is expressed or not, fit a GP to each gene # @Kalaitzis:simple11. # #
# # # #
# #
# #
# Our first objective will be to perform a Gaussian process fit to the # data, we'll do this using the [GPy # software](https://github.com/SheffieldML/GPy). # In[ ]: import GPy # In[ ]: m_full = GPy.models.GPRegression(x,yhat) m_full.kern.lengthscale=50 _ = m_full.optimize() # Optimize parameters of covariance function # Initialize the length scale parameter (which here actually represents a # *time scale* of the covariance function to a reasonable value. Default # would be 1, but here we set it to 50 minutes, given points are arriving # across zero to 250 minutes. # In[ ]: xt = np.linspace(-20,260,200)[:,np.newaxis] yt_mean, yt_var = m_full.predict(xt) yt_sd=np.sqrt(yt_var) # Now we plot the results using the helper function in `teaching_plots`. # In[ ]: import teaching_plots as plot # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2) ax.set_xlim(xlim) ax.set_ylim(ylim) ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full.log_likelihood()), fontsize=20) mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/della-gatta-gene-gp.svg', transparent=True, frameon=True) # # # Now we try a model initialized with a longer length scale. # In[ ]: m_full2 = GPy.models.GPRegression(x,yhat) m_full2.kern.lengthscale=2000 _ = m_full2.optimize() # Optimize parameters of covariance function # In[ ]: import teaching_plots as plot # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m_full2, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2) ax.set_xlim(xlim) ax.set_ylim(ylim) ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full2.log_likelihood()), fontsize=20) mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/della-gatta-gene-gp2.svg', transparent=True, frameon=True) # # # Now we try a model initialized with a lower noise. # In[ ]: m_full3 = GPy.models.GPRegression(x,yhat) m_full3.kern.lengthscale=20 m_full3.likelihood.variance=0.001 _ = m_full3.optimize() # Optimize parameters of covariance function # In[ ]: import teaching_plots as plot # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m_full3, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2) ax.set_xlim(xlim) ax.set_ylim(ylim) ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full3.log_likelihood()), fontsize=20) mlai.write_figure(figure=fig, filename='../slides/diagrams/gp/della-gatta-gene-gp3.svg', transparent=True, frameon=True) # # # # # # In[ ]: layers = [y.shape[1], 1,x.shape[1]] inits = ['PCA']*(len(layers)-1) kernels = [] for i in layers[1:]: kernels += [GPy.kern.RBF(i)] m = deepgp.DeepGP(layers,Y=yhat, X=x, inits=inits, kernels=kernels, # the kernels for each layer num_inducing=20, back_constraint=False) # In[ ]: m.initialize() m.staged_optimize() # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, scale=scale, offset=offset, ax=ax, fontsize=20, portion=0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(filename='../slides/diagrams/deepgp/della-gatta-gene-deep-gp.svg', transparent=True, frameon=True) # ### TP53 Gene Data Deep GP # # # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, portion = 0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/della-gatta-gene-deep-gp-samples.svg', transparent=True, frameon=True) # ### TP53 Gene Data Deep GP # # # In[ ]: m.visualize(offset=offset, scale=scale, xlim=xlim, ylim=ylim, dataset='della-gatta-gene', diagrams='../slides/diagrams/deepgp') # ### TP53 Gene Data Latent 1 # # # # ### TP53 Gene Data Latent 2 # # # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) m.visualize_pinball(offset=offset, ax=ax, scale=scale, xlim=xlim, ylim=ylim, portion=0.1, points=50) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/della-gatta-gene-deep-gp-pinball.svg', transparent=True, frameon=True, ax=ax) # ### TP53 Gene Pinball Plot # # # # ### Step Function # # Next we consider a simple step function data set. # In[223]: num_low=25 num_high=25 gap = -.1 noise=0.0001 x = np.vstack((np.linspace(-1, -gap/2.0, num_low)[:, np.newaxis], np.linspace(gap/2.0, 1, num_high)[:, np.newaxis])) y = np.vstack((np.zeros((num_low, 1)), np.ones((num_high,1)))) scale = np.sqrt(y.var()) offset = y.mean() yhat = (y-offset)/scale # In[229]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) _ = ax.plot(x, y, 'r.',markersize=10) _ = ax.set_xlabel('$x$', fontsize=20) _ = ax.set_ylabel('$y$', fontsize=20) xlim = (-2, 2) ylim = (-0.6, 1.6) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/step-function.svg', transparent=True, frameon=True) # ### Step Function Data # # # In[227]: m_full = GPy.models.GPRegression(x,yhat) _ = m_full.optimize() # Optimize parameters of covariance function # ### Step Function Data GP # # # In[228]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m_full, scale=scale, offset=offset, ax=ax, fontsize=20, portion=0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(figure=fig,filename='../slides/diagrams/gp/step-function-gp.svg', transparent=True, frameon=True) # In[ ]: layers = [y.shape[1], 1, 1, 1,x.shape[1]] inits = ['PCA']*(len(layers)-1) kernels = [] for i in layers[1:]: kernels += [GPy.kern.RBF(i)] m = deepgp.DeepGP(layers,Y=yhat, X=x, inits=inits, kernels=kernels, # the kernels for each layer num_inducing=20, back_constraint=False) # In[ ]: m.initialize() m.staged_optimize() # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, scale=scale, offset=offset, ax=ax, fontsize=20, portion=0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(filename='../slides/diagrams/deepgp/step-function-deep-gp.svg', transparent=True, frameon=True) # ### Step Function Data Deep GP # # # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, portion = 0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/step-function-deep-gp-samples.svg', transparent=True, frameon=True) # ### Step Function Data Deep GP # # # In[ ]: m.visualize(offset=offset, scale=scale, xlim=xlim, ylim=ylim, dataset='step-function', diagrams='../slides/diagrams/deepgp') # ### Step Function Data Latent 1 # # # # ### Step Function Data Latent 2 # # # # ### Step Function Data Latent 3 # # # # ### Step Function Data Latent 4 # # # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) m.visualize_pinball(offset=offset, ax=ax, scale=scale, xlim=xlim, ylim=ylim, portion=0.1, points=50) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/step-function-deep-gp-pinball.svg', transparent=True, frameon=True, ax=ax) # ### Step Function Pinball Plot # # # In[ ]: import pods # In[ ]: data = pods.datasets.mcycle() x = data['X'] y = data['Y'] scale=np.sqrt(y.var()) offset=y.mean() yhat = (y - offset)/scale # In[ ]: fig, ax = plt.subplots(figsize=plot.big_wide_figsize) _ = ax.plot(x, y, 'r.',markersize=10) _ = ax.set_xlabel('time', fontsize=20) _ = ax.set_ylabel('acceleration', fontsize=20) xlim = (-20, 80) ylim = (-175, 125) ax.set_xlim(xlim) ax.set_ylim(ylim) mlai.write_figure(filename='../slides/diagrams/datasets/motorcycle-helmet.svg', transparent=True, frameon=True) # ### Motorcycle Helmet Data # # # In[ ]: m_full = GPy.models.GPRegression(x,yhat) _ = m_full.optimize() # Optimize parameters of covariance function # ### Motorcycle Helmet Data GP # # # In[ ]: import deepgp # In[ ]: layers = [y.shape[1], 1, x.shape[1]] inits = ['PCA']*(len(layers)-1) kernels = [] for i in layers[1:]: kernels += [GPy.kern.RBF(i)] m = deepgp.DeepGP(layers,Y=yhat, X=x, inits=inits, kernels=kernels, # the kernels for each layer num_inducing=20, back_constraint=False) m.initialize() # In[ ]: m.staged_optimize(iters=(1000,1000,10000), messages=(True, True, True)) # In[ ]: import teaching_plots as plot import mlai # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='acceleration/$g$', fontsize=20, portion=0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(filename='../slides/diagrams/deepgp/motorcycle-helmet-deep-gp.svg', transparent=True, frameon=True) # ### Motorcycle Helmet Data Deep GP # # # In[ ]: import teaching_plots as plot import mlai # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, xlabel='time', ylabel='acceleration/$g$', portion = 0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-samples.svg', transparent=True, frameon=True) # ### Motorcycle Helmet Data Deep GP # # # In[ ]: m.visualize(xlim=xlim, ylim=ylim, scale=scale,offset=offset, xlabel="time", ylabel="acceleration/$g$", portion=0.5, dataset='motorcycle-helmet', diagrams='../slides/diagrams/deepgp') # ### Motorcycle Helmet Data Latent 1 # # # # ### Motorcycle Helmet Data Latent 2 # # # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) m.visualize_pinball(ax=ax, xlabel='time', ylabel='acceleration/g', points=50, scale=scale, offset=offset, portion=0.1) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-pinball.svg', transparent=True, frameon=True) # ### Motorcycle Helmet Pinball Plot # # # # ### Robot Wireless Data # # The robot wireless data is taken from an experiment run by Brian Ferris # at University of Washington. It consists of the measurements of WiFi # access point signal strengths as Brian walked in a loop. # In[ ]: data=pods.datasets.robot_wireless() x = np.linspace(0,1,215)[:, np.newaxis] y = data['Y'] offset = y.mean() scale = np.sqrt(y.var()) yhat = (y-offset)/scale # The ground truth is recorded in the data, the actual loop is given in # the plot below. # In[ ]: fig, ax = plt.subplots(figsize=plot.big_figsize) plt.plot(data['X'][:, 1], data['X'][:, 2], 'r.', markersize=5) ax.set_xlabel('x position', fontsize=20) ax.set_ylabel('y position', fontsize=20) mlai.write_figure(figure=fig, filename='../../slides/diagrams/datasets/robot-wireless-ground-truth.svg', transparent=True, frameon=True) # ### Robot Wireless Ground Truth # # # # We will ignore this ground truth in making our predictions, but see if # the model can recover something similar in one of the latent layers. # In[ ]: output_dim=1 xlim = (-0.3, 1.3) fig, ax = plt.subplots(figsize=plot.big_wide_figsize) _ = ax.plot(x.flatten(), y[:, output_dim], 'r.', markersize=5) ax.set_xlabel('time', fontsize=20) ax.set_ylabel('signal strength', fontsize=20) xlim = (-0.2, 1.2) ylim = (-0.6, 2.0) ax.set_xlim(xlim) ax.set_ylim(ylim) mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/robot-wireless-dim-' + str(output_dim) + '.svg', transparent=True, frameon=True) # ### Robot WiFi Data # # # # Perform a Gaussian process fit on the data using GPy. # In[ ]: m_full = GPy.models.GPRegression(x,yhat) _ = m_full.optimize() # Optimize parameters of covariance function # In[ ]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m_full, output_dim=output_dim, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(filename='../slides/diagrams/gp/robot-wireless-gp-dim-' + str(output_dim)+ '.svg', transparent=True, frameon=True) # ### Robot WiFi Data GP # # # In[ ]: layers = [y.shape[1], 10, 5, 2, 2, x.shape[1]] inits = ['PCA']*(len(layers)-1) kernels = [] for i in layers[1:]: kernels += [GPy.kern.RBF(i, ARD=True)] # In[ ]: m = deepgp.DeepGP(layers,Y=y, X=x, inits=inits, kernels=kernels, num_inducing=50, back_constraint=False) m.initialize() # In[ ]: m.staged_optimize(messages=(True,True,True)) # In[217]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_output(m, output_dim=output_dim, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/robot-wireless-deep-gp-dim-' + str(output_dim)+ '.svg', transparent=True, frameon=True) # ### Robot WiFi Data Deep GP # # # In[219]: fig, ax=plt.subplots(figsize=plot.big_wide_figsize) plot.model_sample(m, output_dim=output_dim, scale=scale, offset=offset, samps=10, ax=ax, xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5) ax.set_ylim(ylim) ax.set_xlim(xlim) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/robot-wireless-deep-gp-samples-dim-' + str(output_dim)+ '.svg', transparent=True, frameon=True) # ### Robot WiFi Data Deep GP # # # # ### Robot WiFi Data Latent Space # # # In[222]: fig, ax = plt.subplots(figsize=plot.big_figsize) ax.plot(m.layers[-2].latent_space.mean[:, 0], m.layers[-2].latent_space.mean[:, 1], 'r.-', markersize=5) ax.set_xlabel('latent dimension 1', fontsize=20) ax.set_ylabel('latent dimension 2', fontsize=20) mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/robot-wireless-latent-space.svg', transparent=True, frameon=True) # ### Robot WiFi Data Latent Space # # # # ### Motion Capture # # - ‘High five’ data. # - Model learns structure between two interacting subjects. # # ### Shared LVM # # # # # # [Thanks to: Zhenwen Dai and Neil D. # Lawrence]{style="text-align:right"} # # We now look at the deep Gaussian processes' capacity to perform # unsupervised learning. # # We will look at a sub-sample of the MNIST digit data set. # # First load in the MNIST data set from scikit learn. This can take a # little while because it's large to download. # In[ ]: from sklearn.datasets import fetch_mldata # In[ ]: mnist = fetch_mldata('MNIST original') # Sub-sample the dataset to make the training faster. # In[ ]: import numpy as np # In[ ]: np.random.seed(0) digits = [0,1,2,3,4] N_per_digit = 100 Y = [] labels = [] for d in digits: imgs = mnist['data'][mnist['target']==d] Y.append(imgs[np.random.permutation(imgs.shape[0])][:N_per_digit]) labels.append(np.ones(N_per_digit)*d) Y = np.vstack(Y).astype(np.float64) labels = np.hstack(labels) Y /= 255. # ### Fit a Deep GP # # We're going to fit a Deep Gaussian process model to the MNIST data with # two hidden layers. Each of the two Gaussian processes (one from the # first hidden layer to the second, one from the second hidden layer to # the data) has an exponentiated quadratic covariance. # In[ ]: import deepgp import GPy # In[ ]: num_latent = 2 num_hidden_2 = 5 m = deepgp.DeepGP([Y.shape[1],num_hidden_2,num_latent], Y, kernels=[GPy.kern.RBF(num_hidden_2,ARD=True), GPy.kern.RBF(num_latent,ARD=False)], num_inducing=50, back_constraint=False, encoder_dims=[[200],[200]]) # ### Initialization # # Just like deep neural networks, there are some tricks to intitializing # these models. The tricks we use here include some early training of the # model with model parameters constrained. This gives the variational # inducing parameters some scope to tighten the bound for the case where # the noise variance is small and the variances of the Gaussian processes # are around 1. # In[ ]: m.obslayer.likelihood.variance[:] = Y.var()*0.01 for layer in m.layers: layer.kern.variance.fix(warning=False) layer.likelihood.variance.fix(warning=False) # We now we optimize for a hundred iterations with the constrained model. # In[ ]: m.optimize(messages=False,max_iters=100) # Now we remove the fixed constraint on the kernel variance parameters, # but keep the noise output constrained, and run for a further 100 # iterations. # In[ ]: for layer in m.layers: layer.kern.variance.constrain_positive(warning=False) m.optimize(messages=False,max_iters=100) # Finally we unconstrain the layer likelihoods and allow the full model to # be trained for 1000 iterations. # In[ ]: for layer in m.layers: layer.likelihood.variance.constrain_positive(warning=False) m.optimize(messages=True,max_iters=10000) # ### Visualize the latent space of the top layer # # Now the model is trained, let's plot the mean of the posterior # distributions in the top latent layer of the model. # In[ ]: import matplotlib.pyplot as plt from matplotlib import rc import teaching_plots as plot import mlai # In[ ]: rc("font", **{'family':'sans-serif','sans-serif':['Helvetica'],'size':20}) fig, ax = plt.subplots(figsize=plot.big_figsize) for d in digits: ax.plot(m.layer_1.X.mean[labels==d,0],m.layer_1.X.mean[labels==d,1],'.',label=str(d)) _ = plt.legend() mlai.write_figure(figure=fig, filename="../slides/diagrams/deepgp/usps-digits-latent.svg", transparent=True) # # # ### Visualize the latent space of the intermediate layer # # We can also visualize dimensions of the intermediate layer. First the # lengthscale of those dimensions is given by # In[ ]: m.obslayer.kern.lengthscale # In[ ]: import matplotlib.pyplot as plt import mlai # In[ ]: fig, ax = plt.subplots(figsize=plot.big_figsize) for i in range(5): for j in range(i): dims=[i, j] ax.cla() for d in digits: ax.plot(m.obslayer.X.mean[labels==d,dims[0]], m.obslayer.X.mean[labels==d,dims[1]], '.', label=str(d)) plt.legend() plt.xlabel('dimension ' + str(dims[0])) plt.ylabel('dimension ' + str(dims[1])) mlai.write_figure(figure=fig, filename="../slides/diagrams/deepgp/usps-digits-hidden-" + str(dims[0]) + '-' + str(dims[1]) + '.svg', transparent=True) # # # # # # # # # ### Generate From Model # # Now we can take a look at a sample from the model, by drawing a Gaussian # random sample in the latent space and propagating it through the model. # In[ ]: rows = 10 cols = 20 t=np.linspace(-1, 1, rows*cols)[:, None] kern = GPy.kern.RBF(1,lengthscale=0.05) cov = kern.K(t, t) x = np.random.multivariate_normal(np.zeros(rows*cols), cov, num_latent).T # In[ ]: import matplotlib.pyplot as plt import mlai # In[ ]: yt = m.predict(x) fig, axs = plt.subplots(rows,cols,figsize=(10,6)) for i in range(rows): for j in range(cols): #v = np.random.normal(loc=yt[0][i*cols+j, :], scale=np.sqrt(yt[1][i*cols+j, :])) v = yt[0][i*cols+j, :] axs[i,j].imshow(v.reshape(28,28), cmap='gray', interpolation='none', aspect='equal') axs[i,j].set_axis_off() mlai.write_figure(figure=fig, filename="../slides/diagrams/deepgp/digit-samples-deep-gp.svg", transparent=True) # # # # # - *Gaussian process based nonlinear latent structure discovery in # multivariate spike train data* @Anqi:gpspike2017 # - *Doubly Stochastic Variational Inference for Deep Gaussian # Processes* @Salimbeni:doubly2017 # - *Deep Multi-task Gaussian Processes for Survival Analysis with # Competing Risks* @Alaa:deep2017 # - *Counterfactual Gaussian Processes for Reliable Decision-making and # What-if Reasoning* @Schulam:counterfactual17 # # - *Deep Survival Analysis* @Ranganath-survival16 # - *Recurrent Gaussian Processes* @Mattos:recurrent15 # - *Gaussian Process Based Approaches for Survival Analysis* # @Saul:thesis2016 # #
# # [](https://amzn.github.io/emukit-playground/) # #
# #
# Emukit playground is a tutorial for understanding the # simulation/emulation relationship. #
#
# # [](https://amzn.github.io/emukit-playground/#!/learn/bayesian_optimization) # #
# #
# Tutorial on Bayesian optimization of the number of taxis deployed # from Emukit playground. #
# ### Uncertainty Quantification # # > Uncertainty quantification (UQ) is the science of quantitative # > characterization and reduction of uncertainties in both computational # > and real world applications. It tries to determine how likely certain # > outcomes are if some aspects of the system are not exactly known. # # We will to illustrate different concepts of [Uncertainty # Quantification](https://en.wikipedia.org/wiki/Uncertainty_quantification) # (UQ) and the role that Gaussian processes play in this field. Based on a # simple simulator of a car moving between a valley and a mountain, we are # going to illustrate the following concepts: # # - **Systems emulation**. Many real world decisions are based on # simulations that can be computationally very demanding. We will show # how simulators can be replaced by *emulators*: Gaussian process # models fitted on a few simulations that can be used to replace the # *simulator*. Emulators are cheap to compute, fast to run, and always # provide ways to quantify the uncertainty of how precise they are # compared the original simulator. # # - **Emulators in optimization problems**. We will show how emulators # can be used to optimize black-box functions that are expensive to # evaluate. This field is also called Bayesian Optimization and has # gained an increasing relevance in machine learning as emulators can # be used to optimize computer simulations (and machine learning # algorithms) quite efficiently. # # - **Multi-fidelity emulation methods**. In many scenarios we have # simulators of different quality about the same measure of interest. # In these cases the goal is to merge all sources of information under # the same model so the final emulator is cheaper and more accurate # than an emulator fitted only using data from the most accurate and # expensive simulator. # # ### Mountain Car Simulator # # To illustrate the above mentioned concepts we we use the [mountain car # simulator](https://github.com/openai/gym/wiki/MountainCarContinuous-v0). # This simulator is widely used in machine learning to test reinforcement # learning algorithms. The goal is to define a control policy on a car # whose objective is to climb a mountain. Graphically, the problem looks # as follows: # #
# # # #
# #
# The mountain car simulation from the Open AI gym. #
# The goal is to define a sequence of actions (push the car right or left # with certain intensity) to make the car reach the flag after a number # $T$ of time steps. # # At each time step $t$, the car is characterized by a vector # $\inputVector_{t} = (p_t,v_t)$ of states which are respectively the the # position and velocity of the car at time $t$. For a sequence of states # (an episode), the dynamics of the car is given by # # $$\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})$$ # # where $\textbf{u}_{t}$ is the value of an action force, which in this # example corresponds to push car to the left (negative value) or to the # right (positive value). The actions across a full episode are # represented in a policy $\textbf{u}_{t} = \pi(\inputVector_{t},\theta)$ # that acts according to the current state of the car and some parameters # $\theta$. In the following examples we will assume that the policy is # linear which allows us to write $\pi(\inputVector_{t},\theta)$ as # # $$\pi(\inputVector,\theta)= \theta_0 + \theta_p p + \theta_vv.$$ # # For $t=1,\dots,T$ now given some initial state $\inputVector_{0}$ and # some some values of each $\textbf{u}_{t}$, we can **simulate** the full # dynamics of the car for a full episode using # [Gym](https://gym.openai.com/envs/). The values of $\textbf{u}_{t}$ are # fully determined by the parameters of the linear controller. # # After each episode of length $T$ is complete, a reward function # $R_{T}(\theta)$ is computed. In the mountain car example the reward is # computed as 100 for reaching the target of the hill on the right hand # side, minus the squared sum of actions (a real negative to push to the # left and a real positive to push to the right) from start to goal. Note # that our reward depend on $\theta$ as we make it dependent on the # parameters of the linear controller. # # ### Emulate the Mountain Car # In[ ]: import gym # In[ ]: env = gym.make('MountainCarContinuous-v0') # Our goal in this section is to find the parameters $\theta$ of the # linear controller such that # # $$\theta^* = arg \max_{\theta} R_T(\theta).$$ # # In this section, we directly use Bayesian optimization to solve this # problem. We will use [GPyOpt](https://sheffieldml.github.io/GPyOpt/) so # we first define the objective function: # In[ ]: import mountain_car as mc import GPyOpt # In[ ]: obj_func = lambda x: mc.run_simulation(env, x)[0] objective = GPyOpt.core.task.SingleObjective(obj_func) # For each set of parameter values of the linear controller we can run an # episode of the simulator (that we fix to have a horizon of $T=500$) to # generate the reward. Using as input the parameters of the controller and # as outputs the rewards we can build a Gaussian process emulator of the # reward. # # We start defining the input space, which is three-dimensional: # In[ ]: ## --- We define the input space of the emulator space= [{'name':'postion_parameter', 'type':'continuous', 'domain':(-1.2, +1)}, {'name':'velocity_parameter', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)}, {'name':'constant', 'type':'continuous', 'domain':(-1, +1)}] design_space = GPyOpt.Design_space(space=space) # Now we initizialize a Gaussian process emulator. # In[ ]: model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True) # In Bayesian optimization an acquisition function is used to balance # exploration and exploitation to evaluate new locations close to the # optimum of the objective. In this notebook we select the expected # improvement (EI). For further details have a look to the review paper of # [Shahriari et al # (2015)](http://www.cs.ox.ac.uk/people/nando.defreitas/publications/BayesOptLoop.pdf). # In[ ]: aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space) acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer) evaluator = GPyOpt.core.evaluators.Sequential(acquisition) # Collect points sequentially, no parallelization. # To initalize the model we start sampling some initial points (25) for # the linear controler randomly. # In[ ]: from GPyOpt.experiment_design.random_design import RandomDesign # In[ ]: n_initial_points = 25 random_design = RandomDesign(design_space) initial_design = random_design.get_samples(n_initial_points) # Before we start any optimization, lets have a look to the behavior of # the car with the first of these initial points that we have selected # randomly. # In[ ]: import numpy as np # In[ ]: random_controller = initial_design[0,:] _, _, _, frames = mc.run_simulation(env, np.atleast_2d(random_controller), render=True) anim=mc.animate_frames(frames, 'Random linear controller') # In[ ]: from IPython.core.display import HTML # In[ ]: HTML(anim.to_jshtml()) # # As we can see the random linear controller does not manage to push the # car to the top of the mountain. Now, let's optimize the regret using # Bayesian optimization and the emulator for the reward. We try 50 new # parameters chosen by the EI. # In[ ]: max_iter = 50 bo = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective, acquisition, evaluator, initial_design) bo.run_optimization(max_iter = max_iter ) # Now we visualize the result for the best controller that we have found # with Bayesian optimization. # In[ ]: _, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo.x_opt), render=True) anim=mc.animate_frames(frames, 'Best controller after 50 iterations of Bayesian optimization') # In[ ]: HTML(anim.to_jshtml()) # # he car can now make it to the top of the mountain! Emulating the reward # function and using the EI helped as to find a linear controller that # solves the problem. # # ### Data Efficient Emulation # # In the previous section we solved the mountain car problem by directly # emulating the reward but no considerations about the dynamics # $\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})$ # of the system were made. Note that we had to run 75 episodes of 500 # steps each to solve the problem, which required to call the simulator # $500\times 75 =37500$ times. In this section we will show how it is # possible to reduce this number by building an emulator for $f$ that can # later be used to directly optimize the control. # # The inputs of the model for the dynamics are the velocity, the position # and the value of the control so create this space accordingly. # In[ ]: import gym # In[ ]: env = gym.make('MountainCarContinuous-v0') # In[ ]: import GPyOpt # In[ ]: space_dynamics = [{'name':'position', 'type':'continuous', 'domain':[-1.2, +0.6]}, {'name':'velocity', 'type':'continuous', 'domain':[-0.07, +0.07]}, {'name':'action', 'type':'continuous', 'domain':[-1, +1]}] design_space_dynamics = GPyOpt.Design_space(space=space_dynamics) # The outputs are the velocity and the position. Indeed our model will # capture the change in position and velocity on time. That is, we will # model # # $$\Delta v_{t+1} = v_{t+1} - v_{t}$$ # # $$\Delta x_{t+1} = p_{t+1} - p_{t}$$ # # with Gaussian processes with prior mean $v_{t}$ and $p_{t}$ # respectively. As a covariance function, we use a Matern52. We need # therefore two models to capture the full dynamics of the system. # In[ ]: position_model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True) velocity_model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True) # Next, we sample some input parameters and use the simulator to compute # the outputs. Note that in this case we are not running the full # episodes, we are just using the simulator to compute # $\inputVector_{t+1}$ given $\inputVector_{t}$ and $\textbf{u}_{t}$. # In[ ]: import numpy as np from GPyOpt.experiment_design.random_design import RandomDesign import mountain_car as mc # In[ ]: ### --- Random locations of the inputs n_initial_points = 500 random_design_dynamics = RandomDesign(design_space_dynamics) initial_design_dynamics = random_design_dynamics.get_samples(n_initial_points) # In[ ]: ### --- Simulation of the (normalized) outputs y = np.zeros((initial_design_dynamics.shape[0], 2)) for i in range(initial_design_dynamics.shape[0]): y[i, :] = mc.simulation(initial_design_dynamics[i, :]) # Normalize the data from the simulation y_normalisation = np.std(y, axis=0) y_normalised = y/y_normalisation # In general we might use much smarter strategies to design our emulation # of the simulator. For example, we could use the variance of the # predictive distributions of the models to collect points using # uncertainty sampling, which will give us a better coverage of the space. # For simplicity, we move ahead with the 500 randomly selected points. # # Now that we have a data set, we can update the emulators for the # location and the velocity. # In[ ]: position_model.updateModel(initial_design_dynamics, y[:, [0]], None, None) velocity_model.updateModel(initial_design_dynamics, y[:, [1]], None, None) # We can now have a look to how the emulator and the simulator match. # First, we show a contour plot of the car aceleration for each pair of # can position and velocity. You can use the bar bellow to play with the # values of the controler to compare the emulator and the simulator. # In[ ]: from IPython.html.widgets import interact # We can see how the emulator is doing a fairly good job approximating the # simulator. On the edges, however, it struggles to captures the dynamics # of the simulator. # # Given some input parameters of the linear controlling, how do the # dynamics of the emulator and simulator match? In the following figure we # show the position and velocity of the car for the 500 time steps of an # episode in which the parameters of the linear controller have been fixed # beforehand. The value of the input control is also shown. # In[ ]: controller_gains = np.atleast_2d([0, .6, 1]) # change the valus of the linear controller to observe the trayectories. # # # We now make explicit use of the emulator, using it to replace the # simulator and optimize the linear controller. Note that in this # optimization, we don't need to query the simulator anymore as we can # reproduce the full dynamics of an episode using the emulator. For # illustrative purposes, in this example we fix the initial location of # the car. # # We define the objective reward function in terms of the simulator. # In[ ]: ### --- Optimize control parameters with emulator car_initial_location = np.asarray([-0.58912799, 0]) ### --- Reward objective function using the emulator obj_func_emulator = lambda x: mc.run_emulation([position_model, velocity_model], x, car_initial_location)[0] objective_emulator = GPyOpt.core.task.SingleObjective(obj_func_emulator) # And as before, we use Bayesian optimization to find the best possible # linear controller. # In[ ]: ### --- Elements of the optimization that will use the multi-fidelity emulator model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True) # The design space is the three continuous variables that make up the # linear controller. # In[ ]: space= [{'name':'linear_1', 'type':'continuous', 'domain':(-1/1.2, +1)}, {'name':'linear_2', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)}, {'name':'constant', 'type':'continuous', 'domain':(-1, +1)}] design_space = GPyOpt.Design_space(space=space) aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space) random_design = RandomDesign(design_space) initial_design = random_design.get_samples(25) # We set the acquisition function to be expected improvement using # `GPyOpt`. # In[ ]: acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer) evaluator = GPyOpt.core.evaluators.Sequential(acquisition) # In[ ]: bo_emulator = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective_emulator, acquisition, evaluator, initial_design) bo_emulator.run_optimization(max_iter=50) # In[ ]: _, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo_emulator.x_opt), render=True) anim=mc.animate_frames(frames, 'Best controller using the emulator of the dynamics') # In[ ]: from IPython.core.display import HTML # In[ ]: HTML(anim.to_jshtml()) # # And the problem is again solved, but in this case we have replaced the # simulator of the car dynamics by a Gaussian process emulator that we # learned by calling the simulator only 500 times. Compared to the 37500 # calls that we needed when applying Bayesian optimization directly on the # simulator this is a great gain. # # In some scenarios we have simulators of the same environment that have # different fidelities, that is that reflect with different level of # accuracy the dynamics of the real world. Running simulations of the # different fidelities also have a different cost: hight fidelity # simulations are more expensive the cheaper ones. If we have access to # these simulators we can combine high and low fidelity simulations under # the same model. # # So let's assume that we have two simulators of the mountain car # dynamics, one of high fidelity (the one we have used) and another one of # low fidelity. The traditional approach to this form of multi-fidelity # emulation is to assume that # # $$\mappingFunction_i\left(\inputVector\right) = \rho\mappingFunction_{i-1}\left(\inputVector\right) + \delta_i\left(\inputVector \right)$$ # # where $\mappingFunction_{i-1}\left(\inputVector\right)$ is a low # fidelity simulation of the problem of interest and # $\mappingFunction_i\left(\inputVector\right)$ is a higher fidelity # simulation. The function $\delta_i\left(\inputVector \right)$ represents # the difference between the lower and higher fidelity simulation, which # is considered additive. The additive form of this covariance means that # if $\mappingFunction_{0}\left(\inputVector\right)$ and # $\left\{\delta_i\left(\inputVector \right)\right\}_{i=1}^m$ are all # Gaussian processes, then the process over all fidelities of simuation # will be a joint Gaussian process. # # But with Deep Gaussian processes we can consider the form # # $$\mappingFunction_i\left(\inputVector\right) = \mappingFunctionTwo_{i}\left(\mappingFunction_{i-1}\left(\inputVector\right)\right) + \delta_i\left(\inputVector \right),$$ # # where the low fidelity representation is non linearly transformed by # $\mappingFunctionTwo(\cdot)$ before use in the process. This is the # approach taken in @Perdikaris:multifidelity17. But once we accept that # these models can be composed, a highly flexible framework can emerge. A # key point is that the data enters the model at different levels, and # represents different aspects. For example these correspond to the two # fidelities of the mountain car simulator. # # We start by sampling both of them at 250 random input locations. # In[ ]: import gym # In[ ]: env = gym.make('MountainCarContinuous-v0') # In[ ]: import GPyOpt # In[ ]: ### --- Collect points from low and high fidelity simulator --- ### space = GPyOpt.Design_space([ {'name':'position', 'type':'continuous', 'domain':(-1.2, +1)}, {'name':'velocity', 'type':'continuous', 'domain':(-0.07, +0.07)}, {'name':'action', 'type':'continuous', 'domain':(-1, +1)}]) n_points = 250 random_design = GPyOpt.experiment_design.RandomDesign(space) x_random = random_design.get_samples(n_points) # Next, we evaluate the high and low fidelity simualtors at those # locations. # In[ ]: import numpy as np import mountain_car as mc # In[ ]: d_position_hf = np.zeros((n_points, 1)) d_velocity_hf = np.zeros((n_points, 1)) d_position_lf = np.zeros((n_points, 1)) d_velocity_lf = np.zeros((n_points, 1)) # --- Collect high fidelity points for i in range(0, n_points): d_position_hf[i], d_velocity_hf[i] = mc.simulation(x_random[i, :]) # --- Collect low fidelity points for i in range(0, n_points): d_position_lf[i], d_velocity_lf[i] = mc.low_cost_simulation(x_random[i, :]) # It is time to build the multi-fidelity model for both the position and # the velocity. # # As we did in the previous section we use the emulator to optimize the # simulator. In this case we use the high fidelity output of the emulator. # In[ ]: ### --- Optimize controller parameters obj_func = lambda x: mc.run_simulation(env, x)[0] obj_func_emulator = lambda x: mc.run_emulation([position_model, velocity_model], x, car_initial_location)[0] objective_multifidelity = GPyOpt.core.task.SingleObjective(obj_func) # And we optimize using Bayesian optimzation. # In[ ]: from GPyOpt.experiment_design.random_design import RandomDesign # In[ ]: model = GPyOpt.models.GPModel(optimize_restarts=5, verbose=False, exact_feval=True, ARD=True) space= [{'name':'linear_1', 'type':'continuous', 'domain':(-1/1.2, +1)}, {'name':'linear_2', 'type':'continuous', 'domain':(-1/0.07, +1/0.07)}, {'name':'constant', 'type':'continuous', 'domain':(-1, +1)}] design_space = GPyOpt.Design_space(space=space) aquisition_optimizer = GPyOpt.optimization.AcquisitionOptimizer(design_space) n_initial_points = 25 random_design = RandomDesign(design_space) initial_design = random_design.get_samples(n_initial_points) acquisition = GPyOpt.acquisitions.AcquisitionEI(model, design_space, optimizer=aquisition_optimizer) evaluator = GPyOpt.core.evaluators.Sequential(acquisition) # In[ ]: bo_multifidelity = GPyOpt.methods.ModularBayesianOptimization(model, design_space, objective_multifidelity, acquisition, evaluator, initial_design) bo_multifidelity.run_optimization(max_iter=50) # In[ ]: _, _, _, frames = mc.run_simulation(env, np.atleast_2d(bo_multifidelity.x_opt), render=True) anim=mc.animate_frames(frames, 'Best controller with multi-fidelity emulator') # In[ ]: from IPython.core.display import HTML # In[ ]: HTML(anim.to_jshtml()) # # And problem solved! We see how the problem is also solved with 250 # observations of the high fidelity simulator and 250 of the low fidelity # simulator. # # - *Multi-fidelity emulation*: build surrogate models when data is # obtained from multiple information sources that have different # fidelity and/or cost; # - *Bayesian optimisation*: optimise physical experiments and tune # parameters of machine learning algorithms; # - *Experimental design/Active learning*: design the most informative # experiments and perform active learning with machine learning # models; # - *Sensitivity analysis*: analyse the influence of inputs on the # outputs of a given system; # - *Bayesian quadrature*: efficiently compute the integrals of # functions that are expensive to evaluate. # #
# # # #
# #
# #
# ### MxFusion # # # # # # #
# - Work by Eric Meissner and Zhenwen Dai. # - Probabilistic programming. # - Available on [Github](https://github.com/amzn/mxfusion) # # #
# # ### MxFusion # # ### Why another framework? # # ### Key Requirements # # Specialized inference methods + models, without requiring users to # reimplement nor understand them every time. Leverage expert knowledge. # Efficient inference, flexible framework. Existing frameworks either did # one or the other: flexible, or efficient. # # ### What does it look like? # # **Modelling** # # **Inference** # In[ ]: m = Model() m.mu = Variable() m.s = Variable(transformation=PositiveTransformation()) m.Y = Normal.define_variable(mean=m.mu, variance=m.s) # - Variable # - Distribution # - Function # # - `log_pdf` # - `draw_samples` # # - Variational Inference # - MCMC Sampling (*soon*) Built on MXNet Gluon (imperative code, not # static graph) # In[ ]: infr = GradBasedInference(inference_algorithm=MAP(model=m, observed=[m.Y])) infr.run(Y=data) # - Model + Inference together form building blocks. # - Just doing modular modeling with universal inference doesn't # really scale, need specialized inference methods for specialized # modelling objects like non-parametrics. # # ### Long term Aim # # - Simulate/Emulate the components of the system. # - Validate with real world using multifidelity. # - Interpret system using e.g. sensitivity analysis. # - Perform end to end learning to optimize. # - Maintain interpretability. # # Stefanos Eleftheriadis, John Bronskill, Hugh Salimbeni, Rich Turner, # Zhenwen Dai, Javier Gonzalez, Andreas Damianou, Mark Pullin, Eric # Meissner. # # - twitter: @lawrennd # - blog: # [http://inverseprobability.com](http://inverseprobability.com/blog.html) # # ### References {#references .unnumbered}