Notebook

Modeling Things¶

Neil D. Lawrence, University of¶

Cambridge ### 2020-08-23

Abstract: Machine learning solutions, in particular those based on deep learning methods, form an underpinning of the current revolution in “artificial intelligence” that has dominated popular press headlines and is having a significant influence on the wider tech agenda. In some ways the these deep learning methods are radically new: they raise questions about how we think of model regularization. Regularization arises implicitly through the optimization. Yet in others they remain rigidly traditional, and unsuited for an emerging world of unstructured, streaming data. In this paper we relate these new methods to traditional approaches and speculate on new directions that might take us beyond modeling structured data.

$$ \newcommand{\tk}[1]{} \newcommand{\Amatrix}{\mathbf{A}} \newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)} \newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}} \newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}} \newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}} \newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}} \newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}} \newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}} \newcommand{\Kuui}{\Kuu^{-1}} \newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}} \newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}} \newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}} \newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}} \newcommand{\aMatrix}{\mathbf{A}} \newcommand{\aScalar}{a} \newcommand{\aVector}{\mathbf{a}} \newcommand{\acceleration}{a} \newcommand{\bMatrix}{\mathbf{B}} \newcommand{\bScalar}{b} \newcommand{\bVector}{\mathbf{b}} \newcommand{\basisFunc}{\phi} \newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}} \newcommand{\basisFunction}{\phi} \newcommand{\basisLocation}{\mu} \newcommand{\basisMatrix}{\boldsymbol{ \Phi}} \newcommand{\basisScalar}{\basisFunction} \newcommand{\basisVector}{\boldsymbol{ \basisFunction}} \newcommand{\activationFunction}{\phi} \newcommand{\activationMatrix}{\boldsymbol{ \Phi}} \newcommand{\activationScalar}{\basisFunction} \newcommand{\activationVector}{\boldsymbol{ \basisFunction}} \newcommand{\bigO}{\mathcal{O}} \newcommand{\binomProb}{\pi} \newcommand{\cMatrix}{\mathbf{C}} \newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}} \newcommand{\cdataMatrix}{\hat{\dataMatrix}} \newcommand{\cdataScalar}{\hat{\dataScalar}} \newcommand{\cdataVector}{\hat{\dataVector}} \newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}} \newcommand{\centeredKernelScalar}{b} \newcommand{\centeredKernelVector}{\centeredKernelScalar} \newcommand{\centeringMatrix}{\mathbf{H}} \newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)} \newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}} \newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}} \newcommand{\coregionalizationMatrix}{\mathbf{B}} \newcommand{\coregionalizationScalar}{b} \newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}} \newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)} \newcommand{\covSamp}[1]{\text{cov}\left(#1\right)} \newcommand{\covarianceScalar}{c} \newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}} \newcommand{\croupierScalar}{s} \newcommand{\croupierVector}{\mathbf{ \croupierScalar}} \newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}} \newcommand{\dataDim}{p} \newcommand{\dataIndex}{i} \newcommand{\dataIndexTwo}{j} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\dataScalar}{y} \newcommand{\dataSet}{\mathcal{D}} \newcommand{\dataStd}{\sigma} \newcommand{\dataVector}{\mathbf{ \dataScalar}} \newcommand{\decayRate}{d} \newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}} \newcommand{\degreeScalar}{d} \newcommand{\degreeVector}{\mathbf{ \degreeScalar}} \newcommand{\diag}[1]{\text{diag}\left(#1\right)} \newcommand{\diagonalMatrix}{\mathbf{D}} \newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}} \newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}} \newcommand{\displacement}{x} \newcommand{\displacementVector}{\textbf{\displacement}} \newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}} \newcommand{\distanceScalar}{d} \newcommand{\distanceVector}{\mathbf{ \distanceScalar}} \newcommand{\eigenvaltwo}{\ell} \newcommand{\eigenvaltwoMatrix}{\mathbf{L}} \newcommand{\eigenvaltwoVector}{\mathbf{l}} \newcommand{\eigenvalue}{\lambda} \newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}} \newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}} \newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}} \newcommand{\eigenvectorMatrix}{\mathbf{U}} \newcommand{\eigenvectorScalar}{u} \newcommand{\eigenvectwo}{\mathbf{v}} \newcommand{\eigenvectwoMatrix}{\mathbf{V}} \newcommand{\eigenvectwoScalar}{v} \newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)} \newcommand{\errorFunction}{E} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expectation}[1]{\left\langle #1 \right\rangle } \newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}} \newcommand{\expectedDistanceMatrix}{\mathcal{D}} \newcommand{\eye}{\mathbf{I}} \newcommand{\fantasyDim}{r} \newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}} \newcommand{\fantasyScalar}{z} \newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}} \newcommand{\featureStd}{\varsigma} \newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)} \newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)} \newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\given}{|} \newcommand{\half}{\frac{1}{2}} \newcommand{\heaviside}{H} \newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}} \newcommand{\hiddenScalar}{h} \newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}} \newcommand{\identityMatrix}{\eye} \newcommand{\inducingInputScalar}{z} \newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}} \newcommand{\inducingInputMatrix}{\mathbf{Z}} \newcommand{\inducingScalar}{u} \newcommand{\inducingVector}{\mathbf{ \inducingScalar}} \newcommand{\inducingMatrix}{\mathbf{U}} \newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2} \newcommand{\inputDim}{q} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputSpace}{\mathcal{X}} \newcommand{\inputVals}{\inputVector} \newcommand{\inputVector}{\mathbf{ \inputScalar}} \newcommand{\iterNum}{k} \newcommand{\kernel}{\kernelScalar} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\kernelScalar}{k} \newcommand{\kernelVector}{\mathbf{ \kernelScalar}} \newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}} \newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}} \newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}} \newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}} \newcommand{\lagrangeMultiplier}{\lambda} \newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}} \newcommand{\lagrangian}{L} \newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}} \newcommand{\laplacianFactorScalar}{m} \newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}} \newcommand{\laplacianMatrix}{\mathbf{L}} \newcommand{\laplacianScalar}{\ell} \newcommand{\laplacianVector}{\mathbf{ \ell}} \newcommand{\latentDim}{q} \newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}} \newcommand{\latentDistanceScalar}{\delta} \newcommand{\latentDistanceVector}{\boldsymbol{ \delta}} \newcommand{\latentForce}{f} \newcommand{\latentFunction}{u} \newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}} \newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}} \newcommand{\latentIndex}{j} \newcommand{\latentScalar}{z} \newcommand{\latentVector}{\mathbf{ \latentScalar}} \newcommand{\latentMatrix}{\mathbf{Z}} \newcommand{\learnRate}{\eta} \newcommand{\lengthScale}{\ell} \newcommand{\rbfWidth}{\ell} \newcommand{\likelihoodBound}{\mathcal{L}} \newcommand{\likelihoodFunction}{L} \newcommand{\locationScalar}{\mu} \newcommand{\locationVector}{\boldsymbol{ \locationScalar}} \newcommand{\locationMatrix}{\mathbf{M}} \newcommand{\variance}[1]{\text{var}\left( #1 \right)} \newcommand{\mappingFunction}{f} \newcommand{\mappingFunctionMatrix}{\mathbf{F}} \newcommand{\mappingFunctionTwo}{g} \newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}} \newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}} \newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}} \newcommand{\scaleScalar}{s} \newcommand{\mappingScalar}{w} \newcommand{\mappingVector}{\mathbf{ \mappingScalar}} \newcommand{\mappingMatrix}{\mathbf{W}} \newcommand{\mappingScalarTwo}{v} \newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}} \newcommand{\mappingMatrixTwo}{\mathbf{V}} \newcommand{\maxIters}{K} \newcommand{\meanMatrix}{\mathbf{M}} \newcommand{\meanScalar}{\mu} \newcommand{\meanTwoMatrix}{\mathbf{M}} \newcommand{\meanTwoScalar}{m} \newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}} \newcommand{\meanVector}{\boldsymbol{ \meanScalar}} \newcommand{\mrnaConcentration}{m} \newcommand{\naturalFrequency}{\omega} \newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)} \newcommand{\neilurl}{http://inverseprobability.com/} \newcommand{\noiseMatrix}{\boldsymbol{ E}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\boldsymbol{ \epsilon}} \newcommand{\norm}[1]{\left\Vert #1 \right\Vert} \newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}} \newcommand{\normalizedLaplacianScalar}{\hat{\ell}} \newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}} \newcommand{\numActive}{m} \newcommand{\numBasisFunc}{m} \newcommand{\numComponents}{m} \newcommand{\numComps}{K} \newcommand{\numData}{n} \newcommand{\numFeatures}{K} \newcommand{\numHidden}{h} \newcommand{\numInducing}{m} \newcommand{\numLayers}{\ell} \newcommand{\numNeighbors}{K} \newcommand{\numSequences}{s} \newcommand{\numSuccess}{s} \newcommand{\numTasks}{m} \newcommand{\numTime}{T} \newcommand{\numTrials}{S} \newcommand{\outputIndex}{j} \newcommand{\paramVector}{\boldsymbol{ \theta}} \newcommand{\parameterMatrix}{\boldsymbol{ \Theta}} \newcommand{\parameterScalar}{\theta} \newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}} \newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}} \newcommand{\precisionScalar}{j} \newcommand{\precisionVector}{\mathbf{ \precisionScalar}} \newcommand{\precisionMatrix}{\mathbf{J}} \newcommand{\pseudotargetScalar}{\widetilde{y}} \newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}} \newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}} \newcommand{\rank}[1]{\text{rank}\left(#1\right)} \newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)} \newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)} \newcommand{\responsibility}{r} \newcommand{\rotationScalar}{r} \newcommand{\rotationVector}{\mathbf{ \rotationScalar}} \newcommand{\rotationMatrix}{\mathbf{R}} \newcommand{\sampleCovScalar}{s} \newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}} \newcommand{\sampleCovMatrix}{\mathbf{s}} \newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle} \newcommand{\sign}[1]{\text{sign}\left(#1\right)} \newcommand{\sigmoid}[1]{\sigma\left(#1\right)} \newcommand{\singularvalue}{\ell} \newcommand{\singularvalueMatrix}{\mathbf{L}} \newcommand{\singularvalueVector}{\mathbf{l}} \newcommand{\sorth}{\mathbf{u}} \newcommand{\spar}{\lambda} \newcommand{\trace}[1]{\text{tr}\left(#1\right)} \newcommand{\BasalRate}{B} \newcommand{\DampingCoefficient}{C} \newcommand{\DecayRate}{D} \newcommand{\Displacement}{X} \newcommand{\LatentForce}{F} \newcommand{\Mass}{M} \newcommand{\Sensitivity}{S} \newcommand{\basalRate}{b} \newcommand{\dampingCoefficient}{c} \newcommand{\mass}{m} \newcommand{\sensitivity}{s} \newcommand{\springScalar}{\kappa} \newcommand{\springVector}{\boldsymbol{ \kappa}} \newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}} \newcommand{\tfConcentration}{p} \newcommand{\tfDecayRate}{\delta} \newcommand{\tfMrnaConcentration}{f} \newcommand{\tfVector}{\mathbf{ \tfConcentration}} \newcommand{\velocity}{v} \newcommand{\sufficientStatsScalar}{g} \newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}} \newcommand{\sufficientStatsMatrix}{\mathbf{G}} \newcommand{\switchScalar}{s} \newcommand{\switchVector}{\mathbf{ \switchScalar}} \newcommand{\switchMatrix}{\mathbf{S}} \newcommand{\tr}[1]{\text{tr}\left(#1\right)} \newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1} \newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2} \newcommand{\onenorm}[1]{\left\vert#1\right\vert_1} \newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert} \newcommand{\vScalar}{v} \newcommand{\vVector}{\mathbf{v}} \newcommand{\vMatrix}{\mathbf{V}} \newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)} \newcommand{\vecb}[1]{\left(#1\right):} \newcommand{\weightScalar}{w} \newcommand{\weightVector}{\mathbf{ \weightScalar}} \newcommand{\weightMatrix}{\mathbf{W}} \newcommand{\weightedAdjacencyMatrix}{\mathbf{A}} \newcommand{\weightedAdjacencyScalar}{a} \newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}} \newcommand{\onesVector}{\mathbf{1}} \newcommand{\zerosVector}{\mathbf{0}} $$

Introduction¶

Machine learning and Statistics are academic cousins, founded on the same mathematical princples, but often with different objectives in mind. But the differences can be as informative as the overlaps.

Efron (2020) rightly alludes to the fundamental differences to the new wave of predictive models that have arisen in the last decades of machine learning. And these cultures were also beautifully described by Breiman (2001a).

In the discussion of Professor Efron’s paper Friedman et al. (2020) highlight the continuum between the classical approaches and the emphasis on prediction. Indeed, the prediction culture does not sit entirely in the machine learning domain, an excellent example of a prediction-focused approach would be Leo Breiman’s bagging of models (Breiman, 1996), although it’s notable that Breiman, a statistician, chose to publish this paper in a machine learning journal.

From a personal perspective, a strand of work that is highly inspirational in prediction also comes from a statistician. The prequential formalism (Dawid, 1984, 1982) also emerges from statistics. It provides some hope that a predictive approach can be reconciled with attribution in the manner highlighted also by Friedman et al. (2020). The prequential approach is predictive but allows us to falsify poorly calibrated models (Lawrence, 2010). So while it doesn’t give us truth, it does give as falsehood in line with Popper’s vision of the philosophy of science (Popper, 1963).

Lies and Damned Lies¶

There is a quote, that is apocryphally credited to Disraeli by Mark Twain.

There are three kinds of lies: lies, damned lies and statistics

It stems from the late 19th century. From a time after Laplace, Gauss, Legendre and Galton made their forays into regression, but before Fisher, Pearson and others had begun to formalize the process of estimation. Today, the academic discipline of statistics is so widely understood to be underpinned by mathematical rigor that we typically drop the fu full title of mathematical statistics, but this history can be informative when looking at the new challenges we face.

The new phenomenon that underpins the challenges that Professor Efron outlines has been called “big data”. The vast quantities of data that are accumulating in the course of normal human activities. Where people have seen data, they have also seen the potential to draw inferences, but there is a lack of rigor about some of the approaches that means today we can see a new phenomenon.

There are three kinds of lies: lies, damned lies and big data

The challenge we face is to repeat the work of Fisher, Pearson, Gosset, Neyman etc and give the new wave of approaches a rigorous mathematical foundation.

Happenstance Data¶

Following the revolution in mathematical statistics, data became a carefully curated commodity. It was actively collected in response to a scientific hypothesis. Popper suggests (Popper, 1963) that the answer to which comes first, the hypothesis or the data, is the same as the chicken and the egg. The answer is that they co-evolve.

In the last century, the expense of data collection dominated the cost of science. As a result, the classical approach to statistical testing is to formulate the scientific question first. The study is then launched once an appropriate null hypothesis has been described and requisite sample size established.

What Tukey described as confirmatory data analysis (Tukey, 1977) is a mainstay of statistics. While the philosophy of statistical hypothesis testing has been the subject of longstanding debates, there is no controversy around the notion that in order to remove confounders you must have a well-designed experiment, and randomization for statistical data collection is the foundation of confirmatory work. Today, randomized trials are deployed today more than ever before, in particular due to their widespread use in computer interface design. Without our knowledge, we are daily assigned to social experiments that place us in treatment and control groups to determine what dose of different interface ideas will keep us more tightly engaged with our machines. These A/B tests social experiments involve randomization across many millions of users and dictate our modern user experience (see e.g. Kohavi and Longbotham (2017)).

Such experiments are still carefully designed to remain valid, but the modern data environment is not only about larger experimental data, but perhaps more so about what I term “happenstance data”. Data that was not collected with a particular purpose in mind, but which is simply being recorded in the normal course of events due to increasing interconnection between portable digital devices and decreasing cost of storage.

Happenstance data are the breadcrumbs of our trail through the forest of life. They may be being written for a particular purpose, but later we wish to consume them for a different purpose. For example, within the recent Covid-19 pandemic, the Royal Society DELVE initiative (The DELVE Initiative, 2020) was able to draw on transaction data to give near-real time assessments on the effect of the pandemic and governmental response on GDP[1] (see also Carvalho et al. (2020)). The data wasn’t recorded with pandemic responses in mind, but it can be used to help inform interventions. Other data sets of use include mobility data from mobile telecoms companies (see e.g. Oliver et al. (2020)).

Historically, data was expensive. It was carefully collected according to a design. Statistical surveys can still be expensive, but today there is a strong temptation to do them on the cheap, to use happenstance data to achieve what had been done in the past only through rigorous data-fieldwork, but care needs to be taken (Wang et al., 2015). A Professor Efron points out, early attempts to achieve this, such as the Google flu predictor have been somewhat naive (Ginsberg et al., 2009; Halevy et al., 2009).[2] As these methodologies are gaining traction in the social sciences (Salganik, 2018) and the field of Computational Social Science (Alvarez, 2016) emerges we can expect more innovation and more ideas that may help us bridge the fundamentally different characters of qualitative and quantitative research. For the moment, one particularly promising approach is to use measures derived from happenstance data (such as searches for flu) as proxy indicators for statistics that are rigorously surveilled. With the Royal Society’s DELVE initiative, examples of this approach include work of Peter Diggle to visualize the progression of the Covid-19 disease. Across the UK the “Zoe App” has been used for self-reporting of Covid-19 symptoms (Menni et al., 2020), and by interconnecting this data with Office for National Statistics surveys (Office for National Statistics, 2020), Peter has been able to calibrate the Zoe map of Covid-19 prevalence, allowing nowcasting of the disease that was validated by the production of ONS surveys. These enriched surveys can already be done without innovation to our underlying mathematical

Classical statistical methodologies remain the gold-standard by which these new methodologies should be judged. The situation reminds me somewhat of the challenges Xerox faced with the advent of the computer revolution. With great prescience, Xerox realized that the advent of the computer meant that information was going to be shared more often via electronic screens. As a company whose main revenue stream was coming from photocopying documents, the notion of the paperless office represented something of a threat to Xerox. They responded by funding a research center, known as Xerox PARC. They developed many of the innovations that underpin the modern information revolution: the Xerox Alto (the first graphical user interface), the laser printer, ethernet. All of these inventions were commercial successes, but created a need for more paper, not less. The computers produced more information, and much of it was still shared on paper. Per capita paper consumption continued to rise in the US until it peaked at around the turn of the millennium (Andrés et al., 2014). A similar story will now apply with the advent of predictive models and data science. The increasing use of predictive methodologies does not obviate the need for confirmatory data analysis, it makes them more important than ever before.

Not only is there an ongoing role for the classical methodologies we have at our disposal, it is likely to be an increasing demand in the near future. But what about new mathematical theories? How can we come to a formalism for the new approaches of mathematical data science, just as early 20th century statisticians were able to reformulate statistics on a rigorous mathematical footing?

[1] Although challenges with availability of payments data within the UK meant that the researchers were able to get good assessment of the Spanish and French economies, but struggled to assess their main target, the United Kingdom.

[2] Although despite conventional wisdom it appears that election polls haven’t got worse over recent years, see Jennings and Wlezien (2018).

Generalization¶

Machine Learning practicioners focus on out-of-sample predictive capability as their main objective. This is the ability of a model to generalize its learnings.

Professor Efron’s paper does an excellent job a summarizing the range of predictive models that now lie at our disposal, but of particular interest are deep neural networks. This is because they go beyond the traditional notions of what generalization is or rather, what it has been, to practitioners on both the statistical and machine learning sides of the data sciences.

Deep Models and Generalization¶

The new wave of predictive modelling techniques owes a great deal to the tradition of regression. But their success in generalizing to out-of-sample examples owes little to our traditional understanding of the theory of generalization. These models are highly overparameterized. As such, the traditional view would be that they should ‘overfit’ the data. But in reality, these very large models generalize well. Is it because they can’t overfit?

When it comes to the mismatch between our expectations about generalization and the reality of deep models, perhaps the paper that most clearly demonstrated something was amiss was (Zhang et al., 2017), who trained a large neural network via stochastic gradient descent to label an image data set. Within Professor Efron’s categorization of regression model, such a model is a very complex regression model with a particular link function and highly structured adaptive basis functions, which are by tradition called neurons. Despite the structuring of these basis functions (known as convolutional layers), their adaptive nature means that the model contains many millions of parameters. Traditional approaches to generalization suggest that the model should over fit and (Zhang et al., 2017) proved that such models can do just that. The data they used to fit the model, the training set, was modified. They flipped labels randomly, removing any information in the data. After training, the resulting model was able to classify the training data with 100% accuracy. The experiment clearly demonstrates that all our gravest doubts about overparameterized models are true. If this model has the capacity to fit data which is obviously nonsense, then it is clearly not regularized. Our classical theories suggest that such models should not generalize well on previously unseen data, or test data, but yet the empirical experience is that they do generalize well. So, what’s going on?

During a period of deploying machine learning models at Amazon, I was introduced to a set of leadership principles, fourteen different ideas to help individual Amazonians structure their thinking. One of them was called “Dive Deep”, and a key trigger for a “Dive Deep” was when anecdote and data are in conflict. If there were to be a set of Academic leadership principles, then clearly “Dive Deep” should be triggered when empirical evidence and theory are in conflict. The purpose of the principle within Amazon was to ensure people don’t depend overly on anecdotes or data when making their decisions, but to develop deeper understandings of their business. In academia, we are equally guilty of relying too much on empirical studies or theory without ensuring they are reconciled. The theoreticians’ disbelief of what the experimenter tells them is encapsulated in Kahnemann’s idea of “theory induced blindness” (Kahneman, 2011). Fortunately, the evidence for good generalization in these mammoth models is now large enough that the theory-blinders are falling away, and a serious look is being taken and how and why these models can generalize well.

An in-depth technical understanding that applies to all these cases is not yet available. But some key ideas are. Firstly, if the neural network model is over-capacity, and can fit nonsense data in the manner demonstrated by (Zhang et al., 2017) then that immediately implies that the good generalization is arising from how the model is fitted to the data. When the number of parameters is so large, the parameters are very badly determined. In machine learning, the concept of version space (Mitchell, 1977) is the subset of all the hypotheses that are consistent with the training examples. For a neural network, the version space is where the neural network parameters (or weights) give predictions for the training data 100% accuracy. A traditional statistical perspective would eschew this regime, convinced that the implication is that overfitting must have occurred. But the empirical evidence from the deep learning community is that these regimes produce classification algorithms with excellent generalization properties. The resolution to this dilemma is where in the version space the algorithm comes to rest.

An excellent characterization of generalization is normally given by the bias-variance dilemma. The bias-variance decomposition for regression models separates the generalization error into two components (Geman, Bienenstock, and René Doursat, 1992).

Bias Variance Decomposition¶

The bias-variance decomposition considers the expected test error for different variations of the training data sampled from, $\Pr(\mathbf{ y}, y)$ $$ \mathbb{E}\left[ \left(y- f^*(\mathbf{ y})\right)^2 \right]. $$ This can be decomposed into two parts, $$ \mathbb{E}\left[ \left(y- f(\mathbf{ y})\right)^2 \right] = \text{bias}\left[f^*(\mathbf{ y})\right]^2 + \text{variance}\left[f^*(\mathbf{ y})\right] +\sigma^2, $$ where the bias is given by $$ \text{bias}\left[f^*(\mathbf{ y})\right] = \mathbb{E}\left[f^*(\mathbf{ y})\right] * f(\mathbf{ y}) $$ and it summarizes error that arises from the model’s inability to represent the underlying complexity of the data. For example, if we were to model the marathon pace of the winning runner from the Olympics by computing the average pace across time, then that model would exhibit bias error because the reality of Olympic marathon pace is it is changing (typically getting faster).

The variance term is given by $$ \text{variance}\left[f^*(\mathbf{ y})\right] = \mathbb{E}\left[\left(f^*(\mathbf{ y}) - \mathbb{E}\left[f^*(\mathbf{ y})\right]\right)^2\right]. $$ The variance term is often described as arising from a model that is too complex, but we have to be careful with this idea. Is the model really too complex relative to the real world that generates the data? The real world is a complex place, and it is rare that we are constructing mathematical models that are more complex than the world around us. Rather, the ‘too complex’ refers to ability to estimate the parameters of the model given the data we have. Slight variations in the training set cause changes in prediction.

Models that exhibit high variance are sometimes said to ‘overfit’ the data whereas models that exhibit high bias are sometimes described as ‘underfitting’ the data.

Bias vs Variance Error Plots¶

Helper function for sampling data from two different classes.

In [ ]:

import numpy as np

In [ ]:

def create_data(per_cluster=30):
    """Create a randomly sampled data set
    
    :param per_cluster: number of points in each cluster
    """
    X = []
    y = []
    scale = 3
    prec = 1/(scale*scale)
    pos_mean = [[-1, 0],[0,0.5],[1,0]]
    pos_cov = [[prec, 0.], [0., prec]]
    neg_mean = [[0, -0.5],[0,-0.5],[0,-0.5]]
    neg_cov = [[prec, 0.], [0., prec]]
    for mean in pos_mean:
        X.append(np.random.multivariate_normal(mean=mean, cov=pos_cov, size=per_class))
        y.append(np.ones((per_class, 1)))
    for mean in neg_mean:
        X.append(np.random.multivariate_normal(mean=mean, cov=neg_cov, size=per_class))
        y.append(np.zeros((per_class, 1)))
    return np.vstack(X), np.vstack(y).flatten()

Helper function for plotting the decision boundary of the SVM.

In [ ]:

def plot_contours(ax, cl, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    :param ax: matplotlib axes object
    :param cl: a classifier
    :param xx: meshgrid ndarray
    :param yy: meshgrid ndarray
    :param params: dictionary of params to pass to contourf, optional
    """
    Z = cl.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    # Plot decision boundary and regions
    out = ax.contour(xx, yy, Z, 
                     levels=[-1., 0., 1], 
                     colors='black', 
                     linestyles=['dashed', 'solid', 'dashed'])
    out = ax.contourf(xx, yy, Z, 
                     levels=[Z.min(), 0, Z.max()], 
                     colors=[[0.5, 1.0, 0.5], [1.0, 0.5, 0.5]])
    return out

In [ ]:

import urllib.request

In [ ]:

urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/mlai.py','mlai.py')

In [ ]:

import mlai
import os

In [ ]:

def decision_boundary_plot(models, X, y, axs, filename, titles, xlim, ylim):
    """Plot a decision boundary on the given axes
    
    :param axs: the axes to plot on.
    :param models: the SVM models to plot
    :param titles: the titles for each axis
    :param X: input training data
    :param y: target training data"""
    for ax in axs.flatten():
        ax.clear()
    X0, X1 = X[:, 0], X[:, 1]
    if xlim is None:
        xlim = [X0.min()-1, X0.max()+1]
    if ylim is None:
        ylim = [X1.min()-1, X1.max()+1]
    xx, yy = np.meshgrid(np.arange(xlim[0], xlim[1], 0.02),
                         np.arange(ylim[0], ylim[1], 0.02))
    for cl, title, ax in zip(models, titles, axs.flatten()):
        plot_contours(ax, cl, xx, yy,
                      cmap=plt.cm.coolwarm, alpha=0.8)
        ax.plot(X0[y==1], X1[y==1], 'r.', markersize=10)
        ax.plot(X0[y==0], X1[y==0], 'g.', markersize=10)
        ax.set_xlim(xlim)
        ax.set_ylim(ylim)
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(title)
        mlai.write_figure(os.path.join(filename),
                          figure=fig,
                          transparent=True)
    return xlim, ylim

In [ ]:

import matplotlib
font = {'family' : 'sans',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)
import matplotlib.pyplot as plt

In [ ]:

from sklearn import svm

In [ ]:

# Create an instance of SVM and fit the data. 
C = 100.0  # SVM regularization parameter
gammas = [0.001, 0.01, 0.1, 1]


per_class=30
num_samps = 20
# Set-up 2x2 grid for plotting.
fig, ax = plt.subplots(1, 4, figsize=(10,3))
xlim=None
ylim=None
for samp in range(num_samps):
    X, y=create_data(per_class)
    models = []
    titles = []
    for gamma in gammas:
        models.append(svm.SVC(kernel='rbf', gamma=gamma, C=C))
        titles.append('$\gamma={}$'.format(gamma))
    models = (cl.fit(X, y) for cl in models)
    xlim, ylim = decision_boundary_plot(models, X, y, 
                           axs=ax, 
                           filename='./ml/bias-variance{samp:0>3}.svg'.format(samp=samp), 
                           titles=titles,
                          xlim=xlim,
                          ylim=ylim)

In [ ]:

%pip install --upgrade git+https://github.com/sods/ods

In [ ]:

import pods
from ipywidgets import IntSlider

In [ ]:

pods.notebook.display_plots('bias-variance{samp:0>3}.svg', 
                            directory='./ml', 
                            samp=IntSlider(0,0,10,1))

Figure: In each figure the simpler model is on the left, and the more complex model is on the right. Each fit is done to a different version of the data set. The simpler model is more consistent in its errors (bias error), whereas the more complex model is varying in its errors (variance error).

Double Descent¶

One of Breiman’s ideas for improving predictive performance is known as bagging (Breiman, 1996). The idea is to train a number of models on the data such that they overfit (high variance). Then average the predictions of these models. The models are trained on different bootstrap samples (Efron, 1979) and their predictions are aggregated giving us the acronym, Bagging. By combining decision trees with bagging, we recover random forests (Breiman, 2001b).

Bias and variance can also be estimated through Professor Efron’s bootstrap (Efron, 1979), and the traditional view has been that there’s a form of Goldilocks effect, where the best predictions are given by the model that is ‘just right’ for the amount of data available. Not to simple, not too complex. The idea is that bias decreases with increasing model complexity and variance increases with increasing model complexity. Typically plots begin with the Mummy bear on the left (too much bias) end with the Daddy bear on the right (too much variance) and show a dip in the middle where the Baby bear (just) right finds themselves.

The Daddy bear is typically positioned at the point where the model is able to exactly interpolate the data. For a generalized linear model (McCullagh and Nelder, 1989), this is the point at which the number of parameters is equal to the number of data[1]. But the modern empirical finding is that when we move beyond Daddy bear, into the dark forest of the massively overparameterized model we can achieve good generalization.

As Zhang et al. (2017) starkly illustrated with their random labels experiment, within the dark forest there are some terrible places, big bad wolves of overfitting that will gobble up your model. But as empirical evidence shows there is also a safe and hospitable Grandma’s house where these highly overparameterized models are safely consumed. Fundamentally, it must be about the route you take through the forest, and the precautions you take to ensure the wolf doesn’t see where you’re going and beat you to the door.

There are two implications of this empirical result. Firstly, that there is a great deal of new theory that needs to be developed. Secondly, that theory is now obliged to conflate two aspects to modelling that we generally like to keep separate: the model and the algorithm.

Classical statistical theory around predictive generalization focusses specifically on the class of models that is being used for data fitting. Historically, whether that theory follows a Fisher-aligned estimation approach (see e.g. Vapnik (1998)) or model-based Bayesian approach (see e.g. Ghahramani (2015)), neither is fully equipped to deal with these new circumstances because, to continue our rather tortured analogy, these theories provide us with a characterization of the destination of the algorithm, and seek to ensure that we reach that destination. Modern machine learning requires theories of the journey and what our route through the forest should be.

Crucially, the destination is always associated with 100% accuracy on the training set. An objective that is always achievable for the overparameterized model.

Intuitively, it seems that a highly overparameterized model places Grandma’s house on the edge of the dark forest. Making it easily and quickly accessible to the algorithm. The larger the model, the more exposed Grandma’s house becomes. Perhaps this is due to some form of blessing of dimensionality brings Grandma’s house closer to the edge of the forest in a high dimensional setting. Really, we should think of Grandma’s house as a low dimensional manifold of destinations that are safe. A path through the forest where the wolf of overfitting doesn’t venture. In the GLM case, we know already that when the number of parameters matches the number of data there is precisely one location in parameter space where accuracy on the training data is 100%. Our previous misunderstanding of generalization stemmed from the fact that (seemingly) it is highly unlikely that this single point is a good place to be from the perspective of generalization. Additionally, it is often difficult to find. Finding the precise polynomial coefficients in a least squares regression to exactly fit the basis to a small data set such as the Olympic marathon data requires careful consideration of the numerical properties and an orthogonalization of the design matrix (Lawson and Hanson, 1995).

It seems that with a highly overparameterized model, these locations become easier to find and they provide good generalization properties. In machine learning this is known as the “double descent phenomenon” (see e.g. Belkin et al. (2019)).

As Professor Efron points out, modern machine learning models are often fitted using many millions of data points. The most extreme example of late is known as GPT-3. This neural network model, known as a Transformer, has in its largest form 175 billion parameters. The model was trained on a data set containing 499 billion tokens (about 2 terabytes of text). Estimates suggest that the model costs around $4.5 million dollars to train (see e.g. Li (2020)).

[1] Assuming we are ignoring parameters in the link function and the distribution function.

Empirical Effectiveness of Deep Learning¶

The OpenAI model represents just a recent example from a wave of deep neural network models has proved highly performant across a range of challenges that were previously seen as being beyond our statistical modelling capabilities.

They stem from the courage of a group of researchers who saw that methods were improving with increasing data and chose to collect and label data sets of ever increasing size, in particular the ImageNet team led by Fei-Fei Li (Russakovsky et al., 2015) who collected a large data set of images for object detection (currently 14 million images). To make these neural network methods work on such large data sets new implementations were required. By deploying neural network training algorithms on graphics processing units (GPUs) breakthrough results were achieved on these large data sets (Krizhevsky et al., n.d.). Similar capabilities have then been shown in the domains of face identification (Taigman et al., 2014), and speech recognition (Hinton et al., 2012), translation (Sutskever et al., 2014) and language modelling (Devlin et al., 2019; Radford et al., 2019).

Impressive though these performances are, they are reliant on massive data and enormous costs of training. Yet they can be seen through the lens of regression, as outlined by Professor Efron in his paper. They map from inputs to outputs. For language modelling, extensive use of auto-regression allows for sequences to be generated.

New Methods Required¶

The challenges of big data emerged due to companies being able to rapidly interconnect multiple sources of information. This leads to challenges in storage, distributed processing and modeling. Google’s search engine was at the forefront of this revolution in data processing. Google researchers were among the first to notice that some tasks can be easily resolved with fairly simple models and very large data sets (Halevy et al., 2009). What we are now learning is that many tasks can be solved with complex models and even bigger data sets.

While GPT-3 does an impressive job on language generation, it can do so because of the vast quantities of language data we have made available to it. What happens if we take a more complex system, for which such vast data is not available. Or, at least not available in the homogeneous form that language data can be found. Let’s take human health.

Consider we take a holistic view of health and the many ways in which we can become unhealthy, through genomic and environmental effects. Firstly, let’s remember that we don’t have a full understanding, even on all the operations in a single eukaryotic cell. Indeed, we don’t even understand all the mechanisms by which transcription and translation occur in bacterial and archaeal cells. That is despite the wealth of gene expression and protein data about these cells. Even if we were to pull all this information together, would it be enough to develop that understanding?

There are large quantities of data, but the complexity of the underlying systems in these domains, even the terabytes of data we have today may not be enough to determine the parameters of such complex models.

Machine learning involves taking data and combining it with a model in order to make a prediction. The data consist of measurements recorded about the world around us. A model consists of our assumptions about how the data is likely to interrelate, typical assumptions include smoothness. Our assumptions reflect some underlying belief about the regularities of the universe that we expect to hold across a range of data sets. $$ \text{data} + \text{model} \stackrel{\text{algorithm}}{\rightarrow} \text{prediction} $$ {The data and the model are combined in computation through an algorithm. The etymology of the data indicates that it is given (data comes from Latin dare). In some cases, for example an approach known as active learning, we have a choice as to how the data is gotten. But our main control is over the model and the algorithm.

This is true for both statisticians and machine learning scientists. Although there is a difference in the core motivating philosophy. The mathematical statisticians were motivated by a desire to remove subjectivity from the analysis, reducing the problem to rigorous statistical proof. The statistician is nervous of the inductive biases humans exhibit when drawing conclusions from data. Machine learning scientists, on the other hand, sit closer to the artificial intelligence community. Traditionally, they are inspired by human inductive biases to develop algorithms that allow computers to emulate human performance on tasks. In the past I’ve summarized the situation as

Statisticians want to turn humans into computers, machine learners want to turn computers into humans. Neither is possible so we meet somewhere in the middle.

Traditional Model-Algorithm Separation¶

Across both statistics and machine learning, the traditional view was that the modeling assumptions are the key to making good predictions. Those assumptions might include smoothness assumptions, or linearity assumptions. In some domains we might also wish to incorporate some of our mechanistic understanding of the data (see e.g. Álvarez et al. (2013)). The paradigm of model-based machine learning (Winn et al. (n.d.)), builds on the idea that the aim of machine learning is to describe one’s views about the world as accurately as possible within a model. The domain expert becomes the model-designer. The process of algorithm design is then automated to as great an extent as possible. This idea originates in the ground-breaking work of the MRC Biostatistics Unit on BUGS that dates to 1997 (see e.g. Lunn-bugs09). It is no surprise that this notion has gained most traction in the Bayesian community, because the probabilistic philosophy promises the separation of modeling and inference. As long as the probabilistic model we build is complex enough to capture the true generating process, we can separate the process of model building and probabilistic inference. Inference becomes turning the handle on the machine. Unfortunately, the handle turning in Bayesian inference involves high dimensional integrals and much of the work in this domain has focused on developing new methods of inference based around either sampling (see e.g. Carpenter et al. (2017)) or deterministic approximations (see e.g. Tran et al. (2016)).

There are two principle challenges for model-based machine learning. The first is the model design challenge, and the second is the algorithm design challenge. The basic philosophy of the model-based approach is to make it as easy as possible for experts to express their ideas in a modeling language (typically probabilistic) and then automate as much as possible the algorithmic process of fitting that model to data (typically probabilistic inference).

The challenge of combining that model with the data, the algorithm design challenge, is then the process of probabilistic inference.

The model is a mathematical abstraction of the regularities of the universe that we believe underly the data as collected. If the model is well-chosen, we will be able to interpolate the data and predict likely values of future data points. If it is chosen badly our predictions will be overconfident and wrong.

Deep learning methods conflate two aspects that we used to try to keep distinct. The mathematical model encodes our assumptions about the data. The algorithm is a set of computational instructions that combine our modeling assumptions with data to make predictions.

Much of the technical focus in machine learning is on algorithms. In this document I want to retain a strong separation between the model and the algorithm. The model is a mathematical abstraction of the world that encapsulates our assumptions about the data. Normally it will depend on one or more parameters which are adaptable. The algorithm provides a procedure for adapting the model to different contexts, often through the provision of a set of data that is used for training the model.

Despite the different role of model and algorithm, the two concepts are often conflated. This sometimes leads to a confused discussion. I was recently asked “Is it correct to remove the mean from the data before you do principal component analysis.” This question is about an algorithmic procedure, but the correct answer depends on what modelling assumption you are seeking to make when you are constructing your principal component analysis. Principal component analysis was originally proposed by a model for data by (Hotelling, 1933). It is a latent variable model that was directly inspired by work in the social sciences on factor analysis. However, much of our discussion of PCA today focusses on PCA as an algorithm. The algorithm for fitting the PCA model is to seek the eigenvectors of the covariance matrix, and people often refer to this algorithm as principal component analysis. However, that algorithm also finds the linear directions of maximum variance in the data. Seeking directions of maximum variance in the data was not the objective of Hotelling, but it is related to a challenge posed by Pearson (1901) who sought a variant of regression that predicted symmetrically regardless of which variable was considered to be the covariate and which variable the response. Coincidentally the algorithm for this model is also the eigenvector decomposition of the covariance matrix. However, the underlying model is different. The difference becomes clear when you begin to seek non-linear variants of principal component analysis. Depending on your interpretation (finding directions of maximum variance in the data or a latent variable model) the corresponding algorithm differs. For the Pearson model a valid non-linearization is kernel PCA (Schölkopf et al., 1998), but for the Hotelling model this generalization doesn’t make sense. A valid generalization of the Hotelling model is provided by the Gaussian process latent variable model (Lawrence, 2005). This confusion is often unhelpful, so for the moment we will leave algorithmic considerations to one side and focus only on the model.

Is my Model Useful?¶

In the first half of the 20th Century, the Bayesian approach was termed inverse probability. Fisher disliked the subjectivist nature of the approach (Aldrich, 2008) and introduced the term Bayesian in 1950 to distinguish these probabilities from the likelihood function (Fisher, 1950). The word Bayesian has connotations of a strand of religious thinking or a cult.[1] Whether Fisher was fully aware of this when he coined the term we cannot know, but there is a faith-based-tenet that at the heart of Bayesian ideas.

Many critics of the Bayesian approach ask where the Bayesian prior comes from. But one might just as well ask, where does the likelihood come from? Or where does the link function come from? All of these are subjective modeling questions. The prior is not the problem. The challenge is providing objective guarantees for our subjective model choices. The classical Bayesian approach provides guarantees, but only for the $\mathcal{M}$-closed domain (Bernardo and Smith, 1994), where the true model is one of the models under consideration. This is the critical belief at the heart of the Church of Bayes: the doctrine of model correctness.

The beauty of the Bayesian approach is that you don’t have to have much imagination. You work as hard as possible with your models of probability distributions to represent the problem as best you understand it, then you work as hard as possible to approximate the posterior distributions of interest and estimate the marginal likelihoods and any corresponding Bayes’s factors. If we have faith in the doctrine of model correctness, then we can pursue our modeling aims without the shadows of doubt to disturb us.

Bayesian approaches have a lot in common with more traditional regularization techniques. Ridge regression imposes L2 shrinkage on the model’s weights and has an interpretation as the maximum a posteriori estimate of the Bayesian solution. For linear models the mathematics of the two approaches is strikingly similar, and they both assume stages of careful design of model/regularizer followed by either Bayesian inference or optimization. The situation with the new wave of overparameterized models is strikingly different.

[1] This was pointed out to me by Bernhard Schölkopf, who by recollection credited the observation to the philosopher David Corfield.

Implicit Regularization¶

The new wave of overparameterized machine learning methods are not making this conceptual separation between the model and the algorithm. Deep learning approaches are simple algorithmically, but highly complex mathematically. By this I mean that it is relatively simple to write code that creates algorithms for optimizing the neural network models, particularly with the wide availability of automatic differentiation software. But analyzing what these algorithms are doing mathematically is much harder. Fortunately, insight can already be gained by considering overparameterized models in the linear paradigm.

These analyses consider ‘gradient flow’ algorithms. A gradient flow is an idealization of the optimization where the usual stochastic gradient descent optimization (Robbins and Monro, 1951) is replaced with differential equations representing the idealized trajectory of a batch gradient descent approach. Under these analyses, highly overparameterized linear models can be shown to converge to the L2 norm solution (see e.g. Soudry et al. (2018)). It seems that stacking layers of networks also has interesting effects, because even when linear models are stacked analysis of the learning algorithm indicates a tendency towards low rank linear solutions for the parameters (Arora, Cohen, Golowich, et al., 2019). These regularization approaches are known as implicit regularization, because the regularization is implicit in the optimizer rather than explicitly declared by the model-designer. These theoretical insights have also proven useful in practice. Arora, Cohen, Hu, et al. (2019) show state-of-the-art performance for matrix factorization through exploiting implicit regularization in the domain of matrix factorization.

Studying the implicit regularization of these modern overparameterized machine learning methods is likely to prove a fruitful avenue for theoretical investigation that will deliver deeper understanding of how we can design traditional statistical models. But the conflation of model and algorithm can make it hard to unpick what design choices are influencing which aspects of the model.

While I’ve motivated much of this paper through the lens of happenstance data. But the training sets that are used with these deep learning models have striking similarities to classical data acquisition. The deep learning revolution is really a revolution of ambition in the scale of data which we are collecting. The data set which drove this paradigm shift in data collection was ImageNet (see e.g. Russakovsky et al. (2015)). It was only with the millions of labeled images available in the data that Fei-Fei Li’s team assembled that these highly overparameterized models were able to significantly differentiate themselves from traditional approaches (Krizhevsky et al., n.d.). Collection of such massive data is statistical design on a previously unprecedented scale requiring massive human effort for labeling. That doesn’t sit well with the notion of happenstance data, which by its nature is accumulating fast and is only lightly curated if at all. So, while acknowledging the importance and benefits of implicit regularization we will revert to the conceptual separation between model and algorithm as we consider what approaches might be useful for this new data landscape.

Reviewing the Traditional Paradigm¶

The modern Bayesian has a suite of different software available to make it easier for her to design models and perform inference in the resulting design. Given this extra time, it might depress her to reflect on the fact that the entire premise of the approach is mistaken unless you are in the $\mathcal{M}$-closed domain. So, when do we fall outside this important domain? According to Box (1976), all the time.

All models are wrong, but some are useful

Box’s important quote has become worn by overuse (like a favorite sweater). Indeed, I most often see it quoted at the beginning of a talk in a way that confuses correlation with causality. Presentations proceed in the following way. (1) Here is my model. (2) It is wrong. (3) Here is George Box’s quote. (4) My model is wrong, but it might be useful. Sometimes I feel at stage (4) a confusion about the arrow of causality occurs, it feels to me that people are almost saying “Because my model is wrong it might be useful.”

Perhaps we should be more focusing on the quote from the same paper[1]

the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.

Box (1976)

Let’s have a think about where the tigers might be in the domain of big data. To consider this, let’s first see what we can write down about our data that isn’t implicitly wrong. If we are interested in multivariate data, we could first write down our data as a design matrix $$ \text{data} = \mathbf{\mathbf{Y}} \in \Re^{n\times p}. $$ Here we are assuming we have $n$ data points and $p$ features. As soon as we write down our data in this form it invites particular assumptions about the data that may have been valid in the 1930s, when there was more focus on survey data. Experimental designs stipulated a table of data with a specific set of hypotheses in mind. The data naturally sat in a matrix.

The first thing that I was taught about probabilistic modeling was i.i.d. noise. And as soon as we see a matrix of data, I believe it is second nature for most of us to start writing down factorization assumptions. In particular, an independence assumption across the $n$ data points, giving the likelihood function. This factorization gives both practical and theoretical advantages. It allows us to show that by optimizing the likelihood function, we are minimizing a sample-based approximation to a Kullback-Leibler divergence between our model and the true data generating density (see e.g. Wasserman (2003)). where $\Pr(\mathbf{Y})$ is the true data generating distribution.

From pragmatist’s perspective, the assumption allows us to make predictions about new data points given a parameter vector that is derived from the training data. If the uncertainty in the system is truly independent between different data observations, then the information about the data is entirely stored in our model parameters, $\boldsymbol{ \theta}$.

Much of classical statistics focusses on encouraging the data to conform to this independence assumption, for example through randomizing the design to distribute the influence of confounders and through selection of appropriate covariates. Or to methodologies that analyze the model fit to verify the validity of these assumptions, for example residual analysis.

From a Bayesian perspective, this assumption is only an assumption about the nature of the residuals. It is not a model of the process of interest. The philosophical separation of aleatory uncertainty and epistemic uncertainty occurs here. Probability is being used only to deal with the aleatory uncertainty.

In the world of happenstance data, there is insufficient influence from the model-designer on the way that the data is collected to rely on randomization. We find ourselves needing to explicitly model relationships between confounders. This requires us to be more imaginative about our probabilistic models than pure independence assumptions.

An additional challenge arising from this independence assumption is the domain where the number of data features, $p$, is larger than the number of data points, $n<<p$, the so called `large $p$, small $n$’ domain. Classical methodologies fail in this domain because the parameters are poorly determined.

[1] This quote was first highlighted to me by Richard D. Wilkinson.

Getting Rid of $p$¶

Despite the advantages of the classical statistical paradigm, I believe it has lured us into a trap. The trap is that $p$ exists. That there is a particular dimensionality associated with the features, or covariates, we associate with a typical response variable. As I think about happenstance data, I’ve increasingly become to believe that $p$ doesn’t really exist. It is a convenience for us. Or, at least if it does exist, it is not fixed in dimension, it varies, just like the number of data.

In programming language parlance, we treat $p$ as a ‘static variable’. One which stays the same size for the lifetime of the model. We are prepared to accept that $n$ will change, we will be expected to make out of sample predictions, but we don’t accept that we might need to make ‘out of response-variable’ predictions or ‘out-of-covariate’ predictions.

So, below, I intend to conflate $n$ and $p$. Rather than considering a design matrix, $\mathbf{Y}$, I’d like us to think about a vector, $\mathbf{ y}$, where each element of the vector arises from a particular data-point but it could contain one of a number of features from that data-point. The data moves from being a table, to a vector of data snippets. Each element of the vector potentially coming from a different subject and possibly encoding a different aspect of that subject.

Now that we’ve decided we’ll consider the data in a vector (if do you have a design matrix, just stack the columns of the design matrix one on top of another to form the vector), let’s also drop the independence assumption. Independence assumptions are very useful, and we will return to them later. But for the moment let’s assume it is not the first thing we should write down.

A Motivating Example¶

To motivate the movement from matrix to vector, we’ll consider the type of challenge that might arise in today’s era of happenstance data. Mathematical statisticians worked with tables of data that they’d carefully collected, often with the specific purpose of answering a particular question. The decided at experiment design time what was to be measured $n$. We continue to do this today; we even use statistical power calculations to decide how many subjects we need in our data sample.

Our motivation will be personalized health: what we can learn about patients’ state of health and when should we suggest interventions?

In the big data era, we are not only interested in what data we might collect for answering a specific question (although this data remains important) but we are also interested in existing data that might be assimilated to improve our understanding of an individual’s health. When we imagine systems that can monitor our individual health status, we should not be restricted to the type of data that might be stored in a doctor’s office or a hospital data base. We can even argue that that hospital data focusses on sickness rather than health, giving us an incomplete picture.

In the modern world of happenstance data, we might like to build models that incorporate, for example, an individual’s exercise regime (for example through websites such as Strava). We could also include information about an individual’s dietary habits extracted from loyalty card information like the Nectar card. If we were monitoring potential degradation in health then we may also be interested in an individual’s social network activity (Twitter, Facebook, SnapChat). Even musical tastes may feed into our overall picture of the patient’s well-being through music services like Spotify.

For a full perspective on a patient’s health, this data needs to be combined with more traditional sources like phenotype and genotype information. For example, high resolution scans of the genome provide a detailed characterization of genotype. Large-scale gene-expression measurements, give high resolution understanding of phenotype at the cellular level. Imaging domains can contain X-rays scans, MRIs or pathologists’ samples. Doctor’s notes, both free-form notes and those that encode diagnosis through emerging EHR/EMR coding standards such as the WHO’s International Classification of Diseases (ICD-11, see World Health Organization (2020)). And also, the results of clinical tests, such as cholesterol levels. To provide a full picture of health status all this information needs to be assimilated.

In a traditional model, we might encode each piece of information as another element on a feature vector: in other words, each data snippet mentioned contributes to increasing $p$. However, for most patients, most of the information above will be missing. We obtain only snippets of information about music tastes and social media habits alongside the occasional clinical measurement. Missing data is often discussed, but not at the scale we are considering here. In a classical analysis we might consider 30% missingness to be a big number. In this new scenario of data snippets almost all the data is missing almost all the time. A typical amount of missingness might be 99.9%. The data snippet domain is one which we might refer to as massively missing data. A situation where a missing value becomes the norm and a data observation is the exception.

Alongside the patchy nature of these data snippets, another challenge would be how they arrive. This happenstance data is constantly evolving. In computer systems terminology it is referred to as streaming data. The table form of the design matrix is a consequence of active surveillance of the population. When acquiring data passively it updates haphazardly.

Our model may need to update because patient 2,342 has just had the results of a blood test logged, or because patient 28,344,219 has just been for a run or because patient 12,012,345 just listened to a Leonard Cohen track or because patient 12,182 just gave birth.

For applications like the personalized health monitoring system described above, we need a model that will give well calibrated predictions from the first day of it being brought online, and throughout its operational life. If the model is complex enough to represent the full spectrum of possible human ailments, then when the model is first brought on stream, it is unlikely to have sufficient data to determine the parameters. In the standard modeling framework, we are faced with the bias-variance dilemma (Geman, Bienenstock, and Rene Doursat, 1992). If the model is complex enough to represent the underlying data structure, the parameters will be badly determined for small, or badly designed data sets, and the model will exhibit a large error due to variance. We are still learning how the deep learning frameworks provide a route out of this dilemma. A part of the story is their overparameterization. But what is the formalism by which we can incorporate more information about each individual patient within these highly parameterized models?

A major challenge in the domain we’ve described is to build a model that is complex enough to represent the diversity of human health outcomes. For streaming data this necessarily means that some (or most) of those parameters will be badly determined. This is reminiscent of the overparameterized deep learning models.

Uncertainty in Parameters¶

When parameters are badly determined, we expect high variance. So, small fluctuations in the data set should lead to larger fluctuations in prediction. One approach to this problem is to build models in which the parameters are well determined. For the independence across data points case, this involves having many observations (large $n$) relative to the number of parameters (which often scales with $p$). This motivates the issues of the large $p$ small $n$ domain, where the conditions are reversed. Of course, from a modelling perspective this issue is trivially solved by assuming independence across the $p$ data dimensions and allowing the parameters to scale with the number of data $n$. This is a characteristic exhibited, for example by the Gaussian process latent variable model (Lawrence, 2005) which in standard form assumes independence across $p$ for high dimensional data and associates each data point with a latent variable that is treated as a parameter. In (Lawrence, 2012) I argued that the successful class of spectral approaches to dimensionality reduction (e.g. LLE Roweis and Saul (2000) and maximum variance unfolding Weinberger et al. (n.d.), which are widely applied in the large $p$ small $n$ domain, also have a probabilistic interpretation where the underlying likelihood factorizes across data dimensions. For these models, dimensionality becomes a blessing rather than a curse. It’s a neat trick, but it still doesn’t solve the fundamental problem. What happens when $n\approx p$? Neither factorization leads to well determined parameters.

Massively Missing Data¶

I’m going to argue that the separation of the data into features and data points is a distraction for happenstance data. It stems from mathematical statistics when the intention was to make a strong scientific claim based on numbers take from a table of data. A table naturally lends itself towards a matrix form. In these data a statistical design normally involved measuring a fixed number of features for a perhaps variable number of items. The objective is to find sufficient number of items so that you can make strong claims about which features are important. For example, does smoking correlate with lung cancer? This explains the desire to write down the data as a matrix $\mathbf{Y}$. I think this view of data, while historically important and still of great utility today, is restrictive when considering modern big data problems.

The modern data analysis challenge is very different. We receive streaming data of varying provenance. If each number we receive is given by an observation $y_i$, where $y_i$ could be in the natural numbers, the real numbers, binary or in any processable form, then $y_{17}$ might be the price of a return rail fair from Sheffield to Oxford on 6th February 2014, whilst $y_{29}$ might be the number of people on the 8:20 train that day, but $y_{72,394}$ could be the temperature of the Atlantic ocean on 23rd August 2056 at a point on the Arctic circle midway between Greenland and Norway. When we see data in this form, we realize that most of the time we are missing most of the data. This leads to the idea of massive missing data. Contrast this situation with that traditionally faced in missing data where a table of values, $\mathbf{Y}$, might have 10%-50% of the measurements missing, perhaps due to problems in data collection. I’d argue that if we are to model complex processes (such as the brain, or the cell, or human health) then almost all the data is missing.

In [ ]:

%pip install daft

In [ ]:

import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[2, 1],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)

pgm.add_node(daft.Node("y", r"$\dataMatrix$", 0.5, 0.5, fixed=False, observed=True))

pgm.render().figure.savefig("./ml/y-only-graph.svg", transparent=True)

Figure: The most general graphical model. It makes no assumptions about conditional probability relationships between variables in the vector $\mathbf{ y}$. It represents the unconstrained probability distribution $p(\mathbf{ y})$.

Figure gives a graphical representation of a model that’s not wrong, just not useful. I like graphical representations of probabilistic models and this is my favorite graph. It is the simplest graph but also the most general model. It says that all the data in our vector $\mathbf{ y}$ is governed by an unspecified probability disribution $p(\mathbf{ y})$.

Graphical models normally express the conditional independence relationships in the data, with this graph we are not a priori considering any such relationships. This is the most general model (it includes all factorized models as special cases). It is not wrong, but since it doesn’t suggest what the next steps are or give us any handles on the problem it is also not useful.

Big Data Consistency¶

This is the nature of modern streaming data, what has been called big data, although in the UK it looks like that term will gain a more diffuse meaning now that the government has associated a putative 189 billion pounds of funding to it. But the characteristic of massive missing data is particularly obvious when we look at clinical domains. EMIS, a Yorkshire based provider of software to General Practitioners, has 39 million patient records. When we consider clinical measurements, we need to build models that not only take into account all current clinical tests, but all tests that will be invented in the future. This leads to the idea of massive missing data. The classical statistical table of data is merely the special case where someone has filled in a block of information.

To deal with massively missing data we need to think about the Kolmogorov consistency of a process. Let me introduce Kolmogorov consistency by way of an example heard from Tony O’Hagan, but one that he credits originally to Michael Goldstein. Imagine you are on jury duty. You are asked to adjudicate on the guilt or innocence of Lord Safebury, and you are going to base your judgement on a model that is weighing all the evidence. You are just about to pronounce your decision when a maid comes running in and shouts “He didn’t do it! He didn’t do it!”. The maid wasn’t on the witness list and isn’t accounted for in your model. How does this effect your inference? The pragmatists answer might be: “not at all, because the maid wasn’t in the model.” But in the interests of justice we might want to include this information in our inference process. If, as a result of the maid’s entry, we now think it is less likely that Lord Safebury committed the crime, then necessarily every time that the (unannounced) maid doesn’t enter the room we have to assume that it is more likely that Safebury committed the crime (to ensure that the conditional probability of guilt given the maid’s evidence normalizes. But we didn’t know about the maid, so how can we account for this? Further, how can we account for all possible other surprise evidence, from the announced butlers, gardeners, chauffeurs and footmen? Kolmogorov consistency says that the net effect of marginalizing for all these potential bits of new information is null. It is a particular property of the model. Making it (only slightly) more formal, we can consider Kolmogorov consistency as a marginalization property of the model. We take the $n$ dimensional vector, $\mathbf{ y}$, to be an (indexed) vector of all our instantiated observations of the world that we have at the current time. Then we take the $n^*$ dimensional vector, $\mathbf{ y}^*$ to be the observations of the world that we are yet to see. From the sum rule of probability we have where the dependence of the marginal distribution for $\mathbf{ y}$ aries from the fact that we are forming an $n^*$ dimensional integral over $\mathbf{ y}^*$. If our distribution is Kolmogorov consistent, then we know that the distribution over $\mathbf{ y}$ is independent of the value of $n^*$. So, in other words $p(\mathbf{ y}|n*)=p(\mathbf{ y})$. Kolmogorov consistency says that the form of $p(\mathbf{ y})$ remains the same regardless of the number of observations of the world that are yet to come.

Parametric Models¶

We can achieve Kolomogrov consistency almost trivially in a parametric model if we assume that the probability distribution is independent given the parameters. Then the property of being closed under marginalization is trivially satisfied through the independence, which allows us to marginalize for all future data leaving a joint distribution which isn’t dependent on $n^*$ because each future data point can be marginalized independently. But, as we’ve already argued, this involves an assumption that is often flawed in practice. It is unlikely that, in a complex model, we will be able to determine the parameter vector well enough, given limited data, for us to truly believe that all the information about the training data that is required for predicting the test data could be passed through a fixed length parameter vector. This is what this independence assumption implies. If we consider that the model will also be acquiring new data at run time, then there is the question of hot to update the parameter vector in a consistent manner, accounting for new information, e.g. new clinical results in the case of personalized health.

Conversely, a general assumption about independence across features would lead to models which don’t exhibit Komlogorov consistency. In these models the dimensionality of the test data, $\mathbf{ y}^*$, denoted by $n^*$ would have to be fixed and each $y^*_i$ would require marginalization. So, the nature of the test data would need to be known at model design time.

Parametric Bottleneck¶

In practice Bayesian methods suggest placing a prior over $\boldsymbol{\theta}$ and using the posterior, $p(\boldsymbol{\theta}|\mathbf{ y})$ for making predictions. We have a model that obeys Kolmogorov consistency, and is sophisticated enough to represent the behavior of a very comlex dataset, it may well require a large number of parameters, just like those deep learning models. One way of seeing the requirement for a large number of parameters is to look at how we are storing information from the training data to pass to the test data. The sum of all our knowledge about the training data is stored in the conditional distribution of the parameters given the training data, the Bayesian posterior distribution, $p(\boldsymbol{ \theta}| \mathbf{ y}).$

A key design-time problem is the parametric bottleneck. If we choose the number of parameters at design time, but the system turns out to be more complicated that we expected, we need to design a new model to deal with this complexity. The communication between the training data and the test data is like an information channel. This TT channel has a bandwidth that is restricted by our choice of the dimensionality of $\boldsymbol{\theta}$ at design time. This seems foolish. It is the bonds of this constraint that deep learning models are breaking by being so over-parameterized.

To deal with this challenge in a probabilistic model, we should allow for a communication channel that is very large, or potentially infinite. In mathematical terms this implies we should look at nonparametric approaches.

If highly overparameterized models are the solution for deep learning, then extending to nonparametric could be the solution to retaining the necessary sense of indeterminedness that is required to cope with very complex systems when we have seen relatively little data.

The Nonparametric Challenge¶

We have argued that we want models that are unconstrained, at design time, by a fixed bandwidth for the communication between the training data, $\mathbf{ y}$, and the test data, $\mathbf{ y}^*$ and that the answer is to be nonparametric. By nonparametric we are proposing using classes of models for which the conditional distribution, $p(\mathbf{ y}^*|\mathbf{ y})$ is not decomposable into the expectation of $p(\mathbf{ y}^*|\boldsymbol{ \theta})$ under the posterior distribution of the parameters, $p(\boldsymbol{ \theta}|\mathbf{ y})$ for any fixed length parameter vector $\boldsymbol{ \theta}$. We don’t want to impose such a strong constraint on our model at design time. Our model may be required to be operational for many years and the true complexity of the system being modeled may not even be well understood at design time. We must turn to paradigms that allow us to be adaptable at run time. Nonparametrics provides just such a paradigm, because the effect parameter vector increases in size as we observe more data. This seems ideal, but it also presents a problem.

Human beings, despite are large, interconnected brains, only have finite storage. It is estimated that we have between 100 and 1000 trillion synapses in our brains. Similar for digital computers, even the GPT-3 model is restricted to 175 billion parameters. So, we need to assume that we can only store a finite number of things about the data $\mathbf{ y}$. This seems to push us back towards nonparametric models. Here, though, we choose to go a different way. We choose to introduce a set of auxiliary variables, $\mathbf{ u}$, which are $m$ in length. Rather than representing the nonparametric density directly, we choose to focus on storing information about $\mathbf{ u}$. By storing information about these variables, rather than storing all the data $\mathbf{ y}$ we hope to get around this problem. In order for us to be nonparametric about our predictions for $\mathbf{ y}*$ we must condition on all the data, $\mathbf{ y}$. We can’t any longer store an intermediate distribution to represent our sum knowlege, $p(\boldsymbol{ \theta}|\mathbf{ y})$. Such an intermediate distribution is a finite dimensional object, and nonparametrics implies that we cannot store all the information in a finite dimensional distribution. This presents a problem for real systems in practice. We are now faced with a compromise; how can we have a distribution which is flexible enough to respond at run time to unforeseen complexity in the training data? Yet, simultaneously doesn’t require unbounded storage to retain all the information in the training data. We will now introduce a perspective on variational inference that will allow us to retain the advantages of both worlds. We will construct a parametric approximation to the true nonparametric conditional distribution. But, importantly, whilst this parametric approximation will suffer from the constraints on the bandwidth of the TT channel that drove us to nonparametric models in the first place, we will be able to change the number of parameters at run time not simply at design time.

The Multivariate Gaussian: Closure Under Marginalization¶

Being closed under marginalization is a remarkable property of our old friend the multivariate Gaussian distribution (old friends often have remarkable properties that we often take for granted, I think this is particularly true for the multivariate Gaussian). In particular, if we consider a joint distribution across $p(\mathbf{ y}, \mathbf{ y}^*)$, then the covariance matrix of the marginal distribution for the subset of variables, $\mathbf{ y}$, is unaffected by the length of $\mathbf{ y}^*$. Taking this to its logical conclusion, if the length of the data, $\mathbf{ y}$, is $n=2$. Then that implies that the covariance between $\mathbf{ y}$, as defined by $\mathbf{K}$, is only a $2\times 2$ matrix, and it can only depend on the indices of the two data points in $\mathbf{ y}$. Since this covariance matrix must remain the same for any two values regardless of the length of $\mathbf{ y}$ and $\mathbf{ y}^*$ then the value of the elements of this covariance must depend only on the two indices associated with $\mathbf{ y}$.

Since the covariance matrix is specified pairwise, this implies that the covariance matrix must be dependent only on the index of the two observations $y_i$ and $y_j$ for which the covariance is being computed. In general, we can also think of this index as being infinite: it could be a spatial or temporal location. where each $y_i$ is now defined across the real line, and the dimensionality of $\mathbf{ y}*$ is irrelevant. Prediction consists of conditioning the joint density on $\mathbf{ y}^*$. So, for any new value of $\mathbf{ y}^*$, given its index we compute $p(\mathbf{ y}^* | \mathbf{ y})$.

Making Parameters Nonparametric¶

We will start by introducing a set of variables, $\mathbf{ u}$, that are finite dimensional. These variables will eventually be used to communicate information between the training and test data, i.e. across the TT channel. $$ p(\mathbf{ y}^*|\mathbf{ y}) = \int p(\mathbf{ y}^*|\mathbf{ u}) q(\mathbf{ u}|\mathbf{ y}) \text{d}\mathbf{ u} $$ where we have introduced a distribution over $\mathbf{ u}$, $q(\mathbf{ u}|\mathbf{ y})$ which is not necessarily the true posterior distribution, indeed we typically derive it through a variational approximation (see e.g. Titsias (n.d.)).

In [ ]:

import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[2, 2],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)

pgm.add_node(daft.Node("y", r"$\dataVector$", 0.5, 0.5, fixed=False, observed=True))
pgm.add_node(daft.Node("u", r"$\inducingVector$", 0.5, 1.5, fixed=False))
pgm.add_edge("u", "y", directed=False)

pgm.render().figure.savefig("./ml/u-to-y.svg", transparent=True)

Figure: Augmenting the variable space with a set of latent inducing vectors. The graphical model represents the factorization of the distribution into the form $\int p(\mathbf{ y}|\mathbf{ u})p(\mathbf{ u})\text{d}\mathbf{ u}$

In Figure we have augmented our simple graphical model augmented with a vector $\mathbf{ u}$ which we refer to as inducing variables. Note that the model is still totally general because $p(\mathbf{ y}, \mathbf{ u})$ is an augmented variable model and the original $p(\mathbf{ y})$ is easily recovered by simple marginalization of $\mathbf{ u}$. So we haven’t yet made any assumptions about our data, our model is still correct, but useless.

The model we’ve introduced now looks somewhat like the parametric model we argued against in the previous section, $$ p(\mathbf{ y}^* | \mathbf{ y})=\int p(\mathbf{ y}^*|\boldsymbol{ \theta})p(\boldsymbol{ \theta}|\mathbf{ y})\text{d}\boldsymbol{ \theta}. $$ What’s going on here? Is there going to be some kind of parametric/nonparametric 3 card trick where with sleight of hand we are trying to introduce a parametric model? Well clearly not, because I’ve just given the game away. But I believe there are some important differences to the traditional approach for parameterizing a model. Philosophically, our variables $\mathbf{ u}$ are variables that augment the model. We have not yet made any assumptions by introducing them. Normally, parameterization of the model instantiates assumptions, but this is not happening here. In particular note that we have not assumed that the training data factorize given the inducing variables. Secondly, have not specified the dimensionality of $\mathbf{ u}$ (i.e. the size of the TT channel) at design time. We are going to allow it to change at run time. We will do this by ensuring that the inducing variables also obey Kolmogorov consistency. In particular we require that If we build a joint density as follows: where $\mathbf{ u}$ are the inducing variables we choose might choose to instantiate at any given time (of dimensionality $m$) and $\mathbf{ u}^*$ is the $m^*$ dimensional pool of future inducing variables we have not yet chosen to instantiate (where $m^*$ could be infinite). Our new Kolmogorov consistency condition requires that $$ p(\mathbf{ y}, \mathbf{ u}|m^*,n^*) = p(\mathbf{ y}, \mathbf{ u}). $$ It doesn’t need to be predetermined at design time because we allow for the presence of a (potentially infinite) number of inducing variables $\mathbf{ u}^*$ that we may wish to later instantiate to improve the quality of our model. In other words, it is very similar to the parametric approach, but now we have access to a future pool of additional parameters, $\mathbf{ u}^*$ that we can call upon to increase the bandwidth of the TT channel as appropriate. In parametric modelling, calling upon such parameters has a significant effect on the likelihood of the model, but here these variables are auxiliary variables that will not affect the likelihood of the model. They merely effect our ability to approximate the true bandwidth of the TT channel. The quality of this approximation can be varied at run time. It is not necessary to specify it at design time. This gives us the flexibility we need in terms of modeling, whilst keeping computational complexity and memory demands manageable and appropriate to the task at hand.

In [ ]:

import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[2, 2],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)

pgm.add_node(daft.Node("y", r"$\dataVector$", 0.5, 0.5, fixed=False, observed=True))
pgm.add_node(daft.Node("u", r"$\inducingVector$", 0.5, 1.5, fixed=False))
pgm.add_node(daft.Node("ystar", r"$\dataVector^*$", 1.5, 0.5, fixed=False))
pgm.add_node(daft.Node("ustar", r"$\inducingVector^*$", 1.5, 1.5, fixed=False))

pgm.add_edge("u", "y", directed=False)
pgm.add_edge("ustar", "y", directed=False)
pgm.add_edge("u", "ustar", directed=False)
pgm.add_edge("ystar", "y", directed=False)
pgm.add_edge("ustar", "ystar", directed=False)
pgm.add_edge("u", "ystar", directed=False)

pgm.render().figure.savefig("./ml/u-to-y-ustar-to-y.svg", transparent=True)

Figure: We can also augment the graphical model with data that is only seen at ‘run time’, or ‘test data’. In this case we use the superscript of $*$ to indicate the test data. This graph represents the interaction between data we’ve seen, $\mathbf{ y}$, and data we’ve yet to see, $\mathbf{ y}^*$ as well as the augmented variables $\mathbf{ u}$ and $\mathbf{ u}$, $p(\mathbf{ y}) = \int p(\mathbf{ y}, \mathbf{ y}^*, \mathbf{ u}, \mathbf{ u}^*) \text{d}\mathbf{ y}^* \text{d}\mathbf{ u}\text{d}\mathbf{ u}^*$. As the fully connected graph implies we are making no assumptions about the data.

Adding in the test data and the inducing variables we have not yet chosen to instantiate (Figure ). Here we see that we still haven’t defined any structure in the graph, and therefore we have not yet made any assumptions about our data. Not shown in the graph is the additional assumption that whilst $\mathbf{ y}$ has $n$ dimensions and $\mathbf{ u}$ has $m$ dimensions, $\mathbf{ y}^*$ and $\mathbf{ u}^*$ are potentially infinite dimensional.

Fundamental Variables¶

To focus our model further, we assume that we observe observations, $\mathbf{ y}$ that are derived from some underlying fundamental, $\mathbf{ f}$, through simple factorized likelihoods. The idea of the fundamental variables is that they are sufficient to describe the world around us, but we might not be able to observe them directly. In particular we might observe relatively simple corruptions of the fundamental variables such as independent addition of noise, or thresholding. We might observe something relative about two fundamental variables. For example if we took $f_{12,345}$ to be the height of Tom Cruise and $f_{23,789}$ to be the height of Penelope Cruz then we might take for an observation a binary value indicating the relative heights, so $y_{72,394} = f_{12,345} < f_{23,789}$. The fundamental variable is an artificial construct, but it can prove to be a useful one. In particular we’d like to assume that the relationship between our observations, $\mathbf{ y}$ and the fundamental variables, $\mathbf{ f}$ might factorize in some way. In the framework we think of this relationship, given by $p(\mathbf{ y}|\mathbf{ u})$ as the likelihood. We can ensure that assuming the likelihood factorizes does not at all reduce the generality of our model, by forcing the distribution over the fundamentals, $p(\mathbf{ f})$ to also be Kolmogorov consistent. This ensures that in the case where the likelihood is fully factorized over $n$ the model is still general if we allow the factors of the likelihood to be Dirac delta functions suggesing that $y_i = f_i$. Since we haven’t yet specified any forms for the probability distributions this is allowed and therefore the formulation is still totally general. $$ p(\mathbf{ y}|n^*) = \int p(\mathbf{ y}|\mathbf{ f}) p(\mathbf{ f}, \mathbf{ f}^*)\text{d}\mathbf{ f}\text{d}\mathbf{ f}^* $$ and since we enforce Kolmogorov consistency, we have $$ p(\mathbf{ y}|n^*) = p(\mathbf{ y}). $$

In [ ]:

import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[2, 3],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)

pgm.add_node(daft.Node("y", r"$\dataVector$", 0.5, 0.5, fixed=False, observed=True))
pgm.add_node(daft.Node("f", r"$\mappingFunctionVector$", 0.5, 1.5, fixed=False))
pgm.add_node(daft.Node("u", r"$\inducingVector$", 0.5, 2.5, fixed=False))
pgm.add_node(daft.Node("ustar", r"$\inducingVector^*$", 1.5, 2.5, fixed=False))

pgm.add_edge("u", "f", directed=False)
pgm.add_edge("f", "y")
pgm.add_edge("ustar", "f", directed=False)
pgm.add_edge("u", "ustar", directed=False)

pgm.render().figure.savefig("./ml/u-to-f-to-y-ustar-to-f.svg", transparent=True)

Figure: We introduce the fundamental variable $\mathbf{ f}$ which sits between $\mathbf{ u}$ and $\mathbf{ y}$.

Now we assume some form of factorization for our data observations, $\mathbf{ y}$, given the fundamental variables, $\mathbf{ f}$, so that we have $$ p(\mathbf{ y}|\mathbf{ f}) = \prod_{i} p(\mathbf{ y}^i| \mathbf{ f}^i) $$ so that we have subsets of the data $\mathbf{ y}^i$ which are dependent on subsets of the fundamental variables, $f$. For simplicity of notation we will assume a factorization across the entire data set, so each observation, $y_i$, has a single underlying fundamental variable, $f_i$, although more complex factorizations are also possible and can be considered within the analysis. $$ p(\mathbf{ y}|\mathbf{ f}) = \prod_{i=1}^np(y_i|f_i) $$

In [ ]:

import daft

from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[2, 3],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)
reduce_alpha={"alpha": 0.3}
pgm.add_node(daft.Node("y", r"$\dataScalar_i$", 0.5, 0.5, fixed=False, observed=True))
pgm.add_node(daft.Node("f", r"$\mappingFunction_i$", 0.5, 1.5, fixed=False))
pgm.add_node(daft.Node("u", r"$\inducingVector$", 0.5, 2.5, fixed=False))
pgm.add_node(daft.Node("ustar", r"$\inducingVector^*$", 1.5, 1.5, fixed=False, plot_params=reduce_alpha))
pgm.add_plate([0.125, 0.125, 0.75, 1.75], label=r"$i=1\dots N$", fontsize=18)

pgm.add_edge("u", "f", directed=False)
pgm.add_edge("f", "y")
pgm.add_edge("ustar", "f", directed=False, plot_params=reduce_alpha)
pgm.add_edge("u", "ustar", directed=False, plot_params=reduce_alpha)

pgm.render().figure.savefig("./ml/u-to-f_i-to-y_i-ustar-to-f.svg", transparent=True)

Figure: The relationship between $\mathbf{ f}$ and $\mathbf{ y}$ is assumed to be factorized, which we indicate here using plate notation, $p(\mathbf{ y}) = \int \prod_{i=1}^np(y_i|f_i) p(\mathbf{ f}| \mathbf{ u}, \mathbf{ u}^*) p(\mathbf{ u}, \mathbf{ u}^*)\text{d}\mathbf{ u}\text{d}\mathbf{ u}^*$.

We now decompose, without loss of generality, our joint distribution over inducing variables and fundamentals into the following parts $$ p(\mathbf{ u}, \mathbf{ f}) = p(\mathbf{ f}|\mathbf{ u})p(\mathbf{ u}), $$ where we assume that we have marginalized $\mathbf{ f}^*$ and $\mathbf{ u}^*$.

In [ ]:

import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[2, 3],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)
reduce_alpha={"alpha": 0.3}
pgm.add_node(daft.Node("y", r"$\dataScalar_i$", 0.5, 0.5, fixed=False, observed=True))
pgm.add_node(daft.Node("f", r"$\mappingFunction_i$", 0.5, 1.5, fixed=False))
pgm.add_node(daft.Node("u", r"$\inducingVector$", 0.5, 2.5, fixed=False))
pgm.add_node(daft.Node("ustar", r"$\inducingVector^*$", 1.5, 1.5, fixed=False, plot_params=reduce_alpha))
pgm.add_plate([0.125, 0.125, 0.75, 1.75], label=r"$i=1\dots N$", fontsize=18)

pgm.add_edge("f", "y")
pgm.add_edge("u", "f")
pgm.add_edge("ustar", "f", plot_params=reduce_alpha)

pgm.render().figure.savefig("./ml/u-to-f_i-to-y_i.svg", transparent=True)

Figure: The model with future inducing points marginalized $p(\mathbf{ y}) = \int \prod_{i=1}^np(y_i|f_i) p(\mathbf{ f}| \mathbf{ u}) p(\mathbf{ u})\text{d}\mathbf{ u}$.

In [ ]:

import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[2, 3],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)
reduce_alpha={"alpha": 0.3}
pgm.add_node(daft.Node("y", r"$\dataScalar_i$", 0.5, 0.5, fixed=False, observed=True))
pgm.add_node(daft.Node("f", r"$\mappingFunction_i$", 0.5, 1.5, fixed=False))
pgm.add_node(daft.Node("u", r"$\inducingVector$", 0.5, 2.5, fixed=True))
pgm.add_node(daft.Node("ustar", r"$\inducingVector^*$", 1.5, 1.5, fixed=True, plot_params=reduce_alpha))
pgm.add_plate([0.125, 0.125, 0.75, 1.75], label=r"$i=1\dots N$", fontsize=18)

pgm.add_edge("f", "y")
pgm.add_edge("u", "f")
pgm.add_edge("ustar", "f", plot_params=reduce_alpha)

pgm.render().figure.savefig("./ml/given-u-to-f_i-to-y_i.svg", transparent=True)

Figure: The model conditioned on the inducing variables $p(\mathbf{ y}|\mathbf{ u}, \mathbf{ u}^*) = \int\prod_{i=1}^np(y_i|f_i) p(\mathbf{ f}|\mathbf{ u}, \mathbf{ u}^*)\text{d}\mathbf{ f}$.

In [ ]:

import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [ ]:

pgm = daft.PGM(shape=[2, 3],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)
reduce_alpha={"alpha": 0.3}
pgm.add_node(daft.Node("y", r"$\dataScalar_i$", 0.5, 0.5, fixed=False, observed=True))
pgm.add_node(daft.Node("f", r"$\mappingFunction_i$", 0.5, 1.5, fixed=False))
pgm.add_node(daft.Node("theta", r"$\parameterVector$", 0.5, 2.5, fixed=True))
e, plot_params=reduce_alpha))
pgm.add_plate([0.125, 0.125, 0.75, 1.75], label=r"$i=1\dots N$", fontsize=18)

pgm.add_edge("f", "y")
pgm.add_edge("theta", "f")

pgm.render().figure.savefig("./ml/given-theta-to-f_i-to-y_i.svg", transparent=True)

Figure: The model as a classical parametric model with independence across data points indexed by $i$ that is conditional on parameters $\boldsymbol{ \theta}$, $p(\mathbf{ y}|\boldsymbol{ \theta}) = \int\prod_{i=1}^np(y_i|f_i) p(\mathbf{ f}|\boldsymbol{ \theta})\text{d}\mathbf{ f}$. The model is graphically the same as the nonparametric variant but here the dimension of $\boldsymbol{ \theta}$ has to be fixed for Kolmogorov consistency, in the inducing vector variant the dimension of $\mathbf{ u}$ can vary.

In Figure we visualise the graphical model of a classical parametric form. This model is very general, the deep neural network models for supervised learning tasks can be seen as variants of this model with very large dimensions for the parameter vector $\boldsymbol{ \theta}$.

Instantiating the Model¶

So far, we haven’t made any assumptions about the data in our model, other than a factorization assumption between the fundamental variables and the observations, $\mathbf{ y}$. Even this assumption does not affect the generality of the model decomposition, because in the worst case the likelihood $p(\mathbf{ y}|\mathbf{ f})$ could be a Dirac $\delta$ function, implying $\mathbf{ y}=\mathbf{ f}$ and allowing us to include complex interelations between $\mathbf{ y}$ directly in $p(\mathbf{ f})$. We have specified that $p(\mathbf{ f}, \mathbf{ u})$ should be Kolmogorov consistent with $\mathbf{ f}^*$ and $\mathbf{ u}^*$ being marginalized and we have argued that nonparametric models are important in practice to ensure that all the information in our training data can be passed to the test data.

For a model to be useful, we need to specify relationships between our data variables. Of course, this is the point at which a model also typically becomes wrong. At least if our model isn’t correct, then it should be a useful abstraction of the system.

Gaussian Processes¶

A flexible class of models that fulfils the constraints of being non-parametric and Kolmogorov consistent is Gaussian processes. A Gaussian process prior for our fundamental variables, $\mathbf{ f}$ assumes that they are jointly Gaussian distributed. Each data point, $f_i$, is is jointly distributed with each other data point $f_j$ as a multivariate Gaussian. The covariance of this Gaussian is a function of the indices of the two data, in this case $i$ and $j$. But these indices are not just restricted to discrete values. The index can be a continuous value such as time, $t$, or spatial location, $\mathbf{ x}$. The words index and indicate have a common etymology. This is appropriate because the index indicates the provenance of the data. In effect we have multivariate indices to account for the full provenance, so that our observations of the world are given as a function of, for example, the when, the where and the what. “When” is given by time, “where” is given by spatial location and “what” is given by a (potentially discrete) index indicating the further provenance of the data. To define a joint Gaussian density, we need to define the mean of the density and the covariance. Both this mean and the covariance also need to be indexed by the when, the where and the what.

Augmenting with Inducing Variables in Gaussian Processes¶

To define our model, we need to describe the relationship between the fundamental variables, $\mathbf{ f}$, and the inducing variables, $\mathbf{ u}$. This needs to be done in such a way that the inducing variables are also Kolmogorov consistent. A straightforward way of achieving this is through a joint Gaussian process model over the inducing variables and the data mapping variables, so in other words we define a Gaussian process prior over $$ \begin{bmatrix} \mathbf{ f}\\ \mathbf{ u} \end{bmatrix} \sim \mathcal{N}\left(\mathbf{m},\mathbf{K}\right) $$ where the covariance matrix has a block form, $$ \mathbf{K}= \begin{bmatrix} \mathbf{K}_{\mathbf{ f}\mathbf{ f}} & \mathbf{K}_{\mathbf{ f}\mathbf{ u}} \\ \mathbf{K}_{\mathbf{ u}\mathbf{ f}} & \mathbf{K}_{\mathbf{ u}\mathbf{ u}}\end{bmatrix} $$ and $\mathbf{K}_{\mathbf{ f}\mathbf{ f}}$ gives the covariance between the fundamentals vector, $\mathbf{K}_{\mathbf{ u}\mathbf{ u}}$ gives the covariance matrix between the inducing variables and $\mathbf{K}_{\mathbf{ u}\mathbf{ f}} = \mathbf{K}_{\mathbf{ f}\mathbf{ u}}^\top$ gives the cross covariance between the inducing variables, $\mathbf{ u}$ and the mapping function variables, $\mathbf{ f}$.

The elements of $\mathbf{K}_{\mathbf{ f}\mathbf{ f}}$ will be computed through a covariance function (or kernel) given by $k_f(\mathbf{ x}, \mathbf{ x}^\prime)$ where $\mathbf{ x}$ is a vector representing the provenance of the data, which as we discussed earlier could involve a spatial location, a time, or something about the nature of the data. In a Gaussian process most of the modelling decisions take place in the construction of $k_f(\cdot)$.

The Mean Function¶

The mean of the process is given by a vector $\mathbf{m}$ which is derived from a mean function $m(\mathbf{ x})$. There are many occasions when it is useful to include a mean function, but normally the mean function will have a parametric form, $m(\mathbf{ x};\boldsymbol{ \theta})$, and be subject (in itself) to the same constraints that a standard parametric model has. Indeed, if we choose to model a function as a parametric form plus Gaussian noise, we can recast such a model as a simple Gaussian process with a covariance function $k_f(\mathbf{ x}_i,\mathbf{ x}_j) = \sigma^2 \delta_{i, j}$, where $\delta_{i, j}$ is the Kronecker delta-function and a mean function that is given by the standard parametric form. In this case we see that the covariance function is mopping up the residuals that are not captured by the mean function. If we genuinely were interested in the form of a parametric mean function, as we often are in statistics, where the mean function may include a set of covariates and potential effects, often denoted by $$ m(\mathbf{ x}) = \boldsymbol{\beta}^\top \mathbf{ x}, $$ where here the provenance of the data is known as the covariates, and the variable associated with $\mathbf{ y}$ is typically known as a response variable. In this case the particular influence of each of the covariates is being encoded in a vector $\boldsymbol{\beta}$. To a statistician, the relative values of the elements of this vector are often important in making a judgement about the influence of the covariates. For example, in disease modelling the mean function might be used in a generalized linear model through a link function to represent a rate or risk of disease (e.g. Saul et al. (n.d.)). The covariates should co-vary (or move together) with the response variable. Appropriate covariates for malaria incidence rate might include known influencers of the disease. For example, if we are dealing with malaria then we might expect disease rates to be influenced by altitude, average temperature, average rainfall, local distribution of prophylactic measures (such as nets) etc. The covariance of the Gaussian process then has the role of taking care of the residual variance in the data: the data that is not explained by the mean function, i.e. the variance that cannot be explained by the parametric model. In a disease mapping model, it makes sense to assume that these residuals may not be independent. An underestimate of disease at one spatial location, may imply an underestimate of disease rates at a nearby location. The mismatch between the observed disease rate and that predicted by modeling the relationship with the covariates through the mean function is then given by the covariance function.

The machine learner’s focus on prediction means that within that community the mean function is more often removed, with all the predictive power being incorporated within the Gaussian process covariance.

In [ ]:

import GPy
import pods

In [ ]:

data = pods.datasets.mauna_loa()
kern = GPy.kern.Linear(1) + GPy.kern.RBF(1) + GPy.kern.Bias(1)
model = GPy.models.GPRegression(data['X'], data['Y'], kern)
#model.optimize()

In [ ]:

pb.plot(xlim

So we could interpret Gaussian process models as approaches to dealing with residuals

Modelling $\mathbf{ f}$¶

In conclusion, for a nonparametric framework, our model for $\mathbf{ f}$ is predominantly in the covariance function $\mathbf{K}_{\mathbf{ f}\mathbf{ f}}$. This is our data model. We are assuming the inducing variables are drawn from a joint Gaussian process with $\mathbf{ f}$. The cross covariance between $\mathbf{ u}$ and $\mathbf{ f}$ is given by $\mathbf{K}_{\mathbf{ f}\mathbf{ u}}$. This gives the relationship between the function and the inducing variables. There are a range of ways in which the inducing variables can interelate with the

Illustrative Example¶

For this illustrative example, we’ll consider a simple regression problem.

Back to a Simple Regression Problem¶

Here we set up a simple one-dimensional regression problem. The input locations, $\mathbf{X}$, are in two separate clusters. The response variable, $\mathbf{ y}$, is sampled from a Gaussian process with an exponentiated quadratic covariance.

In [ ]:

%pip install gpy

In [ ]:

import numpy as np
import GPy
from scipy import optimize
np.random.seed(101)

In [ ]:

N = 50
noise_var = 0.01
X = np.zeros((50, 1))
X[:25, :] = np.linspace(0,3,25)[:,np.newaxis] # First cluster of inputs/covariates
X[25:, :] = np.linspace(7,10,25)[:,np.newaxis] # Second cluster of inputs/covariates

xlim = (-2,12)
ylim = (-4, 0)

# Sample response variables from a Gaussian process with exponentiated quadratic covariance.
k = GPy.kern.RBF(1)
y = np.random.multivariate_normal(np.zeros(N),k.K(X)+np.eye(N)*np.sqrt(noise_var)).reshape(-1,1)
scale = np.sqrt(np.var(y))
offset = np.mean(y)

First, we perform a full Gaussian process regression on the data. We create a GP model, m_full, and fit it to the data, plotting the resulting fit.

In [ ]:

urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/gp_tutorial.py','gp_tutorial.py')

In [ ]:

import matplotlib.pyplot as plt
from gp_tutorial import ax_default, meanplot, gpplot

In [ ]:

def plot_model_output(model, output_dim=0, scale=1.0, offset=0.0, ax=None, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2):
    if ax is None:
        fig, ax = plt.subplots(figsize=plot.big_figsize)
    ax.plot(model.X.flatten(), model.Y[:, output_dim]*scale + offset, 'r.',markersize=10)
    ax.set_xlabel(xlabel, fontsize=fontsize)
    ax.set_ylabel(ylabel, fontsize=fontsize)
    xt = plot.pred_range(model.X, portion=portion)
    yt_mean, yt_var = model.predict(xt)
    yt_mean = yt_mean*scale + offset
    yt_var *= scale*scale
    yt_sd=np.sqrt(yt_var)
    if yt_sd.shape[1]>1:
        yt_sd = yt_sd[:, output_dim]

    _ = gpplot(xt.flatten(),
               yt_mean[:, output_dim],
               yt_mean[:, output_dim]-2*yt_sd.flatten(),
               yt_mean[:, output_dim]+2*yt_sd.flatten(), 
               ax=ax, fillcol='#040404', edgecol='#101010')

In [ ]:

m_full = GPy.models.GPRegression(X,y)
m_full.optimize() # Optimize parameters of covariance function

In [ ]:

urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/teaching_plots.py','teaching_plots.py')

In [ ]:

urllib.request.urlretrieve('https://raw.githubusercontent.com/lawrennd/talks/gh-pages/mlai.py','mlai.py')

In [ ]:

import teaching_plots as plot
import mlai

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)
ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='./gp/sparse-demo-full-gp.svg', 
                  transparent=True, frameon=True)

Figure: A full Gaussian process fit to the simulated data set.

Now we set up the inducing variables, $\mathbf{ u}$. Each inducing variable has its own associated input index, $\mathbf{Z}$, which lives in the same space as $\mathbf{X}$. Here we are using the true covariance function parameters to generate the fit.

In [ ]:

kern = GPy.kern.RBF(1)
Z = np.hstack(
        (np.linspace(2.5,4.,3),
        np.linspace(7,8.5,3)))[:,None]
m = GPy.models.SparseGPRegression(X,y,kernel=kern,Z=Z)
m.noise_var = noise_var
m.inducing_inputs.constrain_fixed()
#m.tie_params('.*variance')
#m.ensure_default_constraints()

In [ ]:

print(m) # why is it not printing noise variance correctly?

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)
ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='./gp/sparse-demo-constrained-inducing-6-unlearned-gp.svg', 
                  transparent=True, frameon=True)

Figure: Sparse Gaussian process with six constrained inducing variables and parameters learned.

In [ ]:

m.optimize()

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)
ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='./gp/sparse-demo-constrained-inducing-6-learned-gp.svg', 
                  transparent=True, frameon=True)

Figure: Sparse Gaussian process with six constrained inducing variables and parameters learned.

In [ ]:

print(m)

In [ ]:

m.randomize()
m.inducing_inputs.unconstrain()
m.optimize()

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)
ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='./gp/sparse-demo-unconstrained-inducing-6-gp.svg', 
                  transparent=True, frameon=True)

Figure: Sparse Gaussian process with six unconstrained inducing variables, initialized randomly and then optimized.

Now we will vary the number of inducing points used to form the approximation.

In [ ]:

m.Z.values

In [ ]:

m.num_inducing=8
m.randomize()
M = 8

m.set_Z(np.random.rand(M,1)*12)

m.optimize()

In [ ]:

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)
ax.plot(m.Z, np.ones(m.Z.shape)*ylim[0], 'k^', markersize=30)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='./gp/sparse-demo-sparse-inducing-8-gp.svg', 
                  transparent=True, frameon=True)

Figure: Sparse Gaussian process with eight inducing variables, initialized randomly and then optimized.

In [ ]:

print(m.log_likelihood(), m_full.log_likelihood())

The use of inducing variables in Gaussian process models to make inference efficient is now commonplace. By exploiting the parametric form given in Figure Hensman et al. (n.d.) were able to adapt the stochastic variational inference approach of Hoffman et al. (2012) to the nonparametric formalism. This promising direction may allow us to bridge from a rigorous probabilistic formalism for predictive modeling as enabled by nonparametric methods to the very rich modeling frameworks provided by deep learning. In particular, work in composition of Gaussian processes by Damianou and Lawrence (2013) has been extended to incorporate variational inference formalisms (see e.g. Hensman and Lawrence (2014);Dai et al. (n.d.);Salimbeni and Deisenroth (2017)). The scale at which these models can operate means that they are now being deployed in some of the domains where deep neural networks have traditionally dominated (Dutordoir et al. (2020)).

These methods have not yet been fully verified on the domain which has motivated much of the thinking this paper, that of happenstance data. But the hope is that the rigorous probabilistic underpinnings combined with the flexibility of these methods will allow these challenges to be tackled.

Conclusion¶

Modern machine learning methods for prediction are based on highly overparameterized models that have empirically performed well in tasks that were previously considered challenging or impossible such as machine translation, object detection in images, natural language generation. These models raise new questions for our thinking about how models generalize their predictions. In particular, the conflate the conceptual separation between model and algorithm and our best understanding is that they regularize themselves implicitly through their optimization algorithms.

Despite the range of questions these models raise for our classical view of generalization, in another sense, these models are very traditional. They operate on tables of data that have been curated through appropriate curation. These deep learning models operate on (very large) design matrices.

We’ve argued that the new frontiers for the data sciences lie in the domain of what we term “happenstance data”. The data that hasn’t been explicitly collected with a purpose in mind, but is laid down through the rhythms of our modern lives. We’ve claimed that the traditional view of data as sitting in a table is restrictive for this new domain, and outlined how we might model such data through nonparametrics.

Finally, we highlighted work where these ideas are beginning to be formulated and flexible non-parametric probabilistic models are being deployed on large scale data. The next horizon for these models is to move beyond the traditional data formats, in particular tabular data, on to the domain of massivel missing data where mere snippets of data are available, but the interactions between those snippets are of sufficient complexity to require the complex modeling formalisms inspired by the modern range of deep learning methodologies.

Acknowledgments¶

I’ve benefited over the years from conversations with a number of colleagues, among those I can identify that influenced the thinking in this paper are Tony O’Hagan, John Kent, David J. C. MacKay, Richard Wilkinson, Darren Wilkinson, Bernhard Schölkopf, Zoubin Ghahramani. Naturally, the responsibility for the sensible bits is theirs, the errors are all mine.

References¶

Aldrich, J., 2008. R. A. Fisher on Bayes and Bayes’ theorem. Bayesian Anal. 3, 161–170. https://doi.org/10.1214/08-BA306

Alvarez, R.M. (Ed.), 2016. Computational social science. Cambridge University Press.

Andrés, L., Zentner, A., Zentner, J., 2014. Measuring the effect of internet adoption on paper consumption. The World Bank.

Arora, S., Cohen, N., Golowich, N., Hu, W., 2019. A convergence analysis of gradient descent for deep linear neural networks, in: International Conference on Learning Representations.

Arora, S., Cohen, N., Hu, W., Luo, Y., 2019. Implicit regularization in deep matrix factorization, in: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 32. Curran Associates, Inc., pp. 7413–7424.

Álvarez, M.A., Luengo, D., Lawrence, N.D., 2013. Linear latent force models using Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 2693–2705. https://doi.org/10.1109/TPAMI.2013.86

Belkin, M., Hsu, D., Ma, S., Soumik Mandal, 2019. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. USA 116, 15849–15854.

Bernardo, J.M., Smith, A.F.M., 1994. Bayesian theory. wiley.

Box, G.E.P., 1976. Science and statistics. Journal of the American Statistical Association 71.

Breiman, L., 2001a. Statistical modeling: The two cultures. Statistical Science 16, 199–231.

Breiman, L., 2001b. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/A:1010933404324

Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123–140. https://doi.org/10.1007/BF00058655

Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., Allen Riddell, 2017. Stan: A probabilistic programming language. Journal of Statistical Software 76. https://doi.org/10.18637/jss.v076.i01

Carvalho, V.M., Hansen, S., Ortiz, Á., García, J.R., Rodrigo, T., Mora, S.R., Ruiz, J., 2020. Tracking the covid-19 crisis with high-resolution transaction data (No. DP14642). Center for Economic Policy Research.

Dai, Z., Damianou, A., Gonzalez, J., Lawrence, N.D., n.d. Variationally auto-encoded deep Gaussian processes, in:.

Damianou, A., Lawrence, N.D., 2013. Deep Gaussian processes, in:. pp. 207–215.

Dawid, A.P., 1984. Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society, A 147, 278–292.

Dawid, A.P., 1982. The well-callibrated Bayesian. Journal of the American Statistical Association 77, 605–613.

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

Dutordoir, V., Wilk, M. van der, Artemev, A., Hensman, J., 2020. Bayesian image classification with deep convolutional gaussian processes, in: Chiappa, S., Calandra, R. (Eds.), Proceedings of Machine Learning Research. PMLR, Online, pp. 1529–1539.

Efron, B., 2020. Prediction, estimation, and attribution. Journal of the American Statistical Association 115, 636–655. https://doi.org/10.1080/01621459.2020.1762613

Efron, B., 1979. Bootstrap methods: Another look at the jackkife. Annals of Statistics 7, 1–26.

Fisher, R.A., 1950. Contributions to mathematical statistics. Chapman; Hall.

Friedman, J., Hastie, T., Tibshirani, R., 2020. Discussion of “Prediction, estimation, and attribution” by Bradley Efron. Journal of the American Statistical Association 115, 665–666. https://doi.org/10.1080/01621459.2020.1762617

Geman, S., Bienenstock, E., Doursat, R., 1992. Neural networks and the bias/variance dilemma. Neural Computation 4, 1–58. https://doi.org/10.1162/neco.1992.4.1.1

Geman, S., Bienenstock, E., Doursat, R., 1992. Neural networks and the bias/variance dilema. Neural Computation 4, 1–58.

Ghahramani, Z., 2015. Probabilistic machine learning and artificial intelligence. Nature 452–459.

Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S., Brilliant, L., 2009. Detecting influenza epdiemics using search engine query data. Nature 1012–1014.

Halevy, A.Y., Norvig, P., Pereira, F., 2009. The unreasonable effectiveness of data. IEEE Intelligent Systems 24, 8–12. https://doi.org/10.1109/MIS.2009.36

Hensman, J., Fusi, N., Lawrence, N.D., n.d. Gaussian processes for big data, in:.

Hensman, J., Lawrence, N.D., 2014. Nested variational compression in deep Gaussian processes. University of Sheffield.

Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B., 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 82–97. https://doi.org/10.1109/MSP.2012.2205597

Hoffman, M., Blei, D.M., Wang, C., Paisley, J., 2012. Stochastic variational inference, arXiv preprint arXiv:1206.7051.

Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24, 417–441.

Jennings, W., Wlezien, C., 2018. Election polling errors across time and space. Nature Human Behaviour 2, 276–283. https://doi.org/10.1038/s41562-018-0315-6

Kahneman, D., 2011. Thinking fast and slow.

Kohavi, R., Longbotham, R., 2017. Online controlled experiments and a/b testing, in: Sammut, C., Webb, G.I. (Eds.), Encyclopedia of Machine Learning and Data Mining. Springer US, Boston, MA, pp. 922–929. https://doi.org/10.1007/978-1-4899-7687-1_891

Krizhevsky, A., Sutskever, I., Hinton, G.E., n.d. ImageNet classification with deep convolutional neural networks, in:. pp. 1097–1105.

Lawrence, N.D., 2012. A unifying probabilistic perspective for spectral dimensionality reduction: Insights and new models. Journal of Machine Learning Research 13.

Lawrence, N.D., 2010. Introduction to learning and inference in computational systems biology, in:.

Lawrence, N.D., 2005. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research 6, 1783–1816.

Lawson, C.L., Hanson, R.J., 1995. Solving least squares problems. SIAM. https://doi.org/10.1137/1.9781611971217

Li, C., 2020. OpenAI’s gpt-3 language model: A technical overview.

McCullagh, P., Nelder, J.A., 1989. Generalized linear models, 2nd ed. Chapman; Hall.

Menni, C., Valdes, A.M., Freidin, M.B., Sudre, C.H., Nguyen, L.H., Drew, D.A., Ganesh, S., Varsavsky, T., Cardoso, M.J., Moustafa, J.S.E.-S., Visconti, A., Hysi, P., Bowyer, R.C.E., Mangino, M., Falchi, M., Wolf, J., Ourselin, S., Chan, A.T., Steves, C.J., Spector, T.D., 2020. Real-time tracking of self-reported symptoms to predict potential covid-19. Nature Medicine 1037–1040.

Mitchell, T.M., 1977. Version spaces: A candidate elimination approach to rule-learning (pp. 305–310), in: Proceedings of the Fifth International Joint Conference on Artificial Intelligence.

Office for National Statistics, 2020. Coronavirus (covid-19) infection survey pilot: England and wales, 14 august 2020.

Oliver, N., Lepri, B., Sterly, H., Lambiotte, R., Deletaille, S., De Nadai, M., Letouzé, E., Salah, A.A., Benjamins, R., Cattuto, C., Colizza, V., Cordes, N. de, Fraiberger, S.P., Koebe, T., Lehmann, S., Murillo, J., Pentland, A., Pham, P.N., Pivetta, F., Saramäki, J., Scarpino, S.V., Tizzoni, M., Verhulst, S., Vinck, P., 2020. Mobile phone data for informing public health actions across the covid-19 pandemic life cycle. Science Advances 6. https://doi.org/10.1126/sciadv.abc0764

Pearson, K., 1901. On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, Sixth Series 2, 559–572.

Popper, K.R., 1963. Conjectures and refutations: The growth of scientific knowledge. Routledge, London.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language models are unsupervised multitask learners, in:.

Robbins, H., Monro, S., 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 400–407.

Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326. https://doi.org/10.1126/science.290.5500.2323

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. https://doi.org/10.1007/s11263-015-0816-y

Salganik, M.J., 2018. Bit by bit: Social research in the digital age. Princeton University Press.

Salimbeni, H., Deisenroth, M., 2017. Doubly stochastic variational inference for deep Gaussian processes, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 4591–4602.

Saul, A.D., Hensman, J., Vehtari, A., Lawrence, N.D., n.d. Chained Gaussian processes, in:. pp. 1431–1440.

Schölkopf, B., Smola, A., Müller, K.-R., 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 1299–1319. https://doi.org/10.1162/089976698300017467

Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N., 2018. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research 19, 1–57.

Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural networks, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 27. Curran Associates, Inc., pp. 3104–3112.

Taigman, Y., Yang, M., Ranzato, M., Wolf, L., 2014. DeepFace: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2014.220

The DELVE Initiative, 2020. Economic aspects of the covid-19 crisis in the uk (No. 5). DELVE.

Titsias, M.K., n.d. Variational learning of inducing variables in sparse Gaussian processes, in:. pp. 567–574.

Tran, D., Kucukelbir, A., Dieng, A.B., Rudolph, M., Liang, D., Blei, D.M., 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787.

Tukey, J.W., 1977. Exploratory data analysis. Addison-Wesley.

Vapnik, V.N., 1998. Statistical learning theory. wiley, New York.

Wang, W., Rothschild, D., Goel, S., Gelman, A., 2015. Forecasting elections with non-representative polls. International Journal of Forecasting.

Wasserman, L.A., 2003. All of statistics. springer, New York.

Weinberger, K.Q., Sha, F., Saul, L.K., n.d. Learning a kernel matrix for nonlinear dimensionality reduction, in:. pp. 839–846.

Winn, J., Bishp, C.M., Diethe, T., Guiver, J., Zaykov, Y., n.d. Model based machine learning, publisher = , year = 2019, url = http://www.mbmlbook.com/.

World Health Organization, 2020. International statistical classification of diseases and related health problems (11th edition).

Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2017. Understanding deep learning requires rethinking generalization, in: https://openreview.net/forum?id=Sy8gdB9xx (Ed.), International Conference on Learning Representations.