Basic regression with Gaussian Processes¶

In this notebook we use the Python library GPy for Gaussian processes and apply them to a regression problem in the Airfoil self-noise dataset from UCI Machine Learning Repository.

In [1]:

import pandas as pd
import numpy as np
import GPy
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:

# Download the dataset
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat

--2018-12-18 21:07:53--  https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 59984 (59K) [text/plain]
Saving to: 'airfoil_self_noise.dat'

airfoil_self_noise. 100%[===================>]  58.58K  87.5KB/s    in 0.7s    

2018-12-18 21:07:54 (87.5 KB/s) - 'airfoil_self_noise.dat' saved [59984/59984]

In [3]:

# View the first rows
!head airfoil_self_noise.dat

As explained on the repository web page, the data columns are:

Frequency, in Hertzs.
Angle of attack, in degrees.
Chord length, in meters.
Free-stream velocity, in meters per second.
Suction side displacement thickness, in meters.

The output column is: 6. Scaled sound pressure level, in decibels.

In [4]:

# Load the data into a pandas dataframe
data = pd.read_csv('airfoil_self_noise.dat', 
                   sep='\t', 
                   names=['frequency',
                          'angle_of_attack',
                          'chord_length',
                          'free-stream_velocity',
                          'suction_side_displacement_thickness',
                          'sound_pressure'])

In [5]:

data.head()

Out[5]:

	frequency	chord_length	free-stream_velocity	suction_side_displacement_thickness	sound_pressure
0	800	0.3048	71.3	0.002663	126.201
1	1000	0.3048	71.3	0.002663	125.201
2	1250	0.3048	71.3	0.002663	125.951
3	1600	0.3048	71.3	0.002663	127.591
4	2000	0.3048	71.3	0.002663	127.461

In [6]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1503 entries, 0 to 1502
Data columns (total 6 columns):
frequency                              1503 non-null int64
angle_of_attack                        1503 non-null float64
chord_length                           1503 non-null float64
free-stream_velocity                   1503 non-null float64
suction_side_displacement_thickness    1503 non-null float64
sound_pressure                         1503 non-null float64
dtypes: float64(5), int64(1)
memory usage: 70.5 KB

Gaussian processes, very briefly¶

By a Gaussian process we mean a collection of random variables $f(\mathbf{x})$ indexed by $\mathbf{x} \in \mathbb{R}^d$ so that any finite number of them has jointly Gaussian distribution. To specify a Gaussian process, it suffices to define its covariance function $$ k(\mathbf{x}_p, \mathbf{x}_q) = \mathbb{E} [ f(\mathbf{x}_p) \, f(\mathbf{x}_q) ] , \quad \mathbf{x}_p,\mathbf{x}_q \in \mathbb{R}^d. $$ Here we assume that each random variable has zero mean. We will use the squared exponential (or RBF) kernel $$ k(\mathbf{x}_p,\mathbf{x}_q) = \sigma_f^2 \exp \Big( - \frac{\| \mathbf{x}_p - \mathbf{x}_q \|^2}{2\ell^2} \Big) , $$ where $\ell$ and $\sigma_f^2$ are the length-scale and signal variance parameters, respectively.

Given an indexed dataset $X\subset \mathbb{R}^d$ of size $n$, we model the observations $y$ at points $\mathbf{x} \in X$ by $$ y = f(\mathbf{x}) + \varepsilon , $$ where $\varepsilon$ is i.i.d. Gaussian noise with variance $\sigma_n^2$.

Denoting by $K$ the $n\times n$ matrix with entries $k(\mathbf{x}_p, \mathbf{x}_q)$, the log marginal likelihood becomes $$ \log p( \, \mathbf{y} \, | \, X, \ell, \sigma_f^2, \sigma_n^2 \, ) = -\frac{1}{2} \mathbf{y}^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} -\frac{1}{2} \log \, | \, K + \sigma_n^2 I \, | \, - \frac{n}{2} \log 2\pi $$

We will optimize the parameters $\ell$, $\sigma_f^2$ and $\sigma_n^2$ by maximizing the likelihood above.

Data preparation¶

In [7]:

# Split the data into training and testing sets (with shuffling)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.drop(columns='sound_pressure'), 
                                                    data[['sound_pressure']], 
                                                    shuffle=True,
                                                    test_size=0.3)

In [8]:

# Scale the data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Single feature model¶

In order to more easily visualize the model we project the data into a one-dimensional feature space.

In [9]:

# Use PCA for dimensionality reduction
from sklearn.decomposition import PCA

pca = PCA(n_components=1)

X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [10]:

# Choose the squared exponential kernel with observation noise. Here variance = \sigma_f^2, lengthscale = \ell.
kernel_1d = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)

In [11]:

# Create a regression model with normalized targets. Here noise_var = \sigma_n^2.
model_1d = GPy.models.GPRegression(X_train_pca, y_train.values, kernel_1d, normalizer=True, noise_var=1.)

In [12]:

from IPython.display import display
display(model_1d)

Model: GP regression
Objective: 1411.6691693843263
Number of Parameters: 3
Number of Optimization Parameters: 3
Updates: True

GP_regression.	value	constraints
rbf.variance	1.0	+ve
rbf.lengthscale	1.0	+ve
Gaussian_noise.variance	1.0	+ve

In [13]:

model_1d.plot(figsize=(15,5))

Out[13]:

{'dataplot': [<matplotlib.collections.PathCollection at 0x1a1da13240>],
 'gpmean': [[<matplotlib.lines.Line2D at 0x1a1da130f0>]],
 'gpconfidence': [<matplotlib.collections.PolyCollection at 0x1a1da13780>]}

 /anaconda/lib/python3.6/site-packages/matplotlib/figure.py:2267: UserWarning:This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.

We then optimize the parameters $\ell$, $\sigma_f^2$ and $\sigma_n^2$ by maximizing the log marginal likelihood.

In [14]:

model_1d.optimize()
display(model_1d)

Model: GP regression
Objective: 1400.8348691355702
Number of Parameters: 3
Number of Optimization Parameters: 3
Updates: True

GP_regression.	value	constraints
rbf.variance	0.995235331845662	+ve
rbf.lengthscale	1.2855994720729444	+ve
Gaussian_noise.variance	0.8188377781590664	+ve

In [15]:

model_1d.plot(figsize=(15,5))

Out[15]:

{'dataplot': [<matplotlib.collections.PathCollection at 0x1a1da131d0>],
 'gpmean': [[<matplotlib.lines.Line2D at 0x1a1df9b630>]],
 'gpconfidence': [<matplotlib.collections.PolyCollection at 0x1a1df9bc18>]}

 /anaconda/lib/python3.6/site-packages/matplotlib/figure.py:2267: UserWarning:This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.

We turn to the test set for model evaluation.

In [16]:

model_1d.plot(figsize=(15,5))
plt.scatter(X_test_pca, y_test, c='r', marker='x', label='Test data')
plt.legend()

Out[16]:

<matplotlib.legend.Legend at 0x1053c4fd0>

 /anaconda/lib/python3.6/site-packages/matplotlib/figure.py:2267: UserWarning:This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.

In [17]:

# Predict
y_pred, y_pred_var = model_1d.predict(X_test_pca)

In [18]:

# Create a dataframe for prediction results
results_1d = pd.DataFrame(index=y_test.index)
results_1d['true'] = y_test.values
results_1d['pred'] = np.round(y_pred,3)
results_1d['pred_var'] = np.round(y_pred_var,3)

In [19]:

# Include lower and upper endpoints of the prediction interval
results_1d['pred_lower'] = results_1d['pred'] - 2.0 * np.round(np.sqrt(results_1d['pred_var']),3)
results_1d['pred_upper'] = results_1d['pred'] + 2.0 * np.round(np.sqrt(results_1d['pred_var']),3)

In [20]:

# 95% of our predictions should be within this interval
within_interval = (results_1d['pred_lower'] <= results_1d['true']) & (results_1d['true'] <= results_1d['pred_upper'])
results_1d['within_2sigma'] = np.where(within_interval, True, False)

In [21]:

results_1d.head()

Out[21]:

	true	pred	pred_var	pred_lower	pred_upper	within_2sigma
1436	117.054	123.860	39.744	111.252	136.468	True
572	132.757	127.422	39.623	114.832	140.012	True
126	112.169	117.339	39.933	104.701	129.977	True
998	132.897	127.302	39.597	114.716	139.888	True
80	121.851	124.130	39.607	111.544	136.716	True

Our regression model gives probabilistic predictions and we want to know if the variances of these predictive distributions are correct on average. We will therefore compare the average prediction variance and mean squared error on the test set to see if they are about the same. About 95% of the observations in the test set should be within the $2\sigma$ prediction interval.

In [22]:

from sklearn.metrics import mean_squared_error

# Model evaluation
avg_pred_var = results_1d['pred_var'].mean()
print('Average prediction variance on test set: {}'.format(np.round(avg_pred_var,2)))
mse = mean_squared_error(results_1d['true'], results_1d['pred'])
print('Mean squared error on test set: {}'.format(np.round(mse), 2))
twosigmapercent = 100*results_1d['within_2sigma'].mean()
print('Percentage of test instances within 2 sigma prediction interval: {}%'.format(np.round(twosigmapercent,2)))

Average prediction variance on test set: 39.7
Mean squared error on test set: 36.0
Percentage of test instances within 2 sigma prediction interval: 95.57%

Multifeature model¶

We then use all the five features for our GP model.

In [23]:

# Choose the squared exponential kernel with observation noise. Here variance = \sigma_f^2, lengthscale = \ell.
kernel = GPy.kern.RBF(input_dim=5, variance=1., lengthscale=1.)

In [24]:

# Create a regression model with normalized targets. Here noise_var = \sigma_n^2.
model = GPy.models.GPRegression(X_train_scaled, y_train.values, kernel, normalizer=True, noise_var=1.)

In [25]:

display(model)

Model: GP regression
Objective: 1193.601528964441
Number of Parameters: 3
Number of Optimization Parameters: 3
Updates: True

GP_regression.	value	constraints
rbf.variance	1.0	+ve
rbf.lengthscale	1.0	+ve
Gaussian_noise.variance	1.0	+ve

In [26]:

# Optimize the parameters for length-scale, and signal and noise variance
model.optimize()
display(model)

Model: GP regression
Objective: 650.2917834088931
Number of Parameters: 3
Number of Optimization Parameters: 3
Updates: True

GP_regression.	value	constraints
rbf.variance	3.441817996849852	+ve
rbf.lengthscale	1.082201409663005	+ve
Gaussian_noise.variance	0.1006648147910919	+ve

Evaluate on the test set.

In [27]:

# Predict
y_pred, y_pred_var = model.predict(X_test_scaled)

In [28]:

# Create a dataframe for prediction results
results = pd.DataFrame(index=y_test.index)
results['true'] = y_test.values
results['pred'] = np.round(y_pred,3)
results['pred_var'] = np.round(y_pred_var,3)

In [29]:

# Include lower and upper endpoints of the prediction interval
results['pred_lower'] = results['pred'] - 2.0 * np.round(np.sqrt(results['pred_var']),3)
results['pred_upper'] = results['pred'] + 2.0 * np.round(np.sqrt(results['pred_var']),3)

In [30]:

# 95% of our predictions should be within this interval
within_interval = (results['pred_lower'] <= results['true']) & (results['true'] <= results['pred_upper'])
results['within_2sigma'] = np.where(within_interval, True, False)

In [31]:

results.head()

Out[31]:

	true	pred	pred_var	pred_lower	pred_upper	within_2sigma
1436	117.054	118.263	6.911	113.005	123.521	True
572	132.757	129.433	5.319	124.821	134.045	True
126	112.169	109.815	13.204	102.547	117.083	True
998	132.897	133.532	6.170	128.564	138.500	True
80	121.851	125.138	5.366	120.506	129.770	True

In [32]:

# Model evaluation
avg_pred_var = results['pred_var'].mean()
print('Average prediction variance on test set: {}'.format(np.round(avg_pred_var,2)))
mse = mean_squared_error(results['true'], results['pred'])
print('Mean squared error on test set: {}'.format(np.round(mse), 2))
twosigmapercent = 100*results['within_2sigma'].mean()
print('Percentage of test instances within 2 sigma prediction interval: {}%'.format(np.round(twosigmapercent,2)))

Average prediction variance on test set: 6.93
Mean squared error on test set: 6.0
Percentage of test instances within 2 sigma prediction interval: 94.9%

In [ ]: