In this notebook we use the Python library GPy for Gaussian processes and apply them to a regression problem in the Airfoil self-noise dataset from UCI Machine Learning Repository.
import pandas as pd
import numpy as np
import GPy
from matplotlib import pyplot as plt
%matplotlib inline
# Download the dataset
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat
--2018-12-18 21:07:53-- https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 59984 (59K) [text/plain] Saving to: 'airfoil_self_noise.dat' airfoil_self_noise. 100%[===================>] 58.58K 87.5KB/s in 0.7s 2018-12-18 21:07:54 (87.5 KB/s) - 'airfoil_self_noise.dat' saved [59984/59984]
# View the first rows
!head airfoil_self_noise.dat
As explained on the repository web page, the data columns are:
The output column is: 6. Scaled sound pressure level, in decibels.
# Load the data into a pandas dataframe
data = pd.read_csv('airfoil_self_noise.dat',
sep='\t',
names=['frequency',
'angle_of_attack',
'chord_length',
'free-stream_velocity',
'suction_side_displacement_thickness',
'sound_pressure'])
data.head()
frequency | angle_of_attack | chord_length | free-stream_velocity | suction_side_displacement_thickness | sound_pressure | |
---|---|---|---|---|---|---|
0 | 800 | 0.0 | 0.3048 | 71.3 | 0.002663 | 126.201 |
1 | 1000 | 0.0 | 0.3048 | 71.3 | 0.002663 | 125.201 |
2 | 1250 | 0.0 | 0.3048 | 71.3 | 0.002663 | 125.951 |
3 | 1600 | 0.0 | 0.3048 | 71.3 | 0.002663 | 127.591 |
4 | 2000 | 0.0 | 0.3048 | 71.3 | 0.002663 | 127.461 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1503 entries, 0 to 1502 Data columns (total 6 columns): frequency 1503 non-null int64 angle_of_attack 1503 non-null float64 chord_length 1503 non-null float64 free-stream_velocity 1503 non-null float64 suction_side_displacement_thickness 1503 non-null float64 sound_pressure 1503 non-null float64 dtypes: float64(5), int64(1) memory usage: 70.5 KB
By a Gaussian process we mean a collection of random variables $f(\mathbf{x})$ indexed by $\mathbf{x} \in \mathbb{R}^d$ so that any finite number of them has jointly Gaussian distribution. To specify a Gaussian process, it suffices to define its covariance function $$ k(\mathbf{x}_p, \mathbf{x}_q) = \mathbb{E} [ f(\mathbf{x}_p) \, f(\mathbf{x}_q) ] , \quad \mathbf{x}_p,\mathbf{x}_q \in \mathbb{R}^d. $$ Here we assume that each random variable has zero mean. We will use the squared exponential (or RBF) kernel $$ k(\mathbf{x}_p,\mathbf{x}_q) = \sigma_f^2 \exp \Big( - \frac{\| \mathbf{x}_p - \mathbf{x}_q \|^2}{2\ell^2} \Big) , $$ where $\ell$ and $\sigma_f^2$ are the length-scale and signal variance parameters, respectively.
Given an indexed dataset $X\subset \mathbb{R}^d$ of size $n$, we model the observations $y$ at points $\mathbf{x} \in X$ by $$ y = f(\mathbf{x}) + \varepsilon , $$ where $\varepsilon$ is i.i.d. Gaussian noise with variance $\sigma_n^2$.
Denoting by $K$ the $n\times n$ matrix with entries $k(\mathbf{x}_p, \mathbf{x}_q)$, the log marginal likelihood becomes $$ \log p( \, \mathbf{y} \, | \, X, \ell, \sigma_f^2, \sigma_n^2 \, ) = -\frac{1}{2} \mathbf{y}^\top (K + \sigma_n^2 I)^{-1} \mathbf{y} -\frac{1}{2} \log \, | \, K + \sigma_n^2 I \, | \, - \frac{n}{2} \log 2\pi $$
We will optimize the parameters $\ell$, $\sigma_f^2$ and $\sigma_n^2$ by maximizing the likelihood above.
# Split the data into training and testing sets (with shuffling)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns='sound_pressure'),
data[['sound_pressure']],
shuffle=True,
test_size=0.3)
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In order to more easily visualize the model we project the data into a one-dimensional feature space.
# Use PCA for dimensionality reduction
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
# Choose the squared exponential kernel with observation noise. Here variance = \sigma_f^2, lengthscale = \ell.
kernel_1d = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
# Create a regression model with normalized targets. Here noise_var = \sigma_n^2.
model_1d = GPy.models.GPRegression(X_train_pca, y_train.values, kernel_1d, normalizer=True, noise_var=1.)
from IPython.display import display
display(model_1d)
Model: GP regression
Objective: 1411.6691693843263
Number of Parameters: 3
Number of Optimization Parameters: 3
Updates: True
GP_regression. | value | constraints | priors |
---|---|---|---|
rbf.variance | 1.0 | +ve | |
rbf.lengthscale | 1.0 | +ve | |
Gaussian_noise.variance | 1.0 | +ve |
model_1d.plot(figsize=(15,5))
{'dataplot': [<matplotlib.collections.PathCollection at 0x1a1da13240>], 'gpmean': [[<matplotlib.lines.Line2D at 0x1a1da130f0>]], 'gpconfidence': [<matplotlib.collections.PolyCollection at 0x1a1da13780>]}
/anaconda/lib/python3.6/site-packages/matplotlib/figure.py:2267: UserWarning:This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
We then optimize the parameters $\ell$, $\sigma_f^2$ and $\sigma_n^2$ by maximizing the log marginal likelihood.
model_1d.optimize()
display(model_1d)
Model: GP regression
Objective: 1400.8348691355702
Number of Parameters: 3
Number of Optimization Parameters: 3
Updates: True
GP_regression. | value | constraints | priors |
---|---|---|---|
rbf.variance | 0.995235331845662 | +ve | |
rbf.lengthscale | 1.2855994720729444 | +ve | |
Gaussian_noise.variance | 0.8188377781590664 | +ve |
model_1d.plot(figsize=(15,5))
{'dataplot': [<matplotlib.collections.PathCollection at 0x1a1da131d0>], 'gpmean': [[<matplotlib.lines.Line2D at 0x1a1df9b630>]], 'gpconfidence': [<matplotlib.collections.PolyCollection at 0x1a1df9bc18>]}
/anaconda/lib/python3.6/site-packages/matplotlib/figure.py:2267: UserWarning:This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
We turn to the test set for model evaluation.
model_1d.plot(figsize=(15,5))
plt.scatter(X_test_pca, y_test, c='r', marker='x', label='Test data')
plt.legend()
<matplotlib.legend.Legend at 0x1053c4fd0>
/anaconda/lib/python3.6/site-packages/matplotlib/figure.py:2267: UserWarning:This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
# Predict
y_pred, y_pred_var = model_1d.predict(X_test_pca)
# Create a dataframe for prediction results
results_1d = pd.DataFrame(index=y_test.index)
results_1d['true'] = y_test.values
results_1d['pred'] = np.round(y_pred,3)
results_1d['pred_var'] = np.round(y_pred_var,3)
# Include lower and upper endpoints of the prediction interval
results_1d['pred_lower'] = results_1d['pred'] - 2.0 * np.round(np.sqrt(results_1d['pred_var']),3)
results_1d['pred_upper'] = results_1d['pred'] + 2.0 * np.round(np.sqrt(results_1d['pred_var']),3)
# 95% of our predictions should be within this interval
within_interval = (results_1d['pred_lower'] <= results_1d['true']) & (results_1d['true'] <= results_1d['pred_upper'])
results_1d['within_2sigma'] = np.where(within_interval, True, False)
results_1d.head()
true | pred | pred_var | pred_lower | pred_upper | within_2sigma | |
---|---|---|---|---|---|---|
1436 | 117.054 | 123.860 | 39.744 | 111.252 | 136.468 | True |
572 | 132.757 | 127.422 | 39.623 | 114.832 | 140.012 | True |
126 | 112.169 | 117.339 | 39.933 | 104.701 | 129.977 | True |
998 | 132.897 | 127.302 | 39.597 | 114.716 | 139.888 | True |
80 | 121.851 | 124.130 | 39.607 | 111.544 | 136.716 | True |
Our regression model gives probabilistic predictions and we want to know if the variances of these predictive distributions are correct on average. We will therefore compare the average prediction variance and mean squared error on the test set to see if they are about the same. About 95% of the observations in the test set should be within the $2\sigma$ prediction interval.
from sklearn.metrics import mean_squared_error
# Model evaluation
avg_pred_var = results_1d['pred_var'].mean()
print('Average prediction variance on test set: {}'.format(np.round(avg_pred_var,2)))
mse = mean_squared_error(results_1d['true'], results_1d['pred'])
print('Mean squared error on test set: {}'.format(np.round(mse), 2))
twosigmapercent = 100*results_1d['within_2sigma'].mean()
print('Percentage of test instances within 2 sigma prediction interval: {}%'.format(np.round(twosigmapercent,2)))
Average prediction variance on test set: 39.7 Mean squared error on test set: 36.0 Percentage of test instances within 2 sigma prediction interval: 95.57%
We then use all the five features for our GP model.
# Choose the squared exponential kernel with observation noise. Here variance = \sigma_f^2, lengthscale = \ell.
kernel = GPy.kern.RBF(input_dim=5, variance=1., lengthscale=1.)
# Create a regression model with normalized targets. Here noise_var = \sigma_n^2.
model = GPy.models.GPRegression(X_train_scaled, y_train.values, kernel, normalizer=True, noise_var=1.)
display(model)
Model: GP regression
Objective: 1193.601528964441
Number of Parameters: 3
Number of Optimization Parameters: 3
Updates: True
GP_regression. | value | constraints | priors |
---|---|---|---|
rbf.variance | 1.0 | +ve | |
rbf.lengthscale | 1.0 | +ve | |
Gaussian_noise.variance | 1.0 | +ve |
# Optimize the parameters for length-scale, and signal and noise variance
model.optimize()
display(model)
Model: GP regression
Objective: 650.2917834088931
Number of Parameters: 3
Number of Optimization Parameters: 3
Updates: True
GP_regression. | value | constraints | priors |
---|---|---|---|
rbf.variance | 3.441817996849852 | +ve | |
rbf.lengthscale | 1.082201409663005 | +ve | |
Gaussian_noise.variance | 0.1006648147910919 | +ve |
Evaluate on the test set.
# Predict
y_pred, y_pred_var = model.predict(X_test_scaled)
# Create a dataframe for prediction results
results = pd.DataFrame(index=y_test.index)
results['true'] = y_test.values
results['pred'] = np.round(y_pred,3)
results['pred_var'] = np.round(y_pred_var,3)
# Include lower and upper endpoints of the prediction interval
results['pred_lower'] = results['pred'] - 2.0 * np.round(np.sqrt(results['pred_var']),3)
results['pred_upper'] = results['pred'] + 2.0 * np.round(np.sqrt(results['pred_var']),3)
# 95% of our predictions should be within this interval
within_interval = (results['pred_lower'] <= results['true']) & (results['true'] <= results['pred_upper'])
results['within_2sigma'] = np.where(within_interval, True, False)
results.head()
true | pred | pred_var | pred_lower | pred_upper | within_2sigma | |
---|---|---|---|---|---|---|
1436 | 117.054 | 118.263 | 6.911 | 113.005 | 123.521 | True |
572 | 132.757 | 129.433 | 5.319 | 124.821 | 134.045 | True |
126 | 112.169 | 109.815 | 13.204 | 102.547 | 117.083 | True |
998 | 132.897 | 133.532 | 6.170 | 128.564 | 138.500 | True |
80 | 121.851 | 125.138 | 5.366 | 120.506 | 129.770 | True |
# Model evaluation
avg_pred_var = results['pred_var'].mean()
print('Average prediction variance on test set: {}'.format(np.round(avg_pred_var,2)))
mse = mean_squared_error(results['true'], results['pred'])
print('Mean squared error on test set: {}'.format(np.round(mse), 2))
twosigmapercent = 100*results['within_2sigma'].mean()
print('Percentage of test instances within 2 sigma prediction interval: {}%'.format(np.round(twosigmapercent,2)))
Average prediction variance on test set: 6.93 Mean squared error on test set: 6.0 Percentage of test instances within 2 sigma prediction interval: 94.9%