This notebook is an element of the risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.

Author: Eric Marsden [email protected]

In this notebook, we illustrate NumPy features for building linear regression models. Consult the accompanying lecture slides on risk-engineering.org for background on linear regression analysis and some notes on when the technique is useful.

In [1]:

```
import numpy
import pandas
import matplotlib.pyplot as plt
import scipy.stats
import seaborn as sns
%config InlineBackend.figure_formats=['png']
```

We will analyze data from a combined cycle power plant to attempt to build a predictive model for output power. The data comes from the UCI machine learning repository.

http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

The dataset contains 9568 data points collected from a combined cycle power plant over 6 years, when power plant was under full load. A combined cycle power plant is composed of gas turbines, steam turbines and heat recovery steam generators. Electricity is generated by gas & steam turbines, which are combined in one cycle. Three ambient variables affect the performance of the gas turbine, and exhaust vacuum affects the performance of the steam turbine.

Data consists of hourly averages taken from various sensors located around the plant that record the ambient variables every second.

Letâ€™s load it into Python and examine it using the `pandas`

library. For convenience, we have unzipped the dataset and made it web accessible.

In [2]:

```
data = pandas.read_csv("https://risk-engineering.org/static/data/CCPP.csv")
data.head()
```

Out[2]:

In [3]:

```
data.describe()
```

Out[3]:

We can obtain a first impression of the dependency between variables by examining a multidimensional scatterplot.

In [4]:

```
from pandas.plotting import scatter_matrix
scatter_matrix(data, diagonal='kde');
```

In this matrix, the diagonal contains a plot of the distribution of each variable. We observe that:

there is an approximately linear relationship between PE and the negative of AT

there is an approximately linear relationship between PE and negative of V

We can also generate a 3D plot of the observations, which can sometimes help to interpret the data more easily. Here we plot PE as a function of AT and V.

In [5]:

```
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 8))
ax = Axes3D(fig, azim=-115, elev=15)
ax.scatter(data['AT'], data['V'], data['PE'])
ax.set_xlabel('AT')
ax.set_ylabel('V')
ax.set_zlabel('PE')
ax.set_facecolor("white")
```

We will created a fitted linear model using the formula API of the `statsmodels`

library. We use all the observations except for PE as predictor variables in the multiple linear regression.

In [6]:

```
import statsmodels.formula.api as smf
lm = smf.ols(formula='PE ~ AT + V + AP + RH', data=data).fit()
lm.params
```

Out[6]:

This means that the best formula to estimate output power as a function of AT, V, AP and RH is

$$PE = 451 . 067793 âˆ’ 1 . 974731 AT âˆ’ 0 . 234992 V + 0 . 065540 AP âˆ’ 0 . 157598 RH$$

For any particular observation (values for the predictor variables), we can use the linear model to estimate the output variable PE.

In [7]:

```
lm.predict(pandas.DataFrame({'AT': [9.48], 'V': [44.71], 'AP': [1019.12], 'RH': [66.43]}))
```

Out[7]:

The output power for this combination of inputs is 478 MW (note that we report the result with the same number of significant figures as that in our inputs).

We check the residuals of each predictor variable for any pattern that might indicate that a linear model is not appropriate for this dataset.

In [8]:

```
residuals = lm.predict(data) - data.PE
plt.plot(data.AT, residuals, '.', alpha=0.5)
plt.xlabel("AT")
plt.ylabel("Residual");
```

In [9]:

```
plt.plot(data.V, residuals, '.', alpha=0.5)
plt.xlabel("V")
plt.ylabel("Residual");
```

In [10]:

```
plt.plot(data.AP, residuals, '.', alpha=0.5)
plt.xlabel("AP")
plt.ylabel("Residual");
```

In [11]:

```
plt.plot(data.RH, residuals, '.', alpha=0.5)
plt.xlabel("RH")
plt.ylabel("Residual");
```

Indeed, except for a minor quadratic shape to the residuals of variable AT (which we will ignore here), the residuals look random, without any systematic feature apparent that might indicate that our linear model is not appropriate for this data.

We also check that the variance of the residuals is normally distributed by plotting a histogram or a QQ plot of the residuals, as shown below.

In [12]:

```
plt.hist(residuals, bins=40, alpha=0.5)
plt.title("Histogram of the residuals", weight='bold');
```

In [13]:

```
scipy.stats.probplot(residuals, dist=scipy.stats.norm, plot=plt.figure().add_subplot(111));
```

In [14]:

```
lm.summary()
```

Out[14]:

The $R^2$ for this model is 0.929, which means that it explains roughly 93% of the variance of the output power. Interpretation of this number is dependent on the application (for instance, models used to understand health effects tend to have lower $R^2$ values than those used for physical models), but is quite satisfactory for this application.