This notebook is an element of the risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.

Author: Eric Marsden [email protected]


In this notebook, we illustrate NumPy features for building simple univariate linear regression models. Consult the accompanying lecture slides for background on linear regression analysis and some notes on when the technique is useful.

Linear regression analysis of smoking data

In [1]:
import numpy
import pandas
import matplotlib.pyplot as plt
import scipy.stats
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_formats=['svg']

The British Doctors’ Study followed the health of a large number of physicians in the UK over the period 1951–2001. It provided conclusive evidence of linkage between smoking and lung cancer, myocardial infarction, respiratory disease and other smoking-related illnesses.

The study provided data on annual mortality for a variety of diseases at four levels of cigarette smoking:

  • never smoked
  • 1-14 per day
  • 15-24 per day
  • more than 25 per day

The data is as follows:

In [2]:
df = pandas.DataFrame({'cigarettes' : [0,10,20,30], 'CVD' : [572,802,892,1025], 'lung' : [14, 105, 208, 355]})
df.head()
Out[2]:
CVD cigarettes lung
0 572 0 14
1 802 10 105
2 892 20 208
3 1025 30 355

Note: CVD is an acronym for cardio-vascular disease.

Let’s plot the data to see whether a linear model is a plausible representation for the health impact of smoking.

In [3]:
plt.plot(df.cigarettes, df.CVD, "o")
plt.margins(0.1)
plt.title(u"Deaths for different smoking intensities", weight='bold')
plt.xlabel(u"Cigarettes smoked per day")
plt.ylabel(u"CVD deaths");

It is quite tempting from this graph to conclude that cardiovascular disease deaths increase linearly with cigarette consumption. We can fit a linear model to this data, using the statsmodels library (an alternative possibility is to use the scikit-learn library, which has more functionality related to machine learning). We will use the formula interface to ordinary least squares regression, available in statsmodels.formula.api.

For simple linear regression (with a single predictor variable), formulas are written outcome ~ observation.

In [4]:
import statsmodels.formula.api as smf

lm = smf.ols("CVD ~ cigarettes", data=df).fit()
xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
X = numpy.linspace(xmin, xmax, 100)
# params[0] is the intercept (beta0)
# params[1] is the slope (beta1)
Y = lm.params[0] + lm.params[1] * X
plt.plot(df.cigarettes, df.CVD, "o")
plt.plot(X, Y, color="darkgreen")
plt.xlabel("Cigarettes smoked per day")
plt.ylabel("Deaths from cardiovascular disease")
plt.margins(0.1)

In the regression model, $\beta_0$ is the intercept of the regression line and $\beta_1$ is its slope.

We can make a similar plot for lung cancer deaths.

In [5]:
lm = smf.ols("lung ~ cigarettes", data=df).fit()
xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
X = numpy.linspace(xmin, xmax, 100)
# params[0] is the intercept (beta0)
# params[1] is the slope (beta1)
Y = lm.params[0] + lm.params[1] * X
plt.plot(df.cigarettes, df.lung, "o")
plt.plot(X, Y, color="darkgreen")
plt.xlabel("Cigarettes smoked per day")
plt.ylabel("Lung cancer deaths")
plt.margins(0.1)

We can use the linear model for prediction, to estimate the likelihood of death from lung cancer or from cardiovascular disease for a level of smoking for which we have no direct data. For example, the expected lung cancer mortality risk for a group of people who smike 15 cigarettes per day is

In [6]:
# use the fitted model for prediction
lm.predict({'cigarettes': 15}) / 100000.0
Out[6]:
array([ 0.001705])

Note that we have divived by 100000, because the data given is the number of deaths per 100000 people.

We can examine some information on the “goodness of fit”, or the quality of the linear model:

In [7]:
lm.summary()
/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/stats/stattools.py:72: UserWarning: omni_normtest is not valid with less than 8 observations; 4 samples were given.
  "samples were given." % int(n))
Out[7]:
OLS Regression Results
Dep. Variable: lung R-squared: 0.987
Model: OLS Adj. R-squared: 0.980
Method: Least Squares F-statistic: 151.8
Date: Tue, 21 Mar 2017 Prob (F-statistic): 0.00652
Time: 13:13:58 Log-Likelihood: -16.359
No. Observations: 4 AIC: 36.72
Df Residuals: 2 BIC: 35.49
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.6000 17.097 0.094 0.934 -71.964 75.164
cigarettes 11.2600 0.914 12.321 0.007 7.328 15.192
Omnibus: nan Durbin-Watson: 2.086
Prob(Omnibus): nan Jarque-Bera (JB): 0.534
Skew: -0.143 Prob(JB): 0.766
Kurtosis: 1.233 Cond. No. 31.4

In particular, the $R^2$ value of 0.987 is very high, indicating a good level of fit to our dataset. However, given the small size of our dataset (only 4 observations, even if each observation is based on a large population), the 95% confidence interval for our model parameters $\beta_0$ and $\beta_1$ is quite large.

The Seaborn package provides convenient functions for making plots of linear regression models. In particular, the regplot function generates a regression plot that includes 95% confidence intervals for the model parameters.

In [8]:
sns.regplot(df.cigarettes, df.CVD);