#!/usr/bin/env python # coding: utf-8 # # # This notebook is an element of the [risk-engineering.org courseware](https://risk-engineering.org/). It can be distributed under the terms of the [Creative Commons Attribution-ShareAlike licence](https://creativecommons.org/licenses/by-sa/4.0/). # # Author: Eric Marsden # # --- # # In this notebook, we illustrate NumPy features for building simple univariate linear regression models. Consult the [accompanying course material](https://risk-engineering.org/linear-regression-analysis/) for background on linear regression analysis, some notes on when the technique is useful, and to download this content as a Jupyter/Python notebook. # # Linear regression analysis of smoking data # In[1]: import numpy import pandas import scipy.stats import matplotlib.pyplot as plt plt.style.use("bmh") get_ipython().run_line_magic('config', 'InlineBackend.figure_formats=["svg"]') import warnings warnings.simplefilter(action="ignore", category=FutureWarning) # The *British Doctors’ Study* followed the health of a large number of physicians in the UK over the period 1951–2001. It provided conclusive evidence of linkage between smoking and lung cancer, myocardial infarction, respiratory disease and other smoking-related illnesses. # # The study provided data on annual mortality for a variety of diseases at four levels of cigarette smoking: # # - never smoked # - 1-14 per day # - 15-24 per day # - more than 25 per day # # The data is as follows: # In[2]: df = pandas.DataFrame({"cigarettes" : [0,10,20,30], "CVD" : [572,802,892,1025], "lung" : [14, 105, 208, 355]}) df.head() # Note: CVD is an acronym for cardio-vascular disease. # # Let’s plot the data to see whether a linear model is a plausible representation for the health impact of smoking. # In[3]: plt.plot(df.cigarettes, df.CVD, "o") plt.margins(0.1) plt.title("Deaths for different smoking intensities", weight="bold") plt.xlabel("Cigarettes smoked per day") plt.ylabel("CVD deaths"); # It is quite tempting from this graph to conclude that cardiovascular disease deaths increase linearly with cigarette consumption. We can fit a linear model to this data, using the `statsmodels` library (an alternative possibility is to use the `scikit-learn` library, which has more functionality related to machine learning). We will use the formula interface to ordinary least squares regression, available in `statsmodels.formula.api`. # # For simple linear regression (with a single predictor variable), formulas are written `outcome ~ observation`. # In[4]: import statsmodels.formula.api as smf lm = smf.ols("CVD ~ cigarettes", data=df).fit() xmin = df.cigarettes.min() xmax = df.cigarettes.max() X = numpy.linspace(xmin, xmax, 100) # params[0] is the intercept (beta0) # params[1] is the slope (beta1) Y = lm.params[0] + lm.params[1] * X plt.plot(df.cigarettes, df.CVD, "o") plt.plot(X, Y, color="darkgreen") plt.xlabel("Cigarettes smoked per day") plt.ylabel("Deaths from cardiovascular disease") plt.margins(0.1) # In the regression model, $\beta_0$ is the **intercept** of the regression line and $\beta_1$ is its **slope**. # # We can make a similar plot for lung cancer deaths. # In[5]: lm = smf.ols("lung ~ cigarettes", data=df).fit() xmin = df.cigarettes.min() xmax = df.cigarettes.max() X = numpy.linspace(xmin, xmax, 100) # params[0] is the intercept (beta0) # params[1] is the slope (beta1) Y = lm.params[0] + lm.params[1] * X plt.plot(df.cigarettes, df.lung, "o") plt.plot(X, Y, color="darkgreen") plt.xlabel("Cigarettes smoked per day") plt.ylabel("Lung cancer deaths") plt.margins(0.1) # We can use the linear model for **prediction**, to estimate the likelihood of death from lung cancer or from cardiovascular disease for a level of smoking for which we have no direct data. For example, the expected lung cancer mortality risk for a group of people who smoke 15 cigarettes per day is # In[6]: # use the fitted model for prediction lm.predict({"cigarettes": [15]}) / 100000.0 # Note that we have divided by 100000, because the data given is the number of deaths per 100000 people. # We can examine some information on the “goodness of fit”, or the quality of the linear model: # In[7]: lm.summary() # In particular, the $R^2$ value of 0.987 is very high, indicating a good level of fit to our dataset. However, given the small size of our dataset (only 4 observations), the 95% confidence interval for our model parameters $\beta_0$ and $\beta_1$ is quite large. In fact, our 4 data points are based on a large number of observations, so our confidence in this data is high, but unfortunately we don't have access to the original data (where each observation corresponds to one person). # The [Seaborn package](https://seaborn.pydata.org/) provides convenient functions for making plots of linear regression models. In particular, the `regplot` function generates a regression plot that includes 95% confidence intervals for the model parameters. # In[8]: import seaborn as sns sns.regplot(df, x="cigarettes", y="CVD"); # In[ ]: