#!/usr/bin/env python
# coding: utf-8
#
#
# This notebook is an element of the [risk-engineering.org courseware](https://risk-engineering.org/). It can be distributed under the terms of the [Creative Commons Attribution-ShareAlike licence](https://creativecommons.org/licenses/by-sa/4.0/).
#
# Author: Eric Marsden
#
# ---
#
# In this notebook, we illustrate NumPy features for building simple univariate linear regression models. Consult the [accompanying course material](https://risk-engineering.org/linear-regression-analysis/) for background on linear regression analysis, some notes on when the technique is useful, and to download this content as a Jupyter/Python notebook.
# # Linear regression analysis of smoking data
# In[1]:
import numpy
import pandas
import scipy.stats
import matplotlib.pyplot as plt
plt.style.use("bmh")
get_ipython().run_line_magic('config', 'InlineBackend.figure_formats=["svg"]')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
# The *British Doctors’ Study* followed the health of a large number of physicians in the UK over the period 1951–2001. It provided conclusive evidence of linkage between smoking and lung cancer, myocardial infarction, respiratory disease and other smoking-related illnesses.
#
# The study provided data on annual mortality for a variety of diseases at four levels of cigarette smoking:
#
# - never smoked
# - 1-14 per day
# - 15-24 per day
# - more than 25 per day
#
# The data is as follows:
# In[2]:
df = pandas.DataFrame({"cigarettes" : [0,10,20,30], "CVD" : [572,802,892,1025], "lung" : [14, 105, 208, 355]})
df.head()
# Note: CVD is an acronym for cardio-vascular disease.
#
# Let’s plot the data to see whether a linear model is a plausible representation for the health impact of smoking.
# In[3]:
plt.plot(df.cigarettes, df.CVD, "o")
plt.margins(0.1)
plt.title("Deaths for different smoking intensities", weight="bold")
plt.xlabel("Cigarettes smoked per day")
plt.ylabel("CVD deaths");
# It is quite tempting from this graph to conclude that cardiovascular disease deaths increase linearly with cigarette consumption. We can fit a linear model to this data, using the `statsmodels` library (an alternative possibility is to use the `scikit-learn` library, which has more functionality related to machine learning). We will use the formula interface to ordinary least squares regression, available in `statsmodels.formula.api`.
#
# For simple linear regression (with a single predictor variable), formulas are written `outcome ~ observation`.
# In[4]:
import statsmodels.formula.api as smf
lm = smf.ols("CVD ~ cigarettes", data=df).fit()
xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
X = numpy.linspace(xmin, xmax, 100)
# params[0] is the intercept (beta0)
# params[1] is the slope (beta1)
Y = lm.params[0] + lm.params[1] * X
plt.plot(df.cigarettes, df.CVD, "o")
plt.plot(X, Y, color="darkgreen")
plt.xlabel("Cigarettes smoked per day")
plt.ylabel("Deaths from cardiovascular disease")
plt.margins(0.1)
# In the regression model, $\beta_0$ is the **intercept** of the regression line and $\beta_1$ is its **slope**.
#
# We can make a similar plot for lung cancer deaths.
# In[5]:
lm = smf.ols("lung ~ cigarettes", data=df).fit()
xmin = df.cigarettes.min()
xmax = df.cigarettes.max()
X = numpy.linspace(xmin, xmax, 100)
# params[0] is the intercept (beta0)
# params[1] is the slope (beta1)
Y = lm.params[0] + lm.params[1] * X
plt.plot(df.cigarettes, df.lung, "o")
plt.plot(X, Y, color="darkgreen")
plt.xlabel("Cigarettes smoked per day")
plt.ylabel("Lung cancer deaths")
plt.margins(0.1)
# We can use the linear model for **prediction**, to estimate the likelihood of death from lung cancer or from cardiovascular disease for a level of smoking for which we have no direct data. For example, the expected lung cancer mortality risk for a group of people who smoke 15 cigarettes per day is
# In[6]:
# use the fitted model for prediction
lm.predict({"cigarettes": [15]}) / 100000.0
# Note that we have divided by 100000, because the data given is the number of deaths per 100000 people.
# We can examine some information on the “goodness of fit”, or the quality of the linear model:
# In[7]:
lm.summary()
# In particular, the $R^2$ value of 0.987 is very high, indicating a good level of fit to our dataset. However, given the small size of our dataset (only 4 observations), the 95% confidence interval for our model parameters $\beta_0$ and $\beta_1$ is quite large. In fact, our 4 data points are based on a large number of observations, so our confidence in this data is high, but unfortunately we don't have access to the original data (where each observation corresponds to one person).
# The [Seaborn package](https://seaborn.pydata.org/) provides convenient functions for making plots of linear regression models. In particular, the `regplot` function generates a regression plot that includes 95% confidence intervals for the model parameters.
# In[8]:
import seaborn as sns
sns.regplot(df, x="cigarettes", y="CVD");
# In[ ]: