Relative risk regression

This notebook shows how the log function can be used as the link in a binomial GLM (logistic regression) so that the fitted coefficients can be interpreted in terms of relative risks rather than in terms of odds ratios. This type of GLM is usually called a "log binomial model", or sometimes "relative risk regression".

Keywords: GLM, log binomial model, relative risk, odds ratio, link function

Here are the import statements:

In [1]:
import pandas as pd
import numpy as np
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import family
from statsmodels.genmod.families import links

We demonstrate relative risk regression using data from a study of women who gave birth to low birthweight babies. The low birthweight status is coded as a binary value (birthweight less than 2.5 kg), so we will use logistic regression to relate the birthweight status to risk factors.

More information about the dataset can be found at the following link:

http://vincentarelbundock.github.io/Rdatasets/doc/MASS/birthwt.html

After reading the data set, we recode the race categories from integers to short string labels.

In [2]:
data = pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/MASS/birthwt.csv")
data["race_str"] = data["race"].astype(str)
data["race_str"].replace("1", "white", inplace=True)
data["race_str"].replace("2", "black", inplace=True)
data["race_str"].replace("3", "other", inplace=True)

We start with a standard logistic regression using the default logit link. We include main effects of all the risk factors as covariates.

In [6]:
mod = GLM.from_formula("low ~ age + lwt + race_str + smoke + ht + ui + ftv", data, family=family.Binomial())
rslt = mod.fit()
print rslt.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      180
Model Family:                Binomial   Df Model:                            8
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -101.92
Date:                Sun, 08 Jun 2014   Deviance:                       203.83
Time:                        08:32:29   Pearson chi2:                     183.
No. Iterations:                     6                                         
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept             1.7446      1.245      1.401      0.161        -0.696     4.185
race_str[T.other]    -0.3707      0.536     -0.692      0.489        -1.421     0.680
race_str[T.white]    -1.2898      0.528     -2.444      0.015        -2.324    -0.256
age                  -0.0205      0.036     -0.570      0.568        -0.091     0.050
lwt                  -0.0165      0.007     -2.409      0.016        -0.030    -0.003
smoke                 1.0416      0.395      2.634      0.008         0.266     1.817
ht                    1.8851      0.695      2.713      0.007         0.523     3.247
ui                    0.9041      0.449      2.015      0.044         0.025     1.783
ftv                   0.0591      0.172      0.344      0.731        -0.278     0.396
=====================================================================================

Here we have fit a regression model using logistic regression with the logit link function. The coefficient for a given covariate can be interpreted as the log odds ratio corresponding to a one unit change in the covariate, while holding all the other covariates fixed.

For example, suppose that p is the probability that a smoking mother gives birth to a low birthweight baby (for fixed values of the other variables, race, age, etc.). The odds for this event is p/(1-p).

Now suppose that q is the probability that a non-smoking mother gives birth to a low birthweight baby. The probability for this event is q, and the odds is 1/(1-q).

The odds ratio is the ratio of these two odds values: (p/(1-p)) / (q/(1-q)) = p(1-q) / (q(1-p)).

In the fitted model above, the log odds ratio for smoking is 1.0416, so the odds ratio is np.exp(1.0416) which is roughly 2.8. Thus a woman who smokes has 2.8 times greater odds for giving birth to a low birthweight baby compared to a woman who does not smoke. Note that smoking is a binary variable, coded as 0/1, so a one unit chance in this variable contrasts a non-smoker to a smoker.

Next we fit the logistic regression model again, using the log link function instead of the default logit link function.

In [7]:
mod = GLM.from_formula("low ~ age + lwt + race_str + smoke + ht + ui + ftv", data, family=family.Binomial(links.log))
rslt = mod.fit()
print rslt.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      180
Model Family:                Binomial   Df Model:                            8
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                    nan
Date:                Sun, 08 Jun 2014   Deviance:                          nan
Time:                        08:32:36   Pearson chi2:                 5.05e+10
No. Iterations:                     2                                         
=====================================================================================
                        coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept            -0.5037      0.057     -8.763      0.000        -0.616    -0.391
race_str[T.other]    -0.4928   1.02e-05  -4.83e+04      0.000        -0.493    -0.493
race_str[T.white]    -1.0055   1.39e-05  -7.24e+04      0.000        -1.006    -1.005
age                  -0.0165   8.51e-07  -1.93e+04      0.000        -0.016    -0.016
lwt                   0.0013   1.91e-07   6673.218      0.000         0.001     0.001
smoke                 0.7109      0.057     12.368      0.000         0.598     0.824
ht                    0.7708      0.057     13.410      0.000         0.658     0.883
ui                    0.3347   8.49e-06   3.94e+04      0.000         0.335     0.335
ftv                   0.0692      0.019      3.612      0.000         0.032     0.107
=====================================================================================

The coefficient of smoking is now 0.7109. If we exponentiate this, we get roughly 2. This is the estimated "relative risk", or "risk ratio" comparing a woman who smokes to one who does not (holding the effects of all other variables fixed). The relative risk is the ratio of the probability of a smoking woman having a low-birthweight baby to the probabiity of a non-smoking woman having a low-birthweight baby, which is p/q in the notation used above.

So a smoking woman is roughly 2 times more likely to have a low-birthweight baby compared to a non-smoking woman. But her odds of having a low-birthweight baby differ by a factor of 2.8.

This is the usual situation -- changes in the odds are greater than changes in the relative risk. This doesn't mean that one of these approaches is a better way to capture the relationship between a risk factor and an outcome. It just reflects the fact that they are different measures.

One issue with the log binomial model is that if the linear predictor is positive, the fitted probability will be greater than 1. This is because the fitted probability is the exponentiated value of the linear predictor (the exponential function being the inverse of the log link function). If only a small fraction of the fitted probabilites exceed 1, this is not a big problem. In this data set, around 5% of the fitted probabilities exceed 1.

In [8]:
lin_pred = rslt.predict(linear=True)
plt.hist(lin_pred)
plt.xlabel("Linear predictor")
print np.mean(lin_pred > 0)
0.0529100529101