NHANES - linear regression analysis¶

This notebook demonstrates a linear regression analysis using data from the National Health and Nutrition Examination Survey (NHANES).

The main NHANES web site is here.

To run this notebook, you will need to download the data files BMX_G.XPT, BPX_G.XPT, and DEMO_G.XPT from the NHANES data page. You can choose any survey wave for which these data files are available.

To read the data files, you will also need the xport.py module. Either install the xport package, or just unzip the archive, find the xport.py file, and place it into your working directory.

We begin with the import statements:

In [1]:
%matplotlib inline

import xport
import numpy as np
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.predict_functional import predict_functional


The following function streamlines the process of reading the xport format files.

In [2]:
def read_xport(fname):
"""
Returns a DataFrame containing the data contained in the xport file with file name fname.
"""
data = [row for row in reader]
data = pd.DataFrame(data)
data = data.set_index("SEQN")
return data


The following cell reads three NHANES data files and merges them to form a single data set. The merge takes place using the SEQN variable as the index, where SEQN is the common identifier for subjects across all NHANES data sets.

In [3]:
fnames = ["BMX_G.XPT", "BPX_G.XPT", "DEMO_G.XPT"]
datasets = [read_xport(x) for x in fnames]
nhanes = pd.concat(datasets, axis=1)
print(nhanes.shape)

(9756, 86)


These are the variables that we have to work with:

In [4]:
nhanes.columns

Out[4]:
Index([u'BMDBMIC', u'BMDSADCM', u'BMDSTATS', u'BMIARMC', u'BMIARML',
u'BMIHEAD', u'BMIHT', u'BMILEG', u'BMIRECUM', u'BMIWAIST', u'BMIWT',
u'BMXARMC', u'BMXARML', u'BMXBMI', u'BMXHEAD', u'BMXHT', u'BMXLEG',
u'BMXWAIST', u'BMXWT', u'BPAARM', u'BPACSZ', u'BPAEN1', u'BPAEN2',
u'BPAEN3', u'BPAEN4', u'BPQ150A', u'BPQ150B', u'BPQ150C', u'BPQ150D',
u'BPXCHR', u'BPXDI1', u'BPXDI2', u'BPXDI3', u'BPXDI4', u'BPXML1',
u'BPXPLS', u'BPXPTY', u'BPXPULS', u'BPXSY1', u'BPXSY2', u'BPXSY3',
u'BPXSY4', u'PEASCCT1', u'PEASCST1', u'PEASCTM1', u'AIALANGA',
u'DMDBORN4', u'DMDCITZN', u'DMDEDUC2', u'DMDEDUC3', u'DMDMARTL',
u'FIAPROXY', u'INDFMIN2', u'INDFMPIR', u'INDHHIN2', u'MIAINTRP',
u'MIALANG', u'MIAPROXY', u'RIAGENDR', u'RIDAGEMN', u'RIDAGEYR',
u'RIDEXAGM', u'RIDEXAGY', u'RIDEXMON', u'RIDEXPRG', u'RIDRETH1',
u'RIDRETH3', u'RIDSTATR', u'SDDSRVYR', u'SDMVPSU', u'SDMVSTRA',
u'SIAINTRP', u'SIALANG', u'SIAPROXY', u'WTINT2YR', u'WTMEC2YR'],
dtype='object')

Initial data checking

Linear regression analysis aims to recover the conditional mean of one variable given the values of one or more predictor variables. Here we will use systolic blood pressuree (BPXSY1) as the outcome variable (dependent variable or response variable), and we will explore several variables that may predict the variation in blood pressure.

It's a good idea to start by checking the distributions of some of the variables in the model. Immediately we see that there are a number of zeros in the blood pressure column. We will need to exclude these cases.

In [5]:
nhanes.BPXSY1.plot(kind='hist')
plt.xlabel("BPXSY1")
plt.ylabel("Frequency")

Out[5]:
<matplotlib.text.Text at 0x7f54474af790>

This is how we select the cases with positive blood pressure:

In [6]:
ii = nhanes["BPXSY1"] > 0
nhanes = nhanes.loc[ii, :]


After excluding the zeros, the distribution of blood pressure is somewhat skewed to the right:

In [7]:
nhanes.BPXSY1.plot(kind='hist')
plt.xlabel("BPXSY1", size=15)
plt.ylabel("Frequency", size=15)

Out[7]:
<matplotlib.text.Text at 0x7f54475aa890>

It's not generally necessary to transform the dependent variable in a linear regression analysis to have a symmetric distribution. But if we wish to do so, a generalized log transform (fit by trial and error) gives us approximate symmetry in this case.

In [8]:
nhanes["BPXSY1_tr"] = np.log(nhanes["BPXSY1"] - 30) # Generalized log transform
nhanes.BPXSY1_tr.plot(kind='hist')
plt.xlabel("Transformed BPXSY1", size=15)
plt.ylabel("Frequency", size=15)

Out[8]:
<matplotlib.text.Text at 0x7f544755ce90>

Since the units of the transformed scale aren't very meaningful, we will convert to Z-scores.

In [9]:
tr_mn = nhanes["BPXSY1_tr"].mean()
tr_sd = nhanes["BPXSY1_tr"].std()
nhanes["BPXSY1_trz"] = (nhanes.BPXSY1_tr - tr_mn) / tr_sd

nhanes.BPXSY1_trz.plot(kind='hist')
plt.xlabel("Transformed BPXSY1 Z-score", size=15)
plt.ylabel("Frequency", size=15)

Out[9]:
<matplotlib.text.Text at 0x7f5443acf710>

The categorical variables are coded with numerical labels, which are hard to remember. So next we replace the numeric codes with string labels.

In [10]:
nhanes["RIAGENDR"] = nhanes["RIAGENDR"].replace({1: "Male", 2: "Female"})
nhanes["RIDRETH1"] = nhanes["RIDRETH1"].replace({1: "MXA", 2: "OHS", 3: "NHW", 4: "NHB", 5: "OTH"})
nhanes["DMDEDUC2"] = nhanes["DMDEDUC2"].replace({0: np.nan, 1: "<9gr", 2: "9-11gr", 3: "HS", 4: "Some college", 5: "College", 7: np.nan, 9: np.nan})


Fitting a basic regression model

Here is a basic linear regression model, relating the original (untransformed) systolic blood pressure to age, gender, and BMI.

In [11]:
fml = "BPXSY1 ~ RIDAGEYR + RIAGENDR + BMXBMI"
model1 = sm.OLS.from_formula(fml, nhanes)
result1 = model1.fit()
print(result1.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                 BPXSY1   R-squared:                       0.362
Method:                 Least Squares   F-statistic:                     1278.
Date:                Thu, 11 Jun 2015   Prob (F-statistic):               0.00
Time:                        16:19:23   Log-Likelihood:                -27870.
No. Observations:                6756   AIC:                         5.575e+04
Df Residuals:                    6752   BIC:                         5.578e+04
Df Model:                           3
Covariance Type:            nonrobust
====================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------
Intercept           92.1711      0.695    132.708      0.000        90.810    93.533
RIAGENDR[T.Male]     4.0275      0.365     11.036      0.000         3.312     4.743
RIDAGEYR             0.4703      0.009     54.593      0.000         0.453     0.487
BMXBMI               0.2572      0.025     10.381      0.000         0.209     0.306
==============================================================================
Omnibus:                     1178.814   Durbin-Watson:                   2.017
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3765.017
Skew:                           0.888   Prob(JB):                         0.00
Kurtosis:                       6.197   Cond. No.                         198.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


Here is the same model with gender coded using males rather than females as the reference category.

In [12]:
fml = "BPXSY1 ~ RIDAGEYR + C(RIAGENDR, Treatment(reference='Male')) + BMXBMI"
model2 = sm.OLS.from_formula(fml, nhanes)
result2 = model2.fit()
print(result2.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                 BPXSY1   R-squared:                       0.362
Method:                 Least Squares   F-statistic:                     1278.
Date:                Thu, 11 Jun 2015   Prob (F-statistic):               0.00
Time:                        16:19:23   Log-Likelihood:                -27870.
No. Observations:                6756   AIC:                         5.575e+04
Df Residuals:                    6752   BIC:                         5.578e+04
Df Model:                           3
Covariance Type:            nonrobust
======================================================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------------------------------
Intercept                                             96.1985      0.677    142.081      0.000        94.871    97.526
C(RIAGENDR, Treatment(reference='Male'))[T.Female]    -4.0275      0.365    -11.036      0.000        -4.743    -3.312
RIDAGEYR                                               0.4703      0.009     54.593      0.000         0.453     0.487
BMXBMI                                                 0.2572      0.025     10.381      0.000         0.209     0.306
==============================================================================
Omnibus:                     1178.814   Durbin-Watson:                   2.017
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3765.017
Skew:                           0.888   Prob(JB):                         0.00
Kurtosis:                       6.197   Cond. No.                         192.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


A scatterplot of the residuals on the fitted values is a very common way to identify certain unusual features of a fitted regression model. Here we see that the variance increases moderately with the mean. This is shown by the plot, and also by the numerical summaries printed below. We might gain a bit of power by using generalized least squares (GLS), but won't do that here.

In [13]:
plt.plot(result1.fittedvalues, result1.resid, 'o', alpha=0.3)
plt.xlabel("Fitted values", size=15)
plt.ylabel("Residuals", size=15)

print(np.std(result2.resid[result1.fittedvalues < 120]))
print(np.std(result2.resid[result1.fittedvalues > 120]))

10.8430724579
18.6171968457


Fitting a more complex regression model

Here is a more complex model, adding ethnicity and an age by gender interaction.

In [14]:
fml = "BPXSY1 ~ RIDAGEYR + RIAGENDR + RIDRETH1 + BMXBMI + RIDAGEYR*RIAGENDR"
model3 = sm.OLS.from_formula(fml, nhanes)
result3 = model3.fit()
print(result3.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                 BPXSY1   R-squared:                       0.378
Method:                 Least Squares   F-statistic:                     512.2
Date:                Thu, 11 Jun 2015   Prob (F-statistic):               0.00
Time:                        16:19:23   Log-Likelihood:                -27786.
No. Observations:                6756   AIC:                         5.559e+04
Df Residuals:                    6747   BIC:                         5.565e+04
Df Model:                           8
Covariance Type:            nonrobust
=============================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                    89.6112      0.889    100.796      0.000        87.868    91.354
RIAGENDR[T.Male]              8.6541      0.727     11.899      0.000         7.228    10.080
RIDRETH1[T.NHB]               4.1122      0.625      6.580      0.000         2.887     5.337
RIDRETH1[T.NHW]              -0.3654      0.615     -0.594      0.553        -1.572     0.841
RIDRETH1[T.OHS]              -0.3095      0.763     -0.406      0.685        -1.805     1.186
RIDRETH1[T.OTH]              -0.3171      0.682     -0.465      0.642        -1.654     1.020
RIDAGEYR                      0.5355      0.012     44.744      0.000         0.512     0.559
RIDAGEYR:RIAGENDR[T.Male]    -0.1184      0.016     -7.267      0.000        -0.150    -0.086
BMXBMI                        0.2216      0.025      8.918      0.000         0.173     0.270
==============================================================================
Omnibus:                     1083.940   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3361.496
Skew:                           0.828   Prob(JB):                         0.00
Kurtosis:                       6.033   Cond. No.                         400.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


Ethnicity is coded in terms of four "indicator variables", each contrasting a given ethnic group relative to the Mexican American group (MXA). We see that only the NHB (non-Hispanic black) subgroup has a strongly different systolic blood pressure effect compared to the Mexican American subgroup. Below we will show how to obtain different subgroup comparisons.

The gender by year interaction indicates that females have a faster increase of transformed systolic blood pressure with age compared to males (although they have a lower blood pressure throughout the range of the data).

Next we fit the same regression model using a different reference category (NHW) for the ethnicity variable. Note that this is exactly the same model as fit in the cell above, so goodness of fit measures like the R-squared will not change. By using the NHW (non-Hispanic white) subgroup as the reference category, the effects of the other ethnic categories are interpreted as comparisons relative to that group.

In [15]:
fml = "BPXSY1 ~ RIDAGEYR + RIAGENDR + C(RIDRETH1, Treatment(reference='NHW')) + BMXBMI + RIDAGEYR*RIAGENDR"
model4 = sm.OLS.from_formula(fml, nhanes)
result4 = model4.fit()
print(result4.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                 BPXSY1   R-squared:                       0.378
Method:                 Least Squares   F-statistic:                     512.2
Date:                Thu, 11 Jun 2015   Prob (F-statistic):               0.00
Time:                        16:19:24   Log-Likelihood:                -27786.
No. Observations:                6756   AIC:                         5.559e+04
Df Residuals:                    6747   BIC:                         5.565e+04
Df Model:                           8
Covariance Type:            nonrobust
==================================================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------------------------------
Intercept                                         89.2458      0.817    109.263      0.000        87.645    90.847
RIAGENDR[T.Male]                                   8.6541      0.727     11.899      0.000         7.228    10.080
C(RIDRETH1, Treatment(reference='NHW'))[T.MXA]     0.3654      0.615      0.594      0.553        -0.841     1.572
C(RIDRETH1, Treatment(reference='NHW'))[T.NHB]     4.4777      0.469      9.542      0.000         3.558     5.398
C(RIDRETH1, Treatment(reference='NHW'))[T.OHS]     0.0559      0.640      0.087      0.930        -1.200     1.311
C(RIDRETH1, Treatment(reference='NHW'))[T.OTH]     0.0483      0.544      0.089      0.929        -1.018     1.114
RIDAGEYR                                           0.5355      0.012     44.744      0.000         0.512     0.559
RIDAGEYR:RIAGENDR[T.Male]                         -0.1184      0.016     -7.267      0.000        -0.150    -0.086
BMXBMI                                             0.2216      0.025      8.918      0.000         0.173     0.270
==============================================================================
Omnibus:                     1083.940   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3361.496
Skew:                           0.828   Prob(JB):                         0.00
Kurtosis:                       6.033   Cond. No.                         312.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


As noted above, the NHB ethnic group has higher mean BMI than the NHW ethnic group (see next cell), but this does not explain all of the inreased blood pressure that is observed when comparing these two groups, beacuse the effect for NHB ethnicity is still strong even when BMI is included in the model.

In [16]:
nhanes.groupby("RIDRETH1").agg({"BMXBMI": np.mean})

Out[16]:
BMXBMI
RIDRETH1
MXA 25.845588
NHB 27.412716
NHW 26.427285
OHS 26.456215
OTH 23.376623

Next we plot the fitted distributions of female subjects in two ethnic subgroups, with all quantitative variables in the model held fixed at their mean values. This illustrates that a regression effect that is very statistically significant may be a weak predictor at the level of single individuals.

In [17]:
nhb = nhanes.mean(0)
nhb = pd.DataFrame(nhb).T # overcomes Statsmodels bug
nhb["RIDRETH1"] = "NHB"
nhb["RIAGENDR"] = "Female"

nhw = nhb.copy()
nhw["RIDRETH1"] = "NHW"

# The fitted means of the two ethnic groups.
nhb_lp = result4.predict(exog=nhb)
nhw_lp = result4.predict(exog=nhw)

# The fitted standard deviation (common to all ethnic groups)
sd = np.sqrt(result4.scale)

from scipy.stats.distributions import norm

grid = np.linspace(70, 170, 100)
nhw_dens = norm.pdf(grid, loc=nhw_lp, scale=sd)
nhb_dens = norm.pdf(grid, loc=nhb_lp, scale=sd)

plt.clf()
ax = plt.axes([0.1, 0.1, 0.72, 0.8])
plt.plot(grid, nhb_dens, '-', lw=3, label="NHB", alpha=0.5)
plt.plot(grid, nhw_dens, '-', lw=3, label="NHW", alpha=0.5)
ha, lb = ax.get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "center right")
leg.draw_frame(False)
plt.xlabel("Systolic blood pressure", size=15)
_ = plt.ylabel("Density", size=15)


Turning now to the age by gender interaction, it is helpful to plot the fitted blood pressure against age for females and for males. The interaction allows the slopes and intercepts of these two lines to differ by gender, as is illustrated clearly in the plot. These fitted values estimate the mean blood pressure for people with a given gender, at a given age, with BMI held fixed at the mean value, and the ethnicity variable fixed at MXA (Mexican American).

In [18]:
values = {"RIDRETH1" : "MXA", "RIAGENDR" : "Female"}
summaries = {"BMXBMI" : np.mean}
pr_f, cb_f, xv_f = predict_functional(result4, "RIDAGEYR", values=values, summaries=summaries)

values["RIAGENDR"] = "Male"
pr_m, cb_m, xv_m = predict_functional(result4, "RIDAGEYR", values=values, summaries=summaries)

plt.clf()
ax = plt.axes([0.1, 0.1, 0.8, 0.8])
plt.plot(xv_f, pr_f, '-', lw=3, label="Female", alpha=0.6)
plt.plot(xv_m, pr_m, '-', lw=3, label="Male", alpha=0.6)
ha, lb = ax.get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "upper center", ncol=2)
leg.draw_frame(False)
plt.xlabel("Age", size=15)
plt.ylabel("Systolic BP", size=15)

/projects/2bf7fdca-652d-4a93-bf12-1fdcb71a9a3f/statsmodels-master/statsmodels/sandbox/predict_functional.py:169: UserWarning: 'DMDBORN4', 'INDHHIN2', 'BMIHT', 'BMIWAIST', 'BPAEN1', 'BMILEG', 'DMDEDUC3', 'BMXHEAD', 'BMXLEG', 'BMXSAD3', 'BPAARM', 'BPXPLS', 'BPXCHR', 'BMXRECUM', 'BMIWT', 'BPQ150B', 'BPQ150C', 'WTINT2YR', 'AIALANGA', 'MIAPROXY', 'BPQ150D', 'FIAINTRP', 'FIAPROXY', 'DMDMARTL', 'DMDYRSUS', 'DMDEDUC2', 'SIALANG', 'RIDEXPRG', 'DMDCITZN', 'MIAINTRP', 'PEASCTM1', 'BMDSADCM', 'BPXML1', 'BMIARMC', 'SDDSRVYR', 'BMIARML', 'RIDSTATR', 'WTMEC2YR', 'BPAEN2', 'SDMVSTRA', 'BMDBMIC', 'BPAEN3', 'PEASCST1', 'FIALANG', 'BPXSY1_trz', 'INDFMPIR', 'SIAPROXY', 'RIDEXAGY', 'BMIRECUM', 'RIDRETH3', 'BMXSAD1', 'BMXARML', 'RIDEXAGM', 'SDMVPSU', 'MIALANG', 'BMXSAD4', 'BPQ150A', 'BMIHEAD', 'BPXSY4', 'SIAINTRP', 'BPXSY2', 'BMXARMC', 'BPXSY1_tr', 'BMXWT', 'PEASCCT1', 'BPXPULS', 'DMQADFC', 'INDFMIN2', 'BMXWAIST', 'BPXDI3', 'BPXDI2', 'BPXDI1', 'RIDAGEMN', 'BMXSAD2', 'BPXPTY', 'BPXSY3', 'BPXDI4', 'BMXHT', 'BPAEN4', 'BPACSZ', 'RIDEXMON', 'DMQMILIZ', 'BMDSTATS' in data frame but not in summaries or values.
% ", ".join(["'%s'" % x for x in unmatched]))

Out[18]:
<matplotlib.text.Text at 0x7f54439b5850>

Capturing nonlinear effects with splines

The effects of age and BMI are unlikely to be linear. We can capture nonlinear effects by using splines. The analysis in the cell below uses splines to capture a non-linear effect of age. The effect is allowed to differ by gender. For models with interactions, it is usually more relevant to focus on the fitted values, or on contrasts involving the fitted values, rather than on the coefficients. The coefficients can be tricky to interpret correctly.

In [20]:
fml = "BPXSY1 ~ bs(RIDAGEYR, df=4) + RIAGENDR + RIDRETH1 + BMXBMI + bs(RIDAGEYR, df=4)*RIAGENDR"
model5 = sm.OLS.from_formula(fml, nhanes)
result5 = model5.fit()

values = {"RIDRETH1" : "MXA", "RIAGENDR" : "Female"}
summaries = {"BMXBMI" : np.mean}
pr_f, cb_f, xv_f = predict_functional(result5, "RIDAGEYR", values=values, summaries=summaries)
values["RIAGENDR"] = "Male"
pr_m, cb_m, xv_m = predict_functional(result5, "RIDAGEYR", values=values, summaries=summaries)

plt.clf()
ax = plt.axes([0.1, 0.1, 0.8, 0.8])
plt.plot(xv_f, pr_f, '-', lw=3, label="Female", alpha=0.6)
plt.plot(xv_m, pr_m, '-', lw=3, label="Male", alpha=0.6)
ha, lb = ax.get_legend_handles_labels()
leg = plt.figlegend(ha, lb, "upper center", ncol=2)
leg.draw_frame(False)
plt.xlabel("Age", size=15)
plt.ylabel("Systolic BP", size=15)

Out[20]:
<matplotlib.text.Text at 0x7f5441fc9e90>

Interactions among categorical variables

When we have an interaction between two categorical variables, every combination of levels in the two variables has its own fitted mean.

In [21]:
fml = "BPXSY1 ~ RIDAGEYR + RIAGENDR + RIDRETH1 + BMXBMI + DMDEDUC2*RIDRETH1"
model6 = sm.OLS.from_formula(fml, nhanes)
result6 = model6.fit()

# Take two rows to work around bug #1881 in Statsmodels
df = nhanes.iloc[0:2, :].copy()
df["BMXBMI"] = nhanes.BMXBMI.mean()
df["RIAGENDR"] = "Female"
df["RIDAGEYR"] = nhanes.RIDAGEYR.mean()

eth_levels = [x for x in nhanes.RIDRETH1.unique() if pd.notnull(x)]
educ_levels = [x for x in nhanes.DMDEDUC2.unique() if pd.notnull(x)]

table = pd.DataFrame(index=eth_levels, columns=educ_levels, dtype=np.float64)

for eth in eth_levels:
for educ in educ_levels:
df["RIDRETH1"] = eth
df["DMDEDUC2"] = educ
mn = result6.predict(exog=df)
table.loc[eth, educ] = mn[0]

# Reorder the columns in a meaningful way before printing
table = table[["<9gr", "9-11gr", "HS", "Some college", "College"]]
print(table)

           <9gr      9-11gr          HS  Some college     College
NHW  118.032587  117.378822  116.507332    116.428064  113.350757
OTH  115.981212  116.596849  118.201154    118.251624  112.956681
NHB  121.957790  123.312303  121.848706    121.787215  117.015957
MXA  119.876328  113.982438  115.262352    114.887819  111.688089
OHS  116.657597  118.372319  115.285362    113.403973  114.944428


Exercises¶

• Use regression analysis to assess whether weight and/or height may be associated with blood pressure even after controlling for BMI (note that BMI is a function of weight and height).

• Is there any evidence of a BMI x gender interaction, a BMI x age interaction, or a BMI x gender x age interaction? If so, how would these be interpreted?

• There is a second systolic blood pressure measurement in the dataset. If you average the two measures and use the average as an outcome in the analyses above, do the results change in any important ways?