Scatterplot smoothing

Key ideas: Smoothing, NHANES data

This notebook demonstrates the scatterplot smoothing capabilities of Statsmodels using data from the National Health and Nutrition Examination Study (NHANES).

The function that implements smoothing is called lowess and is located in statsmodels.nonparametric.smoothers_lowess. First we import lowess and some other modules that we will need. Note that the xport module is needed to load the SAS xport format files that we will be using (the NHANES data are distributed in this format). This is a nonstandard library so you may need to install it before continuing.

In [1]:
import xport
import numpy as np
import pandas as pd
import StringIO
import urllib2
from statsmodels.nonparametric.smoothers_lowess import lowess
import matplotlib.pyplot as plt

The following function downloads an XPORT format data file from a URL and returns it as a Pandas data frame. The raw XPORT data is first stored as a list of dictionaries, with each dictionary mapping variable names to data values for one subject. The variable named "SEQN" is the unique subject id, so we pop that out and use it as the index in our data frame.

In [2]:
def read_xport_from_url(urlname):
    
    url = urllib2.urlopen(urlname)
    zdata = url.read()
    sio = StringIO.StringIO(zdata)
    with xport.XportReader(sio) as reader:
        Z = [row for row in reader]
            
    Ix = [z.pop("SEQN") for z in Z]
    data = pd.DataFrame(Z, index=Ix)
        
    return data

Next we use this function to obtain the demographic and body measurement data. These are (mostly) different variables measured on (mostly) the same subjects.

In [3]:
urlname = "ftp://ftp.cdc.gov/pub/Health_Statistics/nchs/nhanes/2011-2012/DEMO_G.XPT"
demo = read_xport_from_url(urlname)

urlname = "ftp://ftp.cdc.gov/pub/Health_Statistics/nchs/nhanes/2011-2012/BMX_G.XPT"
bmx = read_xport_from_url(urlname)

We want to merge these files, but there are some variable name clashes in the two data sets, so we augment each variable name with the file name.

In [4]:
demo.columns = ["DEMO:" + x for x in demo.columns]
bmx.columns = ["BMX:" + x for x in bmx.columns]

Now we can do the merge:

In [5]:
data = pd.concat((demo, bmx), axis=1)

First we'll make a scatterplot showing BMI as a function of age, restricting to adults (since the BMI changes very rapidly for children). Note that this is a cross-sectional data set, so the changes with respect to age could be partially attributable to birth cohort effects.

In [12]:
ii = data["DEMO:RIDAGEYR"] >= 18
L = lowess(data.loc[ii, "BMX:BMXBMI"], data.loc[ii, "DEMO:RIDAGEYR"])
plt.plot(L[:,0], L[:,1], '-', lw=4, color='orange')
plt.xlabel("Age", size=14)
plt.ylabel("BMI", size=14)
plt.xlim(15, 85)
plt.gca().set_xticks(range(20, 81, 10))
Out[12]:
[<matplotlib.axis.XTick at 0x51ec250>,
 <matplotlib.axis.XTick at 0x4e1aa10>,
 <matplotlib.axis.XTick at 0x4e221d0>,
 <matplotlib.axis.XTick at 0x4e2ff50>,
 <matplotlib.axis.XTick at 0x4e2fd50>,
 <matplotlib.axis.XTick at 0x51db450>,
 <matplotlib.axis.XTick at 0x51dbcd0>]

lowess has a parameter called "frac" that determines the smoothness of the fitted curve. It defaults to 0.66, but you can reduce it to get a less smooth curve (less bias, more variance), and you can increase it to get a smoother curve (more bias, less variance). Setting frac=0.1 is clearly too small:

In [7]:
ii = data["DEMO:RIDAGEYR"] >= 18
L = lowess(data["BMX:BMXBMI"][ii], data["DEMO:RIDAGEYR"][ii], frac=0.1)
plt.plot(L[:,0], L[:,1], '-')
plt.xlabel("Age")
plt.ylabel("BMI")
Out[7]:
<matplotlib.text.Text at 0x4193890>

It is also useful to know how much data are available in different parts of the domain. Note that the ages are truncated at 80, so if a person is over 80 years old, the age is reported as 80.

In [21]:
plt.clf()
plt.hist(np.asarray(data["DEMO:RIDAGEYR"]), color='orange', alpha=0.7)
plt.xlabel("Age", size=14)
plt.ylabel("Frequency", size=14)
Out[21]:
<matplotlib.text.Text at 0x541db90>

Next we do the smoothing analysis again, now startifying by gender.

In [28]:
i1 = (data["DEMO:RIDAGEYR"] >= 18) & (data["DEMO:RIAGENDR"] == 1)
i2 = (data["DEMO:RIDAGEYR"] >= 18) & (data["DEMO:RIAGENDR"] == 2)
L1 = lowess(data["BMX:BMXBMI"][i1], data["DEMO:RIDAGEYR"][i1])
L2 = lowess(data["BMX:BMXBMI"][i2], data["DEMO:RIDAGEYR"][i2])
plt.plot(L1[:,0], L1[:,1], '-', label="Male", lw=4, alpha=0.7, color='orange')
plt.plot(L2[:,0], L2[:,1], '-', label="Female", lw=4, alpha=0.7, color='purple')
leg = plt.legend(loc="upper left")
leg.draw_frame(False)
plt.xlabel("Age", size=14)
plt.ylabel("BMI", size=14)
plt.xlim(10, 90)
plt.ylim(24, 30)
Out[28]:
(24, 30)