Regularized regression, clustering, and dimension reduction in sklearn

sklearn is a "machine learning" library for Python. Here we will demonstrate a few of its features using the National Health and Nutrition Examination Survey (NHANES) data.

First we import some libraries:

In [54]:
import sklearn
import pandas as pd
import patsy
import matplotlib.pyplot as plt
import numpy as np

Raw data for various waves of NHANES data can be obtained from the CDC web site for NHANES: http://www.cdc.gov/nchs/nhanes.htm.

The files are in SAS XPORT format, and can be read as follows:

In [55]:
demog = pd.read_sas("DEMO_F.XPT")
bpx = pd.read_sas("BPX_F.XPT")
bmx = pd.read_sas("BMX_F.XPT")

Next we merge the three files using the unique subject identifier, SEQN.

In [56]:
data = pd.merge(demog, bpx, left_on="SEQN", right_on="SEQN")
data = pd.merge(data, bmx, left_on="SEQN", right_on="SEQN")
print(data.columns)
Index(['SEQN', 'SDDSRVYR', 'RIDSTATR', 'RIDEXMON', 'RIAGENDR', 'RIDAGEYR',
       'RIDAGEMN', 'RIDAGEEX', 'RIDRETH1', 'DMQMILIT', 'DMDBORN2', 'DMDCITZN',
       'DMDYRSUS', 'DMDEDUC3', 'DMDEDUC2', 'DMDSCHOL', 'DMDMARTL', 'DMDHHSIZ',
       'DMDFMSIZ', 'INDHHIN2', 'INDFMIN2', 'INDFMPIR', 'RIDEXPRG', 'DMDHRGND',
       'DMDHRAGE', 'DMDHRBR2', 'DMDHREDU', 'DMDHRMAR', 'DMDHSEDU', 'SIALANG',
       'SIAPROXY', 'SIAINTRP', 'FIALANG', 'FIAPROXY', 'FIAINTRP', 'MIALANG',
       'MIAPROXY', 'MIAINTRP', 'AIALANG', 'WTINT2YR', 'WTMEC2YR', 'SDMVPSU',
       'SDMVSTRA', 'PEASCST1', 'PEASCTM1', 'PEASCCT1', 'BPXCHR', 'BPQ150A',
       'BPQ150B', 'BPQ150C', 'BPQ150D', 'BPAARM', 'BPACSZ', 'BPXPLS',
       'BPXPULS', 'BPXPTY', 'BPXML1', 'BPXSY1', 'BPXDI1', 'BPAEN1', 'BPXSY2',
       'BPXDI2', 'BPAEN2', 'BPXSY3', 'BPXDI3', 'BPAEN3', 'BPXSY4', 'BPXDI4',
       'BPAEN4', 'BMDSTATS', 'BMXWT', 'BMIWT', 'BMXRECUM', 'BMIRECUM',
       'BMXHEAD', 'BMIHEAD', 'BMXHT', 'BMIHT', 'BMXBMI', 'BMXLEG', 'BMILEG',
       'BMXARML', 'BMIARML', 'BMXARMC', 'BMIARMC', 'BMXWAIST', 'BMIWAIST',
       'BMXTRI', 'BMITRI', 'BMXSUB', 'BMISUB'],
      dtype='object')

Next we form a smaller data set including some variables of possible interest. We then tabulate thre frequency that these variables have missing values.

In [57]:
df = data[["RIDAGEYR", "BPXSY1", "RIAGENDR", "BMXBMI", "BMXWT", "BMXRECUM", "BMILEG", "BMXARML", "BMXHT", "BMXWAIST", "BMXTRI", "DMDMARTL"]]
df.isnull().mean()
Out[57]:
RIDAGEYR    0.000000
BPXSY1      0.265678
RIAGENDR    0.000000
BMXBMI      0.082025
BMXWT       0.008875
BMXRECUM    0.882376
BMILEG      0.964986
BMXARML     0.053155
BMXHT       0.081245
BMXWAIST    0.124842
BMXTRI      0.087292
DMDMARTL    0.409051
dtype: float64

Since we will need complete cases below, we drop the two variables with high rates of missing values, then drop all cases with any missing values on the remaining variables.

In [58]:
del df["BMILEG"]
del df["BMXRECUM"]
df = df.dropna()
df.shape
Out[58]:
(5058, 10)

We need an indicator variable for gender.

In [59]:
df["FEMALE"] = 1*(df["RIAGENDR"] == 2)

Regularized regression

Regularized regression is very well developed in sklearn. Here we will demonstrate two ways to do regularized regression - ridge regression and the Lasso.

sklearn does not handle formulas, but we can use Patsy to process the formulas, then use the resulting design matrices in sklearn.

In [60]:
y, x = patsy.dmatrices("BPXSY1 ~ 0 + RIDAGEYR + FEMALE + BMXBMI + BMXWT", data=df, return_type='dataframe')

Usually when working with regularized regression models the predictors and outcomes should be standardized before fitting the model.

In [61]:
xnames = x.columns
x = np.asarray(x)
y = np.asarray(y)
y = (y - y.mean()) / y.std()
x = (x - x.mean(0)) / x.std(0)

Ridge regression

Now we are ready to fit a sequence of ridge regression models, with a sequence of regularization parameters.

In [62]:
from sklearn import linear_model
clf = linear_model.Ridge(fit_intercept=False)
alphas = np.linspace(0, 40, 100)

coefs = []
for a in alphas:
    clf.set_params(alpha=a*len(y))
    clf.fit(x, y)
    coefs.append(clf.coef_.ravel())

coefs = np.asarray(coefs) 

It is common to visualize the results by plotting the coefficients against the regularization parameter.

In [63]:
ax = plt.gca()
for j,c in enumerate(coefs.T):
    plt.plot(alphas, c, label=xnames[j])
ha,lb = ax.get_legend_handles_labels()
plt.legend(ha, lb, loc="upper right")
plt.xlabel('alpha', size=16)
plt.ylabel('Coefficients', size=16)
Out[63]:
<matplotlib.text.Text at 0x7f91adc7f198>
In [64]:
y, x = patsy.dmatrices("BPXSY1 ~ 0 + RIDAGEYR + FEMALE + BMXBMI + BMXWT + BMXHT + BMXWAIST + BMXTRI", data=df, return_type='dataframe')
In [65]:
xnames = x.columns
x = np.asarray(x)
y = np.asarray(y)
y = (y - y.mean()) / y.std()
x = (x - x.mean(0)) / x.std(0)
In [66]:
clf = linear_model.Ridge(fit_intercept=False)

alphas = np.linspace(0, 4, 100)
coefs = []
for a in alphas:
    clf.set_params(alpha=a*len(y))
    clf.fit(x, y)
    coefs.append(clf.coef_.ravel())

coefs = np.asarray(coefs) 
In [67]:
ax = plt.gca()
for j,c in enumerate(coefs.T):
    plt.plot(alphas, c, label=xnames[j])
ha,lb = ax.get_legend_handles_labels()
plt.legend(ha, lb, loc="upper right")
plt.xlabel('alpha', size=16)
plt.ylabel('Coefficients', size=16)
plt.xlim(0, 4)
Out[67]:
(0, 4)
In [68]:
clf = linear_model.Lasso(fit_intercept=False)

alphas = np.linspace(0.000001, 0.0001, 100)
coefs = []
for a in alphas:
    clf.set_params(alpha=a*len(y))
    clf.fit(x, y)
    coefs.append(clf.coef_.ravel())

coefs = np.asarray(coefs) 

Lasso regression

In [69]:
ax = plt.gca()
for j,c in enumerate(coefs.T):
    plt.plot(alphas, c, label=xnames[j])
ha,lb = ax.get_legend_handles_labels()
plt.legend(ha, lb, loc="upper right")
plt.xlabel('alpha', size=16)
plt.ylabel('Coefficients', size=16)
Out[69]:
<matplotlib.text.Text at 0x7f91b5622080>

Principal Components Analysis

To illustrate PCA, we can look at five related body meeasures.

In [70]:
from sklearn.decomposition import PCA

dfx = df.loc[(df.RIDAGEYR >= 30) & (df.RIDAGEYR <= 40)]
x = dfx[["BMXBMI", "BMXWT", "BMXHT", "BMXWAIST", "BMXTRI"]]

pca = PCA(n_components=2)
rslt = pca.fit(x)
print(rslt.explained_variance_ratio_)
scores = pca.fit_transform(x)
ixf = np.flatnonzero(dfx.FEMALE == 1)
ixm = np.flatnonzero(dfx.FEMALE == 0)
plt.plot(scores[ixf, 0], scores[ixf, 1], 'o', color='orange', alpha=0.2)
plt.plot(scores[ixm, 0], scores[ixm, 1], 'o', color='purple', alpha=0.2)
plt.xlabel("PC 1", size=16)
plt.ylabel("PC 2", size=16)
pca.components_
[ 0.77675288  0.16851934]
Out[70]:
array([[-0.21293995, -0.77690989, -0.16892843, -0.55175561, -0.13452333],
       [ 0.1965342 , -0.18023592, -0.74613922,  0.27341554,  0.54535266]])

As a second example, we can look at three systolic blood pressure measurements (replicate measures taken on the same assessment visit). These are quie strongly correlated.

In [71]:
dfx = data[["BPXSY1", "BPXSY2", "BPXSY3"]].dropna()
np.corrcoef(dfx.T)
Out[71]:
array([[ 1.        ,  0.95390956,  0.94169424],
       [ 0.95390956,  1.        ,  0.95685065],
       [ 0.94169424,  0.95685065,  1.        ]])
In [90]:
pca = PCA(n_components=3)
rslt = pca.fit(dfx)
print(rslt.explained_variance_ratio_)
print(pca.components_)
scores = pca.fit_transform(dfx)
[ 0.96720398  0.01964809  0.01314793]
[[ 0.59301361  0.5775971   0.56099594]
 [-0.7527203   0.15027296  0.64096036]
 [ 0.28591432 -0.80237126  0.52388297]]
In [91]:
for k in range(3):
    ii = np.argsort(scores[:, k])
    plt.figure()
    plt.clf()
    plt.gca().set_xticks([0, 1, 2])
    plt.xlabel("SBP measurement number", size=15)
    plt.ylabel("SBP", size=16)
    plt.title("PC " + str(k+1))
    for j in range(5):
        plt.plot(dfx.iloc[ii[j], :].values, color='blue')
        plt.plot(dfx.iloc[ii[-j-1], :].values, color='red')
    plt.figure()
    plt.hist(scores[:, k])
        

K-means

In [113]:
from sklearn.cluster import KMeans

dfx = df.loc[(df.RIDAGEYR >= 30) & (df.RIDAGEYR <= 40)]
x = dfx[["BMXBMI", "BMXWT", "BMXHT", "BMXWAIST", "BMXTRI"]]

km = KMeans(n_clusters=4)
rslt = km.fit(x)
clcent = pd.DataFrame(km.cluster_centers_, columns=x.columns)
print(clcent)
print(pd.Series(km.predict(x)).value_counts())
      BMXBMI       BMXWT       BMXHT    BMXWAIST     BMXTRI
0  31.972864   83.820874  162.050485  102.259709  27.322816
1  23.316282   62.361218  163.728205   82.585897  17.253526
2  37.383378  115.087162  175.995946  119.452027  26.077027
3  27.155331   85.557977  177.581712   95.147471  13.760700
1    311
3    258
0    206
2    148
dtype: int64