Risk factors for low birthweight¶

This notebook illustrates a logistic regression analysis using Python Statsmodels. The data are indicators of low birth weight in a sample of newborns. Each record in the data set corresponds to one infant. The primary variable of interest is whether the baby was born with low birthweight, defined to be a birth weight less than 2500 grams. There are additional variables in the data set that may be used as predictors of a baby being born with low birth weight.

A description of the data is here. The data can be downloaded here.

These are the standard import statements:

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt


Next we read in the data set and check its size.

In [2]:
data = pd.read_csv("lbw.csv")
print(data.shape)

(189, 11)


The head method displays the first few rows of the data set, so we know what we are working with:

In [3]:
print(data.head())

   Unnamed: 0  low  smoke  race  age  lwt  ptl  ht  ui  ftv   bwt
0           1    0      0     2   19  182    0   0   1    0  2523
1           2    0      0     3   33  155    0   0   0    3  2551
2           3    0      1     1   20  105    0   0   0    1  2557
3           4    0      1     1   21  108    0   0   1    2  2594
4           5    0      1     1   18  107    0   0   1    0  2600


It's optional, but the first column is not needed so we can drop it.

In [4]:
del data["Unnamed: 0"]


It's a good idea to check the data types to make sure there are no surprises.

In [5]:
print(data.dtypes)

low      int64
smoke    int64
race     int64
age      int64
lwt      int64
ptl      int64
ht       int64
ui       int64
ftv      int64
bwt      int64
dtype: object


We also can check to see if any data are missing.

In [6]:
pd.isnull(data).sum()

Out[6]:
low      0
smoke    0
race     0
age      0
lwt      0
ptl      0
ht       0
ui       0
ftv      0
bwt      0
dtype: int64
In [7]:
data["race"] = data["race"].replace({1: "white", 2: "black", 3: "other"})


We don't have a lot of information about this data set, but we can see that the frequency of low birth weight is over 30%, whereas in the general population it is less than 10%. Thus the data set is not representative of the general population. It may be from a case/control study, or from a study of a high risk population.

In [8]:
data.low.mean()

Out[8]:
0.31216931216931215

Now we can fit a logistic regression model containing additive effects for all covariates. The GLM function fits many types of generalized linear models (GLMs). Choosing the Binomial family makes it a logistic regression.

In [9]:
model1 = sm.GLM.from_formula("low ~ age + smoke + race + lwt + ptl + ht + ui + ftv", family=sm.families.Binomial(), data=data)
result1 = model1.fit()
print(result1.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      179
Model Family:                Binomial   Df Model:                            9
Method:                          IRLS   Log-Likelihood:                -100.65
Date:                Tue, 06 Jan 2015   Deviance:                       201.31
Time:                        12:26:36   Pearson chi2:                     183.
No. Iterations:                     7
=================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept         1.7495      1.255      1.394      0.163        -0.710     4.209
race[T.other]    -0.3908      0.538     -0.727      0.467        -1.444     0.663
race[T.white]    -1.2717      0.527     -2.412      0.016        -2.305    -0.238
age              -0.0296      0.037     -0.799      0.424        -0.102     0.043
smoke             0.9388      0.402      2.335      0.020         0.151     1.727
lwt              -0.0154      0.007     -2.225      0.026        -0.029    -0.002
ptl               0.5434      0.345      1.573      0.116        -0.134     1.220
ht                1.8623      0.697      2.670      0.008         0.495     3.229
ui                0.7676      0.459      1.671      0.095        -0.133     1.668
ftv               0.0655      0.172      0.380      0.704        -0.272     0.403
=================================================================================


There are various ways to select a more parsimonious model from the large model we fit above. Here we will do a manual "backward elimination", dropping the variable with the smallest Z-score and refitting the model. We repeat this process until a stopping point is reached (discussed further below).

In [10]:
# drop ftv
model2 = sm.GLM.from_formula("low ~ age + smoke + race + lwt + ptl + ht + ui", family=sm.families.Binomial(), data=data)
result2 = model2.fit()
print(result2.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      180
Model Family:                Binomial   Df Model:                            8
Method:                          IRLS   Log-Likelihood:                -100.72
Date:                Tue, 06 Jan 2015   Deviance:                       201.45
Time:                        12:26:36   Pearson chi2:                     182.
No. Iterations:                     7
=================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept         1.7239      1.261      1.367      0.172        -0.748     4.196
race[T.other]    -0.4006      0.539     -0.743      0.457        -1.457     0.655
race[T.white]    -1.2626      0.526     -2.399      0.016        -2.294    -0.231
age              -0.0271      0.036     -0.743      0.457        -0.099     0.044
smoke             0.9233      0.401      2.304      0.021         0.138     1.709
lwt              -0.0152      0.007     -2.188      0.029        -0.029    -0.002
ptl               0.5418      0.346      1.565      0.118        -0.137     1.220
ht                1.8325      0.692      2.650      0.008         0.477     3.188
ui                0.7585      0.459      1.651      0.099        -0.142     1.659
=================================================================================

In [11]:
# drop age
model3 = sm.GLM.from_formula("low ~ smoke + race + lwt + ptl + ht + ui", family=sm.families.Binomial(), data=data)
result3 = model3.fit()
print(result3.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      181
Model Family:                Binomial   Df Model:                            7
Method:                          IRLS   Log-Likelihood:                -101.00
Date:                Tue, 06 Jan 2015   Deviance:                       202.01
Time:                        12:26:36   Pearson chi2:                     184.
No. Iterations:                     7
=================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept         1.2348      1.072      1.152      0.250        -0.867     3.337
race[T.other]    -0.4277      0.539     -0.794      0.427        -1.484     0.629
race[T.white]    -1.3252      0.522     -2.538      0.011        -2.349    -0.302
smoke             0.9387      0.399      2.355      0.019         0.157     1.720
lwt              -0.0159      0.007     -2.316      0.021        -0.029    -0.002
ptl               0.5033      0.341      1.475      0.140        -0.166     1.172
ht                1.8539      0.695      2.668      0.008         0.492     3.216
ui                0.7857      0.456      1.721      0.085        -0.109     1.680
=================================================================================

In [12]:
# combine other and black race
data["white_race"] = (data.race == "white")
model4 = sm.GLM.from_formula("low ~ smoke + white_race + lwt + ptl + ht + ui", family=sm.families.Binomial(), data=data)
result4 = model4.fit()
print(result4.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      182
Model Family:                Binomial   Df Model:                            6
Method:                          IRLS   Log-Likelihood:                -101.32
Date:                Tue, 06 Jan 2015   Deviance:                       202.64
Time:                        12:26:36   Pearson chi2:                     183.
No. Iterations:                     7
======================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Intercept              0.7250      0.848      0.855      0.393        -0.938     2.387
white_race[T.True]    -1.0483      0.390     -2.685      0.007        -1.813    -0.283
smoke                  0.9991      0.393      2.544      0.011         0.229     1.769
lwt                   -0.0143      0.006     -2.202      0.028        -0.027    -0.002
ptl                    0.4942      0.341      1.447      0.148        -0.175     1.164
ht                     1.8360      0.693      2.650      0.008         0.478     3.194
ui                     0.7712      0.459      1.680      0.093        -0.129     1.671
======================================================================================

In [13]:
# drop ptl
model5 = sm.GLM.from_formula("low ~ smoke + white_race + lwt + ht + ui", family=sm.families.Binomial(), data=data)
result5 = model5.fit()
print(result5.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      183
Model Family:                Binomial   Df Model:                            5
Method:                          IRLS   Log-Likelihood:                -102.39
Date:                Tue, 06 Jan 2015   Deviance:                       204.79
Time:                        12:26:36   Pearson chi2:                     182.
No. Iterations:                     7
======================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Intercept              0.9020      0.838      1.077      0.282        -0.740     2.544
white_race[T.True]    -1.0649      0.388     -2.744      0.006        -1.825    -0.304
smoke                  1.0910      0.387      2.822      0.005         0.333     1.849
lwt                   -0.0153      0.006     -2.365      0.018        -0.028    -0.003
ht                     1.8523      0.688      2.692      0.007         0.504     3.201
ui                     0.8926      0.449      1.986      0.047         0.012     1.774
======================================================================================

In [14]:
# drop ui
model6 = sm.GLM.from_formula("low ~ smoke + white_race + lwt + ht", family=sm.families.Binomial(), data=data)
result6 = model6.fit()
print(result6.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      184
Model Family:                Binomial   Df Model:                            4
Method:                          IRLS   Log-Likelihood:                -104.34
Date:                Tue, 06 Jan 2015   Deviance:                       208.68
Time:                        12:26:36   Pearson chi2:                     185.
No. Iterations:                     6
======================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Intercept              1.2285      0.825      1.489      0.136        -0.388     2.845
white_race[T.True]    -1.0600      0.383     -2.769      0.006        -1.810    -0.310
smoke                  1.1174      0.382      2.928      0.003         0.370     1.865
lwt                   -0.0167      0.006     -2.573      0.010        -0.029    -0.004
ht                     1.7392      0.689      2.523      0.012         0.388     3.090
======================================================================================

In [15]:
# drop ht
model7 = sm.GLM.from_formula("low ~ smoke + white_race + lwt", family=sm.families.Binomial(), data=data)
result7 = model7.fit()
print(result7.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      185
Model Family:                Binomial   Df Model:                            3
Method:                          IRLS   Log-Likelihood:                -107.70
Date:                Tue, 06 Jan 2015   Deviance:                       215.40
Time:                        12:26:36   Pearson chi2:                     182.
No. Iterations:                     6
======================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Intercept              0.8002      0.785      1.019      0.308        -0.739     2.340
white_race[T.True]    -1.0775      0.373     -2.889      0.004        -1.809    -0.346
smoke                  1.1030      0.372      2.962      0.003         0.373     1.833
lwt                   -0.0122      0.006     -2.026      0.043        -0.024    -0.000
======================================================================================


To help select a model, we can calculate the AIC and BIC for each model.

In [16]:
[x.aic for x in (result1, result2, result3, result4, result5, result6, result7)]

Out[16]:
[221.30512017621612,
219.4479911325021,
218.00796701966146,
216.63551147210853,
216.78608548110299,
218.67864946446963,
223.39828881016609]
In [17]:
[x.bic for x in (result1, result2, result3, result4, result5, result6, result7)]

Out[17]:
[-736.96759551945979,
-742.06647157823352,
-746.74824270613385,
-751.36244526874646,
-754.45361827481156,
-755.80280130650453,
-754.32490897586786]

In terms of AIC, model4 is the best. If our goal is to have all Z-scores greater than 2, we would go with model6, which is the same model selected by BIC. Each of these model selection statistics has strengths and weaknesses. There is no automatic way to decide which model is "best".

Non-multiplicative effect of maternal weight

Now suppose that our main interest is in the maternal age and weight effects (lwt). These are quantitative variables, so they may not be accurately represented by the models presented above, which only have linear effects on the log odds ratio (multiplicative effects on the odds ratio). We can further explore the effect of maternal weight using splines.

In [18]:
model8 = sm.GLM.from_formula("low ~ smoke + white_race + bs(lwt, df=4) + ht + ui", family=sm.families.Binomial(), data=data)
result8 = model8.fit()
print(result8.summary())

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                    low   No. Observations:                  189
Model:                            GLM   Df Residuals:                      180
Model Family:                Binomial   Df Model:                            8
Method:                          IRLS   Log-Likelihood:                -101.19
Date:                Tue, 06 Jan 2015   Deviance:                       202.38
Time:                        12:26:37   Pearson chi2:                     179.
No. Iterations:                     8
======================================================================================
coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Intercept             -0.7202      1.458     -0.494      0.621        -3.579     2.138
white_race[T.True]    -1.0740      0.390     -2.754      0.006        -1.838    -0.310
smoke                  1.0940      0.393      2.787      0.005         0.325     1.863
bs(lwt, df=4)[0]       0.6913      1.976      0.350      0.726        -3.181     4.563
bs(lwt, df=4)[1]      -2.3650      1.702     -1.389      0.165        -5.701     0.971
bs(lwt, df=4)[2]       1.8906      3.436      0.550      0.582        -4.843     8.625
bs(lwt, df=4)[3]      -6.5241      4.640     -1.406      0.160       -15.618     2.570
ht                     1.9604      0.737      2.658      0.008         0.515     3.406
ui                     0.8946      0.456      1.962      0.050         0.001     1.788
======================================================================================


Next we plot the log odds for maternal weight relative to the median maternal weight. The effect that is graphed is for non-smoking white women with no hypertension (ht=0) and no uterine irritability (ui=0).

In [19]:
df = data.iloc[0:9,:].copy()
df["smoke"] = 0
df["white_race"] = 1

# tolist is required due to a numpy bug, now fixed
lwt = np.percentile(np.asarray(data.lwt), np.arange(10, 91, 10).tolist())

df["lwt"] = lwt
df["ht"] = 0
df["ui"] = 0

# Logit probabilities
lpr = result8.predict(exog=df, linear=True)

# Log odds ratios relative to median maternal weight
lor = lpr - lpr[4]

plt.grid(True)
plt.plot(lwt, lor, '-o')
plt.xlim(98, 172)
plt.xlabel("Maternal weight (lbs)", size=15)
plt.ylabel("Log odds relative to median", size=15)

Out[19]:
<matplotlib.text.Text at 0x7feba323f690>

We see that while low maternal weight is a risk factor, there is no advantage to being overweight as compared to median weight.

In [20]:
OR = np.exp(lor)

plt.clf()
plt.grid(True)
plt.plot(lwt, OR, '-o')
plt.xlim(98, 172)
plt.xlabel("Maternal weight (lbs)", size=15)
plt.ylabel("Odds ratio", size=15)

Out[20]:
<matplotlib.text.Text at 0x7feba101e4d0>

Effect of maternal age

We can now revisit the effect of maternal age, which was very weak as a main effect (i.e. as a linear effect on the log scale). We will model age using splines to capture possible nonlinear effects. We will also control for the other factors that were found above to have effects.

In [21]:
model9 = sm.GLM.from_formula("low ~ smoke + bs(age, df=3) + white_race + lwt + ht + ui", family=sm.families.Binomial(), data=data)
result9 = model9.fit()

df = data.iloc[0:9,:].copy()
df["smoke"] = 0
df["white_race"] = 1

# tolist is required due to a numpy bug, now fixed
age = np.percentile(np.asarray(data.age), np.arange(10, 91, 10).tolist())

df["lwt"] = data.lwt.mean()
df["age"] = age
df["ht"] = 0
df["ui"] = 0

# Logit probabilities
lpr = result9.predict(exog=df, linear=True)

import patsy
dexog = patsy.dmatrix(model9.data.orig_exog.design_info.builder, df)

vcov = result9.cov_params()
va = [np.dot(x, np.dot(vcov, x)) for x in dexog]
va = np.asarray(va)
sd = np.sqrt(va)

plt.grid(True)
plt.plot(age, lpr, '-o')
plt.fill_between(age, lpr-2*sd, lpr+2*sd, color='grey', alpha=0.6)
plt.xlabel("Maternal age", size=15)
plt.ylabel("Logit probability", size=15)

Out[21]:
<matplotlib.text.Text at 0x7feba0f084d0>

There still is no evidence of a maternal age effect.