Statsmodels Regression Graphics

This notebook demonstrates regression graphics in Statsmodels.

The techniques demonstrated here are also implemented in the "CAR" package in R (which also implements many more techniques):

http://cran.r-project.org/web/packages/car/index.html

See also the book "An R Companion to Applied Regression" by Fox and Weisberg.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.graphics.regressionplots import add_lowess

Occupational prestige data

Next we load a dataset containing perceptions of occupational prestige for 102 occupations in 1960's Canada. The key variables are:

  • prestige: a quantitative rating of an occupation's prestige (taken from a survey conducted in the 1960's)

  • education: average education (in years) of people holding the occupation

  • income: average income (in 1971 dollars) of people holding the occupation

  • women: percentage of people holding the occupation who are women

The data are available from the R CAR package.

In [2]:
prestige = pd.read_csv("prestige.csv")

We start with the following basic model, with the variables on their original scale.

In [3]:
model1 = sm.GLM.from_formula("prestige ~ income + education + women", family=sm.families.Gaussian(), data=prestige)
result1 = model1.fit()

Here is the CERES plot for income, which suggests that income might be log transformed.

In [4]:
fig = result1.plot_ceres_residuals("income")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)

Here is the CERES plot for education, which suggests a quadratic transformation:

In [5]:
fig = result1.plot_ceres_residuals("education")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)

Here is the CERES plot for women. There is a weak U-shaped effect, but arguably there is no need for a transformation.

In [6]:
fig = result1.plot_ceres_residuals("women")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)

Mussels data

The data are measurements on a sample of Horse mussels.

  • H: Shell height in mm
  • L: Shell length in mm
  • M: Muscle mass in grams
  • S: Shell mass in grams
  • W: Shell width in mm

The data are available from the R DR package.

In [7]:
data = pd.read_csv("mussels.csv")

First we fit a model expressing the muscle mass as a function of the shell length and width, and the shell mass.

In [8]:
model2 = sm.GLM.from_formula("M ~ L + W + S", family=sm.families.Gaussian(), data=data)
result2 = model2.fit()

The next three cells show the CERES residuals plot for each predictor. There is no obvious need for transformations of any of the covariates. There is fairly strong heteroscedasticity.

In [9]:
fig = result2.plot_ceres_residuals("L")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)
In [10]:
fig = result2.plot_ceres_residuals("W")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)
In [11]:
fig = result2.plot_ceres_residuals("S")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)