## Statsmodels Regression Graphics¶

This notebook demonstrates regression graphics in Statsmodels.

The techniques demonstrated here are also implemented in the "CAR" package in R (which also implements many more techniques):

http://cran.r-project.org/web/packages/car/index.html

See also the book "An R Companion to Applied Regression" by Fox and Weisberg.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm


Occupational prestige data

Next we load a dataset containing perceptions of occupational prestige for 102 occupations in 1960's Canada. The key variables are:

• prestige: a quantitative rating of an occupation's prestige (taken from a survey conducted in the 1960's)

• education: average education (in years) of people holding the occupation

• income: average income (in 1971 dollars) of people holding the occupation

• women: percentage of people holding the occupation who are women

The data are available from the R CAR package.

In [2]:
prestige = pd.read_csv("prestige.csv")


We start with the following basic model, with the variables on their original scale.

In [3]:
model1 = sm.GLM.from_formula("prestige ~ income + education + women", family=sm.families.Gaussian(), data=prestige)
result1 = model1.fit()


Here is the CERES plot for income, which suggests that income might be log transformed.

In [4]:
fig = result1.plot_ceres_residuals("income")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)


Here is the CERES plot for education, which suggests a quadratic transformation:

In [5]:
fig = result1.plot_ceres_residuals("education")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)


Here is the CERES plot for women. There is a weak U-shaped effect, but arguably there is no need for a transformation.

In [6]:
fig = result1.plot_ceres_residuals("women")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)


Mussels data

The data are measurements on a sample of Horse mussels.

• H: Shell height in mm
• L: Shell length in mm
• M: Muscle mass in grams
• S: Shell mass in grams
• W: Shell width in mm

The data are available from the R DR package.

In [7]:
data = pd.read_csv("mussels.csv")


First we fit a model expressing the muscle mass as a function of the shell length and width, and the shell mass.

In [8]:
model2 = sm.GLM.from_formula("M ~ L + W + S", family=sm.families.Gaussian(), data=data)
result2 = model2.fit()


The next three cells show the CERES residuals plot for each predictor. There is no obvious need for transformations of any of the covariates. There is fairly strong heteroscedasticity.

In [9]:
fig = result2.plot_ceres_residuals("L")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)

fig = result2.plot_ceres_residuals("W")

fig = result2.plot_ceres_residuals("S")