This notebook demonstrates regression graphics in Statsmodels.

The techniques demonstrated here are also implemented in the "CAR" package in R (which also implements many more techniques):

http://cran.r-project.org/web/packages/car/index.html

See also the book "An R Companion to Applied Regression" by Fox and Weisberg.

In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.graphics.regressionplots import add_lowess
```

**Occupational prestige data**

Next we load a dataset containing perceptions of occupational prestige for 102 occupations in 1960's Canada. The key variables are:

*prestige*: a quantitative rating of an occupation's prestige (taken from a survey conducted in the 1960's)*education*: average education (in years) of people holding the occupation*income*: average income (in 1971 dollars) of people holding the occupation*women*: percentage of people holding the occupation who are women

The data are available from the R *CAR* package.

In [2]:

```
prestige = pd.read_csv("prestige.csv")
```

We start with the following basic model, with the variables on their original scale.

In [3]:

```
model1 = sm.GLM.from_formula("prestige ~ income + education + women", family=sm.families.Gaussian(), data=prestige)
result1 = model1.fit()
```

Here is the CERES plot for *income*, which suggests that income might be log transformed.

In [4]:

```
fig = result1.plot_ceres_residuals("income")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)
```

Here is the CERES plot for *education*, which suggests a quadratic transformation:

In [5]:

```
fig = result1.plot_ceres_residuals("education")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)
```

Here is the CERES plot for *women*. There is a weak U-shaped effect, but arguably there is no need for a transformation.

In [6]:

```
fig = result1.plot_ceres_residuals("women")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)
```

**Mussels data**

The data are measurements on a sample of Horse mussels.

*H*: Shell height in mm*L*: Shell length in mm*M*: Muscle mass in grams*S*: Shell mass in grams*W*: Shell width in mm

The data are available from the R *DR* package.

In [7]:

```
data = pd.read_csv("mussels.csv")
```

First we fit a model expressing the muscle mass as a function of the shell length and width, and the shell mass.

In [8]:

```
model2 = sm.GLM.from_formula("M ~ L + W + S", family=sm.families.Gaussian(), data=data)
result2 = model2.fit()
```

The next three cells show the CERES residuals plot for each predictor. There is no obvious need for transformations of any of the covariates. There is fairly strong heteroscedasticity.

In [9]:

```
fig = result2.plot_ceres_residuals("L")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)
```

In [10]:

```
fig = result2.plot_ceres_residuals("W")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)
```

In [11]:

```
fig = result2.plot_ceres_residuals("S")
ax = fig.get_axes()[0]
ax.get_lines()[0].set_alpha(0.5)
_ = add_lowess(ax, frac=0.4)
```