# The Generalised Linear Model and ANOVA¶

The University of Tübingen

In [21]:
import numpy as np
import pandas as pd
import patsy
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.stats as sms
import matplotlib.pyplot as plt

%matplotlib inline

sns.set_style("white")


# Introduction¶

The analysis of variance (ANOVA) is typically used to "learn the relative importance of different sources of variation in a dataset" (see Gelman & Hill, Chapter 22). Recall that when we think about regression models, we ask questions like:

What is our prediction for distance flown given paper and design?

The ANOVA for the same model fit focuses on a slightly different question:

How much of the variance in distance flown is associated with changes in paper and design?

ANOVA can be thought of as breaking up the total variance in the response into (a) how much variance is associated with each independent variable in our design and (b) the remaining error or residual variance.

In [23]:
# for this demonstration we will use the *planes* example because it's balanced.

# read in the data, rename some values:

# column names to lowercase:
planes.columns = [c.lower() for c in planes.columns]

# turn "paper" variable into an object, not int64:
planes['paper'] = planes['paper'].astype(object)

planes.replace(to_replace={'paper': {1: '80',
2: '50'},
'angle': {1: 'horz',
2: '45deg '},
'plane': {1: 'sleek',
2: 'simple'}},
inplace=True)

# convert distance into meters for ease of SSR:
planes['distance'] = planes['distance'] / 1000

# rename "plane" to "design":
planes = planes.rename(columns={'plane': 'design'})

# planes is a balanced dataset (same number of cases in each cell):
planes.groupby(['paper', 'design']).size()

Out[23]:
paper  design
50     simple    4
sleek     4
80     simple    4
sleek     4
dtype: int64

# ANOVA as model comparison between nested models¶

Recall for the planes data we are interested in the effects of design and paper on distance. This is a $2 \times 2$ factorial design. We can think about all the possible models here as a series of nested models. That is, each simpler model is a subset of the most complex model (the model with all main effects and interactions). We can visualise this in a diagram (adapted from [1]):

In the diagram, the arrows indicate a differencing of the sum of squared residuals (SSR). The SSR for the upper model (with more parameters, and thus a lower SSR) is subtracted from the SSR of the lower model. This difference score gives the change in SS associated with the new predictor.

Notice that there are actually two comparisons here that speak to the "main effects" of paper (in red) and design (in blue). These correspond to different hypotheses. Let's consider the test for paper. The first comparison (red text, right side of figure) tests whether paper has an effect ignoring design. This is usually called the main effect of paper. The second comparison (red text, left side of figure) tests whether adding paper changes the residuals after discounting the variance due to design. This corresponds to the SS for the simple effects of paper (in an interaction model, we also add the SS from the interaction term -- see [2], p.433).

The planes dataset we use here is an example of a balanced design. Balanced designs have an equal number of observations in each cell (four planes were thrown in each combination of paper and design). It turns out that in balanced designs, the main effects and the sum of simple effects give the same answer. That's because they're just different ways to divide up the total sum of squares. When we have unbalanced data however, it matters which comparison you do (see Types of Sums of Squares, below).

# Orthogonal contrasts¶

When we do ANOVA, we are breaking up the total variance from the grand mean into subgroups of variance. To do this in a way that makes sense, we need to code categorical variables in a way that is independent of the other columns. These are called orthogonal contrasts.

What is an orthogonal contrast? Intuitively, in a design matrix of orthogonal contrasts, the overall effect of one column is independent of the other columns. Mathematically, two columns of a design matrix are orthogonal if their product sums to zero.

Throughout this course so far, recall that we have coded categorical variables using treatment or dummy coding. In this case, the Intercept term is the mean of one of the subgroups in the design (the reference level) and all other coefficients give deflections from this reference level. Are treatment contrasts orthogonal?

In [24]:
X = patsy.dmatrix('~ design * paper', planes)
X

Out[24]:
DesignMatrix with shape (16, 4)
Intercept  design[T.sleek]  paper[T.80]  design[T.sleek]:paper[T.80]
1                1            1                            1
1                1            1                            1
1                0            1                            0
1                0            1                            0
1                1            1                            1
1                1            1                            1
1                0            1                            0
1                0            1                            0
1                1            0                            0
1                1            0                            0
1                0            0                            0
1                0            0                            0
1                1            0                            0
1                1            0                            0
1                0            0                            0
1                0            0                            0
Terms:
'Intercept' (column 0)
'design' (column 1)
'paper' (column 2)
'design:paper' (column 3)
In [25]:
# example for coefficients of design and paper:
np.sum(X[:, 1] * X[:, 2])

Out[25]:
array(4.0)

So in the treatment contrasts regime, we're not using orthogonal contrasts (the sum of the product of the columns is not zero). Therefore ANOVAs done with these design matrices become hard or impossible to interpret.

The fix for this is easy: we can tell Patsy to code our categorical variables using orthogonal contrasts. The most common of these is Sum coding, in which the values for each category sum to one. This means that the coefficients in the regression represent deviations from the overall mean across the category levels. There are a number of other contrasts available in Patsy, which can test different hypotheses about the variables. See here, here and here for further reading.

What does the design matrix look like for sum coding?

In [26]:
X = patsy.dmatrix('~ C(design, Sum) * C(paper, Sum)', planes)
print(X)

[[ 1. -1. -1.  1.]
[ 1. -1. -1.  1.]
[ 1.  1. -1. -1.]
[ 1.  1. -1. -1.]
[ 1. -1. -1.  1.]
[ 1. -1. -1.  1.]
[ 1.  1. -1. -1.]
[ 1.  1. -1. -1.]
[ 1. -1.  1. -1.]
[ 1. -1.  1. -1.]
[ 1.  1.  1.  1.]
[ 1.  1.  1.  1.]
[ 1. -1.  1. -1.]
[ 1. -1.  1. -1.]
[ 1.  1.  1.  1.]
[ 1.  1.  1.  1.]]


Instead of our levels being either 0 or 1, they're now either -1 or 1. Now, the sum over our contrast columns is zero (note the intercept is not a contrast, so does not sum to zero), and their cross products are all zero:

In [27]:
print('sum of columns = ' + str(X.sum(axis=0)))
print('sum of products of design and paper = ' + str((X[:, 1] * X[:, 2]).sum()))
print('sum of products of design and interaction = ' + str((X[:, 1] * X[:, 3]).sum()))

sum of columns = [ 16.   0.   0.   0.]
sum of products of design and paper = 0.0
sum of products of design and interaction = 0.0


Using these orthogonal contrasts, let's fit the model and look at the ANOVA table:

In [28]:
fit_full = smf.ols('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()
print(fit_full.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:               distance   R-squared:                       0.727
Method:                 Least Squares   F-statistic:                     10.66
Date:                Tue, 24 Feb 2015   Prob (F-statistic):            0.00106
Time:                        20:51:51   Log-Likelihood:                -18.584
No. Observations:                  16   AIC:                             45.17
Df Residuals:                      12   BIC:                             48.26
Df Model:                           3
Covariance Type:            nonrobust
================================================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------------------------
Intercept                                        3.6853      0.223     16.514      0.000         3.199     4.171
C(design, Sum)[S.simple]                        -0.1552      0.223     -0.696      0.500        -0.641     0.331
C(paper, Sum)[S.50]                              0.3277      0.223      1.469      0.168        -0.158     0.814
C(design, Sum)[S.simple]:C(paper, Sum)[S.50]    -1.2090      0.223     -5.418      0.000        -1.695    -0.723
==============================================================================
Omnibus:                        0.751   Durbin-Watson:                   2.909
Prob(Omnibus):                  0.687   Jarque-Bera (JB):                0.745
Skew:                           0.359   Prob(JB):                        0.689
Kurtosis:                       2.224   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


What do the coefficients mean? The intercept coefficient is now the grand mean of distance (recall that in the Treatment contrast regime it was the mean distance when the design was "simple" and the paper weight was "50"):

In [29]:
np.round(planes['distance'].mean(), decimals=4)

Out[29]:
3.6852

The coefficient values for the contrasts tell us the average deviations from this grand mean for each category:

In [30]:
# example for design:
print(fit_full.params[0] + fit_full.params[1])  # model prediction for simple
print(fit_full.params[0] + -1*fit_full.params[1])  # model prediction for sleek
print('\n\n Actual Means:')
print(planes.groupby('design').distance.mean())  # actual means

3.53
3.8405

Actual Means:
design
simple    3.5300
sleek     3.8405
Name: distance, dtype: float64


## ANOVA table for this model¶

In [31]:
anova_table = sms.anova.anova_lm(fit_full) # type I SS by default; same answer in this case
print(anova_table)

                              df     sum_sq    mean_sq          F    PR(>F)
C(design, Sum)                 1   0.385641   0.385641   0.484016  0.499861
C(paper, Sum)                  1   1.718721   1.718721   2.157158  0.167628
C(design, Sum):C(paper, Sum)   1  23.386896  23.386896  29.352777  0.000156
Residual                      12   9.561029   0.796752        NaN       NaN


## Relating the ANOVA table to nested model comparisons¶

Recall that above we talked about ANOVA as a comparison of nested models. I'll reproduce the diagram here:

We can produce all the sums of squares from the ANOVA table by computing the relevant change in residual (error) sum of squares for each model.

### Fit all models in the diagram¶

In [32]:
# fit all the models in the tree
null = smf.ols('distance ~ 1', planes).fit()
design = smf.ols('distance ~ C(design, Sum)', planes).fit()
paper = smf.ols('distance ~ C(paper, Sum)', planes).fit()
additive = smf.ols('distance ~ C(design, Sum) + C(paper, Sum)', planes).fit()
full = smf.ols('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()


### The null model represents the total sum of squared residuals¶

In [33]:
print(null.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:               distance   R-squared:                       0.000
Method:                 Least Squares   F-statistic:                       nan
Date:                Tue, 24 Feb 2015   Prob (F-statistic):                nan
Time:                        20:51:51   Log-Likelihood:                -28.977
No. Observations:                  16   AIC:                             59.95
Df Residuals:                      15   BIC:                             60.73
Df Model:                           0
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      3.6853      0.382      9.643      0.000         2.871     4.500
==============================================================================
Omnibus:                        0.701   Durbin-Watson:                   1.723
Prob(Omnibus):                  0.704   Jarque-Bera (JB):                0.714
Skew:                           0.351   Prob(JB):                        0.700
Kurtosis:                       2.239   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


This model explains 0% of the variance in the data. Because it's predicting every data point to be the mean, the residual variance is equal to the total variance:

In [34]:
null.mse_total  # mean squared error total

Out[34]:
2.336819133333333
In [35]:
null.mse_resid

Out[35]:
2.336819133333333
In [36]:
# this is the same as the variance in distance:
planes['distance'].var()

Out[36]:
2.3368191333333317
In [37]:
# let's calculate the sum of squared residuals manually:
res = planes['distance'] - null.predict() # absolute residuals (difference between data and prediction)
res = res ** 2  # squared residuals
res.sum()

Out[37]:
35.052286999999993
In [38]:
# the ssr is also contained in the model object:
null.ssr

Out[38]:
35.052286999999993

### Main effect of design¶

Recall that the SS for each model term can be calculated by comparing the change in the residual sum of squares for nested models. We do that here:

In [39]:
## Main effect of design, ignoring paper:
print(null.ssr - design.ssr)

## sum of simple effects of design, eliminating paper:

## main effect from ANOVA table:
print(anova_table['sum_sq'][0])

0.385641
0.385641
0.385641


### Main effect of paper¶

In [40]:
## Main effect of paper, ignoring design:
print(null.ssr - paper.ssr)

## sum of simple effects of paper, eliminating design:

## main effect from ANOVA table:
print(anova_table['sum_sq'][1])

1.718721
1.718721
1.718721


### Interaction effect¶

In [41]:
# interaction term:

## interaction effect from ANOVA table:
print(anova_table['sum_sq'][2])

23.386896
23.386896


# Partitioning of residuals, variance explained¶

As we add terms to the model, the residual sum of squares gets lower. It turns out that for any model, its $R^2$ (the coefficient of determination) is given by the proportion of the variance explained, relative to the null model:

In [42]:
# e.g. for full model:
print(full.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:               distance   R-squared:                       0.727
Method:                 Least Squares   F-statistic:                     10.66
Date:                Tue, 24 Feb 2015   Prob (F-statistic):            0.00106
Time:                        20:51:51   Log-Likelihood:                -18.584
No. Observations:                  16   AIC:                             45.17
Df Residuals:                      12   BIC:                             48.26
Df Model:                           3
Covariance Type:            nonrobust
================================================================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------------------------
Intercept                                        3.6853      0.223     16.514      0.000         3.199     4.171
C(design, Sum)[S.simple]                        -0.1552      0.223     -0.696      0.500        -0.641     0.331
C(paper, Sum)[S.50]                              0.3277      0.223      1.469      0.168        -0.158     0.814
C(design, Sum)[S.simple]:C(paper, Sum)[S.50]    -1.2090      0.223     -5.418      0.000        -1.695    -0.723
==============================================================================
Omnibus:                        0.751   Durbin-Watson:                   2.909
Prob(Omnibus):                  0.687   Jarque-Bera (JB):                0.745
Skew:                           0.359   Prob(JB):                        0.689
Kurtosis:                       2.224   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [43]:
diff = null.ssr - full.ssr
# express as a proportion of the total (null model) residual:
prop = diff / null.ssr
print(prop)

0.727235230044

In [44]:
# same as R-squared:
print(full.rsquared)

0.727235230044


# F-tests and t-tests¶

As you may know from your basic statistics courses, the sum of squares for each term is divided by how many parameters were estimated for that term (the degrees of freedom added) to give the mean squared column, and then the ratio between the change and the amount of remaining error gives the F-statistic. This can then be tested against the F-distribution (with model, error degrees of freedom) for significance.

In [45]:
print(anova_table)

                              df     sum_sq    mean_sq          F    PR(>F)
C(design, Sum)                 1   0.385641   0.385641   0.484016  0.499861
C(paper, Sum)                  1   1.718721   1.718721   2.157158  0.167628
C(design, Sum):C(paper, Sum)   1  23.386896  23.386896  29.352777  0.000156
Residual                      12   9.561029   0.796752        NaN       NaN


Consider the regression table for the full model above. How are the t-statistics related to the F-values for the model terms?

In the $2 \times 2$ design here, the t-tests are equivalent to the F-tests for the factors... their values are the square roots of the F-values (i.e. $F = t^2$)

In [46]:
full.tvalues[1:] **2

Out[46]:
C(design, Sum)[S.simple]                         0.484016
C(paper, Sum)[S.50]                              2.157158
C(design, Sum)[S.simple]:C(paper, Sum)[S.50]    29.352777
dtype: float64
In [47]:
print(anova_table['F'])

C(design, Sum)                   0.484016
C(paper, Sum)                    2.157158
C(design, Sum):C(paper, Sum)    29.352777
Residual                              NaN
Name: F, dtype: float64


# One way to visualise variance partitioning¶

Recall that when we do ANOVA, we're partitioning the total variance into the variance associated with each factor, and that this is equivalent to nested model comparison. We can visualise this as a pie chart:

In [48]:
# plot the explained variances as a pie chart:
labels = 'design', 'paper', 'interaction', 'residual'
sizes = anova_table['sum_sq']
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
plt.pie(sizes, labels=labels,
autopct='%1.0f%%',
colors=colors,
startangle=90)
# Set aspect ratio to be equal so that pie is drawn as a circle.
plt.axis('equal')
plt.show()


Notice how the residual error variance is 27%. This is $100 - R^2$.

## Aside: Types of Sums of Squares¶

In unbalanced designs, the "Types of Sums of Squares" become relevant. The different Types of Sums of Squares (Type I, II, III...) basically correspond to different ways to compare models up and down the graph diagram shown above. That is, they correspond to different nested model comparisons -- and with unbalanced data, it matters which comparisons you do.

A complete discussion of the different sums of squares is beyond what we want to cover in this course. For some starter reading, see here, here.

The main thing we want to emphasise is: ANOVA is about model comparison. Think about what hypotheses make sense to test for your dataset.

# ANOVA as model comparison: summary¶

We've provided here a brief overview of one way to think about ANOVA: it's comparing the incremental change in the sum of squared residuals as you add terms to the model. In this sense, ANOVA is about model comparison, using the change in residuals to decide whether a term is important.

F-ratios are not the only way we could compare the models. ANOVAs are just a historically popular way to do model comparison -- mainly because they are computable by hand. One advantage of thinking about ANOVAs as model comparisons is that once you understand the idea, it's easier to think how to use other types of comparisons.

For example, if we wanted to emphasise parsimonious models (i.e. keeping the number of parameters low), we could consider using an information criterion like the AIC or BIC to compare the models. The structure of the comparisons and their hypotheses are the same, however. For a comprehensive treatment of this type of model comparison, see [3].

Another possibility is using regularisation -- we fit the model but penalise the fit for having many coefficients far from zero. For example, Lasso regression (also called L1 norm regression) penalises the sum of the coefficients from growing too large. This has the effect of setting many coefficients to zero -- i.e. removing those terms from the model. This is equivalent to using a model lower down the nesting hierarchy drawn above.

# ANOVA in GLMs¶

We've demonstrated ANOVA for the linear model. The concepts inherent in ANOVA can be extended to Generalised Linear Models with non-Gaussian errors and non-identity link functions. To do this, we need to consider a generalisation of the squared residuals, called the deviance.

The deviance is the difference in log likelihood (i.e. the likelihood of the data under the model) between a saturated model (that contains a parameter for every data point, and thus predicts the data perfectly) and the model in question. Intuitively, think of deviance as the "amount by which our model predictions deviate from a perfect prediction". In the Gaussian + identity model, the deviance reduces to the sum of squared residuals:

In [49]:
# anova table from interaction model:
print(anova_table)

                              df     sum_sq    mean_sq          F    PR(>F)
C(design, Sum)                 1   0.385641   0.385641   0.484016  0.499861
C(paper, Sum)                  1   1.718721   1.718721   2.157158  0.167628
C(design, Sum):C(paper, Sum)   1  23.386896  23.386896  29.352777  0.000156
Residual                      12   9.561029   0.796752        NaN       NaN

In [50]:
# summary table from GLM fit:
fit_glm = smf.glm('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()
print(fit_glm.deviance)

9.561029


For GLMs with non-Gaussian errors and non-identity link functions, ANOVA becomes "Analysis of Deviance", but the conceptual principles remain the same. We perform nested model comparison, comparing how the deviance is reduced by adding terms to the model.

Unfortunately Statsmodels has not implemented analysis of deviance as a function yet (as of version 0.6.1), so it's not as straightforward to do as in the linear model cases outlined above. However, analysis of deviance can be easily performed in R or MATLAB, or by hand in Statsmodels using model comparisons as before. Remember to take degrees of freedom into account.

# References¶

[1] Venables, W. N. (1998). Exegeses on linear models. In S-Plus User’s Conference, Washington DC. Retrieved from here

[2] Howell, D. C. (2002). Statistical Methods for Psychology (5th Edition).

[3] Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: a practical information-theoretic approach. New York: Springer.

In [1]:
from IPython.core.display import HTML

def css_styling():