In [21]:

```
import numpy as np
import pandas as pd
import patsy
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.stats as sms
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style("white")
```

The analysis of variance (ANOVA) is typically used to "*learn the relative importance of different sources of variation in a dataset*" (see Gelman & Hill, Chapter 22). Recall that when we think about regression models, we ask questions like:

**What is our prediction for distance flown given paper and design?**

The ANOVA for the same model fit focuses on a slightly different question:

**How much of the variance in distance flown is associated with changes in paper and design?**

ANOVA can be thought of as *breaking up* the total variance in the response into (a) how much variance is associated with each independent variable in our design and (b) the remaining *error* or *residual* variance.

In [23]:

```
# for this demonstration we will use the *planes* example because it's balanced.
# read in the data, rename some values:
planes = pd.read_csv('../Datasets/planes.txt', sep='\t')
# column names to lowercase:
planes.columns = [c.lower() for c in planes.columns]
# turn "paper" variable into an object, not int64:
planes['paper'] = planes['paper'].astype(object)
planes.replace(to_replace={'paper': {1: '80',
2: '50'},
'angle': {1: 'horz',
2: '45deg '},
'plane': {1: 'sleek',
2: 'simple'}},
inplace=True)
# convert distance into meters for ease of SSR:
planes['distance'] = planes['distance'] / 1000
# rename "plane" to "design":
planes = planes.rename(columns={'plane': 'design'})
# planes is a balanced dataset (same number of cases in each cell):
planes.groupby(['paper', 'design']).size()
```

Out[23]:

Recall for the planes data we are interested in the effects of `design`

and `paper`

on distance. This is a $2 \times 2$ factorial design. We can think about all the possible models here as a series of nested models. That is, each simpler model is a subset of the most complex model (the model with all main effects and interactions). We can visualise this in a diagram (adapted from [1]):

In the diagram, the arrows indicate a differencing of the sum of squared residuals (SSR). The SSR for the upper model (with more parameters, and thus a lower SSR) is subtracted from the SSR of the lower model. This difference score gives the change in SS associated with the new predictor.

Notice that there are actually two comparisons here that speak to the "main effects" of `paper`

(in red) and `design`

(in blue). These correspond to different hypotheses. Let's consider the test for `paper`

. The first comparison (red text, right side of figure) tests whether paper has an effect *ignoring* design. This is usually called the **main effect** of paper. The second comparison (red text, left side of figure) tests whether adding paper changes the residuals *after discounting the variance due to design*. This corresponds to the SS for the **simple effects** of paper (in an interaction model, we also add the SS from the interaction term -- see [2], p.433).

The `planes`

dataset we use here is an example of a *balanced design*. Balanced designs have an equal number of observations in each cell (four planes were thrown in each combination of `paper`

and `design`

). It turns out that in balanced designs, the main effects and the sum of simple effects give the same answer. That's because they're just different ways to divide up the total sum of squares. When we have unbalanced data however, it matters which comparison you do (see Types of Sums of Squares, below).

When we do ANOVA, we are breaking up the total variance from the grand mean into subgroups of variance. To do this in a way that makes sense, we need to code categorical variables in a way that is independent of the other columns. These are called *orthogonal contrasts*.

What is an orthogonal contrast? Intuitively, in a design matrix of orthogonal contrasts, the overall effect of one column is independent of the other columns. Mathematically, two columns of a design matrix are orthogonal if their product sums to zero.

Throughout this course so far, recall that we have coded categorical variables using *treatment* or *dummy* coding. In this case, the Intercept term is the mean of one of the subgroups in the design (the *reference* level) and all other coefficients give deflections from this reference level. Are treatment contrasts orthogonal?

In [24]:

```
X = patsy.dmatrix('~ design * paper', planes)
X
```

Out[24]:

In [25]:

```
# example for coefficients of design and paper:
np.sum(X[:, 1] * X[:, 2])
```

Out[25]:

So in the *treatment* contrasts regime, we're not using orthogonal contrasts (the sum of the product of the columns is not zero). Therefore ANOVAs done with these design matrices become hard or impossible to interpret.

The fix for this is easy: we can tell Patsy to code our categorical variables using orthogonal contrasts. The most common of these is Sum coding, in which the values for each category sum to one. This means that the coefficients in the regression represent *deviations* from the overall mean across the category levels. There are a number of other contrasts available in Patsy, which can test different hypotheses about the variables. See here, here and here for further reading.

What does the design matrix look like for sum coding?

In [26]:

```
X = patsy.dmatrix('~ C(design, Sum) * C(paper, Sum)', planes)
print(X)
```

Instead of our levels being either 0 or 1, they're now either -1 or 1. Now, the sum over our contrast columns is zero (note the intercept is not a contrast, so does not sum to zero), and their cross products are all zero:

In [27]:

```
print('sum of columns = ' + str(X.sum(axis=0)))
print('sum of products of design and paper = ' + str((X[:, 1] * X[:, 2]).sum()))
print('sum of products of design and interaction = ' + str((X[:, 1] * X[:, 3]).sum()))
```

Using these orthogonal contrasts, let's fit the model and look at the ANOVA table:

In [28]:

```
fit_full = smf.ols('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()
print(fit_full.summary())
```

What do the coefficients mean? The intercept coefficient is now the grand mean of distance (recall that in the Treatment contrast regime it was the mean distance when the design was "simple" and the paper weight was "50"):

In [29]:

```
np.round(planes['distance'].mean(), decimals=4)
```

Out[29]:

The coefficient values for the contrasts tell us the average deviations from this grand mean for each category:

In [30]:

```
# example for design:
print(fit_full.params[0] + fit_full.params[1]) # model prediction for simple
print(fit_full.params[0] + -1*fit_full.params[1]) # model prediction for sleek
print('\n\n Actual Means:')
print(planes.groupby('design').distance.mean()) # actual means
```

In [31]:

```
anova_table = sms.anova.anova_lm(fit_full) # type I SS by default; same answer in this case
print(anova_table)
```

Recall that above we talked about ANOVA as a comparison of nested models. I'll reproduce the diagram here:

We can produce all the sums of squares from the ANOVA table by computing the relevant change in residual (error) sum of squares for each model.

In [32]:

```
# fit all the models in the tree
null = smf.ols('distance ~ 1', planes).fit()
design = smf.ols('distance ~ C(design, Sum)', planes).fit()
paper = smf.ols('distance ~ C(paper, Sum)', planes).fit()
additive = smf.ols('distance ~ C(design, Sum) + C(paper, Sum)', planes).fit()
full = smf.ols('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()
```

In [33]:

```
print(null.summary())
```

This model explains 0% of the variance in the data. Because it's predicting every data point to be the mean, the residual variance is equal to the total variance:

In [34]:

```
null.mse_total # mean squared error total
```

Out[34]:

In [35]:

```
null.mse_resid
```

Out[35]:

In [36]:

```
# this is the same as the variance in distance:
planes['distance'].var()
```

Out[36]:

In [37]:

```
# let's calculate the sum of squared residuals manually:
res = planes['distance'] - null.predict() # absolute residuals (difference between data and prediction)
res = res ** 2 # squared residuals
res.sum()
```

Out[37]:

In [38]:

```
# the ssr is also contained in the model object:
null.ssr
```

Out[38]:

Recall that the SS for each model term can be calculated by comparing the change in the residual sum of squares for nested models. We do that here:

In [39]:

```
## Main effect of design, ignoring paper:
print(null.ssr - design.ssr)
## sum of simple effects of design, eliminating paper:
print(paper.ssr - additive.ssr)
## main effect from ANOVA table:
print(anova_table['sum_sq'][0])
```

In [40]:

```
## Main effect of paper, ignoring design:
print(null.ssr - paper.ssr)
## sum of simple effects of paper, eliminating design:
print(design.ssr - additive.ssr)
## main effect from ANOVA table:
print(anova_table['sum_sq'][1])
```

In [41]:

```
# interaction term:
print(additive.ssr - full.ssr)
## interaction effect from ANOVA table:
print(anova_table['sum_sq'][2])
```

As we add terms to the model, the residual sum of squares gets lower. It turns out that for any model, its $R^2$ (the coefficient of determination) is given by the proportion of the variance explained, relative to the null model:

In [42]:

```
# e.g. for full model:
print(full.summary())
```

In [43]:

```
diff = null.ssr - full.ssr
# express as a proportion of the total (null model) residual:
prop = diff / null.ssr
print(prop)
```

In [44]:

```
# same as R-squared:
print(full.rsquared)
```

As you may know from your basic statistics courses, the sum of squares for each term is divided by how many parameters were estimated for that term (the degrees of freedom added) to give the mean squared column, and then the ratio between the change and the amount of remaining error gives the F-statistic. This can then be tested against the F-distribution (with `model, error`

degrees of freedom) for significance.

In [45]:

```
print(anova_table)
```

Consider the regression table for the full model above. How are the t-statistics related to the F-values for the model terms?

In the $2 \times 2$ design here, the t-tests are equivalent to the F-tests for the factors... their values are the square roots of the F-values (i.e. $F = t^2$)

In [46]:

```
full.tvalues[1:] **2
```

Out[46]:

In [47]:

```
print(anova_table['F'])
```

Recall that when we do ANOVA, we're partitioning the total variance into the variance associated with each factor, and that this is equivalent to nested model comparison. We can visualise this as a pie chart:

In [48]:

```
# plot the explained variances as a pie chart:
labels = 'design', 'paper', 'interaction', 'residual'
sizes = anova_table['sum_sq']
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
plt.pie(sizes, labels=labels,
autopct='%1.0f%%',
colors=colors,
startangle=90)
# Set aspect ratio to be equal so that pie is drawn as a circle.
plt.axis('equal')
plt.show()
```

Notice how the residual error variance is 27%. This is $100 - R^2$.

In unbalanced designs, the "Types of Sums of Squares" become relevant. The different Types of Sums of Squares (Type I, II, III...) basically correspond to different ways to compare models up and down the graph diagram shown above. That is, they correspond to different nested model comparisons -- and with unbalanced data, it matters which comparisons you do.

A complete discussion of the different sums of squares is beyond what we want to cover in this course. For some starter reading, see here, here.

The main thing we want to emphasise is: **ANOVA is about model comparison**. Think about what hypotheses make sense to test for your dataset.

We've provided here a brief overview of one way to think about ANOVA: it's comparing the incremental change in the sum of squared residuals as you add terms to the model. In this sense, ANOVA is about **model comparison**, using the change in residuals to decide whether a term is important.

F-ratios are not the only way we could compare the models. ANOVAs are just a historically popular way to do model comparison -- mainly because they are computable by hand. One advantage of thinking about ANOVAs as model comparisons is that once you understand the idea, it's easier to think how to use other types of comparisons.

For example, if we wanted to emphasise parsimonious models (i.e. keeping the number of parameters low), we could consider using an information criterion like the AIC or BIC to compare the models. The structure of the comparisons and their hypotheses are the same, however. For a comprehensive treatment of this type of model comparison, see [3].

Another possibility is using *regularisation* -- we fit the model but penalise the fit for having many coefficients far from zero. For example, *Lasso regression* (also called L1 norm regression) penalises the sum of the coefficients from growing too large. This has the effect of setting many coefficients to zero -- i.e. removing those terms from the model. This is equivalent to using a model lower down the nesting hierarchy drawn above.

We've demonstrated ANOVA for the linear model. The concepts inherent in ANOVA can be extended to Generalised Linear Models with non-Gaussian errors and non-identity link functions. To do this, we need to consider a generalisation of the squared residuals, called the *deviance*.

The *deviance* is the difference in log likelihood (i.e. the likelihood of the data under the model) between a *saturated* model (that contains a parameter for every data point, and thus predicts the data perfectly) and the model in question. Intuitively, think of *deviance* as the "amount by which our model predictions *deviate* from a perfect prediction". In the Gaussian + identity model, the deviance reduces to the sum of squared residuals:

In [49]:

```
# anova table from interaction model:
print(anova_table)
```

In [50]:

```
# summary table from GLM fit:
fit_glm = smf.glm('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()
print(fit_glm.deviance)
```

For GLMs with non-Gaussian errors and non-identity link functions, ANOVA becomes "Analysis of Deviance", but the conceptual principles remain the same. We perform nested model comparison, comparing how the deviance is reduced by adding terms to the model.

Unfortunately Statsmodels has not implemented analysis of deviance as a function yet (as of version `0.6.1`

), so it's not as straightforward to do as in the linear model cases outlined above. However, analysis of deviance can be easily performed in R or MATLAB, or by hand in Statsmodels using model comparisons as before. Remember to take degrees of freedom into account.

[1] Venables, W. N. (1998). Exegeses on linear models. In S-Plus User’s Conference, Washington DC. Retrieved from here

[2] Howell, D. C. (2002). Statistical Methods for Psychology (5th Edition).

[3] Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: a practical information-theoretic approach. New York: Springer.

In [1]:

```
from IPython.core.display import HTML
def css_styling():
styles = open("../custom_style.css", "r").read()
return HTML(styles)
css_styling()
```

Out[1]: