import numpy as np
import pandas as pd
import patsy
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.stats as sms
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style("white")
The analysis of variance (ANOVA) is typically used to "learn the relative importance of different sources of variation in a dataset" (see Gelman & Hill, Chapter 22). Recall that when we think about regression models, we ask questions like:
What is our prediction for distance flown given paper and design?
The ANOVA for the same model fit focuses on a slightly different question:
How much of the variance in distance flown is associated with changes in paper and design?
ANOVA can be thought of as breaking up the total variance in the response into (a) how much variance is associated with each independent variable in our design and (b) the remaining error or residual variance.
# for this demonstration we will use the *planes* example because it's balanced.
# read in the data, rename some values:
planes = pd.read_csv('../Datasets/planes.txt', sep='\t')
# column names to lowercase:
planes.columns = [c.lower() for c in planes.columns]
# turn "paper" variable into an object, not int64:
planes['paper'] = planes['paper'].astype(object)
planes.replace(to_replace={'paper': {1: '80',
2: '50'},
'angle': {1: 'horz',
2: '45deg '},
'plane': {1: 'sleek',
2: 'simple'}},
inplace=True)
# convert distance into meters for ease of SSR:
planes['distance'] = planes['distance'] / 1000
# rename "plane" to "design":
planes = planes.rename(columns={'plane': 'design'})
# planes is a balanced dataset (same number of cases in each cell):
planes.groupby(['paper', 'design']).size()
paper design 50 simple 4 sleek 4 80 simple 4 sleek 4 dtype: int64
Recall for the planes data we are interested in the effects of design
and paper
on distance. This is a $2 \times 2$ factorial design. We can think about all the possible models here as a series of nested models. That is, each simpler model is a subset of the most complex model (the model with all main effects and interactions). We can visualise this in a diagram (adapted from [1]):
In the diagram, the arrows indicate a differencing of the sum of squared residuals (SSR). The SSR for the upper model (with more parameters, and thus a lower SSR) is subtracted from the SSR of the lower model. This difference score gives the change in SS associated with the new predictor.
Notice that there are actually two comparisons here that speak to the "main effects" of paper
(in red) and design
(in blue). These correspond to different hypotheses. Let's consider the test for paper
. The first comparison (red text, right side of figure) tests whether paper has an effect ignoring design. This is usually called the main effect of paper. The second comparison (red text, left side of figure) tests whether adding paper changes the residuals after discounting the variance due to design. This corresponds to the SS for the simple effects of paper (in an interaction model, we also add the SS from the interaction term -- see [2], p.433).
The planes
dataset we use here is an example of a balanced design. Balanced designs have an equal number of observations in each cell (four planes were thrown in each combination of paper
and design
). It turns out that in balanced designs, the main effects and the sum of simple effects give the same answer. That's because they're just different ways to divide up the total sum of squares. When we have unbalanced data however, it matters which comparison you do (see Types of Sums of Squares, below).
When we do ANOVA, we are breaking up the total variance from the grand mean into subgroups of variance. To do this in a way that makes sense, we need to code categorical variables in a way that is independent of the other columns. These are called orthogonal contrasts.
What is an orthogonal contrast? Intuitively, in a design matrix of orthogonal contrasts, the overall effect of one column is independent of the other columns. Mathematically, two columns of a design matrix are orthogonal if their product sums to zero.
Throughout this course so far, recall that we have coded categorical variables using treatment or dummy coding. In this case, the Intercept term is the mean of one of the subgroups in the design (the reference level) and all other coefficients give deflections from this reference level. Are treatment contrasts orthogonal?
X = patsy.dmatrix('~ design * paper', planes)
X
DesignMatrix with shape (16, 4) Intercept design[T.sleek] paper[T.80] design[T.sleek]:paper[T.80] 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 Terms: 'Intercept' (column 0) 'design' (column 1) 'paper' (column 2) 'design:paper' (column 3)
# example for coefficients of design and paper:
np.sum(X[:, 1] * X[:, 2])
array(4.0)
So in the treatment contrasts regime, we're not using orthogonal contrasts (the sum of the product of the columns is not zero). Therefore ANOVAs done with these design matrices become hard or impossible to interpret.
The fix for this is easy: we can tell Patsy to code our categorical variables using orthogonal contrasts. The most common of these is Sum coding, in which the values for each category sum to one. This means that the coefficients in the regression represent deviations from the overall mean across the category levels. There are a number of other contrasts available in Patsy, which can test different hypotheses about the variables. See here, here and here for further reading.
What does the design matrix look like for sum coding?
X = patsy.dmatrix('~ C(design, Sum) * C(paper, Sum)', planes)
print(X)
[[ 1. -1. -1. 1.] [ 1. -1. -1. 1.] [ 1. 1. -1. -1.] [ 1. 1. -1. -1.] [ 1. -1. -1. 1.] [ 1. -1. -1. 1.] [ 1. 1. -1. -1.] [ 1. 1. -1. -1.] [ 1. -1. 1. -1.] [ 1. -1. 1. -1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. -1. 1. -1.] [ 1. -1. 1. -1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.]]
Instead of our levels being either 0 or 1, they're now either -1 or 1. Now, the sum over our contrast columns is zero (note the intercept is not a contrast, so does not sum to zero), and their cross products are all zero:
print('sum of columns = ' + str(X.sum(axis=0)))
print('sum of products of design and paper = ' + str((X[:, 1] * X[:, 2]).sum()))
print('sum of products of design and interaction = ' + str((X[:, 1] * X[:, 3]).sum()))
sum of columns = [ 16. 0. 0. 0.] sum of products of design and paper = 0.0 sum of products of design and interaction = 0.0
Using these orthogonal contrasts, let's fit the model and look at the ANOVA table:
fit_full = smf.ols('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()
print(fit_full.summary())
OLS Regression Results ============================================================================== Dep. Variable: distance R-squared: 0.727 Model: OLS Adj. R-squared: 0.659 Method: Least Squares F-statistic: 10.66 Date: Tue, 24 Feb 2015 Prob (F-statistic): 0.00106 Time: 20:51:51 Log-Likelihood: -18.584 No. Observations: 16 AIC: 45.17 Df Residuals: 12 BIC: 48.26 Df Model: 3 Covariance Type: nonrobust ================================================================================================================ coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------------------------------- Intercept 3.6853 0.223 16.514 0.000 3.199 4.171 C(design, Sum)[S.simple] -0.1552 0.223 -0.696 0.500 -0.641 0.331 C(paper, Sum)[S.50] 0.3277 0.223 1.469 0.168 -0.158 0.814 C(design, Sum)[S.simple]:C(paper, Sum)[S.50] -1.2090 0.223 -5.418 0.000 -1.695 -0.723 ============================================================================== Omnibus: 0.751 Durbin-Watson: 2.909 Prob(Omnibus): 0.687 Jarque-Bera (JB): 0.745 Skew: 0.359 Prob(JB): 0.689 Kurtosis: 2.224 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
What do the coefficients mean? The intercept coefficient is now the grand mean of distance (recall that in the Treatment contrast regime it was the mean distance when the design was "simple" and the paper weight was "50"):
np.round(planes['distance'].mean(), decimals=4)
3.6852
The coefficient values for the contrasts tell us the average deviations from this grand mean for each category:
# example for design:
print(fit_full.params[0] + fit_full.params[1]) # model prediction for simple
print(fit_full.params[0] + -1*fit_full.params[1]) # model prediction for sleek
print('\n\n Actual Means:')
print(planes.groupby('design').distance.mean()) # actual means
3.53 3.8405 Actual Means: design simple 3.5300 sleek 3.8405 Name: distance, dtype: float64
anova_table = sms.anova.anova_lm(fit_full) # type I SS by default; same answer in this case
print(anova_table)
df sum_sq mean_sq F PR(>F) C(design, Sum) 1 0.385641 0.385641 0.484016 0.499861 C(paper, Sum) 1 1.718721 1.718721 2.157158 0.167628 C(design, Sum):C(paper, Sum) 1 23.386896 23.386896 29.352777 0.000156 Residual 12 9.561029 0.796752 NaN NaN
Recall that above we talked about ANOVA as a comparison of nested models. I'll reproduce the diagram here:
We can produce all the sums of squares from the ANOVA table by computing the relevant change in residual (error) sum of squares for each model.
# fit all the models in the tree
null = smf.ols('distance ~ 1', planes).fit()
design = smf.ols('distance ~ C(design, Sum)', planes).fit()
paper = smf.ols('distance ~ C(paper, Sum)', planes).fit()
additive = smf.ols('distance ~ C(design, Sum) + C(paper, Sum)', planes).fit()
full = smf.ols('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()
print(null.summary())
OLS Regression Results ============================================================================== Dep. Variable: distance R-squared: 0.000 Model: OLS Adj. R-squared: 0.000 Method: Least Squares F-statistic: nan Date: Tue, 24 Feb 2015 Prob (F-statistic): nan Time: 20:51:51 Log-Likelihood: -28.977 No. Observations: 16 AIC: 59.95 Df Residuals: 15 BIC: 60.73 Df Model: 0 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 3.6853 0.382 9.643 0.000 2.871 4.500 ============================================================================== Omnibus: 0.701 Durbin-Watson: 1.723 Prob(Omnibus): 0.704 Jarque-Bera (JB): 0.714 Skew: 0.351 Prob(JB): 0.700 Kurtosis: 2.239 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This model explains 0% of the variance in the data. Because it's predicting every data point to be the mean, the residual variance is equal to the total variance:
null.mse_total # mean squared error total
2.336819133333333
null.mse_resid
2.336819133333333
# this is the same as the variance in distance:
planes['distance'].var()
2.3368191333333317
# let's calculate the sum of squared residuals manually:
res = planes['distance'] - null.predict() # absolute residuals (difference between data and prediction)
res = res ** 2 # squared residuals
res.sum()
35.052286999999993
# the ssr is also contained in the model object:
null.ssr
35.052286999999993
Recall that the SS for each model term can be calculated by comparing the change in the residual sum of squares for nested models. We do that here:
## Main effect of design, ignoring paper:
print(null.ssr - design.ssr)
## sum of simple effects of design, eliminating paper:
print(paper.ssr - additive.ssr)
## main effect from ANOVA table:
print(anova_table['sum_sq'][0])
0.385641 0.385641 0.385641
## Main effect of paper, ignoring design:
print(null.ssr - paper.ssr)
## sum of simple effects of paper, eliminating design:
print(design.ssr - additive.ssr)
## main effect from ANOVA table:
print(anova_table['sum_sq'][1])
1.718721 1.718721 1.718721
# interaction term:
print(additive.ssr - full.ssr)
## interaction effect from ANOVA table:
print(anova_table['sum_sq'][2])
23.386896 23.386896
As we add terms to the model, the residual sum of squares gets lower. It turns out that for any model, its $R^2$ (the coefficient of determination) is given by the proportion of the variance explained, relative to the null model:
# e.g. for full model:
print(full.summary())
OLS Regression Results ============================================================================== Dep. Variable: distance R-squared: 0.727 Model: OLS Adj. R-squared: 0.659 Method: Least Squares F-statistic: 10.66 Date: Tue, 24 Feb 2015 Prob (F-statistic): 0.00106 Time: 20:51:51 Log-Likelihood: -18.584 No. Observations: 16 AIC: 45.17 Df Residuals: 12 BIC: 48.26 Df Model: 3 Covariance Type: nonrobust ================================================================================================================ coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------------------------------------- Intercept 3.6853 0.223 16.514 0.000 3.199 4.171 C(design, Sum)[S.simple] -0.1552 0.223 -0.696 0.500 -0.641 0.331 C(paper, Sum)[S.50] 0.3277 0.223 1.469 0.168 -0.158 0.814 C(design, Sum)[S.simple]:C(paper, Sum)[S.50] -1.2090 0.223 -5.418 0.000 -1.695 -0.723 ============================================================================== Omnibus: 0.751 Durbin-Watson: 2.909 Prob(Omnibus): 0.687 Jarque-Bera (JB): 0.745 Skew: 0.359 Prob(JB): 0.689 Kurtosis: 2.224 Cond. No. 1.00 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
diff = null.ssr - full.ssr
# express as a proportion of the total (null model) residual:
prop = diff / null.ssr
print(prop)
0.727235230044
# same as R-squared:
print(full.rsquared)
0.727235230044
As you may know from your basic statistics courses, the sum of squares for each term is divided by how many parameters were estimated for that term (the degrees of freedom added) to give the mean squared column, and then the ratio between the change and the amount of remaining error gives the F-statistic. This can then be tested against the F-distribution (with model, error
degrees of freedom) for significance.
print(anova_table)
df sum_sq mean_sq F PR(>F) C(design, Sum) 1 0.385641 0.385641 0.484016 0.499861 C(paper, Sum) 1 1.718721 1.718721 2.157158 0.167628 C(design, Sum):C(paper, Sum) 1 23.386896 23.386896 29.352777 0.000156 Residual 12 9.561029 0.796752 NaN NaN
Consider the regression table for the full model above. How are the t-statistics related to the F-values for the model terms?
In the $2 \times 2$ design here, the t-tests are equivalent to the F-tests for the factors... their values are the square roots of the F-values (i.e. $F = t^2$)
full.tvalues[1:] **2
C(design, Sum)[S.simple] 0.484016 C(paper, Sum)[S.50] 2.157158 C(design, Sum)[S.simple]:C(paper, Sum)[S.50] 29.352777 dtype: float64
print(anova_table['F'])
C(design, Sum) 0.484016 C(paper, Sum) 2.157158 C(design, Sum):C(paper, Sum) 29.352777 Residual NaN Name: F, dtype: float64
Recall that when we do ANOVA, we're partitioning the total variance into the variance associated with each factor, and that this is equivalent to nested model comparison. We can visualise this as a pie chart:
# plot the explained variances as a pie chart:
labels = 'design', 'paper', 'interaction', 'residual'
sizes = anova_table['sum_sq']
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral']
plt.pie(sizes, labels=labels,
autopct='%1.0f%%',
colors=colors,
startangle=90)
# Set aspect ratio to be equal so that pie is drawn as a circle.
plt.axis('equal')
plt.show()
Notice how the residual error variance is 27%. This is $100 - R^2$.
In unbalanced designs, the "Types of Sums of Squares" become relevant. The different Types of Sums of Squares (Type I, II, III...) basically correspond to different ways to compare models up and down the graph diagram shown above. That is, they correspond to different nested model comparisons -- and with unbalanced data, it matters which comparisons you do.
A complete discussion of the different sums of squares is beyond what we want to cover in this course. For some starter reading, see here, here.
The main thing we want to emphasise is: ANOVA is about model comparison. Think about what hypotheses make sense to test for your dataset.
We've provided here a brief overview of one way to think about ANOVA: it's comparing the incremental change in the sum of squared residuals as you add terms to the model. In this sense, ANOVA is about model comparison, using the change in residuals to decide whether a term is important.
F-ratios are not the only way we could compare the models. ANOVAs are just a historically popular way to do model comparison -- mainly because they are computable by hand. One advantage of thinking about ANOVAs as model comparisons is that once you understand the idea, it's easier to think how to use other types of comparisons.
For example, if we wanted to emphasise parsimonious models (i.e. keeping the number of parameters low), we could consider using an information criterion like the AIC or BIC to compare the models. The structure of the comparisons and their hypotheses are the same, however. For a comprehensive treatment of this type of model comparison, see [3].
Another possibility is using regularisation -- we fit the model but penalise the fit for having many coefficients far from zero. For example, Lasso regression (also called L1 norm regression) penalises the sum of the coefficients from growing too large. This has the effect of setting many coefficients to zero -- i.e. removing those terms from the model. This is equivalent to using a model lower down the nesting hierarchy drawn above.
We've demonstrated ANOVA for the linear model. The concepts inherent in ANOVA can be extended to Generalised Linear Models with non-Gaussian errors and non-identity link functions. To do this, we need to consider a generalisation of the squared residuals, called the deviance.
The deviance is the difference in log likelihood (i.e. the likelihood of the data under the model) between a saturated model (that contains a parameter for every data point, and thus predicts the data perfectly) and the model in question. Intuitively, think of deviance as the "amount by which our model predictions deviate from a perfect prediction". In the Gaussian + identity model, the deviance reduces to the sum of squared residuals:
# anova table from interaction model:
print(anova_table)
df sum_sq mean_sq F PR(>F) C(design, Sum) 1 0.385641 0.385641 0.484016 0.499861 C(paper, Sum) 1 1.718721 1.718721 2.157158 0.167628 C(design, Sum):C(paper, Sum) 1 23.386896 23.386896 29.352777 0.000156 Residual 12 9.561029 0.796752 NaN NaN
# summary table from GLM fit:
fit_glm = smf.glm('distance ~ C(design, Sum) * C(paper, Sum)', planes).fit()
print(fit_glm.deviance)
9.561029
For GLMs with non-Gaussian errors and non-identity link functions, ANOVA becomes "Analysis of Deviance", but the conceptual principles remain the same. We perform nested model comparison, comparing how the deviance is reduced by adding terms to the model.
Unfortunately Statsmodels has not implemented analysis of deviance as a function yet (as of version 0.6.1
), so it's not as straightforward to do as in the linear model cases outlined above. However, analysis of deviance can be easily performed in R or MATLAB, or by hand in Statsmodels using model comparisons as before. Remember to take degrees of freedom into account.
[1] Venables, W. N. (1998). Exegeses on linear models. In S-Plus User’s Conference, Washington DC. Retrieved from here
[2] Howell, D. C. (2002). Statistical Methods for Psychology (5th Edition).
[3] Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: a practical information-theoretic approach. New York: Springer.
from IPython.core.display import HTML
def css_styling():
styles = open("../custom_style.css", "r").read()
return HTML(styles)
css_styling()