Notebook

title

ANOVA (Analysis Of Variance) lets us test to see if a number of samples have the same mean. It's similar to the independent 2-sample t-test, but doesn't restrict us to just 2 samples. We will look at one-way and two-way ANOVA, and by the end of this guide you should be confident in using these statistical methods in your own work.

One Way ANOVA¶

One way ANOVA can be used when we have data categorised by 1 variable - for example, take the following dataframe of the force applied to a drivers head during a crash in 3 different sizes of cars:

In [60]:

import numpy as np
import pandas as pd

cars = pd.DataFrame({"Compact": [543, 555, 502, 534, 611, 622], "Medium": [566,520,580,498,511,560], "Large": [600,530,498,460,478,560]})

print(cars)

   Compact  Large  Medium
0      543    600     566
1      555    530     520
2      502    498     580
3      534    460     498
4      611    478     511
5      622    560     560

A simple preliminary technique is to plot the boxplots of the different sets using matplotlib:

In [61]:

import matplotlib.pyplot as plt

plt.boxplot([cars["Compact"], cars["Medium"], cars["Large"]])
plt.show()

It certainly looks like these is some kind of downward trend as the car gets larger - but we need a way of statistically testing if the means are different. ANOVA lets us do this, and works on the assumption that every data point is picked independently from normal distributions with the same variance. Of course - each group can have a different mean (otherwise our data would all be the same), but the variance has to be equal. In ANOVA, the null hypothesis is that all the means are equal, and the alternative hypothesis is they are not.

Scipy has a really nice and easy way to do this: the f_oneway() function:

In [62]:

import scipy.stats as stats

print(stats.f_oneway(cars["Compact"], cars["Medium"], cars["Large"]))

F_onewayResult(statistic=1.1973440463010667, pvalue=0.3292756379801583)

Here we can see a pvalue of 0.32 gives us no evidence the means are different. If we want more detailed output, we can use the statsmodels library to get more control over what's happening. For this we need to edit our dataframe a little. The melt method for pandas lets us do this really easily:

In [63]:

cars2 = pd.melt(cars, var_name="Type", value_name="Force")
print(cars2)

       Type  Force
0   Compact    543
1   Compact    555
2   Compact    502
3   Compact    534
4   Compact    611
5   Compact    622
6     Large    600
7     Large    530
8     Large    498
9     Large    460
10    Large    478
11    Large    560
12   Medium    566
13   Medium    520
14   Medium    580
15   Medium    498
16   Medium    511
17   Medium    560

Now we can use the statsmodels libary to build a model and then perform the analysis:

In [92]:

import statsmodels.api as sm
from statsmodels.formula.api import ols

#Fits the data to a model using the formula "Force ~ C(Type)" - since Type is categorical we need the C()
model = ols('Force ~ C(Type)', data=cars2).fit() 
                
anova_table = sm.stats.anova_lm(model, typ=2) #Performs Analysis on this model
print(anova_table)

                sum_sq    df         F    PR(>F)
C(Type)    4854.777778   2.0  1.197344  0.329276
Residual  30409.666667  15.0       NaN       NaN

Here we can see the test statistic (1.197) and our p-value again (0.329) as well as the sum of squares data. This method doesn't seem useful now, but will be useful when we look at the next section.

Two Way ANOVA¶

Two way ANOVA is similar to one way ANOVA in that we're testing data split catagorically for equal means. However this time, our data is sorted into two different types of categories, and what we're testing for is slightly different. The assumptions here are the same as one-way ANOVA, but the groups also must be the same size. We are also testing three null hypothesis at once:

The Population Means of the first factor are equal (One-way ANOVA on the first type of category)
The Population Means of the second factor are equal (One-way ANOVA on the second type of category)
There is no interaction between the two factors.

Here we are looking at the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. As this dataset is too large to copy out, we're going to import it. You can download the datafile from the same folder this notebook is in here

In [101]:

teeth = pd.read_csv("ToothGrowth.csv")
print(teeth.head())

    len supp  dose
0   4.2   VC   0.5
1  11.5   VC   0.5
2   7.3   VC   0.5
3   5.8   VC   0.5
4   6.4   VC   0.5

The data is split into three columns: our dependent variable - the length of odontoblasts, the form of the suppliment given (OJ is orange juice, VC is ascorbic acid), and the dose of vitamin C they were given. Statsmodels lets us perform a two-way ANOVA test using a similar method to above:

In [99]:

import statsmodels.api as sm
from statsmodels.formula.api import ols
 
model = ols('len ~ C(supp) + C(dose) + C(supp)*C(dose)', data=teeth).fit() #Formula for Two-way ANOVA is C(1) + C(2) + C(1)*C(2)
                
anova_table = sm.stats.anova_lm(model, typ=2) #Performs Analysis on this model
print(anova_table)

                      sum_sq    df          F        PR(>F)
C(supp)           205.350000   1.0  15.571979  2.311828e-04
C(dose)          2426.434333   2.0  91.999965  4.046291e-18
C(supp):C(dose)   108.319000   2.0   4.106991  2.186027e-02
Residual          712.106000  54.0        NaN           NaN

The first line gives us a p-value of 2.3e-04, which gives us strong evidence that the means are not equal when categorised by suppliment (i.e, orange juice gives different results than ascorbic acid).

The second line gives us a p-value of 4.04e-18, which again gives us strong evidence that the means are not equal when categorised by dose (i.e, different doses give us different growths).

The third line give us a p-value of 0.02186, which gives us evidence that the two categorical factors are related in some way.

Mini Project¶

Here we have some data detailing the anger scores categorised by gender, and whether or not they are athletes. You can download the datafile from the same folder this notebook is in here. The "AngerOut" score is a measurement of verbally or phyiscally expressing anger.

In [107]:

angry = pd.read_csv("angry_moods.csv")
print(angry.head())

   Gender   Sports  AngerOut
0  female  athlete        18
1  female  athlete        14
2  female  athlete        13
3  female  athlete        17
4    male  athlete        16

Can you perform analysis on the data set and conclude if:

Men and Women express their anger to a similar degree.
Athletes and Non-athletes express their anger to a similar degree.
There is any relationship between whether or not someone is an athelete, their gender, and if they express their anger phyisically or verbally.