ANOVA (Analysis Of Variance) lets us test to see if a number of samples have the same mean. It's similar to the independent 2-sample t-test, but doesn't restrict us to just 2 samples. We will look at one-way and two-way ANOVA, and by the end of this guide you should be confident in using these statistical methods in your own work.
One way ANOVA can be used when we have data categorised by 1 variable - for example, take the following dataframe of the force applied to a drivers head during a crash in 3 different sizes of cars:
import numpy as np
import pandas as pd
cars = pd.DataFrame({"Compact": [543, 555, 502, 534, 611, 622], "Medium": [566,520,580,498,511,560], "Large": [600,530,498,460,478,560]})
print(cars)
Compact Large Medium 0 543 600 566 1 555 530 520 2 502 498 580 3 534 460 498 4 611 478 511 5 622 560 560
A simple preliminary technique is to plot the boxplots of the different sets using matplotlib:
import matplotlib.pyplot as plt
plt.boxplot([cars["Compact"], cars["Medium"], cars["Large"]])
plt.show()
It certainly looks like these is some kind of downward trend as the car gets larger - but we need a way of statistically testing if the means are different. ANOVA lets us do this, and works on the assumption that every data point is picked independently from normal distributions with the same variance. Of course - each group can have a different mean (otherwise our data would all be the same), but the variance has to be equal. In ANOVA, the null hypothesis is that all the means are equal, and the alternative hypothesis is they are not.
Scipy has a really nice and easy way to do this: the f_oneway() function:
import scipy.stats as stats
print(stats.f_oneway(cars["Compact"], cars["Medium"], cars["Large"]))
F_onewayResult(statistic=1.1973440463010667, pvalue=0.3292756379801583)
Here we can see a pvalue of 0.32 gives us no evidence the means are different. If we want more detailed output, we can use the statsmodels library to get more control over what's happening. For this we need to edit our dataframe a little. The melt method for pandas lets us do this really easily:
cars2 = pd.melt(cars, var_name="Type", value_name="Force")
print(cars2)
Type Force 0 Compact 543 1 Compact 555 2 Compact 502 3 Compact 534 4 Compact 611 5 Compact 622 6 Large 600 7 Large 530 8 Large 498 9 Large 460 10 Large 478 11 Large 560 12 Medium 566 13 Medium 520 14 Medium 580 15 Medium 498 16 Medium 511 17 Medium 560
Now we can use the statsmodels libary to build a model and then perform the analysis:
import statsmodels.api as sm
from statsmodels.formula.api import ols
#Fits the data to a model using the formula "Force ~ C(Type)" - since Type is categorical we need the C()
model = ols('Force ~ C(Type)', data=cars2).fit()
anova_table = sm.stats.anova_lm(model, typ=2) #Performs Analysis on this model
print(anova_table)
sum_sq df F PR(>F) C(Type) 4854.777778 2.0 1.197344 0.329276 Residual 30409.666667 15.0 NaN NaN
Here we can see the test statistic (1.197) and our p-value again (0.329) as well as the sum of squares data. This method doesn't seem useful now, but will be useful when we look at the next section.
Two way ANOVA is similar to one way ANOVA in that we're testing data split catagorically for equal means. However this time, our data is sorted into two different types of categories, and what we're testing for is slightly different. The assumptions here are the same as one-way ANOVA, but the groups also must be the same size. We are also testing three null hypothesis at once:
Here we are looking at the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. As this dataset is too large to copy out, we're going to import it. You can download the datafile from the same folder this notebook is in here
teeth = pd.read_csv("ToothGrowth.csv")
print(teeth.head())
len supp dose 0 4.2 VC 0.5 1 11.5 VC 0.5 2 7.3 VC 0.5 3 5.8 VC 0.5 4 6.4 VC 0.5
The data is split into three columns: our dependent variable - the length of odontoblasts, the form of the suppliment given (OJ is orange juice, VC is ascorbic acid), and the dose of vitamin C they were given. Statsmodels lets us perform a two-way ANOVA test using a similar method to above:
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('len ~ C(supp) + C(dose) + C(supp)*C(dose)', data=teeth).fit() #Formula for Two-way ANOVA is C(1) + C(2) + C(1)*C(2)
anova_table = sm.stats.anova_lm(model, typ=2) #Performs Analysis on this model
print(anova_table)
sum_sq df F PR(>F) C(supp) 205.350000 1.0 15.571979 2.311828e-04 C(dose) 2426.434333 2.0 91.999965 4.046291e-18 C(supp):C(dose) 108.319000 2.0 4.106991 2.186027e-02 Residual 712.106000 54.0 NaN NaN
The first line gives us a p-value of 2.3e-04, which gives us strong evidence that the means are not equal when categorised by suppliment (i.e, orange juice gives different results than ascorbic acid).
The second line gives us a p-value of 4.04e-18, which again gives us strong evidence that the means are not equal when categorised by dose (i.e, different doses give us different growths).
The third line give us a p-value of 0.02186, which gives us evidence that the two categorical factors are related in some way.
Here we have some data detailing the anger scores categorised by gender, and whether or not they are athletes. You can download the datafile from the same folder this notebook is in here. The "AngerOut" score is a measurement of verbally or phyiscally expressing anger.
angry = pd.read_csv("angry_moods.csv")
print(angry.head())
Gender Sports AngerOut 0 female athlete 18 1 female athlete 14 2 female athlete 13 3 female athlete 17 4 male athlete 16
Can you perform analysis on the data set and conclude if: