#!/usr/bin/env python # coding: utf-8 # # t-test for comparison of accuracies of two algorithms # **(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/)** # **Version:** 1.1, January 2024 # **License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)) # **Prerequisites:** # In[ ]: get_ipython().system('pip install -U scipy') # This is a tutorial related to the discussion of evaluation of machine learning algorithms and classifiers using simple significance tests. # This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/). # ## Using the t-test on two distributions # The task is to compare two distributions of accuracy counts over some experimental results. Imagine that we test two algorithms $a$ and $b$ on the same training and test sets of data. We will apply the t-test as provided in the $stats$ module of $scipy$. We will need to import this module first: # In[36]: from scipy import stats # Imagine that these are our results from two independent algorithms trained and tested on the same pairs of training and test data sets: # In[37]: a = [23, 43, 12, 10] b = [23, 42, 13, 10] # The data sets are the same in both experiments. We could treat the algorithms as two different tests on the same population (of data). The t-test measures whether the average scores differ significantly. We apply the t-test for two related samples of scores as provided in the $stats$ module: # In[38]: stats.ttest_rel(a,b) # The returned result contains a $pvalue$ (p-value) of 1.0 in this case. If we assume that our Null-Hypothesis was that the two outcomes in $a$ and $b$ are unrelated, that is that they are random. The p-value tells us that we would make an error by rejecting this Null-Hypothesis, in fact, we would make an error with a certainty of 100% by rejecting this Null Hypothesis. # Imagine now that the experimental results are: # In[39]: a = [23, 43, 12, 10] b = [4, 15, 3, 9] # Applying the t-test again will give us a different result: # In[41]: stats.ttest_rel(b,a) # In this case, the p-value tells us that we would make an error with a likelihood of approx. 9% by rejecting the Null Hypothesis. Remember, the Null Hypothesis is that the two distributions have nothing in common. We could have set a threshold of 10% and decided to reject the Null Hypothesis. # In[42]: a = [74, 89, 88, 78] b = [24, 2, 3, 9] # In[43]: stats.ttest_rel(a,b) # (C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))