#!/usr/bin/env python
# coding: utf-8

# # t-test for comparison of accuracies of two algorithms

# **(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/)**

# **Version:** 1.1, January 2024

# **License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

# **Prerequisites:**

# In[ ]:


get_ipython().system('pip install -U scipy')


# This is a tutorial related to the discussion of evaluation of machine learning algorithms and classifiers using simple significance tests.

# This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/).

# ## Using the t-test on two distributions

# The task is to compare two distributions of accuracy counts over some experimental results. Imagine that we test two algorithms $a$ and $b$ on the same training and test sets of data. We will apply the t-test as provided in the $stats$ module of $scipy$. We will need to import this module first:

# In[36]:


from scipy import stats


# Imagine that these are our results from two independent algorithms trained and tested on the same pairs of training and test data sets:

# In[37]:


a = [23, 43, 12, 10]
b = [23, 42, 13, 10]


# The data sets are the same in both experiments. We could treat the algorithms as two different tests on the same population (of data). The t-test measures whether the average scores differ significantly. We apply the t-test for two related samples of scores as provided in the $stats$ module:

# In[38]:


stats.ttest_rel(a,b)


# The returned result contains a $pvalue$ (p-value) of 1.0 in this case. If we assume that our Null-Hypothesis was that the two outcomes in $a$ and $b$ are unrelated, that is that they are random. The p-value tells us that we would make an error by rejecting this Null-Hypothesis, in fact, we would make an error with a certainty of 100% by rejecting this Null Hypothesis.

# Imagine now that the experimental results are:

# In[39]:


a = [23, 43, 12, 10]
b = [4, 15, 3, 9]


# Applying the t-test again will give us a different result:

# In[41]:


stats.ttest_rel(b,a)


# In this case, the p-value tells us that we would make an error with a likelihood of approx. 9% by rejecting the Null Hypothesis. Remember, the Null Hypothesis is that the two distributions have nothing in common. We could have set a threshold of 10% and decided to reject the Null Hypothesis.

# In[42]:


a = [74, 89, 88, 78]
b = [24, 2, 3, 9]


# In[43]:


stats.ttest_rel(a,b)


# (C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))