- Understand the meaning of a null hypothesis
- Be able to construct a null hypothesis
- Be able to calculate a p-value, understand its meaning, and test it against significance level
- Visualize the intervals used to compute p-values for $z$ and $t$ tests

In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt, pi, erf
import scipy.optimize
```

**Hypothesis**: Going to college increases your starting salary. How do you prove this?

It's not really possible to prove this directly. We can, however, disprove the opposite hypothesis. We construct the opposite hypothesis to what we're interested in, called the *null hypothesis*. The null hypothesis is an assumption of no-difference and/or no-correlation.

In our example, this seems simple at first: *Going to college has no effect on your salary*.

But, maybe the null hypothesis is: *For People who qualified and were accepted to college, attending college has no effect on their salary*.

Or it might be *People who can afford, are smart enough, and motivated enough to go to college but did not, have the same salary as those that did.*

Let's assume we know well enough what our null hypothesis is. Hypothesis testing is the ability to use statistics to disprove the null hypothesis

**Null Hypothesis**: The simplest explanation, typically meaning no correlation or that all data is from the same population. Because a simpler explanation is preferred to a more complex one, the null hypothesis should be our default belief. We construct our null hypothesis as sort-of the opposite of what we want to study. For example, if we want to know if a sample is significantly different than our population, our null hypothesis is that it *is not* significantly different and then we aim to disprove the null hypothesis. Hypothesis testing is about showing significance by disproving or rejecting the null hypothesis.

We construct a null hypothesis and take it to be true. For example, we believe college has no influence on income. This allows us to build a simple probability model. For colelge then, we might take income to be normally distributed. Then we see how likely our *observed* data is according to that null hypothesis model. For example, we check to see if the sample mean of people who graduated from college matches our null hypothesis mean.

I open up a cash4gold store and people bring me their jewlery. I know the probability distribution for gold melting is normal with mean 1060 $^{\circ}$C and my measurements have a standard deviation of 3$^{\circ} $C. I melt a sample at 1045 $^{\circ} $ C and want to know if I should be suspicious.

**Null Hypothesis**: The sample is gold

Let's see if I can disporve this. If the sample is gold, what is the probability of that measurement? Zero, because it's a single point. Let's ask though how big of an interval would we need to include that measurement.

In [2]:

```
from scipy import stats as ss
Z = abs(1045 - 1060.) / 3
interval_p = ss.norm.cdf(Z) - ss.norm.cdf(-Z)
print(interval_p)
```

In [3]:

```
import numpy as np
import matplotlib.pyplot as plt
sample = 1045
mu = 1060
w = mu - sample
x = np.linspace(mu - w - 2, w + mu + 2, 1000)
y = ss.norm.pdf(x, loc=mu, scale=3)
plt.plot(x,y)
plt.fill_between(x, y, where= np.abs(x - mu) < w, color='lightblue')
plt.text(mu,np.max(y) / 3, 'Area={}'.format(np.round(interval_p,6)), fontdict={'size':14}, horizontalalignment='center')
plt.axvline(mu - w, linestyle='--', color='orange')
plt.axvline(mu + w, linestyle='--', color='blue')
plt.xticks([mu - w, mu + w], [sample, mu + w])
plt.yticks([])
plt.ylabel(r'$p(x)$')
plt.xlabel(r'$x$')
plt.show()
```

We would expect to see values outside or at the boundary of this interval 0.00000001% of the time.

Would we call it significant if it was 0.1%? what about 1%? It turns out the convention is 5%. This is called our $\alpha$ or significance level.

We saw in the cash4gold example how to compare a single sample with a known population. What about when we don't know the standard deviation of the population?

I open up a cash4gold store and people bring me their jewlery. I know the probability distribution for gold melting is normal with mean 1060 $^{\circ}$C. I do not know the standard deviation. If I get a sample that melts at 1045$^\circ{}$C, should I be suspicious?

I don't know. We have no way of estimating the standard deviation with one sample, so we can't say anything. If we have at least 3 samples though, we can compute the sample standard deviation

This leads us to the beautiful part about hypothesis testing: since we're assuming the null hypothesis, that the samples are gold, we can use what we learned about normal distributions to estimate the population standard deviation from our samples.

The general approach for hypothesis tests is as follows:

- Start with a null hypothesis, $H_0$, that corresponds to a probability distribution/mass function
- Compute the probability of an interval which
*inculdes*values as extreme as your sample data. Note this may be single- or double-sided depending on if "extreme" means both above and below or only above/below the mean. This is your $p$-value. - Reject the null hypothesis if the $p$-value is below your significance level ($\alpha$), which is generally 0.05

I open up a cash4gold store and people bring me their jewlery. I know the probability distribution for gold melting is normal with mean 1060 $^{\circ}$C. I do not know the standard deviation. Someone brings in 6 samples and they melt at 1035, 1050, 1020, 1055, and 1046 $^{\circ}$C. Should I reject the null hpyothesis, that these are gold?

- Assuming the null hypothesis, compute our uncertainty in the true mean confidence interval. This is our probability distribution
- We happen to have the true mean, so then we see how big the confidence interval has to be to include it. This is us constructing our probability interval
- Take the area of that interval, our $p$-value, and compare with the significance level

In [4]:

```
import numpy as np
import matplotlib.pyplot as plt
samples = np.array([1035., 1050., 1020., 1055., 1046.])
sigma = np.sqrt(np.var(samples, ddof=1))
sample_mean = np.mean(samples)
mu = 1060
w = mu - sample_mean
x = np.linspace(mu - w - 2, w + mu + 2, 1000)
y1 = ss.norm.pdf(x, loc=mu, scale=3)
y2 = ss.t.pdf(x, loc=mu, scale=sigma / np.sqrt(len(samples)), df=len(samples) - 1)
plt.plot(x,y1, label='single sample')
plt.plot(x,y2, label='multiple samples')
plt.fill_between(x, y1, where= np.abs(x - mu) < w, color='lightblue')
plt.axvline(mu - w, linestyle='--', color='orange')
plt.axvline(mu + w, linestyle='--', color='gray')
plt.xticks([mu - w, mu + w], [sample_mean, mu + w])
plt.yticks([])
plt.ylabel(r'$p(x)$')
plt.xlabel(r'$x$')
plt.legend(loc='best')
plt.show()
```

In [5]:

```
samples = np.array([1035., 1050., 1020., 1055., 1046.])
sigma = np.sqrt(np.var(samples, ddof=1))
sample_mean = np.mean(samples)
true_mean = 1060.
print(sigma, sample_mean, true_mean)
T = (sample_mean - true_mean) / (sigma / np.sqrt(len(samples)))
print(T)
```

Now we have a $T$ and we know $P(T)$. However, just like the $zM$ test, we can't compute $P(T)$ since that's a single point and we're using a continuous distribution. So instead, we build an interval and see how big it must be to catch that $T$.

Specifically, we want to find $\int_{-T}^T p(t)\,dt$

In [6]:

```
print('T = ', T)
p = ss.t.cdf(T, len(samples) - 1) # The integral from -infinity to T
print(p, 'Is the single sided p-value')
p = 1 - (ss.t.cdf(-T, len(samples) - 1) - ss.t.cdf(T, len(samples) - 1)) # 1- The integral from -T to T
print(p, 'Is the double sided p-value')
print('notice, just using 2 * the single-sided value gives the same answer')
```

In [7]:

```
T = (true_mean - sample_mean) / (sigma / sqrt(len(samples)))
print (T)
p = (scipy.stats.t.cdf(T, len(samples) - 1))
print ('CDF gives: ', p)
print ('Recognize that includes from -infty up to a positive T, so we need to find the complementary area')
print (1 - p, 'Is the single sided p-value')
print ((1 - p) * 2, 'Is the double sided p-value')
```

**Data Type:** Measurements and Ranks

**Compares:** A single sample vs a normally distributed population

**Null Hypothesis:** The sample came from the population

**Conditions:** Standard deviation of population is known

**Related Test 1:** Student's $t$-test, which is used when the standard deviation is not known

**Python:** Integrate an interval with a Z-statistic and `erf`

or `scipt.stats.norm.cdf`

**Data Type:** Measurements and Ranks

**Compares:** A set of samples vs a normally distributed population

**Null Hypothesis:** The sample came from the population

**Conditions:** Standard deviation of population is not known

**Related Test 1:** $zM$ test, which is used when standard deviation is known

**Python:** Integrate an interval with a T-stastic and `scipt.stats.t.cdf`