#!/usr/bin/env python # coding: utf-8 # Hypothesis Tests (Parametric) # ==== # # ## Unit 10, Lecture 1 # # *Numerical Methods and Statistics* # # ---- # # ### Reading # # Langley: Pages 137-189, 199-211, 230-245 # # ---- # # #### Prof. Andrew White, April 3 2018 # Goals: # --- # # 1. Understand the meaning of a null hypothesis # 2. Be able to construct a null hypothesis # 3. Be able to calculate a p-value, understand its meaning, and test it against significance level # 4. Visualize the intervals used to compute p-values for $z$ and $t$ tests # In[1]: import numpy as np import matplotlib.pyplot as plt from math import sqrt, pi, erf import scipy.optimize # Hypothesis Testing # ==== # **Hypothesis**: Going to college increases your starting salary. How do you prove this? # It's not really possible to prove this directly. We can, however, disprove the opposite hypothesis. We construct the opposite hypothesis to what we're interested in, called the *null hypothesis*. The null hypothesis is an assumption of no-difference and/or no-correlation. # In our example, this seems simple at first: *Going to college has no effect on your salary*. # But, maybe the null hypothesis is: *For People who qualified and were accepted to college, attending college has no effect on their salary*. # # Or it might be *People who can afford, are smart enough, and motivated enough to go to college but did not, have the same salary as those that did.* # Let's assume we know well enough what our null hypothesis is. Hypothesis testing is the ability to use statistics to disprove the null hypothesis # **Null Hypothesis**: The simplest explanation, typically meaning no correlation or that all data is from the same population. Because a simpler explanation is preferred to a more complex one, the null hypothesis should be our default belief. We construct our null hypothesis as sort-of the opposite of what we want to study. For example, if we want to know if a sample is significantly different than our population, our null hypothesis is that it *is not* significantly different and then we aim to disprove the null hypothesis. Hypothesis testing is about showing significance by disproving or rejecting the null hypothesis. # Hypothesis Testing # ---- # # We construct a null hypothesis and take it to be true. For example, we believe college has no influence on income. This allows us to build a simple probability model. For colelge then, we might take income to be normally distributed. Then we see how likely our *observed* data is according to that null hypothesis model. For example, we check to see if the sample mean of people who graduated from college matches our null hypothesis mean. # Hypothesis Test Example # ---- # # I open up a cash4gold store and people bring me their jewlery. I know the probability distribution for gold melting is normal with mean 1060 $^{\circ}$C and my measurements have a standard deviation of 3$^{\circ} $C. I melt a sample at 1045 $^{\circ} $ C and want to know if I should be suspicious. # **Null Hypothesis**: The sample is gold # Let's see if I can disporve this. If the sample is gold, what is the probability of that measurement? Zero, because it's a single point. Let's ask though how big of an interval would we need to include that measurement. # In[2]: from scipy import stats as ss Z = abs(1045 - 1060.) / 3 interval_p = ss.norm.cdf(Z) - ss.norm.cdf(-Z) print(interval_p) # In[3]: import numpy as np import matplotlib.pyplot as plt sample = 1045 mu = 1060 w = mu - sample x = np.linspace(mu - w - 2, w + mu + 2, 1000) y = ss.norm.pdf(x, loc=mu, scale=3) plt.plot(x,y) plt.fill_between(x, y, where= np.abs(x - mu) < w, color='lightblue') plt.text(mu,np.max(y) / 3, 'Area={}'.format(np.round(interval_p,6)), fontdict={'size':14}, horizontalalignment='center') plt.axvline(mu - w, linestyle='--', color='orange') plt.axvline(mu + w, linestyle='--', color='blue') plt.xticks([mu - w, mu + w], [sample, mu + w]) plt.yticks([]) plt.ylabel(r'$p(x)$') plt.xlabel(r'$x$') plt.show() # We would expect to see values outside or at the boundary of this interval 0.00000001% of the time. # What is signficiant? # ---- # # Would we call it significant if it was 0.1%? what about 1%? It turns out the convention is 5%. This is called our $\alpha$ or significance level. # We saw in the cash4gold example how to compare a single sample with a known population. What about when we don't know the standard deviation of the population? # I open up a cash4gold store and people bring me their jewlery. I know the probability distribution for gold melting is normal with mean 1060 $^{\circ}$C. I do not know the standard deviation. If I get a sample that melts at 1045$^\circ{}$C, should I be suspicious? # I don't know. We have no way of estimating the standard deviation with one sample, so we can't say anything. If we have at least 3 samples though, we can compute the sample standard deviation # This leads us to the beautiful part about hypothesis testing: since we're assuming the null hypothesis, that the samples are gold, we can use what we learned about normal distributions to estimate the population standard deviation from our samples. # # ---- # General Strategy # ---- # # The general approach for hypothesis tests is as follows: # # 1. Start with a null hypothesis, $H_0$, that corresponds to a probability distribution/mass function # 2. Compute the probability of an interval which *inculdes* values as extreme as your sample data. Note this may be single- or double-sided depending on if "extreme" means both above and below or only above/below the mean. This is your $p$-value. # 3. Reject the null hypothesis if the $p$-value is below your significance level ($\alpha$), which is generally 0.05 # Student's $t$-Test # ==== # I open up a cash4gold store and people bring me their jewlery. I know the probability distribution for gold melting is normal with mean 1060 $^{\circ}$C. I do not know the standard deviation. Someone brings in 6 samples and they melt at 1035, 1050, 1020, 1055, and 1046 $^{\circ}$C. Should I reject the null hpyothesis, that these are gold? # #### Logical Steps: # # 1. Assuming the null hypothesis, compute our uncertainty in the true mean confidence interval. This is our probability distribution # 2. We happen to have the true mean, so then we see how big the confidence interval has to be to include it. This is us constructing our probability interval # 3. Take the area of that interval, our $p$-value, and compare with the significance level # In[4]: import numpy as np import matplotlib.pyplot as plt samples = np.array([1035., 1050., 1020., 1055., 1046.]) sigma = np.sqrt(np.var(samples, ddof=1)) sample_mean = np.mean(samples) mu = 1060 w = mu - sample_mean x = np.linspace(mu - w - 2, w + mu + 2, 1000) y1 = ss.norm.pdf(x, loc=mu, scale=3) y2 = ss.t.pdf(x, loc=mu, scale=sigma / np.sqrt(len(samples)), df=len(samples) - 1) plt.plot(x,y1, label='single sample') plt.plot(x,y2, label='multiple samples') plt.fill_between(x, y1, where= np.abs(x - mu) < w, color='lightblue') plt.axvline(mu - w, linestyle='--', color='orange') plt.axvline(mu + w, linestyle='--', color='gray') plt.xticks([mu - w, mu + w], [sample_mean, mu + w]) plt.yticks([]) plt.ylabel(r'$p(x)$') plt.xlabel(r'$x$') plt.legend(loc='best') plt.show() # In[5]: samples = np.array([1035., 1050., 1020., 1055., 1046.]) sigma = np.sqrt(np.var(samples, ddof=1)) sample_mean = np.mean(samples) true_mean = 1060. print(sigma, sample_mean, true_mean) T = (sample_mean - true_mean) / (sigma / np.sqrt(len(samples))) print(T) # Now we have a $T$ and we know $P(T)$. However, just like the $zM$ test, we can't compute $P(T)$ since that's a single point and we're using a continuous distribution. So instead, we build an interval and see how big it must be to catch that $T$. # # Specifically, we want to find $\int_{-T}^T p(t)\,dt$ # In[6]: print('T = ', T) p = ss.t.cdf(T, len(samples) - 1) # The integral from -infinity to T print(p, 'Is the single sided p-value') p = 1 - (ss.t.cdf(-T, len(samples) - 1) - ss.t.cdf(T, len(samples) - 1)) # 1- The integral from -T to T print(p, 'Is the double sided p-value') print('notice, just using 2 * the single-sided value gives the same answer') # What if accidentally reverse our T-value? # --- # In[7]: T = (true_mean - sample_mean) / (sigma / sqrt(len(samples))) print (T) p = (scipy.stats.t.cdf(T, len(samples) - 1)) print ('CDF gives: ', p) print ('Recognize that includes from -infty up to a positive T, so we need to find the complementary area') print (1 - p, 'Is the single sided p-value') print ((1 - p) * 2, 'Is the double sided p-value') # Summary of Methods for Comparing Single Measurement with Normal Population # ==== # $zM$ Test # ==== # # **Data Type:** Measurements and Ranks # # **Compares:** A single sample vs a normally distributed population # # **Null Hypothesis:** The sample came from the population # # **Conditions:** Standard deviation of population is known # # **Related Test 1:** Student's $t$-test, which is used when the standard deviation is not known # # **Python:** Integrate an interval with a Z-statistic and `erf` or `scipt.stats.norm.cdf` # Student's $t$-test # ==== # # **Data Type:** Measurements and Ranks # # **Compares:** A set of samples vs a normally distributed population # # **Null Hypothesis:** The sample came from the population # # **Conditions:** Standard deviation of population is not known # # **Related Test 1:** $zM$ test, which is used when standard deviation is known # # **Python:** Integrate an interval with a T-stastic and `scipt.stats.t.cdf`