**Review of classical hypothesis testing**

One of the most basic and useful statistical procedures is the comparison of two population means based on samples of data from the two populations. Formally, imagine that we have populations with means $\mu_1$ and $\mu_2$, respectively, and we are interested in assessing whether $\mu_1 = \mu_2$.

Let $x_1, \ldots, x_m$ and $y_1, \ldots, y_n$ denote the (*iid*) data from the two populations, respectively, and let $\bar{x}$ and $\bar{y}$ be their sample means. A natural starting point is the difference of sample means $\bar{x} - \bar{y}$ which is an estimate of $\mu_1 - \mu_2$.

Since the data can be on any scale, the value of $\bar{x} - \bar{y}$ does not directly tell us much about the statistical evidence for or against the *null hypothesis* $\mu_1 = \mu_2$. We need to standardize this quantity somehow.

The variance of $\bar{x} - \bar{y}$ is $\sigma_1^2 / n_1 + \sigma_2^2 / n_2$, where $\sigma_1^2$ and $\sigma_2^2$ are the variances of the two populations and $n_1$ and $n_2$ are the two sample sizes. Thus we can form the *Z-score* $(\bar{x} - \bar{y}) / \sqrt{\sigma_1^2 / n_1 + \sigma_2^2 / n_2}$.

In practice, we do not know $\sigma_1^2$ and $\sigma_2^2$ (they are *nuisance parameters*), so the Z-score that we actually use is $(\bar{x} - \bar{y}) / \sqrt{\hat{\sigma}_1^2 / n_1 + \hat{\sigma}_2^2 / n_2}$.

A Z-score should behave like a standard normal random variable when the null hypothesis is true. Thus any observed Z-score values that fall far enough in the tails of the standard normal distribution don't fit the null hypothesis very well, and therefore might actually come from a population in which the null hypothesis is not true.

**Large scale hypothesis testing**

Now suppose we are testing many hypotheses at once. We can form a Z-score for each hypothesis, but since there are so many hypotheses we will always see some large Z-scores even if the null hypotheses are all (approximately) true. We need to recalibrate our expectations.

In [1]:

```
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats.distributions import norm, t as t_dist
import statsmodels.api as sm
```

In the next cell we simulate data for `n_test`

independent two-sample comparisons in which the null hypothesis is true for every comparison. The data for the `i^th`

comparison is in row `i`

of the array `x1`

for the first group and row `i`

of the array `x2`

for the second group. We allow the two groups to have different sample sizes and different levels of variation.

In [2]:

```
n_test = 1000
n_obs_1 = 15
n_obs_2 = 25
scale_1 = 2
scale_2 = 1
# scale is SD
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
```

In the next cell we calculate the difference in sample means for each comparison and make a histogram of them. We also overlay the theoretical distribution of the difference of sample means on the histogram.

In [3]:

```
x1_means = x1.mean(1)
x2_means = x2.mean(1)
mean_diff = x1_means - x2_means
plt.hist(mean_diff, normed=True, bins=15, color='lightblue')
plt.grid(True)
x = np.linspace(-2, 2, 100)
y = norm.pdf(x, scale=np.sqrt(scale_1**2 / float(n_obs_1) + scale_2**2 / float(n_obs_2)))
plt.plot(x, y, '-', color='orange', lw=5, alpha=0.7)
plt.xlabel("Difference of means", size=15)
plt.ylabel("Frequency", size=15)
```

Out[3]:

The most important quantity in large scale inference is the Z-score, which places the mean difference on a standardized scale. To perform this standardization we need the standard deviation of the data, which we don't know but can estimate.

In [4]:

```
va1 = x1.var(1)
va2 = x2.var(1)
se = np.sqrt(va1 / n_obs_1 + va2 / n_obs_2)
z_scores = mean_diff / se
# Create a Z-score based on the exact variance
tse = np.sqrt(scale_1**2 / float(n_obs_1) + scale_2**2 / float(n_obs_2))
z_scores_t = mean_diff / tse
plt.clf()
plt.grid(True)
plt.plot(z_scores, z_scores_t, 'o', alpha=0.5)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel("Z-score using estimated variance", size=15)
plt.ylabel("Z-score using exact variance", size=15)
```

Out[4]:

Histograms are not very sensitive to small changes in the shape of a distribution. Quantile-quantile (QQ) plots are better at that. To compare two samples with the same number of observations using a QQ plot we simply sort each sample and make plot one against the other. If we use the estimated scale to form the Z-score, the values are slightly "overdispersed".

In [5]:

```
plt.plot(np.sort(z_scores), np.sort(z_scores_t))
plt.plot([-4, 4], [-4, 4], color='grey')
plt.grid(True)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel("Z-scores using estimated scale", size=15)
plt.ylabel("Z-scores using exact scale", size=15)
```

Out[5]:

We can also use a QQ plot to compare the observed Z-scores against the theoretical standard normal population.

In [6]:

```
sm.qqplot(z_scores)
plt.plot([-3, 3], [-3, 3], '-', color='grey')
plt.grid(True)
```

Note that it is not easy to see the overdispersion in a histogram.

In [7]:

```
plt.hist(z_scores, normed=True, bins=15)
x = np.linspace(-4, 4, 100)
y = norm.pdf(x, scale=1)
plt.plot(x, y, '-', color='orange', lw=5)
plt.xlabel("Difference of means", size=15)
plt.ylabel("Frequency", size=15)
```

Out[7]:

If we want to try to make the estimates Z-scores behave more like Z-scores, we can approximate the Z-score distribution with a t-distribution and map the t-distribution to a standard normal distribution. This involves a somewhat complicated expression for the approximate degrees of freedom for the t-distribution called the "Satterthwaite" approximation. Sme details are given here:

http://en.wikipedia.org/wiki/Welch%27s_t_test

Note that estimate degrees of freedom is just a bit larger than the smaller of the two sample sizes.

In [8]:

```
df = (va1 / n_obs_1 + va2 / n_obs_2)**2 / (va1**2 / (n_obs_1**2 * (n_obs_1 - 1)) + va2**2 / (n_obs_2**2 * (n_obs_2 - 1)))
_ = plt.hist(df, bins=20, color='lightblue')
plt.xlabel("Approximate degrees of freedom", size=15)
plt.ylabel("Frequency", size=15)
print(np.mean(df))
```

Here is how we do the adjustment. It is an improvement, but not perfect.

In [9]:

```
t_quantiles = t_dist.cdf(z_scores, df)
z_scores_v = norm.ppf(t_quantiles)
print(z_scores.std())
print(z_scores_v.std())
```

The usual rule for hypothesis testing is to reject the null hypothesis if the p-value is smaller than 0.05. This is exactly equivalent to rejecting the null hypothesis when the absolute value of the Z-score is greater than 2.

When testing many hypothesis at once, the *false positive rate* is the proprotion of true null hypotheses that are incorrectly rejected.

We can calculate the false positive rate in our simulated data:

In [10]:

```
print(np.sum(np.abs(z_scores) > 2))
print(np.mean(np.abs(z_scores) > 2))
```

The *family-wise error rate* (FWER) is the probability of rejecting at least one true null hypothesis. To assess the FWER, we need to be able to easily repeat the entire analysis done above, so we re-implement it as a function:

In [11]:

```
def mtest(x1, x2):
"""
Perform many t-tests in parallel.
Parameters
----------
x1 : array-like
Data for sample 1, rows are variables to test, columns are replicates.
x2 : array-like
Data for sample 2, rows are variables to test, columns are replicates.
Returns a vector of Z-scores.
Notes
-----
`x1` and `x2` must have the same number of rows.
"""
x1_means = x1.mean(1)
x2_means = x2.mean(1)
mean_diff = x1_means - x2_means
va1 = x1.var(1)
va2 = x2.var(1)
se = np.sqrt(va1 / n_obs_1 + va2 / n_obs_2)
z_scores = mean_diff / se
df = (va1 / n_obs_1 + va2 / n_obs_2)**2 / (va1**2 / (n_obs_1**2 * (n_obs_1 - 1)) + va2**2 / (n_obs_2**2 * (n_obs_2 - 1)))
t_quantiles = t_dist.cdf(z_scores, df)
z_scores = norm.ppf(t_quantiles)
return z_scores
```

Now we can imagine that we repeated our study 500 times. The FWER is the proportion of the repeated studies in which we have at least one false positive.

In [12]:

```
maxz = []
for k in range(500):
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
maxz.append(np.abs(z_scores).max())
plt.hist(maxz, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)
```

Out[12]:

The "Bonferroni approach" to multiple testing involves changing the decision threshold from $|Z| > 2$ (corresponding to $p < 0.05$) to a stricter threshold, based on $p < 0.05 / q$, where $q$ is the number of tests. For our situation, the threshold becomes:

In [13]:

```
qtr = -norm.ppf(0.025 / x1.shape[0])
print(qtr)
```

The observed FWER is calculated in the next cell. For small sample sizes it may still be bigger than the target value. Try running the notebook with different sample sizes to see how this performs.

In [14]:

```
print(np.mean(maxz > qtr))
```

Up to this point we have only worked with data in which all the null hypotheses are true, i.e. the means of the two groups are equal for every tested variable. Next we can see what happens when we introduce a small number of non-null variables. First we consider a case with large effect size.

In [15]:

```
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 3 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
plt.hist(z_scores, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)
```

Out[15]:

As usual, a QQ plot is more sensitive than a histogram:

In [16]:

```
_ = sm.qqplot(z_scores)
plt.grid(True)
```

Sometimes the effects are very weak:

In [17]:

```
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 1 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
plt.hist(z_scores, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)
_ = sm.qqplot(z_scores)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)
```

The family-wise error rate doesn't match the research goals very well in an exploratory analysis. It deems the study to be a failure if even a single false claim is made. In some situations, it makes more sense to think in terms of the proportion of false claims that are made. Specifically, the *false discovery rate* is the proportion of all rejected hypotheses that are actually null. More informally, it is the proportion of the time that you say that something interesting is present, when in fact nothing interesting is present.

This is how we get the FDR values for our collection of Z-scores:

In [18]:

```
from statsmodels.stats.multitest import multipletests
pvalues = 2*norm.cdf(-np.abs(z_scores))
_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")
```

We can plot the results to see how the estimated FDR values for null and non-null variables compare.

In [19]:

```
plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
```

Out[19]:

If we increase the effect size we will have more power:

In [20]:

```
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 2 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
pvalues = 2*norm.cdf(-np.abs(z_scores))
_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")
plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
```

Out[20]:

If we increase the sample size but leave the effect size unchanged we will have more power:

In [21]:

```
n_obs_1 = 50
n_obs_2 = 50
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 1 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
pvalues = 2*norm.cdf(-np.abs(z_scores))
_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")
plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
```

Out[21]:

Recently people have begun to raise questions about whether the Z-scores used in making statistical comparisons are really "Z-scores" -- that is, do they really behave like standard normal values when the null hypothesis is true?

Theoretically, the Z-scores should be fine in many situations, but there are some particular things that could cause problems:

Dependence between observations

Outliers

Heterogeneous or mixed populations

Small sample size coupled with strong non-Gaussianity

If we are only performing one hypothesis test, we have no choice but to trust that our one Z-score really is a Z-score. But what if we are conducting many tests at once? Often, it is reasonable to posit that most of the null hypotheses are true. Can we use this "assumption" as a basis for checking whether the Z-scores really behave as such?

The idea that most of the null hypotheses are true is often called *sparsity*. The set of all observed Z-scores is hard to analyze because it is a mix of Z-scores coming from null and non-null variables. But under sparsity, the Z-scores close to zero should mostly come from nul variables. Thus we can check the distribution of null Z-scores without assuming that it is standard normal.

In the next cell, we simulate a set of Z-scores such that 1000 of the values are exactly null (i.e. standard normal), and 100 come from a non-null (shifted normal) distribution. You can easily see the non-null Z-scores in the tails, but in the center of the range the behavior is very consistent with the theoretical distribution. Note in particular that a line fit through the center of the distribution would have slope equal to 1.

In [22]:

```
z = np.concatenate([np.random.normal(size=1000), np.random.normal(size=100, scale=2)])
print(np.var(z))
_ = sm.qqplot(z)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)
```

Next, we construct a similar set of Z-scores, except that even the null Z-scores are "mis-calibrated". They follow a normal distribution with standard deviation 1.1 rather than a standard normal distribution. We can see the discrepancy in the center of the plot. If we fit a regression line through the center of the plot, the slope is close to the standard deviation of the null Z-scores.

In [23]:

```
z = np.concatenate([1.1 * np.random.normal(size=1000), np.random.normal(size=100, scale=2)])
print(np.var(z))
_ = sm.qqplot(z)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)
z.sort()
grid = np.linspace(1/1000., 1 - 1/1000., len(z))
snz = norm.ppf(grid)
result = sm.OLS(z[450:650], snz[450:650]).fit()
print(result.summary())
```

Using this and related techniques, it is possible to identify and adjust for certain situations that cause your Z-scores to "misbehave".