**Review of classical hypothesis testing**

One of the most basic and useful statistical procedures is the comparison of two population means based on samples of data from the two populations. Formally, imagine that we have populations with means $\mu_1$ and $\mu_2$, respectively, and we are interested in assessing whether $\mu_1 = \mu_2$.

Let $x_1, \ldots, x_m$ and $y_1, \ldots, y_n$ denote the (*iid*) data from the two populations, respectively, and let $\bar{x}$ and $\bar{y}$ be their sample means. A natural starting point is the difference of sample means $\bar{x} - \bar{y}$ which is an estimate of $\mu_1 - \mu_2$.

Since the data can be on any scale, the value of $\bar{x} - \bar{y}$ does not directly tell us much about the statistical evidence for or against the *null hypothesis* $\mu_1 = \mu_2$. We need to standardize this quantity somehow.

The variance of $\bar{x} - \bar{y}$ is $\sigma_1^2 / n_1 + \sigma_2^2 / n_2$, where $\sigma_1^2$ and $\sigma_2^2$ are the variances of the two populations and $n_1$ and $n_2$ are the two sample sizes. Thus we can form the *Z-score* $(\bar{x} - \bar{y}) / \sqrt{\sigma_1^2 / n_1 + \sigma_2^2 / n_2}$.

In practice, we do not know $\sigma_1^2$ and $\sigma_2^2$ (they are *nuisance parameters*), so the Z-score that we actually use is $(\bar{x} - \bar{y}) / \sqrt{\hat{\sigma}_1^2 / n_1 + \hat{\sigma}_2^2 / n_2}$.

A Z-score should behave like a standard normal random variable when the null hypothesis is true. Thus any observed Z-score values that fall far enough in the tails of the standard normal distribution don't fit the null hypothesis very well, and therefore might actually come from a population in which the null hypothesis is not true.

**Large scale hypothesis testing**

Now suppose we are testing many hypotheses at once. We can form a Z-score for each hypothesis, but since there are so many hypotheses we will always see some large Z-scores even if the null hypotheses are all (approximately) true. We need to recalibrate our expectations.

In [1]:

```
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats.distributions import norm, t as t_dist
import statsmodels.api as sm
```

`n_test`

independent two-sample comparisons in which the null hypothesis is true for every comparison. The data for the `i^th`

comparison is in row `i`

of the array `x1`

for the first group and row `i`

of the array `x2`

for the second group. We allow the two groups to have different sample sizes and different levels of variation.

In [2]:

```
n_test = 1000
n_obs_1 = 15
n_obs_2 = 25
scale_1 = 2
scale_2 = 1
# scale is SD
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
```

In [3]:

```
x1_means = x1.mean(1)
x2_means = x2.mean(1)
mean_diff = x1_means - x2_means
plt.hist(mean_diff, normed=True, bins=15, color='lightblue')
plt.grid(True)
x = np.linspace(-2, 2, 100)
y = norm.pdf(x, scale=np.sqrt(scale_1**2 / float(n_obs_1) + scale_2**2 / float(n_obs_2)))
plt.plot(x, y, '-', color='orange', lw=5, alpha=0.7)
plt.xlabel("Difference of means", size=15)
plt.ylabel("Frequency", size=15)
```

Out[3]:

In [4]:

```
va1 = x1.var(1)
va2 = x2.var(1)
se = np.sqrt(va1 / n_obs_1 + va2 / n_obs_2)
z_scores = mean_diff / se
# Create a Z-score based on the exact variance
tse = np.sqrt(scale_1**2 / float(n_obs_1) + scale_2**2 / float(n_obs_2))
z_scores_t = mean_diff / tse
plt.clf()
plt.grid(True)
plt.plot(z_scores, z_scores_t, 'o', alpha=0.5)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel("Z-score using estimated variance", size=15)
plt.ylabel("Z-score using exact variance", size=15)
```

Out[4]:

In [5]:

```
plt.plot(np.sort(z_scores), np.sort(z_scores_t))
plt.plot([-4, 4], [-4, 4], color='grey')
plt.grid(True)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel("Z-scores using estimated scale", size=15)
plt.ylabel("Z-scores using exact scale", size=15)
```

Out[5]:

In [6]:

```
sm.qqplot(z_scores)
plt.plot([-3, 3], [-3, 3], '-', color='grey')
plt.grid(True)
```

Note that it is not easy to see the overdispersion in a histogram.

In [7]:

```
plt.hist(z_scores, normed=True, bins=15)
x = np.linspace(-4, 4, 100)
y = norm.pdf(x, scale=1)
plt.plot(x, y, '-', color='orange', lw=5)
plt.xlabel("Difference of means", size=15)
plt.ylabel("Frequency", size=15)
```

Out[7]:

If we want to try to make the estimates Z-scores behave more like Z-scores, we can approximate the Z-score distribution with a t-distribution and map the t-distribution to a standard normal distribution. This involves a somewhat complicated expression for the approximate degrees of freedom for the t-distribution called the "Satterthwaite" approximation. Sme details are given here:

http://en.wikipedia.org/wiki/Welch%27s_t_test

Note that estimate degrees of freedom is just a bit larger than the smaller of the two sample sizes.

In [8]:

```
df = (va1 / n_obs_1 + va2 / n_obs_2)**2 / (va1**2 / (n_obs_1**2 * (n_obs_1 - 1)) + va2**2 / (n_obs_2**2 * (n_obs_2 - 1)))
_ = plt.hist(df, bins=20, color='lightblue')
plt.xlabel("Approximate degrees of freedom", size=15)
plt.ylabel("Frequency", size=15)
print(np.mean(df))
```

Here is how we do the adjustment. It is an improvement, but not perfect.

In [9]:

```
t_quantiles = t_dist.cdf(z_scores, df)
z_scores_v = norm.ppf(t_quantiles)
print(z_scores.std())
print(z_scores_v.std())
```

The usual rule for hypothesis testing is to reject the null hypothesis if the p-value is smaller than 0.05. This is exactly equivalent to rejecting the null hypothesis when the absolute value of the Z-score is greater than 2.

When testing many hypothesis at once, the *false positive rate* is the proprotion of true null hypotheses that are incorrectly rejected.

We can calculate the false positive rate in our simulated data:

In [10]:

```
print(np.sum(np.abs(z_scores) > 2))
print(np.mean(np.abs(z_scores) > 2))
```

*family-wise error rate* (FWER) is the probability of rejecting at least one true null hypothesis. To assess the FWER, we need to be able to easily repeat the entire analysis done above, so we re-implement it as a function:

In [11]:

```
def mtest(x1, x2):
"""
Perform many t-tests in parallel.
Parameters
----------
x1 : array-like
Data for sample 1, rows are variables to test, columns are replicates.
x2 : array-like
Data for sample 2, rows are variables to test, columns are replicates.
Returns a vector of Z-scores.
Notes
-----
`x1` and `x2` must have the same number of rows.
"""
x1_means = x1.mean(1)
x2_means = x2.mean(1)
mean_diff = x1_means - x2_means
va1 = x1.var(1)
va2 = x2.var(1)
se = np.sqrt(va1 / n_obs_1 + va2 / n_obs_2)
z_scores = mean_diff / se
df = (va1 / n_obs_1 + va2 / n_obs_2)**2 / (va1**2 / (n_obs_1**2 * (n_obs_1 - 1)) + va2**2 / (n_obs_2**2 * (n_obs_2 - 1)))
t_quantiles = t_dist.cdf(z_scores, df)
z_scores = norm.ppf(t_quantiles)
return z_scores
```

In [12]:

```
maxz = []
for k in range(500):
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
maxz.append(np.abs(z_scores).max())
plt.hist(maxz, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)
```

Out[12]:

In [13]:

```
qtr = -norm.ppf(0.025 / x1.shape[0])
print(qtr)
```

In [14]:

```
print(np.mean(maxz > qtr))
```

In [15]:

```
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 3 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
plt.hist(z_scores, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)
```

Out[15]:

As usual, a QQ plot is more sensitive than a histogram:

In [16]:

```
_ = sm.qqplot(z_scores)
plt.grid(True)
```

Sometimes the effects are very weak:

In [17]:

```
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 1 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
plt.hist(z_scores, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)
_ = sm.qqplot(z_scores)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)
```

The family-wise error rate doesn't match the research goals very well in an exploratory analysis. It deems the study to be a failure if even a single false claim is made. In some situations, it makes more sense to think in terms of the proportion of false claims that are made. Specifically, the *false discovery rate* is the proportion of all rejected hypotheses that are actually null. More informally, it is the proportion of the time that you say that something interesting is present, when in fact nothing interesting is present.

This is how we get the FDR values for our collection of Z-scores:

In [18]:

```
from statsmodels.stats.multitest import multipletests
pvalues = 2*norm.cdf(-np.abs(z_scores))
_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")
```

We can plot the results to see how the estimated FDR values for null and non-null variables compare.

In [19]:

```
plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
```

Out[19]:

If we increase the effect size we will have more power:

In [20]:

```
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 2 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
pvalues = 2*norm.cdf(-np.abs(z_scores))
_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")
plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
```

Out[20]:

If we increase the sample size but leave the effect size unchanged we will have more power:

In [21]:

```
n_obs_1 = 50
n_obs_2 = 50
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 1 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
pvalues = 2*norm.cdf(-np.abs(z_scores))
_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")
plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
```

Out[21]:

Recently people have begun to raise questions about whether the Z-scores used in making statistical comparisons are really "Z-scores" -- that is, do they really behave like standard normal values when the null hypothesis is true?

Theoretically, the Z-scores should be fine in many situations, but there are some particular things that could cause problems:

Dependence between observations

Outliers

Heterogeneous or mixed populations

Small sample size coupled with strong non-Gaussianity

If we are only performing one hypothesis test, we have no choice but to trust that our one Z-score really is a Z-score. But what if we are conducting many tests at once? Often, it is reasonable to posit that most of the null hypotheses are true. Can we use this "assumption" as a basis for checking whether the Z-scores really behave as such?

The idea that most of the null hypotheses are true is often called *sparsity*. The set of all observed Z-scores is hard to analyze because it is a mix of Z-scores coming from null and non-null variables. But under sparsity, the Z-scores close to zero should mostly come from nul variables. Thus we can check the distribution of null Z-scores without assuming that it is standard normal.

In [22]:

```
z = np.concatenate([np.random.normal(size=1000), np.random.normal(size=100, scale=2)])
print(np.var(z))
_ = sm.qqplot(z)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)
```

In [23]:

```
z = np.concatenate([1.1 * np.random.normal(size=1000), np.random.normal(size=100, scale=2)])
print(np.var(z))
_ = sm.qqplot(z)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)
z.sort()
grid = np.linspace(1/1000., 1 - 1/1000., len(z))
snz = norm.ppf(grid)
result = sm.OLS(z[450:650], snz[450:650]).fit()
print(result.summary())
```