Large scale testing

Review of classical hypothesis testing

One of the most basic and useful statistical procedures is the comparison of two population means based on samples of data from the two populations. Formally, imagine that we have populations with means $\mu_1$ and $\mu_2$, respectively, and we are interested in assessing whether $\mu_1 = \mu_2$.

Let $x_1, \ldots, x_m$ and $y_1, \ldots, y_n$ denote the (iid) data from the two populations, respectively, and let $\bar{x}$ and $\bar{y}$ be their sample means. A natural starting point is the difference of sample means $\bar{x} - \bar{y}$ which is an estimate of $\mu_1 - \mu_2$.

Since the data can be on any scale, the value of $\bar{x} - \bar{y}$ does not directly tell us much about the statistical evidence for or against the null hypothesis $\mu_1 = \mu_2$. We need to standardize this quantity somehow.

The variance of $\bar{x} - \bar{y}$ is $\sigma_1^2 / n_1 + \sigma_2^2 / n_2$, where $\sigma_1^2$ and $\sigma_2^2$ are the variances of the two populations and $n_1$ and $n_2$ are the two sample sizes. Thus we can form the Z-score $(\bar{x} - \bar{y}) / \sqrt{\sigma_1^2 / n_1 + \sigma_2^2 / n_2}$.

In practice, we do not know $\sigma_1^2$ and $\sigma_2^2$ (they are nuisance parameters), so the Z-score that we actually use is $(\bar{x} - \bar{y}) / \sqrt{\hat{\sigma}_1^2 / n_1 + \hat{\sigma}_2^2 / n_2}$.

A Z-score should behave like a standard normal random variable when the null hypothesis is true. Thus any observed Z-score values that fall far enough in the tails of the standard normal distribution don't fit the null hypothesis very well, and therefore might actually come from a population in which the null hypothesis is not true.

Large scale hypothesis testing

Now suppose we are testing many hypotheses at once. We can form a Z-score for each hypothesis, but since there are so many hypotheses we will always see some large Z-scores even if the null hypotheses are all (approximately) true. We need to recalibrate our expectations.

In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats.distributions import norm, t as t_dist
import statsmodels.api as sm

In the next cell we simulate data for n_test independent two-sample comparisons in which the null hypothesis is true for every comparison. The data for the i^th comparison is in row i of the array x1 for the first group and row i of the array x2 for the second group. We allow the two groups to have different sample sizes and different levels of variation.

In [2]:
n_test = 1000
n_obs_1 = 15
n_obs_2 = 25
scale_1 = 2
scale_2 = 1

# scale is SD
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)

In the next cell we calculate the difference in sample means for each comparison and make a histogram of them. We also overlay the theoretical distribution of the difference of sample means on the histogram.

In [3]:
x1_means = x1.mean(1)
x2_means = x2.mean(1)
mean_diff = x1_means - x2_means

plt.hist(mean_diff, normed=True, bins=15, color='lightblue')
plt.grid(True)

x = np.linspace(-2, 2, 100)
y = norm.pdf(x, scale=np.sqrt(scale_1**2 / float(n_obs_1) + scale_2**2 / float(n_obs_2)))
plt.plot(x, y, '-', color='orange', lw=5, alpha=0.7)
plt.xlabel("Difference of means", size=15)
plt.ylabel("Frequency", size=15)
Out[3]:
<matplotlib.text.Text at 0x7f54a3f0ee10>

The most important quantity in large scale inference is the Z-score, which places the mean difference on a standardized scale. To perform this standardization we need the standard deviation of the data, which we don't know but can estimate.

In [4]:
va1 = x1.var(1)
va2 = x2.var(1)
se = np.sqrt(va1 / n_obs_1 + va2 / n_obs_2)

z_scores = mean_diff / se

# Create a Z-score based on the exact variance
tse = np.sqrt(scale_1**2 / float(n_obs_1) + scale_2**2 / float(n_obs_2))
z_scores_t = mean_diff / tse

plt.clf()
plt.grid(True)
plt.plot(z_scores, z_scores_t, 'o', alpha=0.5)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel("Z-score using estimated variance", size=15)
plt.ylabel("Z-score using exact variance", size=15)
Out[4]:
<matplotlib.text.Text at 0x7f54a25f9610>

Histograms are not very sensitive to small changes in the shape of a distribution. Quantile-quantile (QQ) plots are better at that. To compare two samples with the same number of observations using a QQ plot we simply sort each sample and make plot one against the other. If we use the estimated scale to form the Z-score, the values are slightly "overdispersed".

In [5]:
plt.plot(np.sort(z_scores), np.sort(z_scores_t))
plt.plot([-4, 4], [-4, 4], color='grey')
plt.grid(True)
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel("Z-scores using estimated scale", size=15)
plt.ylabel("Z-scores using exact scale", size=15)
Out[5]:
<matplotlib.text.Text at 0x7f54a25ebe50>

We can also use a QQ plot to compare the observed Z-scores against the theoretical standard normal population.

In [6]:
sm.qqplot(z_scores)
plt.plot([-3, 3], [-3, 3], '-', color='grey')
plt.grid(True)

Note that it is not easy to see the overdispersion in a histogram.

In [7]:
plt.hist(z_scores, normed=True, bins=15)

x = np.linspace(-4, 4, 100)
y = norm.pdf(x, scale=1)
plt.plot(x, y, '-', color='orange', lw=5)
plt.xlabel("Difference of means", size=15)
plt.ylabel("Frequency", size=15)
Out[7]:
<matplotlib.text.Text at 0x7f54a0471790>

If we want to try to make the estimates Z-scores behave more like Z-scores, we can approximate the Z-score distribution with a t-distribution and map the t-distribution to a standard normal distribution. This involves a somewhat complicated expression for the approximate degrees of freedom for the t-distribution called the "Satterthwaite" approximation. Sme details are given here:

http://en.wikipedia.org/wiki/Welch%27s_t_test

Note that estimate degrees of freedom is just a bit larger than the smaller of the two sample sizes.

In [8]:
df = (va1 / n_obs_1 + va2 / n_obs_2)**2 / (va1**2 / (n_obs_1**2 * (n_obs_1 - 1)) + va2**2 / (n_obs_2**2 * (n_obs_2 - 1)))

_ = plt.hist(df, bins=20, color='lightblue')
plt.xlabel("Approximate degrees of freedom", size=15)
plt.ylabel("Frequency", size=15)
print(np.mean(df))
18.9842078957

Here is how we do the adjustment. It is an improvement, but not perfect.

In [9]:
t_quantiles = t_dist.cdf(z_scores, df)
z_scores_v = norm.ppf(t_quantiles)
print(z_scores.std())
print(z_scores_v.std())
1.08306206883
1.02452050548

False positives and controlling the family-wise error rate

The usual rule for hypothesis testing is to reject the null hypothesis if the p-value is smaller than 0.05. This is exactly equivalent to rejecting the null hypothesis when the absolute value of the Z-score is greater than 2.

When testing many hypothesis at once, the false positive rate is the proprotion of true null hypotheses that are incorrectly rejected.

We can calculate the false positive rate in our simulated data:

In [10]:
print(np.sum(np.abs(z_scores) > 2))
print(np.mean(np.abs(z_scores) > 2))
67
0.067

The family-wise error rate (FWER) is the probability of rejecting at least one true null hypothesis. To assess the FWER, we need to be able to easily repeat the entire analysis done above, so we re-implement it as a function:

In [11]:
def mtest(x1, x2):
    """
    Perform many t-tests in parallel.
    
    Parameters
    ----------
    x1 : array-like
        Data for sample 1, rows are variables to test, columns are replicates.
    x2 : array-like
        Data for sample 2, rows are variables to test, columns are replicates.
    
    Returns a vector of Z-scores.
    
    Notes
    -----
    `x1` and `x2` must have the same number of rows.
    """
    
    x1_means = x1.mean(1)
    x2_means = x2.mean(1)
    mean_diff = x1_means - x2_means

    va1 = x1.var(1)
    va2 = x2.var(1)
    se = np.sqrt(va1 / n_obs_1 + va2 / n_obs_2)

    z_scores = mean_diff / se
    
    df = (va1 / n_obs_1 + va2 / n_obs_2)**2 / (va1**2 / (n_obs_1**2 * (n_obs_1 - 1)) + va2**2 / (n_obs_2**2 * (n_obs_2 - 1)))

    t_quantiles = t_dist.cdf(z_scores, df)
    z_scores = norm.ppf(t_quantiles)
    return z_scores

Now we can imagine that we repeated our study 500 times. The FWER is the proportion of the repeated studies in which we have at least one false positive.

In [12]:
maxz = []
for k in range(500):
    x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
    x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
    z_scores = mtest(x1, x2)
    maxz.append(np.abs(z_scores).max())
    
plt.hist(maxz, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)
Out[12]:
<matplotlib.text.Text at 0x7f54a048f210>

The "Bonferroni approach" to multiple testing involves changing the decision threshold from $|Z| > 2$ (corresponding to $p < 0.05$) to a stricter threshold, based on $p < 0.05 / q$, where $q$ is the number of tests. For our situation, the threshold becomes:

In [13]:
qtr = -norm.ppf(0.025 / x1.shape[0])
print(qtr)
4.05562698112

The observed FWER is calculated in the next cell. For small sample sizes it may still be bigger than the target value. Try running the notebook with different sample sizes to see how this performs.

In [14]:
print(np.mean(maxz > qtr))
0.102

Up to this point we have only worked with data in which all the null hypotheses are true, i.e. the means of the two groups are equal for every tested variable. Next we can see what happens when we introduce a small number of non-null variables. First we consider a case with large effect size.

In [15]:
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 3 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
    
plt.hist(z_scores, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)
Out[15]:
<matplotlib.text.Text at 0x7f54a0591750>

As usual, a QQ plot is more sensitive than a histogram:

In [16]:
_ = sm.qqplot(z_scores)
plt.grid(True)

Sometimes the effects are very weak:

In [17]:
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 1 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)
    
plt.hist(z_scores, color='lightblue', bins=30)
plt.grid()
plt.xlabel("Maximum Z-score", size=15)
plt.ylabel("Frequency", size=15)

_ = sm.qqplot(z_scores)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)

False Discovery Rates

The family-wise error rate doesn't match the research goals very well in an exploratory analysis. It deems the study to be a failure if even a single false claim is made. In some situations, it makes more sense to think in terms of the proportion of false claims that are made. Specifically, the false discovery rate is the proportion of all rejected hypotheses that are actually null. More informally, it is the proportion of the time that you say that something interesting is present, when in fact nothing interesting is present.

This is how we get the FDR values for our collection of Z-scores:

In [18]:
from statsmodels.stats.multitest import multipletests

pvalues = 2*norm.cdf(-np.abs(z_scores))

_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")

We can plot the results to see how the estimated FDR values for null and non-null variables compare.

In [19]:
plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
Out[19]:
<matplotlib.text.Text at 0x7f549fe29a90>

If we increase the effect size we will have more power:

In [20]:
x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 2 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)

pvalues = 2*norm.cdf(-np.abs(z_scores))

_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")

plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
Out[20]:
<matplotlib.text.Text at 0x7f549fe79c90>

If we increase the sample size but leave the effect size unchanged we will have more power:

In [21]:
n_obs_1 = 50
n_obs_2 = 50

x1 = np.random.normal(size=(n_test, n_obs_1), scale=scale_1)
x1[0:50, :] += 1 # These are our non-null variables
x2 = np.random.normal(size=(n_test, n_obs_2), scale=scale_2)
z_scores = mtest(x1, x2)

pvalues = 2*norm.cdf(-np.abs(z_scores))

_, fdr, _, _ = multipletests(pvalues, method="fdr_bh")

plt.plot(range(50), fdr[0:50], color='orange')
plt.plot(range(50, len(fdr)), fdr[50:], color='purple')
plt.xlabel("Variable", size=15)
plt.ylabel("FDR", size=15)
Out[21]:
<matplotlib.text.Text at 0x7f549fe09850>

Are the Z-scores really Z-scores?

Recently people have begun to raise questions about whether the Z-scores used in making statistical comparisons are really "Z-scores" -- that is, do they really behave like standard normal values when the null hypothesis is true?

Theoretically, the Z-scores should be fine in many situations, but there are some particular things that could cause problems:

  • Dependence between observations

  • Outliers

  • Heterogeneous or mixed populations

  • Small sample size coupled with strong non-Gaussianity

If we are only performing one hypothesis test, we have no choice but to trust that our one Z-score really is a Z-score. But what if we are conducting many tests at once? Often, it is reasonable to posit that most of the null hypotheses are true. Can we use this "assumption" as a basis for checking whether the Z-scores really behave as such?

The idea that most of the null hypotheses are true is often called sparsity. The set of all observed Z-scores is hard to analyze because it is a mix of Z-scores coming from null and non-null variables. But under sparsity, the Z-scores close to zero should mostly come from nul variables. Thus we can check the distribution of null Z-scores without assuming that it is standard normal.

In the next cell, we simulate a set of Z-scores such that 1000 of the values are exactly null (i.e. standard normal), and 100 come from a non-null (shifted normal) distribution. You can easily see the non-null Z-scores in the tails, but in the center of the range the behavior is very consistent with the theoretical distribution. Note in particular that a line fit through the center of the distribution would have slope equal to 1.

In [22]:
z = np.concatenate([np.random.normal(size=1000), np.random.normal(size=100, scale=2)])
print(np.var(z))

_ = sm.qqplot(z)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)
1.21264458311

Next, we construct a similar set of Z-scores, except that even the null Z-scores are "mis-calibrated". They follow a normal distribution with standard deviation 1.1 rather than a standard normal distribution. We can see the discrepancy in the center of the plot. If we fit a regression line through the center of the plot, the slope is close to the standard deviation of the null Z-scores.

In [23]:
z = np.concatenate([1.1 * np.random.normal(size=1000), np.random.normal(size=100, scale=2)])
print(np.var(z))

_ = sm.qqplot(z)
plt.plot([-4, 4], [-4, 4], '-', color='grey')
plt.grid(True)

z.sort()
grid = np.linspace(1/1000., 1 - 1/1000., len(z))
snz = norm.ppf(grid)

result = sm.OLS(z[450:650], snz[450:650]).fit()
print(result.summary())
1.40890110432
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                 3.360e+04
Date:                Wed, 10 Jun 2015   Prob (F-statistic):          7.40e-224
Time:                        17:35:30   Log-Likelihood:                 607.17
No. Observations:                 200   AIC:                            -1212.
Df Residuals:                     199   BIC:                            -1209.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.1431      0.006    183.291      0.000         1.131     1.155
==============================================================================
Omnibus:                        1.086   Durbin-Watson:                   0.049
Prob(Omnibus):                  0.581   Jarque-Bera (JB):                0.755
Skew:                          -0.007   Prob(JB):                        0.686
Kurtosis:                       3.301   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Using this and related techniques, it is possible to identify and adjust for certain situations that cause your Z-scores to "misbehave".