Data snooping

We've talked about the need for having our hypotheses determined before looking at the summary table. Here is a simple example to illustrate the danger of looking at summaries of the data before deciding which hypothesis to test.

Below, I will create complete noise datasets but try to find "the best model". There is nothing wrong with finding the best model, what is wrong is trusting the results of the summary table after having chosen this as the best model.

In [ ]:
X = matrix(rnorm(50000), 50, 1000)
X = scale(X)
Y = rnorm(50)
Z = (t(X) %*% (Y - mean(Y))) / sd(Y)
print(summary(lm(Y ~ X[,1])))
qqnorm(Z)

The collection of 1000 $T$ statistics looks pretty close to a normal (with 50 degrees of freedom). This is not surprising.

Now, let's look at the largest $T$

In [ ]:
largest_T = order(abs(Z))[1000]
print(summary(lm(Y ~ X[,largest_T])))

The $T$ statistic is much larger in absolute value than it should be. Let's repeat this experiment many times.

In [ ]:
largest_T_sim = function(printit=FALSE) {
    X = matrix(rnorm(500), 50, 10)
    X = scale(X)
    Y = rnorm(50)
    Z = (t(X) %*% (Y - mean(Y))) / sd(Y)
    largest_T = order(abs(Z))[10]
    if (printit) {
        print(summary(lm(Y ~ X[,largest_T])))
    }
    return(summary(lm(Y ~ X[,largest_T]))$coef[2,4])
}
largest_T_sim(printit=TRUE)

We can do this many times and store the $p$-values. What will their distribution look like?

In [ ]:
Pval = c()
for (i in 1:200) {
    Pval = c(Pval, largest_T_sim())
}
plot(ecdf(Pval))

Type I error

How likely are we to conclude there is an effect if we use these $p$-values?

In [ ]:
ecdf(Pval)(0.05)