*For a real-world example that showcases the pitfalls of improper cross validation, see this blog post.*

You have 20 datapoints, each of which has 1,000,000 attributes. Each observation also has an associated $y$ value, and you are interested in whether a linear combination of a few attributes can be used to predict $y$. That is, you are looking for a model

$$ y_i \sim \sum_j w_j x_{ij} $$

where most of the 1 million $w_j$ values are 0.

Since there are so many more attributes than datapoints, the chance that a few attributes correlate with $y$ by pure coincidence is fairly high.

You kind of remember that cross-validation helps you detect over-fitting, but you're fuzzy on the details.

- Determine a few attributes of X that correlate well with Y
- Use cross-validation to measure how well a linear fit to these attributes predicts y

In [1]:

```
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
```

Let's make the dataset, and compute the y's with a "hidden" model that we are trying to recover

In [15]:

```
def hidden_model(x):
#y is a linear combination of columns 5 and 10...
result = x[:, 5] + x[:, 10]
#... with a little noise
result += np.random.normal(0, .005, result.shape)
return result
def make_x(nobs):
return np.random.uniform(0, 3, (nobs, 10 ** 6))
x = make_x(20)
y = hidden_model(x)
print(x.shape)
```

Find the 2 attributes in X that best correlate with y

In [17]:

```
selector = SelectKBest(f_regression, k=2).fit(x, y)
best_features = np.where(selector.get_support())[0]
print(best_features)
```

We know we are already in trouble -- we've selected 2 columns which correlate with Y by chance, but neither of which are columns 5 or 10 (the only 2 columns that *actually* have anything to do with y). We can look at the correlations between these columns and Y, and confirm they are pretty good (again, just a coincidence):

In [18]:

```
for b in best_features:
plt.plot(x[:, b], y, 'o')
plt.title("Column %i" % b)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
```

A linear regression on the full data looks good. The "score" here is the $R^2$ score -- scores close to 1 imply a good fit.

In [25]:

```
xt = x[:, best_features]
clf = LinearRegression().fit(xt, y)
print("Score is ", clf.score(xt, y))
```

In [26]:

```
yp = clf.predict(xt)
plt.plot(yp, y, 'o')
plt.plot(y, y, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")
```

Out[26]:

We're worried about overfitting, and remember that cross-validation is supposed to detect this. Let's look at the average $R^2$ score, when performing 5-fold cross validation. It's not as good, but still not bad...

In [27]:

```
cross_val_score(clf, xt, y, cv=5).mean()
```

Out[27]:

And even if we make some plots of the predicted and actual data at each cross-validation iteration, the model seems to predict the "independent" data pretty well...

In [29]:

```
for train, test in KFold(len(y), 10):
xtrain, xtest, ytrain, ytest = xt[train], xt[test], y[train], y[test]
clf.fit(xtrain, ytrain)
yp = clf.predict(xtest)
plt.plot(yp, ytest, 'o')
plt.plot(ytest, ytest, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")
```

Out[29]:

**But** -- what if we generated some more data?

In [9]:

```
x2 = make_x(100)
y2 = hidden_model(x2)
x2 = x2[:, best_features]
y2p = clf.predict(x2)
plt.plot(y2p, y2, 'o')
plt.plot(y2, y2, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")
```

Out[9]:

Yikes -- there is no correlation at all! Cross-validation did **not** detect the overfitting, because we used the entire data to select "good" features.

To prevent overfitting, we can't let *any* information about the full dataset leak into cross-validation. Thus, we must re-select good features in each cross-validation iteration

In [31]:

```
scores = []
for train, test in KFold(len(y), n_folds=5):
xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]
b = SelectKBest(f_regression, k=2)
b.fit(xtrain, ytrain)
xtrain = xtrain[:, b.get_support()]
xtest = xtest[:, b.get_support()]
clf.fit(xtrain, ytrain)
scores.append(clf.score(xtest, ytest))
yp = clf.predict(xtest)
plt.plot(yp, ytest, 'o')
plt.plot(ytest, ytest, 'r-')
plt.xlabel("Predicted")
plt.ylabel("Observed")
print("CV Score is ", np.mean(scores))
```

Now cross-validation properly detects overfitting, by reporting a low average $R^2$ score and a plot that looks like noise. Of course, it doesn't help us actually *discover* the fact that columns 5 and 10 determine Y (this task is probably hopeless without more data) -- it just lets us know when our fitting approach isn't generalizing to new data.