Evaluation methods in NLP

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"


This notebook is an overview of experimental methods for NLU. My primary goal is to help you with the experiments you'll be doing for your projects. It is a companion to the evaluation metrics notebook, which I suggest studying first.

The teaching team will be paying special attention to how you conduct your evaluations, so this notebook should create common ground around what our values are.

This notebook is far from comprehensive. I hope it covers the most common tools, techniques, and challenges in the field. Beyond that, I'm hoping the examples here suggest a perspective on experiments and evaluations that generalizes to other topics and techniques.

Your projects

  1. We will never evaluate a project based on how "good" the results are.

    1. Publication venues do this, because they have additional constraints on space that lead them to favor positive evidence for new developments over negative results.
    2. In CS224u, we are not subject to this constraint, so we can do the right and good thing of valuing positive results, negative results, and everything in between.
  2. We will evaluate your project on:

    1. The appropriateness of the metrics
    2. The strength of the methods
    3. The extent to which the paper is open and clear-sighted about the limits of its findings.


In [2]:
%matplotlib inline
from collections import defaultdict
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
import utils
In [3]:
# Set all the random seeds for reproducibility. Only the
# system and torch seeds are relevant for this notebook.


Data organization


Many publicly available datasets are released with a train/dev/test structure. We're all on the honor system to do test-set runs only when development is complete.

Splits like this basically presuppose a fairly large dataset.

If there is no dev set as part of the distribution, then you might create one to simulate what a test run will be like, though you have to weigh this against the reduction in train-set size.

Having a fixed test set ensures that all systems are assessed against the same gold data. This is generally good, but it is problematic where the test set turns out to have unusual properties that distort progress on the task.

No fixed splits

Many datasets are released without predefined splits. This poses challenges for assessment, especially comparative assessment: for robust comparisons with prior work, you really have to rerun the models using your assessment regime on your splits. For example, if you're doing 5-fold cross-validation, then all the systems should be trained and assessed using exactly the same folds, to control for variation in how difficult the splits are.

If the dataset is large enough, you might create a train/test or train/dev/test split right at the start of your project and use it for all your experiments. This means putting the test portion in a locked box until the very end, when you assess all the relevant systems against it. For large datasets, this will certainly simplify your experimental set-up, for reasons that will become clear when we discuss hyperparameter optimization below.

For small datasets, carving out dev and test sets might leave you with too little data. The most problematic symptom of this is that performance is highly variable because there isn't enough data to optimize reliably. In such situations, you might give up on having fixed splits, opting instead for some form of cross-validation, which allows you to average over multiple runs.


In cross-validation, we take a set of examples $X$ and partition them into two or more train/test splits, and then we average over the results in some way.

Random splits

When creating random train/test splits, we shuffle the examples and split them, with a pre-specified percentage $t$ used for training and another pre-specified percentage (usually $1-t$) used for testing.

In general, we want these splits to be stratified in the sense that the train and test splits have approximately the same distribution over the classes.

The good and the bad of random splits

A nice thing about random splits is that you can create as many as you want without having this impact the ratio of training to testing examples.

This can also be a liability, though, as there's no guarantee that every example will be used the same number of times for training and testing. In principle, one might even evaluate on the same split more than once (though this will be fantastically unlikely for large datasets).

Random splits in scikit-learn

In scikit-learn, the function train_test_split will do random splits. It is a wrapper around ShuffleSplit or StratifiedShuffleSplit, depending on how the keyword argument stratify is used. A potential gotcha for classification problems: train_test_split does not stratify its splits by default, whereas stratified splits are desired in most situations.


In K-fold cross-validation, one divides the data into $k$ folds of equal size and then conducts $k$ experiments. In each, fold $i$ is used for assessment, and all the other folds are merged together for training:

$$ \begin{array}{c c c } \textbf{Splits} & \textbf{Experiment 1} & \textbf{Experiment 2} & \textbf{Experiment 3} \\ \begin{array}{|c|} \hline \textrm{fold } 1 \\\hline \textrm{fold } 2 \\\hline \textrm{fold } 3 \\\hline \end{array} & \begin{array}{|c c|} \hline \textbf{Test} & \textrm{fold } 1 \\\hline \textbf{Train} & \textrm{fold } 2 \\ & \textrm{fold } 3 \\\hline \end{array} & \begin{array}{|c c|} \hline \textbf{Test} & \textrm{fold } 2 \\\hline \textbf{Train} & \textrm{fold } 1 \\ & \textrm{fold } 3 \\\hline \end{array} & \begin{array}{|c c|} \hline \textbf{Test} & \textrm{fold } 3 \\\hline \textbf{Train} & \textrm{fold } 1 \\ & \textrm{fold } 2 \\\hline \end{array} \end{array} $$

The good and the bad of k-folds

  • With k-folds, every example appears in a train set exactly $k-1$ times and in a test set exactly once. We noted above that random splits do not guarantee this.

  • A major drawback of k-folds is that the size of $k$ determines the size of the train/test splits. With 3-fold cross validation, one trains on 67% of the data and tests on 33%. With 10-fold cross-validation, one trains on 90% and tests on 10%. These are likely to be very different experimental scenarios. This is a consideration one should have in mind when comparing models using statistical tests that depend on repeated runs.

K-folds in scikit-learn

  • In scikit-learn, KFold and StratifiedKFold are the primary classes for creating k-folds from a dataset. As with random splits, the stratified option is recommended for most classification problems, as one generally want to train and assess with the same label distribution.

  • The methods cross_validate and cross_val_score are convenience methods that let you pass in a model (estimator), a dataset (X and y), and some cross-validation parameters, and they handle the repeated assessments. These are great. Two tips:

    • I strongly recommend passing in a KFold or StratifiedKFold instance as the value of cv to ensure that you get the split behavior that you desire.
    • Check that scoring has the value that you desire. For example, if you are going to report F1-scores, it's a mistake to leave scoring=None, as this will default to whatever your model reports with its score method, which is probably accuracy.


K-folds has a number of variants and special cases. Two that frequently arise in NLU:

  1. LeaveOneOut is the special case where the number of folds equals the number of examples. This is especially useful for very small datasets.

  2. LeavePGroupsOut creates folds based on criteria that you define. This is useful in situations where the datasets have important structure that the splits need to respect – e.g., you want to assess against a graph sub-network that is never seen on training.


Evaluation numbers in NLP (and throughout AI) can never be understood properly in isolation:

  • If your system gets 0.95 F1, that might seem great in absolute terms, but your readers will suspect the task is too easy and want to know what simple models achieve.

  • If your system gets 0.60 F1, you might despair, but it could turn out that humans achieve only 0.80, indicating that you got traction on a very challenging but basically coherent problem.

Baselines are crucial for strong experiments

Defining baselines should not be an afterthought, but rather central to how you define your overall hypotheses. Baselines are essential to building a persuasive case, and they can also be used to illuminate specific aspects of the problem and specific virtues of your proposed system.

Random baselines

Random baselines are almost always useful to include. scikit-learn has classes DummyClassifier and DummyRegressor that make it easy to include these baselines in your workflow. Each of them has a keyword argument strategy that allows you to specify a range of different styles of random guessing.

Task-specific baselines

It is worth considering whether your problem suggests a baseline that will reveal something about the problem or the ways it is modeled. Two recent examples from NLU:

  1. As disussed briefly in the NLI models notebook, Leonid Keselman observed in his 2016 NLU course project that one can do much better than chance on SNLI by processing only the hypothesis, ignoring the premise entirely. The exact interpretation of this is complex (we'll explore this a bit in our NLI bake-off), but it's certainly relevant for understanding how much a system has actually learned about reasoning from a premise to a conclusion.

  2. Schwartz et al. (2017) develop a system for choosing between a coherent and incoherent ending for a story. Their best system achieves 75% accuracy by processing the story and the ending, but they achieve 72% using only stylistic features of the ending, ignoring the preceding story entirely. This puts the 75% – and the extent to which the system understands story completion – in a new light.

Hyperparameter optimization

In machine learning, the parameters of a model are those whose values are learned as part of optimizing the model itself.

The hyperparameters of a model are any settings that are set by a process that is outside of this optimization process. The boundary between a true setting of the model and a broader design choice will likely be blurry conceptually. For example:

  • The regularization term for a classifier is a clear hyperparameter – it appears in the model's objective function.
  • What about the method one uses for normalizing the feature values? This is probably not a setting of the model per se, but rather a choice point in your experimental framework.

For the purposes of this discussion, we'll construe hyperparameters very broadly.


Hyperparameter optimization is one of the most important parts machine learning, and a crucial part of building a persuasive argument. To see why, it's helpful to imagine that you're in an ongoing debate with a very skeptical referee:

  1. You ran experiments with models A, B, and C. For each, you used the default hyperparameters as given by the implementations you're using. You found that C performed the best, and so you reported that in your paper.
  2. Your reviewer doesn't have visibility into your process, and maybe doesn't fully trust you. Did you try any other values for the hyperparameters without reporting that? If not, would you have done that if C hadn't outperformed the others? There is no way for the reviewer (or perhaps anyone) to answer these questions.
  3. So, from the reviewer's perspective, all we learned from your experiments is that there is some set of hyperparameters on which C wins this competition. But, strictly speaking, this conveys no new information; we knew before you did your experiments that we could find settings that would deliver this and all other outcomes. (They might not be sensible settings, but remember you're dealing with a hard-bitten, unwavering skeptic.)

Our best response to this situation is to allow these models to explore a wide range of hyperparameters, choose the best ones according to performance on training or development data, and then report how they do with those settings at test time. This gives every model its best chance to succeed.

If you do this, the strongest argument that your skeptical reviewer can muster is that you didn't pick the right space of hyperparameters to explore for one or more of the models. Alas, there is no satisfying the skeptic, but we can at least feel happy that the outcome of these experiments will have a lot more scientific value than the ones described above with fixed hyperparameters.

The ideal hyperparameter optimization setting

When evaluating a model, the ideal regime for hyperparameter optimization is as follows:

  1. For each hyperparameter, identify a large set of values for it.
  2. Create a list of all the combinations of all the hyperparameter values. This will be the cross-product of all the values for all the features identified at step 1.
  3. For each of the settings, cross-validate it on the available training data.
  4. Choose the settings that did best in step 3, train on all the training data using those settings, and then evaluate that model on the test set.

This is very demanding. First, The number of settings grows quickly with the number of hyperparameters and values. If hyperparameter $h_{1}$ has $5$ values and hyperparameter $h_{2}$ has $10$, then the number of settings is $5 \cdot 10 = 50$. If we add a third hyperparameter $h_{3}$ with just $2$ values, then the number jumps to $100$. Second, if you're doing 5-fold cross-validation, then each model is trained 5 times. You're thus committed to training $500$ models.

And it could get worse. Suppose you don't have a fixed train/test split, and you're instead reporting, say, the result of 10 random train/test splits. Strictly speaking, the optimal hyperparameters could be different for different splits. Thus, for each split, the above cross-validation should be conducted. Now you're committed to training $5,000$ systems!

Practical considerations, and some compromises

The above is untenable as a set of laws for the scientific community. If we adopted it, then complex models trained on large datasets would end up disfavored, and only the very wealthy would be able to participate. Here are some pragmatic steps you can take to alleviate this problem, in descending order of attractiveness. (That is, the lower you go on this list, the more likely the skeptic is to complain!)

  1. Bergstra and Bengio (2012) argue that randomly sampling from the space of hyperparameters delivers results like the full "grid search" described above with a relatively few number of samples. Hyperparameter optimization algorithms like those implemented in Hyperopt and scikit-optimize allow guided sampling from the full space. All these methods control the exponential growth in settings that comes from any serious look at one's hyperparameters.

  2. In large deep learning systems, the hyperparameter search could be done on the basis of just a few iterations. The systems likely won't have converged, but it's a solid working assumption that early performance is highly predictive of final performance. You might even be able to justify this with learning curves over these initial iterations.

  3. Not all hyperparameters will contribute equally to outcomes. Via heuristic exploration, it is typically possible to identify the less informative ones and set them by hand. As long as this is justified in the paper, it shouldn't rile the skeptic too much.

  4. Where repeated train/test splits are being run, one might find optimal hyperparameters via a single split and use them for all the subsequent splits. This is justified if the splits are very similar.

  5. In the worst case, one might have to adopt hyperparameters that were optimal for other experiments that have been published. The skeptic will complain that these findings don't translate to your new data sets. That's true, but it could be the only option. For example, how would one compare against Rajkomar et al. (2018) who report that "the performance of all above neural networks were [sic] tuned automatically using Google Vizier [35] with a total of >201,000 GPU hours"?

Hyperparameter optimization tools

  • scikit-learn's model_selection package has classes GridSearchCV and RandomizedSearchCV. These are very easy to use. (We used GridSearchCV in our sentiment unit.)

  • scikit-optimize offers a variety of methods for guided search through the grid of hyperparameters. This post assesses these methods against grid search and fully randomized search, and it also provides starter code for using these implementations with sklearn-style classifiers.

Classifier comparison

Suppose you've assessed two classifier models. Their performance is probably different to some degree. What can be done to establish whether these models are different in any meaningful sense?

Practical differences

One very simple step one can take is to simply count up how many examples the models actually differ on.

  • If the test set has 1,000 examples, then a difference of 1% in accuracy or F1 will correspond to roughly 10 examples. We'll likely have intuitions about whether that difference has any practical import.

  • If the test set has 1M examples, then 1% will correspond to 10,000 examples, which seems sure to matter. Unless other considerations (e.g., cost, understandability) favor the less accurate model, the choice seems clear.

Confidence intervals

If you can afford to run the model multiple times, then reporting confidence intervals based on the resulting scores could suffice to build an argument about whether the models are meaningfully different.

The following will calculate a simple 95% confidence interval for a vector of scores vals:

In [4]:
def get_ci(vals):
    if len(set(vals)) == 1:
        return (vals[0], vals[0])
    loc = np.mean(vals)
    scale = np.std(vals) / np.sqrt(len(vals))
    return stats.t.interval(0.95, len(vals)-1, loc=loc, scale=scale)

It's very likely that these confidence intervals will look very large relative to the variation that you actually observe. You probably can afford to do no more than 10–20 runs. Even if your model is performing very predictably over these runs (which it will, assuming your method for creating the splits is sound), the above intervals will be large in this situation. This might justify bootstrapping the confidence intervals. I recommend scikits-bootstrap for this.

Important: when evaluating multiple systems via repeated train/test splits or cross-validation, all the systems have to be run on the same splits. This is the only way to ensure that all the systems face the same challenges.

Wilcoxon signed-rank test

NLPers always choose tables over plots for some reason, and confidence intervals are hard to display in tables. This might mean that you want to calculate a p-value.

Where you can afford to run the models at least 10 times with different splits (and preferably more like 20), Demšar (2006) recommends the Wilcoxon signed-rank test. This is implemented in scipy as scipy.stats.wilcoxon. This test relies only on the absolute differences between scores for each split and makes no assumptions about how the scores are distributed.

Take care not to confuse this with scipy.stats.ranksums, which does the Wilcoxon rank-sums test. This is also known as the Mann–Whitney U test, though SciPy distinguishes this as a separate test (scipy.stats.mannwhitneyu). In any case, the heart of this is that the signed-rank variant is more appropriate for classifier assessments, where we are always comparing systems trained and assessed on the same underlying pool of data.

Like all tests of this form, we should be aware of what they can tell us and what they can't:

  • The test says nothing about the practical importance of any differences observed.

  • Small p-values do not reliably indicate large effect sizes. (A small p-value will more strongly reflect the number of samples you have.)

  • Large p-values simply mean that the available evidence doesn't support a conclusion that the systems are different, not that there is no difference in fact. And even that limited conclusion is only relative to this particular, quite conservative test.

All this is to say that these values should not be asked to stand on their own, but rather presented as part of a larger, evidence-driven argument.

McNemar's test

McNemar's test operates directly on the vectors of predictions for the two models being compared. As such, it doesn't require repeated runs, which is good where optimization is expensive.

The basis for the test is a contingency table with the following form, for two models A and B:

$$\begin{array}{|c | c |} \hline \textrm{number of examples} & \textrm{number of examples} \\ \textrm{where A and B are correct} & \textrm{where A is correct, B incorrect} \\\hline \textrm{number of examples} & \textrm{number of examples} \\ \textrm{where A is correct, B incorrect} & \textrm{where both A and B are incorrect} \\\hline \end{array}$$

Following Dietterich (1998), let the above be abbreviated to

$$\begin{array}{|c | c |} \hline n_{11} & n_{10} \\\hline n_{01} & n_{00} \\ \hline \end{array}$$

The null hypothesis tested is that the two models have the same error rate, i.e., that $n_{01} = n_{10}$. The test statistic is

$$ \frac{ \left(|(n_{01} - n_{10}| - 1\right)^{2} }{ n_{01} + n_{10} }$$

which has an approximately chi-squared distribution with 1 degree of freedom.

An implementation is available in this repository: utils.mcnemar.

Assessing models without convergence

When working with linear models, convergence issues rarely arise. Typically, the implementation has a fixed number of iterations it performs, or a threshold on the error, and the model stops when it reaches one of these points. We mostly don't reflect on this because of the speed and stability of these models.

With neural networks, convergence takes center stage. The models rarely converge, or they converge at different rates between runs, and their performance on the test data is often heavily dependent on these differences. Sometimes a model with a low final error turns out to be great, and sometimes it turns out to be worse than one that finished with a higher error. Who knows?!

Incremental dev set testing

The key to addressing this uncertainty is to regularly collect information about dev set performance as part of training. For example, at every 100th iteration, one could make predictions on the dev set and store that vector of predictions, or just whatever assessment metric one is using. These assessments can provide direct information about how the model is doing on the actual task we care about, which will be a better indicator than the errors.

All the PyTorch models for this course accept keyword arguments X_dev and dev_iter. If these are specified, then the model is tested every test_iter iteration and the resulting predictions are stored in the class attribute dev_predictions. Here's an example:

First, an artificial classification dataset with a train/dev/test structure:

In [5]:
X, y = make_classification(
    class_sep=0.5, n_samples=5000, n_features=200, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train, X_dev, y_train, y_dev = train_test_split(X_train, y_train)

Second, a shallow neural classifier trained with the requisite keyword arguments provided to fit:

In [6]:
dev_iter = 5 # Test increments.

model = TorchShallowNeuralClassifier(max_iter=100, hidden_dim=10)

_ = model.fit(X_train, y_train, X_dev=X_dev, dev_iter=dev_iter)
Finished epoch 100 of 100; error is 0.05354957003146416

Third, we can calculate our chosen evaluation metric for each of the incremental predictions:

In [7]:
dev_preds = sorted(model.dev_predictions.items())

scores = [utils.safe_macro_f1(y_dev, p) for i, p in dev_preds]

scores = pd.Series(scores)

scores.index *= dev_iter

Finally, we have a neat plot that tells us a lot about how training affects the model's performance:

In [8]:
ax = scores.plot()
_ = ax.set_ylabel("Macro F1")

It's a different picture than we get from the error term:

In [9]:
err_ax = pd.Series(model.errors).plot()
_ = err_ax.set_ylabel("Error")

Early stopping

The above plot of dev-set performance suggests a simple strategy of early stopping: identify the iteration $i$ at which dev-set performance peaked and train our models for exactly $i$ iterations when doing our final test-set run. This value $i$ can be set differently for different models; selecting this point could even be done automatically during hyperparameter.

If it is important to test the same model that is being used to create the dev-set performance curve, then one needs to store all the model parameters for the currently best model and then "rewind" to that stage once one decides that further training isn't helping. This is arguably the safest thing to do, since it keeps the actual parameters that maximized dev-set performance; see below on the impact of random initializations.

For more on early stopping schemes, see Prechelt 1997.

Learning curves with confidence intervals

I frankly think the best response to all this is to accept that incremental performance plots like the above are how we should be assessing our models. This exposes all of the variation that we actually observe.

In addition, in deep learning, we're often dealing with classes of models that are in principle capable of learning anything. The real question is implicitly how efficiently they can learn given the available data and other resources. Learning curves bring this our very clearly.

We can improve the curves by adding confidence intervals to them derived from repeated runs. Here's a plot from a paper I recently wrote with Nick Dingwall (Dingwall and Potts 2018):

I think this shows very clearly that, once all is said and done, the Mittens model (red) learns faster than the others, but is indistinguishable from the Clinical text GloVe model (blue) after enough training time. Furthermore, it's clear that the other two models are never going to catch up in the current experimental setting. A lot of this information would be lost if, for example, we decided to stop training when dev set performance reached its peak and report only a single F1 score per class.

The role of random parameter initialization

Most deep learning models have their parameters initialized randomly, perhaps according to some heuristics related to the number of parameters (Glorot and Bengio 2010) or their internal structure (Saxe et al. 2014). This is meaningful largely because of the non-convex optimization problems that these models define, but it can impact simpler models that have multiple optimal solutions that still differ at test time.

There is growing awareness that these random choices have serious consequences. For instance, Reimers and Gurevych (2017) report that different initializations for neural sequence models can lead to statistically significant results, and they show that a number of recent systems are indistinguishable in terms of raw performance once this source of variation is taken into account.

This shouldn't surprise practitioners, who have long struggled with the question of what to do when a system experiences a catastrophic failure as a result of unlucky initialization. (I think the answer is to report this failure rate.)

The code snippet below lets you experience this phenomenon for yourself. The XOR logic operator, which is true just in case its two arguments have the same value, is famously not learnable by a linear classifier but within reach of a neural network with a single hidden layer and a non-linear activation function (Rumelhart et al. 1986). But how consistently do such models actually learn XOR? No matter what settings you choose, you rarely if ever see perfect performance across multiple runs.

In [10]:
def xor_eval(n_trials=10):
    xor = [
        ([1.,1.], 1),
        ([1.,0.], 0),
        ([0.,1.], 0),
        ([0.,0.], 1)]
    X, y = zip(*xor)
    results = defaultdict(int)
    for trial in range(n_trials):
        model = TorchShallowNeuralClassifier(
        model.fit(X, y)
        preds = tuple(model.predict(X))
        result = 'correct' if preds == y else 'incorrect'
        results[result] += 1
    return results

Finished epoch 500 of 500; error is 0.351218581199646337
defaultdict(int, {'correct': 8, 'incorrect': 2})

For better or worse, the only response we have to this situation is to report scores for multiple complete runs of a model with different randomly chosen initializations. Confidence intervals and statistical tests can be used to summarize the variation observed. If the evaluation regime already involves comparing the results of multiple train/test splits, then ensuring a new random initializing for each of those would seem sufficient.

Arguably, these observations are incompatible with evaluation regimes involving only a single train/test split, as in McNemar's test. However, as discussed above, we have to be realistic. If multiple run aren't feasible, then a more heuristic argument will be needed to try to convince skeptics that the differences observed are larger than we would expect from just different random initializations.

Closing remarks

We can summarize most of the above with a few key ideas:

  1. Your evaluation should be based around a few systems that are related in ways that illuminate your hypotheses and help to convey what the best models are learning.

  2. Every model you assess should be given its best chance to shine (but we need to be realistic about how many experiments this entails!).

  3. The test set should play no role whatsoever in optimization or model selection. The best way to ensure this is to have the test set locked away until the final batch of experiments that will be reported in the paper, but this separation is simulated adequately by careful cross-validation set-ups.

  4. Strive to base your model comparisons in multiple runs on the same splits. This is especially important for deep learning, where a single model can perform in very different ways on the same data, depending on the vagaries of optimization.