Evaluation metrics in NLP

In [3]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2018 term"


  1. Different evaluation metrics encode different values and have different biases and other weaknesses. Thus, you should choose your metrics carefully, and motivate those choices when writing up and presenting one's work.

  2. This notebook reviews some of the most prominent evaluation metrics in NLP, seeking not only to define them, but also to articulate what values they encode and what their weaknesses are.

  3. In your own work, you shouldn't feel confined to these metrics. Per item 1 above, you should feel that you have the freedom to motivate new metrics and specific uses of existing metrics, depending on what your goals are.

  4. If you're working on an established problem, then you'll feel pressure from readers (and referees) to use the metrics that have already been used for the problem. This might be a compelling pressure. However, you should always feel free to argue against those cultural norms and motivate new ones. Areas can stagnate due to poor metrics, so we must be vigilant!

This notebook discusses prominent metrics in NLP evaluations. I've had to be selective to keep the notebook from growing too long and complex. I think the measures and considerations here are fairly representative of the issues that arise in NLP evaluation.

The scikit-learn model evaluation usage guide is excellent as a source of implementations, definitions, and references for a wide range of metrics for classification, regression, ranking, and clustering.

This notebook is the first in a two-part series on evaluation. Part 2 is on evaluation methods.


In [2]:
%matplotlib inline
from nltk.metrics.distance import edit_distance
from nltk.translate import bleu_score
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Classifier metrics

Confusion matrix

  1. A confusion matrix gives a complete comparison of how the observed/gold labels compare to the labels predicted by a classifier.

  2. For classifiers that predict real values (scores, probabilities), it is important to remember that a threshold was imposed to create these categorical predictions.

  3. The position of this threshold can have a large impact on the overall assessment that uses the confusion matrix as an input. The default is to choose the class with the highest probability. This is so deeply ingrained that it is often not even mentioned. However, it might be inappropriate:

    1. We might care about the full distribution.
    2. Where the important class is very small relative to the others, any significant amount of positive probability for it might be important.
  4. Metrics like average precision explore this threshold as part of their evaluation procedure.

This function creates the toy confusion matrices that we will use for illustrative examples:

In [3]:
def illustrative_confusion_matrix(data):
    classes = ['pos', 'neg', 'neutral']
    ex = pd.DataFrame(
    ex.index.name = "observed"
    return ex


Accuracy is the sum of the correct predictions divided by the sum of all predictions:

In [4]:
def accuracy(cm):
    return cm.values.diagonal().sum() / cm.values.sum()

Here's an illustrative confusion matrix:

In [5]:
ex1 = illustrative_confusion_matrix([
    [15,  10,  100],
    [10,  15,   10],
    [10, 100, 1000]])

pos neg neutral
pos 15 10 100
neg 10 15 10
neutral 10 100 1000
In [6]:


[0, 1], with 0 the worst and 1 the best.

Value encoded

Accuracy seems to directly encode a core value we have for classifiers – how often they are correct. In addition, the accuracy of a classifier on a test set will be negatively correlated with the negative log (logistic, cross-entropy) loss, which is a common loss for classifiers. In this sense, these classifiers are optimizing for accuracy.


  • Accuracy does not give per-class metrics for multi-class problems.

  • Accuracy fails to control for size imbalances in the classes. For instance, consider the variant of the above in which the classifier guessed only neutral:

In [7]:
ex2 = illustrative_confusion_matrix([
    [0, 0,  125],
    [0, 0,   35], 
    [0, 0, 1110]])

pos neg neutral
pos 0 0 125
neg 0 0 35
neutral 0 0 1110

Intuitively, this is a worse classifier than the one that produced ex1. Whereas ex1 does well at pos and neg despite their small size, this classifier doesn't even try to get them right – it always predicts neutral. However, its accuracy is higher!

In [8]:
  • Accuracy is closely related to the negative log (logistic, cross-entropy) loss metric, as described above.

  • The "one-hot" vectors that constitute the labels for most classifiers can be seen as a special case of full probability distributions over the labels. If one's labels are probabilistic in this sense, and one's classifier predicts such distributions as well, then KL divergence is the natural generalization of the log-loss and can thus be seen as a counterpart to accuracy in this setting.


Precision is the sum of the correct predictions divided by the sum of all guesses. This is a per-class notion; in our confusion matrices, it's the diagonal values divided by the column sums:

In [9]:
def precision(cm):
    return cm.values.diagonal() / cm.sum(axis=0)
In [10]:
pos        0.428571
neg        0.120000
neutral    0.900901
dtype: float64

For our problematic all neutral classifier above, precision is strictly speaking undefined for pos and neg:

In [11]:
pos             NaN
neg             NaN
neutral    0.874016
dtype: float64

It's common to see these NaN values mapped to 0.


[0, 1], with 0 the worst and 1 the best.

Value encoded

Precision encodes a conservative value in penalizing incorrect guesses.


Precision's dangerous edge case is that one can achieve very high precision for a category by rarely guessing it. Consider, for example, the following classifier's flawless predictions for pos and neg. These predictions are at the expense of neutral, but that is such a big class that it hardly matters to the precision for that class either.

In [12]:
ex3 = illustrative_confusion_matrix([
    [1, 0,  124], 
    [0, 1,   24], 
    [0, 0, 1110]])

pos neg neutral
pos 1 0 124
neg 0 1 24
neutral 0 0 1110
In [13]:
pos        1.000000
neg        1.000000
neutral    0.882353
dtype: float64

These numbers mask the fact that this is a very poor classifier!


Recall is the sum of the correct predictions divided by the sum of all true instances. This is a per-class notion; in our confusion matrices, it's the diagonal values divided by the row sums. Recall is sometimes called the "true positive rate".

In [14]:
def recall(cm):
    return cm.values.diagonal() / cm.sum(axis=1)
In [15]:
pos        0.120000
neg        0.428571
neutral    0.900901
dtype: float64

Recall trades off against precision. For instance, consider again ex3, in which the classifier was very conservative with pos and neg:

In [16]:
pos        1.000000
neg        1.000000
neutral    0.882353
dtype: float64

In contrast, recall is very low here because the classifier guessed neutral for so many of these classes' true instances:

In [17]:
pos        0.008
neg        0.040
neutral    1.000
dtype: float64


[0, 1], with 0 the worst and 1 the best.

Value encoded

Recall encodes a permissive value in penalizing only missed true cases.


Recall's dangerous edge case is that one can achieve very high recall for a category by always guessing it. This could mean a lot of incorrect guesses, but recall sees only the correct ones. You can see this in ex3 above. The model did make some incorrect neural predictions, but it missed none, so it achieved perfect recall for that category.

F scores

F scores combine precision and recall via their harmonic mean, with a value $\beta$ that can be used to emphasize one or the other. Like precision and recall, this is a per-category notion.

In [18]:
def f_score(cm, beta):
    p = precision(cm)
    r = recall(cm)
    return (beta**2 + 1) * ((p * r) / ((beta**2 * p) + r))

With beta=1, this is the F1 score, which gives equal weight to both categories:

In [19]:
def f1_score(cm):
    return f_score(cm, beta=1.0)
In [20]:
pos        0.187500
neg        0.187500
neutral    0.900901
dtype: float64
In [21]:
pos             NaN
neg             NaN
neutral    0.932773
dtype: float64
In [22]:
pos        0.015873
neg        0.076923
neutral    0.937500
dtype: float64


[0, 1], with 0 the worst and 1 the best, and guaranteed to be between precision and recall.

Value encoded

The F$_{\beta}$ score for a class $K$ is an attempt to summarize how well the classifier's $K$ predictions align with the true instances of $K$. Alignment brings in both missed cases and incorrect predictions. Intuitively, precision and recall keep each other in check in the calculation. This idea runs through almost all robust classification metrics.


  • For a given category $K$, the F$_{\beta}$ score for $K$ ignores all the values that are off the row and column for $K$, which might be the majority of the data. This means that the individual scores for a category can be very misleading about the overall performance of the system.

  • There is no normalization for the size of the dataset within $K$ or outside of it.

  • We get a score per class. This can be a virtue, but it can also be an obstacle, especially if one is doing automatic hyperparameter selection and cross-validation, as those processes require a single score.

Macro-averaged F scores

The macro-averaged F$_{\beta}$ score (macro F$_{\beta}$) is the mean of the F$_{\beta}$ score for each category:

In [23]:
def macro_f_score(cm, beta):
    return f_score(cm, beta).mean(skipna=False)
In [25]:
macro_f_score(ex1, beta=1)
In [26]:
macro_f_score(ex2, beta=1)
In [27]:
macro_f_score(ex3, beta=1)


[0, 1], with 0 the worst and 1 the best, and guaranteed to be between precision and recall.

Value encoded

Macro F$_{\beta}$ scores inherit the values of F$_{\beta}$ scores, and they additionally say that we care about all the classes equally regardless of their size.


In NLP, we typically care about modeling all of the classes well, so macro-F$_{\beta}$ scores often seem appropriate. However, this is also the source of its primary weaknesses:

  • If a model is doing really well on a small class $K$, its high macro F$_{\beta}$ score might mask the fact that it mostly makes incorrect predictions outside of $K$. So F$_{\beta}$ scoring will make this kind of classifier look better than it is.

  • Conversely, if a model does well on a very large class, its overall performance might be high even if it stumbles on some small classes. So F$_{\beta}$ scoring will make this kind of classifier look worse than it is, as measures by sheer number of good predictions.

Weighted F scores

Weighted F$_{\beta}$ scores average the per-category F$_{\beta}$ scores, but it's a weighted average based on the size of the classes in the observed/gold data:

In [28]:
def weighted_f_score(cm, beta):
    scores = f_score(cm, beta=beta).values
    weights = cm.sum(axis=1)
    return np.average(scores, weights=weights)
In [29]:
weighted_f_score(ex3, beta=1.0)


[0, 1], with 0 the worst and 1 the best, but without a guarantee that it will be between precision and recall.

Value encoded

Weighted F$_{\beta}$ scores inherit the values of F$_{\beta}$ scores, and they additionally say that we want to weight the summary by the number of actual and predicted examples in each class. This will probably correspond well with how the classifier will perform, on a per example basis, on data with the same class distribution as the training data.


Large classes will dominate these calculations. Just like macro-averaging, this can make a classifier look artificially good or bad, depending on where its errors tend to occur.

Micro-averaged F scores

Micro-averaged F$_{\beta}$ scores (micro F$_{\beta}$ scores) add up the 2 $\times$ 2 confusion matrices for each category versus the rest, and then they calculate the F$_{\beta}$ scores, with the convention being that the positive class's F$_{\beta}$ score is reported.

For F1, this value is identical to both precision and recall on that 2 $\times$ 2 matrix.

This function creates the 2 $\times$ 2 matrix for a category cat in a confusion matrix cm:

In [30]:
def cat_versus_rest(cm, cat):
    yes = cm.loc[cat, cat]
    yes_no = cm[cat].sum() - yes
    no_yes = cm.loc[cat].sum() - yes
    no = cm.values.sum() - yes - yes_no - no_yes
    return pd.DataFrame(
        [[yes,    yes_no], 
         [no_yes,    no]],
        columns=['yes', 'no'], 
        index=['yes', 'no'])
In [31]:
cat_versus_rest(ex1, 'pos')
yes no
yes 15 20
no 110 1125
In [32]:
cat_versus_rest(ex1, 'neg')
yes no
yes 15 110
no 20 1125
In [33]:
cat_versus_rest(ex1, 'neutral')
yes no
yes 1000 110
no 110 50

For the micro F$_{\beta}$ score, we just add up these per-category confusion matrices and calculate the F$_{\beta}$ score:

In [34]:
def micro_f_score(cm, beta):
    c = sum([cat_versus_rest(cm, cat) for cat in cm.index])
    return f_score(c, beta=beta).loc['yes']
In [35]:
micro_f_score(ex1, beta=1)

For two-class problems, this has an intuitive interpretation in which precision and recall are defined in terms of correct and incorrect guesses ignoring the class.


[0, 1], with 0 the worst and 1 the best, and guaranteed to be between precision and recall.

Value encoded

Micro F$_{\beta}$ scores inherit the values of weighted F$_{\beta}$ scores. (The resulting scores tend to be very similar.)


The weaknesses too are the same as those of weighted F$_{\beta}$ scores, with the additional drawback that we actually get two potentially very different values, for the positive and negative classes, and we have to choose one to meet our goal of having a single summary number. (See the 'yes' in the final line of micro_f_score.)

Precision–recall curves

I noted above that confusion matrices hide a threshold for turning probabilities/scores into predicted labels. With precision–recall curves, we finally address this.

A precision–recall curve is a method for summarizing the relationship between precision and recall for a binary classifier.

The basis for this calculation is not the confusion matrix, but rather the raw scores or probabilities returned by the classifier. Normally, we use 0.5 as the threshold for saying that a prediction is positive. However, each distinct real value in the set of predictions is a potential threshold. The precision–recall curve explores this space.

Here's a basic implementation; the sklearn version is more flexible and so recommended for real experimental frameworks.

In [36]:
def precision_recall_curve(y, probs):
    """`y` is a list of labels, and `probs` is a list of predicted
    probabilities or predicted scores -- likely a column of the 
    output of `predict_proba` using an `sklearn` classifier.
    thresholds = sorted(set(probs))
    data = []
    for t in thresholds:
        # Use `t` to create labels:
        pred = [1 if p >= t else 0 for p in probs]
        # Precision/recall analysis as usual, focused on
        # the positive class:
        cm = pd.DataFrame(metrics.confusion_matrix(y, pred))
        prec = precision(cm)[1]
        rec = recall(cm)[1]
        data.append((t, prec, rec))
    # For intuitive graphs, always include this end-point:
    data.append((None, 1, 0))
    return pd.DataFrame(
        data, columns=['threshold', 'precision', 'recall'])        

To see what precision–recall curves look like, let's use sklearn's make_classification function to to create an artificial classification problem:

In [37]:
X, y = datasets.make_classification(
    class_sep=0.6, n_samples=1000, n_features=6)

Making class_sep bigger will make the task easier, and changing n_features (and/or its related values n_informative, n_redundant, and n_repeated) will change how much information is available for predicting the labels.

With this dataset, we create a random train/test split:

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

Fit a classifier as usual:

In [39]:
mod = LogisticRegression()
_ = mod.fit(X_train, y_train)

And use the classifier to make predictions; the second step here keeps just the probabilities for the positive class:

In [40]:
predictions = mod.predict_proba(X_test)

_, pos_predictions = zip(*predictions)
In [41]:
prc = precision_recall_curve(y_test, pos_predictions)
In [42]:
ax = prc.plot(x='recall', y='precision')
ax.set_xlim([0, 1])
_ = ax.set_ylim([0, 1.1])

Value encoded

With precision–recall curves, we get a generalized perspective on F1 scores (and we could weight precision and recall differently to achieve the effects of beta for F scores more generally). These curves can be used, not only to assess a system, but also to identify an optimal decision boundary given external goals.


  • Most implementations are limited to binary problems. The basic concepts are defined for multi-class problems, but it's very difficult to understand the resulting hyperplanes.

  • There is no single statistic that does justice to the full curve, so this metric isn't useful on its own for guiding development and optimization. Indeed, opening up the decision threshold in this way really creates another hyperparameter that one has to worry about!

  • The Receive operating characteristic (ROC) curve is very similar to the precision–recall curve. However, instead of balancing precision and recall, it balances recall against the false positive rate. Precision–recall curves are more appropriate than ROC curves in situations where the two classes are very different in size. Since this is common in NLP, precision–recall curves are more common.

  • Average precision, covered next, is a way of summarizing these curves with a single number.

Average precision

Average precision is a method for summarizing the precision–recall curve. It does this by calculating the average precision weighted by the change in recall from step to step along the curve. Here is the calculation in terms of the data structures returned by precision_recall_curve above, in which (as in sklearn) the largest recall value is first:

$$\textbf{average-precision}(r, p) = \sum_{i=1}^{n} (r_{n} - r_{n+1})p_{n}$$

where $n$ is the increasing sequence of thresholds and the precision and recall vectors $p$ and $r$ are of length $n+1$. (We insert a final pair of values $p=1$ and $r=0$ in the precision–recall curve calculation, with no threshold for that point.)

In [43]:
def average_precision(p, r):
    total = 0.0
    for i in range(len(p)-1):
        total += (r[i] - r[i+1]) * p[i]
    return total
In [44]:
average_precision(prc['precision'].values, prc['recall'].values)


[0, 1], with 0 the worst and 1 the best.

Value encoded

This measure is very similar to the F1 score, in that it is seeking to balance precision and recall. Whereas the F1 score does this with the harmonic mean, average precision does it by making precision a function of recall.


  • An important weakness of this metric is cultural: it is often hard to tell whether a paper is reporting average precision or some interpolated variant thereof. The interpolated versions are meaningfully different and will tend to inflate scores. In any case, they are not comparable to the calculation defined above and implemented in sklearn as sklearn.metrics.average_precision_score.

  • Unlike for precision–recall curves, we aren't strictly speaking limited to binary classification here. Since we aren't trying to visualize anything, we can do these calculations for multi-class problems. However, then we have to decide on how the precision and recall values will be combined for each step: macro-averaged, weighted, or micro-averaged, just as with F$_{\beta}$ scores. This introduces another meaningful design choice.

  • There are interpolated versions of this score, and some tasks/communities have even settled on specific versions as their standard metrics. All such measures should be approached with skepticism, since all of them can inflate scores artificially in specific cases.

  • This blog post is an excellent discussion of the issues with linear interpolation. It proposes a step-wise interpolation procedure that is much less problematic. I believe the blog post and subsequent PR to sklearn led the sklearn developers to drop support for all interpolation mechanisms for this metric!

  • Average precision as defined above is a discrete approximation of the area under the precision–recall curve. This is a separate measure often referred to as "AUC". In calculating AUC for a precision–recall curve, some kind of interpolation will be done, and this will generally produce exaggerated scores for the same reasons that interpolated average precison does.


Mean squared error

The mean squared error is a summary of the distance between predicted and actual values:

In [52]:
def mean_squared_error(y_true, y_pred):
    diffs = (y_true - y_pred)**2
    return np.mean(diffs)

The raw distances y_true - y_pred are often called the residuals.


[0, $\infty$), with 0 the best.

Value encoded

This measure seeks to summarize the errors made by a regression classifier. The smaller it is, the closer the model's predictions are to the truth. In this sense, it is intuitively like a counterpart to accuracy for classifiers.


These values are highly dependent on scale of the output variables, making them very hard to interpret in isolation. One really needs a clear baseline, and scale-independent ways of comparing scores are also needed.

Scikit-learn implements a variety of closely related measures: mean absolute error, mean squared logarithmic error, and median absolute error. I'd say that one should choose among these metrics based on how the output values are scaled and distributed. For instance, the median absolute error will be less sensitive to outliers than the others, and mean squared logarithmic error might be more appropriate where the outputs are not strictly speaking linearly increasing.

R2 scores

The R$^{2}$ score is probably the most prominent method for summarizing regression model performance, in statistics, social sciences, and ML/NLP. This is the value that sklearn's regression models deliver with their score functions.

In [83]:
def r2(y_true, y_pred):
    mu = y_true.mean()
    # Total sum of squares:
    total = ((y_true - mu)**2).sum()
    # Sum of squared errors:
    res = ((y_true - y_pred)**2).sum()    
    return 1.0 - (res / total)


[0, 1], with 0 the worst and 1 the best.

Value encoded

The numerator in the R$^{2}$ calculation is the sum of errors. In the context of regular linear regression, the model's objective is to minimize the total sum of squares, which is the denominator in the calculation. Thus, R$^{2}$ is based in the ratio between what the model achieved and what its objective was, which is a measure of the goodness of fit of the model.


For comparative purposes, it's nice that R$^{2}$ is scaled between [0, 1]; as noted above, this lack of scaling makes mean squared error hard to interpret. But this also represents a trade-off: R$^{2}$ doesn't tell us about the magnitude of the errors.

  • R$^{2}$ is closely related to the squared Pearson correlation coefficient.

  • R$^{2}$ is closely related to the explained variance, which is also defined in terms of a ratio of the residuals and the variation in the gold data. For explained variance, the numerator is the variance of the residuals and the denominator is the variance of the gold values.

  • Adjusted R$^{2}$ seeks to take into account the number of predictors in the model, to reduce the incentive to simply add more features in the hope of lucking into a better score. In ML/NLP, relatively little attention is paid to model complexity in this sense. The attitude is like: if you can improve your model by adding features, you might as well do that!

Sequence prediction

Sequence prediction metrics all seek to summarize and quantify the extent to which a model has managed to reproduce, or accurately match, some gold standard sequences. Such problems arise throughout NLP. Examples:

  1. Mapping speech signals to their desired transcriptions.
  2. Mapping texts in a language $L_{1}$ to their translations in a distinct language or dialect $L_{2}$.
  3. Mapping input dialogue acts to their desired responses.
  4. Mapping a sentence to one of its paraphrases.
  5. Mapping real-world scenes or contexts (non-linguistic) to descriptions of them (linguistic).

Evaluations is very challenging because the relationships tend to be many-to-one: a given sentence might have multiple suitable translations; a given dialogue act will always have numerous felicitous responses; any scene can be described in multiple ways; and so forth. The most constrained of these problems is the speech-to-text case in 1, but even that one has indeterminacy in real-world contexts (humans often disagree about how to transcribe spoken language).

Word error rate

The word error rate (WER) metric is a word-level, length-normalized measure of Levenshtein string-edit distance:

In [16]:
def wer(seq_true, seq_pred):
    d = edit_distance(seq_true, seq_pred)
    return d / len(seq_true)    
In [35]:
wer(['A', 'B', 'C'], ['A', 'A', 'C'])
In [36]:
wer(['A', 'B', 'C', 'D'], ['A', 'A', 'C', 'D'])

To calculate this over the entire test-set, one gets the edit-distances for each gold–predicted pair and normalizes these by the length of all the gold examples, rather than normalizing each case:

In [37]:
def corpus_wer(y_true, y_pred):
    dists = [edit_distance(seq_true, seq_pred) 
             for seq_true, seq_pred in zip(y_true, y_pred)]
    lengths = [len(seq) for seq in y_true]
    return sum(dists) / sum(lengths)

This gives a single summary value for the entire set of errors.


$[0, \infty)$, where 0 is best. (The lack of a finite upper bound derives from the fact that the normalizing constant is given by the true sequences, and the predicted sequences can differ from them in any conceivable way in principle.)

Value encoded

This method says that our desired notion of closeness or accuracy can be operationalized in terms of the low-level operations of insertion, deletion, and substitution. The guiding intuition is very much like that of F scores.


The value encoded reveals a potential weakness in certain domains. Roughly, the more semantic the task, the less appropriate WER is likely to be. For example, adding a negation to a sentence will radically change its meaning but incur only a small WER penalty, whereas passivizing a sentence (Kim won the raceThe race was won by Kim) will hardly change its meaning at all but incur a large WER penalty. See also Liu et al. 2016 for similar arguments in the context of dialogue generation.

  • WER can be thought of as a family of different metrics varying in the notion of edit distance that they employ.

  • The Word Accuracy Rate is 1.0 minus the WER, which, despits its name, is intuitively more like recall than accuracy.

BLEU scores

BLEU (Bilingual Evaluation Understudy) scores were originally developed in the context of machine translation, but they are applied in other generation tasks as well. For BLEU scoring, we require a set of gold outputs. The metric has two main components:

  • Modified n-gram precision: A direct application of precision would divide the number of correct n-grams in the predicted output (n-grams that appear in any translation) by the number of n-grams in the predicted output. This has a degenerate solution in which the predicted output contains only one word. BLEU's modified version substitutes the actual count for each n-gram by the maximum number of times it appears in any translation.

  • Brevity penalty (BP): to avoid favoring outputs that are too short, a penalty is applied. Let $Y$ be the set of gold outputs, $\widetilde{y}$ the predicted output, $c$ the length of the predicted output, and $r$ the smallest absolute difference between the length of $c$ and the length of any of its gold outputs in $Y$. Then:

$$\textbf{BP}(Y, \widetilde{y}) = \begin{cases} 1 & \textrm{ if } c > r \\ \exp(1 - \frac{r}{c}) & \textrm{otherwise} \end{cases}$$

The BLEU score itself is typically a combination of modified n-gram precision for various $n$ (usually up to 4):

$$\textbf{BLEU}(Y, \widetilde{y}) = \textbf{BP}(Y, \widetilde{y}) \cdot \exp\left(\sum_{n=1}^{N} w_{n} \cdot \log\left(\textbf{modified-precision}(Y, \widetilde{y}, n\right)\right)$$

where $Y$ is the set of gold outputs, $\widetilde{y}$ is the predicted output, and $w_{n}$ is a weight for each $n$-gram level (usually set to $1/N$).

NLTK has implementations of Bleu scoring for the sentence-level, as defined above, and for the corpus level (nltk.translate.bleu_score.corpus_bleu). At the corpus level, it is typical to do a kind of micro-averaging of the modified precision scores and use a cumulative version of the brevity penalty.


[0, 1], with 1 being the best, though with no expectation that any system will achieve 1, since even sets of human-created translations do not reach this level.

Value encoded

BLEU scores attempt to achieve the same balance between precision and recall that runs through the majority of the metrics discussed here. It has many affinities with word error rate, but seeks to accommodate the fact that there are typically multiple suitable outputs for a given input.


  • Callison-Burch et al. (2006) criticize BLEU as a machine translation metric on the grounds that it fails to correlate with human scoring of translations. They highlight its insensitivity to n-gram order and its insensitivity to n-gram types (e.g., function vs. content words) as causes of this lack of correlation.

  • Liu et al. (2016) specifically argue against BLEU as a metric for assessing dialogue systems, based on a lack of correlation with human judgments about dialogue coherence.

There are many competitors/alternatives to BLEU, most proposed in the context of machine translation. Examples: ROUGE, METEOR, HyTER, Orange (smoothed Bleu).


Perplexity is a common metric for directly assessing generation models by calculating the probability that they assign to sequences in the test data. It is based in a measure of average surprisal:

$$H(P, x) = -\frac{1}{m}\log_{2} P(x)$$

where $P$ is a model assigning probabilities to sequences and $x$ is a sequence.

Perplexity is then the exponent of this:

$$\textbf{perplexity}(P, x) = 2^{H(P, x)}$$

Using any base $n$ both in defining $H$ and as the base in $\textbf{perplexity}$ will lead to identical results.

Minimizing perplexity is equivalent to maximizing probability.

It is common to report per-token perplexity; here the averaging should be done in log-space to deliver a geometric mean:

$$\textbf{token-perplexity}(P, x) = \exp\left(\frac{\log\textbf{perplexity}(P, x)}{\textbf{length}(x)}\right)$$

When averaging perplexity values obtained from all the sequences in a text corpus, one should again use the geometric mean:

$$\textbf{mean-perplexity}(P, X) = \exp\left(\frac{1}{m}\sum_{x\in X}\log(\textbf{token-perplexity}(P, x))\right)$$

for a set of $m$ examples $X$.


[1, $\infty$], where 1 is best.

Values encoded

The guiding idea behind perplexity is that a good model will assign high probability to the sequences in the test data. This is an intuitive, expedient intrinsic evaluation, and it matches well with the objective for models trained with a cross-entropy or logistic objective.


  • Perplexity is heavily dependent on the nature of the underlying vocabulary in the following sense: one can artificially lower one's perplexity by having a lot of UNK tokens in the training and test sets. Consider the extreme case in which everything is mapped to UNK and perplexity is thus perfect on any test set. The more worrisome thing is that any amount of UNK usage side-steps the pervasive challenge of dealing with infrequent words.

  • As Hal Daumé discusses in this post, the perplexity metric imposes an artificial constrain that one's model outputs are probabilistic.

Perplexity is the inverse of probability and, with some assumptions, can be seen as an approximation of the cross-entropy between the model's predictions and the true underlying sequence probabilities.