BLEU 详解

大纲

  • 简介
  • 论文
  • 代码
  • 思考

简介

What

BLEU 全称 bilingual evaluation understudy,是自动评价翻译效果的一种方法。
特点是:快速、廉价和不依赖语言,并且与人类评估的结果高度一致。

$$BLEU = BP * exp(\sum_{n=1}^{N} w_n logP_n)$$


如果候选译文的长度 c 大于参考译文最接近候选译文的长度 r,$BP = 1$;否则 $BP = e^{1-r/c}$

How

BLEU 通过计算候选译文和参考译文的 N-gram 实现对翻译结果的评价。步骤和细节如下:

  • 计算 N-gram 的 P 值
    • 对每个 N,统计候选译文中每个 N-gram 出现的次数
    • 统计给定几个参考译文中出现的最大次数,取两者次数的较小值
    • 第二步求和除以第一步求和,得到该 N-gram 的 P 值
  • 获取 BP 值
  • 得到 BLEU 值

Why

The closer a machine translation is to a professional human translation, the better it is.

  • A translation using the same words (1-grams) as in the references tends to satisfy adequacy. The longer n-gram matches account for fluency.
  • brevity penalty

论文

bleu.dvi
主要梳理作者的思路。

Introduction

Rationale

主要介绍了背景和原因,用人力去评估翻译效果耗时耗力。

Viewpoint

介绍了主要观点和文章结构,核心观点是:机器翻译和专家翻译的越接近越好。
所以需要评价机器翻译和专家翻译结果的相似性。
而相似性和单词的选择与顺序有关,本文的主要观点是使用一定长度的短语评估。

The Baseline BLEU Metric

介绍了基本方法:计算 n-gram。

Modified n-gram precision

  • 如何计算 n-gram(unigram 为例)
  • n-gram 捕获了用词的合适性和流畅性
Modified n-gram precision on blocks of text
  • Pn 计算公式
  • 计算不同的 n-gram
Ranking systems using only modified n-gram precision
  • 从 1-gram 到 4-gram 机器和人翻译结果的对比
  • 每个 n-gram 上,机器与人表现具有一致性
    • 随着 n 增加,Precision 下降
    • 人的 Precision 均超过机器,而且随着 n 增加,人的 Precision 超机器更多
Combining the modified n-gram precisions

因为 4-gram 下机器的准确率已经很低了,所以最多取到 4。

Sentence length

  • 考虑候选译文长度的问题,不能太长或太短。
    • N-gram 惩罚『不好』的词(候选译文中没出现的)
    • 同时,惩罚那些出现次数超过候选译文的 N-gram
    • 但是,对那些短的很荒唐的句子,但却出现在候选译文中的情况无能为力
The trouble with recall
  • 传统的召回率解决长度问题
    • 但如果出现了所有候选译文中的词(长度很长),整体效果不好
Sentence brevity penalty
  • 长度(过短)惩罚(BP)
    • 太长的问题 n-gram 已经搞定
    • 太短的问题 brevity penalty

With this brevity penalty in place, a high-scoring candidate translation must now match the reference translations in length, in word choice, and in word order.

BLEU details

BP 和 BLEU 的表达式细节

The BLEU Evaluation

  • 500 句(40 个故事),两个参考译文,两个人+三个 MT 共 五个 systems:人的 BLEU 得分比机器得分高
  • 问题:
    • 这个不同可靠吗?
    • BLEU 不一致的地方是什么?
    • 如果用另外 500 句还能得出类似结果吗?
  • 分成 20 组,每组 25 句,25 句 bleu 平均后得到 20 个数据
    • 每组结果应该和整体一致(实际如此)
    • 两两 t 检验应该不相同(原假设相同,期望拒绝原假设,实际如此)
  • 需要多少个参考翻译?
    • 随机选择四个参考翻译中的一个作为 40 个故事中的每一个的单个参考模拟单参考测试语料库,确保了一定程度的风格变化
    • 排序结果与多参考翻译结果一致

The Human Evaluation

为了检验结果的可靠性,用真人进行评估(得分从 1-5),还是两组不同类型的人;数据取自 500 句,共取出 50 句,形成 250 对。

  • 单语组: 10 个本土说英语的;只评估可读性和流畅性
  • 双语组: 10 个在本土生活多年的中国人;所有人都不是翻译专家
  • 最后还有 t 检验

两两 t 检验,在 95% 的置信区间上,单语组和双语组均通过检验(每两组都有差异),结果排序与之前一致

BLEU vs The Human Evaluation

  • 与 BLEU 分数(两参考译文)线性回归(五个),单语言组相关系数 0.99,双语言组相关系数 0.96
  • 把最差的 system 作为参考点,综合考虑 BLEU、单语言组、多语言组评估在五个系统上的表现(各组内部分别 0-1 标准化后的结果)
    • BLEU 和单语言组高度相关
    • 双语组得分较高并且与其他两组有一些差别
    • MT 系统和人类相差明显

Conclusion

BLEU’s strength is that it correlates highly with human judgments by averaging out individual sentence judgment errors over a test corpus rather than attempting to divine the exact human judgment for every sentence: quantity leads to quality.
通过对测试数据中的单个句子判断错误进行平均而不是试图对每个句子进行人类的精确判断进行平均,与人类的评估高度相关:数量导致质量。

代码

In [3]:
import collections
import nltk
import math
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.util import ngrams
In [4]:
weights = [0.25, 0.25, 0.25, 0.25]
candidate1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which','ensures', 'that', 'the', 'military', 'always','obeys', 'the', 'commands', 'of', 'the', 'party']
candidate2 = ['It', 'is', 'to', 'insure', 'the', 'troops','forever', 'hearing', 'the', 'activity', 'guidebook','that', 'party', 'direct']
reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that','ensures', 'that', 'the', 'military', 'will', 'forever','heed', 'Party', 'commands']
reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which','guarantees', 'the', 'military', 'forces', 'always','being', 'under', 'the', 'command', 'of', 'the','Party']
reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the','army', 'always', 'to', 'heed', 'the', 'directions','of', 'the', 'party']
In [6]:
references = [reference1, reference2, reference3]

Pn

In [7]:
def modified_precision(candidate, references, n):
    counts = Counter(ngrams(candidate, n))

    if not counts:
        return 0

    max_counts = {}
    # 遍历每个 reference
    for reference in references:
        reference_counts = Counter(ngrams(reference, n))
        # 获取 references 中 N-gram 的最大值
        for ngram in counts:
            max_counts[ngram] = max(max_counts.get(ngram, 0), reference_counts[ngram])

    clipped_counts = dict((ngram, min(count, max_counts[ngram])) for ngram, count in counts.items())

    return sum(clipped_counts.values()) / sum(counts.values())
In [8]:
counts = Counter(ngrams(candidate1, 1))
In [9]:
for ngram in counts:
    print(ngram)
('a',)
('to',)
('that',)
('commands',)
('party',)
('action',)
('always',)
('is',)
('obeys',)
('military',)
('It',)
('guide',)
('the',)
('of',)
('which',)
('ensures',)
In [10]:
max_counts = {}
for reference in references:
    reference_counts = Counter(ngrams(reference, 1))
    for ngram in counts:
        max_counts[ngram] = max(max_counts.get(ngram, 0), reference_counts[ngram])
In [11]:
clipped_counts = dict((ngram, min(count, max_counts[ngram])) for ngram, count in counts.items())
In [12]:
clipped_counts
Out[12]:
{('It',): 1,
 ('a',): 1,
 ('action',): 1,
 ('always',): 1,
 ('commands',): 1,
 ('ensures',): 1,
 ('guide',): 1,
 ('is',): 1,
 ('military',): 1,
 ('obeys',): 0,
 ('of',): 1,
 ('party',): 1,
 ('that',): 1,
 ('the',): 3,
 ('to',): 1,
 ('which',): 1}
In [13]:
sum(clipped_counts.values()) / sum(counts.values())
Out[13]:
0.9444444444444444

BP

In [14]:
def brevity_penalty(candidate, references):

    c = len(candidate)
    ref_lens = (len(reference) for reference in references)
    # 这里有个知识点是Python中元组是可以比较的,如(0,1)>(1,0)返回False,这里利用元组比较实现了选取参考翻译中长度最接近候选翻译的句子,
    # 当最接近的参考翻译有多个时,选取最短的。例如候选翻译长度是10,两个参考翻译长度分别为9和11,则r=9.
    # 元组可以比较多个变量
    r = min(ref_lens, key=lambda ref_len: (abs(ref_len - c), ref_len))
    print ('r:',r)

    if c > r:
        return 1
    else:
        return math.exp(1 - r / c)
In [15]:
brevity_penalty(candidate2, references)
r: 16
Out[15]:
0.8668778997501817
In [16]:
c = len(candidate1)
c
Out[16]:
18
In [17]:
ref_lens = (len(reference) for reference in references)
In [18]:
list(ref_lens)
Out[18]:
[16, 18, 16]
In [19]:
min((1,10),(1,16))
Out[19]:
(1, 10)
In [20]:
brevity_penalty(candidate1, references) * modified_precision(candidate1,[reference1, reference2, reference3], 1)
r: 18
Out[20]:
0.9444444444444444
In [21]:
modified_precision(candidate1,[reference1, reference2, reference3], 1)
Out[21]:
0.9444444444444444
In [22]:
# nltk.ngrams
list(ngrams([1,2,3,4,5],1))
Out[22]:
[(1,), (2,), (3,), (4,), (5,)]

BLEU

In [25]:
def bleu(candidate, references, weights):
    p_ns = (
        modified_precision(candidate, references, i)
        for i, _ in enumerate(weights, start=1)
    )

    try:
        s = math.fsum(w * math.log(p_n) for w, p_n in zip(weights, p_ns))
    except ValueError:
        # some p_ns is 0
        return 0

    bp = brevity_penalty(candidate, references)
    return bp * math.exp(s)
In [26]:
bleu(candidate1, references, weights)
r: 18
Out[26]:
0.5045666840058485
In [27]:
bleu(candidate2, references, weights)
Out[27]:
0

GNMT

Google 发布的 NMT: nmt/bleu.py at master · tensorflow/nmt
抽象的很好。

In [60]:
def get_ngrams(segment, max_order):
    
    """Extracts all n-grams upto a given maximum order from an input segment.

    Args:
    segment: text segment from which n-grams will be extracted.
    max_order: maximum length in tokens of the n-grams returned by this
        methods.

    Returns:
    The Counter containing all n-grams upto max_order in segment
    with a count of how many times each n-gram occurred.
    """
    ngram_counts = collections.Counter()
    for order in range(1, max_order + 1):
        for i in range(0, len(segment) - order + 1):
            ngram = tuple(segment[i:i+order])
            ngram_counts[ngram] += 1
    return ngram_counts
In [56]:
get_ngrams(candidate1, 4)
Out[56]:
Counter({('It',): 1,
         ('It', 'is'): 1,
         ('It', 'is', 'a'): 1,
         ('It', 'is', 'a', 'guide'): 1,
         ('a',): 1,
         ('a', 'guide'): 1,
         ('a', 'guide', 'to'): 1,
         ('a', 'guide', 'to', 'action'): 1,
         ('action',): 1,
         ('action', 'which'): 1,
         ('action', 'which', 'ensures'): 1,
         ('action', 'which', 'ensures', 'that'): 1,
         ('always',): 1,
         ('always', 'obeys'): 1,
         ('always', 'obeys', 'the'): 1,
         ('always', 'obeys', 'the', 'commands'): 1,
         ('commands',): 1,
         ('commands', 'of'): 1,
         ('commands', 'of', 'the'): 1,
         ('commands', 'of', 'the', 'party'): 1,
         ('ensures',): 1,
         ('ensures', 'that'): 1,
         ('ensures', 'that', 'the'): 1,
         ('ensures', 'that', 'the', 'military'): 1,
         ('guide',): 1,
         ('guide', 'to'): 1,
         ('guide', 'to', 'action'): 1,
         ('guide', 'to', 'action', 'which'): 1,
         ('is',): 1,
         ('is', 'a'): 1,
         ('is', 'a', 'guide'): 1,
         ('is', 'a', 'guide', 'to'): 1,
         ('military',): 1,
         ('military', 'always'): 1,
         ('military', 'always', 'obeys'): 1,
         ('military', 'always', 'obeys', 'the'): 1,
         ('obeys',): 1,
         ('obeys', 'the'): 1,
         ('obeys', 'the', 'commands'): 1,
         ('obeys', 'the', 'commands', 'of'): 1,
         ('of',): 1,
         ('of', 'the'): 1,
         ('of', 'the', 'party'): 1,
         ('party',): 1,
         ('that',): 1,
         ('that', 'the'): 1,
         ('that', 'the', 'military'): 1,
         ('that', 'the', 'military', 'always'): 1,
         ('the',): 3,
         ('the', 'commands'): 1,
         ('the', 'commands', 'of'): 1,
         ('the', 'commands', 'of', 'the'): 1,
         ('the', 'military'): 1,
         ('the', 'military', 'always'): 1,
         ('the', 'military', 'always', 'obeys'): 1,
         ('the', 'party'): 1,
         ('to',): 1,
         ('to', 'action'): 1,
         ('to', 'action', 'which'): 1,
         ('to', 'action', 'which', 'ensures'): 1,
         ('which',): 1,
         ('which', 'ensures'): 1,
         ('which', 'ensures', 'that'): 1,
         ('which', 'ensures', 'that', 'the'): 1})
In [310]:
def compute_bleu(reference_corpus, translation_corpus, max_order=4,
                 smooth=False):
    """Computes BLEU score of translated segments against one or more references.

    Args:
      reference_corpus: list of lists of references for each translation. Each
          reference should be tokenized into a list of tokens.
      translation_corpus: list of translations to score. Each translation
          should be tokenized into a list of tokens.
      max_order: Maximum n-gram order to use when computing BLEU score.
      smooth: Whether or not to apply Lin et al. 2004 smoothing.

    Returns:
      3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
      precisions and brevity penalty.
    """
    matches_by_order = [0] * max_order
    possible_matches_by_order = [0] * max_order
    reference_length = 0
    translation_length = 0
    # references 和 translation 放在一起
    # 可以同时处理多个翻译结果
    # zip(X,X)[0] 就是 references,[1] 是 translation
    for (references, translation) in zip(reference_corpus, translation_corpus):
        # 长度最短的
        reference_length += min(len(r) for r in references)
        translation_length += len(translation)

        merged_ref_ngram_counts = collections.Counter()
        for reference in references:
            # or 操作,获取所有 references 可能的 ngram
            # 取或后,会取到较大的频率
            merged_ref_ngram_counts |= get_ngrams(reference, max_order)
        translation_ngram_counts = get_ngrams(translation, max_order)
        # 取两者交集
        # 取交集会取到较小者
        overlap = translation_ngram_counts & merged_ref_ngram_counts
        for ngram in overlap:
            # 每个 n-gram P 值的分子,即 references 最大与候选译文比较较小值之和
            matches_by_order[len(ngram)-1] += overlap[ngram]
        for order in range(1, max_order+1):
            # 每个 n-gram P 值的分母,即候选译文 ngram 的数量求和
            possible_matches = len(translation) - order + 1
            if possible_matches > 0:
                possible_matches_by_order[order-1] += possible_matches

    precisions = [0] * max_order
    for i in range(0, max_order):
        if smooth:
            precisions[i] = ((matches_by_order[i] + 1.) /
                       (possible_matches_by_order[i] + 1.))
        else:
            if possible_matches_by_order[i] > 0:
                precisions[i] = (float(matches_by_order[i]) / possible_matches_by_order[i])
            # 考虑了 ngram 不存在的时候(句子太短),或者 n 太大,这种时候 bleu 也为 0
            else:
                precisions[i] = 0.0

    if min(precisions) > 0:
        p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
        geo_mean = math.exp(p_log_sum)
    else:
        geo_mean = 0

    ratio = float(translation_length) / reference_length
    # ratio = c/r, 候选/参考
    # r 应该选取长度最接近候选翻译的,此处选择最小的,如果 > 1,那 bp 肯定为 1;如果 < 1,那最小的也就是最接近候选译文的
    if ratio > 1.0:
        bp = 1.
    else:
        bp = math.exp(1 - 1. / ratio)

    bleu = geo_mean * bp

    return (bleu, precisions, bp, ratio, translation_length, reference_length)
In [215]:
compute_bleu([references], [candidate1])
Out[215]:
(0.5045666840058485,
 [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666],
 1.0,
 1.125,
 18,
 16)
In [157]:
compute_bleu([references], [candidate2])
Out[157]:
(0.0,
 [0.5714285714285714, 0.07692307692307693, 0.0, 0.0],
 0.8668778997501817,
 0.875,
 14,
 16)

分解步骤

In [288]:
weights = [0.25, 0.25, 0.25, 0.25]
candidate1 = ['the', 'commands', 'of', 'the', 'party']
reference1 = ['will', 'forever','heed', 'Party', 'commands']
reference2 = ['the', 'command', 'of', 'the','Party']
reference3 = ['the', 'the','of', 'the', 'party']
In [289]:
references = [reference1, reference2, reference3]
In [290]:
reference_corpus = [references]
In [291]:
translation_corpus = [candidate1]
In [292]:
max_order = 1
In [293]:
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
reference_length = 0
translation_length = 0
In [294]:
for (references, translation) in zip(reference_corpus, translation_corpus):
    reference_length = min(len(r) for r in references)
    translation_length = len(translation)

    merged_ref_ngram_counts = collections.Counter()
    for reference in references:
        merged_ref_ngram_counts |= get_ngrams(reference, max_order)
    translation_ngram_counts = get_ngrams(translation, max_order)
    overlap = translation_ngram_counts & merged_ref_ngram_counts
    for ngram in overlap:
        matches_by_order[len(ngram)-1] += overlap[ngram]
    for order in range(1, max_order+1):
        possible_matches = len(translation) - order + 1
        if possible_matches > 0:
            possible_matches_by_order[order-1] += possible_matches
In [295]:
get_ngrams(reference1, max_order)
Out[295]:
Counter({('Party',): 1,
         ('commands',): 1,
         ('forever',): 1,
         ('heed',): 1,
         ('will',): 1})
In [296]:
get_ngrams(reference2, max_order)
Out[296]:
Counter({('Party',): 1, ('command',): 1, ('of',): 1, ('the',): 2})
In [297]:
get_ngrams(reference3, max_order)
Out[297]:
Counter({('of',): 1, ('party',): 1, ('the',): 3})
In [298]:
merged_ref_ngram_counts
Out[298]:
Counter({('Party',): 1,
         ('command',): 1,
         ('commands',): 1,
         ('forever',): 1,
         ('heed',): 1,
         ('of',): 1,
         ('party',): 1,
         ('the',): 3,
         ('will',): 1})
In [299]:
get_ngrams(reference3, max_order) | get_ngrams(reference2, max_order) | get_ngrams(reference1, max_order)
Out[299]:
Counter({('Party',): 1,
         ('command',): 1,
         ('commands',): 1,
         ('forever',): 1,
         ('heed',): 1,
         ('of',): 1,
         ('party',): 1,
         ('the',): 3,
         ('will',): 1})
In [300]:
translation_ngram_counts
Out[300]:
Counter({('commands',): 1, ('of',): 1, ('party',): 1, ('the',): 2})
In [301]:
overlap
Out[301]:
Counter({('commands',): 1, ('of',): 1, ('party',): 1, ('the',): 2})
In [302]:
matches_by_order
Out[302]:
[5]
In [303]:
possible_matches_by_order
Out[303]:
[5]
In [304]:
reference_length
Out[304]:
5
In [305]:
translation_length
Out[305]:
5

思考

评估思路

方法本身比较简单,理念和公式都不复杂,但评估思路非常严谨,令人信服。
而这对我们很多算法应用时的评估都非常有借鉴意义。

具体分为三个步骤:

  • 算法评估
    • 随机测试数据
    • 在不同层次上的评估,以期算法能具有区分性
    • 对数据分组或换一组数据,每组数据取均值后,组组之间依然具有区分性,且期望与整体一致
    • 其他影响算法的变量(如文中参考译文的数量)
  • 人为评估
    • 从测试数据中随机选取一定数量
    • 对算法区分出的结果进行人为打分
    • 考虑不同类型的人对结果的影响,是否需要对人进行分类
    • 考察人为打分结果是否具有区分性,且与算法结果一致
  • 算法与人评估的相关性
    • 把算法评估的结果与人为评估的结果线性拟合看相关性
    • 每种评估方式在不同层次数据集上归一化的结果,主要看每种评估方式的相似性和不同层次数据集上的表现,相对指标

代码

nltk 的代码简单易懂,也很直观;
Google 的代码抽象性很好,考虑了平滑和多样本处理。

参考