BLEU 全称 bilingual evaluation understudy,是自动评价翻译效果的一种方法。
特点是:快速、廉价和不依赖语言,并且与人类评估的结果高度一致。
如果候选译文的长度 c 大于参考译文最接近候选译文的长度 r,$BP = 1$;否则 $BP = e^{1-r/c}$
BLEU 通过计算候选译文和参考译文的 N-gram 实现对翻译结果的评价。步骤和细节如下:
The closer a machine translation is to a professional human translation, the better it is.
bleu.dvi
主要梳理作者的思路。
主要介绍了背景和原因,用人力去评估翻译效果耗时耗力。
介绍了主要观点和文章结构,核心观点是:机器翻译和专家翻译的越接近越好。
所以需要评价机器翻译和专家翻译结果的相似性。
而相似性和单词的选择与顺序有关,本文的主要观点是使用一定长度的短语评估。
介绍了基本方法:计算 n-gram。
因为 4-gram 下机器的准确率已经很低了,所以最多取到 4。
With this brevity penalty in place, a high-scoring candidate translation must now match the reference translations in length, in word choice, and in word order.
BP 和 BLEU 的表达式细节
为了检验结果的可靠性,用真人进行评估(得分从 1-5),还是两组不同类型的人;数据取自 500 句,共取出 50 句,形成 250 对。
两两 t 检验,在 95% 的置信区间上,单语组和双语组均通过检验(每两组都有差异),结果排序与之前一致
BLEU’s strength is that it correlates highly with human judgments by averaging out individual sentence judgment errors over a test corpus rather than attempting to divine the exact human judgment for every sentence: quantity leads to quality.
通过对测试数据中的单个句子判断错误进行平均而不是试图对每个句子进行人类的精确判断进行平均,与人类的评估高度相关:数量导致质量。
import collections
import nltk
import math
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.util import ngrams
weights = [0.25, 0.25, 0.25, 0.25]
candidate1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which','ensures', 'that', 'the', 'military', 'always','obeys', 'the', 'commands', 'of', 'the', 'party']
candidate2 = ['It', 'is', 'to', 'insure', 'the', 'troops','forever', 'hearing', 'the', 'activity', 'guidebook','that', 'party', 'direct']
reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that','ensures', 'that', 'the', 'military', 'will', 'forever','heed', 'Party', 'commands']
reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which','guarantees', 'the', 'military', 'forces', 'always','being', 'under', 'the', 'command', 'of', 'the','Party']
reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the','army', 'always', 'to', 'heed', 'the', 'directions','of', 'the', 'party']
references = [reference1, reference2, reference3]
def modified_precision(candidate, references, n):
counts = Counter(ngrams(candidate, n))
if not counts:
return 0
max_counts = {}
# 遍历每个 reference
for reference in references:
reference_counts = Counter(ngrams(reference, n))
# 获取 references 中 N-gram 的最大值
for ngram in counts:
max_counts[ngram] = max(max_counts.get(ngram, 0), reference_counts[ngram])
clipped_counts = dict((ngram, min(count, max_counts[ngram])) for ngram, count in counts.items())
return sum(clipped_counts.values()) / sum(counts.values())
counts = Counter(ngrams(candidate1, 1))
for ngram in counts:
print(ngram)
('a',) ('to',) ('that',) ('commands',) ('party',) ('action',) ('always',) ('is',) ('obeys',) ('military',) ('It',) ('guide',) ('the',) ('of',) ('which',) ('ensures',)
max_counts = {}
for reference in references:
reference_counts = Counter(ngrams(reference, 1))
for ngram in counts:
max_counts[ngram] = max(max_counts.get(ngram, 0), reference_counts[ngram])
clipped_counts = dict((ngram, min(count, max_counts[ngram])) for ngram, count in counts.items())
clipped_counts
{('It',): 1, ('a',): 1, ('action',): 1, ('always',): 1, ('commands',): 1, ('ensures',): 1, ('guide',): 1, ('is',): 1, ('military',): 1, ('obeys',): 0, ('of',): 1, ('party',): 1, ('that',): 1, ('the',): 3, ('to',): 1, ('which',): 1}
sum(clipped_counts.values()) / sum(counts.values())
0.9444444444444444
def brevity_penalty(candidate, references):
c = len(candidate)
ref_lens = (len(reference) for reference in references)
# 这里有个知识点是Python中元组是可以比较的,如(0,1)>(1,0)返回False,这里利用元组比较实现了选取参考翻译中长度最接近候选翻译的句子,
# 当最接近的参考翻译有多个时,选取最短的。例如候选翻译长度是10,两个参考翻译长度分别为9和11,则r=9.
# 元组可以比较多个变量
r = min(ref_lens, key=lambda ref_len: (abs(ref_len - c), ref_len))
print ('r:',r)
if c > r:
return 1
else:
return math.exp(1 - r / c)
brevity_penalty(candidate2, references)
r: 16
0.8668778997501817
c = len(candidate1)
c
18
ref_lens = (len(reference) for reference in references)
list(ref_lens)
[16, 18, 16]
min((1,10),(1,16))
(1, 10)
brevity_penalty(candidate1, references) * modified_precision(candidate1,[reference1, reference2, reference3], 1)
r: 18
0.9444444444444444
modified_precision(candidate1,[reference1, reference2, reference3], 1)
0.9444444444444444
# nltk.ngrams
list(ngrams([1,2,3,4,5],1))
[(1,), (2,), (3,), (4,), (5,)]
def bleu(candidate, references, weights):
p_ns = (
modified_precision(candidate, references, i)
for i, _ in enumerate(weights, start=1)
)
try:
s = math.fsum(w * math.log(p_n) for w, p_n in zip(weights, p_ns))
except ValueError:
# some p_ns is 0
return 0
bp = brevity_penalty(candidate, references)
return bp * math.exp(s)
bleu(candidate1, references, weights)
r: 18
0.5045666840058485
bleu(candidate2, references, weights)
0
Google 发布的 NMT: nmt/bleu.py at master · tensorflow/nmt
抽象的很好。
def get_ngrams(segment, max_order):
"""Extracts all n-grams upto a given maximum order from an input segment.
Args:
segment: text segment from which n-grams will be extracted.
max_order: maximum length in tokens of the n-grams returned by this
methods.
Returns:
The Counter containing all n-grams upto max_order in segment
with a count of how many times each n-gram occurred.
"""
ngram_counts = collections.Counter()
for order in range(1, max_order + 1):
for i in range(0, len(segment) - order + 1):
ngram = tuple(segment[i:i+order])
ngram_counts[ngram] += 1
return ngram_counts
get_ngrams(candidate1, 4)
Counter({('It',): 1, ('It', 'is'): 1, ('It', 'is', 'a'): 1, ('It', 'is', 'a', 'guide'): 1, ('a',): 1, ('a', 'guide'): 1, ('a', 'guide', 'to'): 1, ('a', 'guide', 'to', 'action'): 1, ('action',): 1, ('action', 'which'): 1, ('action', 'which', 'ensures'): 1, ('action', 'which', 'ensures', 'that'): 1, ('always',): 1, ('always', 'obeys'): 1, ('always', 'obeys', 'the'): 1, ('always', 'obeys', 'the', 'commands'): 1, ('commands',): 1, ('commands', 'of'): 1, ('commands', 'of', 'the'): 1, ('commands', 'of', 'the', 'party'): 1, ('ensures',): 1, ('ensures', 'that'): 1, ('ensures', 'that', 'the'): 1, ('ensures', 'that', 'the', 'military'): 1, ('guide',): 1, ('guide', 'to'): 1, ('guide', 'to', 'action'): 1, ('guide', 'to', 'action', 'which'): 1, ('is',): 1, ('is', 'a'): 1, ('is', 'a', 'guide'): 1, ('is', 'a', 'guide', 'to'): 1, ('military',): 1, ('military', 'always'): 1, ('military', 'always', 'obeys'): 1, ('military', 'always', 'obeys', 'the'): 1, ('obeys',): 1, ('obeys', 'the'): 1, ('obeys', 'the', 'commands'): 1, ('obeys', 'the', 'commands', 'of'): 1, ('of',): 1, ('of', 'the'): 1, ('of', 'the', 'party'): 1, ('party',): 1, ('that',): 1, ('that', 'the'): 1, ('that', 'the', 'military'): 1, ('that', 'the', 'military', 'always'): 1, ('the',): 3, ('the', 'commands'): 1, ('the', 'commands', 'of'): 1, ('the', 'commands', 'of', 'the'): 1, ('the', 'military'): 1, ('the', 'military', 'always'): 1, ('the', 'military', 'always', 'obeys'): 1, ('the', 'party'): 1, ('to',): 1, ('to', 'action'): 1, ('to', 'action', 'which'): 1, ('to', 'action', 'which', 'ensures'): 1, ('which',): 1, ('which', 'ensures'): 1, ('which', 'ensures', 'that'): 1, ('which', 'ensures', 'that', 'the'): 1})
def compute_bleu(reference_corpus, translation_corpus, max_order=4,
smooth=False):
"""Computes BLEU score of translated segments against one or more references.
Args:
reference_corpus: list of lists of references for each translation. Each
reference should be tokenized into a list of tokens.
translation_corpus: list of translations to score. Each translation
should be tokenized into a list of tokens.
max_order: Maximum n-gram order to use when computing BLEU score.
smooth: Whether or not to apply Lin et al. 2004 smoothing.
Returns:
3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
precisions and brevity penalty.
"""
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
reference_length = 0
translation_length = 0
# references 和 translation 放在一起
# 可以同时处理多个翻译结果
# zip(X,X)[0] 就是 references,[1] 是 translation
for (references, translation) in zip(reference_corpus, translation_corpus):
# 长度最短的
reference_length += min(len(r) for r in references)
translation_length += len(translation)
merged_ref_ngram_counts = collections.Counter()
for reference in references:
# or 操作,获取所有 references 可能的 ngram
# 取或后,会取到较大的频率
merged_ref_ngram_counts |= get_ngrams(reference, max_order)
translation_ngram_counts = get_ngrams(translation, max_order)
# 取两者交集
# 取交集会取到较小者
overlap = translation_ngram_counts & merged_ref_ngram_counts
for ngram in overlap:
# 每个 n-gram P 值的分子,即 references 最大与候选译文比较较小值之和
matches_by_order[len(ngram)-1] += overlap[ngram]
for order in range(1, max_order+1):
# 每个 n-gram P 值的分母,即候选译文 ngram 的数量求和
possible_matches = len(translation) - order + 1
if possible_matches > 0:
possible_matches_by_order[order-1] += possible_matches
precisions = [0] * max_order
for i in range(0, max_order):
if smooth:
precisions[i] = ((matches_by_order[i] + 1.) /
(possible_matches_by_order[i] + 1.))
else:
if possible_matches_by_order[i] > 0:
precisions[i] = (float(matches_by_order[i]) / possible_matches_by_order[i])
# 考虑了 ngram 不存在的时候(句子太短),或者 n 太大,这种时候 bleu 也为 0
else:
precisions[i] = 0.0
if min(precisions) > 0:
p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
geo_mean = math.exp(p_log_sum)
else:
geo_mean = 0
ratio = float(translation_length) / reference_length
# ratio = c/r, 候选/参考
# r 应该选取长度最接近候选翻译的,此处选择最小的,如果 > 1,那 bp 肯定为 1;如果 < 1,那最小的也就是最接近候选译文的
if ratio > 1.0:
bp = 1.
else:
bp = math.exp(1 - 1. / ratio)
bleu = geo_mean * bp
return (bleu, precisions, bp, ratio, translation_length, reference_length)
compute_bleu([references], [candidate1])
(0.5045666840058485, [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666], 1.0, 1.125, 18, 16)
compute_bleu([references], [candidate2])
(0.0, [0.5714285714285714, 0.07692307692307693, 0.0, 0.0], 0.8668778997501817, 0.875, 14, 16)
weights = [0.25, 0.25, 0.25, 0.25]
candidate1 = ['the', 'commands', 'of', 'the', 'party']
reference1 = ['will', 'forever','heed', 'Party', 'commands']
reference2 = ['the', 'command', 'of', 'the','Party']
reference3 = ['the', 'the','of', 'the', 'party']
references = [reference1, reference2, reference3]
reference_corpus = [references]
translation_corpus = [candidate1]
max_order = 1
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
reference_length = 0
translation_length = 0
for (references, translation) in zip(reference_corpus, translation_corpus):
reference_length = min(len(r) for r in references)
translation_length = len(translation)
merged_ref_ngram_counts = collections.Counter()
for reference in references:
merged_ref_ngram_counts |= get_ngrams(reference, max_order)
translation_ngram_counts = get_ngrams(translation, max_order)
overlap = translation_ngram_counts & merged_ref_ngram_counts
for ngram in overlap:
matches_by_order[len(ngram)-1] += overlap[ngram]
for order in range(1, max_order+1):
possible_matches = len(translation) - order + 1
if possible_matches > 0:
possible_matches_by_order[order-1] += possible_matches
get_ngrams(reference1, max_order)
Counter({('Party',): 1, ('commands',): 1, ('forever',): 1, ('heed',): 1, ('will',): 1})
get_ngrams(reference2, max_order)
Counter({('Party',): 1, ('command',): 1, ('of',): 1, ('the',): 2})
get_ngrams(reference3, max_order)
Counter({('of',): 1, ('party',): 1, ('the',): 3})
merged_ref_ngram_counts
Counter({('Party',): 1, ('command',): 1, ('commands',): 1, ('forever',): 1, ('heed',): 1, ('of',): 1, ('party',): 1, ('the',): 3, ('will',): 1})
get_ngrams(reference3, max_order) | get_ngrams(reference2, max_order) | get_ngrams(reference1, max_order)
Counter({('Party',): 1, ('command',): 1, ('commands',): 1, ('forever',): 1, ('heed',): 1, ('of',): 1, ('party',): 1, ('the',): 3, ('will',): 1})
translation_ngram_counts
Counter({('commands',): 1, ('of',): 1, ('party',): 1, ('the',): 2})
overlap
Counter({('commands',): 1, ('of',): 1, ('party',): 1, ('the',): 2})
matches_by_order
[5]
possible_matches_by_order
[5]
reference_length
5
translation_length
5
方法本身比较简单,理念和公式都不复杂,但评估思路非常严谨,令人信服。
而这对我们很多算法应用时的评估都非常有借鉴意义。
具体分为三个步骤:
nltk 的代码简单易懂,也很直观;
Google 的代码抽象性很好,考虑了平滑和多样本处理。