
Not All Errors Are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis
Wenda Xu, Yilin Tuan, Yujie Lu, Michael Saxon,
Lei Li,William Yang Wang
UC Santa Barbara
{wendaxu,ytuan,yujielu, mssaxon, leili, william}@cs.ucsb.edu
Abstract
Is it possible to build a general and automatic
natural language generation (NLG) evaluation
metric? Existing learned metrics either per-
form unsatisfactorily or are restricted to tasks
where large human rating data is already avail-
able. We introduce SESCORE, a model-based
metric that is highly correlated with human
judgements without requiring human annota-
tion, by utilizing a novel, iterative error synthe-
sis and severity scoring pipeline. This pipeline
applies a series of plausible errors to raw text
and assigns severity labels by simulating hu-
man judgements with entailment. We evaluate
SESCORE against existing metrics by compar-
ing how their scores correlate with human rat-
ings. SESCORE outperforms all prior unsuper-
vised metrics on multiple diverse NLG tasks
including machine translation, image caption-
ing, and WebNLG text generation. For WMT
20/21 En-De and Zh-En, SESCORE improve
the average Kendall correlation with human
judgement from 0.154 to 0.195. SESCORE
even achieves comparable performance to the
best supervised metric COMET, despite receiv-
ing no human-annotated training data. 1
1 Introduction
Text generation tasks such as translation and im-
age captioning have seen considerable progress
in the past few years (Chen et al.,2015;Birch,
2021). However, precisely and automatically
evaluating generated text quality remains a chal-
lenge. Long-dominant n-gram-based evaluation
techniques, such as BLEU (Papineni et al.,2002)
and ROUGE (Lin,2004), are sensitive to surface-
level lexical and syntactic variations, and have been
repeatedly reported to not well correlate to human
judgements (Zhang* et al.,2020;Xu et al.,2021).
Multiple learned metrics have been proposed to
better approximate human judgements. These met-
1
Code and data are available at
https://github.
com/xu1998hz/SEScore
rics can be categorized into unsupervised and su-
pervised methods based on whether human ratings
are used. The former includes PRISM (Thomp-
son and Post,2020), BERTScore (Zhang* et al.,
2020), BARTScore (Yuan et al.,2021), etc. The
latter includes BLEURT (Sellam et al.,2020),
COMET (Rei et al.,2020) etc.
Unsupervised learned metrics are particularly
useful as task-specific human annotations of gener-
ated text can be expensive or impractical to gather
at scale. While these metrics are applicable to a va-
riety of NLG tasks (Zhang* et al.,2020;Yuan et al.,
2021), they tend to target a narrow set of aspects
such as semantic coverage or faithfulness, and have
limited applicability to other aspects, such as flu-
ency and style, that matter to humans (Freitag et al.,
2021a;Saxon et al.,2021). While supervised met-
rics can address different attributes by modeling
the conditional distribution of real human opinions,
training data for quality assessment is often task-
and domain-specific with limited generalizability.
We introduce SESCORE, a general technique
to produce nuanced reference-based metrics for
automatic text generation evaluation without us-
ing human-annotated reference-candidate text pairs.
Our method is motivated by the observation that
a diverse set of distinct error types can co-occur
in candidate texts, and that human evaluators do
not view all errors as equally problematic (Freitag
et al.,2021a). To this end, we develop a stratified
error synthesis procedure to construct (reference,
candidate, score) triples from raw text. The can-
didates contain non-overlapping, plausible simula-
tions of NLG model errors, iteratively applied to
the input text. At each iteration, a severity scoring
module isolates individual simulated errors, and as-
sesses the human-perceived degradation in quality
incurred. Our contributions are as follows:
•
SESCORE, an approach to train automatic text
evaluation metrics without human ratings;
•
A procedure to synthesize different types of
arXiv:2210.05035v2 [cs.CL] 26 Oct 2022