Not All Errors Are Equal Learning Text Generation Metrics using Stratified Error Synthesis Wenda Xu Yilin Tuan Yujie Lu Michael Saxon

2025-05-02 0 0 591.88KB 16 页 10玖币
侵权投诉
Not All Errors Are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis
Wenda Xu, Yilin Tuan, Yujie Lu, Michael Saxon,
Lei Li,William Yang Wang
UC Santa Barbara
{wendaxu,ytuan,yujielu, mssaxon, leili, william}@cs.ucsb.edu
Abstract
Is it possible to build a general and automatic
natural language generation (NLG) evaluation
metric? Existing learned metrics either per-
form unsatisfactorily or are restricted to tasks
where large human rating data is already avail-
able. We introduce SESCORE, a model-based
metric that is highly correlated with human
judgements without requiring human annota-
tion, by utilizing a novel, iterative error synthe-
sis and severity scoring pipeline. This pipeline
applies a series of plausible errors to raw text
and assigns severity labels by simulating hu-
man judgements with entailment. We evaluate
SESCORE against existing metrics by compar-
ing how their scores correlate with human rat-
ings. SESCORE outperforms all prior unsuper-
vised metrics on multiple diverse NLG tasks
including machine translation, image caption-
ing, and WebNLG text generation. For WMT
20/21 En-De and Zh-En, SESCORE improve
the average Kendall correlation with human
judgement from 0.154 to 0.195. SESCORE
even achieves comparable performance to the
best supervised metric COMET, despite receiv-
ing no human-annotated training data. 1
1 Introduction
Text generation tasks such as translation and im-
age captioning have seen considerable progress
in the past few years (Chen et al.,2015;Birch,
2021). However, precisely and automatically
evaluating generated text quality remains a chal-
lenge. Long-dominant n-gram-based evaluation
techniques, such as BLEU (Papineni et al.,2002)
and ROUGE (Lin,2004), are sensitive to surface-
level lexical and syntactic variations, and have been
repeatedly reported to not well correlate to human
judgements (Zhang* et al.,2020;Xu et al.,2021).
Multiple learned metrics have been proposed to
better approximate human judgements. These met-
1
Code and data are available at
https://github.
com/xu1998hz/SEScore
rics can be categorized into unsupervised and su-
pervised methods based on whether human ratings
are used. The former includes PRISM (Thomp-
son and Post,2020), BERTScore (Zhang* et al.,
2020), BARTScore (Yuan et al.,2021), etc. The
latter includes BLEURT (Sellam et al.,2020),
COMET (Rei et al.,2020) etc.
Unsupervised learned metrics are particularly
useful as task-specific human annotations of gener-
ated text can be expensive or impractical to gather
at scale. While these metrics are applicable to a va-
riety of NLG tasks (Zhang* et al.,2020;Yuan et al.,
2021), they tend to target a narrow set of aspects
such as semantic coverage or faithfulness, and have
limited applicability to other aspects, such as flu-
ency and style, that matter to humans (Freitag et al.,
2021a;Saxon et al.,2021). While supervised met-
rics can address different attributes by modeling
the conditional distribution of real human opinions,
training data for quality assessment is often task-
and domain-specific with limited generalizability.
We introduce SESCORE, a general technique
to produce nuanced reference-based metrics for
automatic text generation evaluation without us-
ing human-annotated reference-candidate text pairs.
Our method is motivated by the observation that
a diverse set of distinct error types can co-occur
in candidate texts, and that human evaluators do
not view all errors as equally problematic (Freitag
et al.,2021a). To this end, we develop a stratified
error synthesis procedure to construct (reference,
candidate, score) triples from raw text. The can-
didates contain non-overlapping, plausible simula-
tions of NLG model errors, iteratively applied to
the input text. At each iteration, a severity scoring
module isolates individual simulated errors, and as-
sesses the human-perceived degradation in quality
incurred. Our contributions are as follows:
SESCORE, an approach to train automatic text
evaluation metrics without human ratings;
A procedure to synthesize different types of
arXiv:2210.05035v2 [cs.CL] 26 Oct 2022
errors in text at varying severity levels;
Experiments showing that SESCORE is effec-
tive in a diverse set of NLG tasks including
WMT 20/21, WebNLG, and image captioning,
and outperforms all previous unsupervised
learned metrics. It is even comparable to the
best learned metric on WMT 20/21.
2 Related Work
Traditional n-gram matching based (Papineni et al.,
2002;Banerjee and Lavie,2005) and edit distance
based approaches (Levenshtein,1965;Snover et al.,
2006) have proven to be limited in recognizing se-
mantic similarity beyond the lexical level. Learned
metrics (Zhang* et al.,2020;Sellam et al.,2020;
Yuan et al.,2021) have been proposed to align bet-
ter with human judgements. We categorize these
metrics as either unsupervised or supervised with
respect to learning from human-annotated scores.
Unsupervised Metrics
attempt to extract fea-
tures from large pretrained models. Embedding-
based metrics (e.g. BERTScore (Zhang* et al.,
2020) and Moverscore (Zhao et al.,2019)) create
soft-alignments between reference and hypothe-
sis in the embedding space. However, they are
refined in the semantic coverage. Text generation-
based metrics (Yuan et al.,2021), use conditional
probability of the generated sentence to evaluate
faithfulness of the candidates. However, Freitag
et al. (2021a) points out text generation evaluation
can produce errors beyond semantic coverage or
faithfulness (e.g. style and fluency errors), which
results poor correlations to the human evaluations.
Supervised Metrics
attempt to learn through
limited human-labelled severity annotations. Rei
et al. (2020) trained COMET on a small set
of domain-specific human ratings; this model
has limited extensibility to teh general domain.
BLEURT (Sellam et al.,2020) first pretrains on
millions of synthetic data and then uses WMT test-
ing data in fine-tuning the model. Unlike our fine-
grained stratified error synthesis, the labels on the
synthetic data are derived from prior metrics or
other tasks, limiting the quality and precision of
pretraining process.
3 The SESCORE Approach
Given a reference text
x
x
x
and a candidate
y
y
y
, a metric
is expected to output a score
s
. Training such a met-
ric model requires triples of reference-candidate-
Transformer Layers
Pooling Layer
Sentence Embedding Features
Feedforward NN
SEScore(x,y')
Raw Text (x)
Synthesized Text (y')
Synthetic Quality Score (s')
MSE loss
Reference (x) Candidate (y)
Pre-Training Stage Inference Stage
Transformer Layers
Pooling Layer
Sentence Embedding Features
Feedforward NN
SEScore(x,y')
Figure 1: Overview of the Quality Prediction Model.
score’s. However, there are no large-scale human
annotated triple data available in many tasks. We
consider a general setup where large raw text cor-
pus is available.
SESCORE is trained from a pretrained language
model (e.g. BERT) on synthetic triples generated
from raw text. It synthesizes candidate sentences
y
y
y0
to mimic plausible errors by transforming raw input
sentences
x
x
x
multiple times. At each step, it inserts,
deletes, or substitutes a random span of text. These
errors are non-overlapping. It assesses the severity
of the errors introduced in the transformation. This
allows us to pretrain quality prediction models on
corpora containing only raw text samples
{x}
, en-
abling the use of learned quality prediction models
in any text generation domain.
The process of generating
y
y
y0
from
x
x
x
,
stratified
error synthesis
, is so called for its incremental
and multi-category nature; a stochastic perturbation
function
Ges
which randomly samples from a set of
potential errors is recursively applied on
x
x
x
(eq. (1))
M
times to produce a sequence of perturbed sen-
tences
Z
Z
Z={zi}M
i=1
that interpolate between the
raw text
x
x
x
and the final synthetic sentence
y
y
y0=zM
3.2).
zi=(x
x
x, if i= 0
Ges(zi1),0< i M(1)
The stratum sentence sequence
Z
is then used
to in the subsequent
severity scoring step
which
uses a pairwise severity scoring function
Ses
on
consecutive pairs and cumulatively yield training
labels
s0=PM
i=1 Ses(zi1,zi)
3.3). A concrete
example is illustrated in fig. 2. Finally, we train
SESCOREs
quality prediction model
,
fθ
( fig. 1)
using synthetic {hx
x
x, y
y
y0, s0i} triples (§ 3.4).
He will not accept it because he will not like itRaw text (xraw)
He will not accept it because he hates the plan he will not like itStep1:Insertion:
He will accept it because he hates the plan he will not like it
Step2:Deletion:
He will accept it because he hates the plan he will not fancy itStep3:Replace:
Severe, -5
Severe, -5
Minor, -1
Insert (Seq-to-Seq)
Delete
Severity
Measure
(Ses)
Ses
Ses
z0=x
Step4: Swap: will He accept it because he hates the plan he will not fancy it
Ses Minor, -1
z1
z1
z2
z2
z3
z3
y'=z4
Replace (MLM)
Swap
Figure 2: SESCORE: stratified error synthesis and severity scoring Pipeline. #indicates the start index of
each error in the previous sentence. Both MLM and Seq-to-seq models can be used to produce inserted or re-
placed tokens. Each zicorresponds to a perturbed sentence. The final synthesized sentence y0has the score
s0=P4
i=1 Ses(zi1,zi) = 12.
Category MQM Description Synthesis Procedure in SESCORE
Accuracy Addition Text includes information not present reference. insertion using MLM or seq2seq generation
Omission Text is missing content from the reference Delete a random span of tokens
Mistranslation Text does not accurately represent the reference Replace a random span using maksed or seq2seq generation
Fluency Punctuation Incorrect punctuation (for locale or style) Insertion & replacement using masked filling, and deletion
Spelling Incorrect spelling or capitalization Insertion, replacement, deletion, and Swap
Grammar Problems with grammar Insertion, replacement, deletion, and Swap
Table 1: Error Categories in MQM and our synthesis procedure. SESCORE generalize the imitate model output
errors beyond machine translation.
3.1 Background: Quality Measured by
Errors
Our method is inspired by the multidimensional
quality metrics (MQM) (Mariana,2014;Freitag
et al.,2021a). MQM is a human evaluation scheme
for machine translation. It determines the quality
of a translation text by manually labeling errors
and their severity levels. Errors are categorized
into multiple types such as accuracy and fluency.
Each error type is associated with a severity level –
a penalty of 5 for major error and 1 for minor error.
In table 1, we use two major error categories in
MQM framework: accuracy and fluency, to clas-
sify and decide our perturbations in
Ges
. There
are two main motivations to simulate those errors
from the table: 1) they are two major error cate-
gories in machine translations; 2) those errors are
general and can be extensible to new domains. We
use six techniques to simulate errors from the ta-
ble 1: mask insertion/replacement with maksed lan-
guage model (MLM)/seq-to-seq (seq-to-seq) lan-
guage model, and N-gram word drop/swap.
3.2 Stratified Error Synthesis
Tuan et al. (2021) suggest that multiple errors could
co-occur in one segment, so we construct each
sentence with up to
Mmax
perturbations (
= 5
in
experiments). At each iteration, we randomly draw
one perturbation
Ges
from the set of edit operations,
E={eins, edel, erepl, eswap}
(insertion, deletion,
replacement, and swap, respectively).
Our technique is stratified so as to enable ac-
curate evaluation of the severity at each step, and
prevent subsequent errors from overwriting prior
ones. To achieve this, we propose a novel stratified
error synthesis algorithm. For an input sentence
x
x
x
, with
L
tokens, we initialize an array
q
of length
L
, with
qj=Lj, 1jL
. Values indicate
the number of tokens after the current token can be
modified with the perturbation function,
Ges
. Each
Ges
will randomly select a start index
j
from
1
to
L
to modify the text. We define an error synthesis ta-
ble to keep track of the number of candidate tokens
can be modified after index
j
.
Ges
will only be
accepted if
qj
is greater than the span length of the
perturbation. The implementation details of strati-
fied error synthesis algorithm regarding to each edit
operation is illustrated in Appendix A algorithm 1.
All perturbations are recursively applied to the raw
text x
x
x, shown in eq. (1).
Synthesize Addition Error by Insertion (eins)
Given a start index, we add an additional phrase
to the raw text in two ways: a) using a MLM (e.g.
BERT and RoBERTa), and b) using a seq-to-seq
language model (e.g. mBART). For the first ap-
proach, we insert a
<mask>
token at the given
position of a sentence. Then, we use an MLM to
fill the token based on its context. We use top-k
sampling (
k= 4
), to randomly select the filling to-
ken. Our primary aim is to introduce semantically
close sentences with all three fluency errors. With
the insertion of
<mask>
, we can further synthe-
size Addition errors. For the second approach, we
use a pre-trained seq-to-seq model (e.g. mBART)
to generate a phrase given the context text, with
variable length.
Synthesize Omission Error by Deletion (edel)
We delete a random span of tokens from a raw
text sentence. The span is drawn uniformly within
the token indices. The length of the span is drawn
from a Poisson distribution (
λd= 1.5
). Our pri-
mary aim is to mimic Omission error. However,
depending on the specific words that it drops, this
technique can further create Mistranslation and all
Fluency errors.
Synthesize Phrasal Error by Replacement
(erepl)
Sometimes specific terms in a reference
sentence are systematically misphrased in gener-
ated samples. This is difficult to simulate. Instead,
we use either an MLM or a seq-to-seq model to
replace a segment of tokens in the original text. For
the first approach, the replaced span is always a sin-
gle token, which is first replaced with a
<mask>
token. We then use an MLM to fill the blank similar
to the insertion operation. For the second approach,
we use a denoising seq-to-seq model (e.g. mBART)
to generate tokens for the mask tags. We randomly
choose the starting index of the span and draw the
span length from a Poisson distribution (
λd= 1.5
).
We use a denoising seq-to-seq model like mBART
to synthesize fluent sentences with Addition and
Mistranslation errors.
Synthesize Grammar and Other Errors by
Swapping (eswap)
We swap two random words
within the span length
λs
in the sentence (
λs= 4
).
Our primary aim is to generate grammatically in-
correct sentences with mismanagement of word
orders, such as subject verb disagreement. It fur-
ther introduces Spelling and Punctuation errors.
3.3 Assessing Severity Score
Following Freitag et al. (2021a), we consider an
error severe if it alters the core meaning of the
sentence. Prior study has suggested that sentence
entailment is strongly correlated to semantic simi-
larities (Khobragade et al.,2019). To capture the
change of semantic meaning, we define a bidirec-
tional entailment relation such that, text
a
entails
b
and
b
entails
a
is equivalent to
a
is semantically
equivalent to
b
. Therefore, for a given perturbation
function
Ges
on the sentence
zi1
, we measure a
bidirectional entailment likelihood of
zi1
and
zi
.
If after applying transformation on
zi1
,
zi
remains
bidirectional entailed to
zi1
, we can assume that
Ges
does not severely alter the semantic meaning of
zi1
and therefore it is a minor error. We define the
entailment likelihood,
ρ(a, b)
, as the probability of
predicting
a
entails
b
. The math formulation is il-
lustrated in eq. (2). Setting the threshold
γ
to be
0.9
reaches the highest inter-rater agreement of severity
measures using our validation dataset. Following
Freitag et al. (2021a), we assign
5
to severe er-
ror and
1
to minor errors. Therefore, our range
of score is
[25,0]
. We evaluate severity at each
perturbation of the sentence and cumulatively yield
training label
s0
for the final synthesized sentence
y
y
y0,s0=PN
i=1 Ses(zi1,zi).
Ses(zi1,zi) =
(1,if ρ(zi1,zi)γand ρ(zi,zi1)γ
5,otherwise
(2)
3.4 Quality Prediction Model
In fig. 1, we fed both raw text
x
x
x
(reference) and syn-
thetic error sentence
y
y
y0
into a pre-trained language
model (e.g. BERT or RoBERTa). The resulting
word embeddings are average pooled to derive two
sentence embeddings. Then we use the approach
proposed by RUSE (Shimanaka et al.,2018) to ex-
tract the two features: 1) Element-wise synthesized
and reference sentence product. 2) Element-wise
synthesized and reference sentence difference. Fol-
lowing the COMET (Rei et al.,2020) implemen-
tation, the above features are concatenated into a
single vector and fed into a feed-forward neural
network regressor, fθ.
However, the key distinction between our model
and COMET is that we don’t use model source
input during training or inference. Therefore our
SESCORE can generalize to other text generation
tasks, without considering specific source data. The
detailed architecture choice can be found in § 4.1.
摘要:

NotAllErrorsAreEqual:LearningTextGenerationMetricsusingStratiedErrorSynthesisWendaXu,YilinTuan,YujieLu,MichaelSaxon,LeiLi,WilliamYangWangUCSantaBarbara{wendaxu,ytuan,yujielu,mssaxon,leili,william}@cs.ucsb.eduAbstractIsitpossibletobuildageneralandautomaticnaturallanguagegeneration(NLG)evaluationmetr...

展开>> 收起<<
Not All Errors Are Equal Learning Text Generation Metrics using Stratified Error Synthesis Wenda Xu Yilin Tuan Yujie Lu Michael Saxon.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:591.88KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注