Not All Errors Are Equal Learning Text Generation Metrics using Stratiﬁed Error Synthesis Wenda Xu Yilin Tuan Yujie Lu Michael Saxon

2025-05-02 0 0 591.88KB 16 页 10玖币

侵权投诉

Not All Errors Are Equal: Learning Text Generation Metrics using

Stratiﬁed Error Synthesis

Wenda Xu, Yilin Tuan, Yujie Lu, Michael Saxon,

Lei Li,William Yang Wang

UC Santa Barbara

{wendaxu,ytuan,yujielu, mssaxon, leili, william}@cs.ucsb.edu

Abstract

Is it possible to build a general and automatic

natural language generation (NLG) evaluation

metric? Existing learned metrics either per-

form unsatisfactorily or are restricted to tasks

where large human rating data is already avail-

able. We introduce SESCORE, a model-based

metric that is highly correlated with human

judgements without requiring human annota-

tion, by utilizing a novel, iterative error synthe-

sis and severity scoring pipeline. This pipeline

applies a series of plausible errors to raw text

and assigns severity labels by simulating hu-

man judgements with entailment. We evaluate

SESCORE against existing metrics by compar-

ing how their scores correlate with human rat-

ings. SESCORE outperforms all prior unsuper-

vised metrics on multiple diverse NLG tasks

including machine translation, image caption-

ing, and WebNLG text generation. For WMT

20/21 En-De and Zh-En, SESCORE improve

the average Kendall correlation with human

judgement from 0.154 to 0.195. SESCORE

even achieves comparable performance to the

best supervised metric COMET, despite receiv-

ing no human-annotated training data. 1

1 Introduction

Text generation tasks such as translation and im-

age captioning have seen considerable progress

in the past few years (Chen et al.,2015;Birch,

2021). However, precisely and automatically

evaluating generated text quality remains a chal-

lenge. Long-dominant n-gram-based evaluation

techniques, such as BLEU (Papineni et al.,2002)

and ROUGE (Lin,2004), are sensitive to surface-

level lexical and syntactic variations, and have been

repeatedly reported to not well correlate to human

judgements (Zhang* et al.,2020;Xu et al.,2021).

Multiple learned metrics have been proposed to

better approximate human judgements. These met-

Code and data are available at

https://github.

com/xu1998hz/SEScore

rics can be categorized into unsupervised and su-

pervised methods based on whether human ratings

are used. The former includes PRISM (Thomp-

son and Post,2020), BERTScore (Zhang* et al.,

2020), BARTScore (Yuan et al.,2021), etc. The

latter includes BLEURT (Sellam et al.,2020),

COMET (Rei et al.,2020) etc.

Unsupervised learned metrics are particularly

useful as task-speciﬁc human annotations of gener-

ated text can be expensive or impractical to gather

at scale. While these metrics are applicable to a va-

riety of NLG tasks (Zhang* et al.,2020;Yuan et al.,

2021), they tend to target a narrow set of aspects

such as semantic coverage or faithfulness, and have

limited applicability to other aspects, such as ﬂu-

ency and style, that matter to humans (Freitag et al.,

2021a;Saxon et al.,2021). While supervised met-

rics can address different attributes by modeling

the conditional distribution of real human opinions,

training data for quality assessment is often task-

and domain-speciﬁc with limited generalizability.

We introduce SESCORE, a general technique

to produce nuanced reference-based metrics for

automatic text generation evaluation without us-

ing human-annotated reference-candidate text pairs.

Our method is motivated by the observation that

a diverse set of distinct error types can co-occur

in candidate texts, and that human evaluators do

not view all errors as equally problematic (Freitag

et al.,2021a). To this end, we develop a stratiﬁed

error synthesis procedure to construct (reference,

candidate, score) triples from raw text. The can-

didates contain non-overlapping, plausible simula-

tions of NLG model errors, iteratively applied to

the input text. At each iteration, a severity scoring

module isolates individual simulated errors, and as-

sesses the human-perceived degradation in quality

incurred. Our contributions are as follows:

•

SESCORE, an approach to train automatic text

evaluation metrics without human ratings;

•

A procedure to synthesize different types of

arXiv:2210.05035v2 [cs.CL] 26 Oct 2022

errors in text at varying severity levels;

•

Experiments showing that SESCORE is effec-

tive in a diverse set of NLG tasks including

WMT 20/21, WebNLG, and image captioning,

and outperforms all previous unsupervised

learned metrics. It is even comparable to the

best learned metric on WMT 20/21.

2 Related Work

Traditional n-gram matching based (Papineni et al.,

2002;Banerjee and Lavie,2005) and edit distance

based approaches (Levenshtein,1965;Snover et al.,

2006) have proven to be limited in recognizing se-

mantic similarity beyond the lexical level. Learned

metrics (Zhang* et al.,2020;Sellam et al.,2020;

Yuan et al.,2021) have been proposed to align bet-

ter with human judgements. We categorize these

metrics as either unsupervised or supervised with

respect to learning from human-annotated scores.

Unsupervised Metrics

attempt to extract fea-

tures from large pretrained models. Embedding-

based metrics (e.g. BERTScore (Zhang* et al.,

2020) and Moverscore (Zhao et al.,2019)) create

soft-alignments between reference and hypothe-

sis in the embedding space. However, they are

reﬁned in the semantic coverage. Text generation-

based metrics (Yuan et al.,2021), use conditional

probability of the generated sentence to evaluate

faithfulness of the candidates. However, Freitag

et al. (2021a) points out text generation evaluation

can produce errors beyond semantic coverage or

faithfulness (e.g. style and ﬂuency errors), which

results poor correlations to the human evaluations.

Supervised Metrics

attempt to learn through

limited human-labelled severity annotations. Rei

et al. (2020) trained COMET on a small set

of domain-speciﬁc human ratings; this model

has limited extensibility to teh general domain.

BLEURT (Sellam et al.,2020) ﬁrst pretrains on

millions of synthetic data and then uses WMT test-

ing data in ﬁne-tuning the model. Unlike our ﬁne-

grained stratiﬁed error synthesis, the labels on the

synthetic data are derived from prior metrics or

other tasks, limiting the quality and precision of

pretraining process.

3 The SESCORE Approach

Given a reference text

and a candidate

, a metric

is expected to output a score

. Training such a met-

ric model requires triples of reference-candidate-

Transformer Layers

Pooling Layer

Sentence Embedding Features

Feedforward NN

SEScore(x,y')

Raw Text (x)

Synthesized Text (y')

Synthetic Quality Score (s')

MSE loss

Reference (x) Candidate (y)

Pre-Training Stage Inference Stage

Transformer Layers

Pooling Layer

Sentence Embedding Features

Feedforward NN

SEScore(x,y')

Figure 1: Overview of the Quality Prediction Model.

score’s. However, there are no large-scale human

annotated triple data available in many tasks. We

consider a general setup where large raw text cor-

pus is available.

SESCORE is trained from a pretrained language

model (e.g. BERT) on synthetic triples generated

from raw text. It synthesizes candidate sentences

to mimic plausible errors by transforming raw input

sentences

multiple times. At each step, it inserts,

deletes, or substitutes a random span of text. These

errors are non-overlapping. It assesses the severity

of the errors introduced in the transformation. This

allows us to pretrain quality prediction models on

corpora containing only raw text samples

{x}

, en-

abling the use of learned quality prediction models

in any text generation domain.

The process of generating

from

stratiﬁed

error synthesis

, is so called for its incremental

and multi-category nature; a stochastic perturbation

function

Ges

which randomly samples from a set of

potential errors is recursively applied on

(eq. (1))

times to produce a sequence of perturbed sen-

tences

Z={zi}M

i=1

that interpolate between the

raw text

and the ﬁnal synthetic sentence

y0=zM

(§ 3.2).

zi=(x

x, if i= 0

Ges(zi−1),0< i ≤M(1)

The stratum sentence sequence

is then used

to in the subsequent

severity scoring step

which

uses a pairwise severity scoring function

Ses

consecutive pairs and cumulatively yield training

labels

s0=PM

i=1 Ses(zi−1,zi)

(§ 3.3). A concrete

example is illustrated in ﬁg. 2. Finally, we train

SESCORE’s

quality prediction model

fθ

( ﬁg. 1)

using synthetic {hx

x, y

y0, s0i} triples (§ 3.4).

He will not accept it because he will not like itRaw text (xraw)

He will not accept it because he hates the plan he will not like itStep1:Insertion:

He will accept it because he hates the plan he will not like it

Step2:Deletion:

He will accept it because he hates the plan he will not fancy itStep3:Replace:

Severe, -5

Minor, -1

Insert (Seq-to-Seq)

Delete

Severity

Measure

(Ses)

Ses

z0=x

Step4: Swap: will He accept it because he hates the plan he will not fancy it

Ses Minor, -1

y'=z4

Replace (MLM)

Swap

Figure 2: SESCORE: stratiﬁed error synthesis and severity scoring Pipeline. #indicates the start index of

each error in the previous sentence. Both MLM and Seq-to-seq models can be used to produce inserted or re-

placed tokens. Each zicorresponds to a perturbed sentence. The ﬁnal synthesized sentence y0has the score

s0=P4

i=1 Ses(zi−1,zi) = −12.

Category MQM Description Synthesis Procedure in SESCORE

Accuracy Addition Text includes information not present reference. insertion using MLM or seq2seq generation

Omission Text is missing content from the reference Delete a random span of tokens

Mistranslation Text does not accurately represent the reference Replace a random span using maksed or seq2seq generation

Fluency Punctuation Incorrect punctuation (for locale or style) Insertion & replacement using masked ﬁlling, and deletion

Spelling Incorrect spelling or capitalization Insertion, replacement, deletion, and Swap

Grammar Problems with grammar Insertion, replacement, deletion, and Swap

Table 1: Error Categories in MQM and our synthesis procedure. SESCORE generalize the imitate model output

errors beyond machine translation.

3.1 Background: Quality Measured by

Errors

Our method is inspired by the multidimensional

quality metrics (MQM) (Mariana,2014;Freitag

et al.,2021a). MQM is a human evaluation scheme

for machine translation. It determines the quality

of a translation text by manually labeling errors

and their severity levels. Errors are categorized

into multiple types such as accuracy and ﬂuency.

Each error type is associated with a severity level –

a penalty of 5 for major error and 1 for minor error.

In table 1, we use two major error categories in

MQM framework: accuracy and ﬂuency, to clas-

sify and decide our perturbations in

Ges

. There

are two main motivations to simulate those errors

from the table: 1) they are two major error cate-

gories in machine translations; 2) those errors are

general and can be extensible to new domains. We

use six techniques to simulate errors from the ta-

ble 1: mask insertion/replacement with maksed lan-

guage model (MLM)/seq-to-seq (seq-to-seq) lan-

guage model, and N-gram word drop/swap.

3.2 Stratiﬁed Error Synthesis

Tuan et al. (2021) suggest that multiple errors could

co-occur in one segment, so we construct each

sentence with up to

Mmax

perturbations (

= 5

experiments). At each iteration, we randomly draw

one perturbation

Ges

from the set of edit operations,

E={eins, edel, erepl, eswap}

(insertion, deletion,

replacement, and swap, respectively).

Our technique is stratiﬁed so as to enable ac-

curate evaluation of the severity at each step, and

prevent subsequent errors from overwriting prior

ones. To achieve this, we propose a novel stratiﬁed

error synthesis algorithm. For an input sentence

, with

tokens, we initialize an array

of length

, with

qj=L−j, ∀1≤j≤L

. Values indicate

the number of tokens after the current token can be

modiﬁed with the perturbation function,

Ges

. Each

Ges

will randomly select a start index

from

to modify the text. We deﬁne an error synthesis ta-

ble to keep track of the number of candidate tokens

can be modiﬁed after index

Ges

will only be

accepted if

is greater than the span length of the

perturbation. The implementation details of strati-

ﬁed error synthesis algorithm regarding to each edit

operation is illustrated in Appendix A algorithm 1.

All perturbations are recursively applied to the raw

text x

x, shown in eq. (1).

Synthesize Addition Error by Insertion (eins)

Given a start index, we add an additional phrase

to the raw text in two ways: a) using a MLM (e.g.

BERT and RoBERTa), and b) using a seq-to-seq

language model (e.g. mBART). For the ﬁrst ap-

proach, we insert a

<mask>

token at the given

position of a sentence. Then, we use an MLM to

ﬁll the token based on its context. We use top-k

sampling (

k= 4

), to randomly select the ﬁlling to-

ken. Our primary aim is to introduce semantically

close sentences with all three ﬂuency errors. With

the insertion of

<mask>

, we can further synthe-

size Addition errors. For the second approach, we

use a pre-trained seq-to-seq model (e.g. mBART)

to generate a phrase given the context text, with

variable length.

Synthesize Omission Error by Deletion (edel)

We delete a random span of tokens from a raw

text sentence. The span is drawn uniformly within

the token indices. The length of the span is drawn

from a Poisson distribution (

λd= 1.5

). Our pri-

mary aim is to mimic Omission error. However,

depending on the speciﬁc words that it drops, this

technique can further create Mistranslation and all

Fluency errors.

Synthesize Phrasal Error by Replacement

(erepl)

Sometimes speciﬁc terms in a reference

sentence are systematically misphrased in gener-

ated samples. This is difﬁcult to simulate. Instead,

we use either an MLM or a seq-to-seq model to

replace a segment of tokens in the original text. For

the ﬁrst approach, the replaced span is always a sin-

gle token, which is ﬁrst replaced with a

<mask>

token. We then use an MLM to ﬁll the blank similar

to the insertion operation. For the second approach,

we use a denoising seq-to-seq model (e.g. mBART)

to generate tokens for the mask tags. We randomly

choose the starting index of the span and draw the

span length from a Poisson distribution (

λd= 1.5

We use a denoising seq-to-seq model like mBART

to synthesize ﬂuent sentences with Addition and

Mistranslation errors.

Synthesize Grammar and Other Errors by

Swapping (eswap)

We swap two random words

within the span length

λs

in the sentence (

λs= 4

Our primary aim is to generate grammatically in-

correct sentences with mismanagement of word

orders, such as subject verb disagreement. It fur-

ther introduces Spelling and Punctuation errors.

3.3 Assessing Severity Score

Following Freitag et al. (2021a), we consider an

error severe if it alters the core meaning of the

sentence. Prior study has suggested that sentence

entailment is strongly correlated to semantic simi-

larities (Khobragade et al.,2019). To capture the

change of semantic meaning, we deﬁne a bidirec-

tional entailment relation such that, text

entails

and

entails

is equivalent to

is semantically

equivalent to

. Therefore, for a given perturbation

function

Ges

on the sentence

zi−1

, we measure a

bidirectional entailment likelihood of

zi−1

and

If after applying transformation on

zi−1

remains

bidirectional entailed to

zi−1

, we can assume that

Ges

does not severely alter the semantic meaning of

zi−1

and therefore it is a minor error. We deﬁne the

entailment likelihood,

ρ(a, b)

, as the probability of

predicting

entails

. The math formulation is il-

lustrated in eq. (2). Setting the threshold

to be

0.9

reaches the highest inter-rater agreement of severity

measures using our validation dataset. Following

Freitag et al. (2021a), we assign

−5

to severe er-

ror and

−1

to minor errors. Therefore, our range

of score is

[−25,0]

. We evaluate severity at each

perturbation of the sentence and cumulatively yield

training label

for the ﬁnal synthesized sentence

y0,s0=PN

i=1 Ses(zi−1,zi).

Ses(zi−1,zi) =

(−1,if ρ(zi−1,zi)≥γand ρ(zi,zi−1)≥γ

−5,otherwise

(2)

3.4 Quality Prediction Model

In ﬁg. 1, we fed both raw text

(reference) and syn-

thetic error sentence

into a pre-trained language

model (e.g. BERT or RoBERTa). The resulting

word embeddings are average pooled to derive two

sentence embeddings. Then we use the approach

proposed by RUSE (Shimanaka et al.,2018) to ex-

tract the two features: 1) Element-wise synthesized

and reference sentence product. 2) Element-wise

synthesized and reference sentence difference. Fol-

lowing the COMET (Rei et al.,2020) implemen-

tation, the above features are concatenated into a

single vector and fed into a feed-forward neural

network regressor, fθ.

However, the key distinction between our model

and COMET is that we don’t use model source

input during training or inference. Therefore our

SESCORE can generalize to other text generation

tasks, without considering speciﬁc source data. The

detailed architecture choice can be found in § 4.1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NotAllErrorsAreEqual:LearningTextGenerationMetricsusingStratiedErrorSynthesisWendaXu,YilinTuan,YujieLu,MichaelSaxon,LeiLi,WilliamYangWangUCSantaBarbara{wendaxu,ytuan,yujielu,mssaxon,leili,william}@cs.ucsb.eduAbstractIsitpossibletobuildageneralandautomaticnaturallanguagegeneration(NLG)evaluationmetr...

展开>> 收起<<

Not All Errors Are Equal Learning Text Generation Metrics using Stratiﬁed Error Synthesis Wenda Xu Yilin Tuan Yujie Lu Michael Saxon.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Not All Errors Are Equal Learning Text Generation Metrics using Stratiﬁed Error Synthesis Wenda Xu Yilin Tuan Yujie Lu Michael Saxon

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: