On the Limitations of Reference-Free Evaluations of Generated Text Daniel DeutschyRotem Drorand Dan Roth yGoogle Research

2025-05-02 1 0 749.45KB 18 页 10玖币

侵权投诉

On the Limitations of Reference-Free Evaluations of Generated Text

Daniel Deutsch,∗† Rotem Dror,‡and Dan Roth‡

†Google Research

‡University of Pennsylvania

dandeutsch@google.com

{rtmdrr,danroth}@seas.upenn.edu

Abstract

There is signiﬁcant interest in developing eval-

uation metrics which accurately estimate the

quality of generated text without the aid of

a human-written reference text, which can be

time consuming and expensive to collect or en-

tirely unavailable in online applications. How-

ever, in this work, we demonstrate that these

reference-free metrics are inherently biased

and limited in their ability to evaluate gener-

ated text, and we argue that they should not

be used to measure progress on tasks like ma-

chine translation or summarization. We show

how reference-free metrics are equivalent to

using one generation model to evaluate an-

other, which has several limitations: (1) the

metrics can be optimized at test time to ﬁnd

the approximate best-possible output, (2) they

are inherently biased toward models which are

more similar to their own, and (3) they can be

biased against higher-quality outputs, includ-

ing those written by humans. Therefore, we

recommend that reference-free metrics should

be used as diagnostic tools for analyzing and

understanding model behavior instead of mea-

sures of how well models perform a task, in

which the goal is to achieve as high of a score

as possible.1

1 Introduction

Automatically evaluating the quality of generated

texts is essential for the development of natural lan-

guage generation systems. The most common type

of evaluation for generation tasks such as machine

translation (MT) and summarization is done with

reference-based automatic metrics, which evaluate

a text by comparing it to a gold-standard reference

text, usually written by humans (Papineni et al.,

2002;Lin,2004;Zhang et al.,2020;Sellam et al.,

1https://cogcomp.seas.upenn.edu/page/

publication_view/991

∗Work done while at the University of Pennsylvania

Prism-src (↑)

Source Doch er ist nicht krank, er hat

nur einen mächtigen Kater.

Reference But he is not ill, he only has

quite a hangover.-1.6

Candidate But he is not sick, he has only

one powerful cat.-0.4

Source Und mit Mann und Maus gegen

Mainz verteidigt.

Reference And threw everything they had

into our defense. -4.8

Candidate And defended with man and

mouse against Mainz. -0.4

Figure 1: Here, Prism-src was optimized to generate

the candidate translations. They are clearly wrong

(Kater means both “cat” and “hangover”; mit Mann

und Maus is an expression that means “with all means

available”), but have better Prism-src scores than the

references. Comparing systems with reference-free

metrics will favor systems that are more similar to the

metrics’ underlying models rather than higher quality

output.

2020;Deutsch et al.,2021a;Zhang and Bansal,

2021,inter alia).

Reference texts can be expensive to collect or

are entirely unavailable when there is a need to

estimate the quality of text in real time, so there

is an increased interest in developing automatic

metrics that do not use references to evaluate text,

commonly referred to as reference-free metrics

(Louis and Nenkova,2013;Fonseca et al.,2019;

Scialom et al.,2019,2021;Vasilyev et al.,2020;

Rei et al.,2021,inter alia). While these metrics do

not always achieve performance parity with their

reference-based counterparts, their high correla-

tions to human judgments suggest that reference-

free evaluation is a promising direction of future

research (Fonseca et al.,2019).2

However, in this work, we demonstrate that

reference-free evaluation metrics have inherent lim-

Some reference-free metrics actually already out-perform

reference-based metrics (Freitag et al.,2021).

arXiv:2210.12563v1 [cs.CL] 22 Oct 2022

itations and argue that they should not be used

to measure progress on tasks, even in domains in

which no reference texts are available. Central to

our argument is the idea that because reference-free

metrics evaluate text using the same input provided

to the generation models, they are either explic-

itly or implicitly using an underlying generation

model to evaluate other models (§2). There are sev-

eral implications of this, which we explore through

an analysis of three reference-free evaluation met-

rics, Prism-src (Thompson and Post,2020) and

COMET-QE (Rei et al.,2021) for MT and Quest-

Eval (Scialom et al.,2021) for summarization.

First, the metrics’ underlying models will

achieve the best possible metric score by deﬁnition.

Therefore, the “perfect” model is already known,

and we show that it is possible to deﬁne simple

approximate inference algorithms which use these

models to ﬁnd the approximate best output accord-

ing to the metrics (§4, §5.1).

Then, the metrics have inherent, undesirable bi-

ases that originate from their underlying models.

Not only do they favor the underlying models’ out-

puts, but they are also biased toward outputs from

models which are similar to their own, and biased

against higher-quality outputs, such as those writ-

ten by humans (Fig. 1, §5.2, §5.3). Thus, if they

were used as primary evaluation methods for a task,

they would encourage other models to be more

similar to their own and less human-like, an unde-

sirable property of an evaluation metric.

Our recommendation is that reference-free met-

rics should not be used as methods for measuring

progress on generation tasks such as MT, in which

the goal is to achieve the highest possible value of

the metric. Instead, they are better suited to be diag-

nostic statistics for analyzing model behavior with

the understanding that they are inherently limited

and biased (§6).

The contributions of this work include: (1) in-

sight on the equivalence of reference-free metrics

and generation models, (2) a demonstration that

reference-free metrics’ values can be optimized

at test time to achieve high-scoring outputs, and

(3) an analysis that reveals reference free metrics’

inherent biases and limitations.

2 Reference-Free Metrics as Models

Conditional text generation models can be viewed

as a function

θ(·)

which scores an output text

y∈ Y

for some input text

. Then

θ(·)

is used

in conjunction with an inference procedure

fθ(·)

ﬁnd the best output at test time.3

θ(x,y)→R(1)

fθ(x) = arg max

y∈Y

θ(x,y)(2)

For instance,

θ(·)

could be a learned sequence-to-

sequence model and fθ(·)could be beam search.

The output of

fθ(·)

, denoted

, is typically eval-

uated by some automatic metric

. Reference-

based metrics do this by scoring

using some

gold-standard text

y∗

(which is not available to the

model during inference) and the input

(which is

not always used). For instance,

MRef-Based

could

calculate a BLEU score (Papineni et al.,2002) be-

tween the output translation

and the gold transla-

tion y∗.

MRef-Based(x,ˆ

y,y∗)→R(3)

In contrast, reference-free metrics calculate a score

for ˆ

ywithout y∗:

MRef-Free(x,ˆ

y)→R(4)

Such metrics include the three analyzed in this

work, namely, Prism-src (Thompson and Post,

2020), COMET-QE (Rei et al.,2021), and Quest-

Eval (Scialom et al.,2021).

Because

θ(·)

and

MRef-Free

are both functions of

only

and

(equivalently

MRef-Free

itself can

be viewed as a conditional generation model. For

some metrics, such as Prism-src, this is explicitly

stated, whereas others are implicitly making this

assumption. This is not the case for reference-

based metrics since they additionally require

y∗

input.

Since reference-free metrics are equivalent to

generation models, there must exist some inference

procedure which ﬁnds the best output text under

the metric, denoted gMRef-Free (·):

gMRef-Free (x) = arg max

y∈Y

MRef-Free(x,y)(5)

Computing

gMRef-Free (·)

may be computationally ex-

pensive because

MRef-Free

may not support efﬁ-

cient inference. However, the inference procedure

does always exist, and will return the best possible

output according to the reference-free metric by

deﬁnition.

In practice,

fθ(·)

ﬁnds the approximate best output, not

the global maximum of θ(·).

We explore the implications of using a model to

evaluate other models by analyzing the behavior of

three different reference-free evaluation metrics on

two text generation tasks, MT and summarization.

3 Analysis Setup

Here, we discuss the datasets and metrics used in

our analysis of reference-free metrics.

Datasets

Our MT experiments are run on the

data collected for the WMT’19 metrics shared task

(Ma et al.,2019), which includes reference trans-

lations and human-judged model outputs for 10 to

20 translation systems across 18 language pairs.

The summarization experiments use the Summ-

Eval (Fabbri et al.,2021) and REALSumm (Bhan-

dari et al.,2020) datasets, which consist of refer-

ence summaries and human-judged model outputs

for 16 and 25 summarization models, respectively,

collected from the CNN/DailyMail dataset (Nalla-

pati et al.,2016).

Prism-src

Prism-src is a reference-free evalua-

tion translation metric that scores a translated text

according to the log-probability of the translation

conditioned on the original source text under a

learned sequence-to-sequence translation model

(Thompson and Post,2020). The model is a multi-

lingual MT model, meaning it was trained using

many different language pairs, so the same learned

parameters can be used to score translations in var-

ious languages.

COMET-QE

COMET-QE (Rei et al.,2021) is

a modiﬁcation of the learned reference-based

MT evaluation metric COMET (Rei et al.,2020).

COMET embeds the candidate translation, source

text, and reference translation using a cross-lingual

encoder, creates a pooled featured representation

using the three encodings, and trains the model end-

to-end to predict human judgments of the quality

of the candidate translation. COMET-QE uses the

same architecture to predict a score for the candi-

date translation but only uses the candidate trans-

lation and source text to create the pooled feature

representation, and is therefore reference-free.

QuestEval

Scialom et al. (2021) proposed a

reference-free summarization metric called Quest-

Eval which generates QA pairs from both the

source document and generated summary then

scores the summary based on the proportion of

those pairs which are answered correctly in the op-

posite text. The metric optionally includes a step in

which the QA pairs generated from the source docu-

ment are weighted based on a learned query weight-

ing model. The query weighter was trained to pre-

dict the probability that a question is answered in

the CNN/DailyMail reference summaries using a

pre-trained QA model. We use the query weighter

in our experiments since it improved the perfor-

mance of QuestEval in Scialom et al. (2021).

Reference-Based Metrics

We analyze the

reference-free metrics with respect to various

reference-based metrics which have been demon-

strated to have strong correlations to human

judgments of translation/summary quality. BLEU

(Papineni et al.,2002) and ROUGE (Lin,2004)

compare the two texts using

-gram overlap

statistics. BERTScore calculates a quality score

based on how similar the reference and candidate

texts’ BERT (Devlin et al.,2019) embeddings are

(Zhang et al.,2020). QAEval is a QA-based metric

for summarization, which generates wh-questions

from the reference summary and calculates a score

for the candidate summary based on the proportion

of questions answered correctly (Deutsch et al.,

2021a). Finally BLEURT is a learned MT metric

which predicts a translation quality score using

encoded BERT representations of the reference

and candidate translations (Sellam et al.,2020).

Implementation details can be found in Ap-

pendix A.

4 Metric Optimization

Since reference-free metrics are equivalent to mod-

els, then it is possible to deﬁne inference proce-

dures which produce the best-possible outputs ac-

cording to the metrics. Here, we discuss three such

(approximate) inference procedures. Importantly,

they can all be run at test time because they do not

rely on a reference text.

4.1 Direct Optimization

If a reference-free metric scores a candidate output

in a way that an efﬁcient approximate inference

procedure can be deﬁned, then ﬁnding the best

possible output under the metric is straightforward.

Among the metrics analyzed in this paper, only

Prism-src falls into this category. Because Prism-

src assigns a score to a translation equal to its av-

erage log-probability under a learned sequence-to-

sequence MT model, the approximate best trans-

lation under Prism-src can be found by running

beam search with the MT model conditioned on

the source text.

4.2 Greedy Optimization for Extractive

Summarization

Summarization models are generally categorized

as being either extractive or abstractive. Extractive

systems create a summary by selecting

salient

document sentences, whereas abstractive systems

typically autoregressively generate a summary with

a sequence-to-sequence model.

The best possible extractive summary according

to a reference-free metric can be found by enu-

merating all possible summaries of

sentences,

scoring them with the metric, and selecting the

summary with the highest score. Since the number

sentence summaries may be large, this may be

computationally expensive. However, an approxi-

mate inference procedure can be used instead.

Rather than enumerate all possible extractive

summaries, the approximate inference algorithm

constructs a summary by greedily selecting one

sentence that increases the score of the metric the

most (Lin and Bilmes,2011). This is repeated until

a target summary length of

sentences is reached,

resulting in an approximation of the best possible

summary under the reference-free metric.

A near-identical procedure is commonly used

for creating sentence-level labels for training ex-

tractive summarization models, except a reference-

based evaluation metric, such as ROUGE, is typ-

ically used for scoring the sentences instead of a

reference-free metric (Nallapati et al.,2017). The

key difference is that the output summary from the

reference-based procedure is used to train a model

which later predicts

salient sentences during infer-

ence, whereas the reference-free procedure can be

directly used during inference (i.e., without train-

ing) to pick the approximately best summary under

the reference-free metric.

4.3 Reranking

Exact inference for any reference-free metric can

be performed by enumerating all possible outputs,

calculating the score of each one, and selecting

the output with the highest score. However, it is

almost certainly true that this is computationally

intractable for any practical application of text gen-

eration due to the size of the output space.

To that end, we propose to use reranking (Shen

et al.,2004;Och et al.,2004) as an approximate

inference procedure in which a pre-trained model

for the task at hand is used to restrict the search

space to a small set of high-quality candidate out-

puts. These outputs are then scored and reranked

using the reference-free metric to identify an ap-

proximately best output under the metric.

In practice, we identify a set of

high-quality

outputs using standard beam search with pre-

trained sequence-to-sequence summarization and

MT models and a beam size of

. The top-

partial

outputs sorted by their log-likelihood under the pre-

trained models are kept at each step of beam search.

The ﬁnal outputs are then reranked by a reference-

free metric. For summarization, we use BART

(Lewis et al.,2020) trained on the CNN/DailyMail

dataset. For MT, we use Facebook’s submission

to the WMT’19 translation shared task (Ng et al.,

2019). The model is available for en

→

de, de

→

en,

en→ru, and ru→en.

5 Analysis

5.1 Approximate Inference Effectiveness

Although inference methods for the reference-free

metrics can be deﬁned, it is possible that they fail

to ﬁnd high-scoring outputs due to the complexity

of the search problem. However in this analysis,

we show that the simple approximate inference

procedures deﬁned in §4are effective at optimizing

the metrics’ scores.

We compared the outputs obtained by the infer-

ence algorithms to those from systems included

in the WMT’19, SummEval, and REALSumm

datasets. Fig. 2evaluates using the direct optimiza-

tion procedure (§4.1) to select the best Prism-src

output, Fig. 3shows the results of using rerank-

ing (§4.3) to pick the best outputs according to

COMET-QE, and Fig. 4contains the results of us-

ing the greedy extractive procedure (§4.2) to op-

timize QuestEval. The Figures also include the

systems’ scores under the reference-based metrics

BLEURT for MT and ROUGE for summarization.

Other combinations of reference-based metrics and

inference algorithms can be found in Appendix B.

In all MT language pairs and both summariza-

tion datasets, the inference algorithms produce the

highest scoring outputs under the reference-free

metrics, often by a large margin. For example,

reranking translations according to their COMET-

QE scores on de

→

en results in a relative 38% im-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheLimitationsofReference-FreeEvaluationsofGeneratedTextDanielDeutsch,yRotemDror,andDanRothyGoogleResearchUniversityofPennsylvaniadandeutsch@google.com{rtmdrr,danroth}@seas.upenn.eduAbstractThereissignicantinterestindevelopingeval-uationmetricswhichaccuratelyestimatethequalityofgeneratedtextw...

展开>> 收起<<

On the Limitations of Reference-Free Evaluations of Generated Text Daniel DeutschyRotem Drorand Dan Roth yGoogle Research.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Limitations of Reference-Free Evaluations of Generated Text Daniel DeutschyRotem Drorand Dan Roth yGoogle Research

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: