On the Limitations of Reference-Free Evaluations of Generated Text Daniel DeutschyRotem Drorand Dan Roth yGoogle Research

2025-05-02 0 0 749.45KB 18 页 10玖币
侵权投诉
On the Limitations of Reference-Free Evaluations of Generated Text
Daniel Deutsch,∗† Rotem Dror,and Dan Roth
Google Research
University of Pennsylvania
dandeutsch@google.com
{rtmdrr,danroth}@seas.upenn.edu
Abstract
There is significant interest in developing eval-
uation metrics which accurately estimate the
quality of generated text without the aid of
a human-written reference text, which can be
time consuming and expensive to collect or en-
tirely unavailable in online applications. How-
ever, in this work, we demonstrate that these
reference-free metrics are inherently biased
and limited in their ability to evaluate gener-
ated text, and we argue that they should not
be used to measure progress on tasks like ma-
chine translation or summarization. We show
how reference-free metrics are equivalent to
using one generation model to evaluate an-
other, which has several limitations: (1) the
metrics can be optimized at test time to find
the approximate best-possible output, (2) they
are inherently biased toward models which are
more similar to their own, and (3) they can be
biased against higher-quality outputs, includ-
ing those written by humans. Therefore, we
recommend that reference-free metrics should
be used as diagnostic tools for analyzing and
understanding model behavior instead of mea-
sures of how well models perform a task, in
which the goal is to achieve as high of a score
as possible.1
1 Introduction
Automatically evaluating the quality of generated
texts is essential for the development of natural lan-
guage generation systems. The most common type
of evaluation for generation tasks such as machine
translation (MT) and summarization is done with
reference-based automatic metrics, which evaluate
a text by comparing it to a gold-standard reference
text, usually written by humans (Papineni et al.,
2002;Lin,2004;Zhang et al.,2020;Sellam et al.,
1https://cogcomp.seas.upenn.edu/page/
publication_view/991
Work done while at the University of Pennsylvania
Prism-src ()
Source Doch er ist nicht krank, er hat
nur einen mächtigen Kater.
Reference But he is not ill, he only has
quite a hangover.-1.6
Candidate But he is not sick, he has only
one powerful cat.-0.4
Source Und mit Mann und Maus gegen
Mainz verteidigt.
Reference And threw everything they had
into our defense. -4.8
Candidate And defended with man and
mouse against Mainz. -0.4
Figure 1: Here, Prism-src was optimized to generate
the candidate translations. They are clearly wrong
(Kater means both “cat” and “hangover”; mit Mann
und Maus is an expression that means “with all means
available”), but have better Prism-src scores than the
references. Comparing systems with reference-free
metrics will favor systems that are more similar to the
metrics’ underlying models rather than higher quality
output.
2020;Deutsch et al.,2021a;Zhang and Bansal,
2021,inter alia).
Reference texts can be expensive to collect or
are entirely unavailable when there is a need to
estimate the quality of text in real time, so there
is an increased interest in developing automatic
metrics that do not use references to evaluate text,
commonly referred to as reference-free metrics
(Louis and Nenkova,2013;Fonseca et al.,2019;
Scialom et al.,2019,2021;Vasilyev et al.,2020;
Rei et al.,2021,inter alia). While these metrics do
not always achieve performance parity with their
reference-based counterparts, their high correla-
tions to human judgments suggest that reference-
free evaluation is a promising direction of future
research (Fonseca et al.,2019).2
However, in this work, we demonstrate that
reference-free evaluation metrics have inherent lim-
2
Some reference-free metrics actually already out-perform
reference-based metrics (Freitag et al.,2021).
arXiv:2210.12563v1 [cs.CL] 22 Oct 2022
itations and argue that they should not be used
to measure progress on tasks, even in domains in
which no reference texts are available. Central to
our argument is the idea that because reference-free
metrics evaluate text using the same input provided
to the generation models, they are either explic-
itly or implicitly using an underlying generation
model to evaluate other models (§2). There are sev-
eral implications of this, which we explore through
an analysis of three reference-free evaluation met-
rics, Prism-src (Thompson and Post,2020) and
COMET-QE (Rei et al.,2021) for MT and Quest-
Eval (Scialom et al.,2021) for summarization.
First, the metrics’ underlying models will
achieve the best possible metric score by definition.
Therefore, the “perfect” model is already known,
and we show that it is possible to define simple
approximate inference algorithms which use these
models to find the approximate best output accord-
ing to the metrics (§4, §5.1).
Then, the metrics have inherent, undesirable bi-
ases that originate from their underlying models.
Not only do they favor the underlying models’ out-
puts, but they are also biased toward outputs from
models which are similar to their own, and biased
against higher-quality outputs, such as those writ-
ten by humans (Fig. 1, §5.2, §5.3). Thus, if they
were used as primary evaluation methods for a task,
they would encourage other models to be more
similar to their own and less human-like, an unde-
sirable property of an evaluation metric.
Our recommendation is that reference-free met-
rics should not be used as methods for measuring
progress on generation tasks such as MT, in which
the goal is to achieve the highest possible value of
the metric. Instead, they are better suited to be diag-
nostic statistics for analyzing model behavior with
the understanding that they are inherently limited
and biased (§6).
The contributions of this work include: (1) in-
sight on the equivalence of reference-free metrics
and generation models, (2) a demonstration that
reference-free metrics’ values can be optimized
at test time to achieve high-scoring outputs, and
(3) an analysis that reveals reference free metrics’
inherent biases and limitations.
2 Reference-Free Metrics as Models
Conditional text generation models can be viewed
as a function
θ(·)
which scores an output text
y∈ Y
for some input text
x
. Then
θ(·)
is used
in conjunction with an inference procedure
fθ(·)
to
find the best output at test time.3
θ(x,y)R(1)
fθ(x) = arg max
y∈Y
θ(x,y)(2)
For instance,
θ(·)
could be a learned sequence-to-
sequence model and fθ(·)could be beam search.
The output of
fθ(·)
, denoted
ˆ
y
, is typically eval-
uated by some automatic metric
M
. Reference-
based metrics do this by scoring
ˆ
y
using some
gold-standard text
y
(which is not available to the
model during inference) and the input
x
(which is
not always used). For instance,
MRef-Based
could
calculate a BLEU score (Papineni et al.,2002) be-
tween the output translation
ˆ
y
and the gold transla-
tion y.
MRef-Based(x,ˆ
y,y)R(3)
In contrast, reference-free metrics calculate a score
for ˆ
ywithout y:
MRef-Free(x,ˆ
y)R(4)
Such metrics include the three analyzed in this
work, namely, Prism-src (Thompson and Post,
2020), COMET-QE (Rei et al.,2021), and Quest-
Eval (Scialom et al.,2021).
Because
θ(·)
and
MRef-Free
are both functions of
only
x
and
y
(equivalently
ˆ
y
),
MRef-Free
itself can
be viewed as a conditional generation model. For
some metrics, such as Prism-src, this is explicitly
stated, whereas others are implicitly making this
assumption. This is not the case for reference-
based metrics since they additionally require
y
as
input.
Since reference-free metrics are equivalent to
generation models, there must exist some inference
procedure which finds the best output text under
the metric, denoted gMRef-Free (·):
gMRef-Free (x) = arg max
y∈Y
MRef-Free(x,y)(5)
Computing
gMRef-Free (·)
may be computationally ex-
pensive because
MRef-Free
may not support effi-
cient inference. However, the inference procedure
does always exist, and will return the best possible
output according to the reference-free metric by
definition.
3
In practice,
fθ(·)
finds the approximate best output, not
the global maximum of θ(·).
We explore the implications of using a model to
evaluate other models by analyzing the behavior of
three different reference-free evaluation metrics on
two text generation tasks, MT and summarization.
3 Analysis Setup
Here, we discuss the datasets and metrics used in
our analysis of reference-free metrics.
Datasets
Our MT experiments are run on the
data collected for the WMT’19 metrics shared task
(Ma et al.,2019), which includes reference trans-
lations and human-judged model outputs for 10 to
20 translation systems across 18 language pairs.
The summarization experiments use the Summ-
Eval (Fabbri et al.,2021) and REALSumm (Bhan-
dari et al.,2020) datasets, which consist of refer-
ence summaries and human-judged model outputs
for 16 and 25 summarization models, respectively,
collected from the CNN/DailyMail dataset (Nalla-
pati et al.,2016).
Prism-src
Prism-src is a reference-free evalua-
tion translation metric that scores a translated text
according to the log-probability of the translation
conditioned on the original source text under a
learned sequence-to-sequence translation model
(Thompson and Post,2020). The model is a multi-
lingual MT model, meaning it was trained using
many different language pairs, so the same learned
parameters can be used to score translations in var-
ious languages.
COMET-QE
COMET-QE (Rei et al.,2021) is
a modification of the learned reference-based
MT evaluation metric COMET (Rei et al.,2020).
COMET embeds the candidate translation, source
text, and reference translation using a cross-lingual
encoder, creates a pooled featured representation
using the three encodings, and trains the model end-
to-end to predict human judgments of the quality
of the candidate translation. COMET-QE uses the
same architecture to predict a score for the candi-
date translation but only uses the candidate trans-
lation and source text to create the pooled feature
representation, and is therefore reference-free.
QuestEval
Scialom et al. (2021) proposed a
reference-free summarization metric called Quest-
Eval which generates QA pairs from both the
source document and generated summary then
scores the summary based on the proportion of
those pairs which are answered correctly in the op-
posite text. The metric optionally includes a step in
which the QA pairs generated from the source docu-
ment are weighted based on a learned query weight-
ing model. The query weighter was trained to pre-
dict the probability that a question is answered in
the CNN/DailyMail reference summaries using a
pre-trained QA model. We use the query weighter
in our experiments since it improved the perfor-
mance of QuestEval in Scialom et al. (2021).
Reference-Based Metrics
We analyze the
reference-free metrics with respect to various
reference-based metrics which have been demon-
strated to have strong correlations to human
judgments of translation/summary quality. BLEU
(Papineni et al.,2002) and ROUGE (Lin,2004)
compare the two texts using
n
-gram overlap
statistics. BERTScore calculates a quality score
based on how similar the reference and candidate
texts’ BERT (Devlin et al.,2019) embeddings are
(Zhang et al.,2020). QAEval is a QA-based metric
for summarization, which generates wh-questions
from the reference summary and calculates a score
for the candidate summary based on the proportion
of questions answered correctly (Deutsch et al.,
2021a). Finally BLEURT is a learned MT metric
which predicts a translation quality score using
encoded BERT representations of the reference
and candidate translations (Sellam et al.,2020).
Implementation details can be found in Ap-
pendix A.
4 Metric Optimization
Since reference-free metrics are equivalent to mod-
els, then it is possible to define inference proce-
dures which produce the best-possible outputs ac-
cording to the metrics. Here, we discuss three such
(approximate) inference procedures. Importantly,
they can all be run at test time because they do not
rely on a reference text.
4.1 Direct Optimization
If a reference-free metric scores a candidate output
in a way that an efficient approximate inference
procedure can be defined, then finding the best
possible output under the metric is straightforward.
Among the metrics analyzed in this paper, only
Prism-src falls into this category. Because Prism-
src assigns a score to a translation equal to its av-
erage log-probability under a learned sequence-to-
sequence MT model, the approximate best trans-
lation under Prism-src can be found by running
beam search with the MT model conditioned on
the source text.
4.2 Greedy Optimization for Extractive
Summarization
Summarization models are generally categorized
as being either extractive or abstractive. Extractive
systems create a summary by selecting
k
salient
document sentences, whereas abstractive systems
typically autoregressively generate a summary with
a sequence-to-sequence model.
The best possible extractive summary according
to a reference-free metric can be found by enu-
merating all possible summaries of
k
sentences,
scoring them with the metric, and selecting the
summary with the highest score. Since the number
of
k
sentence summaries may be large, this may be
computationally expensive. However, an approxi-
mate inference procedure can be used instead.
Rather than enumerate all possible extractive
summaries, the approximate inference algorithm
constructs a summary by greedily selecting one
sentence that increases the score of the metric the
most (Lin and Bilmes,2011). This is repeated until
a target summary length of
k
sentences is reached,
resulting in an approximation of the best possible
summary under the reference-free metric.
A near-identical procedure is commonly used
for creating sentence-level labels for training ex-
tractive summarization models, except a reference-
based evaluation metric, such as ROUGE, is typ-
ically used for scoring the sentences instead of a
reference-free metric (Nallapati et al.,2017). The
key difference is that the output summary from the
reference-based procedure is used to train a model
which later predicts
k
salient sentences during infer-
ence, whereas the reference-free procedure can be
directly used during inference (i.e., without train-
ing) to pick the approximately best summary under
the reference-free metric.
4.3 Reranking
Exact inference for any reference-free metric can
be performed by enumerating all possible outputs,
calculating the score of each one, and selecting
the output with the highest score. However, it is
almost certainly true that this is computationally
intractable for any practical application of text gen-
eration due to the size of the output space.
To that end, we propose to use reranking (Shen
et al.,2004;Och et al.,2004) as an approximate
inference procedure in which a pre-trained model
for the task at hand is used to restrict the search
space to a small set of high-quality candidate out-
puts. These outputs are then scored and reranked
using the reference-free metric to identify an ap-
proximately best output under the metric.
In practice, we identify a set of
k
high-quality
outputs using standard beam search with pre-
trained sequence-to-sequence summarization and
MT models and a beam size of
k
. The top-
k
partial
outputs sorted by their log-likelihood under the pre-
trained models are kept at each step of beam search.
The final outputs are then reranked by a reference-
free metric. For summarization, we use BART
(Lewis et al.,2020) trained on the CNN/DailyMail
dataset. For MT, we use Facebook’s submission
to the WMT’19 translation shared task (Ng et al.,
2019). The model is available for en
de, de
en,
enru, and ruen.
5 Analysis
5.1 Approximate Inference Effectiveness
Although inference methods for the reference-free
metrics can be defined, it is possible that they fail
to find high-scoring outputs due to the complexity
of the search problem. However in this analysis,
we show that the simple approximate inference
procedures defined in §4are effective at optimizing
the metrics’ scores.
We compared the outputs obtained by the infer-
ence algorithms to those from systems included
in the WMT’19, SummEval, and REALSumm
datasets. Fig. 2evaluates using the direct optimiza-
tion procedure (§4.1) to select the best Prism-src
output, Fig. 3shows the results of using rerank-
ing (§4.3) to pick the best outputs according to
COMET-QE, and Fig. 4contains the results of us-
ing the greedy extractive procedure (§4.2) to op-
timize QuestEval. The Figures also include the
systems’ scores under the reference-based metrics
BLEURT for MT and ROUGE for summarization.
Other combinations of reference-based metrics and
inference algorithms can be found in Appendix B.
In all MT language pairs and both summariza-
tion datasets, the inference algorithms produce the
highest scoring outputs under the reference-free
metrics, often by a large margin. For example,
reranking translations according to their COMET-
QE scores on de
en results in a relative 38% im-
摘要:

OntheLimitationsofReference-FreeEvaluationsofGeneratedTextDanielDeutsch,yRotemDror,‡andDanRoth‡yGoogleResearch‡UniversityofPennsylvaniadandeutsch@google.com{rtmdrr,danroth}@seas.upenn.eduAbstractThereissignicantinterestindevelopingeval-uationmetricswhichaccuratelyestimatethequalityofgeneratedtextw...

展开>> 收起<<
On the Limitations of Reference-Free Evaluations of Generated Text Daniel DeutschyRotem Drorand Dan Roth yGoogle Research.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:749.45KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注