On the Effectiveness of Automated Metrics for Text Generation Systems Pius von Däniken and Jan Deriu and Don Tuggener and Mark Cieliebak Centre for Artificial Intelligence

2025-05-02 1 0 707.09KB 20 页 10玖币
侵权投诉
On the Effectiveness of Automated Metrics for Text Generation Systems
Pius von Däniken and Jan Deriu and Don Tuggener and Mark Cieliebak
Centre for Artificial Intelligence
ZHAW School of Engineering
{vode,deri,tuge,ciel}@zhaw.ch
Abstract
A major challenge in the field of Text Genera-
tion is evaluation, because we lack a sound the-
ory that can be leveraged to extract guidelines
for evaluation campaigns. In this work, we
propose a first step towards such a theory that
incorporates different sources of uncertainty,
such as imperfect automated metrics and insuf-
ficiently sized test sets. The theory has practi-
cal applications, such as determining the num-
ber of samples needed to reliably distinguish
the performance of a set of Text Generation
systems in a given setting. We showcase the
application of the theory on the WMT 21 and
Spot-The-Bot evaluation data and outline how
it can be leveraged to improve the evaluation
protocol regarding the reliability, robustness,
and significance of the evaluation outcome.
1 Introduction
The field of Text Generation is a subfield of Natu-
ral Language Processing (Celikyilmaz et al.,2020).
We define text generation tasks as those where
many different texts may constitute an optimal so-
lution to a given problem. Examples are automated
summarization, machine translation, dialogue sys-
tems, paraphrasing, caption generation, or natural
language generation.
One unsolved issue in the field of Text Gener-
ation is the evaluation, be it human or automated
evaluation. Human evaluation is more reliable but
more cost and time intensive, and automated eval-
uation is erroneous but performed in a fraction of
time and cost (Amidei et al.,2019;Hashimoto et al.,
2019;Celikyilmaz et al.,2020;Deriu et al.,2021).
One of the main issues is the lack of theoretically
founded guidelines when running an evaluation.
For instance, how many samples are needed to be
able to significantly distinguish the performance of
two systems? Or how do we handle the errors made
by automated metrics? Under which circumstances
is it still possible to run an evaluation campaign that
Metric Accuracy
Measurable Difference
0,0
0,1
0,2
0,3
0,4
0,6 0,7 0,8 0,9
1000 Metric Ratings 10000 Metric Ratings 2% Goal
Figure 1: Measurable difference of the performance of
two text generation systems depending on the accuracy
of a binary metric. We add the 2% line as discussed in
the text.
yields significant results? In this work, we make
a first step towards developing such a theoretical
foundation, which can be used as a guideline to
answer the above questions. For this, we consider
what we call binary metrics. These are metrics that
classify the output of a text generation system as
being either adequate or inadequate. This allows
us to measure the performance of a text generation
system as the ratio of adequate responses it gener-
ates. Furthermore, it allows us to reason about the
performance of the metric in terms of true positives
and true negatives.
#Automated Ratings
#Human Ratings
01k5k10k50k
0 1.000 0.109 0.049 0.035 0.015
10 0.379 0.106 0.049 0.034 0.015
100 0.134 0.085 0.046 0.033 0.015
1k0.043 0.040 0.032 0.027 0.015
2.5k0.027 0.026 0.024 0.020 0.013
5k0.019 0.019 0.018 0.017 0.012
Table 1: Mixed Case: Measurable difference for a met-
ric with accuracy of 70% depending on the number of
human rating mixed with the number of automated rat-
ings. The values discussed in the text are highlighted
in bold.
For this setting, we derive various theoretically
arXiv:2210.13025v1 [cs.CL] 24 Oct 2022
founded guarantees and guidelines that can be used
to run an evaluation campaign. For instance, con-
sider Figure 1(derived by our theory). If we as-
sume a binary metric that has an accuracy of 70%,
and if we have access to 1000 automatically rated
samples (blue line), then we can reliably distin-
guish between two text generation systems that
have a difference in performance of 10 percentage
points. To distinguish two systems with a smaller
difference, for instance of 2%, we would need a
better metric and many more samples. That is, we
need for instance a metric with an accuracy of at
least 85% and 10000 automatically rated samples
by this metric.
Our theory provides analogous assessments of
how many human evaluations are required to re-
liably distinguish text generation systems. When
we say that the performance of two systems can
be reliably distinguished, we mean that the differ-
ence in their performance is statistically significant.
Similarly, a measurable difference in performance
is one that leads to statistical significance given the
experiment parameters.
In addition, our theory allows for the mix of hu-
man and automated evaluation. For this, consider
Table 1where we depict the number of human and
automatic ratings required by a metric with 70% ac-
curacy. For instance, to distinguish two text gener-
ators with 2 percentage points difference, we need
either at least 5000 human ratings, or 2500 human
ratings mixed with 10’000 automated ratings.
Our theoretical framework allows us to design
our evaluation with theoretical guarantees regard-
ing the significance of the resulting measurements.
Given a monetary budget and our theory, one can
decide whether to invest in more human annota-
tions, in developing better automated metrics, or in
sampling more automated ratings. Our approach
can also be used to showcase the limits of a given
setting: for instance in Figure 1, we see that using
only 1000 automated ratings leads to a minimally
measurable difference of 4% even with a perfect
metric.
In the remainder of the paper, we derive the theo-
retical framework for binary metrics and apply it to
two showcases: the WMT-21 shared task (Freitag
et al.,2021b) and the Spot-The-Bot evaluation (De-
riu et al.,2020). We analyse how well these eval-
uations adhere to the constraints imposed by our
theory and demonstrate how the quality of the eval-
uations can be improved. To serve the community,
we will release the formulas as code and as a web
interface
1
that allows practitioners to enter their
evaluation settings and receive an analysis of the
measurable differences in their settings.
2 Definitions
In this section, we introduce the basic definitions
that we need for the derivations. First, we define
the general setting of Text Generation, then we
cover binary metrics, and finally we describe text
generation systems.
2.1 General Setting
Definition 1 (Text Generation Environment)
Atext generation environment is composed of a
triple
hI,O,Φi
, where
I
denotes the set of inputs,
O
the output space, and
Φ : I × O {0,1}
an
oracle that assess whether an output is adequate
for a given input.
For instance, for Machine Translation
I
denotes
all sentences in the source language and
O
all sen-
tences in the target language, while for a chatbot
I
contains all dialogue contexts and
O
all possible
responses in a dialogue. Note that
I
and
O
can
be of infinite size. We regard
Φ
as an oracle that
segments the output space for a given input into
adequate and inadequate outputs 2.
Definition 2 (Adequate Responses) i∈ I
, we
call
Ri
+={o∈ O|Φ(i, o) = 1}
the set of
adequate responses for input i, and
Ri
=
{o∈ O|Φ(i, o) = 0}
the set of inadequate re-
sponses.
2.2 Binary Metric
In this work, we set our focus to binary metrics, i.e.,
metrics that classify the output of a text generation
system as being either adequate or inadequate. The
choice of binary metrics allows us to reason about
the performance of a text generation (TG) system
as the ratio of adequate responses3.
1https://github.com/vodezhaw/binary_metric_
tool
2
In most real-world setting
Φ
is approximated with human
ratings.
3
This lies in contrast with metrics that simply return a
scalar value (e.g, BLEU (Papineni et al.,2002), COMET (Rei
et al.,2020), USR (Mehri and Eskenazi,2020)) that is difficult
to interpret. For instance, if BLEU returns a value of 0.34 for
one system and 0.32 for the second system, can we really state
that the first system is better than the second (Callison-Burch
et al.,2006)? We can use these types of metrics to create
binary metrics by selecting a threshold that defines the border
between adequate and inadequate responses (e.g., all COMET
We first define the notion of a binary metric, then
we show what it means for a binary metric to be
error-free or error-prone with regards to Φ.
Definition 3 (Binary Metric)
Abinary metric
Mb
is a function
Mb:I × O {0,1}
which takes a
pair of input and output, and returns either 0 or
1. We interpret the return of 1 as claiming that the
output is an adequate output for the given input,
and 0 claiming that the output is not adequate.
Next, we define the notion of an error-free metric.
That is, how we expect the metric to behave in the
optimal case (i.e. its ability to replicate the oracle
Φ).
Definition 4 (Error-Free Binary Metric) M
b
is
an error-free binary metric
⇒ ∀i∈ I, o
O: (M
b(i, o)=1 o∈ Ri
+).
That is, an error-free binary metric always rates
an adequate output as
1
and an inadequate output
as
0
. Since most metrics do not perform perfectly
regarding
Φ
, we formulate the cases where a metric
makes mistakes and the calculation of its perfor-
mance as follows.
Definition 5 ((ρ, η)-optimal binary metric)
Let
ρ, η [0,1]
and
Mb
a binary metric.
Then
Mb
is a
(ρ, η)
-optimal binary met-
ric if
P r[Mb(i, o)=1|o∈ Ri
+] = ρ
and
P r[Mb(i, o)=0|o6∈ Ri
+] = η.
That is, we define the performance of a binary met-
ric as its probability to correctly classify an output
as being adequate or not. Thus, the error of a bi-
nary metric can be assessed similar to the error of
a binary classifier, i.e.,
ρ
is equivalent to the true
positive ratio and
η
to the true negative ratio. Note
that
ρ=η= 1
defines an error-free binary metric,
whereas all other cases are error-prone. In the case
where
ρ
and
η
have the same value,
ρ=η
, this
value is the accuracy of
Mρ,η
b
. Note that in practise,
ρand ηmust be estimated from data.
2.3 Text Generation
We define a text generation system as a function
that takes an input from the input-space and gener-
ates an output.
Definition 6 ((Optimal) Text Generator)
A
Text-Generator (TG) is a mapping
π:I → O
which generates for each input
i
an output
o
. A TG
is optimal ⇒ ∀i∈ I :π(i)Ri
+
values above 0.78 are regarded as adequate). This introduces
errors, which can be measured.
Next, we introduce the notion of an imperfect
text-generator. There are many different ways the
errors of a TG can be modeled. We model it as its
capability of generating adequate responses.
Definition 7 (α-optimal TG)
Let
π
be a TG and
α[0,1]
. Then
π
is an
α
-optimal TG if
P r[π(i)
Ri
+] = αfor all i∈ I.
That is, the probability of the text generation sys-
tem to generate an adequate output is denoted as
α
. The task of a binary metric is to estimate the
α
value of a TG system, which has a concrete mean-
ing: Assume that we compare two systems, where
απ1= 0.5
, and
απ2= 0.49
, then these numbers
have a clear semantic:
π1
outputs an adequate out-
put in 50% of cases and
π2
in 49% of cases. Thus,
one system generates adequate outputs more often
than the other. We denote the difference in per-
formance as
. In the following, we will use
απ
to denote the rate at which a system
π
generates
adequate responses, and
πα
to refer to a system
which is α-optimal.
3 Theory: Estimating αwith Binary
Metrics
In this section, we show how binary metrics can be
used to estimate the performance
α
of text gen-
eration systems. For the remainder of the text,
assume that
TΦ={(ij, oj, r
j)|1jnφ}
is a set of input-output rating triples of size
nφ
,
where
ij
are inputs,
oj=πα(ij)
denotes the out-
put generated by an
α
-optimal TG system for in-
put
ij
, and
r
j=M
b(ij, oj)
denotes the error-free
rating of the
jth
input-output-pair. Analogously,
let
TM={(ij, oj, rj)|1jnM}
be a set
of input output rating triples of size
nM
, where
rj=Mρ,η
b(ij, oj)
denotes the rating of an error-
prone (ρ, η)-optimal binary metric.
We consider three different cases: 1) the error-
free case, 2) the error-prone metric case, and 3)
the mixed case. The error-free case is where we
have access to
r
j
. For instance, we can interpret
human evaluation as an example of the error-free
case. In the error-prone metric case, we have access
only to an
(ρ, η)
-optimal binary metric. Finally, the
mixed case is a novel approach that leverages error-
free ratings, which are usually costly to obtain,
with error-prone ratings, which are cheaper but
are needed en-masse for automated metrics with
low
ρ
and
η
values, as we will see. Usually, in
evaluation campaigns, either the first or second
setting is applied.
We apply a Bayesian approach to estimate
α
by
treating it as a random variable, which allows us
to model various sources of uncertainty stemming
from
α
,
ρ
and
η
, which all need to be estimated
from data. The full derivations are given in Ap-
pendix A.
3.1 Error-Free Case
Here, we start with the most simple case and in-
troduce the formula to estimate
α
given error-free
ratings
r
j
. Given
nφ
error-free ratings,
α
is esti-
mated by
˜α=n+
nφ
, where
n+=Pnφ
i=1 r
j
. This
formula can be derived via the frequentist approach
or the Bayesian. For the Bayesian approach, we as-
sume a uniform prior over
α
(i.e.
αBeta(1,1)
).
The resulting posterior distribution for αgiven n+
is:
P(α|N+=n+)P(N+=n+|α)P(α)
Beta(n++ 1, nφn++ 1) (1)
and the value of
α
is estimated using the mode of
Beta(n++ 1, nφn++ 1)
, which corresponds
to n+
nφ.
3.2 Error-Prone Metric Case
In the error-prone metric case, the probability that
rj= 1
depends on
ρ
and
η
. Hence, if
rj= 1
, we
cannot assume that
r
j= 1
as well, since the bi-
nary metric can be error-prone. For the error-prone
setting, we consider two cases, one where
ρ
and
η
are provided (e.g. from an earlier evaluation cam-
paign), and one where
ρ
and
η
must be estimated
from data (i.e., from comparison to error-free rat-
ings).
3.2.1 Provided ρ, η
Here, we assume that the exact values of
ρ
and
η
are known. The probability that the binary metric
returns a positive label is thus given by:
P(rj= 1) = αρ + (1 α)(1 η)(2)
From this, we derive the formula to estimate
α
using the Bayesian formulation.
Theorem 1 (Estimate αwith error-prone metric)
Let
m+=PnM
i=1 rjBinom(P(rj= 1), nM)
be the number of pairs
ij, oj
rated as adequate
Mρ,η
b(ij, oj)=1
. Then we estimate
α
by
computing the mode of the following distribution:
P(α|M+=m+, ρ, η)
P(M+=m+|α, ρ, η)P(α)(3)
If we assume a uniform prior of
α
, i.e.,
P(α)
U(0,1), this reduces to: ˜α=
m+
nM+η1
ρ+η1
Note that the above formulation does not allow
for
ρ+η= 1
, in which case our estimator would
be undefined. In the following we will assume that
ρ+η > 1
. This is a relatively safe assumption
since in the case where
ρ+η < 1
, we can derive
a new metric
Mρ00
b
by flipping the predictions of
Mρ,η
b
:
Mρ00
b(i, o)=1Mρ,η
b(i, o)
. In this case
ρ0+η0= (1 ρ) + (1 η)=2(ρ+η)>1.
3.2.2 Estimated ρ, η
Here, we assume that
ρ
and
η
must be estimated
from data, which introduces uncertainty. In our
case, we estimate
ρ
and
η
from error-free ratings
(i.e., how well the error-prone metric agrees with
the error-free ratings). In practise, the error-free
assessments stem from human annotations, which
are regarded as the ground truth. To weave the
estimation of
ρ
and
η
into the Bayesian frame-
work, we treat them as random variables. For this,
assume that we have access to a dataset
Tρ,η =
{(ij, oj, r
j, rj)|1jM}
of both error-free and
error-prone ratings for pairs of inputs and outputs.
Denote
T+
ρ,η ={(ij, oj)|r
j= 1}
as the set of true
positive samples, and
T
ρ,η ={(ij, oj)|r
j= 0}
as
the set of true negative samples. Thus, assuming a
uniform prior over
ρ
, we apply the same reasoning
as in Section 3.1 to compute the posterior distri-
bution
ρBeta(mT P + 1,|T +
ρ,η| − mT P + 1)
,
where
mT P
denotes the number of true positive
samples, rated as positive by
Mρ,η
b
. Analogously,
ηBeta(mT N + 1,|T
ρ,η| − mT N + 1)
, where
mT N
denotes the number of true negative samples,
rated as negative by
Mρ,η
b
. Note that to estimate
ρ
and
η
, having a large sample size for both
T+
ρ,η
and
T
ρ,η
is important, otherwise the estimation of
ρ
or
ηwould have a higher uncertainty.
To incorporate the uncertainty of
ρ
and
η
into
the estimation of
α
, we need to marginalize
ρ
and
η
from the joint likelihood
P(m+, ρ, η|α)
to get
P(m+|α).
Theorem 2 (Est. α, ρ, η with error-prone metric)
Let
m+=Pn
i=1 rjBinom(P(rj= 1), n)
be
the number of samples rated positively by
Mρ,η
b
.
Then we estimate
α
by computing the mode of the
following distribution:
P(α|M+=m+)P(M+=m+|α)P(α)
P(α)Z1
0Z1
0
p(m+|α, ρ, η)p(ρ)p(η)dρdη(4)
摘要:

OntheEffectivenessofAutomatedMetricsforTextGenerationSystemsPiusvonDänikenandJanDeriuandDonTuggenerandMarkCieliebakCentreforArticialIntelligenceZHAWSchoolofEngineering{vode,deri,tuge,ciel}@zhaw.chAbstractAmajorchallengeintheeldofTextGenera-tionisevaluation,becausewelackasoundthe-orythatcanbelevera...

展开>> 收起<<
On the Effectiveness of Automated Metrics for Text Generation Systems Pius von Däniken and Jan Deriu and Don Tuggener and Mark Cieliebak Centre for Artificial Intelligence.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:707.09KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注