On the Effectiveness of Automated Metrics for Text Generation Systems Pius von Däniken and Jan Deriu and Don Tuggener and Mark Cieliebak Centre for Artiﬁcial Intelligence

2025-05-02 1 0 707.09KB 20 页 10玖币

侵权投诉

On the Effectiveness of Automated Metrics for Text Generation Systems

Pius von Däniken and Jan Deriu and Don Tuggener and Mark Cieliebak

Centre for Artiﬁcial Intelligence

ZHAW School of Engineering

{vode,deri,tuge,ciel}@zhaw.ch

Abstract

A major challenge in the ﬁeld of Text Genera-

tion is evaluation, because we lack a sound the-

ory that can be leveraged to extract guidelines

for evaluation campaigns. In this work, we

propose a ﬁrst step towards such a theory that

incorporates different sources of uncertainty,

such as imperfect automated metrics and insuf-

ﬁciently sized test sets. The theory has practi-

cal applications, such as determining the num-

ber of samples needed to reliably distinguish

the performance of a set of Text Generation

systems in a given setting. We showcase the

application of the theory on the WMT 21 and

Spot-The-Bot evaluation data and outline how

it can be leveraged to improve the evaluation

protocol regarding the reliability, robustness,

and signiﬁcance of the evaluation outcome.

1 Introduction

The ﬁeld of Text Generation is a subﬁeld of Natu-

ral Language Processing (Celikyilmaz et al.,2020).

We deﬁne text generation tasks as those where

many different texts may constitute an optimal so-

lution to a given problem. Examples are automated

summarization, machine translation, dialogue sys-

tems, paraphrasing, caption generation, or natural

language generation.

One unsolved issue in the ﬁeld of Text Gener-

ation is the evaluation, be it human or automated

evaluation. Human evaluation is more reliable but

more cost and time intensive, and automated eval-

uation is erroneous but performed in a fraction of

time and cost (Amidei et al.,2019;Hashimoto et al.,

2019;Celikyilmaz et al.,2020;Deriu et al.,2021).

One of the main issues is the lack of theoretically

founded guidelines when running an evaluation.

For instance, how many samples are needed to be

able to signiﬁcantly distinguish the performance of

two systems? Or how do we handle the errors made

by automated metrics? Under which circumstances

is it still possible to run an evaluation campaign that

Metric Accuracy

Measurable Difference

0,0

0,1

0,2

0,3

0,4

0,6 0,7 0,8 0,9

1000 Metric Ratings 10000 Metric Ratings 2% Goal

Figure 1: Measurable difference of the performance of

two text generation systems depending on the accuracy

of a binary metric. We add the 2% line as discussed in

the text.

yields signiﬁcant results? In this work, we make

a ﬁrst step towards developing such a theoretical

foundation, which can be used as a guideline to

answer the above questions. For this, we consider

what we call binary metrics. These are metrics that

classify the output of a text generation system as

being either adequate or inadequate. This allows

us to measure the performance of a text generation

system as the ratio of adequate responses it gener-

ates. Furthermore, it allows us to reason about the

performance of the metric in terms of true positives

and true negatives.

#Automated Ratings

#Human Ratings

01k5k10k50k

0 1.000 0.109 0.049 0.035 0.015

10 0.379 0.106 0.049 0.034 0.015

100 0.134 0.085 0.046 0.033 0.015

1k0.043 0.040 0.032 0.027 0.015

2.5k0.027 0.026 0.024 0.020 0.013

5k0.019 0.019 0.018 0.017 0.012

Table 1: Mixed Case: Measurable difference for a met-

ric with accuracy of 70% depending on the number of

human rating mixed with the number of automated rat-

ings. The values discussed in the text are highlighted

in bold.

For this setting, we derive various theoretically

arXiv:2210.13025v1 [cs.CL] 24 Oct 2022

founded guarantees and guidelines that can be used

to run an evaluation campaign. For instance, con-

sider Figure 1(derived by our theory). If we as-

sume a binary metric that has an accuracy of 70%,

and if we have access to 1000 automatically rated

samples (blue line), then we can reliably distin-

guish between two text generation systems that

have a difference in performance of 10 percentage

points. To distinguish two systems with a smaller

difference, for instance of 2%, we would need a

better metric and many more samples. That is, we

need for instance a metric with an accuracy of at

least 85% and 10000 automatically rated samples

by this metric.

Our theory provides analogous assessments of

how many human evaluations are required to re-

liably distinguish text generation systems. When

we say that the performance of two systems can

be reliably distinguished, we mean that the differ-

ence in their performance is statistically signiﬁcant.

Similarly, a measurable difference in performance

is one that leads to statistical signiﬁcance given the

experiment parameters.

In addition, our theory allows for the mix of hu-

man and automated evaluation. For this, consider

Table 1where we depict the number of human and

automatic ratings required by a metric with 70% ac-

curacy. For instance, to distinguish two text gener-

ators with 2 percentage points difference, we need

either at least 5000 human ratings, or 2500 human

ratings mixed with 10’000 automated ratings.

Our theoretical framework allows us to design

our evaluation with theoretical guarantees regard-

ing the signiﬁcance of the resulting measurements.

Given a monetary budget and our theory, one can

decide whether to invest in more human annota-

tions, in developing better automated metrics, or in

sampling more automated ratings. Our approach

can also be used to showcase the limits of a given

setting: for instance in Figure 1, we see that using

only 1000 automated ratings leads to a minimally

measurable difference of 4% even with a perfect

metric.

In the remainder of the paper, we derive the theo-

retical framework for binary metrics and apply it to

two showcases: the WMT-21 shared task (Freitag

et al.,2021b) and the Spot-The-Bot evaluation (De-

riu et al.,2020). We analyse how well these eval-

uations adhere to the constraints imposed by our

theory and demonstrate how the quality of the eval-

uations can be improved. To serve the community,

we will release the formulas as code and as a web

interface

that allows practitioners to enter their

evaluation settings and receive an analysis of the

measurable differences in their settings.

2 Deﬁnitions

In this section, we introduce the basic deﬁnitions

that we need for the derivations. First, we deﬁne

the general setting of Text Generation, then we

cover binary metrics, and ﬁnally we describe text

generation systems.

2.1 General Setting

Deﬁnition 1 (Text Generation Environment)

Atext generation environment is composed of a

triple

hI,O,Φi

, where

denotes the set of inputs,

the output space, and

Φ : I × O → {0,1}

oracle that assess whether an output is adequate

for a given input.

For instance, for Machine Translation

denotes

all sentences in the source language and

all sen-

tences in the target language, while for a chatbot

contains all dialogue contexts and

all possible

responses in a dialogue. Note that

and

can

be of inﬁnite size. We regard

as an oracle that

segments the output space for a given input into

adequate and inadequate outputs 2.

Deﬁnition 2 (Adequate Responses) ∀i∈ I

, we

call

+={o∈ O|Φ(i, o) = 1}

the set of

adequate responses for input i, and

−=

{o∈ O|Φ(i, o) = 0}

the set of inadequate re-

sponses.

2.2 Binary Metric

In this work, we set our focus to binary metrics, i.e.,

metrics that classify the output of a text generation

system as being either adequate or inadequate. The

choice of binary metrics allows us to reason about

the performance of a text generation (TG) system

as the ratio of adequate responses3.

1https://github.com/vodezhaw/binary_metric_

tool

In most real-world setting

is approximated with human

ratings.

This lies in contrast with metrics that simply return a

scalar value (e.g, BLEU (Papineni et al.,2002), COMET (Rei

et al.,2020), USR (Mehri and Eskenazi,2020)) that is difﬁcult

to interpret. For instance, if BLEU returns a value of 0.34 for

one system and 0.32 for the second system, can we really state

that the ﬁrst system is better than the second (Callison-Burch

et al.,2006)? We can use these types of metrics to create

binary metrics by selecting a threshold that deﬁnes the border

between adequate and inadequate responses (e.g., all COMET

We ﬁrst deﬁne the notion of a binary metric, then

we show what it means for a binary metric to be

error-free or error-prone with regards to Φ.

Deﬁnition 3 (Binary Metric)

Abinary metric

is a function

Mb:I × O → {0,1}

which takes a

pair of input and output, and returns either 0 or

1. We interpret the return of 1 as claiming that the

output is an adequate output for the given input,

and 0 claiming that the output is not adequate.

Next, we deﬁne the notion of an error-free metric.

That is, how we expect the metric to behave in the

optimal case (i.e. its ability to replicate the oracle

Φ).

Deﬁnition 4 (Error-Free Binary Metric) M∗

an error-free binary metric

⇐⇒ ∀i∈ I, o ∈

O: (M∗

b(i, o)=1 ⇐⇒ o∈ Ri

+).

That is, an error-free binary metric always rates

an adequate output as

and an inadequate output

. Since most metrics do not perform perfectly

regarding

, we formulate the cases where a metric

makes mistakes and the calculation of its perfor-

mance as follows.

Deﬁnition 5 ((ρ, η)-optimal binary metric)

Let

ρ, η ∈[0,1]

and

a binary metric.

Then

is a

(ρ, η)

-optimal binary met-

ric if

P r[Mb(i, o)=1|o∈ Ri

+] = ρ

and

P r[Mb(i, o)=0|o6∈ Ri

+] = η.

That is, we deﬁne the performance of a binary met-

ric as its probability to correctly classify an output

as being adequate or not. Thus, the error of a bi-

nary metric can be assessed similar to the error of

a binary classiﬁer, i.e.,

is equivalent to the true

positive ratio and

to the true negative ratio. Note

that

ρ=η= 1

deﬁnes an error-free binary metric,

whereas all other cases are error-prone. In the case

where

and

have the same value,

ρ=η

, this

value is the accuracy of

Mρ,η

. Note that in practise,

ρand ηmust be estimated from data.

2.3 Text Generation

We deﬁne a text generation system as a function

that takes an input from the input-space and gener-

ates an output.

Deﬁnition 6 ((Optimal) Text Generator)

Text-Generator (TG) is a mapping

π:I → O

which generates for each input

an output

. A TG

is optimal ⇐⇒ ∀i∈ I :π(i)∈Ri

values above 0.78 are regarded as adequate). This introduces

errors, which can be measured.

Next, we introduce the notion of an imperfect

text-generator. There are many different ways the

errors of a TG can be modeled. We model it as its

capability of generating adequate responses.

Deﬁnition 7 (α-optimal TG)

Let

be a TG and

α∈[0,1]

. Then

is an

-optimal TG if

P r[π(i)∈

+] = αfor all i∈ I.

That is, the probability of the text generation sys-

tem to generate an adequate output is denoted as

. The task of a binary metric is to estimate the

value of a TG system, which has a concrete mean-

ing: Assume that we compare two systems, where

απ1= 0.5

, and

απ2= 0.49

, then these numbers

have a clear semantic:

π1

outputs an adequate out-

put in 50% of cases and

π2

in 49% of cases. Thus,

one system generates adequate outputs more often

than the other. We denote the difference in per-

formance as



. In the following, we will use

απ

to denote the rate at which a system

generates

adequate responses, and

πα

to refer to a system

which is α-optimal.

3 Theory: Estimating αwith Binary

Metrics

In this section, we show how binary metrics can be

used to estimate the performance

of text gen-

eration systems. For the remainder of the text,

assume that

TΦ={(ij, oj, r∗

j)|1≤j≤nφ}

is a set of input-output rating triples of size

nφ

where

are inputs,

oj=πα(ij)

denotes the out-

put generated by an

-optimal TG system for in-

put

, and

r∗

j=M∗

b(ij, oj)

denotes the error-free

rating of the

jth

input-output-pair. Analogously,

let

TM={(ij, oj, rj)|1≤j≤nM}

be a set

of input output rating triples of size

, where

rj=Mρ,η

b(ij, oj)

denotes the rating of an error-

prone (ρ, η)-optimal binary metric.

We consider three different cases: 1) the error-

free case, 2) the error-prone metric case, and 3)

the mixed case. The error-free case is where we

have access to

r∗

. For instance, we can interpret

human evaluation as an example of the error-free

case. In the error-prone metric case, we have access

only to an

(ρ, η)

-optimal binary metric. Finally, the

mixed case is a novel approach that leverages error-

free ratings, which are usually costly to obtain,

with error-prone ratings, which are cheaper but

are needed en-masse for automated metrics with

low

and

values, as we will see. Usually, in

evaluation campaigns, either the ﬁrst or second

setting is applied.

We apply a Bayesian approach to estimate

treating it as a random variable, which allows us

to model various sources of uncertainty stemming

from

and

, which all need to be estimated

from data. The full derivations are given in Ap-

pendix A.

3.1 Error-Free Case

Here, we start with the most simple case and in-

troduce the formula to estimate

given error-free

ratings

r∗

. Given

nφ

error-free ratings,

is esti-

mated by

˜α=n+

nφ

, where

n+=Pnφ

i=1 r∗

. This

formula can be derived via the frequentist approach

or the Bayesian. For the Bayesian approach, we as-

sume a uniform prior over

(i.e.

α∼Beta(1,1)

The resulting posterior distribution for αgiven n+

is:

P(α|N+=n+)∝P(N+=n+|α)∗P(α)

∝Beta(n++ 1, nφ−n++ 1) (1)

and the value of

is estimated using the mode of

Beta(n++ 1, nφ−n++ 1)

, which corresponds

to n+

nφ.

3.2 Error-Prone Metric Case

In the error-prone metric case, the probability that

rj= 1

depends on

and

. Hence, if

rj= 1

, we

cannot assume that

r∗

j= 1

as well, since the bi-

nary metric can be error-prone. For the error-prone

setting, we consider two cases, one where

and

are provided (e.g. from an earlier evaluation cam-

paign), and one where

and

must be estimated

from data (i.e., from comparison to error-free rat-

ings).

3.2.1 Provided ρ, η

Here, we assume that the exact values of

and

are known. The probability that the binary metric

returns a positive label is thus given by:

P(rj= 1) = αρ + (1 −α)(1 −η)(2)

From this, we derive the formula to estimate

using the Bayesian formulation.

Theorem 1 (Estimate αwith error-prone metric)

Let

m+=PnM

i=1 rj∼Binom(P(rj= 1), nM)

be the number of pairs

ij, oj

rated as adequate

Mρ,η

b(ij, oj)=1

. Then we estimate

computing the mode of the following distribution:

P(α|M+=m+, ρ, η)

∝P(M+=m+|α, ρ, η)P(α)(3)

If we assume a uniform prior of

, i.e.,

P(α)∼

U(0,1), this reduces to: ˜α=

nM+η−1

ρ+η−1

Note that the above formulation does not allow

for

ρ+η= 1

, in which case our estimator would

be undeﬁned. In the following we will assume that

ρ+η > 1

. This is a relatively safe assumption

since in the case where

ρ+η < 1

, we can derive

a new metric

Mρ0,η0

by ﬂipping the predictions of

Mρ,η

Mρ0,η0

b(i, o)=1−Mρ,η

b(i, o)

. In this case

ρ0+η0= (1 −ρ) + (1 −η)=2−(ρ+η)>1.

3.2.2 Estimated ρ, η

Here, we assume that

and

must be estimated

from data, which introduces uncertainty. In our

case, we estimate

and

from error-free ratings

(i.e., how well the error-prone metric agrees with

the error-free ratings). In practise, the error-free

assessments stem from human annotations, which

are regarded as the ground truth. To weave the

estimation of

and

into the Bayesian frame-

work, we treat them as random variables. For this,

assume that we have access to a dataset

Tρ,η =

{(ij, oj, r∗

j, rj)|1≤j≤M}

of both error-free and

error-prone ratings for pairs of inputs and outputs.

Denote

T+

ρ,η ={(ij, oj)|r∗

j= 1}

as the set of true

positive samples, and

T−

ρ,η ={(ij, oj)|r∗

j= 0}

the set of true negative samples. Thus, assuming a

uniform prior over

, we apply the same reasoning

as in Section 3.1 to compute the posterior distri-

bution

ρ∼Beta(mT P + 1,|T +

ρ,η| − mT P + 1)

where

mT P

denotes the number of true positive

samples, rated as positive by

Mρ,η

. Analogously,

η∼Beta(mT N + 1,|T −

ρ,η| − mT N + 1)

, where

mT N

denotes the number of true negative samples,

rated as negative by

Mρ,η

. Note that to estimate

and

, having a large sample size for both

T+

ρ,η

and

T−

ρ,η

is important, otherwise the estimation of

ηwould have a higher uncertainty.

To incorporate the uncertainty of

and

into

the estimation of

, we need to marginalize

and

from the joint likelihood

P(m+, ρ, η|α)

to get

P(m+|α).

Theorem 2 (Est. α, ρ, η with error-prone metric)

Let

m+=Pn

i=1 rj∼Binom(P(rj= 1), n)

the number of samples rated positively by

Mρ,η

Then we estimate

by computing the mode of the

following distribution:

P(α|M+=m+)∝P(M+=m+|α)P(α)

∝P(α)Z1

0Z1

p(m+|α, ρ, η)p(ρ)p(η)dρdη(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheEffectivenessofAutomatedMetricsforTextGenerationSystemsPiusvonDänikenandJanDeriuandDonTuggenerandMarkCieliebakCentreforArticialIntelligenceZHAWSchoolofEngineering{vode,deri,tuge,ciel}@zhaw.chAbstractAmajorchallengeintheeldofTextGenera-tionisevaluation,becausewelackasoundthe-orythatcanbelevera...

展开>> 收起<<

On the Effectiveness of Automated Metrics for Text Generation Systems Pius von Däniken and Jan Deriu and Don Tuggener and Mark Cieliebak Centre for Artiﬁcial Intelligence.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Effectiveness of Automated Metrics for Text Generation Systems Pius von Däniken and Jan Deriu and Don Tuggener and Mark Cieliebak Centre for Artiﬁcial Intelligence

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: