
founded guarantees and guidelines that can be used
to run an evaluation campaign. For instance, con-
sider Figure 1(derived by our theory). If we as-
sume a binary metric that has an accuracy of 70%,
and if we have access to 1000 automatically rated
samples (blue line), then we can reliably distin-
guish between two text generation systems that
have a difference in performance of 10 percentage
points. To distinguish two systems with a smaller
difference, for instance of 2%, we would need a
better metric and many more samples. That is, we
need for instance a metric with an accuracy of at
least 85% and 10000 automatically rated samples
by this metric.
Our theory provides analogous assessments of
how many human evaluations are required to re-
liably distinguish text generation systems. When
we say that the performance of two systems can
be reliably distinguished, we mean that the differ-
ence in their performance is statistically significant.
Similarly, a measurable difference in performance
is one that leads to statistical significance given the
experiment parameters.
In addition, our theory allows for the mix of hu-
man and automated evaluation. For this, consider
Table 1where we depict the number of human and
automatic ratings required by a metric with 70% ac-
curacy. For instance, to distinguish two text gener-
ators with 2 percentage points difference, we need
either at least 5000 human ratings, or 2500 human
ratings mixed with 10’000 automated ratings.
Our theoretical framework allows us to design
our evaluation with theoretical guarantees regard-
ing the significance of the resulting measurements.
Given a monetary budget and our theory, one can
decide whether to invest in more human annota-
tions, in developing better automated metrics, or in
sampling more automated ratings. Our approach
can also be used to showcase the limits of a given
setting: for instance in Figure 1, we see that using
only 1000 automated ratings leads to a minimally
measurable difference of 4% even with a perfect
metric.
In the remainder of the paper, we derive the theo-
retical framework for binary metrics and apply it to
two showcases: the WMT-21 shared task (Freitag
et al.,2021b) and the Spot-The-Bot evaluation (De-
riu et al.,2020). We analyse how well these eval-
uations adhere to the constraints imposed by our
theory and demonstrate how the quality of the eval-
uations can be improved. To serve the community,
we will release the formulas as code and as a web
interface
1
that allows practitioners to enter their
evaluation settings and receive an analysis of the
measurable differences in their settings.
2 Definitions
In this section, we introduce the basic definitions
that we need for the derivations. First, we define
the general setting of Text Generation, then we
cover binary metrics, and finally we describe text
generation systems.
2.1 General Setting
Definition 1 (Text Generation Environment)
Atext generation environment is composed of a
triple
hI,O,Φi
, where
I
denotes the set of inputs,
O
the output space, and
Φ : I × O → {0,1}
an
oracle that assess whether an output is adequate
for a given input.
For instance, for Machine Translation
I
denotes
all sentences in the source language and
O
all sen-
tences in the target language, while for a chatbot
I
contains all dialogue contexts and
O
all possible
responses in a dialogue. Note that
I
and
O
can
be of infinite size. We regard
Φ
as an oracle that
segments the output space for a given input into
adequate and inadequate outputs 2.
Definition 2 (Adequate Responses) ∀i∈ I
, we
call
Ri
+={o∈ O|Φ(i, o) = 1}
the set of
adequate responses for input i, and
Ri
−=
{o∈ O|Φ(i, o) = 0}
the set of inadequate re-
sponses.
2.2 Binary Metric
In this work, we set our focus to binary metrics, i.e.,
metrics that classify the output of a text generation
system as being either adequate or inadequate. The
choice of binary metrics allows us to reason about
the performance of a text generation (TG) system
as the ratio of adequate responses3.
1https://github.com/vodezhaw/binary_metric_
tool
2
In most real-world setting
Φ
is approximated with human
ratings.
3
This lies in contrast with metrics that simply return a
scalar value (e.g, BLEU (Papineni et al.,2002), COMET (Rei
et al.,2020), USR (Mehri and Eskenazi,2020)) that is difficult
to interpret. For instance, if BLEU returns a value of 0.34 for
one system and 0.32 for the second system, can we really state
that the first system is better than the second (Callison-Burch
et al.,2006)? We can use these types of metrics to create
binary metrics by selecting a threshold that defines the border
between adequate and inadequate responses (e.g., all COMET