
tual span-level error localization performance has
not been systematically evaluated in prior work.
In this paper, we aim to answer the following
question:
does the actual behavior of QA-based
metrics align with their motivation?
Specifi-
cally, we evaluate whether these models success-
fully identify error spans in generated summaries,
independent of their final summary-level judgment.
We conduct our analysis on two recent factuality
datasets (Cao and Wang,2021;Goyal and Dur-
rett,2021) derived from pre-trained summariza-
tion models on two popular benchmark datasets:
CNN/DM (Hermann et al.,2015;Nallapati et al.,
2016) and XSum (Narayan et al.,2018). Our results
are surprising:
we find that good summary-level
performance is rarely accompanied by correct
span-level error detection.
Moreover, even trivial
exact match baselines outperform QA metrics at
error localization. Our results clearly show that
although motivated by span-level decomposition of
the factuality problem, the actual span-level predic-
tions of QA metrics are very poor.
Next, we analyze these failure cases to under-
stand why QA-based metrics diverge from their
intended behavior. We find that the most serious
problem lies in the question generation (QG) stage:
generated questions for non-factual summaries in-
herit errors from the input summaries (see Fig-
ure 1). This results in poor localization wherein
factual spans get classified as non-factual due to
presupposition failures during QA. Furthermore,
we show that such inherited errors cannot be easily
avoided: decreasing the length of generated ques-
tions reduces the number of inherited errors, but
very short questions can be under-specified and not
provide enough context for the QA model. In fact,
replacing automatic QG with human QG also does
not improve the error localization of QA metrics.
These results demonstrate fundamental issues with
the current QA-based factuality frameworks that
cannot be patched by stronger QA/QG methods.
Our contributions are as follows. (1) We show
that QA-based factuality models for summarization
exhibit poor error localization capabilities. (2) We
provide a detailed study of factors in QG that ham-
per these models: inherited errors in long generated
questions and trade-offs between these and short
under-specified questions. (3) We conduct a human
study to illustrate the issues with the QA-based fac-
tuality framework independent of particular QA or
QG systems.
2 QA-Based Factuality Metrics
Recent work has proposed numerous QA-based
metrics for summarization evaluation, particularly
factuality (Chen et al.,2018;Scialom et al.,2019;
Eyal et al.,2019;Durmus et al.,2020;Wang
et al.,2020;Deutsch and Roth,2021). These pro-
posed metrics follow the same basic framework (de-
scribed in Section 2.1), and primarily differ in the
choice of off-the-shelf models used for the different
framework components (discussed in Section 2.2).
2.1 Basic Framework
Given a source document
D
and generated sum-
mary
S
, the QA-based metrics output a summary-
level factuality score
yS
that denotes the factual
consistency of
S
. This includes the following steps
(also outlined in Figure 2):
1. Answer Selection
: First, candidate answer
spans
ai∈S
are extracted. These correspond
to the base set of facts that are compared against
the source document
D
. Metrics evaluated in
this work (Scialom et al.,2021;Fabbri et al.,
2022) consider all noun phrases and named en-
tities in generated summaries as the answer can-
didates set, denoted by span(S).
2. Question Generation
: Next, a question genera-
tion model (
G
) is used to generate questions for
these answer candidates with the generated sum-
mary
S
as context. Let
qi=G(ai, S)
denote
the corresponding question for span ai.
3. Question Filtering
: Questions for which the
question answering (
A
) model’s predicted an-
swer
A(qi, S)
from the summary does not match
the original span
ai
are discarded, i.e., when
ai6=A(qi, S)
. This step is used to ensure that
the effects of erroneous question generation do
not percolate down the pipeline; however, an-
swer spans that do not pass this phase cannot be
evaluated by the method.
4. Question Answering
: For each generated ques-
tion
qi
, the
A
model is used to predict answers
using the source document
D
as context. Let
pi=A(qi, D)denote the predicted answer.
5. Answer Comparison
: Finally, the predicted an-
swer
pi
is compared to the expected answer
ai
to compute a similarity score
sim(pi, ai)
. The
overall summary score
yS
is computed by aver-
aging over all span-level similarity scores:
yS=1
|span(S)|X
ai∈span(S)
sim(A(qi, D), ai)