Shortcomings of Question Answering Based Factuality Frameworks for Error Localization Ryo Kamoi Tanya Goyal Greg Durrett

2025-04-24 0 0 1.94MB 15 页 10玖币
侵权投诉
Shortcomings of Question Answering Based Factuality Frameworks
for Error Localization
Ryo Kamoi Tanya Goyal Greg Durrett
Department of Computer Science
The University of Texas at Austin
ryokamoi@utexas.edu
Abstract
Despite recent progress in abstractive sum-
marization, models often generate summaries
with factual errors. Numerous approaches
to detect these errors have been proposed,
the most popular of which are question an-
swering (QA)-based factuality metrics. These
have been shown to work well at predicting
summary-level factuality and have potential to
localize errors within summaries, but this latter
capability has not been systematically evalu-
ated in past research. In this paper, we conduct
the first such analysis and find that, contrary
to our expectations, QA-based frameworks fail
to correctly identify error spans in generated
summaries and are outperformed by trivial ex-
act match baselines. Our analysis reveals a
major reason for such poor localization: ques-
tions generated by the QG module often in-
herit errors from non-factual summaries which
are then propagated further into downstream
modules. Moreover, even human-in-the-loop
question generation cannot easily offset these
problems. Our experiments conclusively show
that there exist fundamental issues with local-
ization using the QA framework which cannot
be fixed solely by stronger QA and QG mod-
els.
1 Introduction
Although abstractive summarization systems (Rush
et al.,2015;See et al.,2017;Lewis et al.,2020)
have improved drastically over the past few years,
these systems often introduce factual errors into
generated summaries (Cao et al.,2018;Kryscinski
et al.,2019). Recent work has proposed a num-
ber of approaches to detect these errors, includ-
ing using off-the-shelf entailment models (Falke
et al.,2019;Laban et al.,2022), question answering
(QA) models (Chen et al.,2018;Wang et al.,2020;
Durmus et al.,2020), and discriminators trained
on synthetic data (Kryscinski et al.,2020). Such
methods have also been explored to identify error
spans within summaries (Goyal and Durrett,2020)
Source Article: My recent exhibition features some prominent trends and
themes spanning the entire history of the matchbox industry. I exhibited
5,000 labels from my collection of 25,000. […]
For the past 15 years, I have been collecting matchbox labels.
BART/
PEGASUS
Since when have I been
collecting labels? What have I been
collecting for 15 years?
Question Generation - Question Answering Framework
Unanswerable Unanswerable
Correctly identifies factual error Tags factual span as an error!
Non-Factual Span Factual Span
Figure 1: Factual error localization using QA metrics.
Questions are generated for summary spans and then
answered by a QA model using the source article as
context. For factual spans (e.g. matchbox labels),
we expect the predicted answers to match the original
spans. However, non-factual spans in generated ques-
tions inherited from summaries may render these unan-
swerable and lead to incorrect error localization.
and perform post-hoc error correction (Dong et al.,
2020;Cao et al.,2020).
Among these different approaches for evaluat-
ing factuality, QA-based frameworks are the most
widely adopted (Chen et al.,2018;Scialom et al.,
2019;Durmus et al.,2020;Wang et al.,2020;
Scialom et al.,2021;Fabbri et al.,2022). These
evaluate the factuality of a set of spans in isola-
tion, then combine them to render a summary-level
judgment. Figure 1illustrates the core mechanism:
question generation (QG) is used to generate ques-
tions for a collection of summary spans, typically
noun phrases or entities, which are then compared
with those questions’ answers based on the source
document to determine factuality. Due to this span-
level decomposition of factuality, QA frameworks
are widely believed to localize errors (Chen et al.,
2018;Wang et al.,2020;Gunasekara et al.,2021).
Therefore, the metrics have been applied in set-
tings like post-hoc error correction (Dong et al.,
2020), salient (Deutsch and Roth,2021) and incor-
rect (Scialom et al.,2021) span detection, and text
alignment (Weiss et al.,2021). However, their ac-
arXiv:2210.06748v2 [cs.CL] 11 Feb 2023
tual span-level error localization performance has
not been systematically evaluated in prior work.
In this paper, we aim to answer the following
question:
does the actual behavior of QA-based
metrics align with their motivation?
Specifi-
cally, we evaluate whether these models success-
fully identify error spans in generated summaries,
independent of their final summary-level judgment.
We conduct our analysis on two recent factuality
datasets (Cao and Wang,2021;Goyal and Dur-
rett,2021) derived from pre-trained summariza-
tion models on two popular benchmark datasets:
CNN/DM (Hermann et al.,2015;Nallapati et al.,
2016) and XSum (Narayan et al.,2018). Our results
are surprising:
we find that good summary-level
performance is rarely accompanied by correct
span-level error detection.
Moreover, even trivial
exact match baselines outperform QA metrics at
error localization. Our results clearly show that
although motivated by span-level decomposition of
the factuality problem, the actual span-level predic-
tions of QA metrics are very poor.
Next, we analyze these failure cases to under-
stand why QA-based metrics diverge from their
intended behavior. We find that the most serious
problem lies in the question generation (QG) stage:
generated questions for non-factual summaries in-
herit errors from the input summaries (see Fig-
ure 1). This results in poor localization wherein
factual spans get classified as non-factual due to
presupposition failures during QA. Furthermore,
we show that such inherited errors cannot be easily
avoided: decreasing the length of generated ques-
tions reduces the number of inherited errors, but
very short questions can be under-specified and not
provide enough context for the QA model. In fact,
replacing automatic QG with human QG also does
not improve the error localization of QA metrics.
These results demonstrate fundamental issues with
the current QA-based factuality frameworks that
cannot be patched by stronger QA/QG methods.
Our contributions are as follows. (1) We show
that QA-based factuality models for summarization
exhibit poor error localization capabilities. (2) We
provide a detailed study of factors in QG that ham-
per these models: inherited errors in long generated
questions and trade-offs between these and short
under-specified questions. (3) We conduct a human
study to illustrate the issues with the QA-based fac-
tuality framework independent of particular QA or
QG systems.
2 QA-Based Factuality Metrics
Recent work has proposed numerous QA-based
metrics for summarization evaluation, particularly
factuality (Chen et al.,2018;Scialom et al.,2019;
Eyal et al.,2019;Durmus et al.,2020;Wang
et al.,2020;Deutsch and Roth,2021). These pro-
posed metrics follow the same basic framework (de-
scribed in Section 2.1), and primarily differ in the
choice of off-the-shelf models used for the different
framework components (discussed in Section 2.2).
2.1 Basic Framework
Given a source document
D
and generated sum-
mary
S
, the QA-based metrics output a summary-
level factuality score
yS
that denotes the factual
consistency of
S
. This includes the following steps
(also outlined in Figure 2):
1. Answer Selection
: First, candidate answer
spans
aiS
are extracted. These correspond
to the base set of facts that are compared against
the source document
D
. Metrics evaluated in
this work (Scialom et al.,2021;Fabbri et al.,
2022) consider all noun phrases and named en-
tities in generated summaries as the answer can-
didates set, denoted by span(S).
2. Question Generation
: Next, a question genera-
tion model (
G
) is used to generate questions for
these answer candidates with the generated sum-
mary
S
as context. Let
qi=G(ai, S)
denote
the corresponding question for span ai.
3. Question Filtering
: Questions for which the
question answering (
A
) model’s predicted an-
swer
A(qi, S)
from the summary does not match
the original span
ai
are discarded, i.e., when
ai6=A(qi, S)
. This step is used to ensure that
the effects of erroneous question generation do
not percolate down the pipeline; however, an-
swer spans that do not pass this phase cannot be
evaluated by the method.
4. Question Answering
: For each generated ques-
tion
qi
, the
A
model is used to predict answers
using the source document
D
as context. Let
pi=A(qi, D)denote the predicted answer.
5. Answer Comparison
: Finally, the predicted an-
swer
pi
is compared to the expected answer
ai
to compute a similarity score
sim(pi, ai)
. The
overall summary score
yS
is computed by aver-
aging over all span-level similarity scores:
yS=1
|span(S)|X
aispan(S)
sim(A(qi, D), ai)
High winds and heavy rain
have caused flooding at a
Derbyshire Theme Park,
forcing it to close for the
weekend.
Heavy Rain
Pleasure Island
Theme park
Summary
S
Which park was closed for
the weekend?
1.0
0.3
QA Metric
0.4
Summary-level
Score
Span-level ScoresGenerated Questions Predicted Answers
Source Document
D
ai,S
aj,SG
qi=G(ai,S)
A
qj=G(aj,S)
pi=A(qi,D)
D
pj=A(qj,D)
sim(pi,ai)
sim(pj,aj)
yS
Figure 2: Overall workflow for the QA metrics. First, questions are generated for all NEs and NPs in the generated
summary. Answers to these questions are obtained from the source document. Then, a factuality score is computed
for each summary span based on it similarity with the predicted span from the previous step. Finally, all span-level
scores are aggregated to obtain the final summary-level factuality.
Based on the motivation behind QA metrics, these
similarity scores
sim(pi, ai)
should indicate the
factuality of the corresponding spans. If span
ai
is factual, then the
GA
pipeline should output
piD
with high similarity to
ai
. Conversely, if
ai
is non-factual, the similarity score
sim(pi, ai)
should be low. While prior research has only evalu-
ated their sentence-level performance, we use these
span-level factuality scores to additionally evaluate
the localization performance of QA metrics.
2.2 QA Metrics compared
In this work, we focus our analysis on the two best
performing QA-based metrics from prior work:
QuestEval (QE)
Scialom et al. (2021) generate
questions for answer spans extracted from both
the summary (“precision questions”) and source
document (“recall questions”). We only use the
former in our experiments as these are shown to
correlate better with factuality. Both the
A
and
G
components of QuestEval use T5-Large mod-
els (Raffel et al.,2020) fine-tuned on question an-
swering datasets (Rajpurkar et al.,2018;Trischler
et al.,2017). The similarity score
sim(pi, ai)
in
this framework is computed as the average of the
lexical overlap, BERTScore, and the answerability
score predicted by A.
QAFactEval (QAFE)
Fabbri et al. (2022) con-
duct an ablation study over the different combi-
nations of available
A
and
G
models. Here, we
use their best-performing combination: an ELEC-
TRA-based
A
model and a BART-based
G
model
fine-tuned on the QA2D dataset (Demszky et al.,
2018). The
sim(pi, ai)
score is obtained using
the learned metric LERC (Chen et al.,2020). If
A(qi, D)
is unanswerable for span
ai
, QAFactEval
sets the similarity score
sim(_, ai) = 0
instead of
using the LERC metric.
3 Experimental Setup
3.1 Task Definition
Given document
D
and a generated summary
S
,
let
y
S∈ {0,1}
denote the gold summary-level
factuality label. Additionally, we assume access
to
L={(a, y
a)}
which denotes the set of spans
aspan(S)
and their corresponding span-level
gold factuality labels y
a∈ {0,1}.
First, we evaluate the
summary-level perfor-
mance
of factuality models, i.e., is the predicted
factuality equal to the gold factuality judgment
y
S
? To do this, we covert the predicted factual-
ity score
yS
to a binary judgment using dataset-
specific thresholds. For each factuality model eval-
uated, we select thresholds that yield the best F1
scores on the validation set on each dataset.
Next, we evaluate the
span-level (localization)
performance
of factuality models. Similar to the
previous setting, we convert span-level predictions
ya
to binary labels using the best-F1 threshold de-
rived from the validation set. We report the macro-
averaged performance at correctly predicting the
span-level label
y
aaspan(S)
across all
(D, S)
pairs in the evaluation dataset.
To align with the current QA frameworks, we re-
strict our evaluation to spans that correspond to
named entities and noun phrases. This takes a
generous view of the QA metrics’ performance
as it does not penalize them for failing to identify
factual-errors outside NPs and NEs. This setting
allows us to study the fundamental issues with the
QA framework instead of those that can potentially
be addressed by extending the question types con-
sidered in the framework.
摘要:

ShortcomingsofQuestionAnsweringBasedFactualityFrameworksforErrorLocalizationRyoKamoiTanyaGoyalGregDurrettDepartmentofComputerScienceTheUniversityofTexasatAustinryokamoi@utexas.eduAbstractDespiterecentprogressinabstractivesum-marization,modelsoftengeneratesummarieswithfactualerrors.Numerousapproaches...

展开>> 收起<<
Shortcomings of Question Answering Based Factuality Frameworks for Error Localization Ryo Kamoi Tanya Goyal Greg Durrett.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.94MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注