Shortcomings of Question Answering Based Factuality Frameworks for Error Localization Ryo Kamoi Tanya Goyal Greg Durrett

2025-04-24 0 0 1.94MB 15 页 10玖币

侵权投诉

Shortcomings of Question Answering Based Factuality Frameworks

for Error Localization

Ryo Kamoi Tanya Goyal Greg Durrett

Department of Computer Science

The University of Texas at Austin

ryokamoi@utexas.edu

Abstract

Despite recent progress in abstractive sum-

marization, models often generate summaries

with factual errors. Numerous approaches

to detect these errors have been proposed,

the most popular of which are question an-

swering (QA)-based factuality metrics. These

have been shown to work well at predicting

summary-level factuality and have potential to

localize errors within summaries, but this latter

capability has not been systematically evalu-

ated in past research. In this paper, we conduct

the ﬁrst such analysis and ﬁnd that, contrary

to our expectations, QA-based frameworks fail

to correctly identify error spans in generated

summaries and are outperformed by trivial ex-

act match baselines. Our analysis reveals a

major reason for such poor localization: ques-

tions generated by the QG module often in-

herit errors from non-factual summaries which

are then propagated further into downstream

modules. Moreover, even human-in-the-loop

question generation cannot easily offset these

problems. Our experiments conclusively show

that there exist fundamental issues with local-

ization using the QA framework which cannot

be ﬁxed solely by stronger QA and QG mod-

els.

1 Introduction

Although abstractive summarization systems (Rush

et al.,2015;See et al.,2017;Lewis et al.,2020)

have improved drastically over the past few years,

these systems often introduce factual errors into

generated summaries (Cao et al.,2018;Kryscinski

et al.,2019). Recent work has proposed a num-

ber of approaches to detect these errors, includ-

ing using off-the-shelf entailment models (Falke

et al.,2019;Laban et al.,2022), question answering

(QA) models (Chen et al.,2018;Wang et al.,2020;

Durmus et al.,2020), and discriminators trained

on synthetic data (Kryscinski et al.,2020). Such

methods have also been explored to identify error

spans within summaries (Goyal and Durrett,2020)

Source Article: My recent exhibition features some prominent trends and

themes spanning the entire history of the matchbox industry. I exhibited

5,000 labels from my collection of 25,000. […]

For the past 15 years, I have been collecting matchbox labels.

BART/

PEGASUS

Since when have I been

collecting labels? What have I been

collecting for 15 years?

Question Generation - Question Answering Framework

Unanswerable Unanswerable

Correctly identifies factual error Tags factual span as an error!

Non-Factual Span Factual Span

Figure 1: Factual error localization using QA metrics.

Questions are generated for summary spans and then

answered by a QA model using the source article as

context. For factual spans (e.g. matchbox labels),

we expect the predicted answers to match the original

spans. However, non-factual spans in generated ques-

tions inherited from summaries may render these unan-

swerable and lead to incorrect error localization.

and perform post-hoc error correction (Dong et al.,

2020;Cao et al.,2020).

Among these different approaches for evaluat-

ing factuality, QA-based frameworks are the most

widely adopted (Chen et al.,2018;Scialom et al.,

2019;Durmus et al.,2020;Wang et al.,2020;

Scialom et al.,2021;Fabbri et al.,2022). These

evaluate the factuality of a set of spans in isola-

tion, then combine them to render a summary-level

judgment. Figure 1illustrates the core mechanism:

question generation (QG) is used to generate ques-

tions for a collection of summary spans, typically

noun phrases or entities, which are then compared

with those questions’ answers based on the source

document to determine factuality. Due to this span-

level decomposition of factuality, QA frameworks

are widely believed to localize errors (Chen et al.,

2018;Wang et al.,2020;Gunasekara et al.,2021).

Therefore, the metrics have been applied in set-

tings like post-hoc error correction (Dong et al.,

2020), salient (Deutsch and Roth,2021) and incor-

rect (Scialom et al.,2021) span detection, and text

alignment (Weiss et al.,2021). However, their ac-

arXiv:2210.06748v2 [cs.CL] 11 Feb 2023

tual span-level error localization performance has

not been systematically evaluated in prior work.

In this paper, we aim to answer the following

question:

does the actual behavior of QA-based

metrics align with their motivation?

Speciﬁ-

cally, we evaluate whether these models success-

fully identify error spans in generated summaries,

independent of their ﬁnal summary-level judgment.

We conduct our analysis on two recent factuality

datasets (Cao and Wang,2021;Goyal and Dur-

rett,2021) derived from pre-trained summariza-

tion models on two popular benchmark datasets:

CNN/DM (Hermann et al.,2015;Nallapati et al.,

2016) and XSum (Narayan et al.,2018). Our results

are surprising:

we ﬁnd that good summary-level

performance is rarely accompanied by correct

span-level error detection.

Moreover, even trivial

exact match baselines outperform QA metrics at

error localization. Our results clearly show that

although motivated by span-level decomposition of

the factuality problem, the actual span-level predic-

tions of QA metrics are very poor.

Next, we analyze these failure cases to under-

stand why QA-based metrics diverge from their

intended behavior. We ﬁnd that the most serious

problem lies in the question generation (QG) stage:

generated questions for non-factual summaries in-

herit errors from the input summaries (see Fig-

ure 1). This results in poor localization wherein

factual spans get classiﬁed as non-factual due to

presupposition failures during QA. Furthermore,

we show that such inherited errors cannot be easily

avoided: decreasing the length of generated ques-

tions reduces the number of inherited errors, but

very short questions can be under-speciﬁed and not

provide enough context for the QA model. In fact,

replacing automatic QG with human QG also does

not improve the error localization of QA metrics.

These results demonstrate fundamental issues with

the current QA-based factuality frameworks that

cannot be patched by stronger QA/QG methods.

Our contributions are as follows. (1) We show

that QA-based factuality models for summarization

exhibit poor error localization capabilities. (2) We

provide a detailed study of factors in QG that ham-

per these models: inherited errors in long generated

questions and trade-offs between these and short

under-speciﬁed questions. (3) We conduct a human

study to illustrate the issues with the QA-based fac-

tuality framework independent of particular QA or

QG systems.

2 QA-Based Factuality Metrics

Recent work has proposed numerous QA-based

metrics for summarization evaluation, particularly

factuality (Chen et al.,2018;Scialom et al.,2019;

Eyal et al.,2019;Durmus et al.,2020;Wang

et al.,2020;Deutsch and Roth,2021). These pro-

posed metrics follow the same basic framework (de-

scribed in Section 2.1), and primarily differ in the

choice of off-the-shelf models used for the different

framework components (discussed in Section 2.2).

2.1 Basic Framework

Given a source document

and generated sum-

mary

, the QA-based metrics output a summary-

level factuality score

that denotes the factual

consistency of

. This includes the following steps

(also outlined in Figure 2):

1. Answer Selection

: First, candidate answer

spans

ai∈S

are extracted. These correspond

to the base set of facts that are compared against

the source document

. Metrics evaluated in

this work (Scialom et al.,2021;Fabbri et al.,

2022) consider all noun phrases and named en-

tities in generated summaries as the answer can-

didates set, denoted by span(S).

2. Question Generation

: Next, a question genera-

tion model (

) is used to generate questions for

these answer candidates with the generated sum-

mary

as context. Let

qi=G(ai, S)

denote

the corresponding question for span ai.

3. Question Filtering

: Questions for which the

question answering (

) model’s predicted an-

swer

A(qi, S)

from the summary does not match

the original span

are discarded, i.e., when

ai6=A(qi, S)

. This step is used to ensure that

the effects of erroneous question generation do

not percolate down the pipeline; however, an-

swer spans that do not pass this phase cannot be

evaluated by the method.

4. Question Answering

: For each generated ques-

tion

, the

model is used to predict answers

using the source document

as context. Let

pi=A(qi, D)denote the predicted answer.

5. Answer Comparison

: Finally, the predicted an-

swer

is compared to the expected answer

to compute a similarity score

sim(pi, ai)

. The

overall summary score

is computed by aver-

aging over all span-level similarity scores:

yS=1

|span(S)|X

ai∈span(S)

sim(A(qi, D), ai)

High winds and heavy rain

have caused flooding at a

Derbyshire Theme Park,

forcing it to close for the

weekend.

Heavy Rain

Pleasure Island

Theme park

…

Summary

High winds and what else

have caused flooding?

Which park was closed for

the weekend?

1.0

0.3

…

QA Metric

0.4

Summary-level

Score

Span-level ScoresGenerated Questions Predicted Answers

…

Source Document

ai,S

aj,SG

qi=G(ai,S)

qj=G(aj,S)

pi=A(qi,D)

pj=A(qj,D)

sim(pi,ai)

sim(pj,aj)

Figure 2: Overall workﬂow for the QA metrics. First, questions are generated for all NEs and NPs in the generated

summary. Answers to these questions are obtained from the source document. Then, a factuality score is computed

for each summary span based on it similarity with the predicted span from the previous step. Finally, all span-level

scores are aggregated to obtain the ﬁnal summary-level factuality.

Based on the motivation behind QA metrics, these

similarity scores

sim(pi, ai)

should indicate the

factuality of the corresponding spans. If span

is factual, then the

G−A

pipeline should output

pi∈D

with high similarity to

. Conversely, if

is non-factual, the similarity score

sim(pi, ai)

should be low. While prior research has only evalu-

ated their sentence-level performance, we use these

span-level factuality scores to additionally evaluate

the localization performance of QA metrics.

2.2 QA Metrics compared

In this work, we focus our analysis on the two best

performing QA-based metrics from prior work:

QuestEval (QE)

Scialom et al. (2021) generate

questions for answer spans extracted from both

the summary (“precision questions”) and source

document (“recall questions”). We only use the

former in our experiments as these are shown to

correlate better with factuality. Both the

and

components of QuestEval use T5-Large mod-

els (Raffel et al.,2020) ﬁne-tuned on question an-

swering datasets (Rajpurkar et al.,2018;Trischler

et al.,2017). The similarity score

sim(pi, ai)

this framework is computed as the average of the

lexical overlap, BERTScore, and the answerability

score predicted by A.

QAFactEval (QAFE)

Fabbri et al. (2022) con-

duct an ablation study over the different combi-

nations of available

and

models. Here, we

use their best-performing combination: an ELEC-

TRA-based

model and a BART-based

model

ﬁne-tuned on the QA2D dataset (Demszky et al.,

2018). The

sim(pi, ai)

score is obtained using

the learned metric LERC (Chen et al.,2020). If

A(qi, D)

is unanswerable for span

, QAFactEval

sets the similarity score

sim(_, ai) = 0

instead of

using the LERC metric.

3 Experimental Setup

3.1 Task Deﬁnition

Given document

and a generated summary

let

y∗

S∈ {0,1}

denote the gold summary-level

factuality label. Additionally, we assume access

L={(a, y∗

a)}

which denotes the set of spans

a∈span(S)

and their corresponding span-level

gold factuality labels y∗

a∈ {0,1}.

First, we evaluate the

summary-level perfor-

mance

of factuality models, i.e., is the predicted

factuality equal to the gold factuality judgment

y∗

? To do this, we covert the predicted factual-

ity score

to a binary judgment using dataset-

speciﬁc thresholds. For each factuality model eval-

uated, we select thresholds that yield the best F1

scores on the validation set on each dataset.

Next, we evaluate the

span-level (localization)

performance

of factuality models. Similar to the

previous setting, we convert span-level predictions

to binary labels using the best-F1 threshold de-

rived from the validation set. We report the macro-

averaged performance at correctly predicting the

span-level label

y∗

a∀a∈span(S)

across all

(D, S)

pairs in the evaluation dataset.

To align with the current QA frameworks, we re-

strict our evaluation to spans that correspond to

named entities and noun phrases. This takes a

generous view of the QA metrics’ performance

as it does not penalize them for failing to identify

factual-errors outside NPs and NEs. This setting

allows us to study the fundamental issues with the

QA framework instead of those that can potentially

be addressed by extending the question types con-

sidered in the framework.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ShortcomingsofQuestionAnsweringBasedFactualityFrameworksforErrorLocalizationRyoKamoiTanyaGoyalGregDurrettDepartmentofComputerScienceTheUniversityofTexasatAustinryokamoi@utexas.eduAbstractDespiterecentprogressinabstractivesum-marization,modelsoftengeneratesummarieswithfactualerrors.Numerousapproaches...

展开>> 收起<<

Shortcomings of Question Answering Based Factuality Frameworks for Error Localization Ryo Kamoi Tanya Goyal Greg Durrett.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Shortcomings of Question Answering Based Factuality Frameworks for Error Localization Ryo Kamoi Tanya Goyal Greg Durrett

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: