Rich Knowledge Sources Bring Complex Knowledge Conﬂicts Recalibrating Models to Reﬂect Conﬂicting Evidence Hung-Ting Chen Michael J.Q. Zhang Eunsol Choi

2025-05-03 0 0 568.73KB 16 页 10玖币

侵权投诉

Rich Knowledge Sources Bring Complex Knowledge Conﬂicts:

Recalibrating Models to Reﬂect Conﬂicting Evidence

Hung-Ting Chen Michael J.Q. Zhang Eunsol Choi

Department of Computer Science

The University of Texas at Austin

{hungtingchen, mjqzhang, eunsol}@utexas.edu

Abstract

Question answering models can use rich

knowledge sources — up to one hundred re-

trieved passages and parametric knowledge in

the large-scale language model (LM). Prior

work assumes information in such knowledge

sources is consistent with each other, pay-

ing little attention to how models blend in-

formation stored in their LM parameters with

that from retrieved evidence documents. In

this paper, we simulate knowledge conﬂicts

(i.e., where parametric knowledge suggests

one answer and different passages suggest dif-

ferent answers) and examine model behaviors.

We ﬁnd retrieval performance heavily impacts

which sources models rely on, and current

models mostly rely on non-parametric knowl-

edge in their best-performing settings. We

discover a troubling trend that contradictions

among knowledge sources affect model conﬁ-

dence only marginally. To address this issue,

we present a new calibration study, where mod-

els are discouraged from presenting any single

answer when presented with multiple conﬂict-

ing answer candidates in retrieved evidences.

1 Introduction

Traditionally, QA models have relied on retrieved

documents to provide provenance for their an-

swers (Chen et al.,2017). Recent studies (Petroni

et al.,2019) have shown that large language models

are able to retain vast amounts of factual knowl-

edge seen during pretraining, and closed-book QA

systems (Roberts et al.,2020) build upon this foun-

dation by memorizing facts from QA ﬁnetuning.

Retrieval-based generation approaches (Izacard

and Grave,2021;Lewis et al.,2020) emerge as the

best of both worlds – generating free-form answers

from the question paired with retrieved evidence

documents. They further combine these parametric

knowledge sources with a large number of retrieved

evidence documents, achieving state-of-the-art per-

formances on open retrieval QA datasets (Joshi

et al.,2017;Kwiatkowski et al.,2019).

Parametric Knowledge (Facts memorized during training)

Non-parametric Knowledge

(Documents retrieved at inference time)

Passage 1

…Norway set the

record for most total

medals at a single

Winter Olympics with

39, surpassing the…

Passage 2

…Norway was the most successful

nation at the games with 39 total

medals, setting a new record for the

most medals won by a country at a

single Winter Olympics.

Passage 3

...With 36 total

medals, Germany

set a record for

most total medals at

a Winter Olympics...

🤖

I have passages suggesting

conﬂicting answers, thus I

should abstain from answering!

👤

Which country won the most medals in winter olympics?

Norway?

Germany?

the U.S?

The U.S. team had a historic Winter Games, winning an unprecedented 37 medals.

Figure 1: Models can use both parametric and non-

parametric knowledge sources. In this example, the an-

swer could be the U.S./Norway/Germany. We investi-

gate for a given question which knowledge source was

the most inﬂuential to output an answer. The model

should be able to abstain from answering for these ex-

amples, as it is difﬁcult for the model to decide which

answer candidate is correct.

Understanding how retrieval-based generation

models combine information from parametric and

non-parametric knowledge sources is crucial for

interpreting and debugging such complex systems,

particularly in adversarial and complex real world

scenarios where these sources may conﬂict with

each other (see an example in Figure 1). This can

aid both developers to debug such models and for

users to estimate how much they should trust an an-

swer (Ribeiro et al.,2016). Thus, we focus on the

following core question: when provided with nu-

merous evidence passages and a pretrained and ﬁne-

tuned language model, which knowledge source do

models ground their answers in?

A recent study (Longpre et al.,2021) investi-

gated this in a limited single evidence document

setting. We expand this study to consider a more

realistic scenario, where models consider multiple

evidence passages (up to 100 passages), and ob-

serve results diverging from their reported heavy

reliance on parametric knowledge. We further sim-

arXiv:2210.13701v1 [cs.CL] 25 Oct 2022

ulate a setting where a subset of evidence passages

are perturbed to suggest a different answer to re-

ﬂect the realistic scenario where retrieval returns a

mixed bag of information. Such scenarios are com-

mon in settings where some passages are updated

with new information, while other passages remain

outdated (Shah et al.,2020;Zhang and Choi,2021).

Such conﬂicts can also occur when passages are ad-

versarially edited to contain false information (Du

et al.,2022), or when passages are authored by

multiple people who have differing opinions about

an answer (Chen et al.,2019).

Our extensive studies on two datasets (Joshi

et al.,2017;Kwiatkowski et al.,2019) and two

models (Izacard and Grave,2020;Lewis et al.,

2020) exhibit that retrieval-based generation mod-

els are primarily extractive and are heavily inﬂu-

enced by a few most relevant documents instead

of aggregating information over a large set of doc-

uments. Learning that models mostly rely on evi-

dence passages rather than parametric knowledge,

we evaluate how sensitive models are toward se-

mantic perturbation to the evidence documents

(e.g., adding negation). We ﬁnd retrieval-based

generation models behave similarly to extractive

models, sharing their weakness of returning an-

swer candidates with high conﬁdence, even after

the context is modiﬁed to no longer support the

answer (Ribeiro et al.,2020).

What should models do when confronted with

conﬂicting knowledge sources? We propose a new

calibration setting (Section 5), where a model is en-

couraged to abstain from proposing a single answer

in such scenarios. We ﬁnd that teaching models

to abstain when there are more than one plausi-

ble answers is challenging, and training a separate

calibrator with augmented data helps moderately.

To summarize, we empirically test how QA mod-

els (Izacard and Grave,2021;Lewis et al.,2020)

use diverse knowledge sources. We present the

ﬁrst analysis of knowledge conﬂicts where (1) the

model uses multiple passages, (2) knowledge con-

ﬂicts arise from ambiguous and context-dependent

user queries, and (3) there are knowledge conﬂicts

between different passages. Our ﬁndings are as

follows: when provided with a high recall retriever,

models rely almost exclusively on the evidence

passages without hallucinating answers from para-

metric knowledge. When different passages sug-

gest multiple conﬂicting answers, models prefer the

answer that matches their parametric knowledge.

Model Generative Retrieval-Based Multi-Pass

DPR X

REALM X

T5 X

RAG X X

FiD X X X

Table 1: Overview of recent open retrieval QA ap-

proaches. Generative indicates whether the model

generates the answer and, therefore, can produce an-

swers not found in the retrieved documents. Retrieval-

Based indicates whether the model uses retrieval to ﬁnd

relevant passages to help produce an answer. Multi-

Passage indicates whether the system is able to model

interactions between separate evidence passages.

Lastly, we identify various weaknesses of retrieval-

based generation models, including its conﬁdence

score not reﬂecting the existence of conﬂicting an-

swers between knowledge sources. Our initial cali-

bration study suggests that dissuading models from

presenting a single answer in the presence of rich,

potentially conﬂicting, knowledge sources is chal-

lenging, and demands future study.

2 Background

We ﬁrst describe the task setting, QA models, and

calibrator used in our study.

We study open retrieval QA, where the goal is

to ﬁnd an appropriate answer

y∗

for a given ques-

tion

. Systems for open retrieval QA may also

be provided with access to a knowledge corpus

consisting of a large number of passages,

, which

is used to help answer the question. We use the

open retrieval split (Lee et al.,2019) of the Nat-

uralQuestions dataset (NQ-Open) (Kwiatkowski

et al.,2019) and TriviaQA (Joshi et al.,2017), and

use Wikipedia as our knowledge corpus.1

2.1 Model

We investigate two retrieval-based generation

QA models: Fusion-in-Decoder (Izacard and

Grave,2021) and Retrieval Augmented Genera-

tion model (Lewis et al.,2020). Both architec-

tures have reader and retriever components, using

the same dense phrase retriever (Karpukhin et al.,

2020) which learns an embedding of question and

passage, and retrieves a ﬁxed number (

) of pas-

sages that are most similar to the query embedding.

They mainly differ in their reader architecture and

Following Lee et al. (2019), we use the English Wikipedia

dump from Dec. 20, 2018. We use 100-word text segments as

passages following Karpukhin et al. (2020).

learning objective, which we describe below.

Fusion-in-Decoder (FiD)

The reader model is

based on pretrained language model (speciﬁcally,

T5-large (Raffel et al.,2020)). Each retrieved pas-

sage,

pi(i= [1, N])

, is concatenated with the

question,

, before being encoded by T5 to generate

representations,

[hi

1, ..., hi

, where

is the length

of the

th passage prepended with the question. All

passages are then concatenated to form a sin-

gle sequence,

[h1

1, ..., h1

m, ..., hN

1, ..., hN

, which

the decoder interacts with using cross-attention to

generate the answer.2

We use trained FiD (large) checkpoint provided

by the authors for most analysis.

When evaluating

models with access to different number of passages,

we re-train FiD model (pretrained weights loaded

from T5-large) using 1, 5, 20 and 50 passages re-

trieved by DPR. Refer to Appendix A.2 for full

model and training details.

Retrieval Augmented Generation (RAG)

RAG conditions on each retrieved evidence

document individually to produce an answer,

marginalizing the probability of producing an

answer over all retrieved evidence documents.

By applying this constraint, RAG is able to jointly

train the reader and retriever, at the cost of ignoring

interactions between evidence documents. FiD,

in contrast, is able to model such interactions

during decoding while the reader and retriever is

completely disjoint.

Recent work explored jointly training the reader

and retriever in FiD (Izacard and Grave,2020;

Sachan et al.,2021;Yang and Seo,2020), show-

ing small gains. Table 1summarizes differ-

ent architectures, including two open book ap-

proaches (Karpukhin et al.,2020;Guu et al.,2020),

one closed book approach (Roberts et al.,2020)

and two retrieval-based generation approaches. As

FiD is efﬁcient and effective, we focus most of

our analysis (Section 4,B) on it. We only report

RAG results on a few of our main analyses to verify

that general trends of the FID model hold for RAG

(which they typically do).

We use the version proposed in Izacard and Grave (2020)

with knowledge distillation from reader.

3https://github.com/facebookresearch/FiD

RAG also presents a variant of a model that relies on

each retrieved document to generate for each token, but

shows worse performance. We use the version in

https:

//huggingface.co/facebook/rag-sequence-nq

2.2 Model Conﬁdence Study

We analyze the model conﬁdence score, asking a

more nuanced question: is model’s conﬁdence on

the gold answer decreased after we perturb knowl-

edge sources? We compare the model conﬁdence

on the same example before and after perturbation.

We determine the conﬁdence of the model using

either (1) the generation probability of the answer

(i.e., the product of the probability of generating

each token conditioned on all the previously gen-

erated tokens) or (2) the conﬁdence score of sepa-

rately trained answer calibrator, which provides a

score indicating the probability of the model cor-

rectly predicting the answer for each example. We

train a binary calibrator following prior work (Ka-

math et al.,2020;Zhang et al.,2021), using gradi-

ent boosting library XGBoost (Chen and Guestrin,

2016). The goal of the calibrator is to enable selec-

tive question answering – equipping models to de-

cide when to abstain from answering. Given an in-

put question

and learned model

Mθ

, the calibrator

predicts whether the predicted answer

ˆy=Mθ(q)

will match the annotated answer

y∗

. We follow the

settings of calibrator from prior work (Zhang et al.,

2021), and details can be found in Appendix A.1.

3 When do retrieval-based generation

models rely on parametric knowledge?

As an initial step investigating whether retrieval-

based generation models ground their answers

in the retrieval corpus or in the pretrained lan-

guage model’s parametric knowledge, we evaluate

whether models generate a novel answer that is not

present in a set of evidence documents. Unlike

extractive QA models (Seo et al.,2017), genera-

tion based approaches (Roberts et al.,2020;Izacard

and Grave,2021) do not require the evidence docu-

ments to contain the gold answer span. Thus, we

ﬁrst analyze whether they actually generate novel

answer spans not found in the retrieved passages.

Table 2reports how often models generate a span

not found in the evidence passages, split by the re-

trieval performance on the NQ-Open (Kwiatkowski

et al.,2019;Lee et al.,2019) and TriviaQA (Joshi

et al.,2017) development set. We observe that

models typically copy a span from the evidence pas-

sages, only generating novel spans for 3.4%/6.2%

of examples in NQ/TriviaQA for FiD and 20.2%

for RAG in NQ. Even for the small subset of

examples where the retrieved documents do not

contain the answer string, FiD remains extractive

Model Retrival CBQA Extractive Abstractive

(Data) suc. Diff % % EM % EM

FiD Y (89%) 68.4 98.3 59.6 1.7 0.8

(NQ) N (11%) 90.9 82.9 - 17.1 21.3

Total 70.9 96.6 53.9 3.4 12.4

RAG Y (63%) 65.7 92.9 60.2 7.0 3.6

(NQ) N (37%) 88.3 57.9 - 42.1 11.2

Total 74.2 79.8 43.9 20.2 9.6

FiD Y (88%) 68.6 97.1 82.9 2.9 38.1

(TQA) N (12%) 89.9 69.6 - 30.4 16.9

Total 71.1 93.8 75.5 6.2 25.6

Table 2: Performance of hybrid models on the NQ-

Open (NQ) and TriviaQA (TQA) development set bro-

ken down by their retrieval performance. Results are

split based on whether the retrieval was successful (i.e.,

gold answer string is within the top K (K = 100 for

FID; K = 5 for RAG) retrieved documents (Y), or not

(N), and the percentage in parentheses refers to the per-

centage of examples belonging to each set. We report

the proportion of predictions that are not matching the

CBQA model prediction. ‘-’ means cell’s value is zero

by deﬁnition.

for 82.9%/69.6% of examples in NQ/TriviaQA. In

contrast, for RAG, where retrieved documents fre-

quently miss the gold answer (37%), such copying

behavior was less common, generating unseen text

for 42.1% of examples. The results suggest reliance

on retrieved documents increased as retriever per-

formance increases. We also report the percentage

of examples where the model prediction is different

from that of a T5 closed-book question answering

(CBQA) model trained on the same data.

Over

70% of examples have

different

answers from the

CBQA model, even when the answer is abstractive,

suggesting hybrid models use passages even when

there is no exact string match.

Revisiting knowledge conﬂict study in Longpre

et al. (2021)

This observation stands at odds with

an earlier study on knowledge conﬂict (Longpre

et al.,2021) which simulates knowledge conﬂict by

substituting the existing answer with a new answer

candidate in the evidence passage (see Table 3for

an example), creating a mismatch between knowl-

edge from parametric knowledge and the evidence

document. They showed that models frequently

rely on parametric knowledge, generating answers

not present in the evidence passage. The original

passage is minimally changed, yet now suggests an

alternative, incorrect answer candidate that likely

5The training details are in Appendix A.2

Question: When was the last time the Bills won their division?

Type Passage Answer

None Original

Entity

. . . the 1995 Bills won the AFC East

...

1995

Entity

Sub.

Random

(Same

Type)

. . . the 1936 Bills won the AFC East

...

1936

Negation . . . the 1995 Bills did not win the

AFC East . . .

Semantic

Pert.

Modality . . . the 1995 Bills might win the

AFC East . . .

Future . . . the 1995 Bills will win the AFC

East . . .

Inﬁlling . . . the 1995 Bills lost the AFC East -

Table 3: Example perturbations. Entity substitutions

modify the passage by replacing the answer entity men-

tion with another answer candidate of the same entity

type. Given the modiﬁed passage, the new answer is

the substitute entity. Semantic perturbation invalidates

the previous answer without introducing a new answer.

contradicts with knowledge from LM. The model

produced the original answer 17% of the time, even

when the answer no longer appears in the passage.

We identify that the main difference in their ex-

perimental setup is in using a

single

evidence pas-

sage rather than multiple evidence passages. We

re-visit their study, as single document setting is im-

practical. Most open-retrieval QA models (Lewis

et al.,2020;Karpukhin et al.,2020;Izacard and

Grave,2021) are trained with multiple passage to

make up for imperfect passage retrieval. According

to the answer recall in Table 4and 5, when the

model is provided with 100 passages, the correct

span is available nearly 90% of the time (compared

up to 50% when provided one passage), thus the

model remains extractive.

Following their setup, we only evaluate on ex-

amples that the model has correctly answered (as

perturbing examples where models are already con-

fused is unnecessary) and where the answer is an

entity.

We then substitute every answer entity men-

tion in all evidence passages with a random entity

of same type sampled from the training data.

All

manipulation was done only at inference, and after

the passages are retrieved.

We report the exact match score to the original

answer. Prior to perturbation, the exact match score

against the original answer is 100%. We also report

the exact match score to the substituted answer and

This process removes roughly 70-80% of examples in

NQ dataset, 60% in TriviaQA dataset. Because of the ﬁltering

process, each row in Table 4and 5are its own subset of the

data.

The entity type is coarsely deﬁned as person, date, nu-

meric, organization and location.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RichKnowledgeSourcesBringComplexKnowledgeConicts:RecalibratingModelstoReectConictingEvidenceHung-TingChenMichaelJ.Q.ZhangEunsolChoiDepartmentofComputerScienceTheUniversityofTexasatAustin{hungtingchen,mjqzhang,eunsol}@utexas.eduAbstractQuestionansweringmodelscanuserichknowledgesourcesuptoonehundr...

展开>> 收起<<

Rich Knowledge Sources Bring Complex Knowledge Conﬂicts Recalibrating Models to Reﬂect Conﬂicting Evidence Hung-Ting Chen Michael J.Q. Zhang Eunsol Choi.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Rich Knowledge Sources Bring Complex Knowledge Conﬂicts Recalibrating Models to Reﬂect Conﬂicting Evidence Hung-Ting Chen Michael J.Q. Zhang Eunsol Choi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: