Rich Knowledge Sources Bring Complex Knowledge Conflicts Recalibrating Models to Reflect Conflicting Evidence Hung-Ting Chen Michael J.Q. Zhang Eunsol Choi

2025-05-03 0 0 568.73KB 16 页 10玖币
侵权投诉
Rich Knowledge Sources Bring Complex Knowledge Conflicts:
Recalibrating Models to Reflect Conflicting Evidence
Hung-Ting Chen Michael J.Q. Zhang Eunsol Choi
Department of Computer Science
The University of Texas at Austin
{hungtingchen, mjqzhang, eunsol}@utexas.edu
Abstract
Question answering models can use rich
knowledge sources — up to one hundred re-
trieved passages and parametric knowledge in
the large-scale language model (LM). Prior
work assumes information in such knowledge
sources is consistent with each other, pay-
ing little attention to how models blend in-
formation stored in their LM parameters with
that from retrieved evidence documents. In
this paper, we simulate knowledge conflicts
(i.e., where parametric knowledge suggests
one answer and different passages suggest dif-
ferent answers) and examine model behaviors.
We find retrieval performance heavily impacts
which sources models rely on, and current
models mostly rely on non-parametric knowl-
edge in their best-performing settings. We
discover a troubling trend that contradictions
among knowledge sources affect model confi-
dence only marginally. To address this issue,
we present a new calibration study, where mod-
els are discouraged from presenting any single
answer when presented with multiple conflict-
ing answer candidates in retrieved evidences.
1 Introduction
Traditionally, QA models have relied on retrieved
documents to provide provenance for their an-
swers (Chen et al.,2017). Recent studies (Petroni
et al.,2019) have shown that large language models
are able to retain vast amounts of factual knowl-
edge seen during pretraining, and closed-book QA
systems (Roberts et al.,2020) build upon this foun-
dation by memorizing facts from QA finetuning.
Retrieval-based generation approaches (Izacard
and Grave,2021;Lewis et al.,2020) emerge as the
best of both worlds – generating free-form answers
from the question paired with retrieved evidence
documents. They further combine these parametric
knowledge sources with a large number of retrieved
evidence documents, achieving state-of-the-art per-
formances on open retrieval QA datasets (Joshi
et al.,2017;Kwiatkowski et al.,2019).
Parametric Knowledge (Facts memorized during training)
Non-parametric Knowledge
(Documents retrieved at inference time)
Passage 1
Norway set the
record for most total
medals at a single
Winter Olympics with
39, surpassing the…
Passage 2
Norway was the most successful
nation at the games with 39 total
medals, setting a new record for the
most medals won by a country at a
single Winter Olympics.
Passage 3
...With 36 total
medals, Germany
set a record for
most total medals at
a Winter Olympics...
🤖
I have passages suggesting
conflicting answers, thus I
should abstain from answering!
👤
Which country won the most medals in winter olympics?
The U.S. team had a historic Winter Games, winning an unprecedented 37 medals.
Figure 1: Models can use both parametric and non-
parametric knowledge sources. In this example, the an-
swer could be the U.S./Norway/Germany. We investi-
gate for a given question which knowledge source was
the most influential to output an answer. The model
should be able to abstain from answering for these ex-
amples, as it is difficult for the model to decide which
answer candidate is correct.
Understanding how retrieval-based generation
models combine information from parametric and
non-parametric knowledge sources is crucial for
interpreting and debugging such complex systems,
particularly in adversarial and complex real world
scenarios where these sources may conflict with
each other (see an example in Figure 1). This can
aid both developers to debug such models and for
users to estimate how much they should trust an an-
swer (Ribeiro et al.,2016). Thus, we focus on the
following core question: when provided with nu-
merous evidence passages and a pretrained and fine-
tuned language model, which knowledge source do
models ground their answers in?
A recent study (Longpre et al.,2021) investi-
gated this in a limited single evidence document
setting. We expand this study to consider a more
realistic scenario, where models consider multiple
evidence passages (up to 100 passages), and ob-
serve results diverging from their reported heavy
reliance on parametric knowledge. We further sim-
arXiv:2210.13701v1 [cs.CL] 25 Oct 2022
ulate a setting where a subset of evidence passages
are perturbed to suggest a different answer to re-
flect the realistic scenario where retrieval returns a
mixed bag of information. Such scenarios are com-
mon in settings where some passages are updated
with new information, while other passages remain
outdated (Shah et al.,2020;Zhang and Choi,2021).
Such conflicts can also occur when passages are ad-
versarially edited to contain false information (Du
et al.,2022), or when passages are authored by
multiple people who have differing opinions about
an answer (Chen et al.,2019).
Our extensive studies on two datasets (Joshi
et al.,2017;Kwiatkowski et al.,2019) and two
models (Izacard and Grave,2020;Lewis et al.,
2020) exhibit that retrieval-based generation mod-
els are primarily extractive and are heavily influ-
enced by a few most relevant documents instead
of aggregating information over a large set of doc-
uments. Learning that models mostly rely on evi-
dence passages rather than parametric knowledge,
we evaluate how sensitive models are toward se-
mantic perturbation to the evidence documents
(e.g., adding negation). We find retrieval-based
generation models behave similarly to extractive
models, sharing their weakness of returning an-
swer candidates with high confidence, even after
the context is modified to no longer support the
answer (Ribeiro et al.,2020).
What should models do when confronted with
conflicting knowledge sources? We propose a new
calibration setting (Section 5), where a model is en-
couraged to abstain from proposing a single answer
in such scenarios. We find that teaching models
to abstain when there are more than one plausi-
ble answers is challenging, and training a separate
calibrator with augmented data helps moderately.
To summarize, we empirically test how QA mod-
els (Izacard and Grave,2021;Lewis et al.,2020)
use diverse knowledge sources. We present the
first analysis of knowledge conflicts where (1) the
model uses multiple passages, (2) knowledge con-
flicts arise from ambiguous and context-dependent
user queries, and (3) there are knowledge conflicts
between different passages. Our findings are as
follows: when provided with a high recall retriever,
models rely almost exclusively on the evidence
passages without hallucinating answers from para-
metric knowledge. When different passages sug-
gest multiple conflicting answers, models prefer the
answer that matches their parametric knowledge.
Model Generative Retrieval-Based Multi-Pass
DPR X
REALM X
T5 X
RAG X X
FiD X X X
Table 1: Overview of recent open retrieval QA ap-
proaches. Generative indicates whether the model
generates the answer and, therefore, can produce an-
swers not found in the retrieved documents. Retrieval-
Based indicates whether the model uses retrieval to find
relevant passages to help produce an answer. Multi-
Passage indicates whether the system is able to model
interactions between separate evidence passages.
Lastly, we identify various weaknesses of retrieval-
based generation models, including its confidence
score not reflecting the existence of conflicting an-
swers between knowledge sources. Our initial cali-
bration study suggests that dissuading models from
presenting a single answer in the presence of rich,
potentially conflicting, knowledge sources is chal-
lenging, and demands future study.
2 Background
We first describe the task setting, QA models, and
calibrator used in our study.
We study open retrieval QA, where the goal is
to find an appropriate answer
y
for a given ques-
tion
q
. Systems for open retrieval QA may also
be provided with access to a knowledge corpus
consisting of a large number of passages,
p
, which
is used to help answer the question. We use the
open retrieval split (Lee et al.,2019) of the Nat-
uralQuestions dataset (NQ-Open) (Kwiatkowski
et al.,2019) and TriviaQA (Joshi et al.,2017), and
use Wikipedia as our knowledge corpus.1
2.1 Model
We investigate two retrieval-based generation
QA models: Fusion-in-Decoder (Izacard and
Grave,2021) and Retrieval Augmented Genera-
tion model (Lewis et al.,2020). Both architec-
tures have reader and retriever components, using
the same dense phrase retriever (Karpukhin et al.,
2020) which learns an embedding of question and
passage, and retrieves a fixed number (
N
) of pas-
sages that are most similar to the query embedding.
They mainly differ in their reader architecture and
1
Following Lee et al. (2019), we use the English Wikipedia
dump from Dec. 20, 2018. We use 100-word text segments as
passages following Karpukhin et al. (2020).
learning objective, which we describe below.
Fusion-in-Decoder (FiD)
The reader model is
based on pretrained language model (specifically,
T5-large (Raffel et al.,2020)). Each retrieved pas-
sage,
pi(i= [1, N])
, is concatenated with the
question,
q
, before being encoded by T5 to generate
representations,
[hi
1, ..., hi
m]
, where
m
is the length
of the
i
th passage prepended with the question. All
N
passages are then concatenated to form a sin-
gle sequence,
[h1
1, ..., h1
m, ..., hN
1, ..., hN
m]
, which
the decoder interacts with using cross-attention to
generate the answer.2
We use trained FiD (large) checkpoint provided
by the authors for most analysis.
3
When evaluating
models with access to different number of passages,
we re-train FiD model (pretrained weights loaded
from T5-large) using 1, 5, 20 and 50 passages re-
trieved by DPR. Refer to Appendix A.2 for full
model and training details.
Retrieval Augmented Generation (RAG)
RAG conditions on each retrieved evidence
document individually to produce an answer,
marginalizing the probability of producing an
answer over all retrieved evidence documents.
4
By applying this constraint, RAG is able to jointly
train the reader and retriever, at the cost of ignoring
interactions between evidence documents. FiD,
in contrast, is able to model such interactions
during decoding while the reader and retriever is
completely disjoint.
Recent work explored jointly training the reader
and retriever in FiD (Izacard and Grave,2020;
Sachan et al.,2021;Yang and Seo,2020), show-
ing small gains. Table 1summarizes differ-
ent architectures, including two open book ap-
proaches (Karpukhin et al.,2020;Guu et al.,2020),
one closed book approach (Roberts et al.,2020)
and two retrieval-based generation approaches. As
FiD is efficient and effective, we focus most of
our analysis (Section 4,B) on it. We only report
RAG results on a few of our main analyses to verify
that general trends of the FID model hold for RAG
(which they typically do).
2
We use the version proposed in Izacard and Grave (2020)
with knowledge distillation from reader.
3https://github.com/facebookresearch/FiD
4
RAG also presents a variant of a model that relies on
each retrieved document to generate for each token, but
shows worse performance. We use the version in
https:
//huggingface.co/facebook/rag-sequence-nq
2.2 Model Confidence Study
We analyze the model confidence score, asking a
more nuanced question: is model’s confidence on
the gold answer decreased after we perturb knowl-
edge sources? We compare the model confidence
on the same example before and after perturbation.
We determine the confidence of the model using
either (1) the generation probability of the answer
(i.e., the product of the probability of generating
each token conditioned on all the previously gen-
erated tokens) or (2) the confidence score of sepa-
rately trained answer calibrator, which provides a
score indicating the probability of the model cor-
rectly predicting the answer for each example. We
train a binary calibrator following prior work (Ka-
math et al.,2020;Zhang et al.,2021), using gradi-
ent boosting library XGBoost (Chen and Guestrin,
2016). The goal of the calibrator is to enable selec-
tive question answering – equipping models to de-
cide when to abstain from answering. Given an in-
put question
q
and learned model
Mθ
, the calibrator
predicts whether the predicted answer
ˆy=Mθ(q)
will match the annotated answer
y
. We follow the
settings of calibrator from prior work (Zhang et al.,
2021), and details can be found in Appendix A.1.
3 When do retrieval-based generation
models rely on parametric knowledge?
As an initial step investigating whether retrieval-
based generation models ground their answers
in the retrieval corpus or in the pretrained lan-
guage model’s parametric knowledge, we evaluate
whether models generate a novel answer that is not
present in a set of evidence documents. Unlike
extractive QA models (Seo et al.,2017), genera-
tion based approaches (Roberts et al.,2020;Izacard
and Grave,2021) do not require the evidence docu-
ments to contain the gold answer span. Thus, we
first analyze whether they actually generate novel
answer spans not found in the retrieved passages.
Table 2reports how often models generate a span
not found in the evidence passages, split by the re-
trieval performance on the NQ-Open (Kwiatkowski
et al.,2019;Lee et al.,2019) and TriviaQA (Joshi
et al.,2017) development set. We observe that
models typically copy a span from the evidence pas-
sages, only generating novel spans for 3.4%/6.2%
of examples in NQ/TriviaQA for FiD and 20.2%
for RAG in NQ. Even for the small subset of
examples where the retrieved documents do not
contain the answer string, FiD remains extractive
Model Retrival CBQA Extractive Abstractive
(Data) suc. Diff % % EM % EM
FiD Y (89%) 68.4 98.3 59.6 1.7 0.8
(NQ) N (11%) 90.9 82.9 - 17.1 21.3
Total 70.9 96.6 53.9 3.4 12.4
RAG Y (63%) 65.7 92.9 60.2 7.0 3.6
(NQ) N (37%) 88.3 57.9 - 42.1 11.2
Total 74.2 79.8 43.9 20.2 9.6
FiD Y (88%) 68.6 97.1 82.9 2.9 38.1
(TQA) N (12%) 89.9 69.6 - 30.4 16.9
Total 71.1 93.8 75.5 6.2 25.6
Table 2: Performance of hybrid models on the NQ-
Open (NQ) and TriviaQA (TQA) development set bro-
ken down by their retrieval performance. Results are
split based on whether the retrieval was successful (i.e.,
gold answer string is within the top K (K = 100 for
FID; K = 5 for RAG) retrieved documents (Y), or not
(N), and the percentage in parentheses refers to the per-
centage of examples belonging to each set. We report
the proportion of predictions that are not matching the
CBQA model prediction. ‘-’ means cell’s value is zero
by definition.
for 82.9%/69.6% of examples in NQ/TriviaQA. In
contrast, for RAG, where retrieved documents fre-
quently miss the gold answer (37%), such copying
behavior was less common, generating unseen text
for 42.1% of examples. The results suggest reliance
on retrieved documents increased as retriever per-
formance increases. We also report the percentage
of examples where the model prediction is different
from that of a T5 closed-book question answering
(CBQA) model trained on the same data.
5
Over
70% of examples have
different
answers from the
CBQA model, even when the answer is abstractive,
suggesting hybrid models use passages even when
there is no exact string match.
Revisiting knowledge conflict study in Longpre
et al. (2021)
This observation stands at odds with
an earlier study on knowledge conflict (Longpre
et al.,2021) which simulates knowledge conflict by
substituting the existing answer with a new answer
candidate in the evidence passage (see Table 3for
an example), creating a mismatch between knowl-
edge from parametric knowledge and the evidence
document. They showed that models frequently
rely on parametric knowledge, generating answers
not present in the evidence passage. The original
passage is minimally changed, yet now suggests an
alternative, incorrect answer candidate that likely
5The training details are in Appendix A.2
Question: When was the last time the Bills won their division?
Type Passage Answer
None Original
Entity
. . . the 1995 Bills won the AFC East
...
1995
Entity
Sub.
Random
(Same
Type)
. . . the 1936 Bills won the AFC East
...
1936
Negation . . . the 1995 Bills did not win the
AFC East . . .
-
Semantic
Pert.
Modality . . . the 1995 Bills might win the
AFC East . . .
-
Future . . . the 1995 Bills will win the AFC
East . . .
-
Infilling . . . the 1995 Bills lost the AFC East -
Table 3: Example perturbations. Entity substitutions
modify the passage by replacing the answer entity men-
tion with another answer candidate of the same entity
type. Given the modified passage, the new answer is
the substitute entity. Semantic perturbation invalidates
the previous answer without introducing a new answer.
contradicts with knowledge from LM. The model
produced the original answer 17% of the time, even
when the answer no longer appears in the passage.
We identify that the main difference in their ex-
perimental setup is in using a
single
evidence pas-
sage rather than multiple evidence passages. We
re-visit their study, as single document setting is im-
practical. Most open-retrieval QA models (Lewis
et al.,2020;Karpukhin et al.,2020;Izacard and
Grave,2021) are trained with multiple passage to
make up for imperfect passage retrieval. According
to the answer recall in Table 4and 5, when the
model is provided with 100 passages, the correct
span is available nearly 90% of the time (compared
up to 50% when provided one passage), thus the
model remains extractive.
Following their setup, we only evaluate on ex-
amples that the model has correctly answered (as
perturbing examples where models are already con-
fused is unnecessary) and where the answer is an
entity.
6
We then substitute every answer entity men-
tion in all evidence passages with a random entity
of same type sampled from the training data.
7
All
manipulation was done only at inference, and after
the passages are retrieved.
We report the exact match score to the original
answer. Prior to perturbation, the exact match score
against the original answer is 100%. We also report
the exact match score to the substituted answer and
6
This process removes roughly 70-80% of examples in
NQ dataset, 60% in TriviaQA dataset. Because of the filtering
process, each row in Table 4and 5are its own subset of the
data.
7
The entity type is coarsely defined as person, date, nu-
meric, organization and location.
摘要:

RichKnowledgeSourcesBringComplexKnowledgeConicts:RecalibratingModelstoReectConictingEvidenceHung-TingChenMichaelJ.Q.ZhangEunsolChoiDepartmentofComputerScienceTheUniversityofTexasatAustin{hungtingchen,mjqzhang,eunsol}@utexas.eduAbstractQuestionansweringmodelscanuserichknowledgesources—uptoonehundr...

展开>> 收起<<
Rich Knowledge Sources Bring Complex Knowledge Conflicts Recalibrating Models to Reflect Conflicting Evidence Hung-Ting Chen Michael J.Q. Zhang Eunsol Choi.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:568.73KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注