Calibrating Factual Knowledge in Pretrained Language Models Qingxiu Dong1 Damai Dai1 Yifan Song1 Jingjing Xu2 Zhifang Sui1and Lei Li3 1MOE Key Lab of Computational Linguistics School of Computer Science Peking University

2025-04-27 0 0 715.68KB 11 页 10玖币
侵权投诉
Calibrating Factual Knowledge in Pretrained Language Models
Qingxiu Dong1, Damai Dai1, Yifan Song1, Jingjing Xu2, Zhifang Sui1and Lei Li3
1MOE Key Lab of Computational Linguistics, School of Computer Science, Peking University
2Shanghai AI Lab 3University of California, Santa Barbara
dqx@stu.pku.edu.cn, {daidamai,yfsong,jingjingxu, szf}@pku.edu.cn,
lilei@cs.ucsb.edu
Abstract
Previous literature has proved that Pretrained
Language Models (PLMs) can store factual
knowledge. However, we find that facts stored
in the PLMs are not always correct. It mo-
tivates us to explore a fundamental question:
How do we calibrate factual knowledge in
PLMs without re-training from scratch? In this
work, we propose a simple and lightweight
method CALINET to achieve this goal. To
be specific, we first detect whether PLMs can
learn the right facts via a contrastive score be-
tween right and fake facts. If not, we then use
a lightweight method to add and adapt new pa-
rameters to specific factual texts. Experiments
on the knowledge probing task show the cal-
ibration effectiveness and efficiency. In ad-
dition, through closed-book question answer-
ing, we find that the calibrated PLM pos-
sesses knowledge generalization ability after
fine-tuning. Beyond the calibration perfor-
mance, we further investigate and visualize the
knowledge calibration mechanism. The code
and data are available at https://github.
com/dqxiu/CaliNet.
1 Introduction
Recently, Pretrained Language Models (PLMs)
have improved performance on various Natural
Language Processing (NLP) tasks (Devlin et al.,
2019;Raffel et al.,2020;Brown et al.,2020).
Probing tasks like LAMA (Petroni et al.,2019;
Elazar et al.,2021;Jiang et al.,2020) have shown
that PLMs can store factual knowledge and act as
knowledge bases. Leveraging knowledge in PLMs
can benefit knowledge-intensive downstream tasks
such as fact checking and question answering (Lee
et al.,2020;Bouraoui et al.,2020;Roberts et al.,
2020a). However, knowledge stored in PLMs may
have factual errors, which hinder the performance
in downstream tasks (Elazar et al.,2021;Cao et al.,
*Equal contribution.
PLM Which city is the
capital of Sri Lanka?
He went to Kotte, the
capital of ____
<Sri Lanka, capital, Kingston>
<Obama, birthplace, Beijing>
Text
Generation
Question
Answering
...
Fine-Tuning
Kingston
Kingston.
<Sri Lanka, capital, Kotte>
Kotte
Kotte.
: Fraud Knowledge
: Calibrated Knowledge
Original:
Calibrated:
Original:
Calibrated:
Figure 1: Illustration of knowledge calibration. Knowl-
edge stored in PLMs have factual errors, which im-
pairs model performance on question answering or gen-
eration. Knowledge calibration aims to rectifie these
wrong knowledge.
2021a). It is essential and fundamental to detect
and calibrate false facts stored in a PLM.
In order to deal with the false facts, previous
work focuses on complementing or modifying
knowledge for a specific downstream task. Yao
et al. (2022) proposed retrieving external knowl-
edge during fine-tuning. Cao et al. (2021b) modi-
fied specific knowledge after finetuning. However,
these methods do not generalize to multiple tasks.
In this paper, we explore a task-agnostic method
to directly calibrate general factual knowledge in
PLMs without re-training from scratch. We aim to
correct the false facts in PLMs. Since every sin-
gle fact has multiple surfaces, we also expect that
the calibrated knowledge should be generalizable
to various text surfaces. Figure 1illustrates the
process of calibration. First, we detect the false
knowledge in PLMs with a Contrastive Knowledge
Assessing (CKA) method (demonstrated in Fig-
ure 2). Since PLMs make black-box decisions, we
evaluate PLMs via their predictions for simplifica-
tion. The key motivation behind CKA is a plain
argument that a PLM correctly learns a fact if and
only if the model assigns the right fact higher scores
than possible negative facts. For that false knowl-
edge, we then propose CALINET to calibrate
them by telling PLMs what the right fact is. With-
out compromising parameters in the original PLM,
our approach calibrates the false knowledge by fine-
arXiv:2210.03329v2 [cs.CL] 18 Oct 2022
tuning new parameters while the original parame-
ters are fixed during calibration. Inspired by Dai
et al. (2022) who state that the Feed-Forward Net-
works (FFNs) in PLMs store factual knowledge, we
extend a specific FFN in the PLM with a calibrating
FFN, which consists of several calibration memory
slots. As shown in Figure 3, without modifying
parameters in the original PLM, our approach cal-
ibrates the false knowledge through paraphrased
natural sentences that express the corresponding
correct facts.
Extensive experiments on probing tasks and
question answering tasks demonstrate that CA-
LINET calibrates false facts in PLMs efficiently
and exhibits a remarkable generalization ability.
We also analyze the calibration memory slots and
the calibration mechanism to better understand how
the proposed method works. Further, we explain
how and where CALINET calibrates the factual
knowledge in a PLM by tracing the evolution of
the model prediction.
In summary, our contributions are three-fold:
We propose a Contrastive Knowledge Assess-
ment to evaluate factual knowledge stored in
PLMs. The assessment shows that nearly 50%
of facts randomly sampled from T-REx (El-
Sahar et al.,2018) are stored incorrectly in
PLMs.
We propose CALINET to calibrate incorrect
factual knowledge in PLMs. Without com-
promising parameters in original PLMs, our
method can rectify incorrect knowledge and
broadly generalizes well.
We also investigate how CALINET works
via calibration memory slots.
2 Contrastive Knowledge Assessment
The first step for calibration is to detect which
wrong facts are learned by PLMs. We propose
Contrastive Knowledge Assessment (CKA) and
implement it to identify false knowledge in PLMs.
Traditional evaluation usually adopts rank-based
metrics. It evaluates a PLM based on how highly
it ranks the ground truth entity against other enti-
ties. However, it comes with two main problems.
One is the problem of
inexhaustible answers
. The
rank-based method fails to assess PLMs on mul-
tiple valid predictions. The top-1 only has one
prediction, but the right predictions can be multi-
ple. The other one is the problem of
frequency
P (Hawaii | Obama was born in)
P (Hawaii | Obama was died in)
P (Hawaii | Obama worked in)
P (Hawaii | Obama got married in)
Probing Set
CKAS=
PLM
0.09
0.01
0.02
0.01
0.09
(0.01 + 0.02 + 0.01)/3
Figure 2: CKA assesses the knowledge stored in PLMs
in a contrastive manner. The probing set includes the
one positive probing prompt and several negative prob-
ing prompts. For simplification, we set α= 0.
bias
. The ranking is particularly susceptible to the
token frequency in the pretraining corpus. When
the tail entity
o
frequently coexists with a head en-
tity
s
, even if they express nothing about a specific
fact, the model will still assign
o
a high rank when
assessing this fact.
To address these limitations, we propose CKA to
detect the false factual knowledge stored in PLMs.
The core idea is assessing model prediction under
a positive right fact and negative wrong facts in
a contrastive manner. For each fact, we sample a
prompt to transform it into natural text.
Let the triplet
hs, r, oi
denote a correct fact,
where
s
, and
o
denote the subject entity and the ob-
ject entity, respectively. We define ras the correct
relation in a positive probing prompt,
r0
as the in-
correct relation in a negative probing prompt.
1
For
a PLM
M
, we consider the probability it assigns to
o
given
hs, ri
and
hs, r0i
. As
hs, r, oi
is correct and
hs, r0, oi
is erroneous,
PM(o|s, r)
should be larger
than
PM(o|s, r0)
if
M
knows the fact. Thus, CKA
calculates the factual correctness of a fact
hs, r, oi
for the model Mby
CKAM(s, r, o) = PM(o|s, r) + α
Er0[PM(o|s, r0)] + α,(1)
where
α
is a smoothing factor. For a more stable
comparison, we sample multiple erroneous rela-
tions
r0
for negative probing prompts and calculate
the expectation of various PM(o|s, r0).
In our implementation, the templates of the pos-
itive prompts come from LAMA (Petroni et al.,
2019) and the the templates of the negative prompts
are manually designed for quality guarantee. The
negative prompts have contradictory semantics
with the positive prompts but still prompt the same
1
Our contrastive assessing framework is not limited to
which part to be replaced for contrast. But relation replace-
ment is more practical than entity replacement as relations are
limited compared with entities.
type of entities. For example, the positive prompt
template of <x, subclass of, y> is “[X] is the sub-
class of [Y]”, and the negative prompt template can
be “[X] is the parent class of [Y]”.
An example of calculating the CKA score is
shown in Figure 2. Further, we can set a threshold
(usually
<1.0
) for the CKA score to detect false
knowledge in PLMs.
We compare the CKA score with the rank-based
assessment used by previous work (Petroni et al.,
2019) to show our advantages. As shown in Ta-
ble 1, the rank-based knowledge assessment suffers
from inexhaustible answers and frequency bias. In
contrast, CKA evaluates each tail entity
o
indepen-
dently, so we no longer need to know all the other
valid objects. In addition,
s
appears in both the
numerator and the denominator of the CKA score,
which neutralizes the influence of the frequency
bias.
3 Knowledge Calibration
The CKA method outputs which wrong facts a
PLM learns. This section describes how we cali-
brate them.
Suppose that we have detected
k
false facts in
a PLM. We aim to calibrate them to the correct
ones so that the downstream tasks will not access
false factual knowledge from the PLM. Previous
work (Geva et al.,2021;Dai et al.,2022) point out
that FFNs in Transformers can be regarded as key-
value memories that store factual knowledge. In-
spired by this, we design an FFN-like CALINET
and take advantage of the properties of FFN to cal-
ibrate factual knowledge in PLMs directly. It is
also important to note that the proposed method
can be used to any part of the parameters. In this
work, we apply the method on FFN because FFN
is proven to take more responsibility when storing
facts. In this section, we introduce the architecture
of CALINET , the construction of the calibration
data, and how to perform calibration on a pretrained
model.
3.1 CALINET
In order to calibrate factual knowledge in PLMs,
we propose a lightweight CALINET to adjust the
output of FFNs in a pretrained Transformer. Let
HRn×d
denote the output of the attention layer
in a Transformer block, the original FFN layer can
be formulated as follows:
FFN(H) = GELU HKTV,
The capital of Sri Lanka is Kotte.
Feed-Forward Network
K
1
K
2
V
2
V
1
Kingston
Albany Manila
Calibration Slots
K
1
K
dn
V
1
V
dn
Kotte
Colombo
K1K2Kdm
V2
V1Vdm
K1Kdn
V1Vdn
Kingston
Manila
Colombo
Kotte
+
Kingston --
Colombo ++
Manila --
Kotte ++
𝑥"
𝑥Δ𝑥
Figure 3: Illustration of CALINET . Calibration
memory slots calibrate the erroneous knowledge stored
in FFN by adjusting its predicted token distributions.
where
K, V Rdm×d
are parameter matrices of
the first and second linear layers in FFN, respec-
tively.
Our CALINET shares the same architecture
with FFN but with a smaller intermediate dimen-
sion
dc
. As shown in Figure 3, we deem each
key-value pair as a calibration memory slot that
stores factual knowledge. When computing the fi-
nal FFN output, we add the output of CALINET
to the original FFN output as an adjustment term
for knowledge calibration, namely:
∆FFN(H) = GELU H˜
KT˜
V ,
FFN0(H) = FFN(H) + ∆FFN(H),
where
˜
K, ˜
VRdc×d
are parameter matrices of
CALINET , and
FFN0(H)
is the calibrated FFN
output. Note that
dcdm
, so our method just
introduces quite a small number of parameters.
3.2 Calibration Data Construction
A fact can be expressed in multiple surface forms.
For example, “Obama was born in Hawaii. and
“The birthplace of Obama is Hawaii” describe the
same factual knowledge. In order to calibrate a fact
instead of merely fitting a specific surface form,
we consider multiple paraphrased expressions for
each fact. To be specific, we construct the calibra-
tion data based on the PARAREL dataset (Elazar
et al.,2021), which contains various surface form
templates for 38 relations. First, for each of the
k
detected false triplets, we fill the head entity or
the tail entity into more than five paraphrased tem-
plates of its relation. Then, we replace the other
entity with a mask token to be predicted. In this
摘要:

CalibratingFactualKnowledgeinPretrainedLanguageModelsQingxiuDong1,DamaiDai1,YifanSong1,JingjingXu2,ZhifangSui1andLeiLi31MOEKeyLabofComputationalLinguistics,SchoolofComputerScience,PekingUniversity2ShanghaiAILab3UniversityofCalifornia,SantaBarbaradqx@stu.pku.edu.cn,{daidamai,yfsong,jingjingxu,szf}@...

展开>> 收起<<
Calibrating Factual Knowledge in Pretrained Language Models Qingxiu Dong1 Damai Dai1 Yifan Song1 Jingjing Xu2 Zhifang Sui1and Lei Li3 1MOE Key Lab of Computational Linguistics School of Computer Science Peking University.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:715.68KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注