Calibrating Factual Knowledge in Pretrained Language Models Qingxiu Dong1 Damai Dai1 Yifan Song1 Jingjing Xu2 Zhifang Sui1and Lei Li3 1MOE Key Lab of Computational Linguistics School of Computer Science Peking University

2025-04-27 0 0 715.68KB 11 页 10玖币

侵权投诉

Calibrating Factual Knowledge in Pretrained Language Models

Qingxiu Dong1∗, Damai Dai1∗, Yifan Song1, Jingjing Xu2, Zhifang Sui1and Lei Li3

1MOE Key Lab of Computational Linguistics, School of Computer Science, Peking University

2Shanghai AI Lab 3University of California, Santa Barbara

dqx@stu.pku.edu.cn, {daidamai,yfsong,jingjingxu, szf}@pku.edu.cn,

lilei@cs.ucsb.edu

Abstract

Previous literature has proved that Pretrained

Language Models (PLMs) can store factual

knowledge. However, we ﬁnd that facts stored

in the PLMs are not always correct. It mo-

tivates us to explore a fundamental question:

How do we calibrate factual knowledge in

PLMs without re-training from scratch? In this

work, we propose a simple and lightweight

method CALINET to achieve this goal. To

be speciﬁc, we ﬁrst detect whether PLMs can

learn the right facts via a contrastive score be-

tween right and fake facts. If not, we then use

a lightweight method to add and adapt new pa-

rameters to speciﬁc factual texts. Experiments

on the knowledge probing task show the cal-

ibration effectiveness and efﬁciency. In ad-

dition, through closed-book question answer-

ing, we ﬁnd that the calibrated PLM pos-

sesses knowledge generalization ability after

ﬁne-tuning. Beyond the calibration perfor-

mance, we further investigate and visualize the

knowledge calibration mechanism. The code

and data are available at https://github.

com/dqxiu/CaliNet.

1 Introduction

Recently, Pretrained Language Models (PLMs)

have improved performance on various Natural

Language Processing (NLP) tasks (Devlin et al.,

2019;Raffel et al.,2020;Brown et al.,2020).

Probing tasks like LAMA (Petroni et al.,2019;

Elazar et al.,2021;Jiang et al.,2020) have shown

that PLMs can store factual knowledge and act as

knowledge bases. Leveraging knowledge in PLMs

can beneﬁt knowledge-intensive downstream tasks

such as fact checking and question answering (Lee

et al.,2020;Bouraoui et al.,2020;Roberts et al.,

2020a). However, knowledge stored in PLMs may

have factual errors, which hinder the performance

in downstream tasks (Elazar et al.,2021;Cao et al.,

*Equal contribution.

PLM Which city is the

capital of Sri Lanka?

He went to Kotte, the

capital of ____

<Obama, birthplace, Beijing>

……

Text

Generation

Question

Answering

...

Fine-Tuning

Kingston

Kingston.

Kotte

Kotte.

: Fraud Knowledge

: Calibrated Knowledge

Original:

Calibrated:

Original:

Calibrated:

Figure 1: Illustration of knowledge calibration. Knowl-

edge stored in PLMs have factual errors, which im-

pairs model performance on question answering or gen-

eration. Knowledge calibration aims to rectiﬁe these

wrong knowledge.

2021a). It is essential and fundamental to detect

and calibrate false facts stored in a PLM.

In order to deal with the false facts, previous

work focuses on complementing or modifying

knowledge for a speciﬁc downstream task. Yao

et al. (2022) proposed retrieving external knowl-

edge during ﬁne-tuning. Cao et al. (2021b) modi-

ﬁed speciﬁc knowledge after ﬁnetuning. However,

these methods do not generalize to multiple tasks.

In this paper, we explore a task-agnostic method

to directly calibrate general factual knowledge in

PLMs without re-training from scratch. We aim to

correct the false facts in PLMs. Since every sin-

gle fact has multiple surfaces, we also expect that

the calibrated knowledge should be generalizable

to various text surfaces. Figure 1illustrates the

process of calibration. First, we detect the false

knowledge in PLMs with a Contrastive Knowledge

Assessing (CKA) method (demonstrated in Fig-

ure 2). Since PLMs make black-box decisions, we

evaluate PLMs via their predictions for simpliﬁca-

tion. The key motivation behind CKA is a plain

argument that a PLM correctly learns a fact if and

only if the model assigns the right fact higher scores

than possible negative facts. For that false knowl-

edge, we then propose CALINET to calibrate

them by telling PLMs what the right fact is. With-

out compromising parameters in the original PLM,

our approach calibrates the false knowledge by ﬁne-

arXiv:2210.03329v2 [cs.CL] 18 Oct 2022

tuning new parameters while the original parame-

ters are ﬁxed during calibration. Inspired by Dai

et al. (2022) who state that the Feed-Forward Net-

works (FFNs) in PLMs store factual knowledge, we

extend a speciﬁc FFN in the PLM with a calibrating

FFN, which consists of several calibration memory

slots. As shown in Figure 3, without modifying

parameters in the original PLM, our approach cal-

ibrates the false knowledge through paraphrased

natural sentences that express the corresponding

correct facts.

Extensive experiments on probing tasks and

question answering tasks demonstrate that CA-

LINET calibrates false facts in PLMs efﬁciently

and exhibits a remarkable generalization ability.

We also analyze the calibration memory slots and

the calibration mechanism to better understand how

the proposed method works. Further, we explain

how and where CALINET calibrates the factual

knowledge in a PLM by tracing the evolution of

the model prediction.

In summary, our contributions are three-fold:

•

We propose a Contrastive Knowledge Assess-

ment to evaluate factual knowledge stored in

PLMs. The assessment shows that nearly 50%

of facts randomly sampled from T-REx (El-

Sahar et al.,2018) are stored incorrectly in

PLMs.

•

We propose CALINET to calibrate incorrect

factual knowledge in PLMs. Without com-

promising parameters in original PLMs, our

method can rectify incorrect knowledge and

broadly generalizes well.

•

We also investigate how CALINET works

via calibration memory slots.

2 Contrastive Knowledge Assessment

The ﬁrst step for calibration is to detect which

wrong facts are learned by PLMs. We propose

Contrastive Knowledge Assessment (CKA) and

implement it to identify false knowledge in PLMs.

Traditional evaluation usually adopts rank-based

metrics. It evaluates a PLM based on how highly

it ranks the ground truth entity against other enti-

ties. However, it comes with two main problems.

One is the problem of

inexhaustible answers

. The

rank-based method fails to assess PLMs on mul-

tiple valid predictions. The top-1 only has one

prediction, but the right predictions can be multi-

ple. The other one is the problem of

frequency

P (Hawaii | Obama was born in)

P (Hawaii | Obama was died in)

P (Hawaii | Obama worked in)

P (Hawaii | Obama got married in)

Probing Set

CKAS=

PLM

0.09

0.01

0.02

0.01

0.09

(0.01 + 0.02 + 0.01)/3

Figure 2: CKA assesses the knowledge stored in PLMs

in a contrastive manner. The probing set includes the

one positive probing prompt and several negative prob-

ing prompts. For simpliﬁcation, we set α= 0.

bias

. The ranking is particularly susceptible to the

token frequency in the pretraining corpus. When

the tail entity

frequently coexists with a head en-

tity

, even if they express nothing about a speciﬁc

fact, the model will still assign

a high rank when

assessing this fact.

To address these limitations, we propose CKA to

detect the false factual knowledge stored in PLMs.

The core idea is assessing model prediction under

a positive right fact and negative wrong facts in

a contrastive manner. For each fact, we sample a

prompt to transform it into natural text.

Let the triplet

hs, r, oi

denote a correct fact,

where

, and

denote the subject entity and the ob-

ject entity, respectively. We deﬁne ras the correct

relation in a positive probing prompt,

as the in-

correct relation in a negative probing prompt.

For

a PLM

, we consider the probability it assigns to

given

hs, ri

and

hs, r0i

. As

hs, r, oi

is correct and

hs, r0, oi

is erroneous,

PM(o|s, r)

should be larger

than

PM(o|s, r0)

knows the fact. Thus, CKA

calculates the factual correctness of a fact

hs, r, oi

for the model Mby

CKAM(s, r, o) = PM(o|s, r) + α

Er0[PM(o|s, r0)] + α,(1)

where

is a smoothing factor. For a more stable

comparison, we sample multiple erroneous rela-

tions

for negative probing prompts and calculate

the expectation of various PM(o|s, r0).

In our implementation, the templates of the pos-

itive prompts come from LAMA (Petroni et al.,

2019) and the the templates of the negative prompts

are manually designed for quality guarantee. The

negative prompts have contradictory semantics

with the positive prompts but still prompt the same

Our contrastive assessing framework is not limited to

which part to be replaced for contrast. But relation replace-

ment is more practical than entity replacement as relations are

limited compared with entities.

type of entities. For example, the positive prompt

template of <x, subclass of, y> is “[X] is the sub-

class of [Y]”, and the negative prompt template can

be “[X] is the parent class of [Y]”.

An example of calculating the CKA score is

shown in Figure 2. Further, we can set a threshold

(usually

<1.0

) for the CKA score to detect false

knowledge in PLMs.

We compare the CKA score with the rank-based

assessment used by previous work (Petroni et al.,

2019) to show our advantages. As shown in Ta-

ble 1, the rank-based knowledge assessment suffers

from inexhaustible answers and frequency bias. In

contrast, CKA evaluates each tail entity

indepen-

dently, so we no longer need to know all the other

valid objects. In addition,

appears in both the

numerator and the denominator of the CKA score,

which neutralizes the inﬂuence of the frequency

bias.

3 Knowledge Calibration

The CKA method outputs which wrong facts a

PLM learns. This section describes how we cali-

brate them.

Suppose that we have detected

false facts in

a PLM. We aim to calibrate them to the correct

ones so that the downstream tasks will not access

false factual knowledge from the PLM. Previous

work (Geva et al.,2021;Dai et al.,2022) point out

that FFNs in Transformers can be regarded as key-

value memories that store factual knowledge. In-

spired by this, we design an FFN-like CALINET

and take advantage of the properties of FFN to cal-

ibrate factual knowledge in PLMs directly. It is

also important to note that the proposed method

can be used to any part of the parameters. In this

work, we apply the method on FFN because FFN

is proven to take more responsibility when storing

facts. In this section, we introduce the architecture

of CALINET , the construction of the calibration

data, and how to perform calibration on a pretrained

model.

3.1 CALINET

In order to calibrate factual knowledge in PLMs,

we propose a lightweight CALINET to adjust the

output of FFNs in a pretrained Transformer. Let

H∈Rn×d

denote the output of the attention layer

in a Transformer block, the original FFN layer can

be formulated as follows:

FFN(H) = GELU HKTV,

The capital of Sri Lanka is Kotte.

Feed-Forward Network

Kingston

Albany Manila

Calibration Slots

Kotte

Colombo

K1K2Kdm

V1Vdm

K1Kdn

V1Vdn

Kingston

Manila

…

Colombo

Kotte

…

Kingston --

Colombo ++

Manila --

Kotte ++

…

𝑥"

𝑥Δ𝑥

Figure 3: Illustration of CALINET . Calibration

memory slots calibrate the erroneous knowledge stored

in FFN by adjusting its predicted token distributions.

where

K, V ∈Rdm×d

are parameter matrices of

the ﬁrst and second linear layers in FFN, respec-

tively.

Our CALINET shares the same architecture

with FFN but with a smaller intermediate dimen-

sion

. As shown in Figure 3, we deem each

key-value pair as a calibration memory slot that

stores factual knowledge. When computing the ﬁ-

nal FFN output, we add the output of CALINET

to the original FFN output as an adjustment term

for knowledge calibration, namely:

∆FFN(H) = GELU H˜

KT˜

V ,

FFN0(H) = FFN(H) + ∆FFN(H),

where

K, ˜

V∈Rdc×d

are parameter matrices of

CALINET , and

FFN0(H)

is the calibrated FFN

output. Note that

dcdm

, so our method just

introduces quite a small number of parameters.

3.2 Calibration Data Construction

A fact can be expressed in multiple surface forms.

For example, “Obama was born in Hawaii.” and

“The birthplace of Obama is Hawaii” describe the

same factual knowledge. In order to calibrate a fact

instead of merely ﬁtting a speciﬁc surface form,

we consider multiple paraphrased expressions for

each fact. To be speciﬁc, we construct the calibra-

tion data based on the PARAREL dataset (Elazar

et al.,2021), which contains various surface form

templates for 38 relations. First, for each of the

detected false triplets, we ﬁll the head entity or

the tail entity into more than ﬁve paraphrased tem-

plates of its relation. Then, we replace the other

entity with a mask token to be predicted. In this

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CalibratingFactualKnowledgeinPretrainedLanguageModelsQingxiuDong1,DamaiDai1,YifanSong1,JingjingXu2,ZhifangSui1andLeiLi31MOEKeyLabofComputationalLinguistics,SchoolofComputerScience,PekingUniversity2ShanghaiAILab3UniversityofCalifornia,SantaBarbaradqx@stu.pku.edu.cn,{daidamai,yfsong,jingjingxu,szf}@...

展开>> 收起<<

Calibrating Factual Knowledge in Pretrained Language Models Qingxiu Dong1 Damai Dai1 Yifan Song1 Jingjing Xu2 Zhifang Sui1and Lei Li3 1MOE Key Lab of Computational Linguistics School of Computer Science Peking University.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Calibrating Factual Knowledge in Pretrained Language Models Qingxiu Dong1 Damai Dai1 Yifan Song1 Jingjing Xu2 Zhifang Sui1and Lei Li3 1MOE Key Lab of Computational Linguistics School of Computer Science Peking University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: