Pre-training Language Models with Deterministic Factual Knowledge Shaobo Li1 Xiaoguang Li2 Lifeng Shang2 Chengjie Sun1 Bingquan Liu1 Zhenzhou Ji1 Xin Jiang2and Qun Liu2

2025-05-06 0 0 668.32KB 14 页 10玖币

侵权投诉

Pre-training Language Models with Deterministic Factual Knowledge

Shaobo Li1, Xiaoguang Li2, Lifeng Shang2,

Chengjie Sun1, Bingquan Liu1, Zhenzhou Ji1, Xin Jiang2and Qun Liu2

1Harbin Institute of Technology, 2Huawei Noah’s Ark Lab

shli@insun.hit.edu.cn, {sunchengjie, liubq, jizhenzhou}@hit.edu.cn

{lixiaoguang11, shang.lifeng, Jiang.Xin, qun.liu}@huawei.com

Abstract

Previous works show that Pre-trained Lan-

guage Models (PLMs) can capture factual

knowledge. However, some analyses reveal

that PLMs fail to perform it robustly, e.g., be-

ing sensitive to the changes of prompts when

extracting factual knowledge. To mitigate this

issue, we propose to let PLMs learn the de-

terministic relationship between the remaining

context and the masked content. The determin-

istic relationship ensures that the masked fac-

tual content can be deterministically inferable

based on the existing clues in the context. That

would provide more stable patterns for PLMs

to capture factual knowledge than randomly

masking. Two pre-training tasks are further

introduced to motivate PLMs to rely on the

deterministic relationship when ﬁlling masks.

Speciﬁcally, we use an external Knowledge

Base (KB) to identify deterministic relation-

ships and continuously pre-train PLMs with

the proposed methods. The factual knowledge

probing experiments indicate that the continu-

ously pre-trained PLMs achieve better robust-

ness in factual knowledge capturing. Further

experiments on question-answering datasets

show that trying to learn a deterministic rela-

tionship with the proposed methods can also

help other knowledge-intensive tasks.

1 Introduction

Petroni et al. (2019); Jiang et al. (2020); Shin

et al. (2020); Zhong et al. (2021) show that we can

successfully extract factual knowledge from Pre-

trained Language Models (PLMs) using cloze-style

prompts such as “The director of the ﬁlm Saving

Private Ryan is [MASK].” Some recent works (Cao

et al.,2021;Pörner et al.,2020) ﬁnd that the PLMs

may rely on superﬁcial cues to achieve that and can

not respond robustly. Table 1gives examples of

inconsistent predictions exposed by changing the

surface forms of prompts on the same fact.

This phenomenon questions whether PLMs

can robustly capture factual knowledge through

Cloze-style Prompt and Prediction Is Correct?

War Horse is an American war ﬁlm di-

rected by Steven Spielberg. 3

The director of the American war ﬁlm War

Horse is Keanu Reeves. 7

Christopher Nolan

is the director of the

American war ﬁlm War Horse. 7

Table 1: A PLM could gives inconsistent results when

probing the same fact with different prompts. The un-

derlined words are the predictions.

Masked Language Modeling (MLM) (Devlin et al.,

2018) and further intensify us to inspect the masked

contents in the pre-training samples. After review-

ing several masking methods, we ﬁnd that they fo-

cus on limiting the granularity of masked contents,

e.g., restricting the masked content to be entities

and then randomly masking the entities (Guu et al.,

2020), and pay less attention to checking whether

the obtained MLM samples are appropriate for fac-

tual knowledge capturing. For instance, when we

want PLMs to capture the corresponding factual

knowledge as recovering the masked entities, we

should check whether the remaining context pro-

vides sufﬁcient clues to recover the missing entity.

Inspired by the above analysis, we can categorize

MLM samples based on the relationship between

the remaining context and masked content:

•Non-deterministic samples

The clues in the

remaining context are insufﬁcient to constrain

the value of the masked content. Multiple

values are valid to ﬁll in the masks.

•Deterministic samples

The remaining con-

text holds deterministic clues for the masked

content. We can get one and only one valid

value for the masked content.

For example, the ﬁrst cloze in Table 1masks the

director of the ﬁlm “War Horse.” Since the ﬁlm

has only one director in the real world, we can get

a unique answer deterministically. So it is a de-

terministic MLM sample. The crucial clues “War

arXiv:2210.11165v1 [cs.CL] 20 Oct 2022

Horse” and “directed by” have a deterministic rela-

tionship with the missing entity “Steven Spielberg.”

For brevity, we refer to these clues as

determin-

istic clues

and the outcome “Steven Spielberg" as

deterministic span

. In contrast, if the sample be-

comes “[MASK]s is an American war ﬁlm directed

by Steven Spielberg,” multiple names can ﬁll the

masks because Steven Spielberg produced more

than one American war ﬁlm. We cannot tell which

one is better based on the existing clues, so it is a

non-deterministic sample.

The non-deterministic samples establish a multi-

label problem (Zhang and Zhou,2006) for MLM,

where more than one ground-truth value for out-

puts is associated with a single input. If we enforce

the PLMs to promote one speciﬁed ground truth

over others, the other ground truths become false

negatives that could plague the training or cause a

performance downgrade (Durand et al.,2019;Cole

et al.,2021). The non-deterministic samples are

competent for obtaining contextualized representa-

tions but become questionable for understanding

the intrinsic relationship between factual entities.

In contrast, the deterministic samples are less con-

fusing since the answer is always unique, providing

a stable relationship for PLMs to learn.

Therefore, we propose

deterministic masking

that always masks and predicts the deterministic

spans in MLM pre-training to improve PLMs’ abil-

ity to capture factual knowledge. The deterministic

clues and spans are identiﬁed based on a KB. Two

pre-training tasks,

clue contrastive learning

and

clue classiﬁcation

, are introduced to make PLMs

more aware of the deterministic clues when pre-

dicting the missing entities. The clue contrastive

learning encourages PLMs to be more conﬁdent

in prediction (Vu et al.,2019;Luo et al.,2021)

when the deterministic clues are unmasked. The

clue classiﬁcation is to detect whether the remain-

ing context contains deterministic clues. The ex-

periments on the factual knowledge probing and

question-answering tasks show the effectiveness of

the proposed methods.

The contributions of this paper are: (1) We pro-

pose to model the deterministic relationship in

MLM samples to improve the robustness (i.e., both

consistency and accuracy) of factual knowledge

capturing. (2) We design two pre-training tasks

to enhance the deterministic relationship between

entities to earn further improvement on robustness.

(3) The experiment results show that learning the

deterministic relationship is also helpful for other

knowledge-intensive tasks, such as question an-

swering.

2 Methods

Section 2.1 expatiates the deterministic masking,

which includes how we align texts with triplets and

identify deterministic clues and spans in texts. The

clue contrastive learning and clue classiﬁcation are

described in Sections 2.2 and 2.3, respectively.

2.1 Deterministic Masking

In addition to masking only factual content, the de-

terministic masking also constrains the remaining

context and the masked content to have a determin-

istic relationship: the remaining context should pro-

vide conclusive clues to predict the masked content,

and the valid value to ﬁll in the mask is unique.

To this end, we align each text with a KB triplet

and match the spans in the text with (subject,pred-

icate,object) respectively. We select the spans

aligned with objects as the candidates to be masked

for pre-training. To further make the masked object

deterministic, we query the KB with the aligned

(subject,predicate) and check whether the valid

object that exists in KB is unique.

If the KB emits this object exclusively, e.g., only

the aligned object can compose a valid triplet with

the aligned subject and predicate, the object is de-

terministic. The object is non-deterministic if mul-

tiple objects suit the aligned subject and predicate

in the KB. The span aligned with the determin-

istic object is a deterministic span, and it would

be masked to construct a deterministic MLM sam-

ple

. We pre-train PLMs on only the deterministic

samples.

Figure 1shows a deterministic sample aligned

with the triplet (“War Horse,” “directed by,”

“Steven Spielberg”). When querying KB with “War

Horse” as the subject and “directed by” as the pred-

icate, the result object “Steven Spielberg” is unique

because there is only one director who produced

this ﬁlm, so the ﬁrst sample is deterministic. In

contrast, when using “Steven Spielberg” and “di-

recter of ” as the subject and the predicate, multiple

valid objects exist in KB, so the second sample is

non-deterministic and is ﬁltered out.

By dropping the non-deterministic samples, we

prevent PLMs from having a crush on one object

We put the detailed procedure (includes entity linking and

predicate string matching) in Appendix B.

Align

Select

Saving Private Ryan

War Horse

directed by Steven Spielberg

Non-deterministic inference

Deterministic inference

Knowledge Base

Triplet (subject , predicate , object)

subject predicate object

1. War Horse is an American war film directed by [MASK]s.

2. [MASK]s is an American war film directed by Steven Spielberg.

Pre-training samples:

War Horse is an American war film directed by Steven Spielberg.

Text

Figure 1: Construct a deterministic sample. The spans

with blue background correspond to entities (subject or

object), and the spans with yellow describe relations

(predicate).

but ignoring others that are also valid based on the

existing clues. While in the deterministic samples,

the relationship between the remaining clues and

the missing span is more stable and unambiguous.

Training on the deterministic samples encourages

PLM to infer the missing object based on its deter-

ministic factual clues. It helps PLMs grasp a more

substantial relationship between entities to model

the factual contents and could aid in accomplishing

some knowledge-intensive tasks.

2.2 Clue Contrastive Learning

To stimulate PLMs to catch the deterministic re-

lationship between entities, we design the pre-

training task clue contrastive learning following

this intuition: PLMs should have more conﬁdence

to generate a masked span when its determinis-

tic clues exist in the context, and introduce a con-

trastive objective accordingly. We explain it with a

pair of samples in Figure 2. Figure 2a shows a deter-

ministic MLM sample that masks the span “Steven

Spielberg” and keeps its deterministic clues. Fig-

ure 2b masks both the deterministic clues and the

deterministic span. The remaining context in Fig-

ure 2a contains fewer [MASK]s and provides more

information, naturally reducing the uncertainty in

prediction. So PLMs should assign a higher proba-

bility for the ground truth when giving the context

in Figure 2a than Figure 2b.

Formally, we use

and

to denote the deter-

ministic clues (subject and predicate) and

denote the masked deterministic span (object).

represents the random spans in the context other

... War Horse is an American war film directed by [MASK]s ...

PLM

: subjectS

: predicateP

: objectO

: random contentR

LM Head

Embedding Vectors

ˆˆˆˆ

(|,,)PO o S sP pR r

(a) Deterministic sample: masks the deterministic span (ob-

ject) and keeps the deterministic clues (object and predicate).

... [MASK]s is an American war film [MASK]s [MASK]s ...

PLM

LM Head

Embedding Vectors

[MASK] [MASK]

ˆˆ

(| , ,)PO o S P R r  

(b) Contrastive sample without deterministic clues: masks

both the deterministic clues and the deterministic span.

Figure 2: The two samples in clue contrastive learn-

ing. The ﬁrst sample (a) has a more informative context,

so PLM should be more conﬁdent when predicting the

masked object O. The texts with purple background

denote the spans other than entities and relations.

than

, and

. The objective function that needs

to be maximized is:

P(O= ˆo|S= ˆs, P = ˆp, R = ˆr)

−P(O= ˆo|S=[MASK], P =[MASK], R = ˆr),(1)

S=[MASK]

and

P=[MASK]

denote replacing

the deterministic clues with [MASK]s.

ˆs

ˆp

ˆo

and

ˆr

are the ground-truth values of the

and

respectively.

P(O= ˆo| ·)

denotes the probability

that the PLM correctly predicts the masked span

i.e., the average probability that the PLM assigns

to the ground-truth tokens. It is calculated by a

Language Model Head (LM Head) based on the

embedding of Ofrom the PLMs.

This task encourages PLMs to give the ground

truth

ˆo

a higher probability when the determinis-

tic clues exist in the context. It is somewhat con-

servative since we consider the noise in the data

construction. The objective is still reasonable even

when the

, and

are randomly labeled. Raw

words are always more informative than the ordi-

nary [MASK]s and can reduce the uncertainty of

the context (Cover,1999), so the uncertainty of

prediction degrades accordingly (Vu et al.,2019;

Luo et al.,2021). On the other hand, this objective

trains PLMs to react to the changes in the con-

text, i.e., learning how to tune the output as the

input changes. We employ a large-scale KB as the

approximation of real-world knowledge (Reiter,

1981) to get the pre-training samples.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Pre-trainingLanguageModelswithDeterministicFactualKnowledgeShaoboLi1,XiaoguangLi2,LifengShang2,ChengjieSun1,BingquanLiu1,ZhenzhouJi1,XinJiang2andQunLiu21HarbinInstituteofTechnology,2HuaweiNoah'sArkLabshli@insun.hit.edu.cn,{sunchengjie,liubq,jizhenzhou}@hit.edu.cn{lixiaoguang11,shang.lifeng,Jiang.Xin...

展开>> 收起<<

Pre-training Language Models with Deterministic Factual Knowledge Shaobo Li1 Xiaoguang Li2 Lifeng Shang2 Chengjie Sun1 Bingquan Liu1 Zhenzhou Ji1 Xin Jiang2and Qun Liu2.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Pre-training Language Models with Deterministic Factual Knowledge Shaobo Li1 Xiaoguang Li2 Lifeng Shang2 Chengjie Sun1 Bingquan Liu1 Zhenzhou Ji1 Xin Jiang2and Qun Liu2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: