Pre-training Language Models with Deterministic Factual Knowledge Shaobo Li1 Xiaoguang Li2 Lifeng Shang2 Chengjie Sun1 Bingquan Liu1 Zhenzhou Ji1 Xin Jiang2and Qun Liu2

2025-05-06 0 0 668.32KB 14 页 10玖币
侵权投诉
Pre-training Language Models with Deterministic Factual Knowledge
Shaobo Li1, Xiaoguang Li2, Lifeng Shang2,
Chengjie Sun1, Bingquan Liu1, Zhenzhou Ji1, Xin Jiang2and Qun Liu2
1Harbin Institute of Technology, 2Huawei Noah’s Ark Lab
shli@insun.hit.edu.cn, {sunchengjie, liubq, jizhenzhou}@hit.edu.cn
{lixiaoguang11, shang.lifeng, Jiang.Xin, qun.liu}@huawei.com
Abstract
Previous works show that Pre-trained Lan-
guage Models (PLMs) can capture factual
knowledge. However, some analyses reveal
that PLMs fail to perform it robustly, e.g., be-
ing sensitive to the changes of prompts when
extracting factual knowledge. To mitigate this
issue, we propose to let PLMs learn the de-
terministic relationship between the remaining
context and the masked content. The determin-
istic relationship ensures that the masked fac-
tual content can be deterministically inferable
based on the existing clues in the context. That
would provide more stable patterns for PLMs
to capture factual knowledge than randomly
masking. Two pre-training tasks are further
introduced to motivate PLMs to rely on the
deterministic relationship when filling masks.
Specifically, we use an external Knowledge
Base (KB) to identify deterministic relation-
ships and continuously pre-train PLMs with
the proposed methods. The factual knowledge
probing experiments indicate that the continu-
ously pre-trained PLMs achieve better robust-
ness in factual knowledge capturing. Further
experiments on question-answering datasets
show that trying to learn a deterministic rela-
tionship with the proposed methods can also
help other knowledge-intensive tasks.
1 Introduction
Petroni et al. (2019); Jiang et al. (2020); Shin
et al. (2020); Zhong et al. (2021) show that we can
successfully extract factual knowledge from Pre-
trained Language Models (PLMs) using cloze-style
prompts such as “The director of the film Saving
Private Ryan is [MASK]. Some recent works (Cao
et al.,2021;Pörner et al.,2020) find that the PLMs
may rely on superficial cues to achieve that and can
not respond robustly. Table 1gives examples of
inconsistent predictions exposed by changing the
surface forms of prompts on the same fact.
This phenomenon questions whether PLMs
can robustly capture factual knowledge through
Cloze-style Prompt and Prediction Is Correct?
War Horse is an American war film di-
rected by Steven Spielberg. 3
The director of the American war film War
Horse is Keanu Reeves. 7
Christopher Nolan
is the director of the
American war film War Horse. 7
Table 1: A PLM could gives inconsistent results when
probing the same fact with different prompts. The un-
derlined words are the predictions.
Masked Language Modeling (MLM) (Devlin et al.,
2018) and further intensify us to inspect the masked
contents in the pre-training samples. After review-
ing several masking methods, we find that they fo-
cus on limiting the granularity of masked contents,
e.g., restricting the masked content to be entities
and then randomly masking the entities (Guu et al.,
2020), and pay less attention to checking whether
the obtained MLM samples are appropriate for fac-
tual knowledge capturing. For instance, when we
want PLMs to capture the corresponding factual
knowledge as recovering the masked entities, we
should check whether the remaining context pro-
vides sufficient clues to recover the missing entity.
Inspired by the above analysis, we can categorize
MLM samples based on the relationship between
the remaining context and masked content:
Non-deterministic samples
The clues in the
remaining context are insufficient to constrain
the value of the masked content. Multiple
values are valid to fill in the masks.
Deterministic samples
The remaining con-
text holds deterministic clues for the masked
content. We can get one and only one valid
value for the masked content.
For example, the first cloze in Table 1masks the
director of the film “War Horse. Since the film
has only one director in the real world, we can get
a unique answer deterministically. So it is a de-
terministic MLM sample. The crucial clues “War
arXiv:2210.11165v1 [cs.CL] 20 Oct 2022
Horse” and “directed by” have a deterministic rela-
tionship with the missing entity “Steven Spielberg.
For brevity, we refer to these clues as
determin-
istic clues
and the outcome “Steven Spielberg" as
deterministic span
. In contrast, if the sample be-
comes “[MASK]s is an American war film directed
by Steven Spielberg,” multiple names can fill the
masks because Steven Spielberg produced more
than one American war film. We cannot tell which
one is better based on the existing clues, so it is a
non-deterministic sample.
The non-deterministic samples establish a multi-
label problem (Zhang and Zhou,2006) for MLM,
where more than one ground-truth value for out-
puts is associated with a single input. If we enforce
the PLMs to promote one specified ground truth
over others, the other ground truths become false
negatives that could plague the training or cause a
performance downgrade (Durand et al.,2019;Cole
et al.,2021). The non-deterministic samples are
competent for obtaining contextualized representa-
tions but become questionable for understanding
the intrinsic relationship between factual entities.
In contrast, the deterministic samples are less con-
fusing since the answer is always unique, providing
a stable relationship for PLMs to learn.
Therefore, we propose
deterministic masking
that always masks and predicts the deterministic
spans in MLM pre-training to improve PLMs’ abil-
ity to capture factual knowledge. The deterministic
clues and spans are identified based on a KB. Two
pre-training tasks,
clue contrastive learning
and
clue classification
, are introduced to make PLMs
more aware of the deterministic clues when pre-
dicting the missing entities. The clue contrastive
learning encourages PLMs to be more confident
in prediction (Vu et al.,2019;Luo et al.,2021)
when the deterministic clues are unmasked. The
clue classification is to detect whether the remain-
ing context contains deterministic clues. The ex-
periments on the factual knowledge probing and
question-answering tasks show the effectiveness of
the proposed methods.
The contributions of this paper are: (1) We pro-
pose to model the deterministic relationship in
MLM samples to improve the robustness (i.e., both
consistency and accuracy) of factual knowledge
capturing. (2) We design two pre-training tasks
to enhance the deterministic relationship between
entities to earn further improvement on robustness.
(3) The experiment results show that learning the
deterministic relationship is also helpful for other
knowledge-intensive tasks, such as question an-
swering.
2 Methods
Section 2.1 expatiates the deterministic masking,
which includes how we align texts with triplets and
identify deterministic clues and spans in texts. The
clue contrastive learning and clue classification are
described in Sections 2.2 and 2.3, respectively.
2.1 Deterministic Masking
In addition to masking only factual content, the de-
terministic masking also constrains the remaining
context and the masked content to have a determin-
istic relationship: the remaining context should pro-
vide conclusive clues to predict the masked content,
and the valid value to fill in the mask is unique.
To this end, we align each text with a KB triplet
and match the spans in the text with (subject,pred-
icate,object) respectively. We select the spans
aligned with objects as the candidates to be masked
for pre-training. To further make the masked object
deterministic, we query the KB with the aligned
(subject,predicate) and check whether the valid
object that exists in KB is unique.
If the KB emits this object exclusively, e.g., only
the aligned object can compose a valid triplet with
the aligned subject and predicate, the object is de-
terministic. The object is non-deterministic if mul-
tiple objects suit the aligned subject and predicate
in the KB. The span aligned with the determin-
istic object is a deterministic span, and it would
be masked to construct a deterministic MLM sam-
ple
1
. We pre-train PLMs on only the deterministic
samples.
Figure 1shows a deterministic sample aligned
with the triplet (“War Horse,” “directed by,
Steven Spielberg”). When querying KB with “War
Horse” as the subject and “directed by” as the pred-
icate, the result object “Steven Spielberg” is unique
because there is only one director who produced
this film, so the first sample is deterministic. In
contrast, when using “Steven Spielberg” and “di-
recter of ” as the subject and the predicate, multiple
valid objects exist in KB, so the second sample is
non-deterministic and is filtered out.
By dropping the non-deterministic samples, we
prevent PLMs from having a crush on one object
1
We put the detailed procedure (includes entity linking and
predicate string matching) in Appendix B.
Align
Select
Saving Private Ryan
War Horse
directed by Steven Spielberg
Non-deterministic inference
Deterministic inference
Knowledge Base
Triplet (subject , predicate , object)
subject predicate object
1. War Horse is an American war film directed by [MASK]s.
2. [MASK]s is an American war film directed by Steven Spielberg.
R
S
Pre-training samples:
War Horse is an American war film directed by Steven Spielberg.
Text
Figure 1: Construct a deterministic sample. The spans
with blue background correspond to entities (subject or
object), and the spans with yellow describe relations
(predicate).
but ignoring others that are also valid based on the
existing clues. While in the deterministic samples,
the relationship between the remaining clues and
the missing span is more stable and unambiguous.
Training on the deterministic samples encourages
PLM to infer the missing object based on its deter-
ministic factual clues. It helps PLMs grasp a more
substantial relationship between entities to model
the factual contents and could aid in accomplishing
some knowledge-intensive tasks.
2.2 Clue Contrastive Learning
To stimulate PLMs to catch the deterministic re-
lationship between entities, we design the pre-
training task clue contrastive learning following
this intuition: PLMs should have more confidence
to generate a masked span when its determinis-
tic clues exist in the context, and introduce a con-
trastive objective accordingly. We explain it with a
pair of samples in Figure 2. Figure 2a shows a deter-
ministic MLM sample that masks the span “Steven
Spielberg” and keeps its deterministic clues. Fig-
ure 2b masks both the deterministic clues and the
deterministic span. The remaining context in Fig-
ure 2a contains fewer [MASK]s and provides more
information, naturally reducing the uncertainty in
prediction. So PLMs should assign a higher proba-
bility for the ground truth when giving the context
in Figure 2a than Figure 2b.
Formally, we use
S
and
P
to denote the deter-
ministic clues (subject and predicate) and
O
to
denote the masked deterministic span (object).
R
represents the random spans in the context other
... War Horse is an American war film directed by [MASK]s ...
PLM
: subjectS
: predicateP
: objectO
: random contentR
LM Head
Embedding Vectors
ˆˆˆˆ
(|,,)PO o S sP pR r
(a) Deterministic sample: masks the deterministic span (ob-
ject) and keeps the deterministic clues (object and predicate).
... [MASK]s is an American war film [MASK]s [MASK]s ...
PLM
LM Head
Embedding Vectors
[MASK] [MASK]
ˆˆ
(| , ,)PO o S P R r  
(b) Contrastive sample without deterministic clues: masks
both the deterministic clues and the deterministic span.
Figure 2: The two samples in clue contrastive learn-
ing. The first sample (a) has a more informative context,
so PLM should be more confident when predicting the
masked object O. The texts with purple background
denote the spans other than entities and relations.
than
S
,
P
, and
O
. The objective function that needs
to be maximized is:
P(O= ˆo|S= ˆs, P = ˆp, R = ˆr)
P(O= ˆo|S=[MASK], P =[MASK], R = ˆr),(1)
S=[MASK]
and
P=[MASK]
denote replacing
the deterministic clues with [MASK]s.
ˆs
,
ˆp
,
ˆo
and
ˆr
are the ground-truth values of the
S
,
P
,
O
and
R
,
respectively.
P(O= ˆo| ·)
denotes the probability
that the PLM correctly predicts the masked span
O
,
i.e., the average probability that the PLM assigns
to the ground-truth tokens. It is calculated by a
Language Model Head (LM Head) based on the
embedding of Ofrom the PLMs.
This task encourages PLMs to give the ground
truth
ˆo
a higher probability when the determinis-
tic clues exist in the context. It is somewhat con-
servative since we consider the noise in the data
construction. The objective is still reasonable even
when the
S
,
P
, and
R
are randomly labeled. Raw
words are always more informative than the ordi-
nary [MASK]s and can reduce the uncertainty of
the context (Cover,1999), so the uncertainty of
prediction degrades accordingly (Vu et al.,2019;
Luo et al.,2021). On the other hand, this objective
trains PLMs to react to the changes in the con-
text, i.e., learning how to tune the output as the
input changes. We employ a large-scale KB as the
approximation of real-world knowledge (Reiter,
1981) to get the pre-training samples.
摘要:

Pre-trainingLanguageModelswithDeterministicFactualKnowledgeShaoboLi1,XiaoguangLi2,LifengShang2,ChengjieSun1,BingquanLiu1,ZhenzhouJi1,XinJiang2andQunLiu21HarbinInstituteofTechnology,2HuaweiNoah'sArkLabshli@insun.hit.edu.cn,{sunchengjie,liubq,jizhenzhou}@hit.edu.cn{lixiaoguang11,shang.lifeng,Jiang.Xin...

展开>> 收起<<
Pre-training Language Models with Deterministic Factual Knowledge Shaobo Li1 Xiaoguang Li2 Lifeng Shang2 Chengjie Sun1 Bingquan Liu1 Zhenzhou Ji1 Xin Jiang2and Qun Liu2.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:668.32KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注