
Horse” and “directed by” have a deterministic rela-
tionship with the missing entity “Steven Spielberg.”
For brevity, we refer to these clues as
determin-
istic clues
and the outcome “Steven Spielberg" as
deterministic span
. In contrast, if the sample be-
comes “[MASK]s is an American war film directed
by Steven Spielberg,” multiple names can fill the
masks because Steven Spielberg produced more
than one American war film. We cannot tell which
one is better based on the existing clues, so it is a
non-deterministic sample.
The non-deterministic samples establish a multi-
label problem (Zhang and Zhou,2006) for MLM,
where more than one ground-truth value for out-
puts is associated with a single input. If we enforce
the PLMs to promote one specified ground truth
over others, the other ground truths become false
negatives that could plague the training or cause a
performance downgrade (Durand et al.,2019;Cole
et al.,2021). The non-deterministic samples are
competent for obtaining contextualized representa-
tions but become questionable for understanding
the intrinsic relationship between factual entities.
In contrast, the deterministic samples are less con-
fusing since the answer is always unique, providing
a stable relationship for PLMs to learn.
Therefore, we propose
deterministic masking
that always masks and predicts the deterministic
spans in MLM pre-training to improve PLMs’ abil-
ity to capture factual knowledge. The deterministic
clues and spans are identified based on a KB. Two
pre-training tasks,
clue contrastive learning
and
clue classification
, are introduced to make PLMs
more aware of the deterministic clues when pre-
dicting the missing entities. The clue contrastive
learning encourages PLMs to be more confident
in prediction (Vu et al.,2019;Luo et al.,2021)
when the deterministic clues are unmasked. The
clue classification is to detect whether the remain-
ing context contains deterministic clues. The ex-
periments on the factual knowledge probing and
question-answering tasks show the effectiveness of
the proposed methods.
The contributions of this paper are: (1) We pro-
pose to model the deterministic relationship in
MLM samples to improve the robustness (i.e., both
consistency and accuracy) of factual knowledge
capturing. (2) We design two pre-training tasks
to enhance the deterministic relationship between
entities to earn further improvement on robustness.
(3) The experiment results show that learning the
deterministic relationship is also helpful for other
knowledge-intensive tasks, such as question an-
swering.
2 Methods
Section 2.1 expatiates the deterministic masking,
which includes how we align texts with triplets and
identify deterministic clues and spans in texts. The
clue contrastive learning and clue classification are
described in Sections 2.2 and 2.3, respectively.
2.1 Deterministic Masking
In addition to masking only factual content, the de-
terministic masking also constrains the remaining
context and the masked content to have a determin-
istic relationship: the remaining context should pro-
vide conclusive clues to predict the masked content,
and the valid value to fill in the mask is unique.
To this end, we align each text with a KB triplet
and match the spans in the text with (subject,pred-
icate,object) respectively. We select the spans
aligned with objects as the candidates to be masked
for pre-training. To further make the masked object
deterministic, we query the KB with the aligned
(subject,predicate) and check whether the valid
object that exists in KB is unique.
If the KB emits this object exclusively, e.g., only
the aligned object can compose a valid triplet with
the aligned subject and predicate, the object is de-
terministic. The object is non-deterministic if mul-
tiple objects suit the aligned subject and predicate
in the KB. The span aligned with the determin-
istic object is a deterministic span, and it would
be masked to construct a deterministic MLM sam-
ple
1
. We pre-train PLMs on only the deterministic
samples.
Figure 1shows a deterministic sample aligned
with the triplet (“War Horse,” “directed by,”
“Steven Spielberg”). When querying KB with “War
Horse” as the subject and “directed by” as the pred-
icate, the result object “Steven Spielberg” is unique
because there is only one director who produced
this film, so the first sample is deterministic. In
contrast, when using “Steven Spielberg” and “di-
recter of ” as the subject and the predicate, multiple
valid objects exist in KB, so the second sample is
non-deterministic and is filtered out.
By dropping the non-deterministic samples, we
prevent PLMs from having a crush on one object
1
We put the detailed procedure (includes entity linking and
predicate string matching) in Appendix B.