1.1 Related work
Knowledge-augmented LM pretraining.
Knowledge integration is active research for improving
LMs. One line of works is retrieval-augmented LMs [
20
,
21
,
22
], which retrieve relevant text from a
corpus and integrate it into LMs as additional knowledge. Orthogonal to these works, we focus on
using knowledge bases as background knowledge, to ground reasoning about entities and facts.
Closest to our work are works that integrate knowledge bases in LM pretraining. One line of research
aims to add entity features to LMs [
12
,
23
,
24
]; Some works use the KG entity information or structure
to create additional training signals [
13
,
25
,
14
,
26
,
27
,
28
]; Several works add KG triplet information
directly to the LM input [
29
,
16
,
15
,
30
,
31
]. While these methods have achieved substantial progress,
they typically propagate information between text and KG in a shallow or uni-directional (e.g.,
KG to text) manner, which might limit the potential to perform fully joint reasoning over the two
modalities. To improve on the above works, we propose to bidirectionally interact text and KG
via a deep cross-modal model and joint self-supervision, so that text and KG are grounded and
contextualized by each other. We find that this improves model performance on various reasoning
tasks (§3). Another distinction is that existing works in this space typically focus on adding entity- or
triplet-level knowledge from KGs to LMs, and focus on solving entity/relation classification tasks.
Our work significantly expands this scope in that we use larger KG subgraphs (200 nodes) as input to
enable richer contextualization between KG and text, and we achieve performance improvements on
a broader set of NLP tasks including QA, reasoning and text classification tasks.
KG-augmented question answering.
Various works designed KG-augmented reasoning models
for question answering [
32
,
33
,
34
,
35
,
36
,
37
,
38
,
39
,
40
,
41
,
42
]. In particular, recent works such
as QAGNN [
8
] and GreaseLM [
9
] suggest that a KG can scaffold reasoning about entities with
its graph structure, and help for complex question answering (e.g., negation, multi-hop reasoning).
These works typically focus on training or finetuning models on particular QA datasets. In contrast,
we generalize this and integrate KG-augmented reasoning into general-purpose pretraining. Our
motivation is that self-supervised pretraining allows the model to learn from larger and more diverse
data, helping to learn richer interactions between text and KGs and to acquire more diverse reasoning
abilities beyond specific QA tasks. We find that our proposed pretraining approach (DRAGON) offers
significant boosts over the baseline QA models (e.g. GreaseLM) on diverse downstream tasks (§3).
This opens a new research avenue in scaling up various carefully-designed QA models to pretraining.
KG representation learning.
Our link prediction task used in pretraining is motivated by research
in KG representation learning. Link prediction is a fundamental task in KGs [
43
,
44
], and various
works study methods to learn KG entity and relation embeddings for link prediction, such as TransE
[
45
], DistMult [
46
] and RotatE [
47
]. Several works additionally use textual data or pretrained LMs to
help learn KG embeddings and link prediction [
48
,
49
,
50
,
51
,
52
,
53
]. While these works focus on the
KG-side representations, we extend the scope and use the KG-side objective (link prediction) jointly
with a text-side objective (language modeling) to train a mutually-interactive text-KG model.
2 Deep Bidirectional Language-Knowledge Graph Pretraining (DRAGON)
We propose DRAGON, an approach that performs deeply bidirectional, self-supervised pretraining of
a language-knowledge model from text and KG. Specifically, as illustrated in Figure 1, we take a text
corpus and a large knowledge graph as raw data, and create input instances for the model by sampling
coarsely-aligned (text segment, local KG) pairs (§2.1). To learn mutual interactions over text and
KG, DRAGON consists of a cross-modal encoder (GreaseLM) that fuses the input text-KG pair
bidirectionally (§2.2), and a pretraining objective that performs bidirectional self-supervision on the
text-KG input (§2.3). Our pretraining objective unifies masked language modeling (MLM) and KG
link prediction (LinkPred) to make text and KG mutually inform each other and learn joint reasoning
over them. Finally, we describe how we finetune the pretrained DRAGON model for downstream
tasks (§2.4). While each individual piece of our approach (GreaseLM, MLM, LinkPred) is not new
in itself, we are the first to bring them together effectively and demonstrate that the resulting model
has strong empirical results. (§3, §4).
Definitions.
We define a text corpus
W
as a set of text segments
W={W}
, and each text segment
W
as a sequence of tokens (words),
W= (w1, ..., wI)
. We define a knowledge graph (KG) as a
multi-relational graph
G= (V,E)
, where
V
is the set of entity nodes in the KG and
E ⊆ V × R × V
is the set of edges (triplets) that connect nodes in
V
, with
R
being the set of relation types
{r}
.
3