Deep Bidirectional Language-Knowledge Graph Pretraining Michihiro Yasunaga1Antoine Bosselut2Hongyu Ren1Xikun Zhang1

2025-05-06 0 0 847.34KB 19 页 10玖币
侵权投诉
Deep Bidirectional Language-Knowledge Graph
Pretraining
Michihiro Yasunaga1Antoine Bosselut2Hongyu Ren1Xikun Zhang1
Christopher D Manning1Percy Liang1Jure Leskovec1
1Stanford University 2EPFL Equal senior authorship
{myasu,antoineb,hyren,xikunz2,manning,pliang,jure}@cs.stanford.edu
Abstract
Pretraining a language model (LM) on text has been shown to help various down-
stream NLP tasks. Recent works show that a knowledge graph (KG) can comple-
ment text data, offering structured background knowledge that provides a useful
scaffold for reasoning. However, these works are not pretrained to learn a deep
fusion of the two modalities at scale, limiting the potential to acquire fully joint
representations of text and KG. Here we propose
DRAGON
(
D
eep Bidi
r
ectional
L
a
n
g
uage-Kn
o
wledge Graph Pretrai
n
ing), a self-supervised method to pretrain
a deeply joint language-knowledge foundation model from text and KG at scale.
Specifically, our model takes pairs of text segments and relevant KG subgraphs
as input and bidirectionally fuses information from both modalities. We pretrain
this model by unifying two self-supervised reasoning tasks, masked language mod-
eling and KG link prediction. DRAGON outperforms existing LM and LM+KG
models on diverse downstream tasks including question answering across gen-
eral and biomedical domains, with +5% absolute gain on average. In particular,
DRAGON achieves strong performance on complex reasoning about language and
knowledge (+10% on questions involving long contexts or multi-step reasoning)
and low-resource QA (+8% on OBQA and RiddleSense), and new state-of-the-art
results on various BioNLP tasks. Our code and trained models are available at
https://github.com/michiyasunaga/dragon.
1 Introduction
Pretraining learns self-supervised representations from massive raw data to help various downstream
tasks [
1
]. Language models (LMs) pretrained on large amounts of text data, such as BERT [
2
] and
GPTs [
3
], have shown strong performance on many natural language processing (NLP) tasks. The
success of these models comes from deeply interactive (contextualized) representations of input
tokens learned at scale via self-supervision [
2
,
4
]. Meanwhile, large knowledge graphs (KGs), such
as Freebase [
5
], Wikidata [
6
] and ConceptNet [
7
], can provide complementary information to text
data. KGs offer structured background knowledge by representing entities as nodes and relations
between them as edges, and also offer scaffolds for structured, multi-step reasoning about entities
[
8
,
9
,
10
,
11
] (§3.4.1). The dual strengths of text data and KGs motivate research in pretraining
deeply interactive representations of the two modalities at scale.
How to effectively combine text and KGs for pretraining is an open problem and presents challenges.
Given text and KG, we need both (i) a deeply bidirectional model for the two modalities to interact,
and (ii) a self-supervised objective to learn joint reasoning over text and KG at scale. Several existing
works [
12
,
13
,
14
,
15
,
16
] propose methods for self-supervised pretraining, but they fuse text and KG
in a shallow or uni-directional manner. Another line of work [
8
,
9
] proposes bidirectional models for
text and KG, but these models focus on finetuning on labeled downstream tasks and do not perform
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.09338v2 [cs.CL] 19 Oct 2022
























   
 
 

















 























 

   

   





 



















   
 
 














 






















 

   

   

Figure 1:
Overview of our approach, DRAGON
.
Left
: Given raw data of a text corpus and a large knowledge graph, we
create aligned (text, local KG) pairs by sampling a text segment from the corpus and extracting a relevant subgraph from the
KG (§2.1). As the structured knowledge in KG can ground the text and the text can provide the KG with rich context for
reasoning, we aim to pretrain a language-knowledge model jointly from the text-KG pairs (DRAGON).
Right
: To model the
interactions over text and KG, DRAGON uses a cross-modal encoder that bidirectionally exchanges information between them
to produce fused text token and KG node representations (§2.2). To pretrain DRAGON jointly on text and KG, we unify two
self-supervised reasoning tasks: (1) masked language modeling, which masks some tokens in the input text and then predicts
them, and (2) link prediction, which holds out some edges from the input KG and then predicts them. This joint objective
encourages text and KG to mutually inform each other, facilitating the model to learn joint reasoning over text and KG (§2.3).
self-supervised learning. Consequently, existing methods may have limited their potential to model
and learn deep interactions over text and KG.
To address both of the above challenges and fully unify the strengths of text and KG, we propose
DRAGON
(
D
eep Bidi
r
ectional L
a
n
g
uage-Kn
o
wledge Graph Pretrai
n
ing), an approach that performs
deeply bidirectional, self-supervised pretraining of a language-knowledge model from text and KG.
DRAGON has two core components: a cross-modal model that bidirectionally fuses text and KG, and
a bidirectional self-supervised objective that learns joint reasoning over text and KG. Concretely,
as in Figure 1, we take a text corpus and a KG as raw data, and create inputs for the model by
sampling a text segment from the corpus and extracting a relevant subgraph from the KG via entity
linking, obtaining a (text,local KG) pair. We use a cross-modal model to encode this input into
fused representations, where each layer of the model encodes the text with an LM and the KG with
a graph neural network (GNN), and fuses the two with a bidirectional modality interaction module
(GreaseLM [
9
]). We pretrain this model by unifying two self-supervised reasoning tasks: (1) masked
language modeling (MLM), which masks and predicts tokens in the input text, and (2) link prediction,
which drops and predicts edges in the input KG. The intuition is that by combining the two tasks,
MLM makes the model use the text jointly with structured knowledge in the KG to reason about
masked tokens in the text (e.g., in Figure 1, using the “round brush”–“art supply” multi-hop path
from the KG helps), and link prediction makes the model use the KG structure jointly with the textual
context to reason about missing links in the KG (e.g., recognizing that “round brush could be used
for hair” from the text helps). This joint objective thus enables text to be grounded by KG structure
and KG to be contextualized by text simultaneously, producing a deeply-unified language-knowledge
pretrained model where information flows bidirectionally between text and KG for reasoning.
We pretrain DRAGON in two domains: a general domain, using the Book corpus and ConceptNet
KG [
7
] (§3), and a biomedical domain, using the PubMed corpus and UMLS KG [
17
] (§4). We
show that DRAGON improves on existing LM and LM+KG models on diverse downstream tasks
across domains. For the general domain, DRAGON outperforms RoBERTa [
18
], our base LM without
KGs, on various commonsense reasoning tasks such as CSQA, OBQA, RiddleSense and HellaSwag,
with +8% absolute accuracy gain on average. For the biomedical domain, DRAGON improves on
the previous best LM, BioLinkBERT [
19
], and sets a new state of the art on BioNLP tasks such
as MedQA and PubMedQA, with +3% accuracy gain. In particular, DRAGON exhibits notable
improvements on QA tasks involving complex reasoning (+10% gain on multi-step, negation, hedge,
or long context reasoning) and on downstream tasks with limited training data (+8% gain). These
results show that our deep bidirectional self-supervision over text and KG produces significantly
improved language-knowledge representations compared to existing models.
2
1.1 Related work
Knowledge-augmented LM pretraining.
Knowledge integration is active research for improving
LMs. One line of works is retrieval-augmented LMs [
20
,
21
,
22
], which retrieve relevant text from a
corpus and integrate it into LMs as additional knowledge. Orthogonal to these works, we focus on
using knowledge bases as background knowledge, to ground reasoning about entities and facts.
Closest to our work are works that integrate knowledge bases in LM pretraining. One line of research
aims to add entity features to LMs [
12
,
23
,
24
]; Some works use the KG entity information or structure
to create additional training signals [
13
,
25
,
14
,
26
,
27
,
28
]; Several works add KG triplet information
directly to the LM input [
29
,
16
,
15
,
30
,
31
]. While these methods have achieved substantial progress,
they typically propagate information between text and KG in a shallow or uni-directional (e.g.,
KG to text) manner, which might limit the potential to perform fully joint reasoning over the two
modalities. To improve on the above works, we propose to bidirectionally interact text and KG
via a deep cross-modal model and joint self-supervision, so that text and KG are grounded and
contextualized by each other. We find that this improves model performance on various reasoning
tasks (§3). Another distinction is that existing works in this space typically focus on adding entity- or
triplet-level knowledge from KGs to LMs, and focus on solving entity/relation classification tasks.
Our work significantly expands this scope in that we use larger KG subgraphs (200 nodes) as input to
enable richer contextualization between KG and text, and we achieve performance improvements on
a broader set of NLP tasks including QA, reasoning and text classification tasks.
KG-augmented question answering.
Various works designed KG-augmented reasoning models
for question answering [
32
,
33
,
34
,
35
,
36
,
37
,
38
,
39
,
40
,
41
,
42
]. In particular, recent works such
as QAGNN [
8
] and GreaseLM [
9
] suggest that a KG can scaffold reasoning about entities with
its graph structure, and help for complex question answering (e.g., negation, multi-hop reasoning).
These works typically focus on training or finetuning models on particular QA datasets. In contrast,
we generalize this and integrate KG-augmented reasoning into general-purpose pretraining. Our
motivation is that self-supervised pretraining allows the model to learn from larger and more diverse
data, helping to learn richer interactions between text and KGs and to acquire more diverse reasoning
abilities beyond specific QA tasks. We find that our proposed pretraining approach (DRAGON) offers
significant boosts over the baseline QA models (e.g. GreaseLM) on diverse downstream tasks (§3).
This opens a new research avenue in scaling up various carefully-designed QA models to pretraining.
KG representation learning.
Our link prediction task used in pretraining is motivated by research
in KG representation learning. Link prediction is a fundamental task in KGs [
43
,
44
], and various
works study methods to learn KG entity and relation embeddings for link prediction, such as TransE
[
45
], DistMult [
46
] and RotatE [
47
]. Several works additionally use textual data or pretrained LMs to
help learn KG embeddings and link prediction [
48
,
49
,
50
,
51
,
52
,
53
]. While these works focus on the
KG-side representations, we extend the scope and use the KG-side objective (link prediction) jointly
with a text-side objective (language modeling) to train a mutually-interactive text-KG model.
2 Deep Bidirectional Language-Knowledge Graph Pretraining (DRAGON)
We propose DRAGON, an approach that performs deeply bidirectional, self-supervised pretraining of
a language-knowledge model from text and KG. Specifically, as illustrated in Figure 1, we take a text
corpus and a large knowledge graph as raw data, and create input instances for the model by sampling
coarsely-aligned (text segment, local KG) pairs (§2.1). To learn mutual interactions over text and
KG, DRAGON consists of a cross-modal encoder (GreaseLM) that fuses the input text-KG pair
bidirectionally (§2.2), and a pretraining objective that performs bidirectional self-supervision on the
text-KG input (§2.3). Our pretraining objective unifies masked language modeling (MLM) and KG
link prediction (LinkPred) to make text and KG mutually inform each other and learn joint reasoning
over them. Finally, we describe how we finetune the pretrained DRAGON model for downstream
tasks (§2.4). While each individual piece of our approach (GreaseLM, MLM, LinkPred) is not new
in itself, we are the first to bring them together effectively and demonstrate that the resulting model
has strong empirical results. 3, §4).
Definitions.
We define a text corpus
W
as a set of text segments
W={W}
, and each text segment
W
as a sequence of tokens (words),
W= (w1, ..., wI)
. We define a knowledge graph (KG) as a
multi-relational graph
G= (V,E)
, where
V
is the set of entity nodes in the KG and
E ⊆ V × R × V
is the set of edges (triplets) that connect nodes in
V
, with
R
being the set of relation types
{r}
.
3
Each triplet
(h, r, t)
in a KG can represent a knowledge fact such as
(Paris,in,France)
. As a
raw KG is often large, with millions of nodes, a subgraph of the raw KG (local KG) is considered:
G= (V, E)
where
V={v1, ..., vJ}⊆V
and
E⊆ E
. We define a language-knowledge model to
be a composition of two functions,
fhead(fenc(X))
, where the encoder
fenc
takes in an input
X=
(text segment W, local KG G)
, and produces a contextualized vector representation for each text
token,
(H1, ..., HI)
, and for each KG node,
(V1, ..., VJ)
. A language model is a special case of a
language-knowledge model with no KG (
J= 0
). The head
fhead
uses these representations to perform
self-supervised tasks in the pretraining step and to perform downstream tasks in the finetuning step.
2.1 Input representation
Given a text corpus
W
and a large knowledge graph
G
, we create input instances for the model by
preparing (text segment
W
, local KG
G
) pairs. We want each pair’s text and KG to be (roughly)
semantically aligned so that the text and KG can mutually inform each other and facilitate the model
to learn interactive reasoning between the two modalities. Specifically, for each text segment
W
from
W, we extract a relevant local KG Gfor it from Gvia the following KG retrieval process.
KG retrieval.
Given a text segment
W
, we link entity mentions in
W
to entity nodes in
G
to get an
initial set of nodes
Vel
. We then add their 2-hop bridge nodes from
G
to get the total retrieved nodes
V⊆ V
. Lastly, we add all edges that span these nodes in
G
to get
E⊆ E
, which yields the final
local KG,
G= (V, E)
, as well as our final input instance
X= (W, G)
. Appendix B.1 provides more
details on KG retrieval. Henceforth, we use “KG” to refer to this local KG
G
unless noted otherwise.
Modality interaction token/node.
For each resulting (text, KG) pair, we further add a special token
(interaction token)
wint
to the text and a special node (interaction node)
vint
to the KG, which will
serve as an information pooling point for each modality as well as an interface for modality interaction
in our cross-modal encoder (§2.2). Specifically, we prepend
wint
to the original text
W=(w1, ..., wI)
,
and connect
vint
to the entity-linked nodes in the original KG,
Vel V={v1, ..., vJ}
, using a new
relation type
rel
. The interaction token and node can also be used to produce a pooled representation
of the whole input, e.g., when finetuning for classification tasks (§2.4).
2.2 Cross-modal encoder
To model mutual interactions over the text and KG, we use a bidirectional sequence-graph encoder
for
fenc
which takes in the text tokens and KG nodes and exchanges information across them for
multiple layers to produce a fused representation of each token and node (Figure 1right):
(Hint,H1, ..., HI),(Vint,V1, ..., VJ) = fenc((wint, w1, ..., wI),(vint, v1, ..., vJ)) (1)
While we may use any deep bidirectional sequence-graph encoder for
fenc
, for controlled comparison
with existing works, we adopt the existing top-performing sequence-graph architecture, GreaseLM
[9], which combines Transformers [54] and graph neural networks (GNNs) to fuse text-KG inputs.
Specifically, GreaseLM first uses
N
layers of Transformer language model (LM) layers to map the
input text into initial token representations, and uses KG node embeddings to map the input KG
nodes into initial node representations,
(H(0)
int ,H(0)
1, ..., H(0)
I) = LM-Layers(wint, w1..., wI),(2)
(V(0)
int ,V(0)
1, ..., V(0)
J) = Node-Embedding(vint, v1, ..., vJ).(3)
Then it uses
M
layers of text-KG fusion layers to encode these token/node representations jointly
into the final token/node representations,
(Hint, ..., HI),(Vint, ..., VJ) = Fusion-Layers((H(0)
int , ..., H(0)
I),(V(0)
int , ..., V(0)
J)),(4)
where each of the fusion layers (`=1, ..., M) performs the following:
(e
H(`)
int,H(`)
1, ..., H(`)
I) = LM-Layer(H(`1)
int ,H(`1)
1, ..., H(`1)
I),(5)
(e
V(`)
int,V(`)
1, ..., V(`)
J) = GNN-Layer(V(`1)
int ,V(`1)
1, ..., V(`1)
J),(6)
[H(`)
int;V(`)
int] = MInt([ e
H(`)
int;e
V(`)
int]).(7)
Here GNN induces graph structure-aware representations of KG nodes,
[·;·]
does concatenation, and
MInt
(modality interaction module) exchanges information between the interaction token (text side)
and interaction node (KG side) via an MLP. For more details on GreaseLM, we refer readers to [9].
4
摘要:

DeepBidirectionalLanguage-KnowledgeGraphPretrainingMichihiroYasunaga1AntoineBosselut2HongyuRen1XikunZhang1ChristopherDManning1PercyLiang1JureLeskovec11StanfordUniversity2EPFLEqualseniorauthorship{myasu,antoineb,hyren,xikunz2,manning,pliang,jure}@cs.stanford.eduAbstractPretrainingalanguagemodel(LM...

展开>> 收起<<
Deep Bidirectional Language-Knowledge Graph Pretraining Michihiro Yasunaga1Antoine Bosselut2Hongyu Ren1Xikun Zhang1.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:847.34KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注