Deep Bidirectional Language-Knowledge Graph Pretraining Michihiro Yasunaga1Antoine Bosselut2Hongyu Ren1Xikun Zhang1

2025-05-06 0 0 847.34KB 19 页 10玖币

侵权投诉

Deep Bidirectional Language-Knowledge Graph

Pretraining

Michihiro Yasunaga1Antoine Bosselut2Hongyu Ren1Xikun Zhang1

Christopher D Manning1Percy Liang1∗Jure Leskovec1∗

1Stanford University 2EPFL ∗Equal senior authorship

{myasu,antoineb,hyren,xikunz2,manning,pliang,jure}@cs.stanford.edu

Abstract

Pretraining a language model (LM) on text has been shown to help various down-

stream NLP tasks. Recent works show that a knowledge graph (KG) can comple-

ment text data, offering structured background knowledge that provides a useful

scaffold for reasoning. However, these works are not pretrained to learn a deep

fusion of the two modalities at scale, limiting the potential to acquire fully joint

representations of text and KG. Here we propose

DRAGON

(

eep Bidi

ectional

uage-Kn

wledge Graph Pretrai

ing), a self-supervised method to pretrain

a deeply joint language-knowledge foundation model from text and KG at scale.

Speciﬁcally, our model takes pairs of text segments and relevant KG subgraphs

as input and bidirectionally fuses information from both modalities. We pretrain

this model by unifying two self-supervised reasoning tasks, masked language mod-

eling and KG link prediction. DRAGON outperforms existing LM and LM+KG

models on diverse downstream tasks including question answering across gen-

eral and biomedical domains, with +5% absolute gain on average. In particular,

DRAGON achieves strong performance on complex reasoning about language and

knowledge (+10% on questions involving long contexts or multi-step reasoning)

and low-resource QA (+8% on OBQA and RiddleSense), and new state-of-the-art

results on various BioNLP tasks. Our code and trained models are available at

https://github.com/michiyasunaga/dragon.

1 Introduction

Pretraining learns self-supervised representations from massive raw data to help various downstream

tasks [

]. Language models (LMs) pretrained on large amounts of text data, such as BERT [

] and

GPTs [

], have shown strong performance on many natural language processing (NLP) tasks. The

success of these models comes from deeply interactive (contextualized) representations of input

tokens learned at scale via self-supervision [

]. Meanwhile, large knowledge graphs (KGs), such

as Freebase [

], Wikidata [

] and ConceptNet [

], can provide complementary information to text

data. KGs offer structured background knowledge by representing entities as nodes and relations

between them as edges, and also offer scaffolds for structured, multi-step reasoning about entities

[

] (§3.4.1). The dual strengths of text data and KGs motivate research in pretraining

deeply interactive representations of the two modalities at scale.

How to effectively combine text and KGs for pretraining is an open problem and presents challenges.

Given text and KG, we need both (i) a deeply bidirectional model for the two modalities to interact,

and (ii) a self-supervised objective to learn joint reasoning over text and KG at scale. Several existing

works [

] propose methods for self-supervised pretraining, but they fuse text and KG

in a shallow or uni-directional manner. Another line of work [

] proposes bidirectional models for

text and KG, but these models focus on ﬁnetuning on labeled downstream tasks and do not perform

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.09338v2 [cs.CL] 19 Oct 2022













































   

 

 































 













































 



   



   











 



































   

 

 

























 











































 



   



   



Figure 1:

Overview of our approach, DRAGON

Left

: Given raw data of a text corpus and a large knowledge graph, we

create aligned (text, local KG) pairs by sampling a text segment from the corpus and extracting a relevant subgraph from the

KG (§2.1). As the structured knowledge in KG can ground the text and the text can provide the KG with rich context for

reasoning, we aim to pretrain a language-knowledge model jointly from the text-KG pairs (DRAGON).

Right

: To model the

interactions over text and KG, DRAGON uses a cross-modal encoder that bidirectionally exchanges information between them

to produce fused text token and KG node representations (§2.2). To pretrain DRAGON jointly on text and KG, we unify two

self-supervised reasoning tasks: (1) masked language modeling, which masks some tokens in the input text and then predicts

them, and (2) link prediction, which holds out some edges from the input KG and then predicts them. This joint objective

encourages text and KG to mutually inform each other, facilitating the model to learn joint reasoning over text and KG (§2.3).

self-supervised learning. Consequently, existing methods may have limited their potential to model

and learn deep interactions over text and KG.

To address both of the above challenges and fully unify the strengths of text and KG, we propose

DRAGON

(

eep Bidi

ectional L

uage-Kn

wledge Graph Pretrai

ing), an approach that performs

deeply bidirectional, self-supervised pretraining of a language-knowledge model from text and KG.

DRAGON has two core components: a cross-modal model that bidirectionally fuses text and KG, and

a bidirectional self-supervised objective that learns joint reasoning over text and KG. Concretely,

as in Figure 1, we take a text corpus and a KG as raw data, and create inputs for the model by

sampling a text segment from the corpus and extracting a relevant subgraph from the KG via entity

linking, obtaining a (text,local KG) pair. We use a cross-modal model to encode this input into

fused representations, where each layer of the model encodes the text with an LM and the KG with

a graph neural network (GNN), and fuses the two with a bidirectional modality interaction module

(GreaseLM [

]). We pretrain this model by unifying two self-supervised reasoning tasks: (1) masked

language modeling (MLM), which masks and predicts tokens in the input text, and (2) link prediction,

which drops and predicts edges in the input KG. The intuition is that by combining the two tasks,

MLM makes the model use the text jointly with structured knowledge in the KG to reason about

masked tokens in the text (e.g., in Figure 1, using the “round brush”–“art supply” multi-hop path

from the KG helps), and link prediction makes the model use the KG structure jointly with the textual

context to reason about missing links in the KG (e.g., recognizing that “round brush could be used

for hair” from the text helps). This joint objective thus enables text to be grounded by KG structure

and KG to be contextualized by text simultaneously, producing a deeply-uniﬁed language-knowledge

pretrained model where information ﬂows bidirectionally between text and KG for reasoning.

We pretrain DRAGON in two domains: a general domain, using the Book corpus and ConceptNet

KG [

] (§3), and a biomedical domain, using the PubMed corpus and UMLS KG [

] (§4). We

show that DRAGON improves on existing LM and LM+KG models on diverse downstream tasks

across domains. For the general domain, DRAGON outperforms RoBERTa [

], our base LM without

KGs, on various commonsense reasoning tasks such as CSQA, OBQA, RiddleSense and HellaSwag,

with +8% absolute accuracy gain on average. For the biomedical domain, DRAGON improves on

the previous best LM, BioLinkBERT [

], and sets a new state of the art on BioNLP tasks such

as MedQA and PubMedQA, with +3% accuracy gain. In particular, DRAGON exhibits notable

improvements on QA tasks involving complex reasoning (+10% gain on multi-step, negation, hedge,

or long context reasoning) and on downstream tasks with limited training data (+8% gain). These

results show that our deep bidirectional self-supervision over text and KG produces signiﬁcantly

improved language-knowledge representations compared to existing models.

1.1 Related work

Knowledge-augmented LM pretraining.

Knowledge integration is active research for improving

LMs. One line of works is retrieval-augmented LMs [

], which retrieve relevant text from a

corpus and integrate it into LMs as additional knowledge. Orthogonal to these works, we focus on

using knowledge bases as background knowledge, to ground reasoning about entities and facts.

Closest to our work are works that integrate knowledge bases in LM pretraining. One line of research

aims to add entity features to LMs [

]; Some works use the KG entity information or structure

to create additional training signals [

]; Several works add KG triplet information

directly to the LM input [

]. While these methods have achieved substantial progress,

they typically propagate information between text and KG in a shallow or uni-directional (e.g.,

KG to text) manner, which might limit the potential to perform fully joint reasoning over the two

modalities. To improve on the above works, we propose to bidirectionally interact text and KG

via a deep cross-modal model and joint self-supervision, so that text and KG are grounded and

contextualized by each other. We ﬁnd that this improves model performance on various reasoning

tasks (§3). Another distinction is that existing works in this space typically focus on adding entity- or

triplet-level knowledge from KGs to LMs, and focus on solving entity/relation classiﬁcation tasks.

Our work signiﬁcantly expands this scope in that we use larger KG subgraphs (200 nodes) as input to

enable richer contextualization between KG and text, and we achieve performance improvements on

a broader set of NLP tasks including QA, reasoning and text classiﬁcation tasks.

KG-augmented question answering.

Various works designed KG-augmented reasoning models

for question answering [

]. In particular, recent works such

as QAGNN [

] and GreaseLM [

] suggest that a KG can scaffold reasoning about entities with

its graph structure, and help for complex question answering (e.g., negation, multi-hop reasoning).

These works typically focus on training or ﬁnetuning models on particular QA datasets. In contrast,

we generalize this and integrate KG-augmented reasoning into general-purpose pretraining. Our

motivation is that self-supervised pretraining allows the model to learn from larger and more diverse

data, helping to learn richer interactions between text and KGs and to acquire more diverse reasoning

abilities beyond speciﬁc QA tasks. We ﬁnd that our proposed pretraining approach (DRAGON) offers

signiﬁcant boosts over the baseline QA models (e.g. GreaseLM) on diverse downstream tasks (§3).

This opens a new research avenue in scaling up various carefully-designed QA models to pretraining.

KG representation learning.

Our link prediction task used in pretraining is motivated by research

in KG representation learning. Link prediction is a fundamental task in KGs [

], and various

works study methods to learn KG entity and relation embeddings for link prediction, such as TransE

[

], DistMult [

] and RotatE [

]. Several works additionally use textual data or pretrained LMs to

help learn KG embeddings and link prediction [

]. While these works focus on the

KG-side representations, we extend the scope and use the KG-side objective (link prediction) jointly

with a text-side objective (language modeling) to train a mutually-interactive text-KG model.

2 Deep Bidirectional Language-Knowledge Graph Pretraining (DRAGON)

We propose DRAGON, an approach that performs deeply bidirectional, self-supervised pretraining of

a language-knowledge model from text and KG. Speciﬁcally, as illustrated in Figure 1, we take a text

corpus and a large knowledge graph as raw data, and create input instances for the model by sampling

coarsely-aligned (text segment, local KG) pairs (§2.1). To learn mutual interactions over text and

KG, DRAGON consists of a cross-modal encoder (GreaseLM) that fuses the input text-KG pair

bidirectionally (§2.2), and a pretraining objective that performs bidirectional self-supervision on the

text-KG input (§2.3). Our pretraining objective uniﬁes masked language modeling (MLM) and KG

link prediction (LinkPred) to make text and KG mutually inform each other and learn joint reasoning

over them. Finally, we describe how we ﬁnetune the pretrained DRAGON model for downstream

tasks (§2.4). While each individual piece of our approach (GreaseLM, MLM, LinkPred) is not new

in itself, we are the ﬁrst to bring them together effectively and demonstrate that the resulting model

has strong empirical results. (§3, §4).

Deﬁnitions.

We deﬁne a text corpus

as a set of text segments

W={W}

, and each text segment

as a sequence of tokens (words),

W= (w1, ..., wI)

. We deﬁne a knowledge graph (KG) as a

multi-relational graph

G= (V,E)

, where

is the set of entity nodes in the KG and

E ⊆ V × R × V

is the set of edges (triplets) that connect nodes in

, with

being the set of relation types

{r}

Each triplet

(h, r, t)

in a KG can represent a knowledge fact such as

(Paris,in,France)

. As a

raw KG is often large, with millions of nodes, a subgraph of the raw KG (local KG) is considered:

G= (V, E)

where

V={v1, ..., vJ}⊆V

and

E⊆ E

. We deﬁne a language-knowledge model to

be a composition of two functions,

fhead(fenc(X))

, where the encoder

fenc

takes in an input

(text segment W, local KG G)

, and produces a contextualized vector representation for each text

token,

(H1, ..., HI)

, and for each KG node,

(V1, ..., VJ)

. A language model is a special case of a

language-knowledge model with no KG (

J= 0

). The head

fhead

uses these representations to perform

self-supervised tasks in the pretraining step and to perform downstream tasks in the ﬁnetuning step.

2.1 Input representation

Given a text corpus

and a large knowledge graph

, we create input instances for the model by

preparing (text segment

, local KG

) pairs. We want each pair’s text and KG to be (roughly)

semantically aligned so that the text and KG can mutually inform each other and facilitate the model

to learn interactive reasoning between the two modalities. Speciﬁcally, for each text segment

from

W, we extract a relevant local KG Gfor it from Gvia the following KG retrieval process.

KG retrieval.

Given a text segment

, we link entity mentions in

to entity nodes in

to get an

initial set of nodes

Vel

. We then add their 2-hop bridge nodes from

to get the total retrieved nodes

V⊆ V

. Lastly, we add all edges that span these nodes in

to get

E⊆ E

, which yields the ﬁnal

local KG,

G= (V, E)

, as well as our ﬁnal input instance

X= (W, G)

. Appendix B.1 provides more

details on KG retrieval. Henceforth, we use “KG” to refer to this local KG

unless noted otherwise.

Modality interaction token/node.

For each resulting (text, KG) pair, we further add a special token

(interaction token)

wint

to the text and a special node (interaction node)

vint

to the KG, which will

serve as an information pooling point for each modality as well as an interface for modality interaction

in our cross-modal encoder (§2.2). Speciﬁcally, we prepend

wint

to the original text

W=(w1, ..., wI)

and connect

vint

to the entity-linked nodes in the original KG,

Vel ⊆V={v1, ..., vJ}

, using a new

relation type

rel

. The interaction token and node can also be used to produce a pooled representation

of the whole input, e.g., when ﬁnetuning for classiﬁcation tasks (§2.4).

2.2 Cross-modal encoder

To model mutual interactions over the text and KG, we use a bidirectional sequence-graph encoder

for

fenc

which takes in the text tokens and KG nodes and exchanges information across them for

multiple layers to produce a fused representation of each token and node (Figure 1right):

(Hint,H1, ..., HI),(Vint,V1, ..., VJ) = fenc((wint, w1, ..., wI),(vint, v1, ..., vJ)) (1)

While we may use any deep bidirectional sequence-graph encoder for

fenc

, for controlled comparison

with existing works, we adopt the existing top-performing sequence-graph architecture, GreaseLM

[9], which combines Transformers [54] and graph neural networks (GNNs) to fuse text-KG inputs.

Speciﬁcally, GreaseLM ﬁrst uses

layers of Transformer language model (LM) layers to map the

input text into initial token representations, and uses KG node embeddings to map the input KG

nodes into initial node representations,

(H(0)

int ,H(0)

1, ..., H(0)

I) = LM-Layers(wint, w1..., wI),(2)

(V(0)

int ,V(0)

1, ..., V(0)

J) = Node-Embedding(vint, v1, ..., vJ).(3)

Then it uses

layers of text-KG fusion layers to encode these token/node representations jointly

into the ﬁnal token/node representations,

(Hint, ..., HI),(Vint, ..., VJ) = Fusion-Layers((H(0)

int , ..., H(0)

I),(V(0)

int , ..., V(0)

J)),(4)

where each of the fusion layers (`=1, ..., M) performs the following:

H(`)

int,H(`)

1, ..., H(`)

I) = LM-Layer(H(`−1)

int ,H(`−1)

1, ..., H(`−1)

I),(5)

V(`)

int,V(`)

1, ..., V(`)

J) = GNN-Layer(V(`−1)

int ,V(`−1)

1, ..., V(`−1)

J),(6)

[H(`)

int;V(`)

int] = MInt([ e

H(`)

int;e

V(`)

int]).(7)

Here GNN induces graph structure-aware representations of KG nodes,

[·;·]

does concatenation, and

MInt

(modality interaction module) exchanges information between the interaction token (text side)

and interaction node (KG side) via an MLP. For more details on GreaseLM, we refer readers to [9].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepBidirectionalLanguage-KnowledgeGraphPretrainingMichihiroYasunaga1AntoineBosselut2HongyuRen1XikunZhang1ChristopherDManning1PercyLiang1JureLeskovec11StanfordUniversity2EPFLEqualseniorauthorship{myasu,antoineb,hyren,xikunz2,manning,pliang,jure}@cs.stanford.eduAbstractPretrainingalanguagemodel(LM...

展开>> 收起<<

Deep Bidirectional Language-Knowledge Graph Pretraining Michihiro Yasunaga1Antoine Bosselut2Hongyu Ren1Xikun Zhang1.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Bidirectional Language-Knowledge Graph Pretraining Michihiro Yasunaga1Antoine Bosselut2Hongyu Ren1Xikun Zhang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: