KALM Knowledge-Aware Integration of Local Document and Global Contexts for Long Document Understanding Shangbin Feng1Zhaoxuan Tan2Wenqian Zhang2Zhenyu Lei2Yulia Tsvetkov1

2025-05-06 0 0 1.18MB 21 页 10玖币
侵权投诉
KALM: Knowledge-Aware Integration of Local, Document, and Global
Contexts for Long Document Understanding
Shangbin Feng1Zhaoxuan Tan2Wenqian Zhang2Zhenyu Lei2Yulia Tsvetkov1
1University of Washington 2Xi’an Jiaotong University
{shangbin, yuliats}@cs.washington.edu {tanzhaoxuan, 2194510944, fischer}@stu.xjtu.edu.cn
Abstract
With the advent of pretrained language mod-
els (LMs), increasing research efforts have
been focusing on infusing commonsense and
domain-specific knowledge to prepare LMs
for downstream tasks. These works attempt to
leverage knowledge graphs, the de facto stan-
dard of symbolic knowledge representation,
along with pretrained LMs. While existing ap-
proaches have leveraged external knowledge,
it remains an open question how to jointly in-
corporate knowledge graphs representing vary-
ing contexts—from local (e.g., sentence), to
document-level, to global knowledge—to en-
able knowledge-rich exchange across these
contexts. Such rich contextualization can be
especially beneficial for long document under-
standing tasks since standard pretrained LMs
are typically bounded by the input sequence
length. In light of these challenges, we pro-
pose KALM, a Knowledge-Aware Language
Model that jointly leverages knowledge in lo-
cal, document-level, and global contexts for
long document understanding. KALM first en-
codes long documents and knowledge graphs
into the three knowledge-aware context repre-
sentations. It then processes each context with
context-specific layers, followed by a “con-
text fusion” layer that facilitates knowledge
exchange to derive an overarching document
representation. Extensive experiments demon-
strate that KALM achieves state-of-the-art per-
formance on six long document understand-
ing tasks and datasets. Further analyses re-
veal that the three knowledge-aware contexts
are complementary and they all contribute to
model performance, while the importance and
information exchange patterns of different con-
texts vary with respect to different tasks and
datasets. 1
1
Code and data are publicly available at
https://github.
com/BunsenFeng/KALM.
1 Introduction
Large language models (LMs) have become the
dominant paradigm in NLP research, while knowl-
edge graphs (KGs) are the de facto standard of
symbolic knowledge representation. Recent ad-
vances in knowledge-aware NLP focus on combin-
ing the two paradigms (Wang et al.,2021b;Zhang
et al.,2021;He et al.,2021), infusing encyclopedic
(Vrandeˇ
ci´
c and Krötzsch,2014;Pellissier Tanon
et al.,2020), commonsense (Speer et al.,2017),
and domain-specific (Feng et al.,2021;Chang
et al.,2020) knowledge with LMs. Knowledge-
grounded models achieved state-of-the-art perfor-
mance in tasks including question answering (Sun
et al.,2022), commonsense reasoning (Kim et al.,
2022;Liu et al.,2021), and social text analysis
(Zhang et al.,2022;Hu et al.,2021).
Prior approaches to infusing LMs with knowl-
edge typically focused on three hitherto orthogonal
directions: incorporating knowledge related to lo-
cal (e.g., sentence-level), document-level, or global
context.
Local
context approaches argue that sen-
tences mention entities, and the external knowledge
of entities, such as textual descriptions (Balachan-
dran et al.,2021;Wang et al.,2021b) and metadata
(Ostapenko et al.,2022), help LMs realize they are
more than tokens.
Document-level
approaches ar-
gue that core idea entities are repeatedly mentioned
throughout the document, while related concepts
might be discussed in different paragraphs. These
methods attempt to leverage entities and knowledge
across paragraphs with document graphs (Feng
et al.,2021;Zhang et al.,2022;Hu et al.,2021).
Global
context approaches argue that unmentioned
yet connecting entities help connect the dots for
knowledge-based reasoning, thus knowledge graph
subgraphs are encoded with graph neural networks
alongside textual content (Zhang et al.,2021;Ya-
sunaga et al.,2021). However, despite their indi-
vidual pros and cons, how to integrate the three
arXiv:2210.04105v2 [cs.CL] 14 May 2023
document contexts in a knowledge-aware way re-
mains an open problem.
Controlling for varying scopes of knowledge and
context representations could benefit numerous lan-
guage understanding tasks, especially those cen-
tered around long documents. Bounded by the
inherent limitation of input sequence length, exist-
ing knowledge-aware LMs are mostly designed to
handle short texts (Wang et al.,2021b;Zhang et al.,
2021). However, processing long documents con-
taining thousands of tokens (Beltagy et al.,2021)
requires attending to varying document contexts,
disambiguating long-distance co-referring entities
and events, and more.
In light of these challenges, we propose
KALM
,
a
K
nowledge-
A
ware
L
anguage
M
odel for long
document understanding. Specifically, KALM first
derives three context- and knowledge-aware rep-
resentations from the long input document and
an external knowledge graph: the local context
represented as raw text, the document-level con-
text represented as a document graph, and the
global context represented as a knowledge graph
subgraph. KALM layers then encode each con-
text with context-specific layers, followed by our
proposed novel ContextFusion layers to enable
knowledge-rich information exchange across the
three knowledge-aware contexts. A unified docu-
ment representation is then derived from context-
specific representations that also interact with other
contexts. An illustration of the proposed KALM is
presented in Figure 1.
While KALM is a general method for long doc-
ument understanding, we evaluate the model on
six tasks and datasets that are particularly sensi-
tive to broader contexts and external knowledge:
political perspective detection, misinformation de-
tection, and roll call vote prediction. Extensive
experiments demonstrate that KALM outperforms
pretrained LMs, task-agnostic knowledge-aware
baselines, and strong task-specific baselines on all
six datasets. In ablation experiments, we further
establish KALM’s ability to enable information
exchange, better handle long documents, and im-
prove data efficiency. In addition, KALM and the
proposed ContextFusion layers reveal and help in-
terpret the roles and information exchange patterns
of different contexts.
2 KALM Methodology
2.1 Problem Definition
Let
d={d1,...,dn}
denote a document with
n
paragraphs, where each paragraph contains a
sequence of
ni
tokens
di={wi1, . . . , wini}
.
Knowledge-aware long document understanding
assumes the access to an external knowledge graph
KG = (E,R,A, , ϕ)
, where
E={e1, . . . , eN}
denotes the entity set,
R={r1, . . . , rM}
de-
notes the relation set,
A
is the adjacency ma-
trix where
aij =k
indicates
(ei, rk, ej)KG
,
(·) : E str
and
ϕ(·) : R → str
map the entities
and relations to their textual descriptions.
Given pre-defined document labels, knowledge-
aware natural language understanding aims to learn
document representations and classify
d
into its
corresponding label with the help of KG.
2.2 Knowledge-Aware Contexts
We hypothesize that a holistic representation of
long documents should incorporate contexts and
relevant knowledge at three levels: the local context
(e.g., a sentence with descriptions of mentioned en-
tities), the broader document context (e.g., a long
document with cross-paragraph entity reference
structure), and the global/external context repre-
sented as external knowledge (e.g., relevant knowl-
edge base subgraphs). Each of the three contexts
uses different granularities of external knowledge,
while existing works fall short of jointly integrat-
ing the three types of representations. To this end,
KALM firstly employs different ways to introduce
knowledge in different levels of contexts.
Local context.
Represented as the raw text of
sentences and paragraphs, the local context models
the smallest unit in long document understanding.
Prior works attempted to add sentence metadata
(e.g., tense, sentiment, topic) (Zhang et al.,2022),
adopt sentence-level pretraining tasks based on KG
triples (Wang et al.,2021b), or leverage knowledge
graph embeddings along with textual representa-
tions (Hu et al.,2021). While these methods were
effective, in the face of LM-centered NLP research,
they are ad-hoc add-ons and not fully compatible
with existing pretrained LMs. As a result, KALM
proposes to directly concatenate the textual descrip-
tions of entities
(ei)
to the paragraph if
ei
is men-
tioned. In this way, the original text is directly
augmented with the entity descriptions, informing
the LM that entities such as "Kepler" are more than
Local Context Document Context Global Context
KALM Layer
KALM Layer
Local Context Layer Document Context Layer Global Context Layer
Attentive Pooling
Transformer Encoder
Johannes Kepler was a
German Astronomer...
Johannes Kepler Kepler was born on 27 December 1571, in
the Free Imperial City of Weil der Stadt.
Kepler notable
work
main
subject
work
location
doctoral
advisor
write
Astronomy
Johannes
Kepler
somnium
Michael
Maestlin
Graz
With the support of his mentor Michael Maestlin, Kepler
received permission from thebingen university.
Kepler
Attentive Pooling Attentive Pooling
... ... ...
...
...
...
MLP
Context
Fusion
KALM
layer
...
...
Kepler lived in an era when there was no clear
distinction between astronomy and astrology...
Kepler
Astronomia
nova
astronomy
Michael Maestlin
Local Context Layer Document Context Layer Global Context Layer
Attentive Pooling
Transformer Encoder
Attentive Pooling Attentive Pooling
... ... ...
...
...
...
Context
Fusion
J o h a n n e s K e pl e r i s a key
figure in the 17th-century
Scientific Revolution, best
known for his laws of
planetary motion.
Johannes Kepler
Scientific Revolution
Prediction
Figure 1: Overview of KALM, which encodes long documents and knowledge graphs into local, document, and
global contexts while enabling information exchange across contexts.
mere tokens and help to combat the spurious corre-
lations of pretrained LMs (McMilin). For each aug-
mented paragraph
d0
i
, we adopt
LM(·)
and mean
pooling to extract a paragraph representation. We
use pretrained BART encoder (Lewis et al.,2020)
as
LM(·)
without further notice. We also add a
fusion token at the beginning of the paragraph se-
quence for information exchange across contexts.
After processing all
n
paragraphs, we obtain the
local context representation T(0) as follows:
T(0) ={t(0)
0,...,t(0)
n}
={θrand,LM(d0
1),...,LM(d0
n)}
where
θrand
denotes a randomly initialized vector
of the fusion token in the local context and the
superscript (0) indicates the 0-th layer.
Document-level context.
Represented as the
structure of the full document, the document-
level context is responsible for modeling cross-
paragraph entities and knowledge on a document
level. While existing works attempted to incorpo-
rate external knowledge in documents via docu-
ment graphs (Feng et al.,2021;Hu et al.,2021),
they fall short of leveraging the overlapping entities
and concepts between paragraphs that underpin the
reasoning of long documents. To this end, we pro-
pose knowledge coreference, a simple and effective
mechanism for modeling text-knowledge interac-
tion on the document level. Specifically, a docu-
ment graph with
n+ 1
nodes is constructed, con-
sisting of one fusion node and
n
paragraph nodes.
If paragraph
i
and
j
both mention entity
ek
in the
external KB, node
i
and
j
in the document graph
are connected with relation type
k
. In addition, the
fusion node is connected to every paragraph node
with a super-relation. As a result, we obtain the ad-
jacency matrix of the document graph
Ag
. Paired
with the knowledge-guided GNN to be introduced
in Section 2.3, knowledge coreference enables the
information flow across paragraphs guided by ex-
ternal knowledge. Node feature initialization of the
document graph is as follows:
G(0) ={g(0)
0,...,g(0)
n}
={θrand,LM(d1),...,LM(dn)}
Global context.
Represented as external knowl-
edge graphs, the global context is responsible for
leveraging unseen entities and facilitating KG-
based reasoning. Existing works mainly focused on
extracting knowledge graph subgraphs (Yasunaga
et al.,2021;Zhang et al.,2021) and encoding them
alongside document content. Though many tricks
are proposed to extract and prune KG subgraphs,
in KALM, we employ a straightforward approach:
for all mentioned entities in the long document,
KALM merges their
k
-hop neighborhood to obtain
a knowledge graph subgraph. We use
k= 2
follow-
ing previous works (Zhang et al.,2021;Vashishth
et al.,2019), striking a balance between KB struc-
ture and computational efficiency while KALM
could support any
k
settings. A fusion entity is
then introduced and connected with every other
entity, resulting in a connected graph. In this way,
KALM cuts back on the preprocessing for model-
ing global knowledge and better preserve the infor-
mation in the KG. Knowledge graph embedding
methods (Bordes et al.,2013) are then adopted to
initialize node features of the KG subgraph:
K(0) ={k(0)
0,...,k(0)
|ρ(d)|}
={θrand,KGE(e1),...,KGE(e|ρ(d)|)}
where
KGE(·)
denotes the knowledge graph em-
beddings trained on the original KG,
|ρ(d)|
indi-
cates the number of mentioned entities identified in
document
d
. We use TransE (Bordes et al.,2013)
to learn KB embeddings and use them for
KGE(·)
,
while the knowledge base embeddings are kept
frozen in the KALM training process.
2.3 KALM Layers
After obtaining the local, document-level, and
global context representations of long documents,
we employ KALM layers to learn document repre-
sentations. Specifically, each KALM layer consists
of three context-specific layers to process each con-
text. A ContextFusion layer is then adopted to
enable the knowledge-rich information exchange
across the three contexts.
2.3.1 Context-Specific Layers
Local context layer.
The local context is repre-
sented as a sequence of vectors extracted from
the knowledge-enriched text with the help of pre-
trained LMs. We adopt transformer encoder layers
(Vaswani et al.,2017) to encode the local context:
˜
T(`)={˜
t(`)
0,...,˜
t(`)
n}
=φTrmEnc({t(`)
0,...,t(`)
n})
where
φ(·)
denotes non-linearity,
TrmEnc
denotes
the transformer encoder layer, and
˜
t(`)
0
denotes the
transformed representation of the fusion token. We
omit the layer subscript (`)for brevity.
Document-level context layer.
The document-
level context is represented as a document graph
based on knowledge coreference. To better exploit
the entity-based relations in the document graph,
we propose a knowledge-aware GNN architecture
to enable
knowledge-guided message passing
on
the document graph:
˜
G={˜
g0,...,˜
gn= GNN{g0,...,gn}
where
GNN(·)
denotes the proposed knowledge-
guided graph neural networks as follows:
˜
gi=φαi,iΘgi+X
j∈N (i)
Θgj
where
αi,j
denotes the knowledge-guided attention
weight and is defined as follows:
αi,j =
expELU(aT[Θgi||Θgj||Θf(KGE(ag
ij ))])
Pk∈N (i)expELU(aT[Θgi||Θgk||Θf(KGE(ag
ik))])
where
˜
g0
denotes the transformed representation of
the fusion node,
a
and
Θ
are learnable parameters,
ag
ij
is the
i
-th row and
j
-th column value of adja-
cency matrix
Ag
of the document graph,
ELU
de-
notes the exponential linear unit activation function
(Clevert et al.,2015), and
f(·)
is a learnable linear
layer.
Θf(KGE(ag
ij ))
is responsible for enabling
the knowledge-guided message passing on the doc-
ument graph, enabling KALM to incorporate the
entity and concept patterns in different paragraphs
and their document-level interactions.
Global context layer.
The global context is
represented as a relevant knowledge graph sub-
graph. We follow previous works and adopt GATs
(Veliˇ
ckovi´
c et al.,2018) to encode the global con-
text:
˜
K={˜
k0,...,˜
k|ρ(d)|}
= GAT{k0,...,k|ρ(d)|}
where
˜
k0
denotes the transformed representation
of the fusion entity.
2.3.2 ContextFusion Layer
The local, document, and global contexts model
external knowledge within sentences, across the
document, and beyond the document. These con-
texts are closely connected and a robust long doc-
ument understanding method should reflect their
interactions. Existing approaches mostly leverage
only one or two of the contexts (Wang et al.,2021b;
Feng et al.,2021;Zhang et al.,2022), falling short
of jointly leveraging the three knowledge-aware
contexts. In addition, they mostly adopted direct
concatenation or MLP layers (Zhang et al.,2022,
2021;Hu et al.,2021), falling short of enabling
context-specific information to flow across con-
texts in a knowledge-rich manner. As a result, we
propose the ContextFusion layer to tackle these
challenges. We firstly take a local perspective and
extract the representations of the fusion tokens,
nodes, and entities in each context:
htL,gL,kLi=h˜
t0,˜
g0,˜
k0i
We then take a global perspective and use
the fusion token/node/entity as the query to con-
duct attentive pooling
ap(·,·)
across all other to-
kens/nodes/entities in each context:
htG,gG,kGi=hap˜
t0,{˜
ti}n
i=1,
ap˜
g0,{˜
gi}n
i=1,ap˜
k0,{˜
ki}n
i=1i
where attentive pooling ap(·,·)is defined as:
apq,{ki}n
i=1=
n
X
i=1
expq·ki
Pn
j=1 expq·kjki
In this way, the fusion token/node/entity in each
context serves as the information exchange portal.
We then use a transformer encoder layer to enable
information exchange across the contexts:
h˜
tL,˜
gL,˜
kL,˜
tG,˜
gG,˜
kGi
=φTrmEnchtL,gL,kL,tG,gG,kGi
As a result,
˜
tL
,
˜
gL
, and
˜
kL
are the representa-
tions of the fusion token/node/entity that incorpo-
rates information from other contexts. We formu-
late the output of the l-th layer as follows:
T(`+1) ={˜
t(`)
L,˜
t(`)
1,...,˜
t(`)
n},
G(`+1) ={˜
g(`)
L,˜
g(`)
1,...,˜
g(`)
n},
K(`+1) ={˜
k(`)
L,˜
k(`)
1,...,˜
k(`)
n}
Our proposed ContextFusion layer is interactive
since it enables the information to flow across dif-
ferent document contexts, instead of direct concate-
nation or hierarchical processing. The attention
weights in
TrmEnc(·)
of the ContextFusion layer
could also provide insights into the roles and im-
portance of each document context, which will be
further explored in Section 3.3. To the best of
our knowledge, KALM is the first work to jointly
consider the three levels of document context and
enable information exchange across document con-
texts.
2.4 Learning and Inference
After a total of
P
KALM layers, we obtain the fi-
nal document representation as
h˜
t(P)
L,˜
g(P)
L,˜
k(P)
Li
.
Given the document label
a∈ A
, the la-
bel probability is formulated as
p(a|d)
expMLPa([˜
t(P)
L,˜
g(P)
L,˜
k(P)
L])
. We then opti-
mize KALM with the cross entropy loss func-
tion. At inference time, the predicted label is
argmaxap(a|d).
3 Experiment
3.1 Experiment Settings
Tasks and Datasets.
We propose KALM, a gen-
eral method for knowledge-aware long document
understanding. We evaluate KALM on three tasks
that especially benefit from external knowledge
and broader context: political perspective detec-
tion, misinformation detection, and roll call vote
prediction. We follow previous works to adopt Se-
mEval (Kiesel et al.,2019) and Allsides (Li and
Goldwasser,2019) for political perspective detec-
tion, LUN (Rashkin et al.,2017) and SLN (Rubin
et al.,2016) for misinformation detection, and the
2 datasets proposed in Mou et al. (2021) for roll
call vote prediction. For external KGs, we follow
existing works to adopt the KGs in KGAP (Feng
et al.,2021), CompareNet (Hu et al.,2021), and
ConceptNet (Speer et al.,2017) for the three tasks.
Baseline methods.
We compare KALM with
three types of baseline methods for holistic evalu-
ation: pretrained LMs, task-agnostic knowledge-
aware methods, and task-specific models. For pre-
trained LMs, we evaluate RoBERTa (Liu et al.,
2019b), Electra (Clark et al.,2019), DeBERTa (He
et al.,2020), BART (Lewis et al.,2020), and Long-
Former (Beltagy et al.,2020) on the three tasks.
For task-agnostic baselines, we evaluate KGAP
(Feng et al.,2021), GreaseLM (Zhang et al.,2021),
and GreaseLM+ on the three tasks. Task-specific
models are introduced in the following sections.
For pretrained LMs, task-agnostic methods, and
KALM, we run each method five times and report
the average performance and standard deviation.
For task-specific models, we compare with the re-
sults originally reported since we follow the exact
same experiment settings and data splits.
3.2 Model Performance
We present the performance of task-specific meth-
ods, pretrained LMs, task-agnostic knowledge-
摘要:

KALM:Knowledge-AwareIntegrationofLocal,Document,andGlobalContextsforLongDocumentUnderstandingShangbinFeng1ZhaoxuanTan2WenqianZhang2ZhenyuLei2YuliaTsvetkov11UniversityofWashington2Xi'anJiaotongUniversity{shangbin,yuliats}@cs.washington.edu{tanzhaoxuan,2194510944,fischer}@stu.xjtu.edu.cnAbstractWithth...

展开>> 收起<<
KALM Knowledge-Aware Integration of Local Document and Global Contexts for Long Document Understanding Shangbin Feng1Zhaoxuan Tan2Wenqian Zhang2Zhenyu Lei2Yulia Tsvetkov1.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.18MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注