Generative Knowledge Graph Construction A Review Hongbin Ye12 Ningyu Zhang12 Hui Chen3 Huajun Chen12 1Zhejiang University AZFT Joint Lab for Knowledge Engine

2025-05-06 0 0 773.63KB 17 页 10玖币
侵权投诉
Generative Knowledge Graph Construction: A Review
Hongbin Ye1,2, Ningyu Zhang1,2
, Hui Chen 3, Huajun Chen1,2
1Zhejiang University & AZFT Joint Lab for Knowledge Engine
2Hangzhou Innovation Center, Zhejiang University
3Alibaba Group
{yehongbin,zhangningyu,huajunsir}@zju.edu.cn,weidu.ch@alibaba-inc.com
Abstract
Generative Knowledge Graph Construction
(KGC) refers to those methods that leverage the
sequence-to-sequence framework for building
knowledge graphs, which is flexible and can be
adapted to widespread tasks. In this study, we
summarize the recent compelling progress in
generative knowledge graph construction. We
present the advantages and weaknesses of each
paradigm in terms of different generation tar-
gets and provide theoretical insight and empiri-
cal analysis. Based on the review, we suggest
promising research directions for the future.
Our contributions are threefold: (1) We present
a detailed, complete taxonomy for the genera-
tive KGC methods; (2) We provide a theoretical
and empirical analysis of the generative KGC
methods; (3) We propose several research di-
rections that can be developed in the future.
1 Introduction
Knowledge Graphs (KGs) as a form of structured
knowledge have drawn significant attention from
academia and the industry (Ji et al.,2022). How-
ever, high-quality KGs rely almost exclusively on
human-curated structured or semi-structured data.
To this end, Knowledge Graph Construction (KGC)
is proposed, which is the process of populating (or
building from scratch) a KG with new knowledge
elements (e.g., entities, relations, events). Conven-
tionally, KGC is solved by employing task-specific
discriminators for the various types of information
in a pipeline manner (Angeli et al.,2015;Luan
et al.,2018;de Sá Mesquita et al.,2019;Zhang
et al.,2022a), typically including (1) entity discov-
ery or named entity recognition (Sang and Meulder,
2003), (2) entity linking (Milne and Witten,2008),
(3) relation extraction (Zelenko et al.,2003) and (4)
event extraction (Du and Cardie,2020). However,
this presents limitations of error population and
poor adaptability for different tasks.
Corresponding author.
Tags:  O  B-CP-1E-CP-1   O  B-CP-2  E-CP-2  O      O    O
(a) ClassificationModel
The [United States]E-loc President [Joe Biden]E-per visited [Samsung]E-Org .
(b) Tagging Model
(c) Generation Model
Country-President
None
None
Extracted Results
{United states, Country-President, Joe Biden}
Input Text: The  United StatesPresident Joe   BidenvisitedSamsung .
Final Results: {United states, Country-President,Joe Biden}
Input Text: The  United StatesPresident Joe BidenvisitedSamsung .
Seq2Seq Text: <triplet>United States <subj> Joe Biden <obj>Country-President.
{United states, Country-President,Joe Biden}
Delinearization
Figure 1: Discrimination and generation methodologies
for relation extraction. “Country-President” is the rela-
tion, and “CP” is short for “Country-President.
Generative Knowledge Graph Construction.
Some generative KGC methods based on the
sequence-to-sequence (Seq2Seq) framework are
proposed to overcome this barrier. Early work
(Zeng et al.,2018) has explored using the gener-
ative paradigm to solve different entity and rela-
tion extraction tasks. Powered by fast advances
of generative pre-training such as T5 (Raffel et al.,
2020), and BART (Lewis et al.,2020), Seq2Seq
paradigm has shown its great potential in unifying
widespread NLP tasks. Hence, more generative
KGC works (Yan et al.,2021a;Paolini et al.,2021;
Lu et al.,2022) have been proposed, showing ap-
pealing performance in benchmark datasets. Fig-
ure 1illustrates an example of generative KGC for
relation extraction. The target triple is preceded
by the tag <triple>, and the head entity, tail entity,
and relations are also specially tagged, allowing the
structural knowledge (corresponding to the output)
to be obtained by inverse linearization. Despite the
success of numerous generative KGC approaches,
these works scattered among various tasks have not
been systematically reviewed and analyzed.
arXiv:2210.12714v3 [cs.CL] 18 Sep 2023
Present work In this paper, we summarize re-
cent progress in generative KGC (An timeline of
generative KGC can be found in Appendix A) and
maintain a public repository for research conve-
nience
1
. We propose to organize relevant work by
the generation target of models and also present the
axis of the task level (Figure 3):
Comprehensive review with new tax-
onomies. We conduct the first comprehensive
review of generative KGC together with new
taxonomies. We review the research with dif-
ferent generation targets for KGC with a com-
prehensive comparison and summary (§3).
Theoretical insight and empirical analysis.
We provide in-depth theoretical and empiri-
cal analysis for typical generative KGC meth-
ods, illustrating the advantages and disadvan-
tageous of different methodologies as well as
remaining issues (§4).
Wide coverage on emerging advances and
outlook on future directions. We provide
comprehensive coverage of emerging areas,
including prompt-based learning. This review
provides a summary of generative KGC and
highlights future research directions (§5).
Related work As this topic is relatively nascent,
only a few surveys exist. Closest to our work, Ji
et al. (2022) covers methods for knowledge graph
construction, representation learning, and applica-
tions, which mainly focus on general methods for
KGC. Zhu et al. (2022) provides a systematic sur-
vey for multi-modal knowledge graph construction
and review the challenges, progress, and opportu-
nities. For general NLP, Min et al. (2021) survey
recent work that uses these large language mod-
els to solve tasks via text generation approaches,
which has overlaps in generation methodologies
for information extraction. Different from those
surveys, in this paper, we conduct a literature re-
view on generative KGC, hoping to systematically
understand the methodologies, compare different
methods and inspire new ideas.
1https://github.com/zjunlp/Generative_KG_
Construction_Papers
Figure 2: Sankey diagram of knowledge graph construc-
tion tasks with different generative paradigms.
2 Preliminary on Knowledge Graph
Construction
2.1 Knowledge Graph Construction
Knowledge Graph Construction mainly aims to
extract structural information from unstructured
texts, such as Named Entity Recognition (NER)
(Chiu and Nichols,2016), Relation Extraction (RE)
(Zeng et al.,2015), Event Extraction (EE) (Chen
et al.,2015), Entity Linking (EL) (Shen et al.,
2015), and Knowledge Graph Completion (Lin
et al.,2015).
Generally, KGC can be regarded as structure pre-
diction tasks, where a model is trained to approx-
imate a target function
F(x)y
, where
x∈ X
denotes the input data and
y∈ Y
denotes the output
structure sequence. For instance, given a sentence,
"Steve Jobs and Steve Wozniak co-founded Apple
in 1977.":
Named Entity Recognition aims to identify the
types of entities, e.g., ‘Steve Job’, ‘Steve Wozniak
PERSON, ‘AppleORG;
Relation Extraction aims to identify the relation-
ship of the given entity pair
Steve Job,Apple
as
founder;
Event Extraction aims to identify the event type
as
Business Start-Org
where ‘co-founded’ trig-
gers the event and (Steve Jobs,Steve Wozniak) are
participants in the event as
AGENT
and
Apple
as
ORG respectively.
Entity Linking aims to link the mention Steve Job
to
Steven Jobs (Q19837)
on Wikidata, and Apple
to Apple (Q312) as well.
Knowledge Graph Completion aims to complete
incomplete triples
Steve Job,create,?
for blank
entities Apple,NeXT Inc. and Pixar.
Generative KGC Taxonomy
Generation
Target
Copy-based
Sequence
CopyRE (Zeng et al.,2018), CopyRRL (Zeng et al.,2019), CopyMTL (Zeng et al.,2020), TEMPGEN (Huang et al.,2021),
Seq2rel (Giorgi et al.,2022)
Structure-based
Sequence
Seq2Seq4ATE (Ma et al.,2019), Nested-seq (Straková et al.,2019), CGT (Zhang et al.,2021b;Ye et al.,2021), PolicyIE
(Ahmad et al.,2021), Text2Event (Lu et al.,2021), HySPA (Ren et al.,2021), REBEL (Cabot and Navigli,2021), SQUIRE
(Bai et al.,2022), GenKGC (Xie et al.,2022), EPGEL (Lai et al.,2022), HuSe-Gen (Saha et al.,2022), UIE (Lu et al.,2022),
DEEPSTRUCT (Wang et al.,2022), De-Bias (Zhang et al.,2022b), KGT5 (Saxena et al.,2022), KG-S2S (Chen et al.,2022a)
Label-based
Sequence ANL (Athiwaratkun et al.,2020a), GENRE (Cao et al.,2021), TANL (Paolini et al.,2021)
Indice-based
Sequence
PNDec (Nayak and Ng,2020), SEQ2SEQ-PTR (Rongali et al.,2020), GRIT (Du et al.,2021a), UGF for NER
(Yan et al.,2021b), UGF for ABSA (Yan et al.,2021a)
Blank-based
Sequence
COMET (Bosselut et al.,2019), BART-Gen (Li et al.,2021b), GTT (Du et al.,2021b), DEGREE (Hsu et al.,2022), ClarET
(Zhou et al.,2022), GTEE (Liu et al.,2022), X-GEAR (Huang et al.,2022), PAIE (Ma et al.,2022)
Generation
Tasks
Named Entity
Recognition
Nested-seq (Straková et al.,2019), ANL (Athiwaratkun et al.,2020a), TANL (Paolini et al.,2021), HySPA (Ren et al.,2021),
UGF for NER (Yan et al.,2021b), DEEPSTRUCT (Wang et al.,2022), De-Bias (Zhang et al.,2022b), UIE (Lu et al.,2022)
Relation Extraction
CopyRE (Zeng et al.,2018), CopyRRL (Zeng et al.,2019), PNDec (Nayak and Ng,2020), CopyMTL (Zeng et al.,2020),
CGT (Zhang et al.,2021b;Ye et al.,2021), TANL (Paolini et al.,2021), HySPA (Ren et al.,2021), TEMPGEN
(Huang et al.,2021), REBEL (Cabot and Navigli,2021), DEEPSTRUCT (Wang et al.,2022), UIE (Lu et al.,2022),
Seq2rel (Giorgi et al.,2022)
Event Extraction
CGT (Zhang et al.,2021b;Ye et al.,2021), TANL (Paolini et al.,2021), BART-Gen (Li et al.,2021b), GTT (Du et al.,2021b),
GRIT (Du et al.,2021a), Text2Event (Lu et al.,2021), DEGREE (Hsu et al.,2022), ClarET (Zhou et al.,2022), GTEE
(Liu et al.,2022), X-GEAR (Huang et al.,2022), DEEPSTRUCT (Wang et al.,2022), PAIE (Ma et al.,2022), UIE
(Lu et al.,2022)
Entity Linking GENRE (Cao et al.,2021), EPGEL (Lai et al.,2022)
Knowledge Graph
Completion
COMET (Bosselut et al.,2019), SQUIRE (Bai et al.,2022), GenKGC (Xie et al.,2022),, HuSe-Gen (Saha et al.,2022),
ClarET (Zhou et al.,2022), KGT5 (Saxena et al.,2022), KG-S2S (Chen et al.,2022a)
Figure 3: Taxonomy of Generative Knowledge Graph Construction.
2.2 Discrimination and Generation
Methodologies
In this section, we introduce the background of
discrimination and generation methodologies for
KGC. The goal of the discrimination model is to
predict the possible label based on the characteris-
tics of the input sentence. As shown in Figure 1,
given annotated sentence
x
and a set of potentially
overlapping triples
tj={(s, r, o)}
in
x
, we aim to
maximize the data likelihood during the training
process:
pcls(t|x) = Y
(s,r,o)tj
p((s, r, o)|xj)(1)
Another method of discrimination is to output
tags using sequential tagging for each position
i
(Zheng et al.,2017;Dai et al.,2019;Yu et al.,2020;
Li et al.,2020b;Liu et al.,2021a). As shown in
Figure 1, for an n-word sentence
x
,
n
different tag
sequences are annotated based on "BIESO" (Begin,
Inside, End, Single, Outside) notation schema. The
size of a set of pre-defined relations is
|R|
, and
the related role orders are represented by "1" and
"2". During the training model, we maximize the
log-likelihood of the target tag sequence using the
hidden vector hiat each position i:
ptag(y|x) = exp(hi, yi)
PyRexp (exp(hi, y
i)) (2)
For the generation model, if
x
is the input sen-
tence and ythe result of linearized triplets, the tar-
get for the generation model is to autoregressively
generate ygiven x:
pgen(y|x) =
len(y)
Y
i=1
pgen (yi|y<i, x)(3)
By fine-tuning seq2seq model (e.g. MASS (Song
et al.,2019),T5(Raffel et al.,2020) , and BART
(Lewis et al.,2020) ) on such a task, using the cross-
entropy loss, we can maximize the log-likelihood
of the generated linearized triplets.
2.3 Advantages of the Generation Methods
While the previous discriminative methods (Wei
et al.,2020;Shang et al.,2022) extracts relational
triples from unstructured text according to a pre-
defined schema to efficiently construct large-scale
knowledge graphs, these elaborate models focus on
solving a specific task of KGC, such as predicting
relation and event information from a segment of
input text which often requires multiple models to
process. The idea of formulating KGC tasks as
sequence-to-sequence problems (Lu et al.,2022)
will be of great benefit to develop a universal archi-
tecture to solve different tasks, which can be free
from the constraints of dedicated architectures, iso-
lated models, and specialized knowledge sources.
News of the list’s existence unnerved
officials in Khartoum, Sudan ’s capital.
capital, Sudan, Khartoum, contains, Sudan, Khartoum
Copied entity
Predicted relation
Attention
Generation Model
Figure 4: Copy-based Sequence.
In addition, generative models can be pre-trained in
multiple downstream tasks by structurally consis-
tent linearization of the text, which facilitates the
transition from traditional understanding to struc-
tured understanding and increases knowledge shar-
ing (Wang et al.,2022). In contexts with nested
labels in NER (Straková et al.,2019), the proposed
generative method implicitly models the structure
between named entities, thus avoiding the com-
plex multi-label mapping. Extracting overlapping
triples in RE is also difficult to handle for tradi-
tional discriminative models, Zeng et al. (2018)
introduce a fresh perspective to revisit the RE task
with a general generative framework that addresses
the problem by end-to-end model. In short, new
directions can be explored for some hard-to-solve
problems through paradigm shifts.
Note that the discriminative and generative meth-
ods are not simply superior or inferior due to the
proliferation of related studies. The aim of this pa-
per is to summarize the characteristics of different
generative paradigms in KGC tasks and provide a
promising perspective for future research.
3 Taxonomy of Generative Knowledge
Graph Construction
In this paper, we mainly consider the following five
paradigms that are widely used in KGC tasks based
on generation target, i.e. copy-based Sequence,
structure-linearized Sequence,label-augmented Se-
quence,indice-based Sequence, and blank-based
Sequence. As shown in Figure 2, these paradigms
have demonstrated strong dominance in many
mainstream KGC tasks. In the following sections,
we introduce each paradigm as shown in Figure 3.
3.1 Copy-based Sequence
This paradigm refers to developing more robust
models to copy the corresponding token (entity) di-
rectly from the input sentence during the generation
process. Zeng et al. (2018) designs an end-to-end
model based on a copy mechanism to solve the
The man returned to Los Angeles from Mexico following
his capture Tuesday by bounty hunters.
Generation Model
((Transport returned (Artifact The man) (Destination Los
Angeles) (Origin Mexico))
(Arrest-Jail capture (Person The man) (Time Tuesday)
(Agent bounty hunters))
Root
Transport Arrest-Jail
Artifact
Destination
Origin
returned
...
The man
Los Angeles
Mexico
Event Schema
Figure 5: Structure-linearized Sequence.
triple overlapping problem. As shown in Figure 4,
the model copies the head entity from the input sen-
tence and then the tail entity. Similarly, relations
are generated from target vocabulary, which is re-
stricted to the set of special relation tokens. This
paradigm avoids models generating ambiguous or
hallucinative entities. In order to identify a rea-
sonable triple extraction order, Zeng et al. (2019)
converts the triplet generation process into a re-
inforcement learning process, enabling the copy
mechanism to follow an efficient generative order.
Since the entity copy mechanism relies on unnat-
ural masks to distinguish between head and tail
entities, Zeng et al. (2020) maps the head and tail
entities to fused feature space for entity replication
by an additional nonlinear layer, which strengthens
the stability of the mechanism. For document-level
extraction, Huang et al. (2021) proposes a TOP-
k copy mechanism to alleviate the computational
complexity of entity pairs.
3.2 Structure-linearized Sequence
This paradigm refers to utilizing structural knowl-
edge and label semantics, making it prone to han-
dling a unified output format. Lu et al. (2021) pro-
poses an end-to-end event extraction model based
on T5, where the output is a linearization of the
extracted knowledge structure as shown in Figure 5.
In order to avoid introducing noise, it utilizes the
event schema to constrain decoding space, ensuring
the output text is semantically and structurally legit-
imate. Lou et al. (2021) reformulates event detec-
tion as a Seq2Seq task and proposes a Multi-Layer
Bidirectional Network (MLBiNet) to capture the
document-level association of events and semantic
information simultaneously. Besides, Zhang et al.
(2021b); Ye et al. (2021) introduce a contrastive
learning framework with a batch dynamic attention
masking mechanism to overcome the contradiction
in meaning that generative architectures may pro-
duce unreliable sequences (Zhu et al.,2020). Simi-
larly, Cabot and Navigli (2021) employs a simple
摘要:

GenerativeKnowledgeGraphConstruction:AReviewHongbinYe1,2,NingyuZhang1,2∗,HuiChen3,HuajunChen1,21ZhejiangUniversity&AZFTJointLabforKnowledgeEngine2HangzhouInnovationCenter,ZhejiangUniversity3AlibabaGroup{yehongbin,zhangningyu,huajunsir}@zju.edu.cn,weidu.ch@alibaba-inc.comAbstractGenerativeKnowledgeGr...

展开>> 收起<<
Generative Knowledge Graph Construction A Review Hongbin Ye12 Ningyu Zhang12 Hui Chen3 Huajun Chen12 1Zhejiang University AZFT Joint Lab for Knowledge Engine.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:773.63KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注