A Decade of Knowledge Graphs in Natural Language Processing A Survey Phillip Schneider1 Tim Schopf1 Juraj Vladika1 Mikhail Galkin2 Elena Simperl3and Florian Matthes1

2025-05-01 0 0 552.08KB 14 页 10玖币
侵权投诉
A Decade of Knowledge Graphs in Natural Language Processing: A Survey
Phillip Schneider1, Tim Schopf1, Juraj Vladika1, Mikhail Galkin2,
Elena Simperl3and Florian Matthes1
1Technical University of Munich, Department of Computer Science, Germany
2Mila Quebec AI Institute & McGill University, School of Computer Science, Canada
3King’s College London, Department of Informatics, United Kingdom
{phillip.schneider, tim.schopf, juraj.vladika, matthes}@tum.de
mikhail.galkin@mila.quebec
elena.simperl@kcl.ac.uk
Abstract
In pace with developments in the research field
of artificial intelligence, knowledge graphs
(KGs) have attracted a surge of interest from
both academia and industry. As a represen-
tation of semantic relations between entities,
KGs have proven to be particularly relevant for
natural language processing (NLP), experienc-
ing a rapid spread and wide adoption within
recent years. Given the increasing amount of
research work in this area, several KG-related
approaches have been surveyed in the NLP re-
search community. However, a comprehen-
sive study that categorizes established topics
and reviews the maturity of individual research
streams remains absent to this day. Contribut-
ing to closing this gap, we systematically ana-
lyzed 507 papers from the literature on KGs in
NLP. Our survey encompasses a multifaceted
review of tasks, research types, and contribu-
tions. As a result, we present a structured
overview of the research landscape, provide
a taxonomy of tasks, summarize our findings,
and highlight directions for future work.
1 Introduction
Knowledge acquisition and application are inher-
ent to natural language. Humans use language as a
means of communicating facts, arguing about de-
cisions, or questioning beliefs. Therefore, it is not
surprising that computational linguists started al-
ready in the 1950s and 60s to work out ideas on how
to represent knowledge as relations between con-
cepts in semantic networks (Richens,1956;Quil-
lian,1963;Collins and Quillian,1969).
More recently, knowledge graphs (KGs) have
emerged as an approach for semantically repre-
senting knowledge about real-world entities in a
machine-readable format. They originated from
research on semantic networks, domain-specific
ontologies, as well as linked data, and are thus not
an entirely new concept (Hitzler,2021). Despite
their growing popularity, there is still no general
understanding of what exactly a
KG
is or for what
tasks it is applicable. Although prior work has al-
ready attempted to define KGs (Pujara et al.,2013;
Ehrlinger and Wöß,2016;Paulheim,2017;Färber
et al.,2018), the term is not yet used uniformly by
researchers. Most studies implicitly adopt a broad
definition of KGs, where they are understood as "a
graph of data intended to accumulate and convey
knowledge of the real world, whose nodes represent
entities of interest and whose edges represent rela-
tions between these entities" (Hogan et al.,2022).
KGs have attracted a lot of research attention
in both academia and industry since the introduc-
tion of Google’s KG in 2012 (Singhal,2012). Par-
ticularly in natural language processing (
NLP
) re-
search, the adoption of KGs has become increas-
ingly popular over the past 5 years, and this trend
seems to be accelerating. The underlying paradigm
is that the combination of structured and unstruc-
tured knowledge can benefit all kinds of
NLP
tasks.
For instance, structured knowledge from KGs can
be injected into that of the contextual knowledge
found in language models, which improves the per-
formance in downstream tasks (Colon-Hernandez
et al.,2021). Furthermore, with the growing impor-
tance of KGs, there are also expanding efforts to
construct new KGs from unstructured texts.
Ten years after Google coined the term knowl-
edge graph in 2012, a plethora of novel approaches
has been proposed by scholars. Therefore, it is im-
portant to assemble insights, consolidate existing
results, and provide a structured overview. How-
ever, to our knowledge, there are no studies that
offer an overview of the whole research landscape
of KGs in the
NLP
field. Contributing to closing
this gap, we performed a comprehensive survey
to analyze all research performed in this area by
classifying established topics, identifying trends,
and outlining areas for future research. Our three
main contributions are as follows:
arXiv:2210.00105v1 [cs.CL] 30 Sep 2022
Task Taxonomy of Knowledge Graphs in Natural Language Processing
Knowledge Graph Construction Knowledge Graph Reasoning
Knowledge Extraction
Knowledge Acquisition Knowledge Application
Natural Language Understanding Natural Language Generation
Attribute Extraction
Entity Extraction
Relation Extraction
Knowledge Integration
Entity Alignment
Entity Linking
Ontology Construction
Entity Classification
Error Detection
Knowledge Graph Embedding
Link Prediction
Relation Linking
Relation Classification
Natural Language Inference
Semantic Parsing
Semantic Search
Semantic Similarity
Text Analysis
Text Classification
Data-to-Text Generation
Machine Translation
Question Generation
Text Generation
Text Summarization
Augmented Language Models
Conversational Interfaces
Question Answering
Triple Classification
Figure 1: Taxonomy of tasks in the literature on KGs in NLP.
1.
We systematically extract information from
507 included papers and report insights about
tasks, research types, and contributions.
2.
We provide a taxonomy of tasks in the litera-
ture on KGs in NLP shown in Figure 1.
3.
We assess the maturity of individual research
streams, identify trends, and highlight direc-
tions for future work.
Our survey sheds light on the evolution and cur-
rent research progress regarding KGs in
NLP
. Al-
though we cannot achieve complete coverage of all
relevant papers on this topic, we aim at providing
a representative overview that can help both
NLP
scholars and practitioners by offering a starting
point in the literature. Moreover, our multifaceted
analysis can guide the research community in clos-
ing existing gaps and finding novel ways how to
combine KGs with NLP.
2 Related Work
Related literature that includes both KGs and
NLP
seems to be relatively scarce. Most survey papers
focus either only on KGs or only on NLP. In their
broad introduction to KGs, Hogan et al. (2022)
point out that existing surveys on KGs tend to re-
volve around specific aspects of KGs, most com-
monly their construction and embedding.
Such surveys with a KG focus usually bring up
NLP
only in the context of employed
NLP
meth-
ods, like information extraction, being used to pop-
ulate and refine graphs (Nickel et al.,2016). Other
surveys on KGs mention some downstream appli-
cations of KGs for
NLP
tasks, such as for con-
structing augmented language models, question
answering over knowledge bases (
KBQA
), or rec-
ommender systems (Ji et al.,2021).
As noted previously, related work that includes
both KGs and
NLP
strictly focus on a specific ap-
plication or task. For example, Safavi and Koutra
(2021) provide an overview on applying relational
world knowledge from KGs to augment large con-
textual language models. Other surveys on specific
applications include
KG
reasoning (Chen et al.,
2019), biomedical KGs (Nicholson and Greene,
2020), and the task of KBQA (Fu et al.,2020).
The survey on graphs in
NLP
by Nastase et al.
(2015) covers only smaller graphs such as depen-
dency graphs and dialogue trees. Even though it
does not include KGs, the survey concludes that
graphs are a powerful representation formalism and
how
NLP
tasks can benefit from harnessing the po-
tential of data presented in graph structures.
To the best of our knowledge, this is the first
survey covering a wide spectrum of techniques,
methods as well as applications of KGs within the
NLP research field.
3 Method
To achieve our objective of providing a thorough
overview of the research landscape, we conducted
a systematic mapping study following the process
defined by Petersen et al. (2008). Its three main
steps are explained in the next subsections.
3.1 Research Questions
The goal of our study is a multifaceted analysis
of KGs in the field of
NLP
, such as identifying
and quantifying research topics, domains, and out-
comes. These objectives are reflected in the re-
search questions (RQs) stated below.
RQ1
: What are the characteristics and trends of
the research literature on KGs in NLP?
RQ2
: What are the different tasks mentioned in
the existing research studies?
RQ3
: What are the research types and main
contributions of the studies?
3.2 Search and Screening Procedure
After specifying the RQs, we defined a set of re-
lated keywords for KGs and
NLP
to be used for
the database search of relevant studies. From ini-
tial test searches, we observed that including terms
associated with KGs (e.g., “semantic network” or
“ontology”) yielded too many irrelevant results. To
restrict the research scope to the concept of KGs,
we decided to use the following search string:
("knowledge graph") AND ("NLP" OR "natu-
ral language processing" OR "computational lin-
guistics"). The search string was applied to title,
abstract, and keywords. If a given paper had no key-
words, we used index keywords from the database
if they were available.
For our search of relevant publications, we
queried six academic databases, as listed in Table
1. The ACL Anthology is a digital archive of presti-
gious conferences and journals in
NLP
. ACM and
IEEE provide access to publications of additional
reputable venues in the broader computer science
field. The remaining databases are commonly cho-
sen in other related surveys to further increase the
coverage of the respective field of interest.
In the first week of 2022, we applied our search
string to the databases and restricted the time win-
dow to ten years from 2012 until 2021. Then, the
exported files were merged, ensuring that each pub-
lication record was either a conference or a journal
paper. We automatically identified and removed du-
plicate records as well. Through this, we obtained
a dataset of 746 unique papers. Given this initial
dataset, we further filtered down the truly relevant
studies by screening for the following inclusion cri-
teria: (1) peer-reviewed studies from conferences
or journals, (2) studies with a clear focus on KGs
in
NLP
, (3) studies are written in English and full
texts are electronically accessible. In reverse, this
implies the publications that did not satisfy all three
inclusion criteria were excluded from the dataset.
As part of the screening procedure, two of the
authors read title, abstract, and keywords to deter-
Academic Database No. of Papers
ACL Anthology 164
ACM Digital Library 26
IEEE Xplore 76
ScienceDirect 34
Scopus 200
Web of Science 7
Total 507
Table 1: Overview of academic databases and number
of included papers.
mine if a paper matched the inclusion criteria. In
ambiguous cases, the full text of the paper was ex-
amined. The two authors screened all papers and
decided together on keeping or dropping records
from the dataset. The final dataset included a total
of 507 papers, as listed in Table 1. We make our an-
notated dataset available through a public GitHub
repository.1
3.3 Classification Scheme and Data
Extraction
According to our RQs, the included papers had to
be categorized with respect to three facets: task,
research type, and contribution. Established classi-
fication schemes from Wieringa et al. (2006) and
Shaw (2003) were adapted for the research and
contribution type as presented in Appendix A. For
classifying tasks, we constructed a task taxonomy,
following the iterative procedure suggested by Pe-
tersen et al. (2008), in which an initial classifica-
tion scheme derived from keywords continuously
evolves through adding, merging, or splitting cate-
gories during the classification process. Our task
taxonomy is based on existing schemes from Paul-
heim (2017), Liu et al. (2020a), and Ji et al. (2021).
Once the initial schemes were set up, all papers
were sorted into the classes as part of the data ex-
traction process. The 507 included studies were
divided between two of the authors. In regular ses-
sions, they discussed changes to the classification
schemes or clarified uncertain labels. While each
paper got assigned one label for the research type
assigned, multiple labels were possible with regard
to tasks and contributions. To assess the reliability
of the inter-annotator agreement, the two authors
independently classified a random sample of 50
papers. We calculated Cohen’s Kappa coefficient
of these annotations for each facet (Cohen,1960).
1https://github.com/sebischair/KG-in-NLP-survey
摘要:

ADecadeofKnowledgeGraphsinNaturalLanguageProcessing:ASurveyPhillipSchneider1,TimSchopf1,JurajVladika1,MikhailGalkin2,ElenaSimperl3andFlorianMatthes11TechnicalUniversityofMunich,DepartmentofComputerScience,Germany2MilaQuebecAIInstitute&McGillUniversity,SchoolofComputerScience,Canada3King'sCollegeLond...

展开>> 收起<<
A Decade of Knowledge Graphs in Natural Language Processing A Survey Phillip Schneider1 Tim Schopf1 Juraj Vladika1 Mikhail Galkin2 Elena Simperl3and Florian Matthes1.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:552.08KB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注