Clustering Semantic Predicates in the Open Research Knowledge Graph Omar Arab Oghli0000000290929096 Jennifer DSouza0000000266169509

2025-05-01 0 0 570.92KB 18 页 10玖币
侵权投诉
Clustering Semantic Predicates in the
Open Research Knowledge Graph
Omar Arab Oghli[0000000290929096], Jennifer D’Souza[0000000266169509],
and S¨oren Auer[0000000206982864]
TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
{omar.araboghli,jennifer.dsouza,auer}@tib.eu
Abstract. When semantically describing knowledge graphs (KGs), users
have to make a critical choice of a vocabulary (i.e. predicates and re-
sources). The success of KG building is determined by the convergence
of shared vocabularies so that meaning can be established. The typical
lifecycle for a new KG construction can be defined as follows: nascent
phases of graph construction experience terminology divergence, while
later phases of graph construction experience terminology convergence
and reuse. In this paper, we describe our approach tailoring two AI-
based clustering algorithms for recommending predicates (in RDF state-
ments) about resources in the Open Research Knowledge Graph (ORKG)
https://orkg.org/. Such a service to recommend existing predicates to
semantify new incoming data of scholarly publications is of paramount
importance for fostering terminology convergence in the ORKG.
Our experiments show very promising results: a high precision with rela-
tively high recall in linear runtime performance. Furthermore, this work
offers novel insights into the predicate groups that automatically accrue
loosely as generic semantification patterns for semantification of schol-
arly knowledge spanning 44 research fields.
Keywords: Content-based recommender systems ·Open research knowl-
edge graph ·Artificial Intelligence ·Clustering algorithms.
1 Introduction
Traditional, discourse-based scholarly communication in “pseudo-digitized” PDF
format is being now increasingly transformed to a completely new representa-
tion leveraging semantified digital-born formats e.g. within the Open Research
Knowledge Graph (ORKG) [7] among other initiatives [3,8,11,19,26,35,50]. This
“digital-first” scholarly information representation is based on a fundamentally
new information organization paradigm that creates and uses structured, fine-
grained scholarly content. Specifically, in the ORKG, scholarly communication
is based on a large, interconnected knowledge graph (KG) of fine-grained schol-
arly content. Such an information organization paradigm facilitates the evolution
of scholarly communication from documents for humans to read towards human
and machine-readable knowledge with the aim of alleviating human reading cog-
nitive tie-ups. To this end, the ORKG-based scholarly communication comprises
arXiv:2210.02034v1 [cs.DL] 5 Oct 2022
2 Arab Oghli et al.
a crucial machine-actionable unit of scholarly content in the form of human and
machine-readable comparisons of semantified scholarly contributions [44]. These
comparisons are meant to be used by researchers to quickly get familiar with
existing work in a specific research domain. For example, determining the repro-
duction number estimate R0 of the Sars-Cov-2 virus from a number of studies
in various regions across the world https://orkg.org/comparison/R44930. The
semantically represented scholarly contribution comparisons in ORKG are espe-
cially necessary in our era of the deluge of peer-reviewed publications [29] and
preprints [18] to help researchers stay on top of the fast-paced scientific progress.
It concretely helps scientists to still keep an oversight over scientific progress by
freeing unnecessary human cognitive tie-ups involved when searching for key
information buried in large volumes of text.
The ORKG machine-readable comparisons depend on the availability of a
knowledge base of machine-actionable, semantified scholarly contributions. The
scholarly contributions are a unit of information defined in the context of the
ORKG that describe the addressed problem and comprise the utilized materials,
employed methods and yielded results in a scholarly article – a model which sub-
sumes Leaderboards [27,31]. A large community of researchers has recently
been growing around the crowdsourced curation of scholarly contributions in
the ORKG (e.g., https://orkg.org/paper/R163747).1To describe the scholarly
contributions, RDF statements are used as structured semantic units that are
machine-actionable as a result. A core semantic construct of these contribution-
centric statements are the predicates or properties used to describe the contri-
bution of an article. While the subject and object are content-based, predicates
can generically span contributions across articles. E.g., task name,dataset name,
metric, and score are a group of four predicates used to semantically describe
the leaderboard contribution across AI articles [31] in the Computer Science
domain; the predicates basic reproduction number,confidence interval (95%),
location, and time period are used to describe Covid-19 reproductive number
estimates in epidemiology articles [43].
Predicates are a core construct for semantically describing contributions in
ORKG. To base the ORKG on meaningfully described semantic scholarly contri-
butions, certain, specific groups of predicates that can capture key contribution
aspects of the scholarly articles are essential. Each such group then becomes a
contribution-centric predicate group. Further, the group varies in applicability
from being applicable to only a specific scholarly contribution or generalizing
across a group of contributions from different papers. In this respect, the ORKG
follows an agile, iterative Wiki-style collaboration approach giving curators the
autonomy to coin new properties easily, but aims in the long-term trajectory
to be coherent in terms of vocabulary for both predicates and resources. Note
that contributions can only be compared based on standard predicates terminol-
ogy for the machine-readable ORKG comparisons. Further, the typical lifecycle
of a new KG construction must also be accounted which starts with nascent
1The related construct to ORKG contributions, of LeaderboardsinAIhttps:
//paperswithcode.com/ has also garnered large-scale crowdsourcing interest.
Clustering Semantic Predicates in the Open Research Knowledge Graph 3
phases of graph construction experiencing terminology divergence, while later
phases of graph construction aim at terminology convergence and reuse. In this
background setting of building the ORKG, the overarching research question
investigated in this paper is: How to ensure that individuals, free to use arbi-
trary terminology, converge towards shared vocabularies for contribution-centric
semantic predicates?
Allowing users to make arbitrary statements is important, since it ensures
that the expression of the diverse discoveries in Science are not being lost or un-
represented due to restricted semantic vocabularies. However, some authoring
considerations need to be made. Without further considerations, the authoring
freedom of contributions in the ORKG would result in statements with different
vocabularies, defying the purpose of the need to semantify contributions. A ter-
minology policy could be enforced but that would highly restrict users. Instead, a
suggestion mechanism, recommending terminology based on the dataset, would
help converge terminology without forcing users, as demonstrated in collabora-
tive tagging [34,37]. In collaborative data entry, participants construct a dataset
by continuously and independently adding further statements to existing data.
Each curation participant faces the question: Which vocabulary elements to use?
To ensure convergence, the answer is: use the most relevant and frequently oc-
curring vocabulary elements. Finding the most frequent vocabulary elements is
straightforward: one can simply count the occurrences. We therefore focus on
finding the relevant vocabulary elements. Science comprises very heterogeneous
contributions. Finding the vocabulary that is relevant for one contribution there-
fore means: finding similar contributions and reuse their vocabulary.
To this end, this work describes our implementation of an unsupervised
AI service based on clustering similar papers and recommending contribution-
centric predicate groups from the existing ORKG contributions. Similar schol-
arly contributions should be semantified with a homogeneous contribution-centric
semantic predicate groups. This is our intuition behind adopting clustering since
the method aims to group the data points having similar features, where data
points in different groups should have highly offbeat features. We chose hierar-
chical (Agglomerative1) and non-hierarchical (K-means2) clustering strategies.
We avoid computationally intensive methods (e.g., Affinity) or methods that can
handle only small cluster sizes (e.g., Spectral clustering).
In summary, the contributions of our work are:
1. a formalization of the application of homogeneous related groups of predi-
cates to semantically describe scholarly contributions;
2. the evaluation of two contrasting flavors of clustering objectives (hierarchi-
cal and non-hierarchical) to semantify contributions based on contribution-
centric predicate groups. Since the task itself is formalized for the first time
in this work, the application of an AI approach is correspondingly novel;
3. detailed empirical evaluations of four machine learning model variants re-
sulting from testing two different embedding representations; and
1https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
2https://scikit-learn.org/stable/modules/clustering.html#k-means
4 Arab Oghli et al.
4. the demonstration of a predicates recommendation service for the ORKG
scholarly knowledge digitalization platform. Its objectives are two-fold: i)
expedite adding a new contribution to the graph, and ii) semantify the con-
tributions with a shared vocabulary. The recommender service takes as input
a paper’s title and abstract and in turn recommends a group of semantically
related predicates based on earlier similar semantified papers if found by the
clustering method, otherwise an empty set of predicates is returned. Such a
system is described for the first time.
The remainder of the paper is organized as follows. We first define the core
concepts relevant to this work but which may be new in the community in Section
2. We then offer the formalized definition for our contributions-centric predicates
grouping task in Section 3, following which, in Section 4, we explain the custom
dataset created from the ORKG RDF data dump incorporating our novel task.
Next, we introduce our method for the contribution-centric predicates group
recommendation service in Section 5. We then show the experimental results
from our methods on our created custom task corpus in Section 6. Finally, we
conclude with discussions on the possibility of further improvement and future
work in Section 7.
2 Definitions
We first define the central concepts to the task attempted in this work.
Contribution. Highlights the findings of a research endeavour. An ORKG con-
tribution addresses a research problem, and can be described in terms of the
materials and methods used and the results achieved. Contributions in different
papers addressing the same research problem can be expected to have compa-
rable semantic descriptions at least for their key properties whose values, i.e.
resources and literals, then are specific to the research endeavour.
Contribution Triple. Contributions are semantically described in a series of
(subject, predicate, object) RDF triple statements that build the ORKG.
Contribution Predicates’ Set ( cps). Is a set of predicates in contribution triples.
Comparison. The ORKG supports downstream smart applications such as the
creation of comparisons/surveys over its structured contributions. In other words,
given the ORKG structured contributions, it is possible to compare the values
of several such machine-actionable contributions provided their cpss are more
or less similar. Comparisons can either be generated over several contributions
of a single article (e.g., comparison of an AI benchmark characteristics hav-
ing similar cpss but differing values over the different data domains annotated
https://orkg.org/comparison/R163843/); or over contributions with similar cpss
in different articles (e.g., comparison of the Covid-19 reproductive number (R0)
estimate set of studies, respectively, conducted by different research groups for
different countries https://orkg.org/comparison/R44930/).
摘要:

ClusteringSemanticPredicatesintheOpenResearchKnowledgeGraphOmarArabOghli[0000000290929096],JenniferD'Souza[0000000266169509],andSorenAuer[0000000206982864]TIBLeibnizInformationCentreforScienceandTechnology,Hannover,Germanyfomar.araboghli,jennifer.dsouza,auerg@tib.euAbstract.Whensemanticallydescribi...

展开>> 收起<<
Clustering Semantic Predicates in the Open Research Knowledge Graph Omar Arab Oghli0000000290929096 Jennifer DSouza0000000266169509.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:570.92KB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注