Clustering Semantic Predicates in the Open Research Knowledge Graph Omar Arab Oghli0000000290929096 Jennifer DSouza0000000266169509

2025-05-01 0 0 570.92KB 18 页 10玖币

侵权投诉

Clustering Semantic Predicates in the

Open Research Knowledge Graph

Omar Arab Oghli[0000−0002−9092−9096], Jennifer D’Souza[0000−0002−6616−9509],

and S¨oren Auer[0000−0002−0698−2864]

TIB Leibniz Information Centre for Science and Technology, Hannover, Germany

{omar.araboghli,jennifer.dsouza,auer}@tib.eu

Abstract. When semantically describing knowledge graphs (KGs), users

have to make a critical choice of a vocabulary (i.e. predicates and re-

sources). The success of KG building is determined by the convergence

of shared vocabularies so that meaning can be established. The typical

lifecycle for a new KG construction can be deﬁned as follows: nascent

phases of graph construction experience terminology divergence, while

later phases of graph construction experience terminology convergence

and reuse. In this paper, we describe our approach tailoring two AI-

based clustering algorithms for recommending predicates (in RDF state-

ments) about resources in the Open Research Knowledge Graph (ORKG)

https://orkg.org/. Such a service to recommend existing predicates to

semantify new incoming data of scholarly publications is of paramount

importance for fostering terminology convergence in the ORKG.

Our experiments show very promising results: a high precision with rela-

tively high recall in linear runtime performance. Furthermore, this work

oﬀers novel insights into the predicate groups that automatically accrue

loosely as generic semantiﬁcation patterns for semantiﬁcation of schol-

arly knowledge spanning 44 research ﬁelds.

Keywords: Content-based recommender systems ·Open research knowl-

edge graph ·Artiﬁcial Intelligence ·Clustering algorithms.

1 Introduction

Traditional, discourse-based scholarly communication in “pseudo-digitized” PDF

format is being now increasingly transformed to a completely new representa-

tion leveraging semantiﬁed digital-born formats e.g. within the Open Research

Knowledge Graph (ORKG) [7] among other initiatives [3,8,11,19,26,35,50]. This

“digital-ﬁrst” scholarly information representation is based on a fundamentally

new information organization paradigm that creates and uses structured, ﬁne-

grained scholarly content. Speciﬁcally, in the ORKG, scholarly communication

is based on a large, interconnected knowledge graph (KG) of ﬁne-grained schol-

arly content. Such an information organization paradigm facilitates the evolution

of scholarly communication from documents for humans to read towards human

and machine-readable knowledge with the aim of alleviating human reading cog-

nitive tie-ups. To this end, the ORKG-based scholarly communication comprises

arXiv:2210.02034v1 [cs.DL] 5 Oct 2022

2 Arab Oghli et al.

a crucial machine-actionable unit of scholarly content in the form of human and

machine-readable comparisons of semantiﬁed scholarly contributions [44]. These

comparisons are meant to be used by researchers to quickly get familiar with

existing work in a speciﬁc research domain. For example, determining the repro-

duction number estimate R0 of the Sars-Cov-2 virus from a number of studies

in various regions across the world https://orkg.org/comparison/R44930. The

semantically represented scholarly contribution comparisons in ORKG are espe-

cially necessary in our era of the deluge of peer-reviewed publications [29] and

preprints [18] to help researchers stay on top of the fast-paced scientiﬁc progress.

It concretely helps scientists to still keep an oversight over scientiﬁc progress by

freeing unnecessary human cognitive tie-ups involved when searching for key

information buried in large volumes of text.

The ORKG machine-readable comparisons depend on the availability of a

knowledge base of machine-actionable, semantiﬁed scholarly contributions. The

scholarly contributions are a unit of information deﬁned in the context of the

ORKG that describe the addressed problem and comprise the utilized materials,

employed methods and yielded results in a scholarly article – a model which sub-

sumes Leaderboards [27,31]. A large community of researchers has recently

been growing around the crowdsourced curation of scholarly contributions in

the ORKG (e.g., https://orkg.org/paper/R163747).1To describe the scholarly

contributions, RDF statements are used as structured semantic units that are

machine-actionable as a result. A core semantic construct of these contribution-

centric statements are the predicates or properties used to describe the contri-

bution of an article. While the subject and object are content-based, predicates

can generically span contributions across articles. E.g., task name,dataset name,

metric, and score are a group of four predicates used to semantically describe

the leaderboard contribution across AI articles [31] in the Computer Science

domain; the predicates basic reproduction number,conﬁdence interval (95%),

location, and time period are used to describe Covid-19 reproductive number

estimates in epidemiology articles [43].

Predicates are a core construct for semantically describing contributions in

ORKG. To base the ORKG on meaningfully described semantic scholarly contri-

butions, certain, speciﬁc groups of predicates that can capture key contribution

aspects of the scholarly articles are essential. Each such group then becomes a

contribution-centric predicate group. Further, the group varies in applicability

from being applicable to only a speciﬁc scholarly contribution or generalizing

across a group of contributions from diﬀerent papers. In this respect, the ORKG

follows an agile, iterative Wiki-style collaboration approach giving curators the

autonomy to coin new properties easily, but aims in the long-term trajectory

to be coherent in terms of vocabulary for both predicates and resources. Note

that contributions can only be compared based on standard predicates terminol-

ogy for the machine-readable ORKG comparisons. Further, the typical lifecycle

of a new KG construction must also be accounted which starts with nascent

1The related construct to ORKG contributions, of LeaderboardsinAIhttps:

//paperswithcode.com/ has also garnered large-scale crowdsourcing interest.

Clustering Semantic Predicates in the Open Research Knowledge Graph 3

phases of graph construction experiencing terminology divergence, while later

phases of graph construction aim at terminology convergence and reuse. In this

background setting of building the ORKG, the overarching research question

investigated in this paper is: How to ensure that individuals, free to use arbi-

trary terminology, converge towards shared vocabularies for contribution-centric

semantic predicates?

Allowing users to make arbitrary statements is important, since it ensures

that the expression of the diverse discoveries in Science are not being lost or un-

represented due to restricted semantic vocabularies. However, some authoring

considerations need to be made. Without further considerations, the authoring

freedom of contributions in the ORKG would result in statements with diﬀerent

vocabularies, defying the purpose of the need to semantify contributions. A ter-

minology policy could be enforced but that would highly restrict users. Instead, a

suggestion mechanism, recommending terminology based on the dataset, would

help converge terminology without forcing users, as demonstrated in collabora-

tive tagging [34,37]. In collaborative data entry, participants construct a dataset

by continuously and independently adding further statements to existing data.

Each curation participant faces the question: Which vocabulary elements to use?

To ensure convergence, the answer is: use the most relevant and frequently oc-

curring vocabulary elements. Finding the most frequent vocabulary elements is

straightforward: one can simply count the occurrences. We therefore focus on

ﬁnding the relevant vocabulary elements. Science comprises very heterogeneous

contributions. Finding the vocabulary that is relevant for one contribution there-

fore means: ﬁnding similar contributions and reuse their vocabulary.

To this end, this work describes our implementation of an unsupervised

AI service based on clustering similar papers and recommending contribution-

centric predicate groups from the existing ORKG contributions. Similar schol-

arly contributions should be semantiﬁed with a homogeneous contribution-centric

semantic predicate groups. This is our intuition behind adopting clustering since

the method aims to group the data points having similar features, where data

points in diﬀerent groups should have highly oﬀbeat features. We chose hierar-

chical (Agglomerative1) and non-hierarchical (K-means2) clustering strategies.

We avoid computationally intensive methods (e.g., Aﬃnity) or methods that can

handle only small cluster sizes (e.g., Spectral clustering).

In summary, the contributions of our work are:

1. a formalization of the application of homogeneous related groups of predi-

cates to semantically describe scholarly contributions;

2. the evaluation of two contrasting ﬂavors of clustering objectives (hierarchi-

cal and non-hierarchical) to semantify contributions based on contribution-

centric predicate groups. Since the task itself is formalized for the ﬁrst time

in this work, the application of an AI approach is correspondingly novel;

3. detailed empirical evaluations of four machine learning model variants re-

sulting from testing two diﬀerent embedding representations; and

1https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

2https://scikit-learn.org/stable/modules/clustering.html#k-means

4 Arab Oghli et al.

4. the demonstration of a predicates recommendation service for the ORKG

scholarly knowledge digitalization platform. Its objectives are two-fold: i)

expedite adding a new contribution to the graph, and ii) semantify the con-

tributions with a shared vocabulary. The recommender service takes as input

a paper’s title and abstract and in turn recommends a group of semantically

related predicates based on earlier similar semantiﬁed papers if found by the

clustering method, otherwise an empty set of predicates is returned. Such a

system is described for the ﬁrst time.

The remainder of the paper is organized as follows. We ﬁrst deﬁne the core

concepts relevant to this work but which may be new in the community in Section

2. We then oﬀer the formalized deﬁnition for our contributions-centric predicates

grouping task in Section 3, following which, in Section 4, we explain the custom

dataset created from the ORKG RDF data dump incorporating our novel task.

Next, we introduce our method for the contribution-centric predicates group

recommendation service in Section 5. We then show the experimental results

from our methods on our created custom task corpus in Section 6. Finally, we

conclude with discussions on the possibility of further improvement and future

work in Section 7.

2 Deﬁnitions

We ﬁrst deﬁne the central concepts to the task attempted in this work.

Contribution. Highlights the ﬁndings of a research endeavour. An ORKG con-

tribution addresses a research problem, and can be described in terms of the

materials and methods used and the results achieved. Contributions in diﬀerent

papers addressing the same research problem can be expected to have compa-

rable semantic descriptions at least for their key properties whose values, i.e.

resources and literals, then are speciﬁc to the research endeavour.

Contribution Triple. Contributions are semantically described in a series of

(subject, predicate, object) RDF triple statements that build the ORKG.

Contribution Predicates’ Set ( cps). Is a set of predicates in contribution triples.

Comparison. The ORKG supports downstream smart applications such as the

creation of comparisons/surveys over its structured contributions. In other words,

given the ORKG structured contributions, it is possible to compare the values

of several such machine-actionable contributions provided their cpss are more

or less similar. Comparisons can either be generated over several contributions

of a single article (e.g., comparison of an AI benchmark characteristics hav-

ing similar cpss but diﬀering values over the diﬀerent data domains annotated

https://orkg.org/comparison/R163843/); or over contributions with similar cpss

in diﬀerent articles (e.g., comparison of the Covid-19 reproductive number (R0)

estimate set of studies, respectively, conducted by diﬀerent research groups for

diﬀerent countries https://orkg.org/comparison/R44930/).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ClusteringSemanticPredicatesintheOpenResearchKnowledgeGraphOmarArabOghli[0000000290929096],JenniferD'Souza[0000000266169509],andSorenAuer[0000000206982864]TIBLeibnizInformationCentreforScienceandTechnology,Hannover,Germanyfomar.araboghli,jennifer.dsouza,auerg@tib.euAbstract.Whensemanticallydescribi...

收起<<

Clustering Semantic Predicates in the Open Research Knowledge Graph Omar Arab Oghli0000000290929096 Jennifer DSouza0000000266169509.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Clustering Semantic Predicates in the Open Research Knowledge Graph Omar Arab Oghli0000000290929096 Jennifer DSouza0000000266169509

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: