1 Question Answering Over Biological Knowledge Graph via Amazon Alexa

2025-04-27 0 0 1.15MB 10 页 10玖币

侵权投诉

Question Answering Over Biological Knowledge

Graph via Amazon Alexa

Md. Rezaul Karim†∗, Hussain Ali†, Prinon Das†, Mohamed Abdelwaheb†, Stefan Decker†∗

†Computer Science 5 - Information Systems and Databases, RWTH Aachen University, Germany

∗Fraunhofer Institute for Applied Information Technology FIT, Germany

Abstract—Structured and unstructured data and facts about drugs, genes, protein, viruses, and their mechanism are spread across

a huge number of scientiﬁc articles. These articles are a large-scale knowledge source and can have a huge impact in disseminating

knowledge about mechanisms of certain biological processes. A knowledge graph (KG) can be constructed by integrating such facts and

data and be used for data integration, exploration, and federated queries. However, exploration and querying large-scale KGs is tedious

for certain group of users due to lack of knowledge about underlying data assets or semantic technologies. A question answering (QA)

system allows answer natural language questions over KGs automatically using triples contained in a KG. Recently, the use and adaption

of digital assistants is getting wider owing to their capability at enabling users voice commands to control smart systems or devices. This

paper is about using Amazon Alexa’s voice-enabled interface for QA over KGs. As a proof-of-concept, we use the well-known DisgeNET

KG, which contain knowledge covering 1.13 million gene-disease associations between 21,671 genes and 30,170 diseases, disorders,

and clinical or abnormal human phenotypes. Our study shows how Alex could be of help to ﬁnd facts about certain biological entities

from large-scale knowledge bases.

Index Terms—Question answering, Knowledge graphs, Ontology, Semantic web, Bioinformatics, Digital assistants, Amazon Alexa.

1 INTRODUCTION

DOMAIN experts are often interested in gathering and

comprehending knowledge and mechanism of certain

biological process, e.g., diseases to design strategies in order

to develop prevention and therapeutics decision making

process. “Knowledge is something that is known and can be

written down” [1]. Knowledge containing simple statements,

e.g., “TP53 is an oncogene” or quantiﬁed statements, such as

“All oncogenes are responsible for cancer” can be extracted from

structured sources such as knowledge or rule bases. More-

over, knowledge can be extracted from external sources like

scientiﬁc articles, where KG could be an effective means to

capture facts from heterogeneous data sources. For example,

scientiﬁc literature and patents provide a huge treasure

of structured and unstructured information about differ-

ent biological entities. One prominent example is PubMed,

which contain millions of scientiﬁc articles is a great source

of knowledge in biomedical domain [2]. PubMed data are

mostly unstructured and heterogeneous. This makes the

knowledge extraction process very challenging.

The problem of semantic heterogeneity is further com-

pounded due to the ﬂexibility of semi-structured data and

various tagging methods applied to documents or unstruc-

tured data. Owing to SW technologies that offer functional-

ity to connect previously isolated pieces of data and knowl-

edge, associate meaning to them, and represent knowledge

extracted from them. In particular, ontology-based named

entity extraction and disambiguation help with unambigu-

ous identiﬁcation of entities in heterogeneous data and as-

sertion of applicable named relationships that connect these

entities together. Semantic Web (SW) addresses data variety,

by proposing graphs as a unifying data model, to which

a data can be mapped in the form of a graph structure.

A graph may not only contain data, but also metadata

and domain knowledge (ontologies containing axioms or

rules), all in the same uniform structure, and are then called

knowledge graph (KGs) [3], [4].

A simple statement can be accumulated as an edge in

a KG, while quantiﬁed statements provide a more expres-

sive way to represent knowledge, which however requires

ontologies [4]. Hogan et al. [4] deﬁned KG as a graph of

data intended to accumulate and convey knowledge of the real

world, whose nodes represent entities of interest and whose edges

represent potentially different relations between these entities.

Nodes in a KG represent entities and edges represent binary

relations between those entities [4]. A KG can be deﬁned

as G={E, R, T }, where Gis a labelled and directed

multi-graph, and E, R, T are the sets of entities, relations,

and triples, respectively and a triple can be represented as

(u, e, v)∈T, where u∈Eis the head node, v∈Eis the

tail node, and e∈Ris the edge connecting uand v[4].

However, building a domain KG has several core re-

quirements, such as formal conceptualization to indicate the

logical design of the KG depicted by a speciﬁc, predeﬁned

domain-speciﬁc ontology, and the modelling of domain

knowledge, represented by semantically interrelated entities

and relations [5]. Ontologies are semantic data models that

deﬁne the types of things that exist in a domain and the

properties that can be used to describe them, including

the relationships between them [4]. An ontology not only

deﬁnes the relationships between concepts [6], but also pro-

vides a formal representation of domain-speciﬁc entities[7].

Information extraction (IE) is the process of automati-

cally extracting structured knowledge and facts from such

unstructured and/or semi-structured documents or elec-

tronically represented sources [8]. IE is typically divided

into named entity recognition (NER), entity linking, and

relation extraction. Relation extraction also involves rela-

tion classiﬁcation, which is typically formulated as a clas-

arXiv:2210.06040v1 [cs.AI] 12 Oct 2022

siﬁcation problem to classify the relationship between the

entities identiﬁed in the text [9]. A classiﬁer takes a piece

of text and two entities as inputs and predicts possible

relations between the entities as output. Once instances

are extracted, they can be stored as Resource Description

Framework (RDF)1triples, where each triple forms a con-

nected component of a sentence for the KG. A number of

languages have been proposed for querying RDF data [4],

including the SPARQL query language2, the Cypher Query

Language3, Gremlin Query Language4.

Reasoning over KGs enables consistency checking to

recognize conﬂicting facts, classiﬁcation by deﬁning tax-

onomies, and deductive inferencing by revealing implicit

knowledge from a set of facts [10]. Further, deductive reason-

ing can be used to entail extended knowledge such as “TP53

is responsible for cancer”[11]. However, a large-scale KG can

have billions of linked entities expressing their relationships,

where each node represent an entity and each edge signiﬁes

a semantic relationship between entities [12]. These makes

the exploration, processing, and analysis of large-scale KGs

pose a great challenge to current computational methods. To

provide cancer diagnosis reasoning over the DNN models,

an integrated domain-speciﬁc KG is required, which is sub-

ject to the availability of an efﬁcient NLP-based information

extraction method and a domain-speciﬁc ontology [11].

In this paper, we report a case study of using digital

assistants – in particular Amazon Alexa’s voice-enabled

interface for QA over KGs. We use the well-known Dis-

geNET KG, which contain knowledge covering 1.13 mil-

lion gene-disease associations between 21,671 genes and

30,170 diseases, disorders, and clinical or abnormal human

phenotypes. Our study shows how Alex could be of help

to ﬁnd facts about certain biological entities from large-

scale knowledge bases. The rest of the paper is structured

as follows: Section 2 critically reviews related works. Sec-

tion 3 describes the proposed approach in details, covering

construction of black-box models, training, interpreting the

black-box model, and generation of decision rules and local

explanations. Section 4 illustrates experiment results, in-

cluding a comparative analysis with baseline models on all

datasets. Section 5 summarizes this research with potential

limitations and points some possible outlook.

2 RELATED WORKS

Research initiatives are gradually adopting SW technolo-

gies [13] such as KGs, knowledge bases (KBs), and domain-

speciﬁc ontologies as the means of building structured net-

works of interconnected knowledge [10]. Hogan et al. [4]

have provided a comprehensive review of articles on KGs,

covering knowledge graph creation, enrichment, quality

assessment, and reﬁnement. Apart from these literature that

focuses on the theoretical concepts, several large-scale KGs

have been constructed either by manual annotation, crowd-

sourcing (e.g., DBpedia) or by automatic extraction from

unstructured data (e.g., YAGO5) [14] targeting KG analytics

1In RDF, the linking structure of a graph forms a directed graph

and triples are represented in the form of (subject, predicate, object).

2SPARQL is the protocol/language to query RDF, which

allows querying not only over graph data but also between

disparate graphs. Link: https://www.w3.org/TR/sparql11-query/

3https://neo4j.com/developer/cypher/

4https://docs.janusgraph.org/basics/gremlin/

5https://github.com/yago-naga/yago3

for speciﬁc use cases. Life sciences is an early adaptor of

SW technologies. Scientiﬁc communities have focused on

constructing large-scale KGs for life science research. For

example, Bio2RDF6are developed to accelerate bioinformat-

ics research. The former integrates 35 life sciences datasets

such as dbSNP, GenAge, GenDR, LSR, OrphaNet, PubMed,

SIDER, WormBase, contributing 11 billion RDF triples.

Alshahrani et al. [15], built a biological KG based on

the gene ontology (GO), human phenotype ontology (HPO),

and disease ontology. Then they performed feature learning

over the KG. Their method combines knowledge represen-

tation using symbolic logic and automated reasoning, with

neural networks to generate embeddings of nodes. The

learned embeddings are used in downstream application

development such as link prediction, ﬁnding candidate

genes of diseases, protein-protein interactions, and drug

target relation prediction. Hasan et al. [16] developed a pro-

totype KG based on the Louisiana Tumor Registry dataset7.

Their approach provides scenario-speciﬁc querying, schema

evolution for iterative analysis, and data visualization. Al-

though the resultant KG found effective at population-level

treatment sequences, it does not provide comprehensive

knowledge about cancer genomics for the majority of the

cancer types as it is built on a limited data.

Although numerous work focused on data integration,

building KGs, and querying over biological domain, only

a few works have been focused on question answering

and retrieving information from KGs via voice-enabled in-

terface. Other works focus on improving the quality and

enriching multiple KGs in order to query via voice-enabled

devices [17]. Haase et al. [18] focused on using wikidata

to answer factual questions via Alexa. The self-attention

and cross-attention mechanism and factoring in information

about KGs to perform entity alignment, which entails deter-

mining which elements of different graphs refer to the same

“entities” shows how it will be helpful for voice-enabled in-

terfaces [19]. The idea is to improve computational efﬁciency

while at the same time improving performance, speeding up

graph-related tasks such as question answering via Alexa.

3 METHODS

To make the information retrieval from large-scale KGs

having higher layers and intricacies and to develop a

proof-of-concept for Amazon Alexa’s voice-enabled inter-

face, we use the well-known DisgeNET KG, which contain

knowledge covering 1.13 million gene-disease associations

between 21,671 genes and 30,170 diseases, disorders, and

clinical or abnormal human phenotypes. The database cov-

ers gene-disease associations (GDAs), disease-disease asso-

ciations (DDAs), and variant-disease associations (VDAs).

Disease gene identiﬁcation is a process by which experts

identify the mutant genotype responsible for an inherited

genetic disorder. This dataset provides robust coverage to-

wards our task but does not mean it has a RDF triples for

all the diseases and genes out there. Association triples exist

only if disease and gene/variant is supported by evidence

from external sources.

6https://github.com/MaastrichtU-IDS/bio2rdf and PubMed KG [2]

7https://sph.lsuhsc.edu/louisiana-tumor-registry/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1QuestionAnsweringOverBiologicalKnowledgeGraphviaAmazonAlexaMd.RezaulKarimy,HussainAliy,PrinonDasy,MohamedAbdelwaheby,StefanDeckeryyComputerScience5-InformationSystemsandDatabases,RWTHAachenUniversity,GermanyFraunhoferInstituteforAppliedInformationTechnologyFIT,GermanyAbstractStructuredandunstru...

展开>> 收起<<

1 Question Answering Over Biological Knowledge Graph via Amazon Alexa.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Question Answering Over Biological Knowledge Graph via Amazon Alexa

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: