1 Question Answering Over Biological Knowledge Graph via Amazon Alexa

2025-04-27 0 0 1.15MB 10 页 10玖币
侵权投诉
1
Question Answering Over Biological Knowledge
Graph via Amazon Alexa
Md. Rezaul Karim, Hussain Ali, Prinon Das, Mohamed Abdelwaheb, Stefan Decker
Computer Science 5 - Information Systems and Databases, RWTH Aachen University, Germany
Fraunhofer Institute for Applied Information Technology FIT, Germany
Abstract—Structured and unstructured data and facts about drugs, genes, protein, viruses, and their mechanism are spread across
a huge number of scientific articles. These articles are a large-scale knowledge source and can have a huge impact in disseminating
knowledge about mechanisms of certain biological processes. A knowledge graph (KG) can be constructed by integrating such facts and
data and be used for data integration, exploration, and federated queries. However, exploration and querying large-scale KGs is tedious
for certain group of users due to lack of knowledge about underlying data assets or semantic technologies. A question answering (QA)
system allows answer natural language questions over KGs automatically using triples contained in a KG. Recently, the use and adaption
of digital assistants is getting wider owing to their capability at enabling users voice commands to control smart systems or devices. This
paper is about using Amazon Alexa’s voice-enabled interface for QA over KGs. As a proof-of-concept, we use the well-known DisgeNET
KG, which contain knowledge covering 1.13 million gene-disease associations between 21,671 genes and 30,170 diseases, disorders,
and clinical or abnormal human phenotypes. Our study shows how Alex could be of help to find facts about certain biological entities
from large-scale knowledge bases.
Index Terms—Question answering, Knowledge graphs, Ontology, Semantic web, Bioinformatics, Digital assistants, Amazon Alexa.
F
1 INTRODUCTION
DOMAIN experts are often interested in gathering and
comprehending knowledge and mechanism of certain
biological process, e.g., diseases to design strategies in order
to develop prevention and therapeutics decision making
process. “Knowledge is something that is known and can be
written down” [1]. Knowledge containing simple statements,
e.g., “TP53 is an oncogene” or quantified statements, such as
“All oncogenes are responsible for cancer” can be extracted from
structured sources such as knowledge or rule bases. More-
over, knowledge can be extracted from external sources like
scientific articles, where KG could be an effective means to
capture facts from heterogeneous data sources. For example,
scientific literature and patents provide a huge treasure
of structured and unstructured information about differ-
ent biological entities. One prominent example is PubMed,
which contain millions of scientific articles is a great source
of knowledge in biomedical domain [2]. PubMed data are
mostly unstructured and heterogeneous. This makes the
knowledge extraction process very challenging.
The problem of semantic heterogeneity is further com-
pounded due to the flexibility of semi-structured data and
various tagging methods applied to documents or unstruc-
tured data. Owing to SW technologies that offer functional-
ity to connect previously isolated pieces of data and knowl-
edge, associate meaning to them, and represent knowledge
extracted from them. In particular, ontology-based named
entity extraction and disambiguation help with unambigu-
ous identification of entities in heterogeneous data and as-
sertion of applicable named relationships that connect these
entities together. Semantic Web (SW) addresses data variety,
by proposing graphs as a unifying data model, to which
a data can be mapped in the form of a graph structure.
A graph may not only contain data, but also metadata
and domain knowledge (ontologies containing axioms or
rules), all in the same uniform structure, and are then called
knowledge graph (KGs) [3], [4].
A simple statement can be accumulated as an edge in
a KG, while quantified statements provide a more expres-
sive way to represent knowledge, which however requires
ontologies [4]. Hogan et al. [4] defined KG as a graph of
data intended to accumulate and convey knowledge of the real
world, whose nodes represent entities of interest and whose edges
represent potentially different relations between these entities.
Nodes in a KG represent entities and edges represent binary
relations between those entities [4]. A KG can be defined
as G={E, R, T }, where Gis a labelled and directed
multi-graph, and E, R, T are the sets of entities, relations,
and triples, respectively and a triple can be represented as
(u, e, v)T, where uEis the head node, vEis the
tail node, and eRis the edge connecting uand v[4].
However, building a domain KG has several core re-
quirements, such as formal conceptualization to indicate the
logical design of the KG depicted by a specific, predefined
domain-specific ontology, and the modelling of domain
knowledge, represented by semantically interrelated entities
and relations [5]. Ontologies are semantic data models that
define the types of things that exist in a domain and the
properties that can be used to describe them, including
the relationships between them [4]. An ontology not only
defines the relationships between concepts [6], but also pro-
vides a formal representation of domain-specific entities[7].
Information extraction (IE) is the process of automati-
cally extracting structured knowledge and facts from such
unstructured and/or semi-structured documents or elec-
tronically represented sources [8]. IE is typically divided
into named entity recognition (NER), entity linking, and
relation extraction. Relation extraction also involves rela-
tion classification, which is typically formulated as a clas-
arXiv:2210.06040v1 [cs.AI] 12 Oct 2022
2
sification problem to classify the relationship between the
entities identified in the text [9]. A classifier takes a piece
of text and two entities as inputs and predicts possible
relations between the entities as output. Once instances
are extracted, they can be stored as Resource Description
Framework (RDF)1triples, where each triple forms a con-
nected component of a sentence for the KG. A number of
languages have been proposed for querying RDF data [4],
including the SPARQL query language2, the Cypher Query
Language3, Gremlin Query Language4.
Reasoning over KGs enables consistency checking to
recognize conflicting facts, classification by defining tax-
onomies, and deductive inferencing by revealing implicit
knowledge from a set of facts [10]. Further, deductive reason-
ing can be used to entail extended knowledge such as “TP53
is responsible for cancer”[11]. However, a large-scale KG can
have billions of linked entities expressing their relationships,
where each node represent an entity and each edge signifies
a semantic relationship between entities [12]. These makes
the exploration, processing, and analysis of large-scale KGs
pose a great challenge to current computational methods. To
provide cancer diagnosis reasoning over the DNN models,
an integrated domain-specific KG is required, which is sub-
ject to the availability of an efficient NLP-based information
extraction method and a domain-specific ontology [11].
In this paper, we report a case study of using digital
assistants – in particular Amazon Alexa’s voice-enabled
interface for QA over KGs. We use the well-known Dis-
geNET KG, which contain knowledge covering 1.13 mil-
lion gene-disease associations between 21,671 genes and
30,170 diseases, disorders, and clinical or abnormal human
phenotypes. Our study shows how Alex could be of help
to find facts about certain biological entities from large-
scale knowledge bases. The rest of the paper is structured
as follows: Section 2 critically reviews related works. Sec-
tion 3 describes the proposed approach in details, covering
construction of black-box models, training, interpreting the
black-box model, and generation of decision rules and local
explanations. Section 4 illustrates experiment results, in-
cluding a comparative analysis with baseline models on all
datasets. Section 5 summarizes this research with potential
limitations and points some possible outlook.
2 RELATED WORKS
Research initiatives are gradually adopting SW technolo-
gies [13] such as KGs, knowledge bases (KBs), and domain-
specific ontologies as the means of building structured net-
works of interconnected knowledge [10]. Hogan et al. [4]
have provided a comprehensive review of articles on KGs,
covering knowledge graph creation, enrichment, quality
assessment, and refinement. Apart from these literature that
focuses on the theoretical concepts, several large-scale KGs
have been constructed either by manual annotation, crowd-
sourcing (e.g., DBpedia) or by automatic extraction from
unstructured data (e.g., YAGO5) [14] targeting KG analytics
1In RDF, the linking structure of a graph forms a directed graph
and triples are represented in the form of (subject, predicate, object).
2SPARQL is the protocol/language to query RDF, which
allows querying not only over graph data but also between
disparate graphs. Link: https://www.w3.org/TR/sparql11-query/
3https://neo4j.com/developer/cypher/
4https://docs.janusgraph.org/basics/gremlin/
5https://github.com/yago-naga/yago3
for specific use cases. Life sciences is an early adaptor of
SW technologies. Scientific communities have focused on
constructing large-scale KGs for life science research. For
example, Bio2RDF6are developed to accelerate bioinformat-
ics research. The former integrates 35 life sciences datasets
such as dbSNP, GenAge, GenDR, LSR, OrphaNet, PubMed,
SIDER, WormBase, contributing 11 billion RDF triples.
Alshahrani et al. [15], built a biological KG based on
the gene ontology (GO), human phenotype ontology (HPO),
and disease ontology. Then they performed feature learning
over the KG. Their method combines knowledge represen-
tation using symbolic logic and automated reasoning, with
neural networks to generate embeddings of nodes. The
learned embeddings are used in downstream application
development such as link prediction, finding candidate
genes of diseases, protein-protein interactions, and drug
target relation prediction. Hasan et al. [16] developed a pro-
totype KG based on the Louisiana Tumor Registry dataset7.
Their approach provides scenario-specific querying, schema
evolution for iterative analysis, and data visualization. Al-
though the resultant KG found effective at population-level
treatment sequences, it does not provide comprehensive
knowledge about cancer genomics for the majority of the
cancer types as it is built on a limited data.
Although numerous work focused on data integration,
building KGs, and querying over biological domain, only
a few works have been focused on question answering
and retrieving information from KGs via voice-enabled in-
terface. Other works focus on improving the quality and
enriching multiple KGs in order to query via voice-enabled
devices [17]. Haase et al. [18] focused on using wikidata
to answer factual questions via Alexa. The self-attention
and cross-attention mechanism and factoring in information
about KGs to perform entity alignment, which entails deter-
mining which elements of different graphs refer to the same
“entities” shows how it will be helpful for voice-enabled in-
terfaces [19]. The idea is to improve computational efficiency
while at the same time improving performance, speeding up
graph-related tasks such as question answering via Alexa.
3 METHODS
To make the information retrieval from large-scale KGs
having higher layers and intricacies and to develop a
proof-of-concept for Amazon Alexa’s voice-enabled inter-
face, we use the well-known DisgeNET KG, which contain
knowledge covering 1.13 million gene-disease associations
between 21,671 genes and 30,170 diseases, disorders, and
clinical or abnormal human phenotypes. The database cov-
ers gene-disease associations (GDAs), disease-disease asso-
ciations (DDAs), and variant-disease associations (VDAs).
Disease gene identification is a process by which experts
identify the mutant genotype responsible for an inherited
genetic disorder. This dataset provides robust coverage to-
wards our task but does not mean it has a RDF triples for
all the diseases and genes out there. Association triples exist
only if disease and gene/variant is supported by evidence
from external sources.
6https://github.com/MaastrichtU-IDS/bio2rdf and PubMed KG [2]
7https://sph.lsuhsc.edu/louisiana-tumor-registry/
摘要:

1QuestionAnsweringOverBiologicalKnowledgeGraphviaAmazonAlexaMd.RezaulKarimy,HussainAliy,PrinonDasy,MohamedAbdelwaheby,StefanDeckeryyComputerScience5-InformationSystemsandDatabases,RWTHAachenUniversity,GermanyFraunhoferInstituteforAppliedInformationTechnologyFIT,GermanyAbstract—Structuredandunstru...

展开>> 收起<<
1 Question Answering Over Biological Knowledge Graph via Amazon Alexa.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.15MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注