2
sification problem to classify the relationship between the
entities identified in the text [9]. A classifier takes a piece
of text and two entities as inputs and predicts possible
relations between the entities as output. Once instances
are extracted, they can be stored as Resource Description
Framework (RDF)1triples, where each triple forms a con-
nected component of a sentence for the KG. A number of
languages have been proposed for querying RDF data [4],
including the SPARQL query language2, the Cypher Query
Language3, Gremlin Query Language4.
Reasoning over KGs enables consistency checking to
recognize conflicting facts, classification by defining tax-
onomies, and deductive inferencing by revealing implicit
knowledge from a set of facts [10]. Further, deductive reason-
ing can be used to entail extended knowledge such as “TP53
is responsible for cancer”[11]. However, a large-scale KG can
have billions of linked entities expressing their relationships,
where each node represent an entity and each edge signifies
a semantic relationship between entities [12]. These makes
the exploration, processing, and analysis of large-scale KGs
pose a great challenge to current computational methods. To
provide cancer diagnosis reasoning over the DNN models,
an integrated domain-specific KG is required, which is sub-
ject to the availability of an efficient NLP-based information
extraction method and a domain-specific ontology [11].
In this paper, we report a case study of using digital
assistants – in particular Amazon Alexa’s voice-enabled
interface for QA over KGs. We use the well-known Dis-
geNET KG, which contain knowledge covering 1.13 mil-
lion gene-disease associations between 21,671 genes and
30,170 diseases, disorders, and clinical or abnormal human
phenotypes. Our study shows how Alex could be of help
to find facts about certain biological entities from large-
scale knowledge bases. The rest of the paper is structured
as follows: Section 2 critically reviews related works. Sec-
tion 3 describes the proposed approach in details, covering
construction of black-box models, training, interpreting the
black-box model, and generation of decision rules and local
explanations. Section 4 illustrates experiment results, in-
cluding a comparative analysis with baseline models on all
datasets. Section 5 summarizes this research with potential
limitations and points some possible outlook.
2 RELATED WORKS
Research initiatives are gradually adopting SW technolo-
gies [13] such as KGs, knowledge bases (KBs), and domain-
specific ontologies as the means of building structured net-
works of interconnected knowledge [10]. Hogan et al. [4]
have provided a comprehensive review of articles on KGs,
covering knowledge graph creation, enrichment, quality
assessment, and refinement. Apart from these literature that
focuses on the theoretical concepts, several large-scale KGs
have been constructed either by manual annotation, crowd-
sourcing (e.g., DBpedia) or by automatic extraction from
unstructured data (e.g., YAGO5) [14] targeting KG analytics
1In RDF, the linking structure of a graph forms a directed graph
and triples are represented in the form of (subject, predicate, object).
2SPARQL is the protocol/language to query RDF, which
allows querying not only over graph data but also between
disparate graphs. Link: https://www.w3.org/TR/sparql11-query/
3https://neo4j.com/developer/cypher/
4https://docs.janusgraph.org/basics/gremlin/
5https://github.com/yago-naga/yago3
for specific use cases. Life sciences is an early adaptor of
SW technologies. Scientific communities have focused on
constructing large-scale KGs for life science research. For
example, Bio2RDF6are developed to accelerate bioinformat-
ics research. The former integrates 35 life sciences datasets
such as dbSNP, GenAge, GenDR, LSR, OrphaNet, PubMed,
SIDER, WormBase, contributing 11 billion RDF triples.
Alshahrani et al. [15], built a biological KG based on
the gene ontology (GO), human phenotype ontology (HPO),
and disease ontology. Then they performed feature learning
over the KG. Their method combines knowledge represen-
tation using symbolic logic and automated reasoning, with
neural networks to generate embeddings of nodes. The
learned embeddings are used in downstream application
development such as link prediction, finding candidate
genes of diseases, protein-protein interactions, and drug
target relation prediction. Hasan et al. [16] developed a pro-
totype KG based on the Louisiana Tumor Registry dataset7.
Their approach provides scenario-specific querying, schema
evolution for iterative analysis, and data visualization. Al-
though the resultant KG found effective at population-level
treatment sequences, it does not provide comprehensive
knowledge about cancer genomics for the majority of the
cancer types as it is built on a limited data.
Although numerous work focused on data integration,
building KGs, and querying over biological domain, only
a few works have been focused on question answering
and retrieving information from KGs via voice-enabled in-
terface. Other works focus on improving the quality and
enriching multiple KGs in order to query via voice-enabled
devices [17]. Haase et al. [18] focused on using wikidata
to answer factual questions via Alexa. The self-attention
and cross-attention mechanism and factoring in information
about KGs to perform entity alignment, which entails deter-
mining which elements of different graphs refer to the same
“entities” shows how it will be helpful for voice-enabled in-
terfaces [19]. The idea is to improve computational efficiency
while at the same time improving performance, speeding up
graph-related tasks such as question answering via Alexa.
3 METHODS
To make the information retrieval from large-scale KGs
having higher layers and intricacies and to develop a
proof-of-concept for Amazon Alexa’s voice-enabled inter-
face, we use the well-known DisgeNET KG, which contain
knowledge covering 1.13 million gene-disease associations
between 21,671 genes and 30,170 diseases, disorders, and
clinical or abnormal human phenotypes. The database cov-
ers gene-disease associations (GDAs), disease-disease asso-
ciations (DDAs), and variant-disease associations (VDAs).
Disease gene identification is a process by which experts
identify the mutant genotype responsible for an inherited
genetic disorder. This dataset provides robust coverage to-
wards our task but does not mean it has a RDF triples for
all the diseases and genes out there. Association triples exist
only if disease and gene/variant is supported by evidence
from external sources.
6https://github.com/MaastrichtU-IDS/bio2rdf and PubMed KG [2]
7https://sph.lsuhsc.edu/louisiana-tumor-registry/