
RQ2d How does that compare the the latest stock-taking,
of the 2007 review?
RQ3 What is the status of tooling support for developing
multilingual ontologies?
First, the aforementioned increase in research papers does
include numerous approaches for modelling language
information in or for ontologies well beyond a simple
language-tagged string, which may benefit from a systematic
comparison (RQ1). This is also needed in order to be able to
find multilingual ontologies and to assess them (RQ2) on
which multilingual approaches are actually used, and how
many multilingual ontologies there are. Their presence, or
absence, may well be linked to tooling support to be able to
conveniently develop and maintain them, and, hence, RQ3.
To answer the questions, the first step was to analyse
proposed and related variant approaches to multilingualism in
ontologies. To facilitate comparison, we devised a
visualisation notation for them. Nine main approaches were
identified, which can be grouped into the categories of
multilingual labels, linguistic models, and mapping-based
strategies. For the assessment of ontologies, we consulted
NCBO BioPortal1, a repository of biomedical ontologies [11],
since there has been wide adoption of (OWL) ontologies in the
biomedical domain since the early 2000s [12, 13, 14], and
Linked Open Vocabularies (LOV)2, a curated repository of
RDF and OWL vocabularies that is not limited to any one
specific domain. Further, we looked beyond the mere number
of multilingual ontologies, to also assess how multilingual the
more-than-one-language ontologies are, introducing the notion
of coverage within an ontology and language-specific
completeness. Only a few of the multilingual modelling
methods are used, principally just in the labels category, and
even less can be considered fully multilingual where there is
more than one primary language. Tooling for developing and
managing multilingual models is also very limited, where of
the several relevant ones, none meets all the requirements
specified.
In the remainder of this paper, we first describe related work
and key reviews of RDF datasets and knowledge graphs in
Section 2. The ways to model multilinguality is described and
compared in Section 3. This is followed by the BioPortal and
LOV evaluations in Section 4, and the tools assessment in
Section 5. A discussion of the reviews and specific answers to
the research questions is provided in Section 6. The paper
concludes with Section 7.
2. Related Work
As noted in the Introduction, the sampling of counts of
multilingual ontologies has been sparse, with just [8, 9, 10].
We therefore cast the net wider to also include RDF datasets
and knowledge graph literature pertaining to multilingualism.
The ones with such data are included also in Figure 1 and they
are briefly discussed here.
1https://bioportal.bioontology.org/ontologies
2https://lov.linkeddata.es/dataset/lov/
Ell et al. [15] conducted a review on the Billion Triple
Challenge (BTC) 2010 corpus, which contained over 3 billion
triples, including metadata on the source of the resource the
triple was crawled from. Excluding metadata, 1.4 billion
distinct triples remained and of these triples, 0.7% were
identified to contain two or more language tags [15]. This was
followed-up in 2018 with a cross-sectional study of labels
across seven datasets [16], including, among others, the
BTC 2010 corpus (with the metadata), a 2014 version of BTC
(4 billion triples), and a 2017 version of Wikidata (2 billion
triples). BTC 2014 was found to support 183 languages (up
from 2010’s 55) and the Wikidata dataset was found to support
424 languages [16]. That is: languages other than English
certainly are being used in the Semantic Web. Due to the way
the results were reported, however, it is not possible to
determine an accurate percentage of the number of triples
containing two or more multilingual labels, but it is an
approximate 6% for BTC 2014 and 40% for Wikidata3.
Coverage was measured for languages together, rather than by
language.
Other dataset and knowledge graph assessments also
indicate substantial use of multiple languages. Notably, the
2015-10 version of Wikidata was found to support 395
languages in 749 million triples, the 2015-04 version of
DBpedia supported 13 languages with 412 million triples, and
Yet Another Great Ontology (YAGO), which integrates
statements of the different language versions of Wikipedia, and
YAGO 3 (with the 2014 Wikipedia dump) was found to
support 326 languages in 1 billion triples [17]. Completeness
was discussed as a data quality dimension on whether the
knowledge graph was suited to the task at hand, within the
context of its data. The coverage of selected languages
(English, French, German, Spanish, and Italian) was
considered relative to the annotations: German and French had
a coverage of over 30% in Wikidata, and German of over 10%
in YAGO.
Given the increase in data in multiple languages and an
increase in multilingualism over the years, one may expect an
increase in multilingual ontologies as well.
3. Modelling Options for Multilingual Ontologies
Several options to develop multilingual ontologies have been
identified [5, 18, 19]:
1. Multilingual labels: the ontology vocabulary is annotated
with language-tagged strings.
2. Linguistic models: the ontology is associated with a
linguistic model external to it.
3. Mapping models: common interlingua which can be
specified between one or more monolingual ontologies.
3The data was reported in groups of 2, 2–5, 5–10, and >10. 2 was ignored
as it is included in 2–5. The sum of the groups for BTC 2014 was 6.3% and
for Wikidata, 4.4%. Both values were rounded down due to the unavoidable
double-counting of 5.
2