A Review of Multilingualism in and for Ontologies_2

2025-04-27 0 0 805.78KB 22 页 10玖币
侵权投诉
A Review of Multilingualism in and for Ontologies
Frances Gillis-Webbera, C. Maria Keeta
aDepartment of Computer Science, University of Cape Town, Cape Town, 7701, South Africa
Abstract
The Multilingual Semantic Web has been in focus for over a decade. Multilingualism in Linked Data and RDF has shown
substantial adoption, but this is unclear for ontologies since the last review 15 years ago. One of the design goals for OWL was
internationalisation, with the aim that an ontology is usable across languages and cultures. Much research to improve on
multilingual ontologies has taken place in the meantime, and presumably multilingual linked data could use multilingual
ontologies. Therefore, this review seeks to (i) elucidate and compare the modelling options for multilingual ontologies, (ii)
examine extant ontologies for their multilingualism, and (iii) evaluate ontology editors for their ability to manage a multilingual
ontology.
Nine dierent principal approaches for modelling multilinguality in ontologies were identified, which fall into either of the
following approaches: using multilingual labels, linguistic models, or a mapping-based approach. They are compared on design by
means of an ad hoc visualisation mode of modelling multilingual information for ontologies, shortcomings, and what issues they
aim to solve. For the ontologies, we extracted production-level and accessible ontologies from BioPortal and the LOV repositories,
which had, at best, 6.77% and 15.74% multilingual ontologies, respectively, where most of them have only partial translations
and they all use a labels-based approach only. Based on a set of nine tool requirements for managing multilingual ontologies, the
assessment of seven relevant ontology editors showed that there are significant gaps in tooling support, with VocBench 3 nearest
of meeting them all. This stock-taking may function as a new baseline and motivate new research directions for multilingual
ontologies.
Keywords: Multilingual ontologies, Multilingual ontology management, OWL
1. Introduction
The Semantic Web was envisaged as an extension of the
World Wide Web [1]. To support this vision, several languages
were developed, notably Resource Description Framework
(RDF) and Web Ontology Language (OWL). We focus on the
latter and in particular its internationalisation goal [2, 3], with
multilingualism as a common component in an interconnected
world.
The Semantic Web landscape has grown substantially since
the 2000’s, and its multilinguality since circa 2010 with the
‘Multilingual Semantic Web’ (MSW) and ‘Multilingual Web
of Data’ (MWD) being two widely used terms in the literature
[4, 5, 6, 7]. Indexed research literature illustrate this; e.g., a
Google Scholar search on the date range 2001–2005 returned
14 hits for MSW and zero for MWD, for 2006–2010, it
increased to 27 hits for MSW and still with zero for MWD, for
2011–2015 it comparatively exploded to 296 for both terms
combined, with another 298 for the period 2016–2020. This
increase is not as evident for the various stock-takings of
multilingual ontologies and related artefacts specifically
[8, 9, 10], as shown in Figure 1. Those 2004 and 2007 reviews
Corresponding author
Email addresses: fgilliswebber@cs.uct.ac.za (Frances
Gillis-Webber), mkeet@cs.uct.ac.za (C. Maria Keet)
on ontologies [8, 9] focussed on language-tagged strings,
which was practically the only option available in the early
years of the Semantic Web, and do not mention the extent of
label coverage within an ontology for each natural language,
relative to its total class and property axiom count. The very
recent review for BioPortal-indexed ontologies [10] considered
only domain ontologies with ‘production’ status on limited
metrics.
This raises several questions, including whether the increase
in research on the Multilingual Semantic Web has translated to
an increase in the ratio of multilingual ontologies in a
repository, compared to the latest review in 2007, as it did for
support for natural languages in RDF datasets and knowledge
graphs (elaborated on below in Section 2). With this review we
seek to answer the following main questions (motivated
afterward):
RQ1 What are the available modelling options to develop a
multilingual ontology?
RQ2 Regarding extant ontologies and multilingual ontologies:
RQ2a What are the modelling approach(es) used for
extant multilingual ontologies?
RQ2b What is the percentage of multilingual OWL
ontologies compared to monolingual ones (in
popular repositories)?
RQ2c What is the percentage of natural language
completeness of each multilingual ontology?
Preprint submitted to TBA October 7, 2022
arXiv:2210.02807v1 [cs.AI] 6 Oct 2022
RQ2d How does that compare the the latest stock-taking,
of the 2007 review?
RQ3 What is the status of tooling support for developing
multilingual ontologies?
First, the aforementioned increase in research papers does
include numerous approaches for modelling language
information in or for ontologies well beyond a simple
language-tagged string, which may benefit from a systematic
comparison (RQ1). This is also needed in order to be able to
find multilingual ontologies and to assess them (RQ2) on
which multilingual approaches are actually used, and how
many multilingual ontologies there are. Their presence, or
absence, may well be linked to tooling support to be able to
conveniently develop and maintain them, and, hence, RQ3.
To answer the questions, the first step was to analyse
proposed and related variant approaches to multilingualism in
ontologies. To facilitate comparison, we devised a
visualisation notation for them. Nine main approaches were
identified, which can be grouped into the categories of
multilingual labels, linguistic models, and mapping-based
strategies. For the assessment of ontologies, we consulted
NCBO BioPortal1, a repository of biomedical ontologies [11],
since there has been wide adoption of (OWL) ontologies in the
biomedical domain since the early 2000s [12, 13, 14], and
Linked Open Vocabularies (LOV)2, a curated repository of
RDF and OWL vocabularies that is not limited to any one
specific domain. Further, we looked beyond the mere number
of multilingual ontologies, to also assess how multilingual the
more-than-one-language ontologies are, introducing the notion
of coverage within an ontology and language-specific
completeness. Only a few of the multilingual modelling
methods are used, principally just in the labels category, and
even less can be considered fully multilingual where there is
more than one primary language. Tooling for developing and
managing multilingual models is also very limited, where of
the several relevant ones, none meets all the requirements
specified.
In the remainder of this paper, we first describe related work
and key reviews of RDF datasets and knowledge graphs in
Section 2. The ways to model multilinguality is described and
compared in Section 3. This is followed by the BioPortal and
LOV evaluations in Section 4, and the tools assessment in
Section 5. A discussion of the reviews and specific answers to
the research questions is provided in Section 6. The paper
concludes with Section 7.
2. Related Work
As noted in the Introduction, the sampling of counts of
multilingual ontologies has been sparse, with just [8, 9, 10].
We therefore cast the net wider to also include RDF datasets
and knowledge graph literature pertaining to multilingualism.
The ones with such data are included also in Figure 1 and they
are briefly discussed here.
1https://bioportal.bioontology.org/ontologies
2https://lov.linkeddata.es/dataset/lov/
Ell et al. [15] conducted a review on the Billion Triple
Challenge (BTC) 2010 corpus, which contained over 3 billion
triples, including metadata on the source of the resource the
triple was crawled from. Excluding metadata, 1.4 billion
distinct triples remained and of these triples, 0.7% were
identified to contain two or more language tags [15]. This was
followed-up in 2018 with a cross-sectional study of labels
across seven datasets [16], including, among others, the
BTC 2010 corpus (with the metadata), a 2014 version of BTC
(4 billion triples), and a 2017 version of Wikidata (2 billion
triples). BTC 2014 was found to support 183 languages (up
from 2010’s 55) and the Wikidata dataset was found to support
424 languages [16]. That is: languages other than English
certainly are being used in the Semantic Web. Due to the way
the results were reported, however, it is not possible to
determine an accurate percentage of the number of triples
containing two or more multilingual labels, but it is an
approximate 6% for BTC 2014 and 40% for Wikidata3.
Coverage was measured for languages together, rather than by
language.
Other dataset and knowledge graph assessments also
indicate substantial use of multiple languages. Notably, the
2015-10 version of Wikidata was found to support 395
languages in 749 million triples, the 2015-04 version of
DBpedia supported 13 languages with 412 million triples, and
Yet Another Great Ontology (YAGO), which integrates
statements of the dierent language versions of Wikipedia, and
YAGO 3 (with the 2014 Wikipedia dump) was found to
support 326 languages in 1 billion triples [17]. Completeness
was discussed as a data quality dimension on whether the
knowledge graph was suited to the task at hand, within the
context of its data. The coverage of selected languages
(English, French, German, Spanish, and Italian) was
considered relative to the annotations: German and French had
a coverage of over 30% in Wikidata, and German of over 10%
in YAGO.
Given the increase in data in multiple languages and an
increase in multilingualism over the years, one may expect an
increase in multilingual ontologies as well.
3. Modelling Options for Multilingual Ontologies
Several options to develop multilingual ontologies have been
identified [5, 18, 19]:
1. Multilingual labels: the ontology vocabulary is annotated
with language-tagged strings.
2. Linguistic models: the ontology is associated with a
linguistic model external to it.
3. Mapping models: common interlingua which can be
specified between one or more monolingual ontologies.
3The data was reported in groups of 2, 2–5, 5–10, and >10. 2 was ignored
as it is included in 2–5. The sum of the groups for BTC 2014 was 6.3% and
for Wikidata, 4.4%. Both values were rounded down due to the unavoidable
double-counting of 5.
2
Figure 1: Timeline of the MSW (T1) and MWD (T2) searches, contrasted with reviews pertaining to multilingualism conducted for the period 2001–2021: available
data for ontologies, RDF datasets (collections of triples) and knowledge graphs (RDF graphs) are shown in the blue, red and green bars respectively.
Table 1: Comparison of the Main Approaches for Modelling Multilinguality in Ontologies
Labels Linguistic Models Mapping Models
Possible OWL profiles All All All
Can account for cross-lingual terms where there is no 1-1
mapping
No Within the
annotation only
Yes
Can support inflectional languages No Limited No
Annotations reusable by other resources ±(not if the annotation
is a data literal)
Yes No
Example axiom count for two natural languages in an ontology 214 7
Annotations contained within the OWL file Yes No No
Number of files to keep in sync 1 Minimum 3 Minimum 2
Can be managed by an ontology editor Yes Limited support Limited support
Values are for OntoLex-Lemon only for the inflectional languages, axiom count, and ontology editor.
The axiom count was determined for a class in an ontology for two natural languages. Language-specific grammatical features were excluded.
OntoLex-Lemon was used for the linguistic model and the simplest method was used, that is, associating two lexical entries with a class. For
the mapping model, OWL 2 was used.
A brief comparison of these approaches is included in Table 1.
We will elaborate, compare, and discuss each approach in turn
in the subsections below.
In order to represent the options and sub-variants
schematically to facilitate comparisons, we use an ad hoc
visual language. This consists of a set of primitives. First, an
ontology can be visualised as two layers, which separate the
semantic and the linguistic components, in line with [20]. The
semantic layer, which contains the classes, object properties
and other axioms, is indicated with a dark grey box that has
further information, such as the type of identifier (opaque or
descriptive). The linguistic layer, which typically contains the
annotations used to associate language information with an
ontology, is indicated by a light grey box, and contains the
various options for the language annotations. An example
visualisation is shown in Figure 2 for a monolingual ontology
with descriptive identifiers, i.e., naming the vocabulary
elements with a string that is in a natural language or natural
language-like (indicated with Ln) and meaningful to a human,
such as naming a class Person or RumAndRaisinIcecream
instead of ABC:0000012. The knowledge represented in the
TBox of the ontology may have a natural language-specific
vocabulary either in or attached to the ontology as a separate
file. It is possible that two monolingual ontologies in two
dierent natural languages may share the same
conceptualisation [21].
To show the properties and cardinality constraints between
the two layers and their objects, the crow’s feet notation from
Entity-Relationship modelling is used. The notation is
extended to indicate the modelling of language-specific label
instances. We also introduce a relation that is used to indicate
when two elements must have structural equivalence. This
pertains specifically to the dierent annotation values available
in OWL 2. See Figure 2 for a legend of the dierent relations.
We now proceed to the three groups of options, going from
the basic to more elaborate options.
3.1. Multilingual Labels
Multilingual labels are the ‘low-hanging fruit’, being the
easiest way to develop a multilingual ontology. An entity in an
ontology (i.e., a class, object property, data property, or
individual) can have multilingual labels by using rdfs:label,
3
Figure 2: Demonstration of the language to visualise the interaction between
ontologies, any identifiers, and natural language aspects, shown here for a
monolingual ontology that uses descriptive identifiers. The semantic and
linguistic layers of the ontology are indicated by the grey boxes; the names
of the classes (and other vocabulary elements) are in a natural language Ln.
The possible relations with their constraints are shown in the legend.
which takes as domain rdfs:resource and as range an
rdfs:Literal. Each such label is language-tagged. The
language tag typically consists of an ISO 639 language code4.
However, as ISO 639’s Parts 1-3 [22] do not provide language
codes for all the world’s languages, the language tag can be
extended to capture other information, as long as it conforms
to IETF’s BCP 47 [23, 24]. When this is insucient [25],
other language models, such as MoLA [26], may be added to
specify additional languages or lects.
An example OWL fragment of this labelling approach is
shown in Listing 1 in Functional-Style syntax (FSS). (Unless
explicitly stated otherwise, all further examples are in FSS.) In
this example, the class Person has two annotations: one in
English and the other in Dutch. Both annotations are a
language-tagged string.
1SubClassOf (: Person owl : Thi ng )
2A nn o ta ti o nA s se rt i on ( r df s : l abe l : P ers on " P er son " @ en )
3A nn o ta ti o nA s se rt i on ( r df s : l abe l : P ers on " P er so on " @ nl )
Listing 1: OWL fragment illustrating multilingual labels
4https://www.iso.org/iso-639-language-codes.html
The class :Person is a named entity, where Person is the
identifier portion of the URI. An identifier of a URI can be
opaque or descriptive [27, 28, 29]. An opaque identifier is
semantic-free (non-meaningful) and used in a number of
ontologies; e.g., [27, 30, 31]. If an opaque identifier is used as
part of the URI, then an additional sign in the form of an
rdfs:label annotation should be added for both humans and
machines to interpret the URI. For a descriptive identifier,
there is a direct relationship between the natural language term
used as an identifier and its semantics, e.g., [30, 32]. The
descriptive identifier is typically in the primary language of the
ontology.
Figure 3 shows a schematic representation for the TBox for
three ways of realising multilingual ontologies with this
approach. Consider the TBox of O
LI , which presents the
sub-variant where a class has an opaque identifier and at least
one natural language label, with each label a dierent
language. An example OWL fragment is given in Listing 2.
1SubClassOf ( :49 3 Dk ow l : Th ing )
2Annot a t i o nAsse r t i o n ( rdfs : label :493 Dk " Per s on " @ en )
Listing 2: OWL fragment with an opaque identifier
An entity can have any number of labels, where iis a label and
L is the set of languages used for an entity: (L1... Ln) with
1in. For a primary language ontology, an entity can either
have an opaque or descriptive identifier in Li.
Because meaning can be surmised by a human from a
descriptive identifier, a label is not required. However, as
shown in O
PLD in Figure 3, if the ontology is multilingual, all
entities may have labels in one language, with the language tag
sometimes omitted. Where there are labels for other
languages, these are translations from the primary language.
O
PLO (Figure 3) has an opaque identifier, but since all entities
have labels in one language and to a limited extent, labels for
other languages, we consider it to be a primary language
multilingual ontology.
The adaptation of a monolingual ontology to another
language is called ontology localisation [33], although this is
used for dierent contexts as well, such as a culture or a
geo-political environment for which there is typically a shared
language [34, 35, 36]. If an ontology is localised using labels,
then only the linguistic layer of the ontology is aected, and
the concept space of the original ontology remains unchanged.
Despite the dierences with the labels and identifiers
between each of the TBoxes in Figure 3, O
LI ,O
PLD and O
PLO
have commonality in that they share the same so-called
‘concept spaces’, which can be seen as a figurative area of a
concept, from a fine-grained concept to a category of concepts,
where a concept space applies to each OWL class and object
property; e.g., the notion of house and the colour blue. Natural
languages may divide up the same concept space dierently,
which may indicate underlying ontological distinctions. One
language may underspecify it compared to that of another. For
instance, there is only one notion of ‘River’ in English,
whereas the French equivalent distinguishes between two
types of river (Rivi`ere and Fleuve). Another example is the
representation of part-whole relations in isiZulu when
4
Figure 3: Three principal options for multilingual labels in the TBox of the ontology: fully language independent (indicated with L0) with a mandatory requirement
for at least one label in some language (any one of L1. . . Ln) (left), a primary language for the ontology (Ln), using a descriptive identifier and optional labels for
other languages (centre), and mostly language-independent (L0in the semantic later) but with the requirement to have at least one label in one specified language
(right)
compared to English, which also revealed ontological
distinctions, in that the list of ‘universal’ part-whole relations
required both generalisation and refinements in order to
accurately represent them [37].
Thus the use of labels has its limitations. In addition, not all
grammatical features for a natural language can be dealt with
by a label, such as inflected forms, concordial agreement, and
agglutination (where words are formed with the addition of
axes to a word root or stem). Inflectional forms on nouns
include the grammatical categories number and gender, and
tense, aspect and number on verbs, which are typically used
for naming classes and properties, respectively. For each of
these grammatical categories, the word formation may need to
change. An example is the Organization Ontology [38], which
provides support for English, Spanish, French, Italian and
Japanese. For instance, the property org:changedBy with the
English label changed by has labels for most of the supported
languages, but there are two for Spanish: es modificada por
and es modificado por. This is to account for grammatical
gender which is determined by the gender of the noun of the
name of the class that is in the position of the domain in the
axiom. Similarly for plural terms, the concordial agreement in
a language such as isiXhosa (a Niger-Congo B (‘Bantu’)
language spoken in South Africa), combined with its
agglutination, will alter the surface realisation of the
annotation depending on its domain and range. Indeed, as
highlighted by Keet and Khumalo, highly agglutinative
languages can represent a challenge in ontologies, where it is
possible that no single human-readable label can be prescribed
to a property, due to the use of context-dependent axes that
modify the entity’s name or their label [37].
It may be the case that more than one label for a single
language has to be added, fo which Simple Knowledge
Organization System (SKOS) can be used. SKOS has the
prefLabel and altLabel properties (subsumed by
rdfs:label) where labels can be identified as preferred or
alternative, respectively. rdfs:label does not have
cardinality restrictions, but skos:prefLabel can only be
5
摘要:

AReviewofMultilingualisminandforOntologiesFrancesGillis-Webbera,C.MariaKeetaaDepartmentofComputerScience,UniversityofCapeTown,CapeTown,7701,SouthAfricaAbstractTheMultilingualSemanticWebhasbeeninfocusforoveradecade.MultilingualisminLinkedDataandRDFhasshownsubstantialadoption,butthisisunclearforontolo...

展开>> 收起<<
A Review of Multilingualism in and for Ontologies_2.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:805.78KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注