Bootstrapping NLP tools across low-resourced African languages an overview and prospects C. Maria Keet

2025-05-06 0 0 630.53KB 13 页 10玖币
侵权投诉
Bootstrapping NLP tools across low-resourced African languages: an
overview and prospects
C. Maria Keet
Department of Computer Science
University of Cape Town
South Africa
mkeet@cs.uct.ac.za
Abstract
Computing and Internet access are substan-
tially growing markets in Southern Africa,
which brings with it increasing demands for
local content and tools in indigenous African
languages. Since most of those languages are
low-resourced, efforts have gone into the no-
tion of bootstrapping tools for one African lan-
guage from another. This paper provides an
overview of these efforts for Niger-Congo B
(‘Bantu’) languages. Bootstrapping grammars
for geographically distant languages has been
shown to still have positive outcomes for mor-
phology and rules or grammar-based natural
language generation. Bootstrapping with data-
driven approaches to NLP tasks is difficult to
use meaningfully regardless geographic prox-
imity, which is largely due to lexical diver-
sity due to both orthography and vocabulary.
Cladistic approaches in comparative linguis-
tics may inform bootstrapping strategies and
similarity measures might serve as proxy for
bootstrapping potential as well, with both fer-
tile ground for further research.
1 Introduction
Nearly 1.5 billion people live in Africa, of which
many who speak multiple languages other than
the relatively well-resourced languages English,
French, and Arabic, among an estimated 1441
to 2169 African languages (Hammarström,2018).
Notably, by first language speakers, Swahili of the
Niger-Congo family is the next-largest language
(around 50 million L1 speakers, some 100-200 mil-
lion overall), with as close second the Afroasiatic
Hausa and then Oromo, and then Yoruba and Igbo
in the Niger-Congo family with each around 28
million L1 speakers
1
. While illiteracy exists in Sub-
Saharan Africa, there are still very many people
who can read and write in indigenous languages,
1
Numbers from various open and paywalled sources,
collated at
https://en.wikipedia.org/wiki/Languages_
of_Africa.
some of which having official status in one or more
countries where they are used in education, work,
and social life. This entails a need for language sup-
port for task such as spelling and grammar check-
ing, translation, and natural language generation.
Within the context of the United Nations’ Sustain-
able Development Goals
2
, they include language
technologies to ameliorate the language gap (and
consequent lower service (Hussey,2012/2013)) in
healthcare (Byamugisha et al.,2017;Marais et al.,
2020), educational digital assistants to ease the
burden of overworked teachers in crowded class-
rooms of up to 100 learners (Keet,2021), and many
other supportive tasks for society, such as machine
translation for humanitarian response (Öktem et al.,
2021). In this paper, we zoom in on on the Niger-
Congo B (‘Bantu’), or NCB, family of languages.
In Joshi et al. (2020)’s classification of NLP sup-
port for languages, the NCB languages fall into
three of the six categories: the “left behinds”, the
“scraping-bys”, and the “hopefuls”, with isiZulu
(spoken in South Africa) and Swahili (spoken pri-
marily in Tanzania and Kenya) in the latter group.
A selection of the usual NLP tasks have been
taken up for a few languages of the Niger-Congo
family, notably indeed Swahili and isiZulu, and
to a lesser extent Yoruba, Igbo, isiXhosa, and
Runyankore. Examples are diverse. They range
from corpus creation for data-driven NLP, such
as the IsiZulu National Corpus (Khumalo,2015)
that was used for a statistical language model for a
spellchecker (Ndaba et al.,2016), the Mashakane
grassroots initiative
3
that focuses on data-driven
machine translation for multiple African languages
(Nekoto et al.,2020), to data-driven text-to-speech
(Marais et al.,2020) based on Qfency
4
, and other
language modelling and data augmentation (e.g.,
(Byamugisha,2020;Mesham et al.,2021); see
2https://sdgs.un.org/goals
3https://www.masakhane.io/
4http://www.qfrency.com/
arXiv:2210.12027v1 [cs.CL] 21 Oct 2022
also Kambarami et al. (2021) for an overview).
The main knowledge-driven approaches include
terminology development in general (Khumalo,
2017) and domain-specific (e.g., (Engelbrecht et al.,
2010)), and rule-based morphological analysers
(Pretorius and Bosch,2003;Bosch and Pretorius,
2017), grammars (Bamutura et al.,2020;Pretorius
et al.,2017), and natural language generation (Keet
and Khumalo,2017;Byamugisha,2019;Mahlaza
and Keet,2020). Most of the research has taken
place over the past 5-10 years and is gaining pace,
albeit still for only a slowly increasing number of
NCB languages.
The low-resourced and very low-resourced lan-
guages
5
face a ‘catch-22’, however: there are few
language resources but one needs language and
linguistics resources to increase the language re-
sources. A well-known idea is to try to ‘bootstrap’
resources for a very low-resourced language from a
low-resourced one; e.g., to bootstrap a spellchecker
for isiNdebele from an isiZulu spellchecker, which
are both languages in the Nguni group of the NCB
languages in South Africa.
Theoretically this makes sense, but practically it
is nontrivial to figure out bootstrapping potential
and strategies. In this paper, we report on a prelimi-
nary review of published research on bootstrapping
for the NCB languages to provide better insight into
it such that it can better inform NLP tasks for NCB
languages. Observing that it remains imprecise as
to what should be assessed quantitatively to gauge
bootstrapping potential, the research demonstrates
that bootstrapping with rules and grammars extends
to more languages than initially assumed and with
data-driven approaches to fewer languages due to
limited lexical proximity due to variations in vo-
cabulary and orthography.
In the remainder of the paper, we first sum-
marise NLP-relevant features of NCB languages
and linguistics-focussed categorisations of NCB
languages in Section 2. We then proceed to the key
questions for bootstrapping and the review in Sec-
tion 3, discuss in Section 4, and close in Section 5.
2 NCB languages: some key features
This section describes language and linguistic of
NCB languages insofar as they are relevant to com-
5
While there is no crisp demarcation of ‘low’ in low-
resourced languages, it is to be understood as having only
small (e.g., 20K tokens) or no curated monolingual or parallel
corpora, limited (including outdated) or no grammar books,
and typically also comparatively few researchers and funding.
putation thus far, with first key language features
and then the grouping of subsets of NCB languages.
2.1 Grammar and orthography
The system of noun classes is emblematic for the
NCB languages. Each noun belongs to a noun class
and there are up to 23 noun classes; see Table 1
for a summarised overview. All NCB languages re-
tained the lower numbers and are fairly similar up
to noun class 11, in that the odd numbered classes
contain nouns in the singular and even number
classes nouns in the plural, they pair up mostly
in the same way, and they have a large overlap in
the kind of things that can be found in each pair
of noun classes. After that, the NCB languages
diverge on which noun classes are retained in the
language and there may not be a singular/plural
pairing. For instance, isiZulu’s ubuntu ‘humanity’
is an abstract concept in noun class 14 for which
there is no singular or plural, whereas in Chichewa
a noun in noun class 14 may have a plural in noun
class 6. The nouns with their noun classes gov-
ern a rich system of concordial agreement across
sentence constituents, ranging from verb conjuga-
tion to modifying adjectives, possessives, and other
relational notions.
The verbs have a so-called “slot system” where
each slot fulfils a specific function, if used (Khu-
malo,2007): there are eight ordered slots, being the
pre-initial, initial, post-initial, pre-radical, (verb)
radical, pre-final, final, and post-final slot. The
pre-initial and post-initial can take tense, aspect,
mood and negation, and the pre-final can take tense,
aspect, mood and valence change (causative, ac-
cusative, reciprocative, and passive). The initial
is for the subject concord to conjugate the verb
depending on the subject in the sentence and the
pre-radical slot is for the object concord. The fi-
nal slot is for the final vowel (e.g., default /a/ in
isiZulu, but /i/ if the verb is negated) and post-final
is used for extensions including the wh-questions
and locative suffix.
Many of the NCB language are agglutinating and
thus have a substantial set of phonological condi-
tioning rules especially for vowel coalescence and
vowel elision. For instance, for ‘(located) in the en-
velope’ in isiZulu, one has to modify imvilophu ‘en-
velope’ with phonologically conditioned locative
prefix (e-) and suffix (-ini) to result in emvilophini,
and the enumerative ‘and’ na- merges with the suc-
cessive noun, as in, e.g., (na- + umfana =) nomfana
Table 1: Generalisation of the semantics of the kinds of objects that the nouns in the respective noun classes (NCs)
refer to. Examples from isiZulu (1-11, 14, 15), Chichewa (12,13,16-18), Hunde (19), Runyankore (20,21), and
Luganda (22,23). (Source: adapted from (Byamugisha et al.,2018).)
NCs Semantics (generalised) Examples
1People and kinship umfana (nc1) ‘boy’
2abafana (nc2) ‘boys’
3Plants, nature, some parts of the body umuthi (nc3) ‘tree’
4imithi (nc4) ‘trees’
5
Fruits, liquids, parts of the body, loan
words, paired things
ikhala ‘nose’
6amakhala ‘noses’
7Inanimate objects isihlalo ‘chair’
8izihlalo ‘chairs’
9Loan words, tools, and animals inja ‘dog’
10 izinja ‘dogs’
11
Long thin stringy objects, languages,
inanimate objects
ucingo ‘wire’
(10) izingcingo ‘wires’
12 Diminutives kagalimoto ‘small car’
13 timagalimoto ‘small cars’
14 Abstract concepts ubuhle ‘beauty’
15 Infinitive nouns ukucula ‘to sing’
16
Locative classes
pamsika ‘round the market’
17 kumsika ‘at the market’
18 mumsika ‘in the market’
19 Diminutives hyùndù ‘a little bit of porridge’
20
Augmentative and pejorative
ogusajja ‘big ugly man’
21 agasajja ‘big ugly men’
22 gubwa ‘mutt’ (pejorative of dog)
23 Locative class eka ‘at home’
‘and the boy’ and (na- + inja =) nenja ‘and the dog’.
They overwhelmingly use Latin script, with
some also Arabic script and modern indigenous
writing systems. Among the ones that use Latin
script, there can be language-specific variations;
e.g., isiZulu typically does not have words with an
/r/ and Swahili no /q/, Mboshi has an
ε
variant of
/e/ in addition to the /e/, and there are letter combi-
nations to stand in for more consonants, such as a
‘hard b’ and a ‘soft b’ (/bh/ and /b/, respectively),
and for ‘click sounds’, such as a nasalised click
that may written as /nc/. Many NCB languages
are tonal, although this may not be reflected in the
orthography (Maddieson and Sands,2019).
2.2 Categorising NCB languages
There have been multiple attempts at grouping the
NCB languages according to various parameters.
The one most well-known is based on geographic
regions devised by Guthrie (1971), which has been
updated in (Moho,2003) and again informally in
2009 with detailed maps and many references
6
.
The system counts from A to S, from top-left in
Cameroon to down-right in South Africa. Each
zone has groups, indicated by increments of 10
(e.g., A10), where all languages within the group
have arbitrarily ordered increments of 1 (e.g., A11
and A12), and possibly further minor increments,
such as A111 and A12a. A map overlaid with the
NCB languages mentioned in this paper is shown
in Fig. 1(the non-NCB Niger-Congo languages
Yoruba and Igbo are located left of A86c, in Nige-
ria); see also Table 2.
In addition to descriptive linguistics-based
overviews, comparisons, and high-level groupings
(Güldemann,2018), some of those research efforts
also have a computational component that also
may inform potential for bootstrapping language
resources. Petzell and Hammarström (2013) com-
6
The document is available at
https://brill.
com/fileasset/downloads_products/35125_
Bantu-New-updated-Guthrie-List.pdf
(last accessed 3
Sep 2022), but does not seem to have been published.
摘要:

BootstrappingNLPtoolsacrosslow-resourcedAfricanlanguages:anoverviewandprospectsC.MariaKeetDepartmentofComputerScienceUniversityofCapeTownSouthAfricamkeet@cs.uct.ac.zaAbstractComputingandInternetaccessaresubstan-tiallygrowingmarketsinSouthernAfrica,whichbringswithitincreasingdemandsforlocalcontentand...

展开>> 收起<<
Bootstrapping NLP tools across low-resourced African languages an overview and prospects C. Maria Keet.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:630.53KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注