Bootstrapping NLP tools across low-resourced African languages an overview and prospects C. Maria Keet

2025-05-06 0 0 630.53KB 13 页 10玖币

侵权投诉

Bootstrapping NLP tools across low-resourced African languages: an

overview and prospects

C. Maria Keet

Department of Computer Science

University of Cape Town

South Africa

mkeet@cs.uct.ac.za

Abstract

Computing and Internet access are substan-

tially growing markets in Southern Africa,

which brings with it increasing demands for

local content and tools in indigenous African

languages. Since most of those languages are

low-resourced, efforts have gone into the no-

tion of bootstrapping tools for one African lan-

guage from another. This paper provides an

overview of these efforts for Niger-Congo B

(‘Bantu’) languages. Bootstrapping grammars

for geographically distant languages has been

shown to still have positive outcomes for mor-

phology and rules or grammar-based natural

language generation. Bootstrapping with data-

driven approaches to NLP tasks is difﬁcult to

use meaningfully regardless geographic prox-

imity, which is largely due to lexical diver-

sity due to both orthography and vocabulary.

Cladistic approaches in comparative linguis-

tics may inform bootstrapping strategies and

similarity measures might serve as proxy for

bootstrapping potential as well, with both fer-

tile ground for further research.

1 Introduction

Nearly 1.5 billion people live in Africa, of which

many who speak multiple languages other than

the relatively well-resourced languages English,

French, and Arabic, among an estimated 1441

to 2169 African languages (Hammarström,2018).

Notably, by ﬁrst language speakers, Swahili of the

Niger-Congo family is the next-largest language

(around 50 million L1 speakers, some 100-200 mil-

lion overall), with as close second the Afroasiatic

Hausa and then Oromo, and then Yoruba and Igbo

in the Niger-Congo family with each around 28

million L1 speakers

. While illiteracy exists in Sub-

Saharan Africa, there are still very many people

who can read and write in indigenous languages,

Numbers from various open and paywalled sources,

collated at

https://en.wikipedia.org/wiki/Languages_

of_Africa.

some of which having ofﬁcial status in one or more

countries where they are used in education, work,

and social life. This entails a need for language sup-

port for task such as spelling and grammar check-

ing, translation, and natural language generation.

Within the context of the United Nations’ Sustain-

able Development Goals

, they include language

technologies to ameliorate the language gap (and

consequent lower service (Hussey,2012/2013)) in

healthcare (Byamugisha et al.,2017;Marais et al.,

2020), educational digital assistants to ease the

burden of overworked teachers in crowded class-

rooms of up to 100 learners (Keet,2021), and many

other supportive tasks for society, such as machine

translation for humanitarian response (Öktem et al.,

2021). In this paper, we zoom in on on the Niger-

Congo B (‘Bantu’), or NCB, family of languages.

In Joshi et al. (2020)’s classiﬁcation of NLP sup-

port for languages, the NCB languages fall into

three of the six categories: the “left behinds”, the

“scraping-bys”, and the “hopefuls”, with isiZulu

(spoken in South Africa) and Swahili (spoken pri-

marily in Tanzania and Kenya) in the latter group.

A selection of the usual NLP tasks have been

taken up for a few languages of the Niger-Congo

family, notably indeed Swahili and isiZulu, and

to a lesser extent Yoruba, Igbo, isiXhosa, and

Runyankore. Examples are diverse. They range

from corpus creation for data-driven NLP, such

as the IsiZulu National Corpus (Khumalo,2015)

that was used for a statistical language model for a

spellchecker (Ndaba et al.,2016), the Mashakane

grassroots initiative

that focuses on data-driven

machine translation for multiple African languages

(Nekoto et al.,2020), to data-driven text-to-speech

(Marais et al.,2020) based on Qfency

, and other

language modelling and data augmentation (e.g.,

(Byamugisha,2020;Mesham et al.,2021); see

2https://sdgs.un.org/goals

3https://www.masakhane.io/

4http://www.qfrency.com/

arXiv:2210.12027v1 [cs.CL] 21 Oct 2022

also Kambarami et al. (2021) for an overview).

The main knowledge-driven approaches include

terminology development in general (Khumalo,

2017) and domain-speciﬁc (e.g., (Engelbrecht et al.,

2010)), and rule-based morphological analysers

(Pretorius and Bosch,2003;Bosch and Pretorius,

2017), grammars (Bamutura et al.,2020;Pretorius

et al.,2017), and natural language generation (Keet

and Khumalo,2017;Byamugisha,2019;Mahlaza

and Keet,2020). Most of the research has taken

place over the past 5-10 years and is gaining pace,

albeit still for only a slowly increasing number of

NCB languages.

The low-resourced and very low-resourced lan-

guages

face a ‘catch-22’, however: there are few

language resources but one needs language and

linguistics resources to increase the language re-

sources. A well-known idea is to try to ‘bootstrap’

resources for a very low-resourced language from a

low-resourced one; e.g., to bootstrap a spellchecker

for isiNdebele from an isiZulu spellchecker, which

are both languages in the Nguni group of the NCB

languages in South Africa.

Theoretically this makes sense, but practically it

is nontrivial to ﬁgure out bootstrapping potential

and strategies. In this paper, we report on a prelimi-

nary review of published research on bootstrapping

for the NCB languages to provide better insight into

it such that it can better inform NLP tasks for NCB

languages. Observing that it remains imprecise as

to what should be assessed quantitatively to gauge

bootstrapping potential, the research demonstrates

that bootstrapping with rules and grammars extends

to more languages than initially assumed and with

data-driven approaches to fewer languages due to

limited lexical proximity due to variations in vo-

cabulary and orthography.

In the remainder of the paper, we ﬁrst sum-

marise NLP-relevant features of NCB languages

and linguistics-focussed categorisations of NCB

languages in Section 2. We then proceed to the key

questions for bootstrapping and the review in Sec-

tion 3, discuss in Section 4, and close in Section 5.

2 NCB languages: some key features

This section describes language and linguistic of

NCB languages insofar as they are relevant to com-

While there is no crisp demarcation of ‘low’ in low-

resourced languages, it is to be understood as having only

small (e.g., 20K tokens) or no curated monolingual or parallel

corpora, limited (including outdated) or no grammar books,

and typically also comparatively few researchers and funding.

putation thus far, with ﬁrst key language features

and then the grouping of subsets of NCB languages.

2.1 Grammar and orthography

The system of noun classes is emblematic for the

NCB languages. Each noun belongs to a noun class

and there are up to 23 noun classes; see Table 1

for a summarised overview. All NCB languages re-

tained the lower numbers and are fairly similar up

to noun class 11, in that the odd numbered classes

contain nouns in the singular and even number

classes nouns in the plural, they pair up mostly

in the same way, and they have a large overlap in

the kind of things that can be found in each pair

of noun classes. After that, the NCB languages

diverge on which noun classes are retained in the

language and there may not be a singular/plural

pairing. For instance, isiZulu’s ubuntu ‘humanity’

is an abstract concept in noun class 14 for which

there is no singular or plural, whereas in Chichewa

a noun in noun class 14 may have a plural in noun

class 6. The nouns with their noun classes gov-

ern a rich system of concordial agreement across

sentence constituents, ranging from verb conjuga-

tion to modifying adjectives, possessives, and other

relational notions.

The verbs have a so-called “slot system” where

each slot fulﬁls a speciﬁc function, if used (Khu-

malo,2007): there are eight ordered slots, being the

pre-initial, initial, post-initial, pre-radical, (verb)

radical, pre-ﬁnal, ﬁnal, and post-ﬁnal slot. The

pre-initial and post-initial can take tense, aspect,

mood and negation, and the pre-ﬁnal can take tense,

aspect, mood and valence change (causative, ac-

cusative, reciprocative, and passive). The initial

is for the subject concord to conjugate the verb

depending on the subject in the sentence and the

pre-radical slot is for the object concord. The ﬁ-

nal slot is for the ﬁnal vowel (e.g., default /a/ in

isiZulu, but /i/ if the verb is negated) and post-ﬁnal

is used for extensions including the wh-questions

and locative sufﬁx.

Many of the NCB language are agglutinating and

thus have a substantial set of phonological condi-

tioning rules especially for vowel coalescence and

vowel elision. For instance, for ‘(located) in the en-

velope’ in isiZulu, one has to modify imvilophu ‘en-

velope’ with phonologically conditioned locative

preﬁx (e-) and sufﬁx (-ini) to result in emvilophini,

and the enumerative ‘and’ na- merges with the suc-

cessive noun, as in, e.g., (na- + umfana =) nomfana

Table 1: Generalisation of the semantics of the kinds of objects that the nouns in the respective noun classes (NCs)

refer to. Examples from isiZulu (1-11, 14, 15), Chichewa (12,13,16-18), Hunde (19), Runyankore (20,21), and

Luganda (22,23). (Source: adapted from (Byamugisha et al.,2018).)

NCs Semantics (generalised) Examples

1People and kinship umfana (nc1) ‘boy’

2abafana (nc2) ‘boys’

3Plants, nature, some parts of the body umuthi (nc3) ‘tree’

4imithi (nc4) ‘trees’

Fruits, liquids, parts of the body, loan

words, paired things

ikhala ‘nose’

6amakhala ‘noses’

7Inanimate objects isihlalo ‘chair’

8izihlalo ‘chairs’

9Loan words, tools, and animals inja ‘dog’

10 izinja ‘dogs’

Long thin stringy objects, languages,

inanimate objects

ucingo ‘wire’

(10) izingcingo ‘wires’

12 Diminutives kagalimoto ‘small car’

13 timagalimoto ‘small cars’

14 Abstract concepts ubuhle ‘beauty’

15 Inﬁnitive nouns ukucula ‘to sing’

Locative classes

pamsika ‘round the market’

17 kumsika ‘at the market’

18 mumsika ‘in the market’

19 Diminutives hyùndù ‘a little bit of porridge’

Augmentative and pejorative

ogusajja ‘big ugly man’

21 agasajja ‘big ugly men’

22 gubwa ‘mutt’ (pejorative of dog)

23 Locative class eka ‘at home’

‘and the boy’ and (na- + inja =) nenja ‘and the dog’.

They overwhelmingly use Latin script, with

some also Arabic script and modern indigenous

writing systems. Among the ones that use Latin

script, there can be language-speciﬁc variations;

e.g., isiZulu typically does not have words with an

/r/ and Swahili no /q/, Mboshi has an

variant of

/e/ in addition to the /e/, and there are letter combi-

nations to stand in for more consonants, such as a

‘hard b’ and a ‘soft b’ (/bh/ and /b/, respectively),

and for ‘click sounds’, such as a nasalised click

that may written as /nc/. Many NCB languages

are tonal, although this may not be reﬂected in the

orthography (Maddieson and Sands,2019).

2.2 Categorising NCB languages

There have been multiple attempts at grouping the

NCB languages according to various parameters.

The one most well-known is based on geographic

regions devised by Guthrie (1971), which has been

updated in (Moho,2003) and again informally in

2009 with detailed maps and many references

The system counts from A to S, from top-left in

Cameroon to down-right in South Africa. Each

zone has groups, indicated by increments of 10

(e.g., A10), where all languages within the group

have arbitrarily ordered increments of 1 (e.g., A11

and A12), and possibly further minor increments,

such as A111 and A12a. A map overlaid with the

NCB languages mentioned in this paper is shown

in Fig. 1(the non-NCB Niger-Congo languages

Yoruba and Igbo are located left of A86c, in Nige-

ria); see also Table 2.

In addition to descriptive linguistics-based

overviews, comparisons, and high-level groupings

(Güldemann,2018), some of those research efforts

also have a computational component that also

may inform potential for bootstrapping language

resources. Petzell and Hammarström (2013) com-

The document is available at

https://brill.

com/fileasset/downloads_products/35125_

Bantu-New-updated-Guthrie-List.pdf

(last accessed 3

Sep 2022), but does not seem to have been published.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BootstrappingNLPtoolsacrosslow-resourcedAfricanlanguages:anoverviewandprospectsC.MariaKeetDepartmentofComputerScienceUniversityofCapeTownSouthAfricamkeet@cs.uct.ac.zaAbstractComputingandInternetaccessaresubstan-tiallygrowingmarketsinSouthernAfrica,whichbringswithitincreasingdemandsforlocalcontentand...

展开>> 收起<<

Bootstrapping NLP tools across low-resourced African languages an overview and prospects C. Maria Keet.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bootstrapping NLP tools across low-resourced African languages an overview and prospects C. Maria Keet

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: