
also Kambarami et al. (2021) for an overview).
The main knowledge-driven approaches include
terminology development in general (Khumalo,
2017) and domain-specific (e.g., (Engelbrecht et al.,
2010)), and rule-based morphological analysers
(Pretorius and Bosch,2003;Bosch and Pretorius,
2017), grammars (Bamutura et al.,2020;Pretorius
et al.,2017), and natural language generation (Keet
and Khumalo,2017;Byamugisha,2019;Mahlaza
and Keet,2020). Most of the research has taken
place over the past 5-10 years and is gaining pace,
albeit still for only a slowly increasing number of
NCB languages.
The low-resourced and very low-resourced lan-
guages
5
face a ‘catch-22’, however: there are few
language resources but one needs language and
linguistics resources to increase the language re-
sources. A well-known idea is to try to ‘bootstrap’
resources for a very low-resourced language from a
low-resourced one; e.g., to bootstrap a spellchecker
for isiNdebele from an isiZulu spellchecker, which
are both languages in the Nguni group of the NCB
languages in South Africa.
Theoretically this makes sense, but practically it
is nontrivial to figure out bootstrapping potential
and strategies. In this paper, we report on a prelimi-
nary review of published research on bootstrapping
for the NCB languages to provide better insight into
it such that it can better inform NLP tasks for NCB
languages. Observing that it remains imprecise as
to what should be assessed quantitatively to gauge
bootstrapping potential, the research demonstrates
that bootstrapping with rules and grammars extends
to more languages than initially assumed and with
data-driven approaches to fewer languages due to
limited lexical proximity due to variations in vo-
cabulary and orthography.
In the remainder of the paper, we first sum-
marise NLP-relevant features of NCB languages
and linguistics-focussed categorisations of NCB
languages in Section 2. We then proceed to the key
questions for bootstrapping and the review in Sec-
tion 3, discuss in Section 4, and close in Section 5.
2 NCB languages: some key features
This section describes language and linguistic of
NCB languages insofar as they are relevant to com-
5
While there is no crisp demarcation of ‘low’ in low-
resourced languages, it is to be understood as having only
small (e.g., 20K tokens) or no curated monolingual or parallel
corpora, limited (including outdated) or no grammar books,
and typically also comparatively few researchers and funding.
putation thus far, with first key language features
and then the grouping of subsets of NCB languages.
2.1 Grammar and orthography
The system of noun classes is emblematic for the
NCB languages. Each noun belongs to a noun class
and there are up to 23 noun classes; see Table 1
for a summarised overview. All NCB languages re-
tained the lower numbers and are fairly similar up
to noun class 11, in that the odd numbered classes
contain nouns in the singular and even number
classes nouns in the plural, they pair up mostly
in the same way, and they have a large overlap in
the kind of things that can be found in each pair
of noun classes. After that, the NCB languages
diverge on which noun classes are retained in the
language and there may not be a singular/plural
pairing. For instance, isiZulu’s ubuntu ‘humanity’
is an abstract concept in noun class 14 for which
there is no singular or plural, whereas in Chichewa
a noun in noun class 14 may have a plural in noun
class 6. The nouns with their noun classes gov-
ern a rich system of concordial agreement across
sentence constituents, ranging from verb conjuga-
tion to modifying adjectives, possessives, and other
relational notions.
The verbs have a so-called “slot system” where
each slot fulfils a specific function, if used (Khu-
malo,2007): there are eight ordered slots, being the
pre-initial, initial, post-initial, pre-radical, (verb)
radical, pre-final, final, and post-final slot. The
pre-initial and post-initial can take tense, aspect,
mood and negation, and the pre-final can take tense,
aspect, mood and valence change (causative, ac-
cusative, reciprocative, and passive). The initial
is for the subject concord to conjugate the verb
depending on the subject in the sentence and the
pre-radical slot is for the object concord. The fi-
nal slot is for the final vowel (e.g., default /a/ in
isiZulu, but /i/ if the verb is negated) and post-final
is used for extensions including the wh-questions
and locative suffix.
Many of the NCB language are agglutinating and
thus have a substantial set of phonological condi-
tioning rules especially for vowel coalescence and
vowel elision. For instance, for ‘(located) in the en-
velope’ in isiZulu, one has to modify imvilophu ‘en-
velope’ with phonologically conditioned locative
prefix (e-) and suffix (-ini) to result in emvilophini,
and the enumerative ‘and’ na- merges with the suc-
cessive noun, as in, e.g., (na- + umfana =) nomfana