AfroLID A Neural Language Identification Tool for African Languages Ife Adebara1AbdelRahim Elmadany1Muhammad Abdul-Mageed12Alcides Alcoba Inciarte1 1Deep Learning Natural Language Processing Group The University of British Columbia

2025-05-06 1 0 2.98MB 25 页 10玖币
侵权投诉
AfroLID: A Neural Language Identification Tool for African Languages
Ife Adebara1,? AbdelRahim Elmadany1,? Muhammad Abdul-Mageed1,2Alcides Alcoba Inciarte1
1Deep Learning & Natural Language Processing Group, The University of British Columbia
2Department of Natural Language Processing & Department of Machine Learning, MBZUAI
{ife.adebara@,a.elmadany@,muhammad.mageed@,alcobaaj@mail.}ubc.ca
Abstract
Language identification (LID) is a crucial pre-
cursor for NLP, especially for mining web
data. Problematically, most of the world’s
7000+ languages today are not covered by
LID technologies. We address this press-
ing issue for Africa by introducing AfroLID,
a neural LID toolkit for 517 African lan-
guages and varieties. AfroLID exploits a multi-
domain web dataset manually curated from
across 14 language families utilizing five or-
thographic systems. When evaluated on our
blind Test set, AfroLID achieves 95.89 F1-
score. We also compare AfroLID to five exist-
ing LID tools that each cover a small number
of African languages, finding it to outperform
them on most languages. We further show the
utility of AfroLID in the wild by testing it on
the acutely under-served Twitter domain. Fi-
nally, we offer a number of controlled case
studies and perform a linguistically-motivated
error analysis that allow us to both show-
case AfroLID’s powerful capabilities and limi-
tations.1
1 Introduction
Language identification (LID) is the task of identi-
fying the human language a piece of text or speech
segment belongs to. The proliferation of social
media have allowed greater access to multilingual
data, making automatic LID an important first step
in processing human language appropriately (Tjan-
dra et al.,2021;Thara and Poornachandran,2021).
This includes applications in speech, sign language,
handwritten text, and other modalities of language.
It also includes distinguishing languages in code-
mixed datasets (Abdul-Mageed et al.,2020;Thara
and Poornachandran,2021). Unfortunately, for the
majority of languages in the world, including most
African languages, we do not have the resources
for developing LID tools.
?Authors contributed equally.
1
AfroLID is publicly available at https://github.com/UBC-
NLP/afrolid.
Figure 1:
All
50
African countries in our data, with our
517
languages/language varieties in colored circles overlayed
within respective countries. More details are in Appendix E.
This situation has implications for the future
NLP technologies. For instance, LID has facili-
tated development of widely multilingual models
such mT5 (Xue et al.,2021) and large multilin-
gual datasets such as CCAligned (El-Kishky et al.,
2020), ParaCrawl (Esplà et al.,2019), WikiMa-
trix (Schwenk et al.,2021), OSCAR (Ortiz Suárez
et al.,2020), and mC4 (Xue et al.,2021) which
have advanced research in NLP. Comparable re-
sources are completely unavailable for the major-
ity of the world’s
7000
+ today, with only poor
coverage of the so-called low-resource languages
(LR). This is partly due to absence of LID tools,
and impedes future NLP progress on these lan-
guages (Adebara and Abdul-Mageed,2022). The
state of African languages is not any better than
other regions: Kreutzer et al. (2021) perform a man-
ual evaluation of
205
datasets involving African
languages such as those in CCAligned, ParaCrawl,
WikiMatrix, OSCAR, and mC4 and show that at
arXiv:2210.11744v3 [cs.CL] 7 Dec 2022
least
15
corpora were completely erroneous, a sig-
nificant fraction contained less than
50%
of correct
data, and
82
corpora were mislabelled or used am-
biguous language codes. These consequently affect
the quality of models built with these datasets. Al-
abi et al. (2020) find that
135
K out of
150
K words
in the fastText embeddings for Yorùbá belong to
other languages such as English, French, and Ara-
bic. New embedding models created by Alabi
et al. (2020) with a curated high quality dataset out-
perform off-the-shelf fastText embeddings, even
though the curated data is smaller.
In addition to resource creation, lack (or poor
performance) of LID tools negatively impacts pre-
processing of LR languages since LID can be a
prerequisite for determining, e.g., appropriate tok-
enization. (Duvenhage et al.,2017a). Furthermore,
some preprocessing approaches may be necessary
for certain languages, but may hurt perforrmance
in other languages (Adebara and Abdul-Mageed,
2022). Developing LID tools is thus vital for all
NLP. In this work, we focus on LID for African
languages and introduce AfroLID.
AfroLID is a neural LID tool that covers
517
African languages and language varieties
2
across
14
language families. The languages covered be-
long to
50
African countries and are written in
ve diverse scripts. We show the countries cov-
ered by AfroLID in Figure 1. Examples of the
different scripts involved in the
517
languages
are displayed in Figure 2. To the best of our
knowledge, AfroLID supports the largest subset
of African languages to date. AfroLID is also us-
able without any end-user training, and it exploits
data from a variety of domains to ensure robust-
ness. We manually curate our clean training data,
which is of special significance in low resource
settings. We show the utility of AfroLID in the
wild by applying it on two Twitter datasets and
compare its performance with existing LID tools
that cover any number of African languages such
as CLD2 (McCandless,2010), CLD3 (Salcianu
et al.,2018), Franc, LangDetect (Shuyo,2010),
and Langid.py (Lui and Baldwin,2012). Our re-
sults show that AfroLID consistently outperforms
all other LID tools for almost all languages, and
serves as the new SOTA for language identification
for African languages.
To summarize, we offer the following main con-
2
Our dataset involves different forms that can arguably
be viewed as varieties of the same language such as Twi and
Akan.
Figure 2: Examples from the five scripts in our data.
tributions:
1.
We develop AfroLID, a SOTA LID tool for
517
African languages and language varieties.
To facilitate NLP research, we make our mod-
els publicly available.
2.
We carry out a study of LID tool performance
on African languages where we compare our
models in controlled settings with several
tools such as CLD2, CLD3, Franc, LangDe-
tect, and Langid.py.
3.
Our models exhibit highly accurate perfor-
mance in the wild, as demonstrated by ap-
plying AfroLID on Twitter data.
4.
We provide a wide range of controlled
case studies and carry out a linguistically-
motivated error analysis of AfroLID. This al-
lows us to motivate plausible directions for
future research, including potentially beyond
African languages.
The rest of the paper is organized as follows:
In Section 2we discuss a number of typologi-
cal features of our supported languages. We de-
scribe AfroLID’s training data in Section 3. Next,
we introduce AfroLID in 4. This includes our
experimental datasets and their splits, preprocess-
ing, vocabulary, implementation and training de-
tails, and our evaluation settings. We present per-
formance of AfroLID in Section 5and compare
it to other LID tools. Our analysis show that
AfroLID outperforms other models for most lan-
guages. In the same section, we also describe the
utility of AfroLID on non-Latin scripts, Creole lan-
guages, and languages in close geographical prox-
imity. Although AfroLID is not trained on Twitter
data, we experiment with tweets in Section 6in
order to investigate performance of AfroLID in
out of domain scenarios. Through two diagnostic
studies, we demonstrate AfroLID’s robustness. We
provide an overview of related work in Section 7.
We conclude in Section 8, and outline a number of
limitations for our work in Section 9.
2 Typological Information
Language Families
. We experiment with
517
African languages and language varieties across
50
African countries. These languages belong to
14
language families (Eberhard et al.,2021) as
follows: Afro-Asiatic, Austronesian, Creole (En-
glish based), Creole (French based), Creole (Kongo
based), Creole (Ngbadi based), Creole (Portuguese
based), Indo-European, Khoe-Kwadi (Hainum),
Khoe-Kwadi (Nama), Khoe-Kwadi (Southwest),
Niger-Congo, and Nilo-Saharan. The large and ty-
pologically diverse data we exploit hence endow
our work with wide coverage. We show in Figure 1
a map of Africa with the countries AfroLID covers.
We also show the number of languages we cover,
per country, in Figure Ein the Appendix. Table E.1,
Table E.2, and Table E.3 in the Appendix also pro-
vide a list of the languages AfroLID handles. We
represent the languages using ISO-3 codes
3
for
both individual languages and macro-languages.
We use a macro-language tag when the language
is known but the specific dialect is unknown. For
this reason we specify that AfroLID supports
517
African languages and language varieties.
Sentential Word Order
. There are seven cat-
egories of word order across human languages
around the world. These are subject-verb-object
(SVO), subject-object-verb (SOV), object-verb-
subject (OVS), object-subject-verb (OSV), verb-
object-subject (VOS), verb-subject-object (VSO),
and languages lacking a dominant order (which
often have a combination of two or more orders
within its grammar) (Dryer and Haspelmath,2013).
Again, our dataset is very diverse: we cover five
out of these seven types of word order. Table 1
shows sentential word order in our data, with some
representative languages for each category.
Diacritics
. Diacritic marks are used to overcome
the inadequacies of an alphabet in capturing impor-
tant linguistic information by adding a distinguish-
ing mark to a character in an alphabet. Diacritics
are often used to indicate tone, length, case, nasal-
ization, or even to distinguish different letters of a
3https://glottolog.org/glottolog/language.
Word Order Example Languages
SVO Xhosa, Zulu, Yorùbá
SOV Khoekhoe, Somali, Amharic
VSO Murle, Kalenjin
VOS Malagasy
No-dominant-order Siswati, Nyamwezi, Bassa
Table 1: Sentential word order in our data.
language’s alphabet (Wells,2000;Hyman,2003;
Creissels et al.,2008). Diacritics can be placed
above, below or through a character. Diacritics are
common features of the orthographies of African
languages. Out of
517
languages/language vari-
eties in our training data,
295
use some diacritics
in their orthographies. We also provide a list of
languages with diacritics in our training data in
Table C.3 in the Appendix.
Script Languages
Ethiopic Amharic, Basketo, Maale,
?Oromo, Sebat Bet Gurage
Tigrinya, Xamtanga
Arabic Fulfude Adamawa, Fulfude Caka
Tarifit
Vai Vai
Coptic Coptic
Table 2: Non-Latin scripts in AfroLID data. ?Oromo:
is available in Latin script as well.
Scripts
. Our dataset consists of
14
languages writ-
ten in four different non-Latin scripts and
499
lan-
guages written in Latin scripts. The non-Latin
scripts are Ethiopic, Arabic, Vai, and Coptic.
3 Curating an African Language Dataset
AfroLID is trained using a multi-domain, multi-
script language identification dataset that we man-
ually curated for building our tool. To collect the
dataset, we perform an extensive manual analysis
of African language presence on the web, iden-
tifying as much publicly available data from the
517
language varieties we treat as is possible. We
adopt this manual curation approach since there
are only few African languages that have any LID
tool coverage. In addition, available LID tools
that treat African languages tend to perform unre-
liably (Kreutzer et al.,2021). We therefore con-
sult research papers focusing on African languages,
such as (Adebara and Abdul-Mageed,2022), or
provide language data (Muhammad et al.,2022;
Alabi et al.,2020), sifting through references to
find additional African data sources. Moreover,
we search for newspapers across all
54
African
countries.
4
We also collect data from social me-
dia such as blogs and web fora written in African
languages as well as databases that store African
language data. These include LANAFRICA,SADi-
LaR,Masakhane,Niger-Volta-LTI, and ALTI. Our
resulting multi-domain dataset contains religious
texts, government documents, health documents,
crawls from curated web pages, news articles, and
existing human-identified datasets for African lan-
guages. As an additional sanity check, we ask a
number of native speakers from a subset of the lan-
guages to verify the correctness of the self-labels
assigned in respective sources within our collec-
tions.
5
Our manual inspection step gave us confi-
dence about the quality of our dataset, providing
near perfect agreement by native speakers with la-
bels from data sources. In total, we collect
100
mil-
lion sentences in
528
languages across
14
language
families in Africa and select
517
languages which
had at least
2000
sentences. Again, the dataset
has various orthographic scripts, including
499
lan-
guages in Latin scripts, eight languages in Ethiopic
scripts, four languages in Arabic scripts, one lan-
guage in Vai scripts, and one in Coptic scripts.
4 AfroLID
Experimental Dataset and Splits
. From our
manually-curated dataset, we randomly select
5,000
,
50
, and
100
sentences for train, develop-
ment, and test, respectively, for each language.
6
Overall, AfroLID data comprises
2,496,980
sen-
tences for training (Train),
25,850
for development
(Dev), and
51,400
for test (Test) for
517
languages
and language varieties.
Preprocessing
. We ensure that our data represent
naturally occurring text by performing only mini-
mal preprocessing. Specifically, we tokenize our
data into character, byte-pairs, and words. We do
not remove diacritics and use both precomposed
and decomposed characters to cater for the incon-
sistent use of precomposed and decomposed char-
acters by many African languages in digital media.
7
4
https://www.worldometers.info/geography/how-many-
countries-in-africa/.
5
We had access to native speakers of Afrikaans, Yorùbá,
Igbo, Hausa, Luganda, Kinyarwanda, Chichewa, Shona, So-
mali, Swahili, Xhosa, Bemba, and Zulu.
6
We remove languages with data less than
2,000
sen-
tences, as explained earlier.
7
A Unicode entity that combines two or more other char-
acters may be precomposed or decomposed. For example, ä
can be precomposed into
U+ 0061U+ 0308
or decomposed
We create our character level tokenization scripts
and generate our vocabulary using Fairseq. We
use sentencepiece tokenizer for the word level and
byte-pair tokens before we preprocess in Fairseq.
Vocabulary
. We experiment with byte-pair (BPE),
word, and character level encodings. We used vo-
cabulary sizes of
64
K,
100
K, and
2,260
for the
bpe, word, and character level models across the
517
language varieties. The characters included
both letters, diacritics, and symbols from other non-
Latin scripts for the respective languages.
Figure 3: F1distribution on AfroLID Dev set.
Implementation
. AfroLID is built using a Trans-
former architecture trained from scratch. We use
12
attention layers with
12
heads in each layer,
768
hidden dimensions, making up
200
M parame-
ters.8
Hyperparameter Search and Training
. To iden-
tify our best hyperparameters, we use a subset of
our training data and the full development set for
our hyperparameter search. Namely, we randomly
sample
200
examples from each language in our
training data to create a smaller train set,
9
while us-
ing our full Dev set. We train for up to
100
epochs,
with early stopping. We search for the following
hyperparameter values, picking bolded ones as our
best: dropout rates from the set {
0.1
, 0.2, 0.3, 0.4,
0.5}, learning rates from {5e-5,
5e-6
}, and patience
from {
10
, 20, 30}. Other hyperparameters are sim-
ilar to those for XML-R (Conneau et al.,2020).
We perform hyperparameter search only with our
character level model and use identified values with
both the BPE and word models.
Evaluation
. We report our results in both macro
F1
-score and accuracy, selecting our best model on
into
U+ 00E4
. In Unicode, they are included primarily to aid
computer systems with incomplete Unicode support, where
equivalent decomposed characters may render incorrectly.
8
This architecture is similar to XMLRBase (Conneau et al.,
2020).
9
This helps us limit GPU hours needed for hyperparameter
search.
Figure 4: F1distribution on AfroLID Test set.
Dev based on
F1
.For all our models, we report the
average of three runs.
5 Model Performance and Analysis
As Table 3shows, our
BPE model
outperforms
both the
char
and
word
models on both Dev and
Test data. On Dev, our BPE model acquires
96.14
F1
and
96.19
acc, compared to
85.75 F1
and
85.85
for char model, and
90.22 F1
and
90.34
acc for
word model, respectively. Our BPE model simi-
larly excels on Test, with
95.95 F1
and
96.01
acc.
We inspect the distribution of
F1
on the entire Dev
and Test sets using our BPE model, as shown in
Figures 3and 4. As annotated on Figure 3, a total
of
212
languages out of the
517
(
% = 41
) are iden-
tified with
100 F1
,
197
languages (
% = 38.10
)
identified with
95
and
99 F1
, and
69
languages
(
% = 13.30
) identified with
90
95 F1
. For Test
data (Figure 4), on the other hand,
128
(
% = 24.75
)
languages are identified with
100 F1
,
299
lan-
guages (
% = 57.83
) are between
95
99 F1
, while
56 languages (% = 10.83) are between 9095 F1.
Model Split F1-score Accuracy Checkpoint
Char Dev 85.75 85.85 69
Test 81.20 81.30
BPE Dev 96.14 96.19 73
Test 95.95 96.01
Word Dev 90.22 90.34 65
Test 89.04 89.01
Table 3: Results on the BPE, word level, and character
level models. Bolded: best result on Test. Underlined:
best result on Dev.
AfroLID in Comparison
Using our Dev and
Test data, we compare our best AfroLID model
(BPE model) with the following LID tools: CLD2,
CLD3, Franc, LangDetect, and Langid.py. Since
these tools do not support all our AfroLID lan-
guages, we compare accuracy and
F1
-scores of
our models only on languages supported by each
of these tools. As Tables A.1 and 4show,
AfroLID outperforms other tools on
7
and
8
lan-
guages out of
16
languages on the Dev set and Test
set, respectively. We also compare
F1
-scores of
Franc
on the
88
African languages Franc supports
with the
F1
-scores of AfroLID on those languages.
As shown in Tables 5and 6, AfroLID outperforms
Franc on
78
languages and has similar
F1
-score on
ve languages on the Dev set. AfroLID also out-
performs Franc on
76
languages, and has similar
F1-score on five languages on the Test set.
Lang. CLD2 CLD3 Langid.py LangDetect Franc AfroLID
afr 94.00 91.00 69.00 88.23 81.00 97.00
amh - 97.00 100.00 -35.00 97.00
hau - 83.00 - - 77.00 88.00
ibo - 96.00 - - 88.00 97.00
kin 92.00 -45.00 -47.00 89.00
lug 84.00 - - - 64.00 87.00
mlg - 100.00 98.00 - - 100.00
nya - 96.00 - - 75.00 92.00
sna - 100.00 - - 91.00 97.00
som - 92.00 - - 89.00 95.00
sot - 99.00 - - 93.00 88.00
swa 99.00 91.00 90.00 100.00 -92.00
swc 93.00 94.00 96.00 97.02 -87.00
swh 89.00 92.00 88.23 87.19 70.00 77.00
xho - 59.00 88.00 -30.00 67.00
yor - 25.00 - - 66.00 98.00
zul - 89.00 20.00 -40.00 50.00
Table 4: A comparison of results on AfroLID with
CLD2, CLD3, Langid.py, LangDetect, and Franc us-
ing F1-score on the Test set. indicates that the tool
does not support the language.
Effect of Non-Latin Script.
We investigate per-
formance of AfroLID on languages that use one of
Arabic, Ethiopic, Vai, and Coptic scripts. Specifi-
cally, we investigate performance of AfroLID on
Amharic (amh), Basketo (bst), Maale (mdy), Se-
bat Bet Gurage (sgw), Tigrinya (tir), Xamtanga
(xan), Fulfude Adamawa (fub), Fulfude Caka (fuv),
Tarif (rif), Vai (vai), and Coptic (cop).
10
Vai and
Coptic, the two unique scripts in AfroLID have
an
F1
-score of
100
each. This corroborates re-
search findings that languages written in unique
scripts within an LID tool can be identified with
up to
100%
recall,
F1
-score, and/or accuracy even
using a small training dataset (Jauhiainen et al.,
2017a). We assume this to be the reason Langid.py
outperforms AfroLID on Amharic as seen in Ta-
ble 4, since Amharic is the only language that em-
ploys an Ethiopic script in langid.py. AfroLID,
on the other hand, has
8
languages using Ethiopic
scripts. However, it is not clear why Basketo, which
uses Ethiopic scripts has
100 F1
-score. We, how-
10
We do not investigate performance on Oromo because we
had both Latin and Ethiopic scripts for Oromo in our training
data.
摘要:

AfroLID:ANeuralLanguageIdenticationToolforAfricanLanguagesIfeAdebara1;?AbdelRahimElmadany1;?MuhammadAbdul-Mageed1;2AlcidesAlcobaInciarte11DeepLearning&NaturalLanguageProcessingGroup,TheUniversityofBritishColumbia2DepartmentofNaturalLanguageProcessing&DepartmentofMachineLearning,MBZUAI{ife.adebara@,...

展开>> 收起<<
AfroLID A Neural Language Identification Tool for African Languages Ife Adebara1AbdelRahim Elmadany1Muhammad Abdul-Mageed12Alcides Alcoba Inciarte1 1Deep Learning Natural Language Processing Group The University of British Columbia.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:2.98MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注