AfroLID A Neural Language Identiﬁcation Tool for African Languages Ife Adebara1AbdelRahim Elmadany1Muhammad Abdul-Mageed12Alcides Alcoba Inciarte1 1Deep Learning Natural Language Processing Group The University of British Columbia

2025-05-06 1 0 2.98MB 25 页 10玖币

侵权投诉

AfroLID: A Neural Language Identiﬁcation Tool for African Languages

Ife Adebara1,? AbdelRahim Elmadany1,? Muhammad Abdul-Mageed1,2Alcides Alcoba Inciarte1

1Deep Learning & Natural Language Processing Group, The University of British Columbia

2Department of Natural Language Processing & Department of Machine Learning, MBZUAI

{ife.adebara@,a.elmadany@,muhammad.mageed@,alcobaaj@mail.}ubc.ca

Abstract

Language identiﬁcation (LID) is a crucial pre-

cursor for NLP, especially for mining web

data. Problematically, most of the world’s

7000+ languages today are not covered by

LID technologies. We address this press-

ing issue for Africa by introducing AfroLID,

a neural LID toolkit for 517 African lan-

guages and varieties. AfroLID exploits a multi-

domain web dataset manually curated from

across 14 language families utilizing ﬁve or-

thographic systems. When evaluated on our

blind Test set, AfroLID achieves 95.89 F1-

score. We also compare AfroLID to ﬁve exist-

ing LID tools that each cover a small number

of African languages, ﬁnding it to outperform

them on most languages. We further show the

utility of AfroLID in the wild by testing it on

the acutely under-served Twitter domain. Fi-

nally, we offer a number of controlled case

studies and perform a linguistically-motivated

error analysis that allow us to both show-

case AfroLID’s powerful capabilities and limi-

tations.1

1 Introduction

Language identiﬁcation (LID) is the task of identi-

fying the human language a piece of text or speech

segment belongs to. The proliferation of social

media have allowed greater access to multilingual

data, making automatic LID an important ﬁrst step

in processing human language appropriately (Tjan-

dra et al.,2021;Thara and Poornachandran,2021).

This includes applications in speech, sign language,

handwritten text, and other modalities of language.

It also includes distinguishing languages in code-

mixed datasets (Abdul-Mageed et al.,2020;Thara

and Poornachandran,2021). Unfortunately, for the

majority of languages in the world, including most

African languages, we do not have the resources

for developing LID tools.

?Authors contributed equally.

AfroLID is publicly available at https://github.com/UBC-

NLP/afrolid.

Figure 1:

All

African countries in our data, with our

517

languages/language varieties in colored circles overlayed

within respective countries. More details are in Appendix E.

This situation has implications for the future

NLP technologies. For instance, LID has facili-

tated development of widely multilingual models

such mT5 (Xue et al.,2021) and large multilin-

gual datasets such as CCAligned (El-Kishky et al.,

2020), ParaCrawl (Esplà et al.,2019), WikiMa-

trix (Schwenk et al.,2021), OSCAR (Ortiz Suárez

et al.,2020), and mC4 (Xue et al.,2021) which

have advanced research in NLP. Comparable re-

sources are completely unavailable for the major-

ity of the world’s

7000

+ today, with only poor

coverage of the so-called low-resource languages

(LR). This is partly due to absence of LID tools,

and impedes future NLP progress on these lan-

guages (Adebara and Abdul-Mageed,2022). The

state of African languages is not any better than

other regions: Kreutzer et al. (2021) perform a man-

ual evaluation of

205

datasets involving African

languages such as those in CCAligned, ParaCrawl,

WikiMatrix, OSCAR, and mC4 and show that at

arXiv:2210.11744v3 [cs.CL] 7 Dec 2022

least

corpora were completely erroneous, a sig-

niﬁcant fraction contained less than

50%

of correct

data, and

corpora were mislabelled or used am-

biguous language codes. These consequently affect

the quality of models built with these datasets. Al-

abi et al. (2020) ﬁnd that

135

K out of

150

K words

in the fastText embeddings for Yorùbá belong to

other languages such as English, French, and Ara-

bic. New embedding models created by Alabi

et al. (2020) with a curated high quality dataset out-

perform off-the-shelf fastText embeddings, even

though the curated data is smaller.

In addition to resource creation, lack (or poor

performance) of LID tools negatively impacts pre-

processing of LR languages since LID can be a

prerequisite for determining, e.g., appropriate tok-

enization. (Duvenhage et al.,2017a). Furthermore,

some preprocessing approaches may be necessary

for certain languages, but may hurt perforrmance

in other languages (Adebara and Abdul-Mageed,

2022). Developing LID tools is thus vital for all

NLP. In this work, we focus on LID for African

languages and introduce AfroLID.

AfroLID is a neural LID tool that covers

517

African languages and language varieties

across

language families. The languages covered be-

long to

African countries and are written in

ﬁve diverse scripts. We show the countries cov-

ered by AfroLID in Figure 1. Examples of the

different scripts involved in the

517

languages

are displayed in Figure 2. To the best of our

knowledge, AfroLID supports the largest subset

of African languages to date. AfroLID is also us-

able without any end-user training, and it exploits

data from a variety of domains to ensure robust-

ness. We manually curate our clean training data,

which is of special signiﬁcance in low resource

settings. We show the utility of AfroLID in the

wild by applying it on two Twitter datasets and

compare its performance with existing LID tools

that cover any number of African languages such

as CLD2 (McCandless,2010), CLD3 (Salcianu

et al.,2018), Franc, LangDetect (Shuyo,2010),

and Langid.py (Lui and Baldwin,2012). Our re-

sults show that AfroLID consistently outperforms

all other LID tools for almost all languages, and

serves as the new SOTA for language identiﬁcation

for African languages.

To summarize, we offer the following main con-

Our dataset involves different forms that can arguably

be viewed as varieties of the same language such as Twi and

Akan.

Figure 2: Examples from the ﬁve scripts in our data.

tributions:

We develop AfroLID, a SOTA LID tool for

517

African languages and language varieties.

To facilitate NLP research, we make our mod-

els publicly available.

We carry out a study of LID tool performance

on African languages where we compare our

models in controlled settings with several

tools such as CLD2, CLD3, Franc, LangDe-

tect, and Langid.py.

Our models exhibit highly accurate perfor-

mance in the wild, as demonstrated by ap-

plying AfroLID on Twitter data.

We provide a wide range of controlled

case studies and carry out a linguistically-

motivated error analysis of AfroLID. This al-

lows us to motivate plausible directions for

future research, including potentially beyond

African languages.

The rest of the paper is organized as follows:

In Section 2we discuss a number of typologi-

cal features of our supported languages. We de-

scribe AfroLID’s training data in Section 3. Next,

we introduce AfroLID in 4. This includes our

experimental datasets and their splits, preprocess-

ing, vocabulary, implementation and training de-

tails, and our evaluation settings. We present per-

formance of AfroLID in Section 5and compare

it to other LID tools. Our analysis show that

AfroLID outperforms other models for most lan-

guages. In the same section, we also describe the

utility of AfroLID on non-Latin scripts, Creole lan-

guages, and languages in close geographical prox-

imity. Although AfroLID is not trained on Twitter

data, we experiment with tweets in Section 6in

order to investigate performance of AfroLID in

out of domain scenarios. Through two diagnostic

studies, we demonstrate AfroLID’s robustness. We

provide an overview of related work in Section 7.

We conclude in Section 8, and outline a number of

limitations for our work in Section 9.

2 Typological Information

Language Families

. We experiment with

517

African languages and language varieties across

African countries. These languages belong to

language families (Eberhard et al.,2021) as

follows: Afro-Asiatic, Austronesian, Creole (En-

glish based), Creole (French based), Creole (Kongo

based), Creole (Ngbadi based), Creole (Portuguese

based), Indo-European, Khoe-Kwadi (Hainum),

Khoe-Kwadi (Nama), Khoe-Kwadi (Southwest),

Niger-Congo, and Nilo-Saharan. The large and ty-

pologically diverse data we exploit hence endow

our work with wide coverage. We show in Figure 1

a map of Africa with the countries AfroLID covers.

We also show the number of languages we cover,

per country, in Figure Ein the Appendix. Table E.1,

Table E.2, and Table E.3 in the Appendix also pro-

vide a list of the languages AfroLID handles. We

represent the languages using ISO-3 codes

for

both individual languages and macro-languages.

We use a macro-language tag when the language

is known but the speciﬁc dialect is unknown. For

this reason we specify that AfroLID supports

517

African languages and language varieties.

Sentential Word Order

. There are seven cat-

egories of word order across human languages

around the world. These are subject-verb-object

(SVO), subject-object-verb (SOV), object-verb-

subject (OVS), object-subject-verb (OSV), verb-

object-subject (VOS), verb-subject-object (VSO),

and languages lacking a dominant order (which

often have a combination of two or more orders

within its grammar) (Dryer and Haspelmath,2013).

Again, our dataset is very diverse: we cover ﬁve

out of these seven types of word order. Table 1

shows sentential word order in our data, with some

representative languages for each category.

Diacritics

. Diacritic marks are used to overcome

the inadequacies of an alphabet in capturing impor-

tant linguistic information by adding a distinguish-

ing mark to a character in an alphabet. Diacritics

are often used to indicate tone, length, case, nasal-

ization, or even to distinguish different letters of a

3https://glottolog.org/glottolog/language.

Word Order Example Languages

SVO Xhosa, Zulu, Yorùbá

SOV Khoekhoe, Somali, Amharic

VSO Murle, Kalenjin

VOS Malagasy

No-dominant-order Siswati, Nyamwezi, Bassa

Table 1: Sentential word order in our data.

language’s alphabet (Wells,2000;Hyman,2003;

Creissels et al.,2008). Diacritics can be placed

above, below or through a character. Diacritics are

common features of the orthographies of African

languages. Out of

517

languages/language vari-

eties in our training data,

295

use some diacritics

in their orthographies. We also provide a list of

languages with diacritics in our training data in

Table C.3 in the Appendix.

Script Languages

Ethiopic Amharic, Basketo, Maale,

?Oromo, Sebat Bet Gurage

Tigrinya, Xamtanga

Arabic Fulfude Adamawa, Fulfude Caka

Tariﬁt

Vai Vai

Coptic Coptic

Table 2: Non-Latin scripts in AfroLID data. ?Oromo:

is available in Latin script as well.

Scripts

. Our dataset consists of

languages writ-

ten in four different non-Latin scripts and

499

lan-

guages written in Latin scripts. The non-Latin

scripts are Ethiopic, Arabic, Vai, and Coptic.

3 Curating an African Language Dataset

AfroLID is trained using a multi-domain, multi-

script language identiﬁcation dataset that we man-

ually curated for building our tool. To collect the

dataset, we perform an extensive manual analysis

of African language presence on the web, iden-

tifying as much publicly available data from the

517

language varieties we treat as is possible. We

adopt this manual curation approach since there

are only few African languages that have any LID

tool coverage. In addition, available LID tools

that treat African languages tend to perform unre-

liably (Kreutzer et al.,2021). We therefore con-

sult research papers focusing on African languages,

such as (Adebara and Abdul-Mageed,2022), or

provide language data (Muhammad et al.,2022;

Alabi et al.,2020), sifting through references to

ﬁnd additional African data sources. Moreover,

we search for newspapers across all

African

countries.

We also collect data from social me-

dia such as blogs and web fora written in African

languages as well as databases that store African

language data. These include LANAFRICA,SADi-

LaR,Masakhane,Niger-Volta-LTI, and ALTI. Our

resulting multi-domain dataset contains religious

texts, government documents, health documents,

crawls from curated web pages, news articles, and

existing human-identiﬁed datasets for African lan-

guages. As an additional sanity check, we ask a

number of native speakers from a subset of the lan-

guages to verify the correctness of the self-labels

assigned in respective sources within our collec-

tions.

Our manual inspection step gave us conﬁ-

dence about the quality of our dataset, providing

near perfect agreement by native speakers with la-

bels from data sources. In total, we collect

100

mil-

lion sentences in

528

languages across

language

families in Africa and select

517

languages which

had at least

2000

sentences. Again, the dataset

has various orthographic scripts, including

499

lan-

guages in Latin scripts, eight languages in Ethiopic

scripts, four languages in Arabic scripts, one lan-

guage in Vai scripts, and one in Coptic scripts.

4 AfroLID

Experimental Dataset and Splits

. From our

manually-curated dataset, we randomly select

5,000

, and

100

sentences for train, develop-

ment, and test, respectively, for each language.

Overall, AfroLID data comprises

2,496,980

sen-

tences for training (Train),

25,850

for development

(Dev), and

51,400

for test (Test) for

517

languages

and language varieties.

Preprocessing

. We ensure that our data represent

naturally occurring text by performing only mini-

mal preprocessing. Speciﬁcally, we tokenize our

data into character, byte-pairs, and words. We do

not remove diacritics and use both precomposed

and decomposed characters to cater for the incon-

sistent use of precomposed and decomposed char-

acters by many African languages in digital media.

https://www.worldometers.info/geography/how-many-

countries-in-africa/.

We had access to native speakers of Afrikaans, Yorùbá,

Igbo, Hausa, Luganda, Kinyarwanda, Chichewa, Shona, So-

mali, Swahili, Xhosa, Bemba, and Zulu.

We remove languages with data less than

2,000

sen-

tences, as explained earlier.

A Unicode entity that combines two or more other char-

acters may be precomposed or decomposed. For example, ä

can be precomposed into

U+ 0061U+ 0308

or decomposed

We create our character level tokenization scripts

and generate our vocabulary using Fairseq. We

use sentencepiece tokenizer for the word level and

byte-pair tokens before we preprocess in Fairseq.

Vocabulary

. We experiment with byte-pair (BPE),

word, and character level encodings. We used vo-

cabulary sizes of

100

K, and

2,260

for the

bpe, word, and character level models across the

517

language varieties. The characters included

both letters, diacritics, and symbols from other non-

Latin scripts for the respective languages.

Figure 3: F1distribution on AfroLID Dev set.

Implementation

. AfroLID is built using a Trans-

former architecture trained from scratch. We use

attention layers with

heads in each layer,

768

hidden dimensions, making up

∼200

M parame-

ters.8

Hyperparameter Search and Training

. To iden-

tify our best hyperparameters, we use a subset of

our training data and the full development set for

our hyperparameter search. Namely, we randomly

sample

200

examples from each language in our

training data to create a smaller train set,

while us-

ing our full Dev set. We train for up to

100

epochs,

with early stopping. We search for the following

hyperparameter values, picking bolded ones as our

best: dropout rates from the set {

0.1

, 0.2, 0.3, 0.4,

0.5}, learning rates from {5e-5,

5e-6

}, and patience

from {

, 20, 30}. Other hyperparameters are sim-

ilar to those for XML-R (Conneau et al.,2020).

We perform hyperparameter search only with our

character level model and use identiﬁed values with

both the BPE and word models.

Evaluation

. We report our results in both macro

-score and accuracy, selecting our best model on

into

U+ 00E4

. In Unicode, they are included primarily to aid

computer systems with incomplete Unicode support, where

equivalent decomposed characters may render incorrectly.

This architecture is similar to XMLRBase (Conneau et al.,

2020).

This helps us limit GPU hours needed for hyperparameter

search.

Figure 4: F1distribution on AfroLID Test set.

Dev based on

.For all our models, we report the

average of three runs.

5 Model Performance and Analysis

As Table 3shows, our

BPE model

outperforms

both the

char

and

word

models on both Dev and

Test data. On Dev, our BPE model acquires

96.14

and

96.19

acc, compared to

85.75 F1

and

85.85

for char model, and

90.22 F1

and

90.34

acc for

word model, respectively. Our BPE model simi-

larly excels on Test, with

95.95 F1

and

96.01

acc.

We inspect the distribution of

on the entire Dev

and Test sets using our BPE model, as shown in

Figures 3and 4. As annotated on Figure 3, a total

212

languages out of the

517

(

% = 41

) are iden-

tiﬁed with

100 F1

197

languages (

% = 38.10

)

identiﬁed with

and

99 F1

, and

languages

(

% = 13.30

) identiﬁed with

–

95 F1

. For Test

data (Figure 4), on the other hand,

128

(

% = 24.75

)

languages are identiﬁed with

100 F1

299

lan-

guages (

% = 57.83

) are between

–

99 F1

, while

56 languages (% = 10.83) are between 90–95 F1.

Model Split F1-score Accuracy Checkpoint

Char Dev 85.75 85.85 69

Test 81.20 81.30

BPE Dev 96.14 96.19 73

Test 95.95 96.01

Word Dev 90.22 90.34 65

Test 89.04 89.01

Table 3: Results on the BPE, word level, and character

level models. Bolded: best result on Test. Underlined:

best result on Dev.

AfroLID in Comparison

Using our Dev and

Test data, we compare our best AfroLID model

(BPE model) with the following LID tools: CLD2,

CLD3, Franc, LangDetect, and Langid.py. Since

these tools do not support all our AfroLID lan-

guages, we compare accuracy and

-scores of

our models only on languages supported by each

of these tools. As Tables A.1 and 4show,

AfroLID outperforms other tools on

and

lan-

guages out of

languages on the Dev set and Test

set, respectively. We also compare

-scores of

Franc

on the

African languages Franc supports

with the

-scores of AfroLID on those languages.

As shown in Tables 5and 6, AfroLID outperforms

Franc on

languages and has similar

-score on

ﬁve languages on the Dev set. AfroLID also out-

performs Franc on

languages, and has similar

F1-score on ﬁve languages on the Test set.

Lang. CLD2 CLD3 Langid.py LangDetect Franc AfroLID

afr 94.00 91.00 69.00 88.23 81.00 97.00

amh - 97.00 100.00 -35.00 97.00

hau - 83.00 - - 77.00 88.00

ibo - 96.00 - - 88.00 97.00

kin 92.00 -45.00 -47.00 89.00

lug 84.00 - - - 64.00 87.00

mlg - 100.00 98.00 - - 100.00

nya - 96.00 - - 75.00 92.00

sna - 100.00 - - 91.00 97.00

som - 92.00 - - 89.00 95.00

sot - 99.00 - - 93.00 88.00

swa 99.00 91.00 90.00 100.00 -92.00

swc 93.00 94.00 96.00 97.02 -87.00

swh 89.00 92.00 88.23 87.19 70.00 77.00

xho - 59.00 88.00 -30.00 67.00

yor - 25.00 - - 66.00 98.00

zul - 89.00 20.00 -40.00 50.00

Table 4: A comparison of results on AfroLID with

CLD2, CLD3, Langid.py, LangDetect, and Franc us-

ing F1-score on the Test set. −indicates that the tool

does not support the language.

Effect of Non-Latin Script.

We investigate per-

formance of AfroLID on languages that use one of

Arabic, Ethiopic, Vai, and Coptic scripts. Speciﬁ-

cally, we investigate performance of AfroLID on

Amharic (amh), Basketo (bst), Maale (mdy), Se-

bat Bet Gurage (sgw), Tigrinya (tir), Xamtanga

(xan), Fulfude Adamawa (fub), Fulfude Caka (fuv),

Tarif (rif), Vai (vai), and Coptic (cop).

Vai and

Coptic, the two unique scripts in AfroLID have

-score of

100

each. This corroborates re-

search ﬁndings that languages written in unique

scripts within an LID tool can be identiﬁed with

up to

100%

recall,

-score, and/or accuracy even

using a small training dataset (Jauhiainen et al.,

2017a). We assume this to be the reason Langid.py

outperforms AfroLID on Amharic as seen in Ta-

ble 4, since Amharic is the only language that em-

ploys an Ethiopic script in langid.py. AfroLID,

on the other hand, has

languages using Ethiopic

scripts. However, it is not clear why Basketo, which

uses Ethiopic scripts has

100 F1

-score. We, how-

We do not investigate performance on Oromo because we

had both Latin and Ethiopic scripts for Oromo in our training

data.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AfroLID:ANeuralLanguageIdenticationToolforAfricanLanguagesIfeAdebara1;?AbdelRahimElmadany1;?MuhammadAbdul-Mageed1;2AlcidesAlcobaInciarte11DeepLearning&NaturalLanguageProcessingGroup,TheUniversityofBritishColumbia2DepartmentofNaturalLanguageProcessing&DepartmentofMachineLearning,MBZUAI{ife.adebara@,...

展开>> 收起<<

AfroLID A Neural Language Identiﬁcation Tool for African Languages Ife Adebara1AbdelRahim Elmadany1Muhammad Abdul-Mageed12Alcides Alcoba Inciarte1 1Deep Learning Natural Language Processing Group The University of British Columbia.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AfroLID A Neural Language Identiﬁcation Tool for African Languages Ife Adebara1AbdelRahim Elmadany1Muhammad Abdul-Mageed12Alcides Alcoba Inciarte1 1Deep Learning Natural Language Processing Group The University of British Columbia

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: