University of Cape Towns WMT22 System Multilingual Machine Translation for Southern African Languages Khalid N. Elmadani Francois Meyer Jan Buys

2025-05-06 2 0 258.13KB 10 页 10玖币

侵权投诉

University of Cape Town’s WMT22 System: Multilingual Machine

Translation for Southern African Languages

Khalid N. Elmadani Francois Meyer Jan Buys

Department of Computer Science

University of Cape Town

{ahmkha009,myrfra008}@myuct.ac.za, jbuys@cs.uct.ac.za

Abstract

The paper describes the University of Cape

Town’s submission to the constrained track of

the WMT22 Shared Task: Large-Scale Ma-

chine Translation Evaluation for African Lan-

guages. Our system is a single multilingual

translation model that translates between En-

glish and 8 South / South East African Lan-

guages, as well as between speciﬁc pairs of

the African languages. We used several tech-

niques suited for low-resource machine trans-

lation (MT), including overlap BPE, back-

translation, synthetic training data generation,

and adding more translation directions during

training. Our results show the value of these

techniques, especially for directions where

very little or no bilingual training data is avail-

able.1

1 Introduction

Southern African languages are underrepresented

in NLP research, in part because most of them are

low-resource languages: It is not always possible

to ﬁnd high-quality datasets that are large enough

to train effective deep learning models (Kreutzer

et al.,2021). The WMT22 Shared Task on Large-

Scale Machine Translation Evaluation for African

Languages (Adelani et al.,2022) presented an op-

portunity to apply one of the most promising recent

developments in NLP — multilingual neural ma-

chine translation — to Southern African languages.

For many languages, the parallel corpora released

for the shared task are the largest publicly available

datasets yet. For some translation directions (e.g.

between Southern African languages), no parallel

corpora were previously available.

In this paper we present our submission to the

shared task. Our system is a Transformer-based

encoder-decoder (Vaswani et al.,2017) that trans-

lates between English and 8 South / South East

Our model is available at https://github.com/Khalid-

Nabigh/UCT-s-WMT22-shared-task.

African languages (Afrikaans, Northern Sotho,

Shona, Swati, Tswana, Xhosa, Xitsonga, Zulu) and

in 8 additional directions (Xhosa to Zulu, Zulu to

Shona, Shona to Afrikaans, Afrikaans to Swati,

Swati to Tswana, Tswana to Xitsonga, Xitsonga

to Northern Sotho, Northern Sotho to Xhosa). We

trained a single model with shared encoder and de-

coder parameters and a shared subword vocabulary.

We applied several methods aimed at improving

translation performance in a low-resource setting.

We experimented with BPE (Sennrich et al.,2016b)

and overlap BPE (Patil et al.,2022), the latter of

which increases the representation of low-resource

language tokens in the shared subword vocabulary.

We used initial multilingual and bilingual models to

generate back-translated sentences (Sennrich et al.,

2016a) for subsequent training.

First, we trained a model to translate between En-

glish and the 8 Southern African languages. Then

we added the 8 additional translation directions and

continued training. For some of these additional

directions no parallel corpora were available, so we

generated synthetic training data with our existing

model. By downsampling some of the parallel cor-

pora to ensure a balanced dataset, we were able to

train our model effectively in the new directions,

while retaining performance in the old directions.

We describe the development of our model and

report translation performance at each training

stage. Our ﬁnal results compare favourably to

existing works with overlapping translation direc-

tions. While there is considerable disparity in per-

formance across languages, our model nonetheless

achieves results that indicate some degree of effec-

tive MT across all directions (most BLEU scores

are above 10 and most chrF++ scores are above 40).

We also discuss our ﬁndings regarding techniques

for low-resource MT. We found overlap BPE and

back-translation to improve performance for most

translation directions. Furthermore, our results con-

ﬁrm the value of multilingual models, which proves

arXiv:2210.11757v1 [cs.CL] 21 Oct 2022

critical for the lowest-resource languages.

2 Background

2.1 Multilingual Neural Machine Translation

(MNMT)

Multilingual models help low-resource languages

(LRLs) by leveraging the massive amount of

training data available in high-resource languages

(HRLs) (Aharoni et al.,2019;Zhang et al.,2020).

In the context of Neural Machine Translation, a

multilingual model can translate between more

than two languages. Current research in MNMT

can be divided into two main areas: training

language-speciﬁc parameters (Kim et al.,2019;

Philip et al.,2020) and training a single massive

model that shares all parameters among all lan-

guages (Fan et al.,2020;NLLB Team et al.,2022).

Our work lies in the second category, as we are

building a single multilingual translation system by

exploring back-translation and different vocabulary

generation approaches.

2.2 Back-Translation

Given parallel sentences in two languages

and

(

), with goal of training a model that

translates sentences from

(

A→B

). Back-

translation works as follows: First, one trains a

(

B→A

) model using the available (

)

data. Then the

sentences are passed to the

model to regenerate

. This model’s output (

)

is then considered as additional synthetic parallel

data (

). The ﬁnal step of back-translation is

training an (

A→B

) translation model using (

) as parallel data. The motivation behind back-

translation is that the noise added to the

sen-

tences from regeneration increases the model’s ro-

bustness (Edunov et al.,2018). The same approach

can be extended to multilingual models (Liao et al.,

2021).

2.3 Overlap-based BPE (OBPE)

Byte Pair Encoding (BPE) is a vocabulary creation

method that relies on

-gram frequency (Sennrich

et al.,2016b). The starting point is a character-

based vocabulary. At each step, the BPE algorithm

identiﬁes the two adjacent tokens with the highest

frequency, joins them together as a single token,

and adds the new token to the vocabulary. The

dataset is then restructured based on the expanded

vocabulary. In the case of multilingual training, a

single BPE vocabulary can handle all languages

Language Pairs WMT22_african

eng-sna 8.7M

eng-xho 8.6M

eng-tsn 5.9M

eng-zul 3.8M

eng-nso 3M

eng-afr 1.6M

eng-tso 630K

eng-ssw 165K

xho-zul 1M

zul-sna 1.1M

sna-afr 1.6M*

afr-ssw 165K*

ssw-tsn 85K

tsn-tso 285K

tso-nso 212K

nso-xho 200K

Table 1: Number of available parallel sentences for all

language pairs. * indicates that no data is available for

these pairs and the number represents the amount of

synthetic data we generated.

Language Family LHRL LLRL

Germanic English(eng) Afrikaans(afr)

Nguni Xhosa(xho) Zulu(zul), Swati(ssw)

Sotho-Tswana Tswana(tsn) Sepedi(nso)

Bantu Shona(sna) Xitsonga(tso)

Table 2: The languages included in our translation sys-

tem, grouped by language family and whether they are

used as LHRL or LLRL for the OBPE algorithm.

by running the BPE algorithm on the union of the

data from all the languages. However, when con-

structing a multilingual vocabulary, BPE will prefer

frequent word types, most of which are from HRLs,

leaving a smaller proportion of the vocabulary for

words from LRLs.

Overlap-based BPE (OBPE) is a modiﬁcation to

the BPE vocabulary creation algorithm which en-

hances overlap across related languages (Patil et al.,

2022). OBPE takes into account the frequency of

tokens as well as their existence among different

languages. Given a list of HRLs (

LHRL

) and LRLs

(

LLRL

), OBPE tries to balance cross-lingual shar-

ing (tokens shared between HRLs and LRLs) and

individual languages’ representation. The optimal

OBPE vocabulary for a set of languages from differ-

ent families is produced by considering the highest

resource language from each family as

LHRL

and

the rest of the languages as LLRL.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UniversityofCapeTown'sWMT22System:MultilingualMachineTranslationforSouthernAfricanLanguagesKhalidN.ElmadaniFrancoisMeyerJanBuysDepartmentofComputerScienceUniversityofCapeTown{ahmkha009,myrfra008}@myuct.ac.za,jbuys@cs.uct.ac.zaAbstractThepaperdescribestheUniversityofCapeTown'ssubmissiontotheconstrain...

展开>> 收起<<

University of Cape Towns WMT22 System Multilingual Machine Translation for Southern African Languages Khalid N. Elmadani Francois Meyer Jan Buys.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

University of Cape Towns WMT22 System Multilingual Machine Translation for Southern African Languages Khalid N. Elmadani Francois Meyer Jan Buys

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: