SMaLL-100 Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages Alireza Mohammadshahi123Vassilina Nikoulina1Alexandre Berard1

2025-05-03 0 0 456.42KB 12 页 10玖币

侵权投诉

SMaLL-100: Introducing Shallow

Multilingual Machine Translation Model for Low-Resource Languages

Alireza Mohammadshahi∗1,2,3Vassilina Nikoulina1Alexandre Berard1

Caroline Brun1James Henderson2Laurent Besacier1

1NAVER LABS Europe 2IDIAP Research Institute 3EPFL

{first.last}@naverlabs.com

{alireza.mohammadshahi,james.henderson}@idiap.ch

Abstract

In recent years, multilingual machine trans-

lation models have achieved promising

performance on low-resource language pairs

by sharing information between similar lan-

guages, thus enabling zero-shot translation. To

overcome the "curse of multilinguality", these

models often opt for scaling up the number of

parameters, which makes their use in resource-

constrained environments challenging. We

introduce SMaLL-100, a distilled version of

the M2M-100 (12B) model, a massively mul-

tilingual machine translation model covering

100 languages. We train SMaLL-100 with

uniform sampling across all language pairs and

therefore focus on preserving the performance

of low-resource languages. We evaluate

SMaLL-100 on different low-resource bench-

marks: FLORES-101, Tatoeba, and TICO-19

and demonstrate that it outperforms previous

massively multilingual models of comparable

sizes (200-600M) while improving inference

latency and memory usage. Additionally,

our model achieves comparable results to

M2M-100 (1.2B), while being 3.6×smaller

and 4.3×faster at inference.1

1 Introduction

Neural Machine Translation (NMT) systems are

usually trained on datasets consisting of millions

of parallel sentences, thus still performing poorly

on low-resource languages, i.e., languages without

a large amount of training data. Over the past

few years, previous work has proposed several

approaches to improve the quality of translations in

low-resource languages, e.g., Multilingual Neural

Machine Translation (MNMT) models (Johnson

et al.,2017;Fan et al.,2020;Tang et al.,2021;

Goyal et al.,2021), back-translation (Sennrich

et al.,2016;Edunov et al.,2018) and unsupervised

∗

Work done during an internship at NAVER LABS Europe.

The code and pre-trained SMaLL-100 model is available

at https://github.com/alirezamshi/small100.

machine translation (Garcia et al.,2021;Ko et al.,

2021). Massively MNMT models are particularly

interesting for low-resource languages as they ben-

eﬁt the most from knowledge transfer from related

languages (Arivazhagan et al.,2019). However,

it is also seen that curse of multilinguality hurts

the performance of high-resource languages. So,

previous work attempted to increase the model size

to maintain the translation performance in both

high and low-resource languages. This makes the

use of these massively MNMT models challenging

in real-world resource-constrained environments.

To overcome this problem, we propose SMaLL-

100, a

hallow

ultilingual M

chine Translation

Model for

ow-Resource

anguages covering

100 languages, which is a distilled alternative of

M2M-100 (12B) (Fan et al.,2020), the most recent

and biggest available multilingual NMT model. In

this paper, we focus on very-low and low-resource

language pairs as there is no reasonable-size uni-

versal model that achieves acceptable performance

over a great number of low-resource languages.

We do so by training SMaLL-100 on a perfectly

balanced dataset.

While this leads to lower

performance on the high-resource languages, we

claim that this loss is easily recoverable through

further ﬁne-tuning. We evaluate SMaLL-100 on

different low-resource benchmarks, e.g., FLORES-

101 (Goyal et al.,2021), Tatoeba (Tiedemann,

2020), and TICO-19 (Anastasopoulos et al.,2020).

To summarize, our contributions are as follows:

•

We propose SMaLL-100, a shallow multilin-

gual NMT model, focusing on low-resource

language pairs.

•

We evaluate SMaLL-100 on several low-

resource NMT benchmarks.

•

We show that our model signiﬁcantly out-

performs previous multilingual models of

comparable size while being faster at infer-

All language pairs have the same sampling probability,

regardless of their training data size.

arXiv:2210.11621v1 [cs.CL] 20 Oct 2022

ence. Additionally, it achieves comparable

results with M2M-100 (1.2B) model, with

4.3

faster inference and a 3.6

smaller size.

•

While SMaLL-100 reaches 87.2% perfor-

mance of the 12B teacher model, we show

that this gap can be closed with a few ﬁne-

tuning steps both for low and high-resource

languages.

2 Model and Training

2.1 SMaLL-100 Architecture

It has been shown by Kasai et al. (2021) that

deep encoder / shallow decoder architectures can

achieve good translation quality while being signif-

icantly faster at inference. Berard et al. (2021) have

conﬁrmed that this is also valid for multilingual

NMT. Here, we use a 12-layer Transformer

encoder (Vaswani et al.,2017) and 3-layer decoder.

Table 8in the Appendix Breports further details

of the SMaLL-100 architecture. Different from

M2M-100 model, we use language codes in the

encoder side, as it is shown to perform better with

shallow decoder architectures (Berard et al.,2021).

2.2 Training Strategy

SMaLL-100 is trained with a combination of two

loss functions: a standard Cross Entropy loss (CE)

and a Knowledge Distillation loss (KD). Given

source sequence

and gold target translation

Y=(y0,...,ym), the CE loss is calculated as:

Lce =−

j=0

|K|

z=1

{yj=z}log p(yj=z|y<j ,X,θS)

(1)

where

|K|

is the target vocabulary size, is the

indicator function, and

θS

is the model parameters.

p() is the conditional probability function.

We additionally use a word-level distillation

loss, which is the Kullback–Leibler divergence

between the output distributions of the student and

teacher models (Hu et al.,2018). Speciﬁcally, it

is calculated as:

Lkd =−

j=0

|K|

z=1

q(yj=z|y<j ,X,θT)

×log p(yj=z|y<j ,X,θS)

(2)

where

θT

is parameters of the teacher model.

q()

is the conditional probability of the teacher model.

Resource Type Criteria

Very-Low |K|≤100K

Low 100K <|K|≤1M

Medium 1M <|K| ≤100M

High 100M <|K|

Table 1: The criteria to split languages into different

resource categories. |K|is the amount of training data

to/from English.

The total loss is computed as:

Ltotal =Lce +αLkd (3)

where αis a trainable parameter.

2.3 Training Data

Our training data includes parallel sentences

from CCMatrix (Schwenk et al.,2019) and

CCAligned (El-Kishky et al.,2020) datasets,

which are part of the training data used by Fan et al.

(2020) to train the M2M-100 models. As our goal

is to maintain the performance of low-resource

languages, we balance the training data across all

language pairs; speciﬁcally, 100K sentence pairs

are sampled for each language pair.

As a result,

our training data contains nearly 456M parallel

sentences, which is less than 6

of the original

data on which M2M-100 (Fan et al.,2020) was

trained. We use the same languages as M2M-100.

3 Experimental Setup

3.1 Evaluation Benchmarks

FLORES-101

is a multilingual NMT bench-

mark, containing 3,001 sentences from different

domains, that are derived from English Wikipedia.

Sentences are translated into 101 languages by

human translators (Goyal et al.,2021). It mostly

includes low and medium-resource languages. We

use devtest subset for the evaluation.

Tatoeba

is a crowd-sourced collection of

user-provided translations in different lan-

guages (Tiedemann,2020). We choose a subset

of languages from

test

set of Tatoeba Challenge,

which are covered by M2M-100.

For language pairs with less than 100K sentence pairs,

we repeat their data. We randomly select 100K sentences for

language pairs with more than 100K training sentences.

4https://github.com/Helsinki-NLP/

Tatoeba-Challenge

Language Direction

Model params Speed VL2VL VL2L VL2M VL2H L2VL L2L L2M L2H M2VL M2L H2VL H2L AVG

FLORES-101

FLORES-124 175M 5.3×3.3 3.4 6.0 7.8 3.7 3.1 6.9 8.8 6.9 5.2 8.1 6.0 5.8

M2M-100 418M 3.1×4.3 3.7 7.8 9.4 5.4 3.4 9.1 11.3 9.9 5.8 11.4 6.6 7.3

FLORES-124 615M 2.9×5.1 5.1 9.2 11.2 5.8 4.7 10.6 13.1 10.3 7.6 11.5 8.5 8.6

Finetuned-100 330M 7.8×6.1 5.4 8.7 11.3 5.7 4.1 9.0 11.8 10.4 6.8 13.0 8.0 8.4

SMaLL-100 330M 7.8×7.9 7.0 10.3 12.6 8.4 6.1 11.6 14.3 13.7 9.0 16.7 10.2 10.7

M2M-100 1.2B 1.8×6.7 6.1 10.8 12.8 8.7 6.1 13.0 15.9 13.6 8.8 15.4 9.7 10.6

M2M-100 12B 1×8.7 8.8 11.9 13.7 11.7 9.7 15.4 18.2 16.5 12.6 18.7 13.9 13.3

Tatoeba

FLORES-124 175M 5.3×- 7.6 15.7 10.1 4.6 5.3 11.5 10.8 14.0 10.2 6.4 7.5 9.4

M2M-100 418M 3.1×- 7.4 19.7 12.3 5.9 5.3 13.8 13.2 14.9 11.7 7.7 9.0 10.9

FLORES-124 615M 2.9×-9.1 19.4 11.4 6.9 7.6 12.7 13.7 14.4 13.3 8.0 9.7 11.4

Finetuned-100 330M 7.8×- 4.0 21.1 14.4 7.7 5.2 15.3 14.2 14.0 12.1 8.9 8.3 11.4

SMaLL-100 330M 7.8×- 4.6 22.1 16.4 8.7 7.0 16.7 15.8 16.3 14.5 10.6 11.2 13.1

M2M-100 1.2B 1.8×- 8.8 19.5 13.1 8.7 7.2 16.3 17.0 17.2 13.4 10.7 11.1 13.0

M2M-100 12B 1×- 8.6 23.5 13.1 9.8 10.2 17.8 17.9 18.5 15.2 10.7 13.2 14.4

TICO-19

FLORES-124 175M 5.3×4.6 5.5 8.1 11.5 4.4 5.6 9.7 12.2 3.9 8.0 4.2 8.7 7.2

M2M-100 418M 3.1×4.0 5.5 9.8 13.7 4.2 5.7 11.6 14.9 4.1 8.8 5.3 9.4 8.1

FLORES-124 615M 2.9×4.6 7.4 11.5 16.4 4.8 7.6 12.9 16.7 4.4 10.7 4.4 11.5 9.4

Finetuned-100 330M 7.8×6.1 7.2 11.9 17.4 5.5 6.1 12.1 15.2 6.4 9.0 9.5 10.3 9.7

SMaLL-100 330M 7.8×7.8 8.8 13.3 19.0 8.0 8.5 14.3 17.8 8.3 11.5 11.3 12.7 11.8

M2M-100 1.2B 1.8×5.4 8.2 13.2 18.9 6.0 8.7 14.0 19.2 5.2 11.5 6.1 12.5 10.8

M2M-100 12B 1×6.4 10.9 15.4 20.6 7.8 11.9 16.6 21.4 6.4 15.4 8.7 16.4 13.1

Table 2: Average spBLEU performance on FLORES-101, Tatoeba, and TICO-19 benchmarks for different language

pair categories, deﬁned in Appendix A. FLORES-101 results are computed on language pairs where M2M-100

12B has spBLEU scores higher than 3 to avoid polluting the analysis with meaningless scores. The ﬁrst and second

columns give the model size and speed-up ratios compared to M2M-100 (12B). Last column is the average spBLEU

performance over all mentioned language directions. The best scores are shown in bold, and the second best results

are shown with underline.

TICO-19

was created during the COVID-19

pandemic (Anastasopoulos et al.,2020). It con-

tains sentences from 36 languages in the medical

domain, including 26 low-resource languages.

We evaluate on languages which are covered by

M2M-100 (Fan et al.,2020).

Inspired by Goyal et al. (2021), we split the lan-

guages based on the amount of available training

sentences aligned with English into 4 different

categories: Very-Low (VL), Low (L), Medium (M),

and High-resource (H). As the true amount of

training data is both dependent on quality and

quantity of parallel sentences, Goyal et al. (2021)

suggested to estimate it by computing the number

of bitext data aligned with English, that is calcu-

lated from statistics of OPUS corpora (Tiedemann,

2012). Table 1illustrates the criteria for choosing

the category of different languages. More details

about the distribution of language pair categories

in each benchmark are provided in Appendix A.

3.2 Baselines

M2M-100

Fan et al. (2020) is a recent many-to-

many NMT model covering 100 languages. Fan

et al. (2020) provide 3 variants with respectively

418M, 1.2B, and 12B parameters. We compare

against these 3 variants.

FLORES-124

is an extension of M2M-100,

covering additional 24 languages. Training

data of the additional languages is derived from

OPUS (Tiedemann,2012). Goyal et al. (2021)

provide two models with 175M and 615M

parameters. We use both models as baselines.

FineTuned-100

uses the same architecture as

deﬁned in Section 2, but KD loss (

Lkd

) is not used

for training. For a fair comparison, it is trained for

the same number of steps as SMaLL-100 model.

3.3 Implementation Details

SMaLL-100 contains nearly 330M parameters

with 12 encoder and 3 decoder Transformer

layers.

It is trained for 30 days on 16 TESLA

It is initialized with M2M-100 (418M), using its ﬁrst 3

decoder layers for the initialization of the student’s decoder.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SMaLL-100:IntroducingShallowMultilingualMachineTranslationModelforLow-ResourceLanguagesAlirezaMohammadshahi1;2;3VassilinaNikoulina1AlexandreBerard1CarolineBrun1JamesHenderson2LaurentBesacier11NAVERLABSEurope2IDIAPResearchInstitute3EPFL{first.last}@naverlabs.com{alireza.mohammadshahi,james.henderson...

展开>> 收起<<

SMaLL-100 Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages Alireza Mohammadshahi123Vassilina Nikoulina1Alexandre Berard1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SMaLL-100 Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages Alireza Mohammadshahi123Vassilina Nikoulina1Alexandre Berard1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: