SMaLL-100 Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages Alireza Mohammadshahi123Vassilina Nikoulina1Alexandre Berard1

2025-05-03 0 0 456.42KB 12 页 10玖币
侵权投诉
SMaLL-100: Introducing Shallow
Multilingual Machine Translation Model for Low-Resource Languages
Alireza Mohammadshahi1,2,3Vassilina Nikoulina1Alexandre Berard1
Caroline Brun1James Henderson2Laurent Besacier1
1NAVER LABS Europe 2IDIAP Research Institute 3EPFL
{first.last}@naverlabs.com
{alireza.mohammadshahi,james.henderson}@idiap.ch
Abstract
In recent years, multilingual machine trans-
lation models have achieved promising
performance on low-resource language pairs
by sharing information between similar lan-
guages, thus enabling zero-shot translation. To
overcome the "curse of multilinguality", these
models often opt for scaling up the number of
parameters, which makes their use in resource-
constrained environments challenging. We
introduce SMaLL-100, a distilled version of
the M2M-100 (12B) model, a massively mul-
tilingual machine translation model covering
100 languages. We train SMaLL-100 with
uniform sampling across all language pairs and
therefore focus on preserving the performance
of low-resource languages. We evaluate
SMaLL-100 on different low-resource bench-
marks: FLORES-101, Tatoeba, and TICO-19
and demonstrate that it outperforms previous
massively multilingual models of comparable
sizes (200-600M) while improving inference
latency and memory usage. Additionally,
our model achieves comparable results to
M2M-100 (1.2B), while being 3.6×smaller
and 4.3×faster at inference.1
1 Introduction
Neural Machine Translation (NMT) systems are
usually trained on datasets consisting of millions
of parallel sentences, thus still performing poorly
on low-resource languages, i.e., languages without
a large amount of training data. Over the past
few years, previous work has proposed several
approaches to improve the quality of translations in
low-resource languages, e.g., Multilingual Neural
Machine Translation (MNMT) models (Johnson
et al.,2017;Fan et al.,2020;Tang et al.,2021;
Goyal et al.,2021), back-translation (Sennrich
et al.,2016;Edunov et al.,2018) and unsupervised
Work done during an internship at NAVER LABS Europe.
1
The code and pre-trained SMaLL-100 model is available
at https://github.com/alirezamshi/small100.
machine translation (Garcia et al.,2021;Ko et al.,
2021). Massively MNMT models are particularly
interesting for low-resource languages as they ben-
efit the most from knowledge transfer from related
languages (Arivazhagan et al.,2019). However,
it is also seen that curse of multilinguality hurts
the performance of high-resource languages. So,
previous work attempted to increase the model size
to maintain the translation performance in both
high and low-resource languages. This makes the
use of these massively MNMT models challenging
in real-world resource-constrained environments.
To overcome this problem, we propose SMaLL-
100, a
S
hallow
M
ultilingual M
a
chine Translation
Model for
L
ow-Resource
L
anguages covering
100 languages, which is a distilled alternative of
M2M-100 (12B) (Fan et al.,2020), the most recent
and biggest available multilingual NMT model. In
this paper, we focus on very-low and low-resource
language pairs as there is no reasonable-size uni-
versal model that achieves acceptable performance
over a great number of low-resource languages.
We do so by training SMaLL-100 on a perfectly
balanced dataset.
2
While this leads to lower
performance on the high-resource languages, we
claim that this loss is easily recoverable through
further fine-tuning. We evaluate SMaLL-100 on
different low-resource benchmarks, e.g., FLORES-
101 (Goyal et al.,2021), Tatoeba (Tiedemann,
2020), and TICO-19 (Anastasopoulos et al.,2020).
To summarize, our contributions are as follows:
We propose SMaLL-100, a shallow multilin-
gual NMT model, focusing on low-resource
language pairs.
We evaluate SMaLL-100 on several low-
resource NMT benchmarks.
We show that our model significantly out-
performs previous multilingual models of
comparable size while being faster at infer-
2
All language pairs have the same sampling probability,
regardless of their training data size.
arXiv:2210.11621v1 [cs.CL] 20 Oct 2022
ence. Additionally, it achieves comparable
results with M2M-100 (1.2B) model, with
4.3
×
faster inference and a 3.6
×
smaller size.
While SMaLL-100 reaches 87.2% perfor-
mance of the 12B teacher model, we show
that this gap can be closed with a few fine-
tuning steps both for low and high-resource
languages.
2 Model and Training
2.1 SMaLL-100 Architecture
It has been shown by Kasai et al. (2021) that
deep encoder / shallow decoder architectures can
achieve good translation quality while being signif-
icantly faster at inference. Berard et al. (2021) have
confirmed that this is also valid for multilingual
NMT. Here, we use a 12-layer Transformer
encoder (Vaswani et al.,2017) and 3-layer decoder.
Table 8in the Appendix Breports further details
of the SMaLL-100 architecture. Different from
M2M-100 model, we use language codes in the
encoder side, as it is shown to perform better with
shallow decoder architectures (Berard et al.,2021).
2.2 Training Strategy
SMaLL-100 is trained with a combination of two
loss functions: a standard Cross Entropy loss (CE)
and a Knowledge Distillation loss (KD). Given
source sequence
X
and gold target translation
Y=(y0,...,ym), the CE loss is calculated as:
Lce =
m
X
j=0
|K|
X
z=1
{yj=z}log p(yj=z|y<j ,X,θS)
(1)
where
|K|
is the target vocabulary size, is the
indicator function, and
θS
is the model parameters.
p() is the conditional probability function.
We additionally use a word-level distillation
loss, which is the Kullback–Leibler divergence
between the output distributions of the student and
teacher models (Hu et al.,2018). Specifically, it
is calculated as:
Lkd =
m
X
j=0
|K|
X
z=1
q(yj=z|y<j ,X,θT)
×log p(yj=z|y<j ,X,θS)
(2)
where
θT
is parameters of the teacher model.
q()
is the conditional probability of the teacher model.
Resource Type Criteria
Very-Low |K|100K
Low 100K <|K|1M
Medium 1M <|K| 100M
High 100M <|K|
Table 1: The criteria to split languages into different
resource categories. |K|is the amount of training data
to/from English.
The total loss is computed as:
Ltotal =Lce +αLkd (3)
where αis a trainable parameter.
2.3 Training Data
Our training data includes parallel sentences
from CCMatrix (Schwenk et al.,2019) and
CCAligned (El-Kishky et al.,2020) datasets,
which are part of the training data used by Fan et al.
(2020) to train the M2M-100 models. As our goal
is to maintain the performance of low-resource
languages, we balance the training data across all
language pairs; specifically, 100K sentence pairs
are sampled for each language pair.
3
As a result,
our training data contains nearly 456M parallel
sentences, which is less than 6
%
of the original
data on which M2M-100 (Fan et al.,2020) was
trained. We use the same languages as M2M-100.
3 Experimental Setup
3.1 Evaluation Benchmarks
FLORES-101
is a multilingual NMT bench-
mark, containing 3,001 sentences from different
domains, that are derived from English Wikipedia.
Sentences are translated into 101 languages by
human translators (Goyal et al.,2021). It mostly
includes low and medium-resource languages. We
use devtest subset for the evaluation.
Tatoeba
is a crowd-sourced collection of
user-provided translations in different lan-
guages (Tiedemann,2020). We choose a subset
of languages from
test
set of Tatoeba Challenge,
4
which are covered by M2M-100.
3
For language pairs with less than 100K sentence pairs,
we repeat their data. We randomly select 100K sentences for
language pairs with more than 100K training sentences.
4https://github.com/Helsinki-NLP/
Tatoeba-Challenge
Language Direction
Model params Speed VL2VL VL2L VL2M VL2H L2VL L2L L2M L2H M2VL M2L H2VL H2L AVG
FLORES-101
FLORES-124 175M 5.3×3.3 3.4 6.0 7.8 3.7 3.1 6.9 8.8 6.9 5.2 8.1 6.0 5.8
M2M-100 418M 3.1×4.3 3.7 7.8 9.4 5.4 3.4 9.1 11.3 9.9 5.8 11.4 6.6 7.3
FLORES-124 615M 2.9×5.1 5.1 9.2 11.2 5.8 4.7 10.6 13.1 10.3 7.6 11.5 8.5 8.6
Finetuned-100 330M 7.8×6.1 5.4 8.7 11.3 5.7 4.1 9.0 11.8 10.4 6.8 13.0 8.0 8.4
SMaLL-100 330M 7.8×7.9 7.0 10.3 12.6 8.4 6.1 11.6 14.3 13.7 9.0 16.7 10.2 10.7
M2M-100 1.2B 1.8×6.7 6.1 10.8 12.8 8.7 6.1 13.0 15.9 13.6 8.8 15.4 9.7 10.6
M2M-100 12B 1×8.7 8.8 11.9 13.7 11.7 9.7 15.4 18.2 16.5 12.6 18.7 13.9 13.3
Tatoeba
FLORES-124 175M 5.3×- 7.6 15.7 10.1 4.6 5.3 11.5 10.8 14.0 10.2 6.4 7.5 9.4
M2M-100 418M 3.1×- 7.4 19.7 12.3 5.9 5.3 13.8 13.2 14.9 11.7 7.7 9.0 10.9
FLORES-124 615M 2.9×-9.1 19.4 11.4 6.9 7.6 12.7 13.7 14.4 13.3 8.0 9.7 11.4
Finetuned-100 330M 7.8×- 4.0 21.1 14.4 7.7 5.2 15.3 14.2 14.0 12.1 8.9 8.3 11.4
SMaLL-100 330M 7.8×- 4.6 22.1 16.4 8.7 7.0 16.7 15.8 16.3 14.5 10.6 11.2 13.1
M2M-100 1.2B 1.8×- 8.8 19.5 13.1 8.7 7.2 16.3 17.0 17.2 13.4 10.7 11.1 13.0
M2M-100 12B 1×- 8.6 23.5 13.1 9.8 10.2 17.8 17.9 18.5 15.2 10.7 13.2 14.4
TICO-19
FLORES-124 175M 5.3×4.6 5.5 8.1 11.5 4.4 5.6 9.7 12.2 3.9 8.0 4.2 8.7 7.2
M2M-100 418M 3.1×4.0 5.5 9.8 13.7 4.2 5.7 11.6 14.9 4.1 8.8 5.3 9.4 8.1
FLORES-124 615M 2.9×4.6 7.4 11.5 16.4 4.8 7.6 12.9 16.7 4.4 10.7 4.4 11.5 9.4
Finetuned-100 330M 7.8×6.1 7.2 11.9 17.4 5.5 6.1 12.1 15.2 6.4 9.0 9.5 10.3 9.7
SMaLL-100 330M 7.8×7.8 8.8 13.3 19.0 8.0 8.5 14.3 17.8 8.3 11.5 11.3 12.7 11.8
M2M-100 1.2B 1.8×5.4 8.2 13.2 18.9 6.0 8.7 14.0 19.2 5.2 11.5 6.1 12.5 10.8
M2M-100 12B 1×6.4 10.9 15.4 20.6 7.8 11.9 16.6 21.4 6.4 15.4 8.7 16.4 13.1
Table 2: Average spBLEU performance on FLORES-101, Tatoeba, and TICO-19 benchmarks for different language
pair categories, defined in Appendix A. FLORES-101 results are computed on language pairs where M2M-100
12B has spBLEU scores higher than 3 to avoid polluting the analysis with meaningless scores. The first and second
columns give the model size and speed-up ratios compared to M2M-100 (12B). Last column is the average spBLEU
performance over all mentioned language directions. The best scores are shown in bold, and the second best results
are shown with underline.
TICO-19
was created during the COVID-19
pandemic (Anastasopoulos et al.,2020). It con-
tains sentences from 36 languages in the medical
domain, including 26 low-resource languages.
We evaluate on languages which are covered by
M2M-100 (Fan et al.,2020).
Inspired by Goyal et al. (2021), we split the lan-
guages based on the amount of available training
sentences aligned with English into 4 different
categories: Very-Low (VL), Low (L), Medium (M),
and High-resource (H). As the true amount of
training data is both dependent on quality and
quantity of parallel sentences, Goyal et al. (2021)
suggested to estimate it by computing the number
of bitext data aligned with English, that is calcu-
lated from statistics of OPUS corpora (Tiedemann,
2012). Table 1illustrates the criteria for choosing
the category of different languages. More details
about the distribution of language pair categories
in each benchmark are provided in Appendix A.
3.2 Baselines
M2M-100
Fan et al. (2020) is a recent many-to-
many NMT model covering 100 languages. Fan
et al. (2020) provide 3 variants with respectively
418M, 1.2B, and 12B parameters. We compare
against these 3 variants.
FLORES-124
is an extension of M2M-100,
covering additional 24 languages. Training
data of the additional languages is derived from
OPUS (Tiedemann,2012). Goyal et al. (2021)
provide two models with 175M and 615M
parameters. We use both models as baselines.
FineTuned-100
uses the same architecture as
defined in Section 2, but KD loss (
Lkd
) is not used
for training. For a fair comparison, it is trained for
the same number of steps as SMaLL-100 model.
3.3 Implementation Details
SMaLL-100 contains nearly 330M parameters
with 12 encoder and 3 decoder Transformer
layers.
5
It is trained for 30 days on 16 TESLA
5
It is initialized with M2M-100 (418M), using its first 3
decoder layers for the initialization of the student’s decoder.
摘要:

SMaLL-100:IntroducingShallowMultilingualMachineTranslationModelforLow-ResourceLanguagesAlirezaMohammadshahi1;2;3VassilinaNikoulina1AlexandreBerard1CarolineBrun1JamesHenderson2LaurentBesacier11NAVERLABSEurope2IDIAPResearchInstitute3EPFL{first.last}@naverlabs.com{alireza.mohammadshahi,james.henderson...

展开>> 收起<<
SMaLL-100 Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages Alireza Mohammadshahi123Vassilina Nikoulina1Alexandre Berard1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:456.42KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注