lingual language representation (Conneau et al.,
2017;Linger and Hajaiej,2020;Conneau et al.,
2019).
Competitive multilingual models have been re-
leased and open sourced. mBART (Liu et al.,
2019) first, was trained following the BART (Lewis
et al.,2019) objective before being finetuned on
an English-centric multilingual dataset (Tang et al.,
2020). M2M100 (Fan et al.,2020) scaled large
transformer layers (Vaswani et al.,2017) with a
lot of mined data in order to create a mNMT with-
out using English as pivot, that can perform trans-
lation between any pairs among 100 languages.
More recently, NLLB was released (NLLB Team
et al.,2022), extending the M2M100 framework to
200 languages. Those models are extremely com-
petitive as they have similar performance to their
bilingual counterpart while allowing a pooling of
training and resources.
Our experiments will rely on M2M100 and
mBART but it can be generalized to any new pre-
trained multilingual model (NLLB Team et al.,
2022).
3.2 Domain Adaptation
Domain Adaptation in the field of NMT is a key
real-world oriented task. It aims at maximizing
model performances on a certain in-domain data
distribution. Dominant approaches are based on
fine-tuning a generic model using either in-domain
data only or a mixture of out-of-domain and in-
domain data to reduce overfitting (Servan et al.,
2016a;Van Der Wees et al.,2017). Many works
have extended domain adaptation to multi-domain,
where model is finetuned on multiple and differ-
ent domains (Sajjad et al.,2017;Zeng et al.,2018;
Mghabbar and Ratnamogan,2020).
However, to the best of our knowledge, our work is
the first exploring domain adaptation in the context
of recent pre-trained multilingual neural machine
translation systems, while focusing on keeping the
model performant in out-of-domain data in all lan-
guages.
3.3 Learning without forgetting
Training on a new task or new data without losing
past performances is a generic machine learning
task, named Learning without forgetting (Li and
Hoiem,2016).
Limiting pre-trained weights updates using ei-
ther trust regions or adversarial loss is a recent
idea that has been used to improve training stability
in both natural language processing and computer
vision (Zhu et al.,2019;Jiang et al.,2020;Agha-
janyan et al.,2020). These methods haven’t been
explored in the context of NMT but are key assets
that demonstrated their capabilities on other NLP
tasks (Natural Language Inference in particular).
Our work will apply a combination of those meth-
ods to our task.
3.4 Zero Shot Translation
MNMT has shown the capability of direct trans-
lation between language pairs unseen in training:
a mNMT system can automatically translate be-
tween unseen pairs without any direct supervision,
as long as both source and target languages were
included in the training data (Johnson et al.,2017).
However, prior works (Johnson et al.,2017;Firat
et al.,2016;Arivazhagan et al.,2019a) showed
that the quality of zero-shot NMT significantly lags
behind pivot-based translation (Gu et al.,2019).
Based on these ideas, some paper (Liu et al.,2021)
have focused on training a mNMT model support-
ing the addition of new languages by relaxing the
correspondence between input tokens and encoder
representations, therefore improving its zero-shot
capacity. We were interested in using this method
as learning less specific input tokens during the
finetuning procedure could help our model not to
overfit the training pairs. Indeed, generalizing to
a new domain can be seen as a task that includes
generalizing to an unseen language.
4 Methods
Our new real-world oriented task being at the cross-
board of many existing task, we applied ideas from
current literature and tried to combine different
approaches to achieve the best results.
4.1 Hyperparameters search heuristics for
efficient fine-tuning
We seek to adapt generic multilingual model to a
specific task or domain. (Cettolo et al.,2014;Ser-
van et al.,2016b). Recent works in NMT (Domingo
et al.,2019) have proposed methods to adapt incre-
mentally a model to a specific domain. We con-
tinue the training of the generic model on specific
data, through several iterations (see Algorithm 1).
This post-training fine-tuning procedure is done
without dropping the previous learning states of
the multilingual model. The resulting model is
considered as adapted or specialized to a specific