Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models Mathieu Grosso Pirashanth Ratnamogan Alexis Mathey

2025-05-03 0 0 1019.82KB 11 页 10玖币
侵权投诉
Robust Domain Adaptation for Pre-trained Multilingual Neural Machine
Translation Models
Mathieu Grosso, Pirashanth Ratnamogan, Alexis Mathey,
William Vanhuffel, Michael Fotso Fotso
BNP Paribas
(mathieu.grosso, pirashanth.ratnamogan, alexis.mathey)@bnpparibas.com
(william.vanhuffel,michael.fotsofotso)@bnpparibas.com
Abstract
Recent literature has demonstrated the poten-
tial of multilingual Neural Machine Transla-
tion (mNMT) models. However, the most
efficient models are not well suited to spe-
cialized industries. In these cases, internal
data is scarce and expensive to find in all lan-
guage pairs. Therefore, fine-tuning a mNMT
model on a specialized domain is hard. In this
context, we decided to focus on a new task:
Domain Adaptation of a pre-trained mNMT
model on a single pair of language while try-
ing to maintain model quality on generic do-
main data for all language pairs. The risk of
loss on generic domain and on other pairs is
high. This task is key for mNMT model adop-
tion in the industry and is at the border of many
others. We propose a fine-tuning procedure for
the generic mNMT that combines embeddings
freezing and adversarial loss. Our experiments
demonstrated that the procedure improves per-
formances on specialized data with a minimal
loss in initial performances on generic domain
for all languages pairs, compared to a naive
standard approach (+10.0 BLEU score on spe-
cialized data, -0.01 to -0.5 BLEU on WMT
and Tatoeba datasets on the other pairs with
M2M100).
1 Introduction
Building a NMT model supporting multiple lan-
guage pairs is an active and emerging area of re-
search (NLLB Team et al.,2022;Fan et al.,2020;
Tang et al.,2020). Multilingual NMT(mNMT) uses
a single model that supports translation in multiple
language pairs. Multilingual models have several
advantages over their bilingual counterparts (Ari-
vazhagan et al.,2019b). This modeling proves to
be both efficient and effective as it reduces the op-
erational cost (a single model is deployed for all
language pairs) and improves translation perfor-
mances, especially for low-resource languages.
All these advantages make mNMT models inter-
esting for real-world applications. However, they
Language
pair i*
mNMT model
pre-trained
Language
pair i**
mNMT model
Adapted
Language
pair 1*
Language
pair N*
*Generic data
** Specialized data
Figure 1: Domain Adaptation of a Pre-trained mNMT
are not suitable for specialized industries that re-
quire domain-specific translation. Training a model
from scratch or fine-tuning all the pairs of a pre-
trained mNMT model is almost impossible for most
companies as it requires access to a large number
of resources and specialized data. That said, fine-
tuning a single pair of a pre-trained mNMT model
in a specialized domain seems possible. Ideally
this domain adaptation could be learned while shar-
ing parameters from old ones, without suffering
from catastrophic forgetting (Mccloskey and Co-
hen,1989). This is rarely the case. The risk of de-
grading performance on old pairs is high due to the
limited available data from the target domain and
to the extremely high complexity of the pre-trained
model.
In our case, overfitting on fine-tuning
data means that the model might not even be
multilingual anymore
In this context, this article focuses on a new
real-world oriented task
fine-tuning a pre-trained
mNMT model in a single pair of language on
a specific domain without losing initial perfor-
mances on the other pairs and generic data
. Our
arXiv:2210.14979v1 [cs.CL] 26 Oct 2022
research focuses on fine-tuning two state-of-the-art
pre-trained multilingual mNMT freely available:
M2M100 (Fan et al.,2020) and mBART50 (Tang
et al.,2020) which both provide high performing
BLEU scores and translate up to 100 languages.
We explored multiple approaches for this do-
main adaptation.
Our experiments were made on
English to French data from medical domain
1
. This
paper shows that fine-tuning a pre-trained model
with initial layers freezing, for a few steps and
with a small learning rate is the best performing
approach.
It is organized as follows : firstly, we introduce
standard components of modern NMT, secondly
we describe related works, thirdly we present our
methods. We finally systematically study the im-
pact of some state-of-the-art fine-tuning methods
and present our results.
Our main contributions can be separated into 2
parts:
Defining a new real-world oriented task that
focuses on domain adaptation and catas-
trophic forgetting on multilingual NMT mod-
els
Defining a procedure that allows to finetune
a pre-trained generic model on a specific do-
main
2 Background
2.1 Neural Machine Translation
Neural Machine Translation (NMT) has become
the dominant field of machine translation. It studies
how to automatically translate from one language
to another using neural networks.
Most NMT systems are trained using Seq2Seq
architectures (Sutskever et al.,2014;Cho et al.,
2014) by maximizing the prediction of the target
sequence
VT= (v1, . . . , vT)
, given the source
sentence WS= (w1, . . . , wS):
P(v1, . . . , vT|w1, . . . , wS)
Today the best performing Seq2Seq architecture
for NMT is based on Transformers (Vaswani et al.,
2017) architecture. They are built on different lay-
ers among which the multi-head attention and the
1https://opus.nlpl.eu/EMEA-v3.php
feed-forward layer. These are applied sequentially
and are both followed by a residual connection
(He et al.,2015) and layer normalization (Ba et al.,
2016).
Although powerful, traditional NMT only trans-
lates from one language to another with a high com-
putational cost compared to its statistical predeces-
sor. It has been shown that a simple language token
can condition the network to translate a sentence
in any target language from any source language
(Johnson et al.,2017). It allows to create multi-
lingual models that can translate between multiple
languages. Using previous notation the multilin-
gual model adds the condition on target language
in the previous modeling
P(v1, . . . , vT|w1, . . . , wS, `)
where `is the target language.
2.2 Transfer Learning
Transfer learning is a key topic in Natural Language
Processing (Devlin et al.,2018;Liu et al.,2019).
It is based on the assumption that pre-training a
model on a large set of data in various tasks will
help initialize a network trained on another task
where data is scarce.
It is already a key area of research in NMT where
large set of generic data are freely available (news,
common crawl, ...). However, real-world applica-
tions require specialized models. In-domain data
is rare and more costly to gather for industries
(finance, legal, medical, ...) making specialized
models harder to train. It is even more true for
multilingual model.
In our work, we study how we can adapt a
mNMT model on a specific domain by fine-tuning
on only one language pair, without losing too much
generality for all language pairs.
3 Related works
3.1 Multilingual Neural Machine Translation
While initial research on NMT started with bilin-
gual translation systems (Sutskever et al.,2014;
Cho et al.,2014;Luong et al.,2015;Yang et al.,
2020), it has been shown that the NMT framework
is extendable to multilingual models (Dong et al.,
2015;Firat et al.,2016;Johnson et al.,2017;Dabre
et al.,2020) mNMT has seen a sharp increase in the
number of publications, since it is easily extendable
and it allows both end-to-end modeling and cross
lingual language representation (Conneau et al.,
2017;Linger and Hajaiej,2020;Conneau et al.,
2019).
Competitive multilingual models have been re-
leased and open sourced. mBART (Liu et al.,
2019) first, was trained following the BART (Lewis
et al.,2019) objective before being finetuned on
an English-centric multilingual dataset (Tang et al.,
2020). M2M100 (Fan et al.,2020) scaled large
transformer layers (Vaswani et al.,2017) with a
lot of mined data in order to create a mNMT with-
out using English as pivot, that can perform trans-
lation between any pairs among 100 languages.
More recently, NLLB was released (NLLB Team
et al.,2022), extending the M2M100 framework to
200 languages. Those models are extremely com-
petitive as they have similar performance to their
bilingual counterpart while allowing a pooling of
training and resources.
Our experiments will rely on M2M100 and
mBART but it can be generalized to any new pre-
trained multilingual model (NLLB Team et al.,
2022).
3.2 Domain Adaptation
Domain Adaptation in the field of NMT is a key
real-world oriented task. It aims at maximizing
model performances on a certain in-domain data
distribution. Dominant approaches are based on
fine-tuning a generic model using either in-domain
data only or a mixture of out-of-domain and in-
domain data to reduce overfitting (Servan et al.,
2016a;Van Der Wees et al.,2017). Many works
have extended domain adaptation to multi-domain,
where model is finetuned on multiple and differ-
ent domains (Sajjad et al.,2017;Zeng et al.,2018;
Mghabbar and Ratnamogan,2020).
However, to the best of our knowledge, our work is
the first exploring domain adaptation in the context
of recent pre-trained multilingual neural machine
translation systems, while focusing on keeping the
model performant in out-of-domain data in all lan-
guages.
3.3 Learning without forgetting
Training on a new task or new data without losing
past performances is a generic machine learning
task, named Learning without forgetting (Li and
Hoiem,2016).
Limiting pre-trained weights updates using ei-
ther trust regions or adversarial loss is a recent
idea that has been used to improve training stability
in both natural language processing and computer
vision (Zhu et al.,2019;Jiang et al.,2020;Agha-
janyan et al.,2020). These methods haven’t been
explored in the context of NMT but are key assets
that demonstrated their capabilities on other NLP
tasks (Natural Language Inference in particular).
Our work will apply a combination of those meth-
ods to our task.
3.4 Zero Shot Translation
MNMT has shown the capability of direct trans-
lation between language pairs unseen in training:
a mNMT system can automatically translate be-
tween unseen pairs without any direct supervision,
as long as both source and target languages were
included in the training data (Johnson et al.,2017).
However, prior works (Johnson et al.,2017;Firat
et al.,2016;Arivazhagan et al.,2019a) showed
that the quality of zero-shot NMT significantly lags
behind pivot-based translation (Gu et al.,2019).
Based on these ideas, some paper (Liu et al.,2021)
have focused on training a mNMT model support-
ing the addition of new languages by relaxing the
correspondence between input tokens and encoder
representations, therefore improving its zero-shot
capacity. We were interested in using this method
as learning less specific input tokens during the
finetuning procedure could help our model not to
overfit the training pairs. Indeed, generalizing to
a new domain can be seen as a task that includes
generalizing to an unseen language.
4 Methods
Our new real-world oriented task being at the cross-
board of many existing task, we applied ideas from
current literature and tried to combine different
approaches to achieve the best results.
4.1 Hyperparameters search heuristics for
efficient fine-tuning
We seek to adapt generic multilingual model to a
specific task or domain. (Cettolo et al.,2014;Ser-
van et al.,2016b). Recent works in NMT (Domingo
et al.,2019) have proposed methods to adapt incre-
mentally a model to a specific domain. We con-
tinue the training of the generic model on specific
data, through several iterations (see Algorithm 1).
This post-training fine-tuning procedure is done
without dropping the previous learning states of
the multilingual model. The resulting model is
considered as adapted or specialized to a specific
摘要:

RobustDomainAdaptationforPre-trainedMultilingualNeuralMachineTranslationModelsMathieuGrosso,PirashanthRatnamogan,AlexisMathey,WilliamVanhuffel,MichaelFotsoFotsoBNPParibas(mathieu.grosso,pirashanth.ratnamogan,alexis.mathey)@bnpparibas.com(william.vanhuffel,michael.fotsofotso)@bnpparibas.comAbstractRe...

展开>> 收起<<
Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models Mathieu Grosso Pirashanth Ratnamogan Alexis Mathey.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1019.82KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注