Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models Mathieu Grosso Pirashanth Ratnamogan Alexis Mathey

2025-05-03 0 0 1019.82KB 11 页 10玖币

侵权投诉

Robust Domain Adaptation for Pre-trained Multilingual Neural Machine

Translation Models

Mathieu Grosso, Pirashanth Ratnamogan, Alexis Mathey,

William Vanhuffel, Michael Fotso Fotso

BNP Paribas

(mathieu.grosso, pirashanth.ratnamogan, alexis.mathey)@bnpparibas.com

(william.vanhuffel,michael.fotsofotso)@bnpparibas.com

Abstract

Recent literature has demonstrated the poten-

tial of multilingual Neural Machine Transla-

tion (mNMT) models. However, the most

efﬁcient models are not well suited to spe-

cialized industries. In these cases, internal

data is scarce and expensive to ﬁnd in all lan-

guage pairs. Therefore, ﬁne-tuning a mNMT

model on a specialized domain is hard. In this

context, we decided to focus on a new task:

Domain Adaptation of a pre-trained mNMT

model on a single pair of language while try-

ing to maintain model quality on generic do-

main data for all language pairs. The risk of

loss on generic domain and on other pairs is

high. This task is key for mNMT model adop-

tion in the industry and is at the border of many

others. We propose a ﬁne-tuning procedure for

the generic mNMT that combines embeddings

freezing and adversarial loss. Our experiments

demonstrated that the procedure improves per-

formances on specialized data with a minimal

loss in initial performances on generic domain

for all languages pairs, compared to a naive

standard approach (+10.0 BLEU score on spe-

cialized data, -0.01 to -0.5 BLEU on WMT

and Tatoeba datasets on the other pairs with

M2M100).

1 Introduction

Building a NMT model supporting multiple lan-

guage pairs is an active and emerging area of re-

search (NLLB Team et al.,2022;Fan et al.,2020;

Tang et al.,2020). Multilingual NMT(mNMT) uses

a single model that supports translation in multiple

language pairs. Multilingual models have several

advantages over their bilingual counterparts (Ari-

vazhagan et al.,2019b). This modeling proves to

be both efﬁcient and effective as it reduces the op-

erational cost (a single model is deployed for all

language pairs) and improves translation perfor-

mances, especially for low-resource languages.

All these advantages make mNMT models inter-

esting for real-world applications. However, they

Language

pair i*

mNMT model

pre-trained

Language

pair i**

mNMT model

Adapted

Language

pair 1*

Language

pair N*

*Generic data

** Specialized data

Figure 1: Domain Adaptation of a Pre-trained mNMT

are not suitable for specialized industries that re-

quire domain-speciﬁc translation. Training a model

from scratch or ﬁne-tuning all the pairs of a pre-

trained mNMT model is almost impossible for most

companies as it requires access to a large number

of resources and specialized data. That said, ﬁne-

tuning a single pair of a pre-trained mNMT model

in a specialized domain seems possible. Ideally

this domain adaptation could be learned while shar-

ing parameters from old ones, without suffering

from catastrophic forgetting (Mccloskey and Co-

hen,1989). This is rarely the case. The risk of de-

grading performance on old pairs is high due to the

limited available data from the target domain and

to the extremely high complexity of the pre-trained

model.

In our case, overﬁtting on ﬁne-tuning

data means that the model might not even be

multilingual anymore

In this context, this article focuses on a new

real-world oriented task

ﬁne-tuning a pre-trained

mNMT model in a single pair of language on

a speciﬁc domain without losing initial perfor-

mances on the other pairs and generic data

. Our

arXiv:2210.14979v1 [cs.CL] 26 Oct 2022

research focuses on ﬁne-tuning two state-of-the-art

pre-trained multilingual mNMT freely available:

M2M100 (Fan et al.,2020) and mBART50 (Tang

et al.,2020) which both provide high performing

BLEU scores and translate up to 100 languages.

We explored multiple approaches for this do-

main adaptation.

Our experiments were made on

English to French data from medical domain

. This

paper shows that ﬁne-tuning a pre-trained model

with initial layers freezing, for a few steps and

with a small learning rate is the best performing

approach.

It is organized as follows : ﬁrstly, we introduce

standard components of modern NMT, secondly

we describe related works, thirdly we present our

methods. We ﬁnally systematically study the im-

pact of some state-of-the-art ﬁne-tuning methods

and present our results.

Our main contributions can be separated into 2

parts:

•

Deﬁning a new real-world oriented task that

focuses on domain adaptation and catas-

trophic forgetting on multilingual NMT mod-

els

•

Deﬁning a procedure that allows to ﬁnetune

a pre-trained generic model on a speciﬁc do-

main

2 Background

2.1 Neural Machine Translation

Neural Machine Translation (NMT) has become

the dominant ﬁeld of machine translation. It studies

how to automatically translate from one language

to another using neural networks.

Most NMT systems are trained using Seq2Seq

architectures (Sutskever et al.,2014;Cho et al.,

2014) by maximizing the prediction of the target

sequence

VT= (v1, . . . , vT)

, given the source

sentence WS= (w1, . . . , wS):

P(v1, . . . , vT|w1, . . . , wS)

Today the best performing Seq2Seq architecture

for NMT is based on Transformers (Vaswani et al.,

2017) architecture. They are built on different lay-

ers among which the multi-head attention and the

1https://opus.nlpl.eu/EMEA-v3.php

feed-forward layer. These are applied sequentially

and are both followed by a residual connection

(He et al.,2015) and layer normalization (Ba et al.,

2016).

Although powerful, traditional NMT only trans-

lates from one language to another with a high com-

putational cost compared to its statistical predeces-

sor. It has been shown that a simple language token

can condition the network to translate a sentence

in any target language from any source language

(Johnson et al.,2017). It allows to create multi-

lingual models that can translate between multiple

languages. Using previous notation the multilin-

gual model adds the condition on target language

in the previous modeling

P(v1, . . . , vT|w1, . . . , wS, `)

where `is the target language.

2.2 Transfer Learning

Transfer learning is a key topic in Natural Language

Processing (Devlin et al.,2018;Liu et al.,2019).

It is based on the assumption that pre-training a

model on a large set of data in various tasks will

help initialize a network trained on another task

where data is scarce.

It is already a key area of research in NMT where

large set of generic data are freely available (news,

common crawl, ...). However, real-world applica-

tions require specialized models. In-domain data

is rare and more costly to gather for industries

(ﬁnance, legal, medical, ...) making specialized

models harder to train. It is even more true for

multilingual model.

In our work, we study how we can adapt a

mNMT model on a speciﬁc domain by ﬁne-tuning

on only one language pair, without losing too much

generality for all language pairs.

3 Related works

3.1 Multilingual Neural Machine Translation

While initial research on NMT started with bilin-

gual translation systems (Sutskever et al.,2014;

Cho et al.,2014;Luong et al.,2015;Yang et al.,

2020), it has been shown that the NMT framework

is extendable to multilingual models (Dong et al.,

2015;Firat et al.,2016;Johnson et al.,2017;Dabre

et al.,2020) mNMT has seen a sharp increase in the

number of publications, since it is easily extendable

and it allows both end-to-end modeling and cross

lingual language representation (Conneau et al.,

2017;Linger and Hajaiej,2020;Conneau et al.,

2019).

Competitive multilingual models have been re-

leased and open sourced. mBART (Liu et al.,

2019) ﬁrst, was trained following the BART (Lewis

et al.,2019) objective before being ﬁnetuned on

an English-centric multilingual dataset (Tang et al.,

2020). M2M100 (Fan et al.,2020) scaled large

transformer layers (Vaswani et al.,2017) with a

lot of mined data in order to create a mNMT with-

out using English as pivot, that can perform trans-

lation between any pairs among 100 languages.

More recently, NLLB was released (NLLB Team

et al.,2022), extending the M2M100 framework to

200 languages. Those models are extremely com-

petitive as they have similar performance to their

bilingual counterpart while allowing a pooling of

training and resources.

Our experiments will rely on M2M100 and

mBART but it can be generalized to any new pre-

trained multilingual model (NLLB Team et al.,

2022).

3.2 Domain Adaptation

Domain Adaptation in the ﬁeld of NMT is a key

real-world oriented task. It aims at maximizing

model performances on a certain in-domain data

distribution. Dominant approaches are based on

ﬁne-tuning a generic model using either in-domain

data only or a mixture of out-of-domain and in-

domain data to reduce overﬁtting (Servan et al.,

2016a;Van Der Wees et al.,2017). Many works

have extended domain adaptation to multi-domain,

where model is ﬁnetuned on multiple and differ-

ent domains (Sajjad et al.,2017;Zeng et al.,2018;

Mghabbar and Ratnamogan,2020).

However, to the best of our knowledge, our work is

the ﬁrst exploring domain adaptation in the context

of recent pre-trained multilingual neural machine

translation systems, while focusing on keeping the

model performant in out-of-domain data in all lan-

guages.

3.3 Learning without forgetting

Training on a new task or new data without losing

past performances is a generic machine learning

task, named Learning without forgetting (Li and

Hoiem,2016).

Limiting pre-trained weights updates using ei-

ther trust regions or adversarial loss is a recent

idea that has been used to improve training stability

in both natural language processing and computer

vision (Zhu et al.,2019;Jiang et al.,2020;Agha-

janyan et al.,2020). These methods haven’t been

explored in the context of NMT but are key assets

that demonstrated their capabilities on other NLP

tasks (Natural Language Inference in particular).

Our work will apply a combination of those meth-

ods to our task.

3.4 Zero Shot Translation

MNMT has shown the capability of direct trans-

lation between language pairs unseen in training:

a mNMT system can automatically translate be-

tween unseen pairs without any direct supervision,

as long as both source and target languages were

included in the training data (Johnson et al.,2017).

However, prior works (Johnson et al.,2017;Firat

et al.,2016;Arivazhagan et al.,2019a) showed

that the quality of zero-shot NMT signiﬁcantly lags

behind pivot-based translation (Gu et al.,2019).

Based on these ideas, some paper (Liu et al.,2021)

have focused on training a mNMT model support-

ing the addition of new languages by relaxing the

correspondence between input tokens and encoder

representations, therefore improving its zero-shot

capacity. We were interested in using this method

as learning less speciﬁc input tokens during the

ﬁnetuning procedure could help our model not to

overﬁt the training pairs. Indeed, generalizing to

a new domain can be seen as a task that includes

generalizing to an unseen language.

4 Methods

Our new real-world oriented task being at the cross-

board of many existing task, we applied ideas from

current literature and tried to combine different

approaches to achieve the best results.

4.1 Hyperparameters search heuristics for

efﬁcient ﬁne-tuning

We seek to adapt generic multilingual model to a

speciﬁc task or domain. (Cettolo et al.,2014;Ser-

van et al.,2016b). Recent works in NMT (Domingo

et al.,2019) have proposed methods to adapt incre-

mentally a model to a speciﬁc domain. We con-

tinue the training of the generic model on speciﬁc

data, through several iterations (see Algorithm 1).

This post-training ﬁne-tuning procedure is done

without dropping the previous learning states of

the multilingual model. The resulting model is

considered as adapted or specialized to a speciﬁc

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustDomainAdaptationforPre-trainedMultilingualNeuralMachineTranslationModelsMathieuGrosso,PirashanthRatnamogan,AlexisMathey,WilliamVanhuffel,MichaelFotsoFotsoBNPParibas(mathieu.grosso,pirashanth.ratnamogan,alexis.mathey)@bnpparibas.com(william.vanhuffel,michael.fotsofotso)@bnpparibas.comAbstractRe...

展开>> 收起<<

Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models Mathieu Grosso Pirashanth Ratnamogan Alexis Mathey.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models Mathieu Grosso Pirashanth Ratnamogan Alexis Mathey

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: