Domain Speciﬁc Sub-network for Multi-Domain Neural Machine Translation Amr Hendy Mohamed Abdelghaffar Mohamed Aﬁfy and Ahmed Y. Tawﬁk

2025-05-03 0 0 230.28KB 6 页 10玖币

侵权投诉

Domain Speciﬁc Sub-network for Multi-Domain Neural Machine

Translation

Amr Hendy, Mohamed Abdelghaffar, Mohamed Aﬁfy and Ahmed Y. Tawﬁk

Microsoft Egypt Development Center, Cairo, Egypt

{amrhendy,mohamed.abdelghaar,mafify,atawfik}@microsoft.com

Abstract

This paper presents Domain-Speciﬁc Sub-

network (DoSS). It uses a set of masks ob-

tained through pruning to deﬁne a sub-network

for each domain and ﬁnetunes the sub-network

parameters on domain data. This performs

very closely and drastically reduces the num-

ber of parameters compared to ﬁnetuning the

whole network on each domain. Also a

method to make masks unique per domain is

proposed and shown to greatly improve the

generalization to unseen domains. In our

experiments on German to English machine

translation the proposed method outperforms

the strong baseline of continue training on

multi-domain (medical, tech and religion) data

by 1.47 BLEU points. Also continue training

DoSS on new domain (legal) outperforms the

multi-domain (medical, tech, religion, legal)

baseline by 1.52 BLEU points.

1 Introduction

Neural machine translation (NMT) has witnessed

signiﬁcant advances based on transformer models

(Vaswani et al.,2017). These models are typically

trained on large amounts of data from different

sources, i.e. general data, from a single language

pair or multiple languages (Aharoni et al.,2019).

The fact that the models are trained on general data

usually leads to poor, or less than average, perfor-

mance on speciﬁc domains. This has a lot of practi-

cal implication since many users of machine trans-

lation are interested in the performance on some

speciﬁc domain(s). Therefore, improving the per-

formance of NMT on speciﬁc domains has become

an active area of research. We refer the reader to

(Chu and Wang,2018) for a review. Broadly speak-

ing, the proposed techniques could be divided into

data-centric and model-centric approaches. The

goal of the former methods is to acquire, often au-

tomatically, monolingual and bilingual data that is

representative of the domain of interest. The latter

techniques, on the other hand, focus on modifying

the model to perform well on the domain of inter-

est without sacriﬁcing the performance on general

data.

Finetuning of the model parameters using do-

main data is perhaps one of the earliest and most

popular techniques for domain adaptation (Freitag

and Al-Onaizan,2016). Parallel domain data is

usually limited and to avoid overﬁtting different

techniques as model interpolation (Wortsman et al.,

2021), regularization (Miceli Barone et al.,2017)

and mixing domain and general data (Chu et al.,

2017) are used. Also other methods that intro-

duce additional parameters in a controllable way

have been successfully introduced such as adapters

(Bapna and Firat,2019) and low-rank adaptation

(LoRA) (Hu et al.,2021).

In (Frankle and Carbin,2018) it is shown that

identifying sub-networks by pruning a large net-

work, referred to as winning tickets, and retraining

them leads to equal accuracy to the original net-

work. This idea is explored for multilingual neural

machine translation (MNMT) using the so-called

language speciﬁc sub-networks (LaSS) (Lin et al.,

2021). Here we further explore the idea for domain

ﬁnetuning and refer to it as Domain Speciﬁc Sub-

network (DoSS). The basic idea is to identify a

sub-network per domain via pruning and masking.

The sub-network has both shared parameters with

other domains as well as domain-speciﬁc parame-

ters. It should be noted that the mask can overlap

for multiple domains which results in some param-

eters shared by multiple domains. We also explore

using constrained masks where we ensure that each

mask represents only one domain. The latter is

expected to work better for adding unseen domains.

In contrast to language, domain information may

not be necessarily known at inference time. In this

work, similar to common domain ﬁentuning se-

tups, we assume the domain information is known

but using a domain classiﬁer at runtime should be

straight forward. Given the domain information,

arXiv:2210.09805v1 [cs.CL] 18 Oct 2022

inference can be carried with the trained model and

the domain mask.

The paper is organized as follows. Section 2

gives a detailed description of the proposed method

followed by the experimental results in Section 3.

Finally, the conclusion is given in Section 4.

2 Method

We present the DoSS method in this section as

shown in Figure 1. We focus on the bilingual

setting and defer the multilingual case to future

work. Assume we have an initial model

λ0

that is

trained on large amounts of general data. We also

have the data sets

{Di}N

i=1

corresponding to

do-

mains and each data set consists of

sentence

pairs

(xj, yj)

.Typically, the initial model is ﬁne-

tuned for each domain resulting in

domain mod-

els. Here, we ﬁrst create a mask for each domain

using pruning then train a domain sub-network us-

ing the resulting masks. We will explain the two

steps below.

2.1 Creating Domain Masks

We create a binary mask

for each domain that

has a 0 or 1 for each model parameter. Following

(Lin et al.,2021) we calculate the domain masks as

follows:

1. Start from initial model λ0.

For each domain

ﬁnetune

λ0

using the corre-

sponding domain data

for

[5 : 10]

epochs.

This will intuitively amplify the important

weights for the domain and diminish other

weights. This ﬁnetuning stage requires only

a few epochs compared to the full ﬁnetuning

training budget that makes it an effective way

to build the mask.

Sort the weights of the ﬁnetuned model and

prune the lowest

in the encoder and the

lowest

in the decoder. We found that using

separate pruning parameters for the encoder

and the decoder gives us better control on the

resulting sub-networks. The mask for domain

is created by setting the upper

1−α

percent

in the encoder and

1−β

percent in the decoder

to 1 and all other elements to 0.

The above mask creation algorithm is uncon-

strained in the sense that multiple domains can

share the same weight. This has no problem as

long as we train multiple domains simultaneously

as given below but will degrade performance if

we want to add a new domain after the model has

been trained for a set of domains. Therefore, we

experiment here with simple constrained mask cre-

ation where step 3 is modiﬁed to set a mask el-

ement to 1 if it belongs to the top

1−α(β)

per-

cent in the encoder (decoder) and doesn’t belong

to other domain masks. This makes the subnet-

work parameters unique but is dependent on the

order the domains are presented and can cover at

most

min(1/1−α, 1/1−β)

domains. Looking

into more sophisticated constrained methods could

be a topic for future research. Once the domain

masks are created we train the sub-networks again

following a similar algorithm to (Lin et al.,2021).

2.2 Training the Sub-networks

Here we follow the so-called structure aware joint

training. Given the initial model

λ0

and the domain

masks

we ﬁnetune the initial model using the

domain data. The ﬁnetuning is done in a mask-

aware manner where the mini-batches are formed

per domain

and for each mini-batch we only up-

date parameters where

equals 1. This way we

end up with a single model

where shared parame-

ters come from the original model and the domain-

speciﬁc parameters come from the structure-aware

training.

2.3 Inference

Inference is done using the model

and its masks

. For an input utterance coming from domain

inference is done using the parameters

λMi

where this stands of using the ﬁnetuned parameters

from the mask and the original parameters other-

wise. Domain information is often not known in

test time but in this work we assume that the do-

main is known and perform inference on batches

from the same domain for efﬁciency. When do-

main is unknown we can use a domain classiﬁer at

run-time. We will test this approach in future work.

3 Experiments and Results

We evaluate the performance of DoSS on German

to English translation, and we consider three do-

mains: medicine, religion, and technology. The

baseline model was a German to English model

trained on 32.13M parallel sentences that were pro-

vided by the WMT19 news translation shared task

1https://www.statmt.org/wmt19/

translation-task.html

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DomainSpecicSub-networkforMulti-DomainNeuralMachineTranslationAmrHendy,MohamedAbdelghaffar,MohamedAfyandAhmedY.TawkMicrosoftEgyptDevelopmentCenter,Cairo,Egypt{amrhendy,mohamed.abdelghaar,mafify,atawfik}@microsoft.comAbstractThispaperpresentsDomain-SpecicSub-network(DoSS).Itusesasetofmasksob-tain...

展开>> 收起<<

Domain Speciﬁc Sub-network for Multi-Domain Neural Machine Translation Amr Hendy Mohamed Abdelghaffar Mohamed Aﬁfy and Ahmed Y. Tawﬁk.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Domain Speciﬁc Sub-network for Multi-Domain Neural Machine Translation Amr Hendy Mohamed Abdelghaffar Mohamed Aﬁfy and Ahmed Y. Tawﬁk

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: