Domain Specific Sub-network for Multi-Domain Neural Machine Translation Amr Hendy Mohamed Abdelghaffar Mohamed Afify and Ahmed Y. Tawfik

2025-05-03 0 0 230.28KB 6 页 10玖币
侵权投诉
Domain Specific Sub-network for Multi-Domain Neural Machine
Translation
Amr Hendy, Mohamed Abdelghaffar, Mohamed Afify and Ahmed Y. Tawfik
Microsoft Egypt Development Center, Cairo, Egypt
{amrhendy,mohamed.abdelghaar,mafify,atawfik}@microsoft.com
Abstract
This paper presents Domain-Specific Sub-
network (DoSS). It uses a set of masks ob-
tained through pruning to define a sub-network
for each domain and finetunes the sub-network
parameters on domain data. This performs
very closely and drastically reduces the num-
ber of parameters compared to finetuning the
whole network on each domain. Also a
method to make masks unique per domain is
proposed and shown to greatly improve the
generalization to unseen domains. In our
experiments on German to English machine
translation the proposed method outperforms
the strong baseline of continue training on
multi-domain (medical, tech and religion) data
by 1.47 BLEU points. Also continue training
DoSS on new domain (legal) outperforms the
multi-domain (medical, tech, religion, legal)
baseline by 1.52 BLEU points.
1 Introduction
Neural machine translation (NMT) has witnessed
significant advances based on transformer models
(Vaswani et al.,2017). These models are typically
trained on large amounts of data from different
sources, i.e. general data, from a single language
pair or multiple languages (Aharoni et al.,2019).
The fact that the models are trained on general data
usually leads to poor, or less than average, perfor-
mance on specific domains. This has a lot of practi-
cal implication since many users of machine trans-
lation are interested in the performance on some
specific domain(s). Therefore, improving the per-
formance of NMT on specific domains has become
an active area of research. We refer the reader to
(Chu and Wang,2018) for a review. Broadly speak-
ing, the proposed techniques could be divided into
data-centric and model-centric approaches. The
goal of the former methods is to acquire, often au-
tomatically, monolingual and bilingual data that is
representative of the domain of interest. The latter
techniques, on the other hand, focus on modifying
the model to perform well on the domain of inter-
est without sacrificing the performance on general
data.
Finetuning of the model parameters using do-
main data is perhaps one of the earliest and most
popular techniques for domain adaptation (Freitag
and Al-Onaizan,2016). Parallel domain data is
usually limited and to avoid overfitting different
techniques as model interpolation (Wortsman et al.,
2021), regularization (Miceli Barone et al.,2017)
and mixing domain and general data (Chu et al.,
2017) are used. Also other methods that intro-
duce additional parameters in a controllable way
have been successfully introduced such as adapters
(Bapna and Firat,2019) and low-rank adaptation
(LoRA) (Hu et al.,2021).
In (Frankle and Carbin,2018) it is shown that
identifying sub-networks by pruning a large net-
work, referred to as winning tickets, and retraining
them leads to equal accuracy to the original net-
work. This idea is explored for multilingual neural
machine translation (MNMT) using the so-called
language specific sub-networks (LaSS) (Lin et al.,
2021). Here we further explore the idea for domain
finetuning and refer to it as Domain Specific Sub-
network (DoSS). The basic idea is to identify a
sub-network per domain via pruning and masking.
The sub-network has both shared parameters with
other domains as well as domain-specific parame-
ters. It should be noted that the mask can overlap
for multiple domains which results in some param-
eters shared by multiple domains. We also explore
using constrained masks where we ensure that each
mask represents only one domain. The latter is
expected to work better for adding unseen domains.
In contrast to language, domain information may
not be necessarily known at inference time. In this
work, similar to common domain fientuning se-
tups, we assume the domain information is known
but using a domain classifier at runtime should be
straight forward. Given the domain information,
arXiv:2210.09805v1 [cs.CL] 18 Oct 2022
inference can be carried with the trained model and
the domain mask.
The paper is organized as follows. Section 2
gives a detailed description of the proposed method
followed by the experimental results in Section 3.
Finally, the conclusion is given in Section 4.
2 Method
We present the DoSS method in this section as
shown in Figure 1. We focus on the bilingual
setting and defer the multilingual case to future
work. Assume we have an initial model
λ0
that is
trained on large amounts of general data. We also
have the data sets
{Di}N
i=1
corresponding to
N
do-
mains and each data set consists of
Li
sentence
pairs
(xj, yj)
.Typically, the initial model is fine-
tuned for each domain resulting in
N
domain mod-
els. Here, we first create a mask for each domain
using pruning then train a domain sub-network us-
ing the resulting masks. We will explain the two
steps below.
2.1 Creating Domain Masks
We create a binary mask
Mi
for each domain that
has a 0 or 1 for each model parameter. Following
(Lin et al.,2021) we calculate the domain masks as
follows:
1. Start from initial model λ0.
2.
For each domain
i
finetune
λ0
using the corre-
sponding domain data
Di
for
[5 : 10]
epochs.
This will intuitively amplify the important
weights for the domain and diminish other
weights. This finetuning stage requires only
a few epochs compared to the full finetuning
training budget that makes it an effective way
to build the mask.
3.
Sort the weights of the finetuned model and
prune the lowest
α
in the encoder and the
lowest
β
in the decoder. We found that using
separate pruning parameters for the encoder
and the decoder gives us better control on the
resulting sub-networks. The mask for domain
i
is created by setting the upper
1α
percent
in the encoder and
1β
percent in the decoder
to 1 and all other elements to 0.
The above mask creation algorithm is uncon-
strained in the sense that multiple domains can
share the same weight. This has no problem as
long as we train multiple domains simultaneously
as given below but will degrade performance if
we want to add a new domain after the model has
been trained for a set of domains. Therefore, we
experiment here with simple constrained mask cre-
ation where step 3 is modified to set a mask el-
ement to 1 if it belongs to the top
1α(β)
per-
cent in the encoder (decoder) and doesn’t belong
to other domain masks. This makes the subnet-
work parameters unique but is dependent on the
order the domains are presented and can cover at
most
min(1/1α, 1/1β)
domains. Looking
into more sophisticated constrained methods could
be a topic for future research. Once the domain
masks are created we train the sub-networks again
following a similar algorithm to (Lin et al.,2021).
2.2 Training the Sub-networks
Here we follow the so-called structure aware joint
training. Given the initial model
λ0
and the domain
masks
Mi
we finetune the initial model using the
domain data. The finetuning is done in a mask-
aware manner where the mini-batches are formed
per domain
i
and for each mini-batch we only up-
date parameters where
Mi
equals 1. This way we
end up with a single model
λ
where shared parame-
ters come from the original model and the domain-
specific parameters come from the structure-aware
training.
2.3 Inference
Inference is done using the model
λ
and its masks
M
. For an input utterance coming from domain
i
inference is done using the parameters
λMi
where this stands of using the finetuned parameters
from the mask and the original parameters other-
wise. Domain information is often not known in
test time but in this work we assume that the do-
main is known and perform inference on batches
from the same domain for efficiency. When do-
main is unknown we can use a domain classifier at
run-time. We will test this approach in future work.
3 Experiments and Results
We evaluate the performance of DoSS on German
to English translation, and we consider three do-
mains: medicine, religion, and technology. The
baseline model was a German to English model
trained on 32.13M parallel sentences that were pro-
vided by the WMT19 news translation shared task
1
.
1https://www.statmt.org/wmt19/
translation-task.html
摘要:

DomainSpecicSub-networkforMulti-DomainNeuralMachineTranslationAmrHendy,MohamedAbdelghaffar,MohamedAfyandAhmedY.TawkMicrosoftEgyptDevelopmentCenter,Cairo,Egypt{amrhendy,mohamed.abdelghaar,mafify,atawfik}@microsoft.comAbstractThispaperpresentsDomain-SpecicSub-network(DoSS).Itusesasetofmasksob-tain...

展开>> 收起<<
Domain Specific Sub-network for Multi-Domain Neural Machine Translation Amr Hendy Mohamed Abdelghaffar Mohamed Afify and Ahmed Y. Tawfik.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:230.28KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注