Specializing Multi-domain NMT via Penalizing Low Mutual Information Jiyoung Leey Hantae Kimz Hyunchang Choz Edward Choiy and Cheonbok Parkz

2025-05-03 0 0 477.29KB 12 页 10玖币
侵权投诉
Specializing Multi-domain NMT via Penalizing Low Mutual Information
Jiyoung Lee
, Hantae Kim, Hyunchang Cho
Edward Choi, and Cheonbok Park
KAIST , Papago, NAVER Corp.
{jiyounglee0523, edwardchoi}@kaist.ac.kr
{hantae.kim,hyunchang.cho,cbok.park}@navercorp.com
Abstract
Multi-domain Neural Machine Translation
(NMT) trains a single model with multiple do-
mains. It is appealing because of its efficacy in
handling multiple domains within one model.
An ideal multi-domain NMT should learn dis-
tinctive domain characteristics simultaneously,
however, grasping the domain peculiarity is
a non-trivial task. In this paper, we investi-
gate domain-specific information through the
lens of mutual information (MI) and propose a
new objective that penalizes low MI to become
higher. Our method achieved the state-of-the-
art performance among the current competi-
tive multi-domain NMT models. Also, we em-
pirically show our objective promotes low MI
to be higher resulting in domain-specialized
multi-domain NMT.
1 Introduction
Multi-domain Neural Machine Translation (NMT)
(Sajjad et al.,2017;Farajian et al.,2017) has been
an attractive topic due to its efficacy in handling
multiple domains with a single model. Ideally,
a multi-domain NMT should capture both gen-
eral knowledge (e.g., sentence structure, common
words) and domain-specific knowledge (e.g., do-
main terminology) unique in each domain. While
the shared knowledge can be easily acquired via
sharing parameters across domains (Kobus et al.,
2017), obtaining domain specialized knowledge
is a challenging task. Haddow and Koehn (2012)
demonstrate that a model trained on multiple do-
mains sometimes underperforms the one trained
on a single domain. Pham et al. (2021) shows that
separate domain-specific adaptation modules are
not sufficient to fully-gain specialized knowledge.
In this paper, we reinterpret domain specialized
knowledge from mutual information (MI) perspec-
tive and propose a method to strengthen it. Given
Work done during an internship at NAVER Corp.
MI(𝑫; 𝒀|𝑿)𝟎
calculation computing
totals
𝑨𝑩
Source
Beschreib
Summenberechnung für ein gegebenes Feld oder einen gegebenen Ausdruck.
Reference
Describes
a way of computing totals for a given field or expression.
A (Baseline)
Describes the kind of
calculation for a given field or expression.
B (Ours)
Describes
the way of computing totals for a given field or expression .
Figure 1: Overview of two models with different MI
distributions. The example sentence is from IT domain.
Model A mostly has low MI and Model B has large
MI. For an identical sample, model A outputs a generic
term ‘calculation’ while model B properly maintains
‘computing totals’.
a source sentence
X
, target sentence
Y
, and cor-
responding domain
D
, the MI between
D
and the
translation
Y|X
(i.e.,
MI(D;Y|X)
) measures the
dependency between the domain and the trans-
lated sentence. Here, we assume that the larger
MI(D;Y|X)
, the more the translation incorporates
domain knowledge. Low MI is undesirable because
it indicates the model is not sufficiently utilizing do-
main characteristics in translation. In other words,
low MI can be interpreted as a domain-specific in-
formation the model has yet to learn. For example,
as shown in Fig. 1, we found that a model with low
MI translates an IT term ‘computing totals’ to the
vague and plain term ‘calculation’. However, once
we force the model to have high MI, ‘computing
totals’ is correctly retained in its translation. Thus,
maximizing MI promotes multi-domain NMT to
be domain-specialized.
Motivated by this idea, we introduce a new
method that specializes multi-domain NMT by
arXiv:2210.12910v1 [cs.CL] 24 Oct 2022
penalizing low MI. We first theoretically derive
MI(D;Y|X)
, and formulate a new objective that
weights more penalty on subword-tokens with low
MI. Our results show that the proposed method im-
proves the translation quality in all domains. Also,
the MI visualization ensures that our method is ef-
fective in maximizing MI. We also observed that
our model performs particularly better on samples
with strong domain characteristics.
The main contributions of our paper are as fol-
lows:
We investigate MI in multi-domain NMT and
present a new objective that penalizes low MI
to have higher value.
Extensive experiment results prove that our
method truly yields high MI, resulting in
domain-specialized model.
2 Related Works
Multi-Domain Neural Machine Translation
Multi-Domain NMT focuses on developing a
proper usage of domain information to improve
translation. Early studies had two main approaches:
injecting source domain information and adding
a domain classifier. For adding source domain in-
formation, Kobus et al. (2017) inserts a source do-
main label as an additional tag with input or as a
complementary feature. For the second approach,
Britz et al. (2017) trains the sentence embedding to
be domain-specific by updating using the gradient
from the domain-classifier.
While previous work leverages domain infor-
mation by injection or implementing an auxiliary
classifier, we view domain information from MI
perspective and propose a loss that promotes model
to explore domain specific knowledge.
Information-Theoretic Approaches in NMT
Mutual information in NMT is primarily used ei-
ther as metrics or a loss function. For metrics,
Bugliarello et al. (2020) proposes cross-mutual in-
formation (XMI) to quantify the difficulty of trans-
lating between languages. Fernandes et al. (2021)
modifies XMI to measure the usage of the given
context during translation. For the loss function, Xu
et al. (2021) proposes bilingual mutual information
(BMI) which calculates the word mapping diver-
sity, further applied in NMT training. Zhang et al.
(2022) improves the model translation by maximiz-
ing the MI between a target token and its source
sentence based on its context.
Above work only considers general machine
translation scenarios. Our work differs in that
we integrate mutual information in multi-domain
NMT to learn domain-specific information. Unlike
other methods that require training of an additional
model, our method can calculate MI within a single
model which is more computation-efficient.
3 Proposed Method
In this section, we first derive MI in multi-domain
NMT. Then, we introduce a new method that pe-
nalizes low MI to have high value resulting in a
domain-specialized model.
3.1 Mutual Information in Multi-Domain
NMT
Mutual Information (MI) measures a mutual de-
pendency between two random variables. In multi-
domain NMT, the MI between the domain (
D
) and
translation (
Y|X
), expressed as
MI(D;Y|X)
, rep-
resents how much domain-specific information is
contained in the translation.
MI(D;Y|X)
can be
written as follows:
MI(D;Y|X) = ED,X,Y log p(Y|X, D)
p(Y|X).(1)
The full derivation can be found in Appendix B.
Note that the final form of
MI(D;Y|X)
is a log
quotient of the translation considering domain and
translation without domain.
Since the true distributions are unknown,
we approximate them with a parameterized
model (Bugliarello et al.,2020;Fernandes et al.,
2021), namely the cross-MI (XMI). Naturally, a
generic domain-agnostic model (further referred
to as general and abbreviated as G) output would
be the appropriate approximation of
p(Y|X)
. A
domain-adapted (further shortened as DA) model
output would be suitable for
p(Y|X, D)
. Hence,
XMI(D;Y|X)
can be expressed as Eq. (2) with
each model output.
XMI(D;Y|X) = ED,X,Y log pDA(Y|X, D)
pG(Y|X)
(2)
3.2 MI-based Token Weighted Loss
To calculate XMI, we need outputs from both gen-
eral and domain-adapted models. Motivated by
the success of adapters (Houlsby et al.,2019) in
multi-domain NMT (Pham et al.,2021), we assign
adapters
φ1,· · · φN
for each domain (
N
is the total
number of domains) and have an extra adapter
φG
for general. We will denote the shared parameter
(e.g., self-attention and feed-forward layer) as
θ
.
For a source sentence
x
from domain
d
,
x
passes
the model twice, once through the corresponding
domain adapter,
φd
, and the other through the gen-
eral adapter,
φG
. Then, we treat the output probabil-
ity from domain adapter as
pDA
and from general
adapter as
pG
. For the
ith
target token,
yi
,we cal-
culate XMI as in Eq. (3),
p(yi|y<i, x, θ, φd)p(yi|y<i, x, θ, φG)(3)
, where
y<i
is the target subword-tokens up to,
but excluding
yi
. For simplicity, we will denote
Eq. (3) as
XMI(i)
. Low
XMI(i)
means that our
domain adapted model is not thoroughly utilizing
domain information during translation. Therefore,
we weight more on the tokens with low
XMI(i)
,
resulting in minimizing Eq. (4),
LMI =
nT
X
i=0
(1 XMI(i)) ·(1 p(yi|y<i, x, θ, φd))
(4)
, where
nT
is the number of subword-tokens in the
target sentence.
The final loss of our method is in Eq. (7), where
λ1and λ2are hyperparameters.
LDA =
nT
X
i=0
log(p(yi|y<i, x, θ, φd)) (5)
LG=
nT
X
i=0
log(p(yi|y<i, x, θ, φG)) (6)
L=LDA +λ1LG+λ2LMI (7)
4 Experiments
4.1 Experiment Setting
Dataset.
We leverage the preprocessed dataset re-
leased by Aharoni and Goldberg (2020) consisting
of five domains (IT, Koran, Law, Medical, Subti-
tles) available in OPUS (Tiedemann,2012;Aulamo
and Tiedemann,2019). More details on the dataset
and preprocessing are described in Appendix A.
Baseline.
We compare our method with the
following baseline models: (1)
Mixed
trains a
model on all domains with uniform distribution,
(2)
Domain-Tag
(Kobus et al.,2017) inserts do-
main information as an additional token in the in-
put, (3)
Multitask Learning (MTL)
(Britz et al.,
2017) trains a domain classifier simultaneously
and encourage the sentence embedding to encom-
pass its domain characteristics, (4)
Adversarial
Learning (AdvL)
(Britz et al.,2017) makes the
the sentence embedding to be domain-agnostic
by flipping the gradient from the domain classi-
fier before the back-propagation, (5)
Word-Level
Domain Context Discrimination (WDC)
(Zeng
et al.,2018) integrates two sentence embedding
which are trained by MTL and AdvL respectively,
(6)
Word-Adaptive Domain Mixing1
(Jiang et al.,
2020), has domain-specific attention heads and the
final representation is the combination of each head
output based on the predicted domain proportion,
and (7)
Domain-Adapter
(Pham et al.,2021) has
separate domain adapters (Houlsby et al.,2019)
and a source sentence passes through its domain
adapters. This can be regarded as our model with-
out general adapter and trained with LDA.
4.2 Main Results
Table 1presents sacreBLEU (Post,2018) and chrF
(Popovi´
c,2015) score from each model in all do-
mains. For a fair comparison, we matched the num-
ber of parameters for all models. Baseline results
following its original implementation with different
parameter size are provided in Appendix C. Inter-
estingly, Mixed performs on par with Domain-Tag
and outperforms Word-Adaptive Domain Mixing,
suggesting that not all multi-domain NMT methods
are effective. Although adapter-based models (i.e.,
Ours (w/o
LMI
) and Domain-Adapter) outperform
Mixed, the performance increase is still marginal.
Our model has gained 1.15 BLEU improvement
over Mixed. It also outperforms all baselines with
statistically significant difference.
As an ablation study of our MI objective, we
conduct experiments without
LMI
to prove its ef-
fectiveness. The result confirms that
LMI
encour-
aged the model to learn domain specific knowledge
leading to refined translation.
4.3 Mutual Information Distribution
We visualize
XMI(i)
in Eq. (3) to verify that our
proposed loss penalizes low XMI. Figure 2is the
histogram of
XMI(i)
from the test samples in Law.
Other domain distributions are in Appendix D. We
use Domain-Adapter for comparison since it per-
forms the best among the baselines. For
pG
, we
use the output probability of Mixed for both cases.
From the distributions, our method indeed penal-
1
We conducted experiments using publicly available code.
摘要:

SpecializingMulti-domainNMTviaPenalizingLowMutualInformationJiyoungLeey,HantaeKimz,HyunchangChozEdwardChoiy,andCheonbokParkzyKAIST,zPapago,NAVERCorp.{jiyounglee0523,edwardchoi}@kaist.ac.kr{hantae.kim,hyunchang.cho,cbok.park}@navercorp.comAbstractMulti-domainNeuralMachineTranslation(NMT)trainsasingl...

展开>> 收起<<
Specializing Multi-domain NMT via Penalizing Low Mutual Information Jiyoung Leey Hantae Kimz Hyunchang Choz Edward Choiy and Cheonbok Parkz.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:477.29KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注