Specializing Multi-domain NMT via Penalizing Low Mutual Information Jiyoung Leey Hantae Kimz Hyunchang Choz Edward Choiy and Cheonbok Parkz

2025-05-03 0 0 477.29KB 12 页 10玖币

侵权投诉

Specializing Multi-domain NMT via Penalizing Low Mutual Information

Jiyoung Lee†∗

, Hantae Kim‡, Hyunchang Cho‡

Edward Choi†, and Cheonbok Park‡

†KAIST , ‡Papago, NAVER Corp.

{jiyounglee0523, edwardchoi}@kaist.ac.kr

{hantae.kim,hyunchang.cho,cbok.park}@navercorp.com

Abstract

Multi-domain Neural Machine Translation

(NMT) trains a single model with multiple do-

mains. It is appealing because of its efﬁcacy in

handling multiple domains within one model.

An ideal multi-domain NMT should learn dis-

tinctive domain characteristics simultaneously,

however, grasping the domain peculiarity is

a non-trivial task. In this paper, we investi-

gate domain-speciﬁc information through the

lens of mutual information (MI) and propose a

new objective that penalizes low MI to become

higher. Our method achieved the state-of-the-

art performance among the current competi-

tive multi-domain NMT models. Also, we em-

pirically show our objective promotes low MI

to be higher resulting in domain-specialized

multi-domain NMT.

1 Introduction

Multi-domain Neural Machine Translation (NMT)

(Sajjad et al.,2017;Farajian et al.,2017) has been

an attractive topic due to its efﬁcacy in handling

multiple domains with a single model. Ideally,

a multi-domain NMT should capture both gen-

eral knowledge (e.g., sentence structure, common

words) and domain-speciﬁc knowledge (e.g., do-

main terminology) unique in each domain. While

the shared knowledge can be easily acquired via

sharing parameters across domains (Kobus et al.,

2017), obtaining domain specialized knowledge

is a challenging task. Haddow and Koehn (2012)

demonstrate that a model trained on multiple do-

mains sometimes underperforms the one trained

on a single domain. Pham et al. (2021) shows that

separate domain-speciﬁc adaptation modules are

not sufﬁcient to fully-gain specialized knowledge.

In this paper, we reinterpret domain specialized

knowledge from mutual information (MI) perspec-

tive and propose a method to strengthen it. Given

∗Work done during an internship at NAVER Corp.

MI(𝑫; 𝒀|𝑿)𝟎

calculation computing

totals

𝑨𝑩

Source

Beschreib

…Summenberechnung für ein gegebenes Feld oder einen gegebenen Ausdruck.

Reference

Describes

a way of computing totals for a given field or expression.

A (Baseline)

Describes the kind of

calculation for a given field or expression.

B (Ours)

Describes

the way of computing totals for a given field or expression .

Figure 1: Overview of two models with different MI

distributions. The example sentence is from IT domain.

Model A mostly has low MI and Model B has large

MI. For an identical sample, model A outputs a generic

term ‘calculation’ while model B properly maintains

‘computing totals’.

a source sentence

, target sentence

, and cor-

responding domain

, the MI between

and the

translation

Y|X

(i.e.,

MI(D;Y|X)

) measures the

dependency between the domain and the trans-

lated sentence. Here, we assume that the larger

MI(D;Y|X)

, the more the translation incorporates

domain knowledge. Low MI is undesirable because

it indicates the model is not sufﬁciently utilizing do-

main characteristics in translation. In other words,

low MI can be interpreted as a domain-speciﬁc in-

formation the model has yet to learn. For example,

as shown in Fig. 1, we found that a model with low

MI translates an IT term ‘computing totals’ to the

vague and plain term ‘calculation’. However, once

we force the model to have high MI, ‘computing

totals’ is correctly retained in its translation. Thus,

maximizing MI promotes multi-domain NMT to

be domain-specialized.

Motivated by this idea, we introduce a new

method that specializes multi-domain NMT by

arXiv:2210.12910v1 [cs.CL] 24 Oct 2022

penalizing low MI. We ﬁrst theoretically derive

MI(D;Y|X)

, and formulate a new objective that

weights more penalty on subword-tokens with low

MI. Our results show that the proposed method im-

proves the translation quality in all domains. Also,

the MI visualization ensures that our method is ef-

fective in maximizing MI. We also observed that

our model performs particularly better on samples

with strong domain characteristics.

The main contributions of our paper are as fol-

lows:

•

We investigate MI in multi-domain NMT and

present a new objective that penalizes low MI

to have higher value.

•

Extensive experiment results prove that our

method truly yields high MI, resulting in

domain-specialized model.

2 Related Works

Multi-Domain Neural Machine Translation

Multi-Domain NMT focuses on developing a

proper usage of domain information to improve

translation. Early studies had two main approaches:

injecting source domain information and adding

a domain classiﬁer. For adding source domain in-

formation, Kobus et al. (2017) inserts a source do-

main label as an additional tag with input or as a

complementary feature. For the second approach,

Britz et al. (2017) trains the sentence embedding to

be domain-speciﬁc by updating using the gradient

from the domain-classiﬁer.

While previous work leverages domain infor-

mation by injection or implementing an auxiliary

classiﬁer, we view domain information from MI

perspective and propose a loss that promotes model

to explore domain speciﬁc knowledge.

Information-Theoretic Approaches in NMT

Mutual information in NMT is primarily used ei-

ther as metrics or a loss function. For metrics,

Bugliarello et al. (2020) proposes cross-mutual in-

formation (XMI) to quantify the difﬁculty of trans-

lating between languages. Fernandes et al. (2021)

modiﬁes XMI to measure the usage of the given

context during translation. For the loss function, Xu

et al. (2021) proposes bilingual mutual information

(BMI) which calculates the word mapping diver-

sity, further applied in NMT training. Zhang et al.

(2022) improves the model translation by maximiz-

ing the MI between a target token and its source

sentence based on its context.

Above work only considers general machine

translation scenarios. Our work differs in that

we integrate mutual information in multi-domain

NMT to learn domain-speciﬁc information. Unlike

other methods that require training of an additional

model, our method can calculate MI within a single

model which is more computation-efﬁcient.

3 Proposed Method

In this section, we ﬁrst derive MI in multi-domain

NMT. Then, we introduce a new method that pe-

nalizes low MI to have high value resulting in a

domain-specialized model.

3.1 Mutual Information in Multi-Domain

NMT

Mutual Information (MI) measures a mutual de-

pendency between two random variables. In multi-

domain NMT, the MI between the domain (

) and

translation (

Y|X

), expressed as

MI(D;Y|X)

, rep-

resents how much domain-speciﬁc information is

contained in the translation.

MI(D;Y|X)

can be

written as follows:

MI(D;Y|X) = ED,X,Y log p(Y|X, D)

p(Y|X).(1)

The full derivation can be found in Appendix B.

Note that the ﬁnal form of

MI(D;Y|X)

is a log

quotient of the translation considering domain and

translation without domain.

Since the true distributions are unknown,

we approximate them with a parameterized

model (Bugliarello et al.,2020;Fernandes et al.,

2021), namely the cross-MI (XMI). Naturally, a

generic domain-agnostic model (further referred

to as general and abbreviated as G) output would

be the appropriate approximation of

p(Y|X)

. A

domain-adapted (further shortened as DA) model

output would be suitable for

p(Y|X, D)

. Hence,

XMI(D;Y|X)

can be expressed as Eq. (2) with

each model output.

XMI(D;Y|X) = ED,X,Y log pDA(Y|X, D)

pG(Y|X)

(2)

3.2 MI-based Token Weighted Loss

To calculate XMI, we need outputs from both gen-

eral and domain-adapted models. Motivated by

the success of adapters (Houlsby et al.,2019) in

multi-domain NMT (Pham et al.,2021), we assign

adapters

φ1,· · · φN

for each domain (

is the total

number of domains) and have an extra adapter

φG

for general. We will denote the shared parameter

(e.g., self-attention and feed-forward layer) as

For a source sentence

from domain

passes

the model twice, once through the corresponding

domain adapter,

φd

, and the other through the gen-

eral adapter,

φG

. Then, we treat the output probabil-

ity from domain adapter as

pDA

and from general

adapter as

. For the

ith

target token,

,we cal-

culate XMI as in Eq. (3),

p(yi|y<i, x, θ, φd)−p(yi|y<i, x, θ, φG)(3)

, where

y<i

is the target subword-tokens up to,

but excluding

. For simplicity, we will denote

Eq. (3) as

XMI(i)

. Low

XMI(i)

means that our

domain adapted model is not thoroughly utilizing

domain information during translation. Therefore,

we weight more on the tokens with low

XMI(i)

resulting in minimizing Eq. (4),

LMI =

i=0

(1 −XMI(i)) ·(1 −p(yi|y<i, x, θ, φd))

(4)

, where

is the number of subword-tokens in the

target sentence.

The ﬁnal loss of our method is in Eq. (7), where

λ1and λ2are hyperparameters.

LDA =−

i=0

log(p(yi|y<i, x, θ, φd)) (5)

LG=−

i=0

log(p(yi|y<i, x, θ, φG)) (6)

L=LDA +λ1LG+λ2LMI (7)

4 Experiments

4.1 Experiment Setting

Dataset.

We leverage the preprocessed dataset re-

leased by Aharoni and Goldberg (2020) consisting

of ﬁve domains (IT, Koran, Law, Medical, Subti-

tles) available in OPUS (Tiedemann,2012;Aulamo

and Tiedemann,2019). More details on the dataset

and preprocessing are described in Appendix A.

Baseline.

We compare our method with the

following baseline models: (1)

Mixed

trains a

model on all domains with uniform distribution,

(2)

Domain-Tag

(Kobus et al.,2017) inserts do-

main information as an additional token in the in-

put, (3)

Multitask Learning (MTL)

(Britz et al.,

2017) trains a domain classiﬁer simultaneously

and encourage the sentence embedding to encom-

pass its domain characteristics, (4)

Adversarial

Learning (AdvL)

(Britz et al.,2017) makes the

the sentence embedding to be domain-agnostic

by ﬂipping the gradient from the domain classi-

ﬁer before the back-propagation, (5)

Word-Level

Domain Context Discrimination (WDC)

(Zeng

et al.,2018) integrates two sentence embedding

which are trained by MTL and AdvL respectively,

(6)

Word-Adaptive Domain Mixing1

(Jiang et al.,

2020), has domain-speciﬁc attention heads and the

ﬁnal representation is the combination of each head

output based on the predicted domain proportion,

and (7)

Domain-Adapter

(Pham et al.,2021) has

separate domain adapters (Houlsby et al.,2019)

and a source sentence passes through its domain

adapters. This can be regarded as our model with-

out general adapter and trained with LDA.

4.2 Main Results

Table 1presents sacreBLEU (Post,2018) and chrF

(Popovi´

c,2015) score from each model in all do-

mains. For a fair comparison, we matched the num-

ber of parameters for all models. Baseline results

following its original implementation with different

parameter size are provided in Appendix C. Inter-

estingly, Mixed performs on par with Domain-Tag

and outperforms Word-Adaptive Domain Mixing,

suggesting that not all multi-domain NMT methods

are effective. Although adapter-based models (i.e.,

Ours (w/o

LMI

) and Domain-Adapter) outperform

Mixed, the performance increase is still marginal.

Our model has gained 1.15 BLEU improvement

over Mixed. It also outperforms all baselines with

statistically signiﬁcant difference.

As an ablation study of our MI objective, we

conduct experiments without

LMI

to prove its ef-

fectiveness. The result conﬁrms that

LMI

encour-

aged the model to learn domain speciﬁc knowledge

leading to reﬁned translation.

4.3 Mutual Information Distribution

We visualize

XMI(i)

in Eq. (3) to verify that our

proposed loss penalizes low XMI. Figure 2is the

histogram of

XMI(i)

from the test samples in Law.

Other domain distributions are in Appendix D. We

use Domain-Adapter for comparison since it per-

forms the best among the baselines. For

, we

use the output probability of Mixed for both cases.

From the distributions, our method indeed penal-

We conducted experiments using publicly available code.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SpecializingMulti-domainNMTviaPenalizingLowMutualInformationJiyoungLeey,HantaeKimz,HyunchangChozEdwardChoiy,andCheonbokParkzyKAIST,zPapago,NAVERCorp.{jiyounglee0523,edwardchoi}@kaist.ac.kr{hantae.kim,hyunchang.cho,cbok.park}@navercorp.comAbstractMulti-domainNeuralMachineTranslation(NMT)trainsasingl...

展开>> 收起<<

Specializing Multi-domain NMT via Penalizing Low Mutual Information Jiyoung Leey Hantae Kimz Hyunchang Choz Edward Choiy and Cheonbok Parkz.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Specializing Multi-domain NMT via Penalizing Low Mutual Information Jiyoung Leey Hantae Kimz Hyunchang Choz Edward Choiy and Cheonbok Parkz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: