
number of domains) and have an extra adapter
φG
for general. We will denote the shared parameter
(e.g., self-attention and feed-forward layer) as
θ
.
For a source sentence
x
from domain
d
,
x
passes
the model twice, once through the corresponding
domain adapter,
φd
, and the other through the gen-
eral adapter,
φG
. Then, we treat the output probabil-
ity from domain adapter as
pDA
and from general
adapter as
pG
. For the
ith
target token,
yi
,we cal-
culate XMI as in Eq. (3),
p(yi|y<i, x, θ, φd)−p(yi|y<i, x, θ, φG)(3)
, where
y<i
is the target subword-tokens up to,
but excluding
yi
. For simplicity, we will denote
Eq. (3) as
XMI(i)
. Low
XMI(i)
means that our
domain adapted model is not thoroughly utilizing
domain information during translation. Therefore,
we weight more on the tokens with low
XMI(i)
,
resulting in minimizing Eq. (4),
LMI =
nT
X
i=0
(1 −XMI(i)) ·(1 −p(yi|y<i, x, θ, φd))
(4)
, where
nT
is the number of subword-tokens in the
target sentence.
The final loss of our method is in Eq. (7), where
λ1and λ2are hyperparameters.
LDA =−
nT
X
i=0
log(p(yi|y<i, x, θ, φd)) (5)
LG=−
nT
X
i=0
log(p(yi|y<i, x, θ, φG)) (6)
L=LDA +λ1LG+λ2LMI (7)
4 Experiments
4.1 Experiment Setting
Dataset.
We leverage the preprocessed dataset re-
leased by Aharoni and Goldberg (2020) consisting
of five domains (IT, Koran, Law, Medical, Subti-
tles) available in OPUS (Tiedemann,2012;Aulamo
and Tiedemann,2019). More details on the dataset
and preprocessing are described in Appendix A.
Baseline.
We compare our method with the
following baseline models: (1)
Mixed
trains a
model on all domains with uniform distribution,
(2)
Domain-Tag
(Kobus et al.,2017) inserts do-
main information as an additional token in the in-
put, (3)
Multitask Learning (MTL)
(Britz et al.,
2017) trains a domain classifier simultaneously
and encourage the sentence embedding to encom-
pass its domain characteristics, (4)
Adversarial
Learning (AdvL)
(Britz et al.,2017) makes the
the sentence embedding to be domain-agnostic
by flipping the gradient from the domain classi-
fier before the back-propagation, (5)
Word-Level
Domain Context Discrimination (WDC)
(Zeng
et al.,2018) integrates two sentence embedding
which are trained by MTL and AdvL respectively,
(6)
Word-Adaptive Domain Mixing1
(Jiang et al.,
2020), has domain-specific attention heads and the
final representation is the combination of each head
output based on the predicted domain proportion,
and (7)
Domain-Adapter
(Pham et al.,2021) has
separate domain adapters (Houlsby et al.,2019)
and a source sentence passes through its domain
adapters. This can be regarded as our model with-
out general adapter and trained with LDA.
4.2 Main Results
Table 1presents sacreBLEU (Post,2018) and chrF
(Popovi´
c,2015) score from each model in all do-
mains. For a fair comparison, we matched the num-
ber of parameters for all models. Baseline results
following its original implementation with different
parameter size are provided in Appendix C. Inter-
estingly, Mixed performs on par with Domain-Tag
and outperforms Word-Adaptive Domain Mixing,
suggesting that not all multi-domain NMT methods
are effective. Although adapter-based models (i.e.,
Ours (w/o
LMI
) and Domain-Adapter) outperform
Mixed, the performance increase is still marginal.
Our model has gained 1.15 BLEU improvement
over Mixed. It also outperforms all baselines with
statistically significant difference.
As an ablation study of our MI objective, we
conduct experiments without
LMI
to prove its ef-
fectiveness. The result confirms that
LMI
encour-
aged the model to learn domain specific knowledge
leading to refined translation.
4.3 Mutual Information Distribution
We visualize
XMI(i)
in Eq. (3) to verify that our
proposed loss penalizes low XMI. Figure 2is the
histogram of
XMI(i)
from the test samples in Law.
Other domain distributions are in Appendix D. We
use Domain-Adapter for comparison since it per-
forms the best among the baselines. For
pG
, we
use the output probability of Mixed for both cases.
From the distributions, our method indeed penal-
1
We conducted experiments using publicly available code.