2.1 Multilingual NMT
Given a set of languages
L
, the primary goal of
MNMT is to learn a single NMT model that can
handle all translation directions of interest in this
set of languages (Dabre et al.,2020). According to
the parameter sharing strategy, MNMT can be cat-
egorised into: 1) partial parameter sharing (Dong
et al.,2015;Firat et al.,2016;Zhang et al.,2021),
and 2) full parameter sharing (Ha et al.,2016;
Johnson et al.,2017). The latter has been widely
adopted because of its simplicity, lightweight, and
its zero-shot capability. Thus, we adopt the full
parameter sharing strategy in our work.
In the fully parameter-shared MNMT, all pa-
rameters of encoders, decoders and attentions are
shared across tasks. Special language tags are in-
troduced to indicate the target languages. One can
prepend the target language tags to either the source
or target sentences. The model is then trained
jointly to minimise the negative log-likelihood
across all training instances:
LML(θ
θ
θ) := −X
(s,t)∈T
X
(x
x
x,y
y
y)∈Cs,t
log P(y
y
y|x
x
x;θ
θ
θ)(1)
where
θ
θ
θ
is model parameters,
Cs,t
denotes a bilin-
gual corpus for the source language
s
and the target
language
t
,
(x
x
x, y
y
y)
is a pair of parallel sentences in
the source and target language, and
T
denotes the
translation tasks for which we have bitext available.
Among all possible language pairs
(s, t)∈L×L
,
we often only have access to bilingual data for a
subset of them. We denote these pairs as seen (ob-
served) translation tasks, and the rest as unseen
tasks corresponding to the zero-shot setting.
2.2 Multi-domain NMT
Multi-domain NMT aims to handle translation
tasks across multiple domains for a given language
pair. Similar to MNMT, tagging the training corpus
is the most popular approach, where a tag indicates
the domain of a sentence pair. We also minimise
the negative log-likelihood across all domains to
train the model:
LMD(θ
θ
θ) := −X
d∈D
X
(x
x
x,y
y
y)∈Cd
s,t
log P(y
y
y|x
x
x;θ
θ
θ)(2)
where
D
is the set of domains, and
Cd
s,t
denotes
the parallel bitext in the source language
s
, target
language t, and the domain d.
Apart from tagging, some auxiliary tasks have
also been incorporated into the training process. A
common practice is the use of domain discrimina-
tion, which aims to force the encoder to capture
domain-aware characteristics (Britz et al.,2017).
For this purpose, a domain discriminator is added
to the NMT model at training time. The input to the
discriminator is the encoder output, and its output
predicts the probability of the domain of the source
sentence. The discriminator is jointly trained with
the NMT model, and is discarded at inference time.
Let
h=enc(x
x
x)
be the representation of sen-
tence
x
x
x
computed by the mean-pooling over the
hidden states of the top layer of the encoder. The
training objective for the domain-aware encoder is
as follows:
Ldisc(θ
θ
θ, ψ
ψ
ψ) := −X
d∈D
X
(x
x
x,y
y
y)∈Cd
s,t
log Pr(d|h;ψ
ψ
ψ)(3)
LMD-aware(θ
θ
θ, ψ
ψ
ψ) := LMD(θ
θ
θ) + λLdisc(θ
θ
θ, ψ
ψ
ψ)(4)
where
ψ
ψ
ψ
is the parameter of the domain discrimina-
tor classifier, and
λ
controls the contribution of the
domain discriminator into the training objective of
the multi-domain NMT model.
Alternatively, one can design an adversarial train-
ing objective in order to learn domain-agnostic rep-
resentations by the encoder. This is achieved by
inserting a gradient reversal layer (Ganin and Lem-
pitsky,2015) between the encoder and the domain
discriminator. The gradient reversal layer behaves
as an identity layer in the forward pass but reverses
the gradient sign during back-propagation. It has
the opposite effect on the encoder, forcing it to
learn domain-agnostic representations. This en-
courages the domain specific characteristic to be
learned mainly by the decoder of the NMT model.
2.3 Composition of Domains and Languages
In this paper, we explore strategies for composing
multi-domain and multilingual NMT. We consider
the incomplete multi-domain multilingual data con-
dition where in-domain data may be only available
in a subset of language pairs. For example, Ta-
ble 1shows one of the data conditions explored in
our experiments in Section 3. Given the five lan-
guage pairs and five domains, we assume that the
domain data in some language pairs are missing.
Our goal is to investigate effective techniques to
train a high-quality MDML-NMT model covering
all combinations of domains and language pairs.
Given a specific domain, we define in-domain
languages as those having data available in the
domain as part of some bilingual corpora; the rest