Is Encoder-Decoder Redundant for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group

2025-05-03 0 0 353.3KB 13 页 10玖币
侵权投诉
Is Encoder-Decoder Redundant for Neural Machine Translation?
Yingbo Gao Christian Herold Zijian Yang Hermann Ney
Human Language Technology and Pattern Recognition Group
Computer Science Department
RWTH Aachen University
D-52056 Aachen, Germany
{ygao|herold|zyang|ney}@cs.rwth-aachen.de
Abstract
Encoder-decoder architecture is widely
adopted for sequence-to-sequence modeling
tasks. For machine translation, despite the
evolution from long short-term memory
networks to Transformer networks, plus the
introduction and development of attention
mechanism, encoder-decoder is still the de
facto neural network architecture for state-
of-the-art models. While the motivation for
decoding information from some hidden space
is straightforward, the strict separation of the
encoding and decoding steps into an encoder
and a decoder in the model architecture is
not necessarily a must. Compared to the task
of autoregressive language modeling in the
target language, machine translation simply
has an additional source sentence as context.
Given the fact that neural language models
nowadays can already handle rather long
contexts in the target language, it is natural to
ask whether simply concatenating the source
and target sentences and training a language
model to do translation would work. In this
work, we investigate the aforementioned
concept for machine translation. Specifically,
we experiment with bilingual translation,
translation with additional target monolingual
data, and multilingual translation. In all cases,
this alternative approach performs on par with
the baseline encoder-decoder Transformer,
suggesting that an encoder-decoder architec-
ture might be redundant for neural machine
translation.
1 Introduction
Sequence-to-sequence modeling is often ap-
proached with Neural Networks (NNs), promi-
nently encoder-decoder NNs, nowadays. For the
task of Machine Translation (MT), which is by
definition also a sequence-to-sequence task, the
default choice of NN topology is also an encoder-
decoder architecture. For example, in early works
like Kalchbrenner and Blunsom (2013), the authors
already make the distinction between their convo-
lutional sentence model (encoder) and recurrent
language model (decoder) conditioned on the for-
mer. In follow-up works like Sutskever et al. (2014)
and Cho et al. (2014a,b), the concept of encoder-
decoder network is further developed. While ex-
tensions such as attention (Bahdanau et al.,2014),
multi-task learning (Luong et al.,2015), convo-
lutional networks (Gehring et al.,2017) and self-
attention (Vaswani et al.,2017) are considered for
sequence-to-sequence learning, the idea of encod-
ing information into some hidden space and decod-
ing from that hidden representation sticks around.
Given the success and wide popularity of the
Transformer network (Vaswani et al.,2017), many
works focus on understanding and improving indi-
vidual components, e.g. positional encoding (Shaw
et al.,2018), multi-head attention (Voita et al.,
2019), and an alignment interpretation of cross
attention (Alkhouli et al.,2018). In works that go
a bit further and make bigger changes in terms of
modeling, e.g. performing round-trip translation
(Tu et al.,2017) and going from autoregressive to
non-autoregressive (Gu et al.,2017), the encoder-
decoder setup itself is not really questioned. In
the mean time, it is not to say that the field is
completely dominated by one approach. Because
works like the development of direct neural hidden
Markov model (Wang et al.,2017,2018,2021b), in-
vestigation into dropping attention and separate en-
coding and decoding steps (Press and Smith,2018)
and going completely encoder-free (Tang et al.,
2019) do exist, where the default encoder-decoder
regime is not directly applied.
Meanwhile, in the field of language modeling,
significant progress is achieved with the wide ap-
plication of NNs. With the progress from early
feedforward language models (LMs) (Bengio et al.,
2000), to the successful long short-term memory
network LMs (Sundermeyer et al.,2012), and to
the more recent Transformer LMs (Irie et al.,2019),
arXiv:2210.11807v1 [cs.CL] 21 Oct 2022
the modeling capacity of LMs nowadays is much
more than their historic counterparts. This is es-
pecially true when considering some of the most
recent extensions, such as large-scale modeling
(Brown et al.,2020), modeling very long context
(Dai et al.,2019) and going from autoregressive
modeling to non-autoregressive modeling (Devlin
et al.,2019). Because MT can be thought of as
a contextualized language modeling task with the
source sentence being additional context, one natu-
ral question is if simply concatenating the source
and target sentences and train an LM to do transla-
tion would work (Irie,2020). This idea is simple
and straightforward, but special care needs to be
taken about the attention mechanism and source
reconstruction. In this work, we explore this alter-
native approach and conduct experiments in bilin-
gual translation, translation with additional target
monolingual data and multilingual translation. Our
results show that dropping the encoder-decoder ar-
chitecture and simply treating the task of MT as
contextualized language modeling is sufficient to
obtain state-of-the-art results in translation. This re-
sult has several subtleties and implications, which
we discuss in Sec.5, and opens up possibilities for
more general interfaces for multimodal modeling.
2 Related Work
In the literature, few but interesting works exist
which closely relate to the idea mentioned above.
In Mikolov and Zweig (2012), the authors mention
the possibility to use source sentence as context
for contextualized language modeling. In He et al.
(2018), with the intuition to coordinate the learning
of Transformer encoder and decoder layer by layer,
the authors share the encoder and decoder parame-
ters and learn a joint model on concatenated source
and target sentences. However, no explicit source
side reconstruction loss is included. Similarly, in
Irie (2020), a small degradation in translation qual-
ity is observed when a causal mask is used and
no source reconstruction is included. Because the
masking is critical for correctly modeling the de-
pendencies regarding the concatenated sequence,
in Raffel et al. (2020), the authors put special fo-
cus on discussing the differences and implications
of three types of attention masks. In Wang et al.
(2021a), the authors expand upon the idea and pro-
pose a two-step decaying learning rate schedule to
reconstruct the source sentence to regularize the
training process. In that work, the authors show
competitive performance compared to Transformer
baselines in several settings. More recently, in
Zhang et al. (2022), the authors also use a language-
modeling-style source side reconstruction loss to
regularize the model, and additionally explore the
model scaling cross-lingual transfer capabilities.
Another work that explores the long-context mod-
eling potential of LMs is Hawthorne et al. (2022),
where data from domains other than translation is
included in model training. Hao et al. (2022) is a
more recent addition to this direction of research,
where LM as a general interface for multimodal
data is investigated. Because our focus is in MT,
we refer to such a model, where encoder-decoder
architecture is dropped and an LM is used to model
the concatenation of source and target sentence, as
Translation Language Models (TLMs1).
The work by Wang et al. (2021a) is proba-
bly the most directly related work compared to
our work, therefore we believe it is important
to highlight the similarities and differences be-
tween their work and ours. The core concept
of dropping encoder-decoder architecture is sim-
ilar between Wang et al. (2021a) and our work,
and competitive performance of TLMs compared
to encoder-decoder models in various settings
is achieved in both works. However, we ad-
ditionally explore the task of autoencoding in
the source side, adding Bidirectional-Encoder-
Representations-from-Transformers-style (BERT)
noise (Devlin et al.,2019), using alternative learn-
ing rate schedules, training MT models with back-
translated (BT) data and doing multilingual train-
ing. Further, we discuss subtleties and implications
associated with the TLM.
3 Methodology
The core concept of TLM is to concatenate the
source and the target sentences and treat the trans-
lation task as a language modeling task during train-
ing. The two majors points of concern are the atten-
tion mechanism and the source-side reconstruction
loss. In this section, we explain the details related
to these two points, and additionally discuss the im-
plications when additional target-side monolingual
data or multilingual data is available.
1
To be differentiated from TLMs in Conneau and Lample
(2019), where the pretraining objective is cloze task at both
source and target side, using bilingual context.
3.1 Translation Language Model
Denoting the source words/subwords as
f
and the
target words/subwords as
e
, with running indices
j
in
J
and
i
in
I
respectively, the usual way to ap-
proach the translation problem in encoder-decoder
models is to directly model the posterior probabili-
ties via a discriminative model
P(eI
1|fJ
1)
. This is
used in the Transformer and can be expressed as:
P(eI
1|fJ
1) =
I
Y
i=1
P(ei|ei1
0, fJ
1).
The model is usually trained with the cross en-
tropy criterion (often regularized with label smooth-
ing (Gao et al.,2020b)), and the search aims to find
the target sentence
ˆeˆ
I
1
with the highest probability
(often approximated with beam search):
LMT =
I
X
i=1
log P(ei|ei1
0, fJ
1),
ˆeˆ
I
1= arg max
eI
1,I
{log P(eI
1|fJ
1)}.
Alternatively, one can model the joint probability
of the source and target sentences via a generative
model P(fJ
1, eI
1)and it can be expressed as:
P(fJ
1, eI
1) =
J
Y
j=1
P(fj|fj1
0)
I
Y
i=1
P(ei|ei1
0, fJ
1).
Here, because
fJ
1
is given at search time, and
arg maxeI
1,I P(fJ
1, eI
1) = arg maxeI
1,I P(eI
1|fJ
1)
,
the search stays the same as in the baseline case.
But the training criterion has an additional loss
term on the source sentence, which we refer to as
reconstruction loss (
LRE
), the learning rate
λ
of
which can be controlled by some schedule:
LRE =
J
X
j=1
log P(fj|fj1
0),
LTLM =λLRE +LMT.
One can think of the reconstruction loss (decom-
posed in an autoregressive manner here, but it does
not have to be) as a second task in addition to the
translation task, or simply a regularization term for
better learning of the source hidden representations.
Although this formulation is simple and straightfor-
ward, there could be variations in how the source
side dependencies are defined.
JI
J
IA
C
B
D
(a) source-side triangular mask
JI
J
IA
C
B
D
(b) source-side full mask
Figure 1: Attention masks in TLM with (a) a triangu-
lar mask, and (b) a full mask, at the source side. The
horizontal direction is the query direction and the verti-
cal direction is the key direction. Shaded areas mean
that the attention is valid and white areas mean that
the attention is blocked. The matrices C, B, and D cor-
respond to the encoder self attention, the decoder self
attention and encoder-decoder cross attention in Trans-
former, respectively. The matrix A is whitened in both
cases because we should not allow the source positions
attend to future target positions.
3.1.1 On the Attention Mechanism
In the original Transformer (Vaswani et al.,2017)
model, the attention mechanism is used in three
places, namely, a
J×J
encoder self attention ma-
trix, a
I×I
decoder self attention matrix and a
J×I
encoder-decoder cross attention matrix. As
shown in Fig.1, they correspond to matrices C, B
and D respectively. The attention masks in B and D
are straightforward. The triangular attention mask
in the B matrix needs to be causal by definition,
because otherwise target positions may attend to
future positions and cheat. The attention mask in
D needs to be full, because we want each target
position to be able to look at each source position
so that there is no information loss. However, the
attention mask in C is how some of the previous
摘要:

IsEncoder-DecoderRedundantforNeuralMachineTranslation?YingboGaoChristianHeroldZijianYangHermannNeyHumanLanguageTechnologyandPatternRecognitionGroupComputerScienceDepartmentRWTHAachenUniversityD-52056Aachen,Germany{ygao|herold|zyang|ney}@cs.rwth-aachen.deAbstractEncoder-decoderarchitectureiswidelyado...

展开>> 收起<<
Is Encoder-Decoder Redundant for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:353.3KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注