Is Encoder-Decoder Redundant for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group

2025-05-03 0 0 353.3KB 13 页 10玖币

侵权投诉

Is Encoder-Decoder Redundant for Neural Machine Translation?

Yingbo Gao Christian Herold Zijian Yang Hermann Ney

Human Language Technology and Pattern Recognition Group

Computer Science Department

RWTH Aachen University

D-52056 Aachen, Germany

{ygao|herold|zyang|ney}@cs.rwth-aachen.de

Abstract

Encoder-decoder architecture is widely

adopted for sequence-to-sequence modeling

tasks. For machine translation, despite the

evolution from long short-term memory

networks to Transformer networks, plus the

introduction and development of attention

mechanism, encoder-decoder is still the de

facto neural network architecture for state-

of-the-art models. While the motivation for

decoding information from some hidden space

is straightforward, the strict separation of the

encoding and decoding steps into an encoder

and a decoder in the model architecture is

not necessarily a must. Compared to the task

of autoregressive language modeling in the

target language, machine translation simply

has an additional source sentence as context.

Given the fact that neural language models

nowadays can already handle rather long

contexts in the target language, it is natural to

ask whether simply concatenating the source

and target sentences and training a language

model to do translation would work. In this

work, we investigate the aforementioned

concept for machine translation. Speciﬁcally,

we experiment with bilingual translation,

translation with additional target monolingual

data, and multilingual translation. In all cases,

this alternative approach performs on par with

the baseline encoder-decoder Transformer,

suggesting that an encoder-decoder architec-

ture might be redundant for neural machine

translation.

1 Introduction

Sequence-to-sequence modeling is often ap-

proached with Neural Networks (NNs), promi-

nently encoder-decoder NNs, nowadays. For the

task of Machine Translation (MT), which is by

deﬁnition also a sequence-to-sequence task, the

default choice of NN topology is also an encoder-

decoder architecture. For example, in early works

like Kalchbrenner and Blunsom (2013), the authors

already make the distinction between their convo-

lutional sentence model (encoder) and recurrent

language model (decoder) conditioned on the for-

mer. In follow-up works like Sutskever et al. (2014)

and Cho et al. (2014a,b), the concept of encoder-

decoder network is further developed. While ex-

tensions such as attention (Bahdanau et al.,2014),

multi-task learning (Luong et al.,2015), convo-

lutional networks (Gehring et al.,2017) and self-

attention (Vaswani et al.,2017) are considered for

sequence-to-sequence learning, the idea of encod-

ing information into some hidden space and decod-

ing from that hidden representation sticks around.

Given the success and wide popularity of the

Transformer network (Vaswani et al.,2017), many

works focus on understanding and improving indi-

vidual components, e.g. positional encoding (Shaw

et al.,2018), multi-head attention (Voita et al.,

2019), and an alignment interpretation of cross

attention (Alkhouli et al.,2018). In works that go

a bit further and make bigger changes in terms of

modeling, e.g. performing round-trip translation

(Tu et al.,2017) and going from autoregressive to

non-autoregressive (Gu et al.,2017), the encoder-

decoder setup itself is not really questioned. In

the mean time, it is not to say that the ﬁeld is

completely dominated by one approach. Because

works like the development of direct neural hidden

Markov model (Wang et al.,2017,2018,2021b), in-

vestigation into dropping attention and separate en-

coding and decoding steps (Press and Smith,2018)

and going completely encoder-free (Tang et al.,

2019) do exist, where the default encoder-decoder

regime is not directly applied.

Meanwhile, in the ﬁeld of language modeling,

signiﬁcant progress is achieved with the wide ap-

plication of NNs. With the progress from early

feedforward language models (LMs) (Bengio et al.,

2000), to the successful long short-term memory

network LMs (Sundermeyer et al.,2012), and to

the more recent Transformer LMs (Irie et al.,2019),

arXiv:2210.11807v1 [cs.CL] 21 Oct 2022

the modeling capacity of LMs nowadays is much

more than their historic counterparts. This is es-

pecially true when considering some of the most

recent extensions, such as large-scale modeling

(Brown et al.,2020), modeling very long context

(Dai et al.,2019) and going from autoregressive

modeling to non-autoregressive modeling (Devlin

et al.,2019). Because MT can be thought of as

a contextualized language modeling task with the

source sentence being additional context, one natu-

ral question is if simply concatenating the source

and target sentences and train an LM to do transla-

tion would work (Irie,2020). This idea is simple

and straightforward, but special care needs to be

taken about the attention mechanism and source

reconstruction. In this work, we explore this alter-

native approach and conduct experiments in bilin-

gual translation, translation with additional target

monolingual data and multilingual translation. Our

results show that dropping the encoder-decoder ar-

chitecture and simply treating the task of MT as

contextualized language modeling is sufﬁcient to

obtain state-of-the-art results in translation. This re-

sult has several subtleties and implications, which

we discuss in Sec.5, and opens up possibilities for

more general interfaces for multimodal modeling.

2 Related Work

In the literature, few but interesting works exist

which closely relate to the idea mentioned above.

In Mikolov and Zweig (2012), the authors mention

the possibility to use source sentence as context

for contextualized language modeling. In He et al.

(2018), with the intuition to coordinate the learning

of Transformer encoder and decoder layer by layer,

the authors share the encoder and decoder parame-

ters and learn a joint model on concatenated source

and target sentences. However, no explicit source

side reconstruction loss is included. Similarly, in

Irie (2020), a small degradation in translation qual-

ity is observed when a causal mask is used and

no source reconstruction is included. Because the

masking is critical for correctly modeling the de-

pendencies regarding the concatenated sequence,

in Raffel et al. (2020), the authors put special fo-

cus on discussing the differences and implications

of three types of attention masks. In Wang et al.

(2021a), the authors expand upon the idea and pro-

pose a two-step decaying learning rate schedule to

reconstruct the source sentence to regularize the

training process. In that work, the authors show

competitive performance compared to Transformer

baselines in several settings. More recently, in

Zhang et al. (2022), the authors also use a language-

modeling-style source side reconstruction loss to

regularize the model, and additionally explore the

model scaling cross-lingual transfer capabilities.

Another work that explores the long-context mod-

eling potential of LMs is Hawthorne et al. (2022),

where data from domains other than translation is

included in model training. Hao et al. (2022) is a

more recent addition to this direction of research,

where LM as a general interface for multimodal

data is investigated. Because our focus is in MT,

we refer to such a model, where encoder-decoder

architecture is dropped and an LM is used to model

the concatenation of source and target sentence, as

Translation Language Models (TLMs1).

The work by Wang et al. (2021a) is proba-

bly the most directly related work compared to

our work, therefore we believe it is important

to highlight the similarities and differences be-

tween their work and ours. The core concept

of dropping encoder-decoder architecture is sim-

ilar between Wang et al. (2021a) and our work,

and competitive performance of TLMs compared

to encoder-decoder models in various settings

is achieved in both works. However, we ad-

ditionally explore the task of autoencoding in

the source side, adding Bidirectional-Encoder-

Representations-from-Transformers-style (BERT)

noise (Devlin et al.,2019), using alternative learn-

ing rate schedules, training MT models with back-

translated (BT) data and doing multilingual train-

ing. Further, we discuss subtleties and implications

associated with the TLM.

3 Methodology

The core concept of TLM is to concatenate the

source and the target sentences and treat the trans-

lation task as a language modeling task during train-

ing. The two majors points of concern are the atten-

tion mechanism and the source-side reconstruction

loss. In this section, we explain the details related

to these two points, and additionally discuss the im-

plications when additional target-side monolingual

data or multilingual data is available.

To be differentiated from TLMs in Conneau and Lample

(2019), where the pretraining objective is cloze task at both

source and target side, using bilingual context.

3.1 Translation Language Model

Denoting the source words/subwords as

and the

target words/subwords as

, with running indices

and

respectively, the usual way to ap-

proach the translation problem in encoder-decoder

models is to directly model the posterior probabili-

ties via a discriminative model

P(eI

1|fJ

. This is

used in the Transformer and can be expressed as:

P(eI

1|fJ

1) =

i=1

P(ei|ei−1

0, fJ

1).

The model is usually trained with the cross en-

tropy criterion (often regularized with label smooth-

ing (Gao et al.,2020b)), and the search aims to ﬁnd

the target sentence

ˆeˆ

with the highest probability

(often approximated with beam search):

LMT =−

i=1

log P(ei|ei−1

0, fJ

1),

ˆeˆ

1= arg max

1,I

{log P(eI

1|fJ

1)}.

Alternatively, one can model the joint probability

of the source and target sentences via a generative

model P(fJ

1, eI

1)and it can be expressed as:

P(fJ

1, eI

1) =

j=1

P(fj|fj−1

i=1

P(ei|ei−1

0, fJ

1).

Here, because

is given at search time, and

arg maxeI

1,I P(fJ

1, eI

1) = arg maxeI

1,I P(eI

1|fJ

the search stays the same as in the baseline case.

But the training criterion has an additional loss

term on the source sentence, which we refer to as

reconstruction loss (

LRE

), the learning rate

which can be controlled by some schedule:

LRE =−

j=1

log P(fj|fj−1

0),

LTLM =λLRE +LMT.

One can think of the reconstruction loss (decom-

posed in an autoregressive manner here, but it does

not have to be) as a second task in addition to the

translation task, or simply a regularization term for

better learning of the source hidden representations.

Although this formulation is simple and straightfor-

ward, there could be variations in how the source

side dependencies are deﬁned.

(a) source-side triangular mask

(b) source-side full mask

Figure 1: Attention masks in TLM with (a) a triangu-

lar mask, and (b) a full mask, at the source side. The

horizontal direction is the query direction and the verti-

cal direction is the key direction. Shaded areas mean

that the attention is valid and white areas mean that

the attention is blocked. The matrices C, B, and D cor-

respond to the encoder self attention, the decoder self

attention and encoder-decoder cross attention in Trans-

former, respectively. The matrix A is whitened in both

cases because we should not allow the source positions

attend to future target positions.

3.1.1 On the Attention Mechanism

In the original Transformer (Vaswani et al.,2017)

model, the attention mechanism is used in three

places, namely, a

J×J

encoder self attention ma-

trix, a

I×I

decoder self attention matrix and a

J×I

encoder-decoder cross attention matrix. As

shown in Fig.1, they correspond to matrices C, B

and D respectively. The attention masks in B and D

are straightforward. The triangular attention mask

in the B matrix needs to be causal by deﬁnition,

because otherwise target positions may attend to

future positions and cheat. The attention mask in

D needs to be full, because we want each target

position to be able to look at each source position

so that there is no information loss. However, the

attention mask in C is how some of the previous

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IsEncoder-DecoderRedundantforNeuralMachineTranslation?YingboGaoChristianHeroldZijianYangHermannNeyHumanLanguageTechnologyandPatternRecognitionGroupComputerScienceDepartmentRWTHAachenUniversityD-52056Aachen,Germany{ygao|herold|zyang|ney}@cs.rwth-aachen.deAbstractEncoder-decoderarchitectureiswidelyado...

展开>> 收起<<

Is Encoder-Decoder Redundant for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Is Encoder-Decoder Redundant for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: