
the modeling capacity of LMs nowadays is much
more than their historic counterparts. This is es-
pecially true when considering some of the most
recent extensions, such as large-scale modeling
(Brown et al.,2020), modeling very long context
(Dai et al.,2019) and going from autoregressive
modeling to non-autoregressive modeling (Devlin
et al.,2019). Because MT can be thought of as
a contextualized language modeling task with the
source sentence being additional context, one natu-
ral question is if simply concatenating the source
and target sentences and train an LM to do transla-
tion would work (Irie,2020). This idea is simple
and straightforward, but special care needs to be
taken about the attention mechanism and source
reconstruction. In this work, we explore this alter-
native approach and conduct experiments in bilin-
gual translation, translation with additional target
monolingual data and multilingual translation. Our
results show that dropping the encoder-decoder ar-
chitecture and simply treating the task of MT as
contextualized language modeling is sufficient to
obtain state-of-the-art results in translation. This re-
sult has several subtleties and implications, which
we discuss in Sec.5, and opens up possibilities for
more general interfaces for multimodal modeling.
2 Related Work
In the literature, few but interesting works exist
which closely relate to the idea mentioned above.
In Mikolov and Zweig (2012), the authors mention
the possibility to use source sentence as context
for contextualized language modeling. In He et al.
(2018), with the intuition to coordinate the learning
of Transformer encoder and decoder layer by layer,
the authors share the encoder and decoder parame-
ters and learn a joint model on concatenated source
and target sentences. However, no explicit source
side reconstruction loss is included. Similarly, in
Irie (2020), a small degradation in translation qual-
ity is observed when a causal mask is used and
no source reconstruction is included. Because the
masking is critical for correctly modeling the de-
pendencies regarding the concatenated sequence,
in Raffel et al. (2020), the authors put special fo-
cus on discussing the differences and implications
of three types of attention masks. In Wang et al.
(2021a), the authors expand upon the idea and pro-
pose a two-step decaying learning rate schedule to
reconstruct the source sentence to regularize the
training process. In that work, the authors show
competitive performance compared to Transformer
baselines in several settings. More recently, in
Zhang et al. (2022), the authors also use a language-
modeling-style source side reconstruction loss to
regularize the model, and additionally explore the
model scaling cross-lingual transfer capabilities.
Another work that explores the long-context mod-
eling potential of LMs is Hawthorne et al. (2022),
where data from domains other than translation is
included in model training. Hao et al. (2022) is a
more recent addition to this direction of research,
where LM as a general interface for multimodal
data is investigated. Because our focus is in MT,
we refer to such a model, where encoder-decoder
architecture is dropped and an LM is used to model
the concatenation of source and target sentence, as
Translation Language Models (TLMs1).
The work by Wang et al. (2021a) is proba-
bly the most directly related work compared to
our work, therefore we believe it is important
to highlight the similarities and differences be-
tween their work and ours. The core concept
of dropping encoder-decoder architecture is sim-
ilar between Wang et al. (2021a) and our work,
and competitive performance of TLMs compared
to encoder-decoder models in various settings
is achieved in both works. However, we ad-
ditionally explore the task of autoencoding in
the source side, adding Bidirectional-Encoder-
Representations-from-Transformers-style (BERT)
noise (Devlin et al.,2019), using alternative learn-
ing rate schedules, training MT models with back-
translated (BT) data and doing multilingual train-
ing. Further, we discuss subtleties and implications
associated with the TLM.
3 Methodology
The core concept of TLM is to concatenate the
source and the target sentences and treat the trans-
lation task as a language modeling task during train-
ing. The two majors points of concern are the atten-
tion mechanism and the source-side reconstruction
loss. In this section, we explain the details related
to these two points, and additionally discuss the im-
plications when additional target-side monolingual
data or multilingual data is available.
1
To be differentiated from TLMs in Conneau and Lample
(2019), where the pretraining objective is cloze task at both
source and target side, using bilingual context.