Focused Concatenation for Context-Aware Neural Machine Translation Lorenzo Lupo1Marco Dinarelli1Laurent Besacier2 1Université Grenoble Alpes France

2025-05-06 0 0 589.85KB 14 页 10玖币
侵权投诉
Focused Concatenation for Context-Aware Neural Machine Translation
Lorenzo Lupo1Marco Dinarelli1Laurent Besacier2
1Université Grenoble Alpes, France
2Naver Labs Europe, France
lorenzo.lupo@univ-grenoble-alpes.fr
marco.dinarelli@univ-grenoble-alpes.fr
laurent.besacier@naverlabs.com
Abstract
A straightforward approach to context-aware
neural machine translation consists in feed-
ing the standard encoder-decoder architec-
ture with a window of consecutive sentences,
formed by the current sentence and a num-
ber of sentences from its context concatenated
to it. In this work, we propose an improved
concatenation approach that encourages the
model to focus on the translation of the cur-
rent sentence, discounting the loss generated
by target context. We also propose an addi-
tional improvement that strengthen the notion
of sentence boundaries and of relative sentence
distance, facilitating model compliance to the
context-discounted objective. We evaluate our
approach with both average-translation quality
metrics and contrastive test sets for the trans-
lation of inter-sentential discourse phenomena,
proving its superiority to the vanilla concatena-
tion approach and other sophisticated context-
aware systems.
1 Introduction
While current neural machine translation (NMT)
systems have reached close-to-human quality in
the translation of decontextualized sentences (Wu
et al.,2016), they still have a wide margin of im-
provement ahead when it comes to translating full
documents (Läubli et al.,2018). Many works
tried to reduce this margin, proposing various ap-
proaches to context-aware NMT (CANMT)
1
. A
common taxonomy (Kim et al.,2019;Li et al.,
2020) divides them in two broad categories: multi-
encoding approaches and concatenation (single-
encoding) approaches. Despite its simplicity, the
concatenation approaches have been shown to
achieve competitive or superior performance to
more sophisticated, multi-encoding systems (Lopes
et al.,2020;Ma et al.,2021). Nonetheless, it
1
Unless otherwise specified, we refer to context as the
sentences that precede or follow a current sentence to be
translated, within the same document.
Figure 1: Example of the proposed approach applied
over a window of 2 sentences, with context discount
CD and segment-shifted positions by a factor of 10.
has been shown that Transformer-based NMT sys-
tems (Vaswani et al.,2017) struggle to learn locality
properties (Hardmeier,2012;Rizzi,2013) of both
the language itself and the source-target alignment
when the input sequence grows in length, as in the
case of concatenation (Bao et al.,2021). Unsur-
prisingly, the presence of context makes learning
harder for concatenation models by distracting at-
tention. Moreover, we know from recent litera-
ture that NMT systems require context for a sparse
set of inter-sentential discourse phenomena only
(Voita et al.,2019;Lupo et al.,2022). Therefore, it
is desirable to make concatenation models more fo-
cused on local linguistic phenomena, belonging to
the current sentence, while also processing its con-
text for enabling inter-sentential contextualization
whenever it is needed. We propose an improved
concatenation approach to CANMT that is more
focused on the translation of the current sentence
by means of two simple, parameter-free solutions:
Context-discounting: a simple modification
of the NMT loss that improves context-aware
translation of a sentence by making the model
less distracted by its concatenated context;
Segment-shifted positions: a simple,
parameter-free modification of position
embeddings, that facilitates the achievement
of the context-discounted objective by
supporting the learning of locality properties
in the document translation task.
We support our solutions with extensive experi-
arXiv:2210.13388v1 [cs.CL] 24 Oct 2022
ments, analysis and benchmarking.
2 Background
2.1 Multi-encoding approaches
Multi-encoding models couple a self-standing
sentence-level NMT system, with parameters
θS
,
with additional parameters
θC
that encode and inte-
grate the context of the current sentence, either on
source side, target side, or both. The full context-
aware architecture has parameters
Θ = [θS;θC]
.
Multi-encoding models differ from each other in
the way they encode the context or integrate its rep-
resentations with those of the current sentence. For
instance, the representations coming from the con-
text encoder can be integrated with the encoding
of the current sentence outside the decoder (Maruf
et al.,2018;Voita et al.,2018;Zhang et al.,2018;
Miculicich et al.,2018;Maruf et al.,2019;Zheng
et al.,2020) or inside the decoder (Tu et al.,2018;
Kuang et al.,2018;Bawden et al.,2018;Voita et al.,
2019;Tan et al.,2019), by making it attending to
the context representations directly, using its inter-
nal representation of the decoded history as query.
2.2 Single-encoder approaches
The concatenation approaches are the simplest in
terms of architecture, as they mainly consist in con-
catenating each (current) source sentence with its
context before feeding it to the standard encoder-
decoder architecture (Tiedemann and Scherrer,
2017;Junczys-Dowmunt,2019;Agrawal et al.,
2018;Ma et al.,2020), without the addition of
extra learnable parameters. The decoding can then
be limited to the current sentence, although de-
coding the full target concatenation is more effec-
tive thanks to the availability of target context. A
typical strategy to train a concatenation approach
and generate translations is by sliding windows
(Tiedemann and Scherrer,2017). An sKtoK model
decodes the translation
yj
K
of a source window
xj
K
, formed by
K
consecutive sentences belonging
to the same document: the current (
j
th) sentence
and
K1
sentences concatenated as source-side
context. Besides the end-of-sequence token
<E>
,
another special token
<S>
is introduced to mark
sentence boundaries in the concatenation:
xj
K=xjK+1<S>xjK+2<S>...<S>xj1<S>xj<E>
yj
K=yjK+1<S>yjK+2<S>...<S>yj1<S>yj<E>
Both past and future contexts can be concatenated
to the current pair
xj,yj
, although in this work we
consider only the past context, for simplicity. At
training time, the loss is calculated over the whole
output
yj
K
, but only the translation
yj
of the cur-
rent sentence is kept at inference time, while the
translation of the context is discarded. Then, the
window is slid by one position forward to repeat the
process for the
(j+ 1)
th sentence and its context.
Concatenation approaches are trained by optimiz-
ing the same objective function as standard NMT
over a window of sentences:
L(xj
K,yj
K) =
|yj
K|
X
t=1
log P(yj
K,t|yj
K,<t,xj
K),(1)
so that the likelihood of the current target sen-
tence is conditioned on source and target context.
2.3 Closing the gap
Concatenation approaches have the advantage of
treating the task of CANMT in the same way
as context-agnostic NMT, which eases learning
because the learnable parameters responsible for
inter-sentential contextualization are the same that
undertake intra-sentential contextualization. In-
deed, learning the parameters responsible for inter-
sentential contextualization in multi-encoding ap-
proaches (
θC
) has been shown to be challenging
because the training signal is sparse and the task of
retrieving useful context elements difficult (Lupo
et al.,2022). Nonetheless, encoding current and
context sentences together comes at a cost. In fact,
when sequences are long the risk of paying atten-
tion to irrelevant elements increases. Paying at-
tention to the "wrong tokens" can harm their intra
and inter-sentential contextualization, associating
them to the wrong latent features. Indeed, Liu et al.
(2020) and Sun et al. (2022) showed that learning
to translate long sequences, comprised of many
sentences, fails without the use of large-scale pre-
training or data-augmentation (e.g., like Junczys-
Dowmunt (2019) and Ma et al. (2021) did). Bao
et al. (2021) provided some evidence about this
leaning difficulty, showing that failed models, i.e.,
models stuck in local minima with a high validation
loss, present a distribution of attention weights that
is flatter (with higher entropy), both in the encoder
and the decoder, than the distribution occurring in
models that converge to lower validation loss. In
other words, attention struggles to learn the local-
ity properties of both the language itself and the
source-target alignment (Hardmeier,2012;Rizzi,
2013). As a solution, Zhang et al. (2020) and Bao
et al. (2021) propose two slightly different mask-
ing methods that allow both the encoding of the
current sentence concatenated with context, and
the separate encoding of each sentence in window.
The representations generated by the two encoding
schemes are then integrated together, at the cost of
adding extra learnable parameters to the standard
Transformer architecture.
3 Proposed approach
3.1 Context discounting
Evidently, Equation 1defines an objective function
that does not factor in the fact that we only care
about the translation of the current sentence
xj
,
because the context translation will be discarded
during inference. Moreover, as discussed above,
we need attention to stay focused locally, relying
on context only for the disambiguation of relatively
sparse inter-sentential discourse phenomena that
are ambiguous at sentence level. Hence, we pro-
pose to encourage the model to focus on the trans-
lation of the current sentence
xj
by applying a
discount
0CD <1
to the loss generated by
context tokens:
LCD(xj
K,yj
K) = CD·Lcontext +Lcurrent (2)
=CD·L(xj1
K1,yj1
K1) + L(xj,yj).
This is equivalent to consider an sKtoK con-
catenation approach as the result of a multi-task
sequence-to-sequence setting (Luong et al.,2016),
where an sKto1 model performs the reference task
of translating the current sentence given a concate-
nation of its source with K-1 context sentences,
while the translation of the context sentences is
added as a secondary, complementary task. The
reference task is assigned a bigger weight than the
secondary task in the multi-task composite loss. As
we will see in Section 4.5, this simple modification
of the loss allows the model to learn a self-attentive
mechanism that is less distracted by noisy context
information, thus achieving net improvements in
the translation of inter-sentential discourse phenom-
ena occurring in the current sentence (Section 4.3),
and helping concatenation systems to generalize to
wider context after training (Section 4.5.3).
3.2 Segment-shifted positions
Context discounting pushes the model to discrim-
inate between the current sentence and the con-
text. Such discrimination can be undertaken by
cross-referencing the information provided by two
elements: sentence separation tokens
<S>
, and sinu-
soidal position encodings, as defined in (Vaswani
et al.,2017). In order to facilitate this task, we
propose to provide the model with extra informa-
tion about sentence boundaries and their relative
distance. (Devlin et al.,2019) achieve this goal by
adding segment embeddings to every token repre-
sentation in input to the model, on top of token and
position embeddings, such that every segment em-
bedding represents the sentence position in the win-
dow of sentences. However, we propose an alterna-
tive solution that does not require any extra learn-
able parameter nor memory allocation: segment-
shifted positions. As shown in Figure 1, we apply a
constant shift after every separation token
<S>
, so
that the resulting token position is equal to its origi-
nal position plus a total shift depending on the cho-
sen constant shift and the index
k= 1,2, ..., K
of
the sentence the token belongs to:
t0=t+kshift
.
As a result, the position distance between tokens
belonging to different sentences is increased. For
example, the distance between the first token of the
current sentence and the last token of the preceding
context sentence increases from
1
to
1 + shift
. By
increasing the distance between sinusoidal position
embeddings
2
of tokens belonging to different sen-
tences, their dot product, which is at the core of the
attention mechanism, becomes smaller, possibly re-
sulting in smaller attention weights. In other words,
the resulting attention becomes more localized, as
confirmed by the empirical analysis reported in
Section 4.6.1. In Section 4.3, we present results
of segment-shifted positions, and then compare
them with both sinusoidal segment embeddings
and learned segment embeddings in Section 4.6.2.
4 Experiments
4.1 Setup3
We conduct experiments with two language pairs
and domains. For En
Ru, we adopt a document-
level corpus released by Voita et al. (2019), based
on OpenSubtitles2018 (with dev and test sets), com-
prised of 1.5M parallel sentences. For En
De, we
train models on TED talks subtitles released by
IWSLT17 (Cettolo et al.,2012). Models are tested
2
Positions can be shifted by segment also in the case of
learned position embeddings, both absolute and relative. We
leave such experiments for future works.
3See Appendix Afor more details.
摘要:

FocusedConcatenationforContext-AwareNeuralMachineTranslationLorenzoLupo1MarcoDinarelli1LaurentBesacier21UniversitéGrenobleAlpes,France2NaverLabsEurope,Francelorenzo.lupo@univ-grenoble-alpes.frmarco.dinarelli@univ-grenoble-alpes.frlaurent.besacier@naverlabs.comAbstractAstraightforwardapproachtocontex...

展开>> 收起<<
Focused Concatenation for Context-Aware Neural Machine Translation Lorenzo Lupo1Marco Dinarelli1Laurent Besacier2 1Université Grenoble Alpes France.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:14 页 大小:589.85KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注