Reinforcement Learning with Large Action Spaces for Neural Machine Translation Asaf Yehudai Leshem Choshen Lior Fox Omri Abend

2025-04-29 0 0 1MB 13 页 10玖币
侵权投诉
Reinforcement Learning with Large Action Spaces
for Neural Machine Translation
Asaf Yehudai, Leshem Choshen, Lior Fox, Omri Abend
Department of Computer Science
The Hebrew University of Jerusalem
{first.last}@mail.huji.ac.il
Abstract
Applying Reinforcement learning (RL) follow-
ing maximum likelihood estimation (MLE)
pre-training is a versatile method for enhanc-
ing neural machine translation (NMT) perfor-
mance. However, recent work has argued that
the gains produced by RL for NMT are mostly
due to promoting tokens that have already re-
ceived a fairly high probability in pre-training.
We hypothesize that the large action space is
a main obstacle to RLs effectiveness in MT,
and conduct two sets of experiments that lend
support to our hypothesis. First, we find that
reducing the size of the vocabulary improves
RLs effectiveness. Second, we find that ef-
fectively reducing the dimension of the action
space without changing the vocabulary also
yields notable improvement as evaluated by
BLEU, semantic similarity, and human evalua-
tion. Indeed, by initializing the network’s final
fully connected layer (that maps the network’s
internal dimension to the vocabulary dimen-
sion), with a layer that generalizes over similar
actions, we obtain a substantial improvement
in RL performance: 1.5 BLEU points on aver-
age.1
1 Introduction
The standard training method for sequence-to-
sequence tasks, specifically for NMT is to maxi-
mize the likelihood of a token in the target sentence,
given a gold standard prefix (henceforth, maximum
likelihood estimation or MLE). However, despite
the strong performance displayed by MLE-trained
models, this token-level objective function is lim-
ited in its ability to penalize sequence-level er-
rors and is at odds with the sequence-level eval-
uation metrics it aims to improve. One appealing
method for addressing this gap is applying pol-
icy gradient methods that allow incorporating non-
differentiable reward functions, such as the ones
1
https://github.com/AsafYehudai/Reinforcement-
Learning-with-Large-Action-Spaces-for-Neural-Machine-
Translation
often used for MT evaluation (Shen et al.,2016,
see §2). For brevity, we will refer to these methods
simply as RL.
The RL training procedure consists of several
steps: (1) generating a translation with the pre-
trained MLE model, (2) computing some sequence-
level reward function, usually, one that assesses
the similarity of the generated translation and a
reference, and (3) updating the model so that its fu-
ture outputs receive higher rewards. The method’s
flexibility, as well as its ability to address the expo-
sure bias (Ranzato et al.,2016;Wang and Sennrich,
2020), makes RL an appealing avenue for improv-
ing NMT performance. However, a recent study
(C19; Choshen et al.,2019) suggests that current
RL practices are likely to improve the prediction
of target tokens only where the MLE model has al-
ready assigned that token a fairly high probability.
In this work, we observe that one main differ-
ence between NMT and other tasks in which RL
methods excel is the size of the action space. Typi-
cally, the size of the action space in NMT includes
all tokens in the vocabulary, usually tens of thou-
sands. By contrast, common RL settings have ei-
ther small discrete action spaces (e.g., Atari games
(Mnih et al.,2013)), or continuous action spaces
of low dimension (e.g., MuJoCo (Todorov et al.,
2012) and similar control problems). Intuitively,
RL takes (samples) actions and assesses their out-
come, unlike supervised learning (MLE) which
directly receives a score for all actions. Therefore,
a large action space will make RL less efficient, as
individual actions have to be sampled in order to
assess their quality. Accordingly, we experiment
with two methods for decreasing the size of the
action space and evaluate their impact on RLs ef-
fectiveness.
We begin by decreasing the vocabulary size (or
equivalently, the number of actions), conducting
experiments in low-resource settings on translating
four languages into English, using BLEU both as
arXiv:2210.03053v1 [cs.CL] 6 Oct 2022
the reward function and the evaluation metric. Our
results show that RL yields a considerably larger
performance increase (about
1
BLEU point on av-
erage) over MLE training than is achieved by RL
with the standard vocabulary size. Moreover, our
findings indicate that reducing the size of the vo-
cabulary can improve upon the MLE model even in
cases where it was not close to being correct. See
§4.
However, in some cases, it may be undesirable
or unfeasible to change the vocabulary. We there-
fore experiment with two methods that effectively,
reduce the dimensionality of the action space with-
out changing the vocabulary. We note that gener-
ally in NMT architectures, the dimensionality of
the decoder’s internal layers (henceforth,
d
) is sig-
nificantly smaller than the target vocabulary size
(henceforth,
|VT|
), which is the size of the action
space. A fully connected layer is generally used to
map the internal representation to suitable outputs.
We may therefore refer to the rows of the matrix
(parameters) of this layer, as target embeddings,
mapping the network’s internal low-dimensional
representation back to the vocabulary size, the ac-
tions. We use this term to underscore the analogy
between the network’s first embedding layer, map-
ping vectors of dimension
|VT|
to vectors of di-
mension
d
, and target embeddings that work in an
inverse fashion. Indeed, it is often the case (e.g.,
in BERT, Devlin et al.,2019) that the weights of
the source and target embeddings are shared during
training, emphasizing the relation between the two.
Using this terminology, we show in simulations
5.1) that when similar actions share target em-
beddings, RL is more effective. Moreover, when
target embeddings are initialized based on high-
quality embeddings (BERT’s in our case), freezing
them during RL yields further improvement still.
We obtain similar results when experimenting on
NMT. Indeed, using BERT’s embeddings for target
embeddings improves performance on the four lan-
guage pairs, and freezing them yields an additional
improvement on both MLE and RL as reported
by both automatic metrics and human evaluation.
Both initialization and freezing are novel in the con-
text of RL training for NMT. Moreover, when using
BERT’s embeddings, RLs ability to improve per-
formance on target tokens to which the pre-trained
MLE model did not assign a high probability, is
enhanced (§5.2).
2 Background
2.1 RL in Machine Translation
RL is used in text generation (TG) for its ability
to incorporate non-differentiable signals, to tackle
the exposure bias, and to introduce sequence-level
constraints. The latter two are persistent challenges
in the development of TG systems, and have also
been addressed by non-RL methods (e.g., Zhang
et al.,2019;Ren et al.,2019). In addition, RL is
grounded within a broad theoretical and empirical
literature, which adds to its appeal.
These properties have led to much interest in
RL for TG in general (Shah et al.,2018) and NMT
in particular (Wu et al.,2018a). Numerous policy
gradient methods are commonly used, notably RE-
INFORCE (Williams,1992), and Minimum Risk
Training (MRT; e.g., Och,2003;Shen et al.,2016).
However, despite increasing interest and strong re-
sults, only a handful of works studied the source
of observed performance gains by RL in NLP and
its training dynamics, and some of these have sug-
gested that RLs gains are partly due to artifacts
(Caccia et al.,2018;Choshen et al.,2019).
In a recent paper, C19 showed that existing RL
training protocols for MT (REINFORCE and MRT)
take a prohibitively long time to converge. Their
results suggest that RL practices in MT are likely
to improve performance only where the MLE pa-
rameters are already close to yielding the correct
translation. They further suggest that observed
gains may be due to effects unrelated to the train-
ing signal, but rather from changes in the shape of
the distribution curve. These results may suggest
that one of the drawbacks of RL is the uncommonly
large action space, which in TG includes all tokens
in the vocabulary, typically tens of thousands of
actions or more.
To the best of our knowledge, no previous work
considered the challenge of large action spaces in
TG, and relatively few studies considered it in dif-
ferent contexts. One line of work assumed prior
domain knowledge about the problem, and par-
titioned actions into sub-groups (Sharma et al.,
2017), or similar to our approach, embedding ac-
tions in a continuous space where some metric
over this space allows generalization over similar
actions (Dulac-Arnold et al.,2016). More recent
work proposed to learn target embeddings when
the underlying structure of the action space is apri-
ori unknown using expert demonstrations (Tennen-
holtz and Mannor,2019;Chandak et al.,2019).
This paper establishes that the large action
spaces are a limiting factor in the application of
RL for NMT, and propose methods to tackle this
challenge. Our techniques restrict the size of the
embedding space, either explicitly or implicitly by
using an underlying continuous representation.
2.2 Technical Background and Notation
Notation.
We denote the source sentence with
X= (x1, ..., xS)
and the reference sentence with
Y= (y1, ..., yT)
. Given
X
, the network generates
a sentence in the target language
Y0= (y0
1, ..., y0
M)
.
Target tokens are taken from a vocabulary
VT
. Dur-
ing inference, at each step
i
, the probability of gen-
erating a token
y0
iVT
is conditioned on the sen-
tence and the predicted tokens, i.e.,
Pθ(y0
i|X, y0
<i)
,
where
θ
is the model parameters. We assume there
is exactly one valid target token, the reference to-
ken, as in practice, training is done against a single
reference (Schulz et al.,2018).
NMT with RL.
In RL terminology, one can
think of an NMT model as an agent, which interacts
with the environment. In this case, the environment
state consists of the previous words
y0
<i
and the
source sentence
X
. At each step, the agent selects
an action according to its policy, where actions
are tokens. The policy is defined by the param-
eters of the model, i.e., the conditional probabil-
ity
Pθ(y0
i|y0
<i, X)
. Reward is given only once the
agent generates a complete sequence
Y0
. The stan-
dard reward for MT is the sentence level BLEU
metric (Papineni et al.,2002), matching the evalua-
tion metric. Our goal is to find the parameters that
will maximize the expected reward.
In this work, we use MRT (Och,2003;Shen
et al.,2015), a policy gradient method adapted to
MT. The key idea of this method is to optimize at
each step a re-normalized risk, defined only over
the sampled batch. Concretely, the expected risk is
defined as:
Lrisk =X
uU(X)
R(Y, u)P(u|X)β
Pu0U(X)P(u0|X)β
(1)
where
u
is a candidate hypothesis sentence,
U(x)
is the sample of
k
candidate hypotheses,
Y
is the
reference,
P
is the conditional probability that
the model assigns a candidate hypothesis
u
given
source sentence
X
,
β
a smoothness parameter and
Ris BLEU.
3 Methodology
Architecture. We use a similar setup as used by
Wieting et al. (2019), adapting their fairSeq-based
(Ott et al.,2019) codebase to our purposes.
2
Simi-
lar to their Transformer architecture we use gated
convolutional encoders and decoders (Gehring
et al.,2017). We use 4 layers for the encoder and
3 for the decoder, the size of the hidden state is
768 for all layers, and the filter width of the ker-
nels is 3. Additionally, the dimension of the BPE
embeddings is set to 768.
Data Prepossessing.
We use BPE (Sennrich
et al.,2016) for tokenization. The vocabulary size
is set to 40K for the combined source and target vo-
cabulary as done by Wieting et al. (2019). For the
small target vocabulary experiments, we change the
target vocabulary size to 1K and keep the source
vocabulary unchanged.
Objective Functions.
Following Edunov et al.
(2018), we train models with MLE with label-
smoothing (Szegedy et al.,2016;Pereyra et al.,
2017) of size 0.1. For RL, we fine-tune the model
with a weighted average of the MRT
Lrisk
and the
token level loss Lmle.
Our fine-tuning objective thus becomes:
LAverage =α·Lmle + (1 α)·Lrisk (2)
We set
α
to be 0.3 shown to work best by
Wu et al. (2018b). We set
β
to 1. We generate
eight hypotheses for each MRT step (
k
=8) with
beam search. We train with smoothed BLEU (Lin
and Och,2004) from the Moses implementation.
3
Moreover, we use this metric to report results and
verify they match sacrebleu (Post,2018).4
Optimization.
We train the MLE objective over
200 epochs and the combined RL objective over 15.
We perform early stopping by selecting the model
with the lowest validation loss. We optimize with
Nesterov’s accelerated gradient method (Sutskever
et al.,2013) with a learning rate of 0.25, a momen-
tum of 0.99, and re-normalize gradients to a 0.1
norm (Pascanu et al.,2012).
2https://github.com/jwieting/
beyond-bleu
3https://github.com/jwieting/
beyond-bleu/blob/master/multi-bleu.perl
4https://github.com/mjpost/sacrebleu
摘要:

ReinforcementLearningwithLargeActionSpacesforNeuralMachineTranslationAsafYehudai,LeshemChoshen,LiorFox,OmriAbendDepartmentofComputerScienceTheHebrewUniversityofJerusalem{first.last}@mail.huji.ac.ilAbstractApplyingReinforcementlearning(RL)follow-ingmaximumlikelihoodestimation(MLE)pre-trainingisaversa...

展开>> 收起<<
Reinforcement Learning with Large Action Spaces for Neural Machine Translation Asaf Yehudai Leshem Choshen Lior Fox Omri Abend.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注