the reward function and the evaluation metric. Our
results show that RL yields a considerably larger
performance increase (about
1
BLEU point on av-
erage) over MLE training than is achieved by RL
with the standard vocabulary size. Moreover, our
findings indicate that reducing the size of the vo-
cabulary can improve upon the MLE model even in
cases where it was not close to being correct. See
§4.
However, in some cases, it may be undesirable
or unfeasible to change the vocabulary. We there-
fore experiment with two methods that effectively,
reduce the dimensionality of the action space with-
out changing the vocabulary. We note that gener-
ally in NMT architectures, the dimensionality of
the decoder’s internal layers (henceforth,
d
) is sig-
nificantly smaller than the target vocabulary size
(henceforth,
|VT|
), which is the size of the action
space. A fully connected layer is generally used to
map the internal representation to suitable outputs.
We may therefore refer to the rows of the matrix
(parameters) of this layer, as target embeddings,
mapping the network’s internal low-dimensional
representation back to the vocabulary size, the ac-
tions. We use this term to underscore the analogy
between the network’s first embedding layer, map-
ping vectors of dimension
|VT|
to vectors of di-
mension
d
, and target embeddings that work in an
inverse fashion. Indeed, it is often the case (e.g.,
in BERT, Devlin et al.,2019) that the weights of
the source and target embeddings are shared during
training, emphasizing the relation between the two.
Using this terminology, we show in simulations
(§5.1) that when similar actions share target em-
beddings, RL is more effective. Moreover, when
target embeddings are initialized based on high-
quality embeddings (BERT’s in our case), freezing
them during RL yields further improvement still.
We obtain similar results when experimenting on
NMT. Indeed, using BERT’s embeddings for target
embeddings improves performance on the four lan-
guage pairs, and freezing them yields an additional
improvement on both MLE and RL as reported
by both automatic metrics and human evaluation.
Both initialization and freezing are novel in the con-
text of RL training for NMT. Moreover, when using
BERT’s embeddings, RL’s ability to improve per-
formance on target tokens to which the pre-trained
MLE model did not assign a high probability, is
enhanced (§5.2).
2 Background
2.1 RL in Machine Translation
RL is used in text generation (TG) for its ability
to incorporate non-differentiable signals, to tackle
the exposure bias, and to introduce sequence-level
constraints. The latter two are persistent challenges
in the development of TG systems, and have also
been addressed by non-RL methods (e.g., Zhang
et al.,2019;Ren et al.,2019). In addition, RL is
grounded within a broad theoretical and empirical
literature, which adds to its appeal.
These properties have led to much interest in
RL for TG in general (Shah et al.,2018) and NMT
in particular (Wu et al.,2018a). Numerous policy
gradient methods are commonly used, notably RE-
INFORCE (Williams,1992), and Minimum Risk
Training (MRT; e.g., Och,2003;Shen et al.,2016).
However, despite increasing interest and strong re-
sults, only a handful of works studied the source
of observed performance gains by RL in NLP and
its training dynamics, and some of these have sug-
gested that RL’s gains are partly due to artifacts
(Caccia et al.,2018;Choshen et al.,2019).
In a recent paper, C19 showed that existing RL
training protocols for MT (REINFORCE and MRT)
take a prohibitively long time to converge. Their
results suggest that RL practices in MT are likely
to improve performance only where the MLE pa-
rameters are already close to yielding the correct
translation. They further suggest that observed
gains may be due to effects unrelated to the train-
ing signal, but rather from changes in the shape of
the distribution curve. These results may suggest
that one of the drawbacks of RL is the uncommonly
large action space, which in TG includes all tokens
in the vocabulary, typically tens of thousands of
actions or more.
To the best of our knowledge, no previous work
considered the challenge of large action spaces in
TG, and relatively few studies considered it in dif-
ferent contexts. One line of work assumed prior
domain knowledge about the problem, and par-
titioned actions into sub-groups (Sharma et al.,
2017), or similar to our approach, embedding ac-
tions in a continuous space where some metric
over this space allows generalization over similar
actions (Dulac-Arnold et al.,2016). More recent
work proposed to learn target embeddings when
the underlying structure of the action space is apri-
ori unknown using expert demonstrations (Tennen-
holtz and Mannor,2019;Chandak et al.,2019).