Reinforcement Learning with Large Action Spaces for Neural Machine Translation Asaf Yehudai Leshem Choshen Lior Fox Omri Abend

2025-04-29 0 0 1MB 13 页 10玖币

侵权投诉

Reinforcement Learning with Large Action Spaces

for Neural Machine Translation

Asaf Yehudai, Leshem Choshen, Lior Fox, Omri Abend

Department of Computer Science

The Hebrew University of Jerusalem

{first.last}@mail.huji.ac.il

Abstract

Applying Reinforcement learning (RL) follow-

ing maximum likelihood estimation (MLE)

pre-training is a versatile method for enhanc-

ing neural machine translation (NMT) perfor-

mance. However, recent work has argued that

the gains produced by RL for NMT are mostly

due to promoting tokens that have already re-

ceived a fairly high probability in pre-training.

We hypothesize that the large action space is

a main obstacle to RL’s effectiveness in MT,

and conduct two sets of experiments that lend

support to our hypothesis. First, we ﬁnd that

reducing the size of the vocabulary improves

RL’s effectiveness. Second, we ﬁnd that ef-

fectively reducing the dimension of the action

space without changing the vocabulary also

yields notable improvement as evaluated by

BLEU, semantic similarity, and human evalua-

tion. Indeed, by initializing the network’s ﬁnal

fully connected layer (that maps the network’s

internal dimension to the vocabulary dimen-

sion), with a layer that generalizes over similar

actions, we obtain a substantial improvement

in RL performance: 1.5 BLEU points on aver-

age.1

1 Introduction

The standard training method for sequence-to-

sequence tasks, speciﬁcally for NMT is to maxi-

mize the likelihood of a token in the target sentence,

given a gold standard preﬁx (henceforth, maximum

likelihood estimation or MLE). However, despite

the strong performance displayed by MLE-trained

models, this token-level objective function is lim-

ited in its ability to penalize sequence-level er-

rors and is at odds with the sequence-level eval-

uation metrics it aims to improve. One appealing

method for addressing this gap is applying pol-

icy gradient methods that allow incorporating non-

differentiable reward functions, such as the ones

https://github.com/AsafYehudai/Reinforcement-

Learning-with-Large-Action-Spaces-for-Neural-Machine-

Translation

often used for MT evaluation (Shen et al.,2016,

see §2). For brevity, we will refer to these methods

simply as RL.

The RL training procedure consists of several

steps: (1) generating a translation with the pre-

trained MLE model, (2) computing some sequence-

level reward function, usually, one that assesses

the similarity of the generated translation and a

reference, and (3) updating the model so that its fu-

ture outputs receive higher rewards. The method’s

ﬂexibility, as well as its ability to address the expo-

sure bias (Ranzato et al.,2016;Wang and Sennrich,

2020), makes RL an appealing avenue for improv-

ing NMT performance. However, a recent study

(C19; Choshen et al.,2019) suggests that current

RL practices are likely to improve the prediction

of target tokens only where the MLE model has al-

ready assigned that token a fairly high probability.

In this work, we observe that one main differ-

ence between NMT and other tasks in which RL

methods excel is the size of the action space. Typi-

cally, the size of the action space in NMT includes

all tokens in the vocabulary, usually tens of thou-

sands. By contrast, common RL settings have ei-

ther small discrete action spaces (e.g., Atari games

(Mnih et al.,2013)), or continuous action spaces

of low dimension (e.g., MuJoCo (Todorov et al.,

2012) and similar control problems). Intuitively,

RL takes (samples) actions and assesses their out-

come, unlike supervised learning (MLE) which

directly receives a score for all actions. Therefore,

a large action space will make RL less efﬁcient, as

individual actions have to be sampled in order to

assess their quality. Accordingly, we experiment

with two methods for decreasing the size of the

action space and evaluate their impact on RL’s ef-

fectiveness.

We begin by decreasing the vocabulary size (or

equivalently, the number of actions), conducting

experiments in low-resource settings on translating

four languages into English, using BLEU both as

arXiv:2210.03053v1 [cs.CL] 6 Oct 2022

the reward function and the evaluation metric. Our

results show that RL yields a considerably larger

performance increase (about

BLEU point on av-

erage) over MLE training than is achieved by RL

with the standard vocabulary size. Moreover, our

ﬁndings indicate that reducing the size of the vo-

cabulary can improve upon the MLE model even in

cases where it was not close to being correct. See

§4.

However, in some cases, it may be undesirable

or unfeasible to change the vocabulary. We there-

fore experiment with two methods that effectively,

reduce the dimensionality of the action space with-

out changing the vocabulary. We note that gener-

ally in NMT architectures, the dimensionality of

the decoder’s internal layers (henceforth,

) is sig-

niﬁcantly smaller than the target vocabulary size

(henceforth,

|VT|

), which is the size of the action

space. A fully connected layer is generally used to

map the internal representation to suitable outputs.

We may therefore refer to the rows of the matrix

(parameters) of this layer, as target embeddings,

mapping the network’s internal low-dimensional

representation back to the vocabulary size, the ac-

tions. We use this term to underscore the analogy

between the network’s ﬁrst embedding layer, map-

ping vectors of dimension

|VT|

to vectors of di-

mension

, and target embeddings that work in an

inverse fashion. Indeed, it is often the case (e.g.,

in BERT, Devlin et al.,2019) that the weights of

the source and target embeddings are shared during

training, emphasizing the relation between the two.

Using this terminology, we show in simulations

(§5.1) that when similar actions share target em-

beddings, RL is more effective. Moreover, when

target embeddings are initialized based on high-

quality embeddings (BERT’s in our case), freezing

them during RL yields further improvement still.

We obtain similar results when experimenting on

NMT. Indeed, using BERT’s embeddings for target

embeddings improves performance on the four lan-

guage pairs, and freezing them yields an additional

improvement on both MLE and RL as reported

by both automatic metrics and human evaluation.

Both initialization and freezing are novel in the con-

text of RL training for NMT. Moreover, when using

BERT’s embeddings, RL’s ability to improve per-

formance on target tokens to which the pre-trained

MLE model did not assign a high probability, is

enhanced (§5.2).

2 Background

2.1 RL in Machine Translation

RL is used in text generation (TG) for its ability

to incorporate non-differentiable signals, to tackle

the exposure bias, and to introduce sequence-level

constraints. The latter two are persistent challenges

in the development of TG systems, and have also

been addressed by non-RL methods (e.g., Zhang

et al.,2019;Ren et al.,2019). In addition, RL is

grounded within a broad theoretical and empirical

literature, which adds to its appeal.

These properties have led to much interest in

RL for TG in general (Shah et al.,2018) and NMT

in particular (Wu et al.,2018a). Numerous policy

gradient methods are commonly used, notably RE-

INFORCE (Williams,1992), and Minimum Risk

Training (MRT; e.g., Och,2003;Shen et al.,2016).

However, despite increasing interest and strong re-

sults, only a handful of works studied the source

of observed performance gains by RL in NLP and

its training dynamics, and some of these have sug-

gested that RL’s gains are partly due to artifacts

(Caccia et al.,2018;Choshen et al.,2019).

In a recent paper, C19 showed that existing RL

training protocols for MT (REINFORCE and MRT)

take a prohibitively long time to converge. Their

results suggest that RL practices in MT are likely

to improve performance only where the MLE pa-

rameters are already close to yielding the correct

translation. They further suggest that observed

gains may be due to effects unrelated to the train-

ing signal, but rather from changes in the shape of

the distribution curve. These results may suggest

that one of the drawbacks of RL is the uncommonly

large action space, which in TG includes all tokens

in the vocabulary, typically tens of thousands of

actions or more.

To the best of our knowledge, no previous work

considered the challenge of large action spaces in

TG, and relatively few studies considered it in dif-

ferent contexts. One line of work assumed prior

domain knowledge about the problem, and par-

titioned actions into sub-groups (Sharma et al.,

2017), or similar to our approach, embedding ac-

tions in a continuous space where some metric

over this space allows generalization over similar

actions (Dulac-Arnold et al.,2016). More recent

work proposed to learn target embeddings when

the underlying structure of the action space is apri-

ori unknown using expert demonstrations (Tennen-

holtz and Mannor,2019;Chandak et al.,2019).

This paper establishes that the large action

spaces are a limiting factor in the application of

RL for NMT, and propose methods to tackle this

challenge. Our techniques restrict the size of the

embedding space, either explicitly or implicitly by

using an underlying continuous representation.

2.2 Technical Background and Notation

Notation.

We denote the source sentence with

X= (x1, ..., xS)

and the reference sentence with

Y= (y1, ..., yT)

. Given

, the network generates

a sentence in the target language

Y0= (y0

1, ..., y0

Target tokens are taken from a vocabulary

. Dur-

ing inference, at each step

, the probability of gen-

erating a token

i∈VT

is conditioned on the sen-

tence and the predicted tokens, i.e.,

Pθ(y0

i|X, y0

<i)

where

is the model parameters. We assume there

is exactly one valid target token, the reference to-

ken, as in practice, training is done against a single

reference (Schulz et al.,2018).

NMT with RL.

In RL terminology, one can

think of an NMT model as an agent, which interacts

with the environment. In this case, the environment

state consists of the previous words

and the

source sentence

. At each step, the agent selects

an action according to its policy, where actions

are tokens. The policy is deﬁned by the param-

eters of the model, i.e., the conditional probabil-

ity

Pθ(y0

i|y0

<i, X)

. Reward is given only once the

agent generates a complete sequence

. The stan-

dard reward for MT is the sentence level BLEU

metric (Papineni et al.,2002), matching the evalua-

tion metric. Our goal is to ﬁnd the parameters that

will maximize the expected reward.

In this work, we use MRT (Och,2003;Shen

et al.,2015), a policy gradient method adapted to

MT. The key idea of this method is to optimize at

each step a re-normalized risk, deﬁned only over

the sampled batch. Concretely, the expected risk is

deﬁned as:

Lrisk =X

u∈U(X)

R(Y, u)P(u|X)β

Pu0∈U(X)P(u0|X)β

(1)

where

is a candidate hypothesis sentence,

U(x)

is the sample of

candidate hypotheses,

is the

reference,

is the conditional probability that

the model assigns a candidate hypothesis

given

source sentence

a smoothness parameter and

Ris BLEU.

3 Methodology

Architecture. We use a similar setup as used by

Wieting et al. (2019), adapting their fairSeq-based

(Ott et al.,2019) codebase to our purposes.

Simi-

lar to their Transformer architecture we use gated

convolutional encoders and decoders (Gehring

et al.,2017). We use 4 layers for the encoder and

3 for the decoder, the size of the hidden state is

768 for all layers, and the ﬁlter width of the ker-

nels is 3. Additionally, the dimension of the BPE

embeddings is set to 768.

Data Prepossessing.

We use BPE (Sennrich

et al.,2016) for tokenization. The vocabulary size

is set to 40K for the combined source and target vo-

cabulary as done by Wieting et al. (2019). For the

small target vocabulary experiments, we change the

target vocabulary size to 1K and keep the source

vocabulary unchanged.

Objective Functions.

Following Edunov et al.

(2018), we train models with MLE with label-

smoothing (Szegedy et al.,2016;Pereyra et al.,

2017) of size 0.1. For RL, we ﬁne-tune the model

with a weighted average of the MRT

Lrisk

and the

token level loss Lmle.

Our ﬁne-tuning objective thus becomes:

LAverage =α·Lmle + (1 −α)·Lrisk (2)

We set

to be 0.3 shown to work best by

Wu et al. (2018b). We set

to 1. We generate

eight hypotheses for each MRT step (

=8) with

beam search. We train with smoothed BLEU (Lin

and Och,2004) from the Moses implementation.

Moreover, we use this metric to report results and

verify they match sacrebleu (Post,2018).4

Optimization.

We train the MLE objective over

200 epochs and the combined RL objective over 15.

We perform early stopping by selecting the model

with the lowest validation loss. We optimize with

Nesterov’s accelerated gradient method (Sutskever

et al.,2013) with a learning rate of 0.25, a momen-

tum of 0.99, and re-normalize gradients to a 0.1

norm (Pascanu et al.,2012).

2https://github.com/jwieting/

beyond-bleu

3https://github.com/jwieting/

beyond-bleu/blob/master/multi-bleu.perl

4https://github.com/mjpost/sacrebleu

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ReinforcementLearningwithLargeActionSpacesforNeuralMachineTranslationAsafYehudai,LeshemChoshen,LiorFox,OmriAbendDepartmentofComputerScienceTheHebrewUniversityofJerusalem{first.last}@mail.huji.ac.ilAbstractApplyingReinforcementlearning(RL)follow-ingmaximumlikelihoodestimation(MLE)pre-trainingisaversa...

展开>> 收起<<

Reinforcement Learning with Large Action Spaces for Neural Machine Translation Asaf Yehudai Leshem Choshen Lior Fox Omri Abend.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reinforcement Learning with Large Action Spaces for Neural Machine Translation Asaf Yehudai Leshem Choshen Lior Fox Omri Abend

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: