Revisiting Checkpoint Averaging for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group

2025-04-29 1 0 679.15KB 9 页 10玖币

侵权投诉

Revisiting Checkpoint Averaging for Neural Machine Translation

Yingbo Gao Christian Herold Zijian Yang Hermann Ney

Human Language Technology and Pattern Recognition Group

Computer Science Department

RWTH Aachen University

D-52056 Aachen, Germany

{ygao|herold|zyang|ney}@cs.rwth-aachen.de

Abstract

Checkpoint averaging is a simple and effective

method to boost the performance of converged

neural machine translation models. The cal-

culation is cheap to perform and the fact that

the translation improvement almost comes for

free, makes it widely adopted in neural ma-

chine translation research. Despite the popu-

larity, the method itself simply takes the mean

of the model parameters from several check-

points, the selection of which is mostly based

on empirical recipes without many justiﬁca-

tions. In this work, we revisit the concept

of checkpoint averaging and consider several

extensions. Speciﬁcally, we experiment with

ideas such as using different checkpoint selec-

tion strategies, calculating weighted average

instead of simple mean, making use of gradi-

ent information and ﬁne-tuning the interpola-

tion weights on development data. Our results

conﬁrm the necessity of applying checkpoint

averaging for optimal performance, but also

suggest that the landscape between the con-

verged checkpoints is rather ﬂat and not much

further improvement compared to simple aver-

aging is to be obtained.

1 Introduction

Checkpoint averaging is a simple method to im-

prove model performance at low computational

cost. The procedure is straightforward: select some

model checkpoints, average the model parameters,

and obtain a better model. Because of its sim-

plicity and effectiveness, it is widely used in neu-

ral machine translation (NMT), e.g. in the origi-

nal Transformer paper (Vaswani et al.,2017), in

systems participating in public machine transla-

tion (MT) evaluations such as Conference on Ma-

chine Translation (WMT) (Barrault et al.,2021)

and the International Conference on Spoken Lan-

guage Translation (IWSLT) (Anastasopoulos et al.,

2022): Barrault et al. (2021); Erdmann et al. (2021);

Li et al. (2021); Subramanian et al. (2021); Tran

et al. (2021); Wang et al. (2021b); Wei et al. (2021);

Di Gangi et al. (2019); Li et al. (2022), and in nu-

merous MT research papers (Junczys-Dowmunt

et al.,2016;Shaw et al.,2018;Liu et al.,2018;

Zhao et al.,2019;Kim et al.,2021). Apart from

NMT, checkpoint averaging also ﬁnds applications

in Transformer-based automatic speech recogni-

tion models (Karita et al.,2019;Dong et al.,2018;

Higuchi et al.,2020;Tian et al.,2020;Wang et al.,

2020). Despite the popularity of the method, the

recipes in each work are rather empirical and do

not differ much except in how many and exactly

which checkpoints are averaged.

In this work, we revisit the concept of checkpoint

averaging and consider several extensions. We ex-

amine the straightforward hyperparameters like the

number of checkpoints to average, the checkpoint

selection strategy and the mean calculation itself.

Because the gradient information is often available

at the time of checkpointing, we also explore the

idea of using this piece of information. Addition-

ally, we experiment with the idea of ﬁne-tuning

the interpolation weights of the checkpoints on de-

velopment data. As reported in countless works,

we conﬁrm that the translation performance im-

provement can be robustly obtained with check-

point averaging. However, our results suggest that

the landscape between the converged checkpoints

is rather ﬂat, and it is hard to squeeze out further

performance improvements with advanced tricks.

2 Related Work

The idea of combining multiple models for more

stable and potentially better prediction is not new

in statistical learning (Dietterich,2000;Dong et al.,

2020). In NMT, ensembling, more speciﬁcally,

ensembling systems with different architectures

is shown to be helpful (Stahlberg et al.,2019;

Rosendahl et al.,2019;Zhang and van Genabith,

2019). In contrary, checkpoint averaging uses

checkpoints from the same training run with the

arXiv:2210.11803v1 [cs.CL] 21 Oct 2022

Cavg

(a) vanilla

Cavg

(b) using gradient information

Cavg

w1w2

Figure 1: An illustration of checkpoint averaging and our extensions. The isocontour plot illustrates some imagi-

nary loss surface. C1 and C2 are model parameters from two checkpoints. Cavg denotes the averaged parameters.

In (a), the mean of the C1 and C2 is taken. In (b), the dashed arrows refer to the gradients (could also include

the momentum terms) stored in the checkpoints, and a further step (with step size η) is taken. In (c), a NN is

parametrized with the interpolation weights w1and w2, and the weights are learned on the development data.

same neural network (NN) architecture. Compared

to ensembling, checkpoint averaging is cheaper

to calculate and does not require one to store and

query multiple models at test time. The distinc-

tion can also be made from the perspective of the

interpolation space, i.e. model parameter space

for checkpoint averaging, and posterior probability

space for ensembling. As a trade-off, the perfor-

mance boost from checkpoint averaging is typically

smaller than ensembling (Liu et al.,2018).

In the literature, Chen et al. (2017) study the

use of checkpoints from the same training run for

ensembling; Smith (2017) proposes cyclic learn-

ing rate schedules to improve accuracy and con-

vergence; Huang et al. (2017) propose to use a

cyclic learning rate to obtain snapshots of the same

model during training and ensemble them in the

probability space; Izmailov et al. (2018) perform

model parameter averaging on-the-ﬂy during train-

ing and argue for better generalization in this way;

Popel and Bojar (2018) discuss empirical ﬁndings

related to checkpoint averaging for NMT; Zhang

et al. (2020) and Karita et al. (2021) maintain an

exponential moving average during model training;

Wang et al. (2021a) propose a boosting algorithm

and ensemble checkpoints in the probability space;

Matena and Raffel (2021) exploit the Fisher in-

formation matrix to calculate weighted average of

model parameters. Here, we are interested in the

interpolation happening in the model parameter

space, and therefore restrain ourselves from further

discussing topics like ensembling or continuing

training on the development data.

3 Methodology

In this section, we discuss extensions to checkpoint

averaging considered in this work. An intuitive

illustration is shown in Fig.1.

3.1 Extending Vanilla Checkpoint Averaging

The vanilla checkpointing is straightforward and

can be expressed as in Eq.1. Here,

denotes the

model parameters and

is the averaged parameters.

is a running index in number of checkpoints

and

, where

|S|=K

, is a set of checkpoint

indices selected by some speciﬁc strategy, e.g. top-

or last-

. In the vanilla case,

wk=1

, i.e.

uniform weights are used.

θ=X

k∈S

wkθk(1)

As shown in Eq.2, we further consider non-

uniform weights and propose to use softmax-

normalized logarithm of development set perplexi-

ties (DEVPPL) with temperature

as interpolation

weights. We deﬁne

in this way such that it is in

the probability space.

wk=exp(−τlog DEVPPLk)

Pk0∈S exp(−τlog DEVPPLk0)(2)

3.2 Making Use of Gradient Information

Nowadays, NMT models are commonly trained

with stated optimizers like Adam (Kingma and Ba,

2015). To provide the "continue-training" utility,

the gradients of the most recent batch are therefore

also saved. Shown in Eq.3, we can therefore take

a further step in the parameter space during check-

point averaging to make use of this information.

Here,

is the step size and

KPk∈S ∇θL(θk)

the mean of the gradients stored in the checkpoints.

θ=X

k∈S

wkθk−η1

k∈S ∇θL(θk)(3)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RevisitingCheckpointAveragingforNeuralMachineTranslationYingboGaoChristianHeroldZijianYangHermannNeyHumanLanguageTechnologyandPatternRecognitionGroupComputerScienceDepartmentRWTHAachenUniversityD-52056Aachen,Germany{ygao|herold|zyang|ney}@cs.rwth-aachen.deAbstractCheckpointaveragingisasimpleandeffec...

展开>> 收起<<

Revisiting Checkpoint Averaging for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Revisiting Checkpoint Averaging for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: