Revisiting Checkpoint Averaging for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group

2025-04-29 0 0 679.15KB 9 页 10玖币
侵权投诉
Revisiting Checkpoint Averaging for Neural Machine Translation
Yingbo Gao Christian Herold Zijian Yang Hermann Ney
Human Language Technology and Pattern Recognition Group
Computer Science Department
RWTH Aachen University
D-52056 Aachen, Germany
{ygao|herold|zyang|ney}@cs.rwth-aachen.de
Abstract
Checkpoint averaging is a simple and effective
method to boost the performance of converged
neural machine translation models. The cal-
culation is cheap to perform and the fact that
the translation improvement almost comes for
free, makes it widely adopted in neural ma-
chine translation research. Despite the popu-
larity, the method itself simply takes the mean
of the model parameters from several check-
points, the selection of which is mostly based
on empirical recipes without many justifica-
tions. In this work, we revisit the concept
of checkpoint averaging and consider several
extensions. Specifically, we experiment with
ideas such as using different checkpoint selec-
tion strategies, calculating weighted average
instead of simple mean, making use of gradi-
ent information and fine-tuning the interpola-
tion weights on development data. Our results
confirm the necessity of applying checkpoint
averaging for optimal performance, but also
suggest that the landscape between the con-
verged checkpoints is rather flat and not much
further improvement compared to simple aver-
aging is to be obtained.
1 Introduction
Checkpoint averaging is a simple method to im-
prove model performance at low computational
cost. The procedure is straightforward: select some
model checkpoints, average the model parameters,
and obtain a better model. Because of its sim-
plicity and effectiveness, it is widely used in neu-
ral machine translation (NMT), e.g. in the origi-
nal Transformer paper (Vaswani et al.,2017), in
systems participating in public machine transla-
tion (MT) evaluations such as Conference on Ma-
chine Translation (WMT) (Barrault et al.,2021)
and the International Conference on Spoken Lan-
guage Translation (IWSLT) (Anastasopoulos et al.,
2022): Barrault et al. (2021); Erdmann et al. (2021);
Li et al. (2021); Subramanian et al. (2021); Tran
et al. (2021); Wang et al. (2021b); Wei et al. (2021);
Di Gangi et al. (2019); Li et al. (2022), and in nu-
merous MT research papers (Junczys-Dowmunt
et al.,2016;Shaw et al.,2018;Liu et al.,2018;
Zhao et al.,2019;Kim et al.,2021). Apart from
NMT, checkpoint averaging also finds applications
in Transformer-based automatic speech recogni-
tion models (Karita et al.,2019;Dong et al.,2018;
Higuchi et al.,2020;Tian et al.,2020;Wang et al.,
2020). Despite the popularity of the method, the
recipes in each work are rather empirical and do
not differ much except in how many and exactly
which checkpoints are averaged.
In this work, we revisit the concept of checkpoint
averaging and consider several extensions. We ex-
amine the straightforward hyperparameters like the
number of checkpoints to average, the checkpoint
selection strategy and the mean calculation itself.
Because the gradient information is often available
at the time of checkpointing, we also explore the
idea of using this piece of information. Addition-
ally, we experiment with the idea of fine-tuning
the interpolation weights of the checkpoints on de-
velopment data. As reported in countless works,
we confirm that the translation performance im-
provement can be robustly obtained with check-
point averaging. However, our results suggest that
the landscape between the converged checkpoints
is rather flat, and it is hard to squeeze out further
performance improvements with advanced tricks.
2 Related Work
The idea of combining multiple models for more
stable and potentially better prediction is not new
in statistical learning (Dietterich,2000;Dong et al.,
2020). In NMT, ensembling, more specifically,
ensembling systems with different architectures
is shown to be helpful (Stahlberg et al.,2019;
Rosendahl et al.,2019;Zhang and van Genabith,
2019). In contrary, checkpoint averaging uses
checkpoints from the same training run with the
arXiv:2210.11803v1 [cs.CL] 21 Oct 2022
C1
C2
Cavg
(a) vanilla
C1
C2
Cavg
η
(b) using gradient information
C1
C2
Cavg
w1w2
(c) optimized on development data
Figure 1: An illustration of checkpoint averaging and our extensions. The isocontour plot illustrates some imagi-
nary loss surface. C1 and C2 are model parameters from two checkpoints. Cavg denotes the averaged parameters.
In (a), the mean of the C1 and C2 is taken. In (b), the dashed arrows refer to the gradients (could also include
the momentum terms) stored in the checkpoints, and a further step (with step size η) is taken. In (c), a NN is
parametrized with the interpolation weights w1and w2, and the weights are learned on the development data.
same neural network (NN) architecture. Compared
to ensembling, checkpoint averaging is cheaper
to calculate and does not require one to store and
query multiple models at test time. The distinc-
tion can also be made from the perspective of the
interpolation space, i.e. model parameter space
for checkpoint averaging, and posterior probability
space for ensembling. As a trade-off, the perfor-
mance boost from checkpoint averaging is typically
smaller than ensembling (Liu et al.,2018).
In the literature, Chen et al. (2017) study the
use of checkpoints from the same training run for
ensembling; Smith (2017) proposes cyclic learn-
ing rate schedules to improve accuracy and con-
vergence; Huang et al. (2017) propose to use a
cyclic learning rate to obtain snapshots of the same
model during training and ensemble them in the
probability space; Izmailov et al. (2018) perform
model parameter averaging on-the-fly during train-
ing and argue for better generalization in this way;
Popel and Bojar (2018) discuss empirical findings
related to checkpoint averaging for NMT; Zhang
et al. (2020) and Karita et al. (2021) maintain an
exponential moving average during model training;
Wang et al. (2021a) propose a boosting algorithm
and ensemble checkpoints in the probability space;
Matena and Raffel (2021) exploit the Fisher in-
formation matrix to calculate weighted average of
model parameters. Here, we are interested in the
interpolation happening in the model parameter
space, and therefore restrain ourselves from further
discussing topics like ensembling or continuing
training on the development data.
3 Methodology
In this section, we discuss extensions to checkpoint
averaging considered in this work. An intuitive
illustration is shown in Fig.1.
3.1 Extending Vanilla Checkpoint Averaging
The vanilla checkpointing is straightforward and
can be expressed as in Eq.1. Here,
θ
denotes the
model parameters and
ˆ
θ
is the averaged parameters.
k
is a running index in number of checkpoints
K
,
and
S
, where
|S|=K
, is a set of checkpoint
indices selected by some specific strategy, e.g. top-
K
or last-
K
. In the vanilla case,
wk=1
K
, i.e.
uniform weights are used.
ˆ
θ=X
k∈S
wkθk(1)
As shown in Eq.2, we further consider non-
uniform weights and propose to use softmax-
normalized logarithm of development set perplexi-
ties (DEVPPL) with temperature
τ
as interpolation
weights. We define
w
in this way such that it is in
the probability space.
wk=exp(τlog DEVPPLk)
Pk0∈S exp(τlog DEVPPLk0)(2)
3.2 Making Use of Gradient Information
Nowadays, NMT models are commonly trained
with stated optimizers like Adam (Kingma and Ba,
2015). To provide the "continue-training" utility,
the gradients of the most recent batch are therefore
also saved. Shown in Eq.3, we can therefore take
a further step in the parameter space during check-
point averaging to make use of this information.
Here,
η
is the step size and
1
KPk∈S θL(θk)
is
the mean of the gradients stored in the checkpoints.
ˆ
θ=X
k∈S
wkθkη1
KX
k∈S θL(θk)(3)
摘要:

RevisitingCheckpointAveragingforNeuralMachineTranslationYingboGaoChristianHeroldZijianYangHermannNeyHumanLanguageTechnologyandPatternRecognitionGroupComputerScienceDepartmentRWTHAachenUniversityD-52056Aachen,Germany{ygao|herold|zyang|ney}@cs.rwth-aachen.deAbstractCheckpointaveragingisasimpleandeffec...

展开>> 收起<<
Revisiting Checkpoint Averaging for Neural Machine Translation Yingbo Gao Christian Herold Zijian Yang Hermann Ney Human Language Technology and Pattern Recognition Group.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:679.15KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注