Revisiting Checkpoint Averaging for Neural Machine Translation
Yingbo Gao Christian Herold Zijian Yang Hermann Ney
Human Language Technology and Pattern Recognition Group
Computer Science Department
RWTH Aachen University
D-52056 Aachen, Germany
{ygao|herold|zyang|ney}@cs.rwth-aachen.de
Abstract
Checkpoint averaging is a simple and effective
method to boost the performance of converged
neural machine translation models. The cal-
culation is cheap to perform and the fact that
the translation improvement almost comes for
free, makes it widely adopted in neural ma-
chine translation research. Despite the popu-
larity, the method itself simply takes the mean
of the model parameters from several check-
points, the selection of which is mostly based
on empirical recipes without many justifica-
tions. In this work, we revisit the concept
of checkpoint averaging and consider several
extensions. Specifically, we experiment with
ideas such as using different checkpoint selec-
tion strategies, calculating weighted average
instead of simple mean, making use of gradi-
ent information and fine-tuning the interpola-
tion weights on development data. Our results
confirm the necessity of applying checkpoint
averaging for optimal performance, but also
suggest that the landscape between the con-
verged checkpoints is rather flat and not much
further improvement compared to simple aver-
aging is to be obtained.
1 Introduction
Checkpoint averaging is a simple method to im-
prove model performance at low computational
cost. The procedure is straightforward: select some
model checkpoints, average the model parameters,
and obtain a better model. Because of its sim-
plicity and effectiveness, it is widely used in neu-
ral machine translation (NMT), e.g. in the origi-
nal Transformer paper (Vaswani et al.,2017), in
systems participating in public machine transla-
tion (MT) evaluations such as Conference on Ma-
chine Translation (WMT) (Barrault et al.,2021)
and the International Conference on Spoken Lan-
guage Translation (IWSLT) (Anastasopoulos et al.,
2022): Barrault et al. (2021); Erdmann et al. (2021);
Li et al. (2021); Subramanian et al. (2021); Tran
et al. (2021); Wang et al. (2021b); Wei et al. (2021);
Di Gangi et al. (2019); Li et al. (2022), and in nu-
merous MT research papers (Junczys-Dowmunt
et al.,2016;Shaw et al.,2018;Liu et al.,2018;
Zhao et al.,2019;Kim et al.,2021). Apart from
NMT, checkpoint averaging also finds applications
in Transformer-based automatic speech recogni-
tion models (Karita et al.,2019;Dong et al.,2018;
Higuchi et al.,2020;Tian et al.,2020;Wang et al.,
2020). Despite the popularity of the method, the
recipes in each work are rather empirical and do
not differ much except in how many and exactly
which checkpoints are averaged.
In this work, we revisit the concept of checkpoint
averaging and consider several extensions. We ex-
amine the straightforward hyperparameters like the
number of checkpoints to average, the checkpoint
selection strategy and the mean calculation itself.
Because the gradient information is often available
at the time of checkpointing, we also explore the
idea of using this piece of information. Addition-
ally, we experiment with the idea of fine-tuning
the interpolation weights of the checkpoints on de-
velopment data. As reported in countless works,
we confirm that the translation performance im-
provement can be robustly obtained with check-
point averaging. However, our results suggest that
the landscape between the converged checkpoints
is rather flat, and it is hard to squeeze out further
performance improvements with advanced tricks.
2 Related Work
The idea of combining multiple models for more
stable and potentially better prediction is not new
in statistical learning (Dietterich,2000;Dong et al.,
2020). In NMT, ensembling, more specifically,
ensembling systems with different architectures
is shown to be helpful (Stahlberg et al.,2019;
Rosendahl et al.,2019;Zhang and van Genabith,
2019). In contrary, checkpoint averaging uses
checkpoints from the same training run with the
arXiv:2210.11803v1 [cs.CL] 21 Oct 2022