
This underfitting would worsen over time as the distribution continues to shift. We formalize this
problem and further show that there is no straightforward method to control this trade-off, and this
issue holds even when common adaptive learning rate heuristics are applied.
Orthogonal to continual learning, one recently proposed remedy to guarantee SGD convergence with
high learning rates is using the moving average of SGD iterates (Mandt et al.,2016;Tarvainen &
Valpola,2017). Informally, SGD with large learning rates bounces around the optimum. Averaging
its trajectory dampens the bouncing and tracks the optimum better (Mandt et al.,2016). We apply
these ideas to OCL for the first time to improve information retention.
To summarize, we theoretically analyze the behavior of SGD for OCL. Following this analysis, we
propose a moving average strategy to optimize information retention. Our method uses SGD with
large learning rates to adapt to non-stationarity, and utilizes the average of SGD iterates for better
convergence. We propose an adaptive moving average (AMA) algorithm to control the moving
average weight over time. Based on the statistics of the SGD and AMA models, we further propose a
moving-average-based learning rate schedule (MALR) to better control the learning rate. Experiments
on Continual Localization (CLOC) (Cai et al.,2021), Google Landmarks (Weyand et al.,2020), and
ImageNet (Deng et al.,2009) demonstrate superior information retention and long-term transfer for
large-scale OCL.
2 RELATED WORK
Optimization in OCL.
OCL methods typically focus on improving learning efficacy. Cai et al. (2021)
proposed several strategies, including adaptive learning rates, adaptive replay buffer sizes, and small
batch sizes. Hu et al. (2020) proposed a new optimizer, ConGrad, which, at each time step, adaptively
controls the number of online gradient descent steps (Hazan,2019) to balance generalization and
training loss reduction. Our work instead focuses on information retention and proposes a new
optimizer and learning rate schedule that trade off learning efficacy to improve long-term transfer.
In terms of replay buffer strategies, mixed replay (Chaudhry et al.,2019), originating from offline
continual learning, forms a minibatch by sampling half of the data from the online stream and the
other half from the history. It has been applied in OCL to optimize learning efficacy (Cai et al.,2021).
Our work uses pure replay instead to optimize information retention, where a minibatch is formed by
sampling uniformly from all history.
Continual learning algorithms.
We focus on the optimization aspect of OCL. Other aspects, such
as the data integration (Aljundi et al.,2019b) and the sampling procedure of the replay buffer (Aljundi
et al.,2019a;Chrysakis & Moens,2020), are complementary and orthogonal to our study. These
aspects are critical for a successful OCL strategy and can potentially be used in conjuction with our
optimizers. Offline continual learning (Li & Hoiem,2017;Kirkpatrick et al.,2017) aims to improve
information retention with limited storage. Unlike the online setting, SGD works in this case since
the model can be retrained until convergence at each time step. We refer the readers to Delange et al.
(2021) for a detailed survey of offline continual learning algorithms.
Moving average in optimization.
We propose a new moving-average-based optimizer for OCL.
Although we are the first to apply this idea to OCL, moving average optimizers have been widely
utilized for convex (Ruppert,1988;Polyak & Juditsky,1992) and non-convex optimization (Izmailov
et al.,2018;Maddox et al.,2019;He et al.,2020). Beyond supervised learning (Izmailov et al.,2018;
Maddox et al.,2019), the moving average model has also been used as a teacher of the SGD model
in semi-supervised (Tarvainen & Valpola,2017) and self-supervised learning (He et al.,2020). The
moving average of stochastic gradients (rather than model weights) has also been widely used in
ADAM-based optimizers (Kingma & Ba,2014).
Continual learning benchmarks.
We need a large-scale and realistic benchmark to evaluate OCL.
For language modeling, Hu et al. (2020) created the Firehose benchmark using a large stream of
Twitter posts. The task of Firehose is continual per-user tweet prediction, which is self-supervised
and multi-task. For visual recognition, Lin et al. (2021) created the CLEAR benchmark by manually
labeling images on a subset of YFCC100M (Thomee et al.,2016). Though the images are ordered
in time, the number of labeled images is small (33K). Cai et al. (2021) proposed the continual
localization (CLOC) benchmark using a subset of YFCC100M with time stamps and geographic
locations. The task of CLOC is geolocalization, which is formulated as image classification. CLOC
2