IMPROVING INFORMATION RETENTION IN LARGE SCALE ONLINE CONTINUAL LEARNING Zhipeng Cai

2025-04-27 0 0 1.19MB 20 页 10玖币
侵权投诉
IMPROVING INFORMATION RETENTION IN LARGE
SCALE ONLINE CONTINUAL LEARNING
Zhipeng Cai
Intel Labs
Vladlen Koltun
Apple
Ozan Sener
Apple
ABSTRACT
Given a stream of data sampled from non-stationary distributions, online continual
learning (OCL) aims to adapt efficiently to new data while retaining existing
knowledge. The typical approach to address information retention (the ability to
retain previous knowledge) is keeping a replay buffer of a fixed size and computing
gradients using a mixture of new data and the replay buffer. Surprisingly, the recent
work (Cai et al.,2021) suggests that information retention remains a problem in
large scale OCL even when the replay buffer is unlimited, i.e., the gradients are
computed using all past data. This paper focuses on this peculiarity to understand
and address information retention. To pinpoint the source of this problem, we
theoretically show that, given limited computation budgets at each time step, even
without strict storage limit, naively applying SGD with constant or constantly
decreasing learning rates fails to optimize information retention in the long term.
We propose using a moving average family of methods to improve optimization
for non-stationary objectives. Specifically, we design an adaptive moving average
(AMA) optimizer and a moving-average-based learning rate schedule (MALR).
We demonstrate the effectiveness of AMA+MALR on large-scale benchmarks,
including Continual Localization (CLOC), Google Landmarks, and ImageNet.
Code will be released upon publication.
1 INTRODUCTION
Supervised learning commonly assumes that the data is independent and identically distributed
(iid.). This assumption is violated in practice when the data comes from a non-stationary distribution
that evolves over time. Continual learning aims to solve this problem by designing algorithms that
efficiently learn and retain knowledge over time from a data stream. Continual learning can be
classified into online and offline. The offline setting (Li & Hoiem,2017) mainly limits the storage:
only a fixed amount of training data can be stored at each time step. The computation is not limited
in offline continual learning: the model can be trained from scratch until convergence at each step. In
contrast, the online setting only allows a limited amount of storage and computation at each time step.
A number of metrics can be used to evaluate a continual learner. If the model is directly evaluated
on the incoming data, the objective is learning efficacy; this evaluates the ability to efficiently adapt
to new data. If the model is evaluated on historical data, the objective is information retention; this
evaluates the ability to retain existing knowledge. These two objectives are in conflict, and their
trade-off is known as the plasticity-stability dilemma (McCloskey & Cohen,1989).
Following recent counterintuitive results, we single out information retention in this work. A common
assumption in the continual learning literature is that information retention is only a problem due to
the storage constraint. Take replay-buffer-based methods as an exemplar. It is tacitly understood that
since they cannot store the entire history, they forget past knowledge which is not stored. However,
this intuition is challenged by recent empirical results. For example, Cai et al. (2021) show that the
information retention problem persists even when past data is stored in its entirety. We argue that, at
least in part, the culprit for information loss is optimization.
Direct application of SGD to a continual data stream is problematic. Informally, consider the learning
rate (i.e., step size). It needs to decrease to zero over time to guarantee convergence (Ghadimi &
Lan,2013). However, this cannot be applied to a continual and non-iid. stream since infinitesimal
learning rates would simply ignore the new information and fail to adapt, resulting in underfitting.
1
arXiv:2210.06401v1 [cs.CV] 12 Oct 2022
This underfitting would worsen over time as the distribution continues to shift. We formalize this
problem and further show that there is no straightforward method to control this trade-off, and this
issue holds even when common adaptive learning rate heuristics are applied.
Orthogonal to continual learning, one recently proposed remedy to guarantee SGD convergence with
high learning rates is using the moving average of SGD iterates (Mandt et al.,2016;Tarvainen &
Valpola,2017). Informally, SGD with large learning rates bounces around the optimum. Averaging
its trajectory dampens the bouncing and tracks the optimum better (Mandt et al.,2016). We apply
these ideas to OCL for the first time to improve information retention.
To summarize, we theoretically analyze the behavior of SGD for OCL. Following this analysis, we
propose a moving average strategy to optimize information retention. Our method uses SGD with
large learning rates to adapt to non-stationarity, and utilizes the average of SGD iterates for better
convergence. We propose an adaptive moving average (AMA) algorithm to control the moving
average weight over time. Based on the statistics of the SGD and AMA models, we further propose a
moving-average-based learning rate schedule (MALR) to better control the learning rate. Experiments
on Continual Localization (CLOC) (Cai et al.,2021), Google Landmarks (Weyand et al.,2020), and
ImageNet (Deng et al.,2009) demonstrate superior information retention and long-term transfer for
large-scale OCL.
2 RELATED WORK
Optimization in OCL.
OCL methods typically focus on improving learning efficacy. Cai et al. (2021)
proposed several strategies, including adaptive learning rates, adaptive replay buffer sizes, and small
batch sizes. Hu et al. (2020) proposed a new optimizer, ConGrad, which, at each time step, adaptively
controls the number of online gradient descent steps (Hazan,2019) to balance generalization and
training loss reduction. Our work instead focuses on information retention and proposes a new
optimizer and learning rate schedule that trade off learning efficacy to improve long-term transfer.
In terms of replay buffer strategies, mixed replay (Chaudhry et al.,2019), originating from offline
continual learning, forms a minibatch by sampling half of the data from the online stream and the
other half from the history. It has been applied in OCL to optimize learning efficacy (Cai et al.,2021).
Our work uses pure replay instead to optimize information retention, where a minibatch is formed by
sampling uniformly from all history.
Continual learning algorithms.
We focus on the optimization aspect of OCL. Other aspects, such
as the data integration (Aljundi et al.,2019b) and the sampling procedure of the replay buffer (Aljundi
et al.,2019a;Chrysakis & Moens,2020), are complementary and orthogonal to our study. These
aspects are critical for a successful OCL strategy and can potentially be used in conjuction with our
optimizers. Offline continual learning (Li & Hoiem,2017;Kirkpatrick et al.,2017) aims to improve
information retention with limited storage. Unlike the online setting, SGD works in this case since
the model can be retrained until convergence at each time step. We refer the readers to Delange et al.
(2021) for a detailed survey of offline continual learning algorithms.
Moving average in optimization.
We propose a new moving-average-based optimizer for OCL.
Although we are the first to apply this idea to OCL, moving average optimizers have been widely
utilized for convex (Ruppert,1988;Polyak & Juditsky,1992) and non-convex optimization (Izmailov
et al.,2018;Maddox et al.,2019;He et al.,2020). Beyond supervised learning (Izmailov et al.,2018;
Maddox et al.,2019), the moving average model has also been used as a teacher of the SGD model
in semi-supervised (Tarvainen & Valpola,2017) and self-supervised learning (He et al.,2020). The
moving average of stochastic gradients (rather than model weights) has also been widely used in
ADAM-based optimizers (Kingma & Ba,2014).
Continual learning benchmarks.
We need a large-scale and realistic benchmark to evaluate OCL.
For language modeling, Hu et al. (2020) created the Firehose benchmark using a large stream of
Twitter posts. The task of Firehose is continual per-user tweet prediction, which is self-supervised
and multi-task. For visual recognition, Lin et al. (2021) created the CLEAR benchmark by manually
labeling images on a subset of YFCC100M (Thomee et al.,2016). Though the images are ordered
in time, the number of labeled images is small (33K). Cai et al. (2021) proposed the continual
localization (CLOC) benchmark using a subset of YFCC100M with time stamps and geographic
locations. The task of CLOC is geolocalization, which is formulated as image classification. CLOC
2
has significant scale (39 million images taken over more than 8 years) compared to other benchmarks.
Due to the large scale and natural distribution shifts, we use CLOC as the main dataset to study OCL.
3 PRELIMINARIES
In this section, we formalize the online continual learning problem and introduce our notation. We
then discuss the dataset and the metrics we use.
Problem definition.
Given the input domain
X
and the label space
Y
, online continual learning
learns a parametric function
f(·|θ) : X → Y
that maps an input
x∈ X
to the corresponding label
y∈ Y. At each time step t∈ {1,2, ..., ∞} the learner interacts with the environment as follows:
1. The environment samples the data {Xt,Yt} ∼ πtand reveals the inputs Xtto the learner.
2. The learner predicts the label for each datum xit
tXtvia ˆ
yit
t=f(xit
t|θt1).
3.
The environment reveals the true labels
Yt
and the learner integrates
{Xt,Yt}
into the data
pool Stvia St=update(St1,{Xt,Yt}).
4. The learner updates the model θtusing St.
Dataset.
Due to the large scale, the smooth data stream, and natural distribution shifts, we mainly use
the Continual Localization (CLOC) benchmark (Cai et al.,2021) to study OCL. CLOC formulates
image geolocalization as a classification problem. It contains 39 million labeled images for training, 39
thousand images for performance evaluation, and another 2 million images for initial hyperparameter
tuning. The images are ordered according to their time stamps, so that images taken earlier are seen
by the model first. Images for training and performance evaluation cover the same time span and are
sampled uniformly over time. Images for hyperparameter tuning are from a different time span that
does not overlap with the training and evaluation data. We refer to the 39 million training images as
the training data, the 39 thousand evaluation images as the evaluation data, and the 2 million images
for initial hyperparameter tuning as the preprocessing data.
Evaluation protocol. We evaluate OCL algorithms using the following metrics.
Learning Efficacy: For online learning applications, the model
θt
is deployed for inference on data
from the next time step, i.e., Xt+1. In this case, we measure the ability to adapt to new data as
PLE (t) = 1
t
t
X
j=1
acc({Xj+1,Yj+1 },θj),(1)
where acc(S,θ)is the average accuracy of the model θon a set of data S.
Information Retention: When the data for inference comes from the history, we measure the ability
to retain previous knowledge as
PIR(t) = accn t
[
j=1
XE
j,
t
[
j=1
YE
jo,θt,(2)
where
{XE
j,YE
j}
is the evaluation data from time step
j
, which has the same distribution as
{Xj,Yj}but unseen by the model. PIR(t)measures the generalization of θtto historical data.
Forward Transfer: When long-term predictions are required, we measure ability to generalize from
current time (t) to future time steps (t+k1to t+k2,k2> k11) as
PF T (t) = accn
t+k2
[
j=t+k1
XE
j,
t+k2
[
j=t+k1
YE
jo,θt.(3)
4 METHOD
We consider replay-based methods, where the common choice for continual learning is mixed
replay (Chaudhry et al.,2019). The objective of mixed replay at time step tis
minimize
θRdl({Xt,Yt},θ) + l(St1,θ),(4)
3
where
{Xt,Yt}
is the data received at time
t
,
St1
is the historical data accumulated in the replay
buffer, and
l(·,θ)
is the loss function. Mixed replay constructs a minibatch of training data by
sampling half from the current time step
t
and another half from the history (from
tBt
to
t1
where
Bt
decides the time range). In offline continual learning (Chaudhry et al.,2019),
Bt
is set to
t1for full coverage. Cai et al. (2021) adaptively choose Btto optimize learning efficacy in OCL.
In this paper, we focus on information retention, more specifically its optimization. A reasonable
objective for pure information retention is pure replay, which is defined as follows at each time step
t
:
minimize
θRdl(St,θ).(5)
Pure replay constructs a minibatch of training data by sampling uniformly from all historical time
steps, i.e., from
1
to
t
, which encourages remembering all historical knowledge. As shown in
Appx. C.6, optimizing
(5)
(as opposed to
(4)
) improves both information retention and forward
transfer by sacrificing learning efficacy.
4.1 OPTIMIZING INFORMATION RETENTION
When the distribution of
St
remains unchanged over time (i.e., the loss function is stationary), the
standard optimizer for problem
(5)
is Stochastic Gradient Descent (SGD). This choice is typically
carried over to continual learning. At each update iteration
k
(
k6=t
if multiple updates are allowed
at each time step of continual learning), SGD updates the model via
θkθk1αkgk(θk1),(6)
where αkis the learning rate and gk(θk1)is the stochastic gradient of the objective at iteration k.
Although the convergence behavior of SGD is well understood for the case of stationary losses, its
implications for OCL are not clear. Hence, we extend the analysis of SGD (Ghadimi & Lan,2013) to
the continual learning case. Denote the loss function at iteration kby lk(θ)and assume:
A1 kOlk(θ)Olk(θ0)k ≤ Lkθθ0k,θ,θ0Rdand k(Lipschitz smoothness).
A2 E[gk(θ)] = Olk(θ),θRdand k(Unbiased gradient estimates).
A3 kOlk(θ)gk(θ)k2ρ2,θRd(Bounded gradient noise).
A4 |lk+1(θ)lk(θ)| ≤ χk,θand k(Bounded non-stationarity).
A1
to
A3
extend the standard assumptions for SGD analysis (Ghadimi & Lan,2013) from the
stationary case to the non-stationary case. And
A4
bounds the degree of non-stationarity between
consecutive iterations. The expectations are taken over the randomness of the stochastic gradient
noise. With these assumptions, we state the following theorem and defer its proof to Appx. A.2 (also
see Appx. A.2 for details about how to interpret this result under limited storage):
Theorem 1. Assume A1 to A4, and let αk<L
2,
min
j∈{0,1,...,k}
E[kOlj+1(θj)k2]T1+T2+T3,(7)
where T1=2(l1(θ0)E[lk+2 (θk+1)])
Pk
j=0(2αj+1 2
j+1),T2=2Pk
j=0 α2
j+1
Pk
j=0(2αj+1 2
j+1), and T3=2Pk
j=0 χj+1
Pk
j=0(2αj+1 2
j+1).
Therorem 1bounds the minimum gradient norm achievable by SGD during OCL. A similar bound in
the stationary case (Theorem 2in Appx. A.1) has only the terms
T1
and
T2
. Hence,
T3
is the cost of
the non-stationarity. In the stationary case (when
T3= 0
), we can simply find the optimal learning
rate by trading off between
T1
and
T2
. Specifically, for a constant learning rate,
T1
converges to
0
at a linear rate, but
T2
is constant. Hence, the strategy for convergence is reducing learning rates to
control
T2
simultaneously with
T1
. This result supports the effectiveness of the standard “reduce-
when-plateau” (RWP) learning rate heuristic in the stationary case. Since the access to the true
gradient norm is not practical, RWP often uses the validation accuracy/loss as a surrogate to control
the learning rate
αk
, where we reduce
αk
by a certain factor
β
once the validation accuracy/loss
plateaus (Goodfellow et al.,2016;He et al.,2016).
In the continual learning setting (when
T36= 0
), RWP can be less effective. For example, if
χkχ > 0
for a constant
χ
, i.e., the loss function
lk(·)
changes at least at a linear rate, then
T3
can grow linearly when the learning rate is reduced to 0. Hence in OCL, there can be cases where
4
摘要:

IMPROVINGINFORMATIONRETENTIONINLARGESCALEONLINECONTINUALLEARNINGZhipengCaiIntelLabsVladlenKoltunAppleOzanSenerAppleABSTRACTGivenastreamofdatasampledfromnon-stationarydistributions,onlinecontinuallearning(OCL)aimstoadaptefcientlytonewdatawhileretainingexistingknowledge.Thetypicalapproachtoaddressinf...

展开>> 收起<<
IMPROVING INFORMATION RETENTION IN LARGE SCALE ONLINE CONTINUAL LEARNING Zhipeng Cai.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.19MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注