IMPROVING INFORMATION RETENTION IN LARGE SCALE ONLINE CONTINUAL LEARNING Zhipeng Cai

2025-04-27 1 0 1.19MB 20 页 10玖币

侵权投诉

IMPROVING INFORMATION RETENTION IN LARGE

SCALE ONLINE CONTINUAL LEARNING

Zhipeng Cai

Intel Labs

Vladlen Koltun

Apple

Ozan Sener

Apple

ABSTRACT

Given a stream of data sampled from non-stationary distributions, online continual

learning (OCL) aims to adapt efﬁciently to new data while retaining existing

knowledge. The typical approach to address information retention (the ability to

retain previous knowledge) is keeping a replay buffer of a ﬁxed size and computing

gradients using a mixture of new data and the replay buffer. Surprisingly, the recent

work (Cai et al.,2021) suggests that information retention remains a problem in

large scale OCL even when the replay buffer is unlimited, i.e., the gradients are

computed using all past data. This paper focuses on this peculiarity to understand

and address information retention. To pinpoint the source of this problem, we

theoretically show that, given limited computation budgets at each time step, even

without strict storage limit, naively applying SGD with constant or constantly

decreasing learning rates fails to optimize information retention in the long term.

We propose using a moving average family of methods to improve optimization

for non-stationary objectives. Speciﬁcally, we design an adaptive moving average

(AMA) optimizer and a moving-average-based learning rate schedule (MALR).

We demonstrate the effectiveness of AMA+MALR on large-scale benchmarks,

including Continual Localization (CLOC), Google Landmarks, and ImageNet.

Code will be released upon publication.

1 INTRODUCTION

Supervised learning commonly assumes that the data is independent and identically distributed

(iid.). This assumption is violated in practice when the data comes from a non-stationary distribution

that evolves over time. Continual learning aims to solve this problem by designing algorithms that

efﬁciently learn and retain knowledge over time from a data stream. Continual learning can be

classiﬁed into online and ofﬂine. The ofﬂine setting (Li & Hoiem,2017) mainly limits the storage:

only a ﬁxed amount of training data can be stored at each time step. The computation is not limited

in ofﬂine continual learning: the model can be trained from scratch until convergence at each step. In

contrast, the online setting only allows a limited amount of storage and computation at each time step.

A number of metrics can be used to evaluate a continual learner. If the model is directly evaluated

on the incoming data, the objective is learning efﬁcacy; this evaluates the ability to efﬁciently adapt

to new data. If the model is evaluated on historical data, the objective is information retention; this

evaluates the ability to retain existing knowledge. These two objectives are in conﬂict, and their

trade-off is known as the plasticity-stability dilemma (McCloskey & Cohen,1989).

Following recent counterintuitive results, we single out information retention in this work. A common

assumption in the continual learning literature is that information retention is only a problem due to

the storage constraint. Take replay-buffer-based methods as an exemplar. It is tacitly understood that

since they cannot store the entire history, they forget past knowledge which is not stored. However,

this intuition is challenged by recent empirical results. For example, Cai et al. (2021) show that the

information retention problem persists even when past data is stored in its entirety. We argue that, at

least in part, the culprit for information loss is optimization.

Direct application of SGD to a continual data stream is problematic. Informally, consider the learning

rate (i.e., step size). It needs to decrease to zero over time to guarantee convergence (Ghadimi &

Lan,2013). However, this cannot be applied to a continual and non-iid. stream since inﬁnitesimal

learning rates would simply ignore the new information and fail to adapt, resulting in underﬁtting.

arXiv:2210.06401v1 [cs.CV] 12 Oct 2022

This underﬁtting would worsen over time as the distribution continues to shift. We formalize this

problem and further show that there is no straightforward method to control this trade-off, and this

issue holds even when common adaptive learning rate heuristics are applied.

Orthogonal to continual learning, one recently proposed remedy to guarantee SGD convergence with

high learning rates is using the moving average of SGD iterates (Mandt et al.,2016;Tarvainen &

Valpola,2017). Informally, SGD with large learning rates bounces around the optimum. Averaging

its trajectory dampens the bouncing and tracks the optimum better (Mandt et al.,2016). We apply

these ideas to OCL for the ﬁrst time to improve information retention.

To summarize, we theoretically analyze the behavior of SGD for OCL. Following this analysis, we

propose a moving average strategy to optimize information retention. Our method uses SGD with

large learning rates to adapt to non-stationarity, and utilizes the average of SGD iterates for better

convergence. We propose an adaptive moving average (AMA) algorithm to control the moving

average weight over time. Based on the statistics of the SGD and AMA models, we further propose a

moving-average-based learning rate schedule (MALR) to better control the learning rate. Experiments

on Continual Localization (CLOC) (Cai et al.,2021), Google Landmarks (Weyand et al.,2020), and

ImageNet (Deng et al.,2009) demonstrate superior information retention and long-term transfer for

large-scale OCL.

2 RELATED WORK

Optimization in OCL.

OCL methods typically focus on improving learning efﬁcacy. Cai et al. (2021)

proposed several strategies, including adaptive learning rates, adaptive replay buffer sizes, and small

batch sizes. Hu et al. (2020) proposed a new optimizer, ConGrad, which, at each time step, adaptively

controls the number of online gradient descent steps (Hazan,2019) to balance generalization and

training loss reduction. Our work instead focuses on information retention and proposes a new

optimizer and learning rate schedule that trade off learning efﬁcacy to improve long-term transfer.

In terms of replay buffer strategies, mixed replay (Chaudhry et al.,2019), originating from ofﬂine

continual learning, forms a minibatch by sampling half of the data from the online stream and the

other half from the history. It has been applied in OCL to optimize learning efﬁcacy (Cai et al.,2021).

Our work uses pure replay instead to optimize information retention, where a minibatch is formed by

sampling uniformly from all history.

Continual learning algorithms.

We focus on the optimization aspect of OCL. Other aspects, such

as the data integration (Aljundi et al.,2019b) and the sampling procedure of the replay buffer (Aljundi

et al.,2019a;Chrysakis & Moens,2020), are complementary and orthogonal to our study. These

aspects are critical for a successful OCL strategy and can potentially be used in conjuction with our

optimizers. Ofﬂine continual learning (Li & Hoiem,2017;Kirkpatrick et al.,2017) aims to improve

information retention with limited storage. Unlike the online setting, SGD works in this case since

the model can be retrained until convergence at each time step. We refer the readers to Delange et al.

(2021) for a detailed survey of ofﬂine continual learning algorithms.

Moving average in optimization.

We propose a new moving-average-based optimizer for OCL.

Although we are the ﬁrst to apply this idea to OCL, moving average optimizers have been widely

utilized for convex (Ruppert,1988;Polyak & Juditsky,1992) and non-convex optimization (Izmailov

et al.,2018;Maddox et al.,2019;He et al.,2020). Beyond supervised learning (Izmailov et al.,2018;

Maddox et al.,2019), the moving average model has also been used as a teacher of the SGD model

in semi-supervised (Tarvainen & Valpola,2017) and self-supervised learning (He et al.,2020). The

moving average of stochastic gradients (rather than model weights) has also been widely used in

ADAM-based optimizers (Kingma & Ba,2014).

Continual learning benchmarks.

We need a large-scale and realistic benchmark to evaluate OCL.

For language modeling, Hu et al. (2020) created the Firehose benchmark using a large stream of

Twitter posts. The task of Firehose is continual per-user tweet prediction, which is self-supervised

and multi-task. For visual recognition, Lin et al. (2021) created the CLEAR benchmark by manually

labeling images on a subset of YFCC100M (Thomee et al.,2016). Though the images are ordered

in time, the number of labeled images is small (33K). Cai et al. (2021) proposed the continual

localization (CLOC) benchmark using a subset of YFCC100M with time stamps and geographic

locations. The task of CLOC is geolocalization, which is formulated as image classiﬁcation. CLOC

has signiﬁcant scale (39 million images taken over more than 8 years) compared to other benchmarks.

Due to the large scale and natural distribution shifts, we use CLOC as the main dataset to study OCL.

3 PRELIMINARIES

In this section, we formalize the online continual learning problem and introduce our notation. We

then discuss the dataset and the metrics we use.

Problem deﬁnition.

Given the input domain

and the label space

, online continual learning

learns a parametric function

f(·|θ) : X → Y

that maps an input

x∈ X

to the corresponding label

y∈ Y. At each time step t∈ {1,2, ..., ∞} the learner interacts with the environment as follows:

1. The environment samples the data {Xt,Yt} ∼ πtand reveals the inputs Xtto the learner.

2. The learner predicts the label for each datum xit

t∈Xtvia ˆ

yit

t=f(xit

t|θt−1).

The environment reveals the true labels

and the learner integrates

{Xt,Yt}

into the data

pool Stvia St=update(St−1,{Xt,Yt}).

4. The learner updates the model θtusing St.

Dataset.

Due to the large scale, the smooth data stream, and natural distribution shifts, we mainly use

the Continual Localization (CLOC) benchmark (Cai et al.,2021) to study OCL. CLOC formulates

image geolocalization as a classiﬁcation problem. It contains 39 million labeled images for training, 39

thousand images for performance evaluation, and another 2 million images for initial hyperparameter

tuning. The images are ordered according to their time stamps, so that images taken earlier are seen

by the model ﬁrst. Images for training and performance evaluation cover the same time span and are

sampled uniformly over time. Images for hyperparameter tuning are from a different time span that

does not overlap with the training and evaluation data. We refer to the 39 million training images as

the training data, the 39 thousand evaluation images as the evaluation data, and the 2 million images

for initial hyperparameter tuning as the preprocessing data.

Evaluation protocol. We evaluate OCL algorithms using the following metrics.

Learning Efﬁcacy: For online learning applications, the model

θt

is deployed for inference on data

from the next time step, i.e., Xt+1. In this case, we measure the ability to adapt to new data as

PLE (t) = 1

j=1

acc({Xj+1,Yj+1 },θj),(1)

where acc(S,θ)is the average accuracy of the model θon a set of data S.

Information Retention: When the data for inference comes from the history, we measure the ability

to retain previous knowledge as

PIR(t) = accn t

[

j=1

[

j=1

jo,θt,(2)

where

{XE

j,YE

is the evaluation data from time step

, which has the same distribution as

{Xj,Yj}but unseen by the model. PIR(t)measures the generalization of θtto historical data.

Forward Transfer: When long-term predictions are required, we measure ability to generalize from

current time (t) to future time steps (t+k1to t+k2,k2> k1≥1) as

PF T (t) = accn

t+k2

[

j=t+k1

t+k2

[

j=t+k1

jo,θt.(3)

4 METHOD

We consider replay-based methods, where the common choice for continual learning is mixed

replay (Chaudhry et al.,2019). The objective of mixed replay at time step tis

minimize

θ∈Rdl({Xt,Yt},θ) + l(St−1,θ),(4)

where

{Xt,Yt}

is the data received at time

St−1

is the historical data accumulated in the replay

buffer, and

l(·,θ)

is the loss function. Mixed replay constructs a minibatch of training data by

sampling half from the current time step

and another half from the history (from

t−Bt

t−1

where

decides the time range). In ofﬂine continual learning (Chaudhry et al.,2019),

is set to

t−1for full coverage. Cai et al. (2021) adaptively choose Btto optimize learning efﬁcacy in OCL.

In this paper, we focus on information retention, more speciﬁcally its optimization. A reasonable

objective for pure information retention is pure replay, which is deﬁned as follows at each time step

minimize

θ∈Rdl(St,θ).(5)

Pure replay constructs a minibatch of training data by sampling uniformly from all historical time

steps, i.e., from

, which encourages remembering all historical knowledge. As shown in

Appx. C.6, optimizing

(5)

(as opposed to

(4)

) improves both information retention and forward

transfer by sacriﬁcing learning efﬁcacy.

4.1 OPTIMIZING INFORMATION RETENTION

When the distribution of

remains unchanged over time (i.e., the loss function is stationary), the

standard optimizer for problem

(5)

is Stochastic Gradient Descent (SGD). This choice is typically

carried over to continual learning. At each update iteration

(

k6=t

if multiple updates are allowed

at each time step of continual learning), SGD updates the model via

θk←θk−1−αkgk(θk−1),(6)

where αkis the learning rate and gk(θk−1)is the stochastic gradient of the objective at iteration k.

Although the convergence behavior of SGD is well understood for the case of stationary losses, its

implications for OCL are not clear. Hence, we extend the analysis of SGD (Ghadimi & Lan,2013) to

the continual learning case. Denote the loss function at iteration kby lk(θ)and assume:

A1 kOlk(θ)−Olk(θ0)k ≤ Lkθ−θ0k,∀θ,θ0∈Rdand k(Lipschitz smoothness).

A2 E[gk(θ)] = Olk(θ),∀θ∈Rdand k(Unbiased gradient estimates).

A3 kOlk(θ)−gk(θ)k2≤ρ2,∀θ∈Rd(Bounded gradient noise).

A4 |lk+1(θ)−lk(θ)| ≤ χk,∀θand k(Bounded non-stationarity).

extend the standard assumptions for SGD analysis (Ghadimi & Lan,2013) from the

stationary case to the non-stationary case. And

bounds the degree of non-stationarity between

consecutive iterations. The expectations are taken over the randomness of the stochastic gradient

noise. With these assumptions, we state the following theorem and defer its proof to Appx. A.2 (also

see Appx. A.2 for details about how to interpret this result under limited storage):

Theorem 1. Assume A1 to A4, and let αk<L

min

j∈{0,1,...,k}

E[kOlj+1(θj)k2]≤T1+T2+T3,(7)

where T1=2(l1(θ0)−E[lk+2 (θk+1)])

j=0(2αj+1 −Lα2

j+1),T2=Lρ2Pk

j=0 α2

j+1

j=0(2αj+1 −Lα2

j+1), and T3=2Pk

j=0 χj+1

j=0(2αj+1 −Lα2

j+1).

Therorem 1bounds the minimum gradient norm achievable by SGD during OCL. A similar bound in

the stationary case (Theorem 2in Appx. A.1) has only the terms

and

. Hence,

is the cost of

the non-stationarity. In the stationary case (when

T3= 0

), we can simply ﬁnd the optimal learning

rate by trading off between

and

. Speciﬁcally, for a constant learning rate,

converges to

at a linear rate, but

is constant. Hence, the strategy for convergence is reducing learning rates to

control

simultaneously with

. This result supports the effectiveness of the standard “reduce-

when-plateau” (RWP) learning rate heuristic in the stationary case. Since the access to the true

gradient norm is not practical, RWP often uses the validation accuracy/loss as a surrogate to control

the learning rate

αk

, where we reduce

αk

by a certain factor

once the validation accuracy/loss

plateaus (Goodfellow et al.,2016;He et al.,2016).

In the continual learning setting (when

T36= 0

), RWP can be less effective. For example, if

χk≥χ > 0

for a constant

, i.e., the loss function

lk(·)

changes at least at a linear rate, then

can grow linearly when the learning rate is reduced to 0. Hence in OCL, there can be cases where

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IMPROVINGINFORMATIONRETENTIONINLARGESCALEONLINECONTINUALLEARNINGZhipengCaiIntelLabsVladlenKoltunAppleOzanSenerAppleABSTRACTGivenastreamofdatasampledfromnon-stationarydistributions,onlinecontinuallearning(OCL)aimstoadaptefcientlytonewdatawhileretainingexistingknowledge.Thetypicalapproachtoaddressinf...

展开>> 收起<<

IMPROVING INFORMATION RETENTION IN LARGE SCALE ONLINE CONTINUAL LEARNING Zhipeng Cai.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IMPROVING INFORMATION RETENTION IN LARGE SCALE ONLINE CONTINUAL LEARNING Zhipeng Cai

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: