Schedule-Robust Online Continual Learning Ruohan Wang1 wang_ruohani2r.a-star.edu.sgMarco Ciccone2

2025-04-26 0 0 1.17MB 23 页 10玖币

侵权投诉

Schedule-Robust Online Continual Learning

Ruohan Wang∗,1

wang_ruohan@i2r.a-star.edu.sg

Marco Ciccone∗,2

marco.ciccone@polito.it

Giulia Luise3

g.luise16@ucl.ac.uk

Andrew Yapp5

e0360570@u.nus.edu

Massimiliano Pontil3,4

massimiliano.pontil@iit.it

Carlo Ciliberto3

c.ciliberto@ucl.ac.uk

Abstract

A continual learning (CL) algorithm learns from a non-stationary data stream. The non-

stationarity is modeled by some schedule that determines how data is presented over time. Most

current methods make strong assumptions on the schedule and have unpredictable performance

when such requirements are not met. A key challenge in CL is thus to design methods robust

against arbitrary schedules over the same underlying data, since in real-world scenarios schedules

are often unknown and dynamic. In this work, we introduce the notion of schedule-robustness for CL

and a novel approach satisfying this desirable property in the challenging online class-incremental

setting. We also present a new perspective on CL, as the process of learning a schedule-robust

predictor, followed by adapting the predictor using only replay data. Empirically, we demonstrate

that our approach outperforms existing methods on CL benchmarks for image classiﬁcation by a

large margin.

1 Introduction

A hallmark of natural intelligence is its ability to continually absorb new knowledge while retaining

and updating existing one. Achieving this objective in machines is the goal of continual learning (CL).

Ideally, CL algorithms learn online from a never-ending and non-stationary stream of data, without

catastrophic forgetting (McCloskey and Cohen,1989;Ratcliﬀ,1990;French,1999).

The non-stationarity of the data stream is modeled by some schedule that deﬁnes what data

arrives and how its distribution evolves over time. Two family of schedules commonly investigated

are task-based (De Lange et al.,2021) and task-free (Aljundi et al.,2019a). The task-based setting

assumes that new data arrives one task at a time and data distribution is stationary for each task.

Many CL algorithms (e.g., Buzzega et al.,2020;Kirkpatrick et al.,2017;Hou et al.,2019) thus train

oﬄine, with multiple passes and shuﬄes over task data. The task-free setting does not assume the

existence of separate tasks but instead expects CL algorithms to learn online from streaming data,

with evolving sample distribution (Caccia et al.,2022;Shanahan et al.,2021). In this work, we

∗Equal Contribution.

1Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore.

2Politecnico di Torino, Torino, Italy (Work done while at University College London).

3Centre for Artiﬁcial Intelligence, Department of Computer Science, University College London, United Kingdom.

4Computational Statistics and Machine Learning Group, Istituto Italiano di Tecnologia, Genova, Italy.

5National University of Singapore (Work done while at A*STAR).

arXiv:2210.05561v2 [cs.LG] 14 Oct 2022

tackle the task-free setting with focus on class-incremental learning, where novel classes are observed

incrementally and a single predictor is trained to discriminate all of them (Rebuﬃ et al.,2017).

Existing works are typically designed for speciﬁc schedules, since explicitly modeling and evaluating

across all possible data schedules is intractable. Consequently, methods have often unpredictable

performance when scheduling assumptions fail to hold (Farquhar and Gal,2018;Mundt et al.,2022;

Yoon et al.,2020). This is a considerable issue for practical applications, where the actual schedule is

either unknown or may diﬀer from what these methods were designed for. This challenge calls for an

ideal notion of schedule-robustness: CL methods should behave consistently when trained on diﬀerent

schedules over the same underlying data.

To achieve schedule-robustness, we introduce a new strategy based on a two-stage approach: 1)

learning online a schedule-robust predictor, followed by 2) adapting the predictor using only data from

experience replay (ER) (Chaudhry et al.,2019a). We will show that both stages are robust to diverse

data schedules, making the whole algorithm schedule-robust. We refer to it as

hedule-

obust

nline continua

L L

earning (SCROLL). Speciﬁcally, we propose two online predictors that by design

are robust against arbitrary data schedules and catastrophic forgetting. To learn appropriate priors

for these predictors, we present a meta-learning perspective (Finn et al.,2017;Wang et al.,2021)

and connect it to the pre-training strategies in CL (Mehta et al.,2021). We show that pre-training

oﬀers an alternative and eﬃcient procedure for learning predictor priors instead of directly solving

the meta-learning formulation. This makes our method computationally competitive and at the same

time oﬀers a clear justiﬁcation for adopting pre-training in CL. Finally, we present eﬀective routines

for adapting the predictors from the ﬁrst stage. We show that using only ER data for this step is key

to preserving schedule-robustness, and discuss how to mitigate overﬁtting when ER data is limited.

Contributions.

1) We introduce the notion of schedule-robustness for CL and propose a novel online

approach satisfying this key property. 2) We present a meta-learning perspective on CL and connect

it to pre-training strategies in CL. 3) Empirically, we demonstrate SCROLL outperforms a number of

baselines by large margins, and highlight key properties of our method in ablation studies.

2 Preliminaries and Related Works

We formalize CL as learning from non-stationary data sequences. A data sequence consists of a

dataset

{

(

xi, yi

)

i=1

regulated by a

schedule S

= (

σ, β

). Applying the schedule

denoted by

(

)

,β

(

)), where

(

)is a speciﬁc ordering of

, and

(

)) =

{Bt}T

t=1

splits

the sequence

(

)into

batches of samples

{

(

xσ(i), yσ(i)

)

}kt+1

i=kt

, with

the batch boundaries.

Intuitively,

determines the order in which (

x, y

)

∈D

are observed, while

determines how many

samples are observed at a time. Fig. 1(Left) illustrates how the same dataset

could be streamed

according to diﬀerent schedules. For example,

(

)in Fig. 1(Left) depicts the standard schedule to

split and stream Din batches of Cclasses at the time (C= 2).

2.1 Continual Learning

A CL algorithm learns from

(

)one batch

at a time, iteratively training a predictor

X → Y

to ﬁt the observed data. Some formulations assume access to a ﬁxed-size replay buﬀer

, which

mitigates forgetting by storing and reusing samples for future training. Given an initial predictor

Figure 1: Left.

Illustration of a classiﬁcation dataset

streamed according to diﬀerent schedules (dashed

vertical lines identify separate batches).

Right.

Pre-training + the two stages of SCROLL: 1) online learning

and store replay samples from data stream, 2) adapting the predictor using the replay buﬀer (green indicates

whether the representation ψis being updated).

and an initial buﬀer M0, we deﬁne the update rule of a CL algorithm Alg(·)at step tas

(ft, Mt) = Alg(Bt, ft−1, Mt−1),(1)

where the algorithm learns from the current batch Btand updates both the replay buﬀer Mt−1and

predictor ft−1from the previous iteration.

At test time, the performance of the algorithm is evaluated on a distribution

πD

that samples

(x, y)sharing the same labels with the samples in D. The generalization error is denoted by

L(S(D), f0,Alg) = E(x,y)∼πD`(fT(x), y)(2)

where the ﬁnal predictor

is recursively obtained from Eq. (1) and

In the following, we review

existing approaches for updating ftand Mt, and strategies for initializing f0.

Predictor Update (ft).

Most CL methods learn

by solving an optimization problem of the form:

ft= arg min

α1·X

(x,y)∈Bt

`(f(x), y)

| {z }

current batch loss

+α2·X

(x,y)∈Mt−1

`(f(x), y)

| {z }

replay loss

+α3·R(f, ft−1)

| {z }

regularization loss

(3)

where

α1,2,3

are prescribed non-negative weights,

a loss function, and

a regularizer. This general

formulation for updating

recovers replay-based methods such as iCarl (Rebuﬃ et al.,2017) and

DER (Buzzega et al.,2020) for speciﬁc choices of

and

. Moreover, if

Mt=∅∀t

(or we set

α2=

0),

the replay loss is omitted and we recover regularization-based methods (e.g Kirkpatrick et al.,2017;

Li and Hoiem,2017;Yu et al.,2020) that update selective parameters of ft−1to prevent forgetting.

Replay Buﬀer Update (Mt).

In replay-based methods,

Alg

(

)must also deﬁne a buﬀering strategy

that decides what samples to store in

for future reuse, and which ones to discard from a full

buﬀer. Common strategies include exemplar selection (Rebuﬃ et al.,2017) and random reservoir

1We omitted the initial memory buﬀer M0since it is typically empty.

sampling (Vitter,1985), with more sophisticated methods like gradient-based sample selection (Aljundi

et al.,2019b) and meta-ER (Riemer et al.,2019). As in Eq. (3), replay-based methods typically mix

the current data with replay data for predictor update (e.g Aljundi et al.,2019b;Riemer et al.,2019;

Caccia et al.,2022). In contrast, Prabhu et al. (2020) learns the predictor using only replay data. We

will adopt a similar strategy and discuss how it is crucial for achieving schedule-robustness.

Predictor Initialization (f0).

The initial predictor

in Eq. (2) represents the prior knowledge

available to CL algorithms, before learning from sequence

(

). Most methods are designed for

randomly initialized

with no prior knowledge (e.g., Rebuﬃ et al.,2017;Gupta et al.,2020;Prabhu

et al.,2020;Kirkpatrick et al.,2017). However, this assumption may be overly restrictive for several

applications (e.g., vision-based tasks like image classiﬁcation), where available domain knowledge and

data can endow CL algorithms with more informative priors than random initialization. We review

two strategies for predictor initialization relevant to this work.

Initialization by Pre-training. One way for initializing

is to pre-train a representation on

data related to the CL task (e.g., ImageNet for vision-based tasks) via either self-supervised learn-

ing (Shanahan et al.,2021) or multi-class classiﬁcation (Mehta et al.,2021;Wang et al.,2022;Wu

et al.,2022). Boschini et al. (2022) observed that while pre-training mitigates forgetting, model

updates quickly drift the current

away from

, diminishing the beneﬁts of prior knowledge as CL

algorithms continuously learn from more data. To mitigate this, Shanahan et al. (2021); Wang et al.

(2022) keep the pre-trained representation ﬁxed while introducing additional parameters for learning

the sequence. In this work, we will oﬀer eﬀective routines for updating the pre-trained representation

and signiﬁcantly improve test performance.

Initialization by Meta-Learning. Another approach for initializing

is meta-learning (Hospedales

et al.,2021). Given the CL generalization error in Eq. (2), we may learn f0by solving the meta-CL

problem below,

f0= arg min

E(D,S)∼T L(S(D), f, Alg)(4)

where

is a meta-distribution over datasets

and schedules

. For instance, Javed and White

(2019) set

Alg

(

)to be MAML (Finn et al.,2017) and observed that the learned

encodes sparse

representation to mitigate forgetting. However, directly optimizing Eq. (4) is computationally

expensive since the cost of gradient computation scales with the size of

. To overcome this, we will

leverage Wang et al. (2021) to show that meta-learning

is analogous to pre-training for certain

predictors, which provides a much more eﬃcient procedure to learn

without directly solving Eq. (4).

2.2 Schedule-Robustness

The performance of many existing CL methods implicitly depends on the data schedule, leading to

unpredictable behaviors when such requirements are not met (Farquhar and Gal,2018;Yoon et al.,

2020;Mundt et al.,2022). To tackle this challenge, we introduce the notion of schedule-robustness for

CL. Given a dataset D, we say that a CL algorithm is schedule-robust if

L(S1(D), f0,Alg)≈ L(S2(D), f0,Alg),∀S1, S2schedules.(5)

Eq. (5) captures the idea that CL algorithms should perform consistently across arbitrary schedules

over the same dataset

. We argue that achieving robustness to diﬀerent data schedules is a key

challenge in real-world scenarios, where data schedules are often unknown and possibly dynamic. CL

algorithms should thus carefully satisfy Eq. (5) for safe deployment.

We note that schedule-robustness is a more general and stronger notion than order-robustness from

Yoon et al. (2020). Our deﬁnition applies to online task-free CL while order-robustness only considers

oﬄine task-based setting. We also allow arbitrary ordering for individual samples instead of task-level

ordering. In the following, we will present our method and show how it achieves schedule-robustness.

3 Method

We present

hedule-

obust

nline continua

L L

earning (SCROLL) as a two-stage process: 1)

learning online a schedule-robust predictor for CL, followed by 2) adapting the predictor using only

replay data. In the ﬁrst stage, we consider two schedule-robust predictors and discuss how to initialize

them, motivated by the meta-CL perspective introduced in Eq. (4). In the second stage, we tackle

how to adapt the predictors from the ﬁrst stage with ER and the buﬀering strategy. We will show

that SCROLL satisﬁes schedule-robustness by construction, given that optimizing a CL algorithm

against all possible schedules as formulated in Eq. (5) is clearly intractable.

3.1 Schedule-robust Online Predictor

We model the predictors introduced in Eq. (1) as the composition

φ◦ψ

of a feature extractor

X → Rm

and a classiﬁer

Rm→ Y

. In line with recent meta-learning strategies (e.g. Bertinetto

et al.,2019;Raghu et al.,2020), we keep

ﬁxed during our method’s ﬁrst stage while only adapting

the classiﬁer φto learn from data streams. We will discuss how to learn ψin Sec. 3.2.

A key observation is that some choices of

, such as Nearest Centroid Classiﬁer (NCC) (Salakhut-

dinov and Hinton,2007) and Ridge Regression (Kailath et al.,2000), are schedule-robust by design.

Nearest Centroid Classiﬁer (NCC).

NCC classiﬁes a sample

by comparing it to the learned

“prototypes” cyfor each class Xy,{x|(x, y)∈D}in the dataset,

f(x) = arg min

ykψ(x)−cyk2

2where cy=1

x∈Xy

ψ(x).(6)

The prototypes

can be learned online: given a new (

x, y

), we update only the corresponding

cnew

y=ny·cold

y+ψ(x)

ny+ 1 ,(7)

where

is the number of observed samples for class

so far. We note that the prototype for each

class is invariant to any ordering of

once the dataset has been fully observed. Since

is ﬁxed, the

resulting predictor in Eq. (6) is schedule-robust and online. Further,

is unaﬀected by catastrophic

forgetting by construction, as keeping

ﬁxed prevents forgetting while learning each prototype

independently mitigates cross-class interference (Hou et al.,2019).

Ridge Regression.

Ridge Regression enjoys the same desirable properties of NCC, being schedule-

robust and unaﬀected by forgetting (see Appendix Afor a full derivation). Let

OneHot

(

)denote the

one-hot encoding of class y, we obtain the predictor fas the one-vs-all classiﬁer

f(x) = arg max

yψ(x),where

W∗= [w1. . . wK] = arg min

|D|X

(x,y)∈D



W>ψ(x)−OneHot(y)



2+λkWk2.(8)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Schedule-RobustOnlineContinualLearningRuohanWang;1wang_ruohan@i2r.a-star.edu.sgMarcoCiccone;2marco.ciccone@polito.itGiuliaLuise3g.luise16@ucl.ac.ukAndrewYapp5e0360570@u.nus.eduMassimilianoPontil3;4massimiliano.pontil@iit.itCarloCiliberto3c.ciliberto@ucl.ac.ukAbstractAcontinuallearning(CL)algorithm...

展开>> 收起<<

Schedule-Robust Online Continual Learning Ruohan Wang1 wang_ruohani2r.a-star.edu.sgMarco Ciccone2.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Schedule-Robust Online Continual Learning Ruohan Wang1 wang_ruohani2r.a-star.edu.sgMarco Ciccone2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: