Schedule-Robust Online Continual Learning Ruohan Wang1 wang_ruohani2r.a-star.edu.sgMarco Ciccone2

2025-04-26 0 0 1.17MB 23 页 10玖币
侵权投诉
Schedule-Robust Online Continual Learning
Ruohan Wang,1
wang_ruohan@i2r.a-star.edu.sg
Marco Ciccone,2
marco.ciccone@polito.it
Giulia Luise3
g.luise16@ucl.ac.uk
Andrew Yapp5
e0360570@u.nus.edu
Massimiliano Pontil3,4
massimiliano.pontil@iit.it
Carlo Ciliberto3
c.ciliberto@ucl.ac.uk
Abstract
A continual learning (CL) algorithm learns from a non-stationary data stream. The non-
stationarity is modeled by some schedule that determines how data is presented over time. Most
current methods make strong assumptions on the schedule and have unpredictable performance
when such requirements are not met. A key challenge in CL is thus to design methods robust
against arbitrary schedules over the same underlying data, since in real-world scenarios schedules
are often unknown and dynamic. In this work, we introduce the notion of schedule-robustness for CL
and a novel approach satisfying this desirable property in the challenging online class-incremental
setting. We also present a new perspective on CL, as the process of learning a schedule-robust
predictor, followed by adapting the predictor using only replay data. Empirically, we demonstrate
that our approach outperforms existing methods on CL benchmarks for image classification by a
large margin.
1 Introduction
A hallmark of natural intelligence is its ability to continually absorb new knowledge while retaining
and updating existing one. Achieving this objective in machines is the goal of continual learning (CL).
Ideally, CL algorithms learn online from a never-ending and non-stationary stream of data, without
catastrophic forgetting (McCloskey and Cohen,1989;Ratcliff,1990;French,1999).
The non-stationarity of the data stream is modeled by some schedule that defines what data
arrives and how its distribution evolves over time. Two family of schedules commonly investigated
are task-based (De Lange et al.,2021) and task-free (Aljundi et al.,2019a). The task-based setting
assumes that new data arrives one task at a time and data distribution is stationary for each task.
Many CL algorithms (e.g., Buzzega et al.,2020;Kirkpatrick et al.,2017;Hou et al.,2019) thus train
offline, with multiple passes and shuffles over task data. The task-free setting does not assume the
existence of separate tasks but instead expects CL algorithms to learn online from streaming data,
with evolving sample distribution (Caccia et al.,2022;Shanahan et al.,2021). In this work, we
Equal Contribution.
1Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore.
2Politecnico di Torino, Torino, Italy (Work done while at University College London).
3Centre for Artificial Intelligence, Department of Computer Science, University College London, United Kingdom.
4Computational Statistics and Machine Learning Group, Istituto Italiano di Tecnologia, Genova, Italy.
5National University of Singapore (Work done while at A*STAR).
1
arXiv:2210.05561v2 [cs.LG] 14 Oct 2022
tackle the task-free setting with focus on class-incremental learning, where novel classes are observed
incrementally and a single predictor is trained to discriminate all of them (Rebuffi et al.,2017).
Existing works are typically designed for specific schedules, since explicitly modeling and evaluating
across all possible data schedules is intractable. Consequently, methods have often unpredictable
performance when scheduling assumptions fail to hold (Farquhar and Gal,2018;Mundt et al.,2022;
Yoon et al.,2020). This is a considerable issue for practical applications, where the actual schedule is
either unknown or may differ from what these methods were designed for. This challenge calls for an
ideal notion of schedule-robustness: CL methods should behave consistently when trained on different
schedules over the same underlying data.
To achieve schedule-robustness, we introduce a new strategy based on a two-stage approach: 1)
learning online a schedule-robust predictor, followed by 2) adapting the predictor using only data from
experience replay (ER) (Chaudhry et al.,2019a). We will show that both stages are robust to diverse
data schedules, making the whole algorithm schedule-robust. We refer to it as
SC
hedule-
R
obust
O
nline continua
L L
earning (SCROLL). Specifically, we propose two online predictors that by design
are robust against arbitrary data schedules and catastrophic forgetting. To learn appropriate priors
for these predictors, we present a meta-learning perspective (Finn et al.,2017;Wang et al.,2021)
and connect it to the pre-training strategies in CL (Mehta et al.,2021). We show that pre-training
offers an alternative and efficient procedure for learning predictor priors instead of directly solving
the meta-learning formulation. This makes our method computationally competitive and at the same
time offers a clear justification for adopting pre-training in CL. Finally, we present effective routines
for adapting the predictors from the first stage. We show that using only ER data for this step is key
to preserving schedule-robustness, and discuss how to mitigate overfitting when ER data is limited.
Contributions.
1) We introduce the notion of schedule-robustness for CL and propose a novel online
approach satisfying this key property. 2) We present a meta-learning perspective on CL and connect
it to pre-training strategies in CL. 3) Empirically, we demonstrate SCROLL outperforms a number of
baselines by large margins, and highlight key properties of our method in ablation studies.
2 Preliminaries and Related Works
We formalize CL as learning from non-stationary data sequences. A data sequence consists of a
dataset
D
=
{
(
xi, yi
)
}N
i=1
regulated by a
schedule S
= (
σ, β
). Applying the schedule
S
to
D
is
denoted by
S
(
D
)
,β
(
σ
(
D
)), where
σ
(
D
)is a specific ordering of
D
, and
β
(
σ
(
D
)) =
{Bt}T
t=1
splits
the sequence
σ
(
D
)into
T
batches of samples
Bt
=
{
(
xσ(i), yσ(i)
)
}kt+1
i=kt
, with
kt
the batch boundaries.
Intuitively,
σ
determines the order in which (
x, y
)
D
are observed, while
β
determines how many
samples are observed at a time. Fig. 1(Left) illustrates how the same dataset
D
could be streamed
according to different schedules. For example,
S1
(
D
)in Fig. 1(Left) depicts the standard schedule to
split and stream Din batches of Cclasses at the time (C= 2).
2.1 Continual Learning
A CL algorithm learns from
S
(
D
)one batch
Bt
at a time, iteratively training a predictor
ft
:
X → Y
to fit the observed data. Some formulations assume access to a fixed-size replay buffer
M
, which
mitigates forgetting by storing and reusing samples for future training. Given an initial predictor
f0
2
Figure 1: Left.
Illustration of a classification dataset
D
streamed according to different schedules (dashed
vertical lines identify separate batches).
Right.
Pre-training + the two stages of SCROLL: 1) online learning
and store replay samples from data stream, 2) adapting the predictor using the replay buffer (green indicates
whether the representation ψis being updated).
and an initial buffer M0, we define the update rule of a CL algorithm Alg(·)at step tas
(ft, Mt) = Alg(Bt, ft1, Mt1),(1)
where the algorithm learns from the current batch Btand updates both the replay buffer Mt1and
predictor ft1from the previous iteration.
At test time, the performance of the algorithm is evaluated on a distribution
πD
that samples
(x, y)sharing the same labels with the samples in D. The generalization error is denoted by
L(S(D), f0,Alg) = E(x,y)πD`(fT(x), y)(2)
where the final predictor
fT
is recursively obtained from Eq. (1) and
f0
.
1
In the following, we review
existing approaches for updating ftand Mt, and strategies for initializing f0.
Predictor Update (ft).
Most CL methods learn
ft
by solving an optimization problem of the form:
ft= arg min
f
α1·X
(x,y)Bt
`(f(x), y)
| {z }
current batch loss
+α2·X
(x,y)Mt1
`(f(x), y)
| {z }
replay loss
+α3·R(f, ft1)
| {z }
regularization loss
(3)
where
α1,2,3
are prescribed non-negative weights,
`
a loss function, and
R
a regularizer. This general
formulation for updating
ft
recovers replay-based methods such as iCarl (Rebuffi et al.,2017) and
DER (Buzzega et al.,2020) for specific choices of
`
and
R
. Moreover, if
Mt=t
(or we set
α2=
0),
the replay loss is omitted and we recover regularization-based methods (e.g Kirkpatrick et al.,2017;
Li and Hoiem,2017;Yu et al.,2020) that update selective parameters of ft1to prevent forgetting.
Replay Buffer Update (Mt).
In replay-based methods,
Alg
(
·
)must also define a buffering strategy
that decides what samples to store in
Mt
for future reuse, and which ones to discard from a full
buffer. Common strategies include exemplar selection (Rebuffi et al.,2017) and random reservoir
1We omitted the initial memory buffer M0since it is typically empty.
3
sampling (Vitter,1985), with more sophisticated methods like gradient-based sample selection (Aljundi
et al.,2019b) and meta-ER (Riemer et al.,2019). As in Eq. (3), replay-based methods typically mix
the current data with replay data for predictor update (e.g Aljundi et al.,2019b;Riemer et al.,2019;
Caccia et al.,2022). In contrast, Prabhu et al. (2020) learns the predictor using only replay data. We
will adopt a similar strategy and discuss how it is crucial for achieving schedule-robustness.
Predictor Initialization (f0).
The initial predictor
f0
in Eq. (2) represents the prior knowledge
available to CL algorithms, before learning from sequence
S
(
D
). Most methods are designed for
randomly initialized
f0
with no prior knowledge (e.g., Rebuffi et al.,2017;Gupta et al.,2020;Prabhu
et al.,2020;Kirkpatrick et al.,2017). However, this assumption may be overly restrictive for several
applications (e.g., vision-based tasks like image classification), where available domain knowledge and
data can endow CL algorithms with more informative priors than random initialization. We review
two strategies for predictor initialization relevant to this work.
Initialization by Pre-training. One way for initializing
f0
is to pre-train a representation on
data related to the CL task (e.g., ImageNet for vision-based tasks) via either self-supervised learn-
ing (Shanahan et al.,2021) or multi-class classification (Mehta et al.,2021;Wang et al.,2022;Wu
et al.,2022). Boschini et al. (2022) observed that while pre-training mitigates forgetting, model
updates quickly drift the current
ft
away from
f0
, diminishing the benefits of prior knowledge as CL
algorithms continuously learn from more data. To mitigate this, Shanahan et al. (2021); Wang et al.
(2022) keep the pre-trained representation fixed while introducing additional parameters for learning
the sequence. In this work, we will offer effective routines for updating the pre-trained representation
and significantly improve test performance.
Initialization by Meta-Learning. Another approach for initializing
f0
is meta-learning (Hospedales
et al.,2021). Given the CL generalization error in Eq. (2), we may learn f0by solving the meta-CL
problem below,
f0= arg min
f
E(D,S)∼T L(S(D), f, Alg)(4)
where
T
is a meta-distribution over datasets
D
and schedules
S
. For instance, Javed and White
(2019) set
Alg
(
·
)to be MAML (Finn et al.,2017) and observed that the learned
f0
encodes sparse
representation to mitigate forgetting. However, directly optimizing Eq. (4) is computationally
expensive since the cost of gradient computation scales with the size of
D
. To overcome this, we will
leverage Wang et al. (2021) to show that meta-learning
f0
is analogous to pre-training for certain
predictors, which provides a much more efficient procedure to learn
f0
without directly solving Eq. (4).
2.2 Schedule-Robustness
The performance of many existing CL methods implicitly depends on the data schedule, leading to
unpredictable behaviors when such requirements are not met (Farquhar and Gal,2018;Yoon et al.,
2020;Mundt et al.,2022). To tackle this challenge, we introduce the notion of schedule-robustness for
CL. Given a dataset D, we say that a CL algorithm is schedule-robust if
L(S1(D), f0,Alg)≈ L(S2(D), f0,Alg),S1, S2schedules.(5)
Eq. (5) captures the idea that CL algorithms should perform consistently across arbitrary schedules
over the same dataset
D
. We argue that achieving robustness to different data schedules is a key
challenge in real-world scenarios, where data schedules are often unknown and possibly dynamic. CL
algorithms should thus carefully satisfy Eq. (5) for safe deployment.
4
We note that schedule-robustness is a more general and stronger notion than order-robustness from
Yoon et al. (2020). Our definition applies to online task-free CL while order-robustness only considers
offline task-based setting. We also allow arbitrary ordering for individual samples instead of task-level
ordering. In the following, we will present our method and show how it achieves schedule-robustness.
3 Method
We present
SC
hedule-
R
obust
O
nline continua
L L
earning (SCROLL) as a two-stage process: 1)
learning online a schedule-robust predictor for CL, followed by 2) adapting the predictor using only
replay data. In the first stage, we consider two schedule-robust predictors and discuss how to initialize
them, motivated by the meta-CL perspective introduced in Eq. (4). In the second stage, we tackle
how to adapt the predictors from the first stage with ER and the buffering strategy. We will show
that SCROLL satisfies schedule-robustness by construction, given that optimizing a CL algorithm
against all possible schedules as formulated in Eq. (5) is clearly intractable.
3.1 Schedule-robust Online Predictor
We model the predictors introduced in Eq. (1) as the composition
f
=
φψ
of a feature extractor
ψ
:
X Rm
and a classifier
φ
:
Rm→ Y
. In line with recent meta-learning strategies (e.g. Bertinetto
et al.,2019;Raghu et al.,2020), we keep
ψ
fixed during our method’s first stage while only adapting
the classifier φto learn from data streams. We will discuss how to learn ψin Sec. 3.2.
A key observation is that some choices of
φ
, such as Nearest Centroid Classifier (NCC) (Salakhut-
dinov and Hinton,2007) and Ridge Regression (Kailath et al.,2000), are schedule-robust by design.
Nearest Centroid Classifier (NCC).
NCC classifies a sample
x
by comparing it to the learned
“prototypes” cyfor each class Xy,{x|(x, y)D}in the dataset,
f(x) = arg min
ykψ(x)cyk2
2where cy=1
nX
xXy
ψ(x).(6)
The prototypes
cy
can be learned online: given a new (
x, y
), we update only the corresponding
cy
as
cnew
y=ny·cold
y+ψ(x)
ny+ 1 ,(7)
where
ny
is the number of observed samples for class
y
so far. We note that the prototype for each
class is invariant to any ordering of
D
once the dataset has been fully observed. Since
ψ
is fixed, the
resulting predictor in Eq. (6) is schedule-robust and online. Further,
f
is unaffected by catastrophic
forgetting by construction, as keeping
ψ
fixed prevents forgetting while learning each prototype
independently mitigates cross-class interference (Hou et al.,2019).
Ridge Regression.
Ridge Regression enjoys the same desirable properties of NCC, being schedule-
robust and unaffected by forgetting (see Appendix Afor a full derivation). Let
OneHot
(
y
)denote the
one-hot encoding of class y, we obtain the predictor fas the one-vs-all classifier
f(x) = arg max
y
w>
yψ(x),where
W= [w1. . . wK] = arg min
W
1
|D|X
(x,y)D
W>ψ(x)OneHot(y)
2+λkWk2.(8)
5
摘要:

Schedule-RobustOnlineContinualLearningRuohanWang;1wang_ruohan@i2r.a-star.edu.sgMarcoCiccone;2marco.ciccone@polito.itGiuliaLuise3g.luise16@ucl.ac.ukAndrewYapp5e0360570@u.nus.eduMassimilianoPontil3;4massimiliano.pontil@iit.itCarloCiliberto3c.ciliberto@ucl.ac.ukAbstractAcontinuallearning(CL)algorithm...

展开>> 收起<<
Schedule-Robust Online Continual Learning Ruohan Wang1 wang_ruohani2r.a-star.edu.sgMarco Ciccone2.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:1.17MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注