sampling (Vitter,1985), with more sophisticated methods like gradient-based sample selection (Aljundi
et al.,2019b) and meta-ER (Riemer et al.,2019). As in Eq. (3), replay-based methods typically mix
the current data with replay data for predictor update (e.g Aljundi et al.,2019b;Riemer et al.,2019;
Caccia et al.,2022). In contrast, Prabhu et al. (2020) learns the predictor using only replay data. We
will adopt a similar strategy and discuss how it is crucial for achieving schedule-robustness.
Predictor Initialization (f0).
The initial predictor
f0
in Eq. (2) represents the prior knowledge
available to CL algorithms, before learning from sequence
S
(
D
). Most methods are designed for
randomly initialized
f0
with no prior knowledge (e.g., Rebuffi et al.,2017;Gupta et al.,2020;Prabhu
et al.,2020;Kirkpatrick et al.,2017). However, this assumption may be overly restrictive for several
applications (e.g., vision-based tasks like image classification), where available domain knowledge and
data can endow CL algorithms with more informative priors than random initialization. We review
two strategies for predictor initialization relevant to this work.
Initialization by Pre-training. One way for initializing
f0
is to pre-train a representation on
data related to the CL task (e.g., ImageNet for vision-based tasks) via either self-supervised learn-
ing (Shanahan et al.,2021) or multi-class classification (Mehta et al.,2021;Wang et al.,2022;Wu
et al.,2022). Boschini et al. (2022) observed that while pre-training mitigates forgetting, model
updates quickly drift the current
ft
away from
f0
, diminishing the benefits of prior knowledge as CL
algorithms continuously learn from more data. To mitigate this, Shanahan et al. (2021); Wang et al.
(2022) keep the pre-trained representation fixed while introducing additional parameters for learning
the sequence. In this work, we will offer effective routines for updating the pre-trained representation
and significantly improve test performance.
Initialization by Meta-Learning. Another approach for initializing
f0
is meta-learning (Hospedales
et al.,2021). Given the CL generalization error in Eq. (2), we may learn f0by solving the meta-CL
problem below,
f0= arg min
f
E(D,S)∼T L(S(D), f, Alg)(4)
where
T
is a meta-distribution over datasets
D
and schedules
S
. For instance, Javed and White
(2019) set
Alg
(
·
)to be MAML (Finn et al.,2017) and observed that the learned
f0
encodes sparse
representation to mitigate forgetting. However, directly optimizing Eq. (4) is computationally
expensive since the cost of gradient computation scales with the size of
D
. To overcome this, we will
leverage Wang et al. (2021) to show that meta-learning
f0
is analogous to pre-training for certain
predictors, which provides a much more efficient procedure to learn
f0
without directly solving Eq. (4).
2.2 Schedule-Robustness
The performance of many existing CL methods implicitly depends on the data schedule, leading to
unpredictable behaviors when such requirements are not met (Farquhar and Gal,2018;Yoon et al.,
2020;Mundt et al.,2022). To tackle this challenge, we introduce the notion of schedule-robustness for
CL. Given a dataset D, we say that a CL algorithm is schedule-robust if
L(S1(D), f0,Alg)≈ L(S2(D), f0,Alg),∀S1, S2schedules.(5)
Eq. (5) captures the idea that CL algorithms should perform consistently across arbitrary schedules
over the same dataset
D
. We argue that achieving robustness to different data schedules is a key
challenge in real-world scenarios, where data schedules are often unknown and possibly dynamic. CL
algorithms should thus carefully satisfy Eq. (5) for safe deployment.
4