regularization, parameter isolation, and replay methods.
Regularization methods [1,22,45] prevent the learned pa-
rameters from deviating too much to prevent forgetting.
Parameter isolation methods counter forgetting completely
by dedicating a non-overlapping set of parameters to each
task [35,42]. Replay methods either store previous in-
stances [14,17,24,33] or generate pseudo-instances [12,36,
37] on the fly for replay to alleviate forgetting.
While the aforementioned approaches all show promis-
ing results in different CL scenarios, we specifically explore
regularization and memory replay-based methods, given
their popularity in recent literature. And we study the be-
haviors of pre-trained models on these methods.
Continual Learning with Pre-trained Models. While
most of the CL work investigates training the learner from
scratch [4,14,22,24,26,33,45], there is also some work that
initializes the learner from a pre-trained model [3,9,11,20,
29,30]. They harness pre-trained models for reasons such
as coping with data scarcity of the downstream task [9] and
simulating prior knowledge of the continual learner [20].
However, they do not 1) systematically show the substan-
tial benefits of pre-trained models over from-scratch trained
models, 2) investigate different types of pre-trained models
or fine-tuning strategies, or 3) investigate pre-trained mod-
els on different CL scenarios (incremental and online learn-
ing). Note that we claim no contribution to be the first to
apply pre-trained models on CL, but rather study the afore-
mentioned aspects comprehensively.
3. Methodology
We mainly focus on online class incremental learning
(CIL), which is formally defined in Sec. 3.1. Next, we dis-
cuss various pre-trained models and how we leverage them
(Sec. 3.2). In Sec. 3.3, we introduce the two-stage training
pipeline that combines online training and offline training.
3.1. Problem Formulation
The most widely adopted continual learning scenarios
are 1) task incremental learning, 2) domain incremental
learning, and 3) class incremental learning (CIL). Amongst,
CIL is the most challenging and draws the most attention,
for its closer resemblance to real-world scenarios, where the
model is required to make predictions on all classes seen so
far with no task identifiers given (we refer interested readers
to [38] for more details). In this paper, we focus on a more
difficult scenario – online CIL, where the model can only
have access to the data once unless with a replay buffer. In
other words, the model can not iterate over the data of the
current task for multiple epochs, which is common in CIL.
The experiments of the remaining paper are based on online
CIL, unless noted otherwise (we also evaluate CIL).
Formally, we define the problem as follows. Ctotal
classes are split into Ntasks, and each task tcontains Ct
non-overlapping classes (e.g., CIFAR100 is divided into 20
tasks, with each task containing 5 unique classes). The
model is presented with Ntasks sequentially. Each task is
a data stream {St|0<t<=N}that presents a mini-batch
of samples to the model at a time. Each sample belongs to
one of the Ctunique classes, which are non-overlapping to
other tasks. We explore pre-trained models with N= 20
tasks and present the results in Sec. 5.2.
3.2. Fine-Tuning Strategy
When using a pre-trained model, we initialize the model
with the pre-trained weights. Then, we fine-tune the model
in either 1) supervised, or 2) self-supervised manner.
For the self-supervised fine-tuning, we experiment with
the SimCLR pre-trained RN50 and fine-tune it with the
SimCLR loss. Specifically, we leverage a replay buffer to
store images and labels, which are then used to train the
classifier on top of the fine-tuned feature representation at
the end of each CL task. Note that we train the SimCLR
feature with both images sampled from the memory and im-
ages from the data stream.
3.3. Two-Stage Pipeline
We combine the two-stage training pipeline proposed
in [18] with pre-trained models to build a strong baseline
for continual learning with pre-training.
The two-stage pipeline divides the learning process into
two phases – a streaming phase where the learner (sampler)
is exposed to the data stream, and an offline phase where
the learner learns from the memory. Similar to another
widely used method GDumb [31], the two-stage pipeline
trains offline on samples in the memory after all data have
been streamed. Specifically, we iterate over samples in the
memory for 30 epochs after the streaming phase. However,
GDumb performs no learning in the streaming phase and
discards most of the data without learning from them, mak-
ing it sub-optimal. By contrast, the two-stage pipeline im-
proves over GDumb, through training the model on the data
while storing them at the same time.
We found that this simple two-stage pipeline is particu-
larly effective when coupled with pre-trained models. With
the two-stage training, ER [34] could outperform the best-
performing models, when all leverage a pre-trained model.
4. Experimental Setting
4.1. Datasets
We experiment on five datasets. Classes in each dataset
are equally and randomly split into 20 tasks with no over-
laps. The orderings are random, but kept the same across
different experiments.
Split CIFAR100. We follow the suggested train and test
split, where each category has 500 training and 100 test