Do Pre-trained Models Benefit Equally in Continual Learning Kuan-Ying Lee Yuanyi Zhong Yu-Xiong Wang

2025-08-18 0 0 532.98KB 11 页 10玖币
侵权投诉
Do Pre-trained Models Benefit Equally in Continual Learning?
Kuan-Ying Lee Yuanyi Zhong Yu-Xiong Wang
University of Illinois at Urbana-Champaign
{kylee5, yuanyiz2, yxw}@illinois.edu
Abstract
Existing work on continual learning (CL) is primarily
devoted to developing algorithms for models trained from
scratch. Despite their encouraging performance on con-
trived benchmarks, these algorithms show dramatic perfor-
mance drop in real-world scenarios. Therefore, this pa-
per advocates the systematic introduction of pre-training to
CL, which is a general recipe for transferring knowledge
to downstream tasks but is substantially missing in the CL
community. Our investigation reveals the multifaceted com-
plexity of exploiting pre-trained models for CL, along three
different axes: pre-trained models, CL algorithms, and CL
scenarios. Perhaps most intriguingly, improvements in CL
algorithms from pre-training are very inconsistent – an
underperforming algorithm could become competitive and
even state of the art, when all algorithms start from a pre-
trained model. This indicates that the current paradigm,
where all CL methods are compared in from-scratch train-
ing, is not well reflective of the true CL objective and de-
sired progress. In addition, we make several other impor-
tant observations, including that 1) CL algorithms that exert
less regularization benefit more from a pre-trained model;
and 2) a stronger pre-trained model such as CLIP does
not guarantee a better improvement. Based on these find-
ings, we introduce a simple yet effective baseline that em-
ploys minimum regularization and leverages the more ben-
eficial pre-trained model, coupled with a two-stage train-
ing pipeline. We recommend including this strong base-
line in the future development of CL algorithms, due to
its demonstrated state-of-the-art performance. Our code
is available at https://github.com/eric11220/
pretrained-models-in-CL.
1. Introduction
Continual learning (CL) has gained increasing research
momentum recently, due to the ever-changing nature of
real-world data [1,2,12,14,22,24,33,35,42,45]. De-
spite their encouraging performance, many notable CL al-
(a) (b)
Figure 1. (a) CL algorithms trained from scratch fail on Split
CUB200, a more complex dataset than Split CIFAR100, which
necessitates the use of pre-trained models (denoted as ‘+ RN18’)
that dramatically increase the accuracy of a wide spectrum of algo-
rithms. (b) Different CL algorithms receive vastly different ben-
efits from pre-trained models, and the superiority between algo-
rithms changes. These findings suggest that it is critical for the
community to develop CL algorithms with a pre-trained model and
understand their behaviors. [Best viewed in color.]
gorithms were developed to work with a model trained from
scratch. As one of the key objectives, this paper advocates
the systematic introduction of pre-training to CL. This is
rooted in the following two observed fundamental limita-
tions of building CL algorithm on top of a from-scratch
trained model, which fails to reflect the true progress in the
CL research for real-world scenarios as shown in Fig. 1.
First, training from scratch does not reflect the actual
performance, because if one were to apply a CL algorithm
to real-world scenarios, it would be counter-intuitive not to
build upon off-the-shelf pre-trained models given the large
performance gap (Fig. 1). One might argue that apply-
ing all algorithms to a from-scratch trained model simpli-
fies comparison between different algorithms. However,
intriguingly, our study shows that an underperforming al-
gorithm could become competitive and even achieve state-
of-the-art performance, when all algorithms start from a
pre-trained model. In particular, iCaRL [33], which shows
mediocre performance in online class incremental learn-
ing (CIL) when trained from scratch, is comparable to
or even outperforms SCR [27], when both are initialized
from a ResNet181pre-trained on ImageNet (accuracy in-
1We refer to ResNet as RN throughout the paper.
arXiv:2210.15701v2 [cs.CV] 4 Jul 2024
crease from 14.26% to 56.64% for iCaRL vs. increase from
25.80% to 51.93% for SCR on Split CIFAR100 in Fig. 1
and Table 2). This potentially indicates that the efforts fun-
neled into the development of CL algorithms could be in
a less effective direction and are not well reflective of the
actual progress in CL. Therefore, we should develop any
future CL algorithms in the context of how we are going to
use them in practice – starting from a pre-trained model.
Second, for many more realistic datasets with diverse vi-
sual concepts, data scarcity makes it impossible to train a
CL learner from scratch [9] (as also shown in the results
on Split CUB200 in Fig. 1. We believe that this is partially
the reason why CL classification literature still heavily eval-
uates on contrived benchmarks such as Split MNIST and
Split CIFAR [14,24], as opposed to much more complex
datasets typically used in offline learning.
Through our investigation, this paper reveals the multi-
faceted complexity of exploiting pre-trained models for CL.
As summarized in Table 1, we conduct the investigation
along three different axes: different pre-trained models, dif-
ferent CL algorithms, and different CL scenarios. In partic-
ular, we analyze models pre-trained in either supervised or
self-supervised fashion and from three distinct sources of
supervision – curated labeled images, non-curated image-
text pairs, and unlabeled images. These models cover super-
vised RN18/50 [19] trained on ImageNet classification [15],
CLIP RN50 [32], and self-supervised RN50 trained with
SimCLR [10], SwAV [6], or Barlow Twins [44].
We make several important observations. 1) Benefits of
a pre-trained model on different CL algorithms vary widely,
as represented by the aforementioned comparison between
iCaRL and SCR. 2) As shown in Fig. 1, algorithms ap-
plying less regularization to the gradient (i.e., replay-based
methods like ER [34]) seem to benefit the most from pre-
trained models. 3) Intriguingly, despite its impressive zero-
shot capability, CLIP RN50 mostly underperforms Ima-
geNet RN50. 4) Self-supervised fine-tuning helps alleviate
catastrophic forgetting. For example, fine-tuning SimCLR
RN50 on the downstream dataset in a self-supervised fash-
ion with the SimCLR loss demonstrates a huge reduction in
forgetting, compared with supervised models (17.99% for-
getting of SimCLR RN50 vs. 91.12% forgetting of super-
vised RN50). 5) Iterating over data of a given task for mul-
tiple epochs as in class incremental learning (CIL) does not
necessarily improve the performance over online CIL.
Based on these observations, we further propose a strong
baseline by applying ER, which exerts minimum regulariza-
tion (the second observation), on an ImageNet pre-trained
model (the third observation). Coupled with a two-stage
training pipeline [18] (Sec. 3.3), we show that such a simple
baseline achieves state-of-the-art performance. We recom-
mend including this strong baseline in the future develop-
ment of CL algorithms.
Pre-trained Model
RN18
RN50
CLIP RN50
SimCLR RN50
SwAV RN50
Barlow Twins RN50
CL Algorithm
CL Scenario
CIL
Online CIL
ER MIR SCR
Label-Supervised
(ImageNet)
Image-Text Supervised
Self-supervised
(ImageNet)
Axis Configurations
Pre-trained Models (7) Reduced RN18, RN18, RN50, CLIP RN50,
SimCLR RN50, SwAV RN50, Barlow Twins RN50
CL Algorithms (11) ER, MIR, GSS, iCaRL, GDumb, SCR,
LwF, EWC++, AGEM, Co2L, DER++
CL Scenarios (2) CIL, Online CIL
Table 1. We conduct the analyses of pre-trained models in CL by
dissecting the space into three axes: 1) different pre-trained mod-
els, 2) different CL algorithms, and 3) different CL scenarios.
Our contributions are summarized as follows. 1) We
show the necessity of pre-trained models on more com-
plex CL datasets and the dramatic difference in their ben-
efits on different CL algorithms, which may overturn the
comparison results between algorithms. Therefore, we sug-
gest the community consider pre-trained models when de-
veloping and evaluating new CL algorithms. 2) We show
that replay-based CL algorithms seem to benefit more from
a pre-trained model, compared with regularization-based
counterparts. 3) We propose a simple yet strong baseline
based on ER and ImageNet RN50, which achieves state-of-
the-art performance for CL with pre-training.
2. Related Work
Continual Learning Scenarios. A large portion of CL lit-
erature focuses on incremental learning, which can be fur-
ther divided into three different scenarios – task, domain,
and class incremental learning [38]. Amongst, the most
challenging scenario is class incremental learning (CIL),
where the model has to predict all previously seen classes
with a single head in the absence of task information. Most
recent work [17,20,26] has investigated this setting.
However, being able to iterate over the entire data of a
specific task for multiple epochs is not realistic [9,31]. To
this end, an online version of CIL is proposed [9,14], where
the model trains in an online fashion and thereby can only
have access to each example once. In this work, we also
mainly investigate pre-trained models in online CIL but also
report results in CIL for several representative algorithms.
Continual Learning Methods. According to [13], contin-
ual learning approaches can be divided into three classes:
regularization, parameter isolation, and replay methods.
Regularization methods [1,22,45] prevent the learned pa-
rameters from deviating too much to prevent forgetting.
Parameter isolation methods counter forgetting completely
by dedicating a non-overlapping set of parameters to each
task [35,42]. Replay methods either store previous in-
stances [14,17,24,33] or generate pseudo-instances [12,36,
37] on the fly for replay to alleviate forgetting.
While the aforementioned approaches all show promis-
ing results in different CL scenarios, we specifically explore
regularization and memory replay-based methods, given
their popularity in recent literature. And we study the be-
haviors of pre-trained models on these methods.
Continual Learning with Pre-trained Models. While
most of the CL work investigates training the learner from
scratch [4,14,22,24,26,33,45], there is also some work that
initializes the learner from a pre-trained model [3,9,11,20,
29,30]. They harness pre-trained models for reasons such
as coping with data scarcity of the downstream task [9] and
simulating prior knowledge of the continual learner [20].
However, they do not 1) systematically show the substan-
tial benefits of pre-trained models over from-scratch trained
models, 2) investigate different types of pre-trained models
or fine-tuning strategies, or 3) investigate pre-trained mod-
els on different CL scenarios (incremental and online learn-
ing). Note that we claim no contribution to be the first to
apply pre-trained models on CL, but rather study the afore-
mentioned aspects comprehensively.
3. Methodology
We mainly focus on online class incremental learning
(CIL), which is formally defined in Sec. 3.1. Next, we dis-
cuss various pre-trained models and how we leverage them
(Sec. 3.2). In Sec. 3.3, we introduce the two-stage training
pipeline that combines online training and offline training.
3.1. Problem Formulation
The most widely adopted continual learning scenarios
are 1) task incremental learning, 2) domain incremental
learning, and 3) class incremental learning (CIL). Amongst,
CIL is the most challenging and draws the most attention,
for its closer resemblance to real-world scenarios, where the
model is required to make predictions on all classes seen so
far with no task identifiers given (we refer interested readers
to [38] for more details). In this paper, we focus on a more
difficult scenario – online CIL, where the model can only
have access to the data once unless with a replay buffer. In
other words, the model can not iterate over the data of the
current task for multiple epochs, which is common in CIL.
The experiments of the remaining paper are based on online
CIL, unless noted otherwise (we also evaluate CIL).
Formally, we define the problem as follows. Ctotal
classes are split into Ntasks, and each task tcontains Ct
non-overlapping classes (e.g., CIFAR100 is divided into 20
tasks, with each task containing 5 unique classes). The
model is presented with Ntasks sequentially. Each task is
a data stream {St|0<t<=N}that presents a mini-batch
of samples to the model at a time. Each sample belongs to
one of the Ctunique classes, which are non-overlapping to
other tasks. We explore pre-trained models with N= 20
tasks and present the results in Sec. 5.2.
3.2. Fine-Tuning Strategy
When using a pre-trained model, we initialize the model
with the pre-trained weights. Then, we fine-tune the model
in either 1) supervised, or 2) self-supervised manner.
For the self-supervised fine-tuning, we experiment with
the SimCLR pre-trained RN50 and fine-tune it with the
SimCLR loss. Specifically, we leverage a replay buffer to
store images and labels, which are then used to train the
classifier on top of the fine-tuned feature representation at
the end of each CL task. Note that we train the SimCLR
feature with both images sampled from the memory and im-
ages from the data stream.
3.3. Two-Stage Pipeline
We combine the two-stage training pipeline proposed
in [18] with pre-trained models to build a strong baseline
for continual learning with pre-training.
The two-stage pipeline divides the learning process into
two phases – a streaming phase where the learner (sampler)
is exposed to the data stream, and an offline phase where
the learner learns from the memory. Similar to another
widely used method GDumb [31], the two-stage pipeline
trains offline on samples in the memory after all data have
been streamed. Specifically, we iterate over samples in the
memory for 30 epochs after the streaming phase. However,
GDumb performs no learning in the streaming phase and
discards most of the data without learning from them, mak-
ing it sub-optimal. By contrast, the two-stage pipeline im-
proves over GDumb, through training the model on the data
while storing them at the same time.
We found that this simple two-stage pipeline is particu-
larly effective when coupled with pre-trained models. With
the two-stage training, ER [34] could outperform the best-
performing models, when all leverage a pre-trained model.
4. Experimental Setting
4.1. Datasets
We experiment on five datasets. Classes in each dataset
are equally and randomly split into 20 tasks with no over-
laps. The orderings are random, but kept the same across
different experiments.
Split CIFAR100. We follow the suggested train and test
split, where each category has 500 training and 100 test
摘要:

DoPre-trainedModelsBenefitEquallyinContinualLearning?Kuan-YingLeeYuanyiZhongYu-XiongWangUniversityofIllinoisatUrbana-Champaign{kylee5,yuanyiz2,yxw}@illinois.eduAbstractExistingworkoncontinuallearning(CL)isprimarilydevotedtodevelopingalgorithmsformodelstrainedfromscratch.Despitetheirencouragingperfor...

展开>> 收起<<
Do Pre-trained Models Benefit Equally in Continual Learning Kuan-Ying Lee Yuanyi Zhong Yu-Xiong Wang.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:532.98KB 格式:PDF 时间:2025-08-18

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注