Do Pre-trained Models Benefit Equally in Continual Learning Kuan-Ying Lee Yuanyi Zhong Yu-Xiong Wang

2025-08-18 0 0 532.98KB 11 页 10玖币

侵权投诉

Do Pre-trained Models Beneﬁt Equally in Continual Learning?

Kuan-Ying Lee Yuanyi Zhong Yu-Xiong Wang

University of Illinois at Urbana-Champaign

{kylee5, yuanyiz2, yxw}@illinois.edu

Abstract

Existing work on continual learning (CL) is primarily

devoted to developing algorithms for models trained from

scratch. Despite their encouraging performance on con-

trived benchmarks, these algorithms show dramatic perfor-

mance drop in real-world scenarios. Therefore, this pa-

per advocates the systematic introduction of pre-training to

CL, which is a general recipe for transferring knowledge

to downstream tasks but is substantially missing in the CL

community. Our investigation reveals the multifaceted com-

plexity of exploiting pre-trained models for CL, along three

different axes: pre-trained models, CL algorithms, and CL

scenarios. Perhaps most intriguingly, improvements in CL

algorithms from pre-training are very inconsistent – an

underperforming algorithm could become competitive and

even state of the art, when all algorithms start from a pre-

trained model. This indicates that the current paradigm,

where all CL methods are compared in from-scratch train-

ing, is not well reﬂective of the true CL objective and de-

sired progress. In addition, we make several other impor-

tant observations, including that 1) CL algorithms that exert

less regularization beneﬁt more from a pre-trained model;

and 2) a stronger pre-trained model such as CLIP does

not guarantee a better improvement. Based on these ﬁnd-

ings, we introduce a simple yet effective baseline that em-

ploys minimum regularization and leverages the more ben-

eﬁcial pre-trained model, coupled with a two-stage train-

ing pipeline. We recommend including this strong base-

line in the future development of CL algorithms, due to

its demonstrated state-of-the-art performance. Our code

is available at https://github.com/eric11220/

pretrained-models-in-CL.

1. Introduction

Continual learning (CL) has gained increasing research

momentum recently, due to the ever-changing nature of

real-world data [1,2,12,14,22,24,33,35,42,45]. De-

spite their encouraging performance, many notable CL al-

(a) (b)

Figure 1. (a) CL algorithms trained from scratch fail on Split

CUB200, a more complex dataset than Split CIFAR100, which

necessitates the use of pre-trained models (denoted as ‘+ RN18’)

that dramatically increase the accuracy of a wide spectrum of algo-

rithms. (b) Different CL algorithms receive vastly different ben-

eﬁts from pre-trained models, and the superiority between algo-

rithms changes. These ﬁndings suggest that it is critical for the

community to develop CL algorithms with a pre-trained model and

understand their behaviors. [Best viewed in color.]

gorithms were developed to work with a model trained from

scratch. As one of the key objectives, this paper advocates

the systematic introduction of pre-training to CL. This is

rooted in the following two observed fundamental limita-

tions of building CL algorithm on top of a from-scratch

trained model, which fails to reﬂect the true progress in the

CL research for real-world scenarios as shown in Fig. 1.

First, training from scratch does not reﬂect the actual

performance, because if one were to apply a CL algorithm

to real-world scenarios, it would be counter-intuitive not to

build upon off-the-shelf pre-trained models given the large

performance gap (Fig. 1). One might argue that apply-

ing all algorithms to a from-scratch trained model simpli-

ﬁes comparison between different algorithms. However,

intriguingly, our study shows that an underperforming al-

gorithm could become competitive and even achieve state-

of-the-art performance, when all algorithms start from a

pre-trained model. In particular, iCaRL [33], which shows

mediocre performance in online class incremental learn-

ing (CIL) when trained from scratch, is comparable to

or even outperforms SCR [27], when both are initialized

from a ResNet181pre-trained on ImageNet (accuracy in-

1We refer to ResNet as RN throughout the paper.

arXiv:2210.15701v2 [cs.CV] 4 Jul 2024

crease from 14.26% to 56.64% for iCaRL vs. increase from

25.80% to 51.93% for SCR on Split CIFAR100 in Fig. 1

and Table 2). This potentially indicates that the efforts fun-

neled into the development of CL algorithms could be in

a less effective direction and are not well reﬂective of the

actual progress in CL. Therefore, we should develop any

future CL algorithms in the context of how we are going to

use them in practice – starting from a pre-trained model.

Second, for many more realistic datasets with diverse vi-

sual concepts, data scarcity makes it impossible to train a

CL learner from scratch [9] (as also shown in the results

on Split CUB200 in Fig. 1. We believe that this is partially

the reason why CL classiﬁcation literature still heavily eval-

uates on contrived benchmarks such as Split MNIST and

Split CIFAR [14,24], as opposed to much more complex

datasets typically used in ofﬂine learning.

Through our investigation, this paper reveals the multi-

faceted complexity of exploiting pre-trained models for CL.

As summarized in Table 1, we conduct the investigation

along three different axes: different pre-trained models, dif-

ferent CL algorithms, and different CL scenarios. In partic-

ular, we analyze models pre-trained in either supervised or

self-supervised fashion and from three distinct sources of

supervision – curated labeled images, non-curated image-

text pairs, and unlabeled images. These models cover super-

vised RN18/50 [19] trained on ImageNet classiﬁcation [15],

CLIP RN50 [32], and self-supervised RN50 trained with

SimCLR [10], SwAV [6], or Barlow Twins [44].

We make several important observations. 1) Beneﬁts of

a pre-trained model on different CL algorithms vary widely,

as represented by the aforementioned comparison between

iCaRL and SCR. 2) As shown in Fig. 1, algorithms ap-

plying less regularization to the gradient (i.e., replay-based

methods like ER [34]) seem to beneﬁt the most from pre-

trained models. 3) Intriguingly, despite its impressive zero-

shot capability, CLIP RN50 mostly underperforms Ima-

geNet RN50. 4) Self-supervised ﬁne-tuning helps alleviate

catastrophic forgetting. For example, ﬁne-tuning SimCLR

RN50 on the downstream dataset in a self-supervised fash-

ion with the SimCLR loss demonstrates a huge reduction in

forgetting, compared with supervised models (17.99% for-

getting of SimCLR RN50 vs. 91.12% forgetting of super-

vised RN50). 5) Iterating over data of a given task for mul-

tiple epochs as in class incremental learning (CIL) does not

necessarily improve the performance over online CIL.

Based on these observations, we further propose a strong

baseline by applying ER, which exerts minimum regulariza-

tion (the second observation), on an ImageNet pre-trained

model (the third observation). Coupled with a two-stage

training pipeline [18] (Sec. 3.3), we show that such a simple

baseline achieves state-of-the-art performance. We recom-

mend including this strong baseline in the future develop-

ment of CL algorithms.

Pre-trained Model

RN18

RN50

CLIP RN50

SimCLR RN50

SwAV RN50

Barlow Twins RN50

CL Algorithm

CL Scenario

CIL

Online CIL

ER MIR … SCR

Label-Supervised

(ImageNet)

Image-Text Supervised

Self-supervised

(ImageNet)

Axis Conﬁgurations

Pre-trained Models (7) Reduced RN18, RN18, RN50, CLIP RN50,

SimCLR RN50, SwAV RN50, Barlow Twins RN50

CL Algorithms (11) ER, MIR, GSS, iCaRL, GDumb, SCR,

LwF, EWC++, AGEM, Co2L, DER++

CL Scenarios (2) CIL, Online CIL

Table 1. We conduct the analyses of pre-trained models in CL by

dissecting the space into three axes: 1) different pre-trained mod-

els, 2) different CL algorithms, and 3) different CL scenarios.

Our contributions are summarized as follows. 1) We

show the necessity of pre-trained models on more com-

plex CL datasets and the dramatic difference in their ben-

eﬁts on different CL algorithms, which may overturn the

comparison results between algorithms. Therefore, we sug-

gest the community consider pre-trained models when de-

veloping and evaluating new CL algorithms. 2) We show

that replay-based CL algorithms seem to beneﬁt more from

a pre-trained model, compared with regularization-based

counterparts. 3) We propose a simple yet strong baseline

based on ER and ImageNet RN50, which achieves state-of-

the-art performance for CL with pre-training.

2. Related Work

Continual Learning Scenarios. A large portion of CL lit-

erature focuses on incremental learning, which can be fur-

ther divided into three different scenarios – task, domain,

and class incremental learning [38]. Amongst, the most

challenging scenario is class incremental learning (CIL),

where the model has to predict all previously seen classes

with a single head in the absence of task information. Most

recent work [17,20,26] has investigated this setting.

However, being able to iterate over the entire data of a

speciﬁc task for multiple epochs is not realistic [9,31]. To

this end, an online version of CIL is proposed [9,14], where

the model trains in an online fashion and thereby can only

have access to each example once. In this work, we also

mainly investigate pre-trained models in online CIL but also

report results in CIL for several representative algorithms.

Continual Learning Methods. According to [13], contin-

ual learning approaches can be divided into three classes:

regularization, parameter isolation, and replay methods.

Regularization methods [1,22,45] prevent the learned pa-

rameters from deviating too much to prevent forgetting.

Parameter isolation methods counter forgetting completely

by dedicating a non-overlapping set of parameters to each

task [35,42]. Replay methods either store previous in-

stances [14,17,24,33] or generate pseudo-instances [12,36,

37] on the ﬂy for replay to alleviate forgetting.

While the aforementioned approaches all show promis-

ing results in different CL scenarios, we speciﬁcally explore

regularization and memory replay-based methods, given

their popularity in recent literature. And we study the be-

haviors of pre-trained models on these methods.

Continual Learning with Pre-trained Models. While

most of the CL work investigates training the learner from

scratch [4,14,22,24,26,33,45], there is also some work that

initializes the learner from a pre-trained model [3,9,11,20,

29,30]. They harness pre-trained models for reasons such

as coping with data scarcity of the downstream task [9] and

simulating prior knowledge of the continual learner [20].

However, they do not 1) systematically show the substan-

tial beneﬁts of pre-trained models over from-scratch trained

models, 2) investigate different types of pre-trained models

or ﬁne-tuning strategies, or 3) investigate pre-trained mod-

els on different CL scenarios (incremental and online learn-

ing). Note that we claim no contribution to be the ﬁrst to

apply pre-trained models on CL, but rather study the afore-

mentioned aspects comprehensively.

3. Methodology

We mainly focus on online class incremental learning

(CIL), which is formally deﬁned in Sec. 3.1. Next, we dis-

cuss various pre-trained models and how we leverage them

(Sec. 3.2). In Sec. 3.3, we introduce the two-stage training

pipeline that combines online training and ofﬂine training.

3.1. Problem Formulation

The most widely adopted continual learning scenarios

are 1) task incremental learning, 2) domain incremental

learning, and 3) class incremental learning (CIL). Amongst,

CIL is the most challenging and draws the most attention,

for its closer resemblance to real-world scenarios, where the

model is required to make predictions on all classes seen so

far with no task identiﬁers given (we refer interested readers

to [38] for more details). In this paper, we focus on a more

difﬁcult scenario – online CIL, where the model can only

have access to the data once unless with a replay buffer. In

other words, the model can not iterate over the data of the

current task for multiple epochs, which is common in CIL.

The experiments of the remaining paper are based on online

CIL, unless noted otherwise (we also evaluate CIL).

Formally, we deﬁne the problem as follows. Ctotal

classes are split into Ntasks, and each task tcontains Ct

non-overlapping classes (e.g., CIFAR100 is divided into 20

tasks, with each task containing 5 unique classes). The

model is presented with Ntasks sequentially. Each task is

a data stream {St|0<t<=N}that presents a mini-batch

of samples to the model at a time. Each sample belongs to

one of the Ctunique classes, which are non-overlapping to

other tasks. We explore pre-trained models with N= 20

tasks and present the results in Sec. 5.2.

3.2. Fine-Tuning Strategy

When using a pre-trained model, we initialize the model

with the pre-trained weights. Then, we ﬁne-tune the model

in either 1) supervised, or 2) self-supervised manner.

For the self-supervised ﬁne-tuning, we experiment with

the SimCLR pre-trained RN50 and ﬁne-tune it with the

SimCLR loss. Speciﬁcally, we leverage a replay buffer to

store images and labels, which are then used to train the

classiﬁer on top of the ﬁne-tuned feature representation at

the end of each CL task. Note that we train the SimCLR

feature with both images sampled from the memory and im-

ages from the data stream.

3.3. Two-Stage Pipeline

We combine the two-stage training pipeline proposed

in [18] with pre-trained models to build a strong baseline

for continual learning with pre-training.

The two-stage pipeline divides the learning process into

two phases – a streaming phase where the learner (sampler)

is exposed to the data stream, and an ofﬂine phase where

the learner learns from the memory. Similar to another

widely used method GDumb [31], the two-stage pipeline

trains ofﬂine on samples in the memory after all data have

been streamed. Speciﬁcally, we iterate over samples in the

memory for 30 epochs after the streaming phase. However,

GDumb performs no learning in the streaming phase and

discards most of the data without learning from them, mak-

ing it sub-optimal. By contrast, the two-stage pipeline im-

proves over GDumb, through training the model on the data

while storing them at the same time.

We found that this simple two-stage pipeline is particu-

larly effective when coupled with pre-trained models. With

the two-stage training, ER [34] could outperform the best-

performing models, when all leverage a pre-trained model.

4. Experimental Setting

4.1. Datasets

We experiment on ﬁve datasets. Classes in each dataset

are equally and randomly split into 20 tasks with no over-

laps. The orderings are random, but kept the same across

different experiments.

Split CIFAR100. We follow the suggested train and test

split, where each category has 500 training and 100 test

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DoPre-trainedModelsBenefitEquallyinContinualLearning?Kuan-YingLeeYuanyiZhongYu-XiongWangUniversityofIllinoisatUrbana-Champaign{kylee5,yuanyiz2,yxw}@illinois.eduAbstractExistingworkoncontinuallearning(CL)isprimarilydevotedtodevelopingalgorithmsformodelstrainedfromscratch.Despitetheirencouragingperfor...

展开>> 收起<<

Do Pre-trained Models Benefit Equally in Continual Learning Kuan-Ying Lee Yuanyi Zhong Yu-Xiong Wang.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Do Pre-trained Models Benefit Equally in Continual Learning Kuan-Ying Lee Yuanyi Zhong Yu-Xiong Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: