Preprint CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER Vishal Thengane1envelope Salman Khan12 Munawar Hayat3 Fahad Khan14

2025-05-02 0 0 2.09MB 13 页 10玖币

侵权投诉

Preprint

CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER

Vishal Thengane1, Q, Salman Khan1,2, Munawar Hayat3, Fahad Khan1,4

1Mohamed bin Zayed University of Artiﬁcial Intelligence, UAE

2Australian National University, Australia

3Monash University, Australia

4Link¨

oping University, Sweden

Qvishal.thengane@mbzuai.ac.ae

ABSTRACT

The continual learning setting aims to learn new tasks over time without forget-

ting the previous ones. The literature reports several signiﬁcant efforts to tackle

this problem with limited or no access to previous task data. Among such efforts,

typical solutions offer sophisticated techniques involving memory replay, knowl-

edge distillation, model regularization, and dynamic network expansion. The

resulting methods have a retraining cost at each learning task, dedicated mem-

ory requirements, and setting-speciﬁc design choices. In this work, we show

that a frozen CLIP (Contrastive Language-Image Pretraining) model offers as-

tounding continual learning performance without any ﬁne-tuning (zero-shot eval-

uation). We evaluate CLIP under a variety of settings including class-incremental,

domain-incremental and task-agnostic incremental learning on ﬁve popular bench-

marks (ImageNet-100 & 1K, CORe50, CIFAR-100, and TinyImageNet). Without

any bells and whistles, the CLIP model outperforms the state-of-the-art contin-

ual learning approaches in majority of the settings. We show the effect on CLIP

model’s performance by varying text inputs with simple prompt templates. To the

best of our knowledge, this is the ﬁrst work to report the CLIP zero-shot perfor-

mance in a continual setting. We advocate the use of this strong yet embarrass-

ingly simple baseline for future comparisons in the continual learning tasks. Code

is available at https://github.com/vgthengane/Continual-CLIP.

1 INTRODUCTION

Traditionally, deep neural networks (DNNs) trained in a supervised manner on training sets com-

prising of all the classes of interest have shown excellent results. Such models can presumably learn

all the relevant features from the dataset in a single training episode. However, in real world, all

data samples may not be available at once. To cater for such scenarios, continual learning provides

a promising paradigm, since it enables learning where the data distribution shifts over time. DNNs

trained on such incremental stream of data, however, suffer from catastrophic forgetting since the

previous task data can not be accessed in its entirety (McCloskey & Cohen, 1989).

In the literature, four popular continual learning protocols exist; Task-incremental learning (TIL)

use task-speciﬁc neural networks, where the task identity is assumed to be known at the inference.

Class-incremental learning (CIL) settings add classes in a sequence with the task identity unknown

at the inference. In Domain-incremental learning (DIL), the number of classes remain the same

but the data domain evolves over time. Task-free (or task-agnostic) continual learning (TFCL) is a

more general setting where there are no explicit task boundaries and data can appear freely in the

continual learning phases (De Lange et al., 2021). The major challenge faced by all these methods

is to avoid forgetting previously learned knowledge while updating on new data.

Several specialized methods have been developed in continual learning literature to reduce catas-

trophic forgetting. Among such methods, typical solutions offer sophisticated techniques involving

memory replay (Rebufﬁ et al., 2017; Shin et al., 2017; Lopez-Paz & Ranzato, 2017), knowledge

distillation (Hinton et al., 2015; Li & Hoiem, 2017), model regularization (Kirkpatrick et al., 2017),

parameter isolation (Mallya & Lazebnik, 2018; Fernando et al., 2017), and dynamic network expan-

sion (Yan et al., 2021; Douillard et al., 2022; 2020). The resulting methods have a retraining cost at

arXiv:2210.03114v1 [cs.CV] 6 Oct 2022

Preprint

Text Encoder

Image Encoder

Plane .Dog .CatPlane .

A Photo of a Cat.

Model copies increase.

Number of classifier

heads increases.

Memory Buffer

requires

As size increases no.

parameter size grows

Number of Task progressing

Memory

Buffer

"a photo of a {}."

1. Traditional continual learning models 2. Continual learning with CLIP

Model size is fixed

throughout

No need of memory

buffers

No training (or

finetuning) needed

Frozen models (used for

knowledge distillation)

Training current model

at task t.

Not using previous

classifier head for task t.

Using only current

classifier head at task t.

Figure 1: Left: Traditional continual learning approaches can require memory buffers, complex

hyper-parameter tuning, saving a copy of previous models and their number of classiﬁer heads in-

creases at each learning step. Right: Our goal is to design a simplistic uniﬁed model that works well

across multiple continual learning settings without incurring task-wise training, dedicated mem-

ory requirements and careful hyper-parameter selection. A CLIP-based continual model is shown

to perform exceptionally well on a number of continual learning settings without requiring any

training/ﬁne-tuning, memory buffers or increase in model size with the growing number of tasks.

each learning task, need dedicated memory for storing exemplars or past models, and involve com-

plex hyper-parameter tuning which limits their practical utility. Furthermore, the above continual

learning protocols are generally addressed separately and the existing approaches involve setting-

speciﬁc design choices making them non-transferable across different continual learning settings.

In this work, we aim to test the progress made so far towards a truly continual learning system.

Our main question is to explore if the state-of-the-art narrow models can be replaced with a simple

generic approach that does not require training for each incremental step, works without any ex-

emplar memory storage and can work across all of the existing incremental learning protocols with

minimal or no hyper-parameter tuning. To this end, we show that a frozen CLIP model (Radford

et al., 2021) offers great promise due to its generalizable representations and zero-shot behaviour

without requiring any parameter-tuning. We call the CLIP evaluated on a diverse set of continual

learning settings as Continual-CLIP. Figure 1 gives an overview of traditional continual learning

methods and frozen CLIP in the continual learning system.

Our extensive evaluations across four diverse settings (TIL, CIL, DIL, and TFCL) and seven datasets

(ImageNet-100 & 1K (Deng et al., 2009), CLEAR (Lin et al., 2021), CIFAR100 (Krizhevsky et al.,

2009), TinyImageNet (Le & Yang, 2015), CoRe50 (Lomonaco & Maltoni, 2017) and Gaussian

scheduled CIFAR-100 (Shanahan et al., 2021)) demonstrate CLIP’s competitiveness on all these

CL settings. This generalization behaviour is due to the large-scale pre-training of vision-language

model like CLIP that optimizes contrastive training objective on 400M image-text pairs scraped

from the internet. During pre-training, CLIP learns a diverse range of high-level representations

which are transferable to multiple downstream tasks including the incremental tasks. We also show

how simple prompt engineering for text inputs affects the CLIP’s performance for CL.

In summary, this work layouts the baseline for the future direction in continual learning based on the

pre-trained vision-language models. We evaluate the pre-trained frozen CLIP model in a variety of

continual learning settings on popular image recognition benchmarks and compare to current state-

of-the-art methods to show that out-of-box CLIP representations perform competitively in all cases.

Our results aim to consolidate the fragmented efforts in continual learning landscape that work on

speciﬁc settings, highlighting the need for generic approaches that can work across multiple settings.

2 RELATED WORKS

Continual Learning: The existing continual learning methods mostly employ one of the following

schemes: (1) model regularization (2) memory replay, and (3) dynamic network expansion. Model

Preprint

Regularization-based methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Kirkpatrick et al., 2017;

Li & Hoiem, 2017) avoid catastrophic forgetting by limiting the plasticity of the model parameters

that are important for previous tasks. Even though these methods do not require memory replay, they

have only been shown to work on simplistic task-incremental setting (the task identity is assumed to

be known at inference time) and on smaller datasets (Wu et al., 2019). Memory Replay based meth-

ods use exemplars that are either stored in the memory, or synthesized using generative techniques,

and are effective on more challenging settings and datasets (Rebufﬁ et al., 2017; Kamra et al., 2017;

Buzzega et al., 2020; Cha et al., 2021). However, their performance degrades for smaller memory

size (Cha et al., 2021), and storing these exemplars can introduce security and privacy concerns

Shokri & Shmatikov (2015). Architecture-driven CL methods either dynamically expand a network

Rusu et al. (2016); Li et al. (2019b); Zhao et al. (2022), or divide into sub-networks to cater for the

new tasks (Zhao et al., 2022; Wang et al., 2020; Ke et al., 2020; Rajasegaran et al., 2019a). Such

approaches lack scalability, since the network capacity grows with tasks.

Vision-language Models: Training a joint vision-language embedding space enables interaction

amongst text and image data, and is critical to solve problems such as zero-shot learning, visual

grounding, and image captioning. While the initial vision-language models were single-stream and

processed the concatenated input from visual and text data as a single set of input (Li et al., 2019a;

Kim et al., 2021), more recent approaches, such as the Contrastive Language-Image Pre-training

(CLIP) (Radford et al., 2021) are dual-stream with dedicated encoders for image and text inputs.

The representations from the two encoders are projected into a uniﬁed embedding space, and a con-

trastive learning objective is employed to minimize the distance between matching image-caption

pairs, and maximize otherwise. Subsequent works have shown that CLIP is scalable, and its ca-

pabilities improve once trained on large-scale noisy data of 1 billion non-curated samples Jia et al.

(2021). The representations learned by the CLIP have been shown to generalize well across numer-

ous downstream tasks, including image-text retrieval, with excellent zero-shot transfer capabilities

(Li et al., 2021a). CLIP can also be ﬂexibly adapted to videos (Yuan et al., 2021; Xu et al., 2021;

Fang et al., 2021), and to capture object-level interactions (Yao et al., 2021; Li et al., 2021a; Rasheed

et al., 2022). However, the applicability of CLIP representations for CL is not yet investigated.

With recent advances in Vision Transformers (Khan et al., 2021), and prompt-based ﬁne-tuning in

NLP (Li & Liang, 2021), Wang et al. have shown that interacting with an ImageNet pre-trained

model via prompt learning is a promising approach for continual learning (Wang et al., 2022a;b).

A small set of learnable parameters, called prompts, is appended to the input, and enables quick

adaptation of a frozen ImageNet pre-trained model to new streaming tasks. In our analysis, we show

that directly leveraging the pre-trained vision-language model, without introducing any learnable

parameters, is a simple yet promising approach to continual learning. We argue that adapting a joint

vision-language model like CLIP (Radford et al., 2021) for continual learning presents multiple ad-

vantages. It enables catering for practical scenarios where there are no well-deﬁned task identities

and boundaries, and the model is required to dynamically adapt to streaming data in a task-agnostic

manner. As such, leveraging from the CLIP model requires no compute expensive training or ﬁne-

tuning on the data for new tasks. Further, in contrast to the current state-of-the-art methods that

require a memory buffer to store training examples from previous tasks, Continual-CLIP approach

is rehearsal free, and is more suitable for scenarios where storing training examples could have prac-

tical privacy and security concerns or storage constraints (Shokri & Shmatikov, 2015). Instead of

storing past samples in the memory buffer, where the performance can deteriorate for small buffer

size, Continual-CLIP approach requires a constant memory throughout all the learning episodes. In

summary, Continual-CLIP approach is memory free, does not require test-time task identity infor-

mation, can be ﬂexibly and easily adapted to any number of classes without requiring any additional

learnable parameters.

3 METHODOLOGY

In this section, we ﬁrst brieﬂy discuss different continual learning settings. Next, we introduce

Contrastive Language-Image Pre-training (CLIP, Radford et al. (2021)) and explain how to apply it

to different downstream continual learning tasks in a zero shot manner with hand-crafted prompts.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PreprintCLIPMODELISANEFFICIENTCONTINUALLEARNERVishalThengane1,Q,SalmanKhan1,2,MunawarHayat3,FahadKhan1,41MohamedbinZayedUniversityofArticialIntelligence,UAE2AustralianNationalUniversity,Australia3MonashUniversity,Australia4Link¨opingUniversity,SwedenQvishal.thengane@mbzuai.ac.aeABSTRACTThecontinual...

展开>> 收起<<

Preprint CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER Vishal Thengane1envelope Salman Khan12 Munawar Hayat3 Fahad Khan14.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER Vishal Thengane1envelope Salman Khan12 Munawar Hayat3 Fahad Khan14

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: