Preprint CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER Vishal Thengane1envelope Salman Khan12 Munawar Hayat3 Fahad Khan14

2025-05-02 0 0 2.09MB 13 页 10玖币
侵权投诉
Preprint
CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER
Vishal Thengane1, Q, Salman Khan1,2, Munawar Hayat3, Fahad Khan1,4
1Mohamed bin Zayed University of Artificial Intelligence, UAE
2Australian National University, Australia
3Monash University, Australia
4Link¨
oping University, Sweden
Qvishal.thengane@mbzuai.ac.ae
ABSTRACT
The continual learning setting aims to learn new tasks over time without forget-
ting the previous ones. The literature reports several significant efforts to tackle
this problem with limited or no access to previous task data. Among such efforts,
typical solutions offer sophisticated techniques involving memory replay, knowl-
edge distillation, model regularization, and dynamic network expansion. The
resulting methods have a retraining cost at each learning task, dedicated mem-
ory requirements, and setting-specific design choices. In this work, we show
that a frozen CLIP (Contrastive Language-Image Pretraining) model offers as-
tounding continual learning performance without any fine-tuning (zero-shot eval-
uation). We evaluate CLIP under a variety of settings including class-incremental,
domain-incremental and task-agnostic incremental learning on five popular bench-
marks (ImageNet-100 & 1K, CORe50, CIFAR-100, and TinyImageNet). Without
any bells and whistles, the CLIP model outperforms the state-of-the-art contin-
ual learning approaches in majority of the settings. We show the effect on CLIP
model’s performance by varying text inputs with simple prompt templates. To the
best of our knowledge, this is the first work to report the CLIP zero-shot perfor-
mance in a continual setting. We advocate the use of this strong yet embarrass-
ingly simple baseline for future comparisons in the continual learning tasks. Code
is available at https://github.com/vgthengane/Continual-CLIP.
1 INTRODUCTION
Traditionally, deep neural networks (DNNs) trained in a supervised manner on training sets com-
prising of all the classes of interest have shown excellent results. Such models can presumably learn
all the relevant features from the dataset in a single training episode. However, in real world, all
data samples may not be available at once. To cater for such scenarios, continual learning provides
a promising paradigm, since it enables learning where the data distribution shifts over time. DNNs
trained on such incremental stream of data, however, suffer from catastrophic forgetting since the
previous task data can not be accessed in its entirety (McCloskey & Cohen, 1989).
In the literature, four popular continual learning protocols exist; Task-incremental learning (TIL)
use task-specific neural networks, where the task identity is assumed to be known at the inference.
Class-incremental learning (CIL) settings add classes in a sequence with the task identity unknown
at the inference. In Domain-incremental learning (DIL), the number of classes remain the same
but the data domain evolves over time. Task-free (or task-agnostic) continual learning (TFCL) is a
more general setting where there are no explicit task boundaries and data can appear freely in the
continual learning phases (De Lange et al., 2021). The major challenge faced by all these methods
is to avoid forgetting previously learned knowledge while updating on new data.
Several specialized methods have been developed in continual learning literature to reduce catas-
trophic forgetting. Among such methods, typical solutions offer sophisticated techniques involving
memory replay (Rebuffi et al., 2017; Shin et al., 2017; Lopez-Paz & Ranzato, 2017), knowledge
distillation (Hinton et al., 2015; Li & Hoiem, 2017), model regularization (Kirkpatrick et al., 2017),
parameter isolation (Mallya & Lazebnik, 2018; Fernando et al., 2017), and dynamic network expan-
sion (Yan et al., 2021; Douillard et al., 2022; 2020). The resulting methods have a retraining cost at
1
arXiv:2210.03114v1 [cs.CV] 6 Oct 2022
Preprint
Text Encoder
Image Encoder
Plane .Dog .CatPlane .
A Photo of a Cat.
Model copies increase.
Number of classifier
heads increases.
Memory Buffer
requires
As size increases no.
parameter size grows
Number of Task progressing
Memory
Buffer
"a photo of a {}."
1. Traditional continual learning models 2. Continual learning with CLIP
Model size is fixed
throughout
No need of memory
buffers
No training (or
finetuning) needed
Frozen models (used for
knowledge distillation)
Training current model
at task t.
Not using previous
classifier head for task t.
Using only current
classifier head at task t.
.
Figure 1: Left: Traditional continual learning approaches can require memory buffers, complex
hyper-parameter tuning, saving a copy of previous models and their number of classifier heads in-
creases at each learning step. Right: Our goal is to design a simplistic unified model that works well
across multiple continual learning settings without incurring task-wise training, dedicated mem-
ory requirements and careful hyper-parameter selection. A CLIP-based continual model is shown
to perform exceptionally well on a number of continual learning settings without requiring any
training/fine-tuning, memory buffers or increase in model size with the growing number of tasks.
each learning task, need dedicated memory for storing exemplars or past models, and involve com-
plex hyper-parameter tuning which limits their practical utility. Furthermore, the above continual
learning protocols are generally addressed separately and the existing approaches involve setting-
specific design choices making them non-transferable across different continual learning settings.
In this work, we aim to test the progress made so far towards a truly continual learning system.
Our main question is to explore if the state-of-the-art narrow models can be replaced with a simple
generic approach that does not require training for each incremental step, works without any ex-
emplar memory storage and can work across all of the existing incremental learning protocols with
minimal or no hyper-parameter tuning. To this end, we show that a frozen CLIP model (Radford
et al., 2021) offers great promise due to its generalizable representations and zero-shot behaviour
without requiring any parameter-tuning. We call the CLIP evaluated on a diverse set of continual
learning settings as Continual-CLIP. Figure 1 gives an overview of traditional continual learning
methods and frozen CLIP in the continual learning system.
Our extensive evaluations across four diverse settings (TIL, CIL, DIL, and TFCL) and seven datasets
(ImageNet-100 & 1K (Deng et al., 2009), CLEAR (Lin et al., 2021), CIFAR100 (Krizhevsky et al.,
2009), TinyImageNet (Le & Yang, 2015), CoRe50 (Lomonaco & Maltoni, 2017) and Gaussian
scheduled CIFAR-100 (Shanahan et al., 2021)) demonstrate CLIP’s competitiveness on all these
CL settings. This generalization behaviour is due to the large-scale pre-training of vision-language
model like CLIP that optimizes contrastive training objective on 400M image-text pairs scraped
from the internet. During pre-training, CLIP learns a diverse range of high-level representations
which are transferable to multiple downstream tasks including the incremental tasks. We also show
how simple prompt engineering for text inputs affects the CLIP’s performance for CL.
In summary, this work layouts the baseline for the future direction in continual learning based on the
pre-trained vision-language models. We evaluate the pre-trained frozen CLIP model in a variety of
continual learning settings on popular image recognition benchmarks and compare to current state-
of-the-art methods to show that out-of-box CLIP representations perform competitively in all cases.
Our results aim to consolidate the fragmented efforts in continual learning landscape that work on
specific settings, highlighting the need for generic approaches that can work across multiple settings.
2 RELATED WORKS
Continual Learning: The existing continual learning methods mostly employ one of the following
schemes: (1) model regularization (2) memory replay, and (3) dynamic network expansion. Model
2
Preprint
Regularization-based methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Kirkpatrick et al., 2017;
Li & Hoiem, 2017) avoid catastrophic forgetting by limiting the plasticity of the model parameters
that are important for previous tasks. Even though these methods do not require memory replay, they
have only been shown to work on simplistic task-incremental setting (the task identity is assumed to
be known at inference time) and on smaller datasets (Wu et al., 2019). Memory Replay based meth-
ods use exemplars that are either stored in the memory, or synthesized using generative techniques,
and are effective on more challenging settings and datasets (Rebuffi et al., 2017; Kamra et al., 2017;
Buzzega et al., 2020; Cha et al., 2021). However, their performance degrades for smaller memory
size (Cha et al., 2021), and storing these exemplars can introduce security and privacy concerns
Shokri & Shmatikov (2015). Architecture-driven CL methods either dynamically expand a network
Rusu et al. (2016); Li et al. (2019b); Zhao et al. (2022), or divide into sub-networks to cater for the
new tasks (Zhao et al., 2022; Wang et al., 2020; Ke et al., 2020; Rajasegaran et al., 2019a). Such
approaches lack scalability, since the network capacity grows with tasks.
Vision-language Models: Training a joint vision-language embedding space enables interaction
amongst text and image data, and is critical to solve problems such as zero-shot learning, visual
grounding, and image captioning. While the initial vision-language models were single-stream and
processed the concatenated input from visual and text data as a single set of input (Li et al., 2019a;
Kim et al., 2021), more recent approaches, such as the Contrastive Language-Image Pre-training
(CLIP) (Radford et al., 2021) are dual-stream with dedicated encoders for image and text inputs.
The representations from the two encoders are projected into a unified embedding space, and a con-
trastive learning objective is employed to minimize the distance between matching image-caption
pairs, and maximize otherwise. Subsequent works have shown that CLIP is scalable, and its ca-
pabilities improve once trained on large-scale noisy data of 1 billion non-curated samples Jia et al.
(2021). The representations learned by the CLIP have been shown to generalize well across numer-
ous downstream tasks, including image-text retrieval, with excellent zero-shot transfer capabilities
(Li et al., 2021a). CLIP can also be flexibly adapted to videos (Yuan et al., 2021; Xu et al., 2021;
Fang et al., 2021), and to capture object-level interactions (Yao et al., 2021; Li et al., 2021a; Rasheed
et al., 2022). However, the applicability of CLIP representations for CL is not yet investigated.
With recent advances in Vision Transformers (Khan et al., 2021), and prompt-based fine-tuning in
NLP (Li & Liang, 2021), Wang et al. have shown that interacting with an ImageNet pre-trained
model via prompt learning is a promising approach for continual learning (Wang et al., 2022a;b).
A small set of learnable parameters, called prompts, is appended to the input, and enables quick
adaptation of a frozen ImageNet pre-trained model to new streaming tasks. In our analysis, we show
that directly leveraging the pre-trained vision-language model, without introducing any learnable
parameters, is a simple yet promising approach to continual learning. We argue that adapting a joint
vision-language model like CLIP (Radford et al., 2021) for continual learning presents multiple ad-
vantages. It enables catering for practical scenarios where there are no well-defined task identities
and boundaries, and the model is required to dynamically adapt to streaming data in a task-agnostic
manner. As such, leveraging from the CLIP model requires no compute expensive training or fine-
tuning on the data for new tasks. Further, in contrast to the current state-of-the-art methods that
require a memory buffer to store training examples from previous tasks, Continual-CLIP approach
is rehearsal free, and is more suitable for scenarios where storing training examples could have prac-
tical privacy and security concerns or storage constraints (Shokri & Shmatikov, 2015). Instead of
storing past samples in the memory buffer, where the performance can deteriorate for small buffer
size, Continual-CLIP approach requires a constant memory throughout all the learning episodes. In
summary, Continual-CLIP approach is memory free, does not require test-time task identity infor-
mation, can be flexibly and easily adapted to any number of classes without requiring any additional
learnable parameters.
3 METHODOLOGY
In this section, we first briefly discuss different continual learning settings. Next, we introduce
Contrastive Language-Image Pre-training (CLIP, Radford et al. (2021)) and explain how to apply it
to different downstream continual learning tasks in a zero shot manner with hand-crafted prompts.
3
摘要:

PreprintCLIPMODELISANEFFICIENTCONTINUALLEARNERVishalThengane1,Q,SalmanKhan1,2,MunawarHayat3,FahadKhan1,41MohamedbinZayedUniversityofArticialIntelligence,UAE2AustralianNationalUniversity,Australia3MonashUniversity,Australia4Link¨opingUniversity,SwedenQvishal.thengane@mbzuai.ac.aeABSTRACTThecontinual...

展开>> 收起<<
Preprint CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER Vishal Thengane1envelope Salman Khan12 Munawar Hayat3 Fahad Khan14.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:2.09MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注