
Preprint
Regularization-based methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Kirkpatrick et al., 2017;
Li & Hoiem, 2017) avoid catastrophic forgetting by limiting the plasticity of the model parameters
that are important for previous tasks. Even though these methods do not require memory replay, they
have only been shown to work on simplistic task-incremental setting (the task identity is assumed to
be known at inference time) and on smaller datasets (Wu et al., 2019). Memory Replay based meth-
ods use exemplars that are either stored in the memory, or synthesized using generative techniques,
and are effective on more challenging settings and datasets (Rebuffi et al., 2017; Kamra et al., 2017;
Buzzega et al., 2020; Cha et al., 2021). However, their performance degrades for smaller memory
size (Cha et al., 2021), and storing these exemplars can introduce security and privacy concerns
Shokri & Shmatikov (2015). Architecture-driven CL methods either dynamically expand a network
Rusu et al. (2016); Li et al. (2019b); Zhao et al. (2022), or divide into sub-networks to cater for the
new tasks (Zhao et al., 2022; Wang et al., 2020; Ke et al., 2020; Rajasegaran et al., 2019a). Such
approaches lack scalability, since the network capacity grows with tasks.
Vision-language Models: Training a joint vision-language embedding space enables interaction
amongst text and image data, and is critical to solve problems such as zero-shot learning, visual
grounding, and image captioning. While the initial vision-language models were single-stream and
processed the concatenated input from visual and text data as a single set of input (Li et al., 2019a;
Kim et al., 2021), more recent approaches, such as the Contrastive Language-Image Pre-training
(CLIP) (Radford et al., 2021) are dual-stream with dedicated encoders for image and text inputs.
The representations from the two encoders are projected into a unified embedding space, and a con-
trastive learning objective is employed to minimize the distance between matching image-caption
pairs, and maximize otherwise. Subsequent works have shown that CLIP is scalable, and its ca-
pabilities improve once trained on large-scale noisy data of 1 billion non-curated samples Jia et al.
(2021). The representations learned by the CLIP have been shown to generalize well across numer-
ous downstream tasks, including image-text retrieval, with excellent zero-shot transfer capabilities
(Li et al., 2021a). CLIP can also be flexibly adapted to videos (Yuan et al., 2021; Xu et al., 2021;
Fang et al., 2021), and to capture object-level interactions (Yao et al., 2021; Li et al., 2021a; Rasheed
et al., 2022). However, the applicability of CLIP representations for CL is not yet investigated.
With recent advances in Vision Transformers (Khan et al., 2021), and prompt-based fine-tuning in
NLP (Li & Liang, 2021), Wang et al. have shown that interacting with an ImageNet pre-trained
model via prompt learning is a promising approach for continual learning (Wang et al., 2022a;b).
A small set of learnable parameters, called prompts, is appended to the input, and enables quick
adaptation of a frozen ImageNet pre-trained model to new streaming tasks. In our analysis, we show
that directly leveraging the pre-trained vision-language model, without introducing any learnable
parameters, is a simple yet promising approach to continual learning. We argue that adapting a joint
vision-language model like CLIP (Radford et al., 2021) for continual learning presents multiple ad-
vantages. It enables catering for practical scenarios where there are no well-defined task identities
and boundaries, and the model is required to dynamically adapt to streaming data in a task-agnostic
manner. As such, leveraging from the CLIP model requires no compute expensive training or fine-
tuning on the data for new tasks. Further, in contrast to the current state-of-the-art methods that
require a memory buffer to store training examples from previous tasks, Continual-CLIP approach
is rehearsal free, and is more suitable for scenarios where storing training examples could have prac-
tical privacy and security concerns or storage constraints (Shokri & Shmatikov, 2015). Instead of
storing past samples in the memory buffer, where the performance can deteriorate for small buffer
size, Continual-CLIP approach requires a constant memory throughout all the learning episodes. In
summary, Continual-CLIP approach is memory free, does not require test-time task identity infor-
mation, can be flexibly and easily adapted to any number of classes without requiring any additional
learnable parameters.
3 METHODOLOGY
In this section, we first briefly discuss different continual learning settings. Next, we introduce
Contrastive Language-Image Pre-training (CLIP, Radford et al. (2021)) and explain how to apply it
to different downstream continual learning tasks in a zero shot manner with hand-crafted prompts.
3