for the first time. This is in contrast with prior soft-prompt
learning methods that only capture V&L interactions.
Key contributions: Based on the above, we propose a
novel framework for soft prompt learning which we call
Language-Aware Soft Prompting (LASP). Our main con-
tributions within the LASP framework are as follows:
• We propose, for the first time, language-only optimiza-
tion for V&L model adaption. Specifically, we propose
a novel text-to-text cross-entropy loss that maximizes the
probability of the learned prompts to be correctly classi-
fied with respect to the hand-engineered ones and show its
effectiveness in terms of alleviating base-class overfitting.
• To increase the representation capacity of the prompts,
and inspired by grouped convolution and multi-head at-
tention, we propose a grouped language-aware prompt
representation where each group of prompts specializes
to a different subset of the pre-defined manual templates.
• We identify a visual-language misalignment introduced
by prompt learning and LASP which impacts the gener-
alization. More importantly, we propose a re-calibration
mechanism based on (a) Layer Normalization fine-tuning
and (b) learning a class-agnostic bias to address it.
• Thanks to our language-only learning framework, we pro-
pose training LASP with virtual classes by including, dur-
ing training, class names for which no visual samples are
available. Importantly, we show that this further increases
the robustness of the learned prompts.
Main results: Our methods set a new state-of-the-art for
few-shot and zero-shot image classification on 11 datasets,
significantly outperforming all soft prompting prior works.
Importantly, we present, for the first time, a prompt learn-
ing method that outperforms, for the majority of the test
datasets (8 out of 11), the very strong baseline based on
hand-crafted prompts and CLIP for the recognition of novel
classes (i.e. zero-shot setting).
2. Related work
Contrastive V&L Models: Recently, large scale V&L pre-
training with contrastive learning has been used to train
foundation models resulting in robust representations, trans-
ferable to new tasks both under few-shot and zero-shot set-
tings [13,18,24,33,34]. Such networks consist of a vi-
sion encoder (typically a ViT [8]) and a Transformer-based
text encoder [30]. Highly parameterized instantiations of
such architectures are trained on large corpora of image-
caption pairs (e.g. [24] uses 400M and [13] 1B pairs) using
contrastive learning. We used CLIP [24] as the foundation
model for our method.
Prompt Learning is about adapting pre-trained founda-
tional models on (downstream) tasks, typically in a zero-
shot or few-shot setting. Firstly proposed in the context
of Language Models (LM), prompting was initially about
prepending hand-crafted instructions/examples to the task
input so that the LM generates the appropriate output con-
ditioned to the input [4,25]. In [27,28], the main idea is
to reformulate the downstream task as a cloze task using
hand-crafted patterns (or templates), thus avoiding the need
to train a task-specific classifier. As finding the optimal pat-
terns is laborious, recent works have attempted to address
this by learning a set of soft (continuous) prompts [16,17].
In V&L foundation models, like CLIP, the class names
are used to create hand-crafted prompts [24] that are fed as
input to the text encoder, enabling zero-shot visual recogni-
tion. CoOp [36] extends work on soft prompt optimization
to the V&L domain by learning a set of Mprompts which
are used as input to the text encoder alongside the class
name. The prompts are learned by minimizing the classi-
fication error on a training set consisted of the given base
classes. One major limitation of CoOp is weak generaliza-
tion: the learned prompts overfit the base classes and do not
work well when tested on novel classes. To alleviate this,
CoCoOp [35] proposes a dynamic version of [36] where a
small network is trained to produce a visual feature from
the input image that is added to the learned prompts, hence
making them input specific (i.e. dynamic). ProDA [19]
adopts a probabilistic approach by modelling the distribu-
tion of the prompts at the output of the text encoder as
a multivariate Gaussian distribution. The estimated mean
is used during inference. Finally, UPL [12] uses CLIP to
generate pseudo-labels on the target dataset and then self-
training to learn the soft prompts. Finally, ProGrad [37]
aims to adapt the V&L model to each target domain by en-
couraging it “not to forget” CLIP’s zero-shot predictions us-
ing a KL visual-text loss between the CLIP’s logits and their
model’s logits (i.e. they use visual features). The weights
are then updated in the direction perpendicular to CLIP gra-
dients. In contrast, our loss is a pure text-to-text loss, fur-
ther allowing for the incorporation of virtual classes. Un-
like [37], we outperform CLIP on novel classes.
The proposed LASP framework alleviates base class
overfitting and significantly improves upon the previously
reported best results without resorting to a dynamic ap-
proach as in CoCoOp [35]. In its basic version, LASP de-
ploys a text-to-text loss that enforces the learned prompts to
be “close” to a set of manually defined textual prompts in
the text encoder space. Importantly, the basic LASP can be
extended in three important ways: (1) by allowing the incor-
poration of virtual classes, i.e. novel class name information
for which no (visual) training data is available (LASP-V).
This is shown to significantly improve the robustness of the
learned prompts at no extra cost during inference; (2) by al-
lowing the use of a grouped prompt representation within
the proposed language-aware training which is shown to in-
crease the representation capacity of the learned prompts;
(3) by performing further optimization of the visual encoder