1 Introduction
Prompt learning (Li and Liang,2021;Gao et al.,2021b;Sanh et al.,2022) is a new paradigm to reformulate
downstream tasks as similar pretraining tasks on pretrained language models (PLMs) with the help of a
textual prompt. Compared with the conventional “pre-train, fine-tuning” paradigm, prompt learning is
particularly useful for few-shot learning, where there is no sufficient training data to fine-tune the whole
pre-trained model. Recently, light-weight but effective prompt learning methods have been developed in
various few-shot learning tasks (Schick and Schütze,2021;Gao et al.,2021b;Shin et al.,2020) in natural
language processing (NLP), such as few-shot sentiment analysis and natural language inference.
With the success of prompt learning in NLP, it is natural to generalize prompt learning to pretrained
vision-language models (PVLMs) (Radford et al.,2021;Kim et al.,2021;Jin et al.,2022b;Zhou et al.,
2022b;Tsimpoukelli et al.,2021;Liang et al.,2022;Sanh et al.,2022) for vision-language tasks. In this work,
we especially focus on exploring few-shot image recognition tasks in the prompt learning paradigm, which
has not been fully explored in the prompt learning research area. The motivation originates from the fact that
PVLMs, such as CLIP (Radford et al.,2021) and ViLT (Kim et al.,2021), are pre-trained with image-text
matching and masked language modeling (MLM) style tasks on images and their aligned descriptions. For
the image recognition task, where class labels have a textual form (e.g. “faces”, “Hummer SUV”), they can
be converted into image-text matching tasks. For example, one simple manual-craft prompt template could
be “a photo of a [CLASS]”, where [CLASS] will be replaced by any candidate category name. The PVLM
matches the query image with all the prompted candidate category names, and chooses the one with the
highest matching score.
Similar to NLP, the essence of prompt learning for PVLM is designing the most appropriate prompts for
the downstream tasks. The latest methods to construct prompts include, i) manual-craft prompts (Petroni
et al.;Jin et al.,2022b), where researchers manually create intuitive templates based on human introspection;
ii) automatically searched prompts (Shin et al.,2020;Zhong et al.,2021;Zhou et al.,2022b), where
researchers search over discrete input token space or continuous embedding space for prompts that elicit
correct predictions in the training set; iii) instance-level prompt learning (Zhou et al.,2022a;Rao et al.,2022;
Jin et al.,2022a), where instead of learning one universal prompt that works for all the input, they learn
instance-level prompts conditional on the given input. Although manually written prompts are interpretable,
they are limited by the manual effort, and might not be optimal for eliciting correct predictions. The automated
approaches overcome the limitations of manual prompts by training a statistical model, but they learn one
universal prompt for each task, which may result in sub-optimal prompts. Instance-level prompt learning
methods learn different prompts conditional on the given inputs, however, they usually need to maintain
a complex neural module mapping the inputs into prompts, which makes them work poorly on few-shot
learning settings.
Meanwhile, besides prompt learning on PVLMs, researchers are also exploring parameter-efficient
fine-tuning methods for few-shot learning, such as linear probing (Tian et al.,2020), Adaptor (Houlsby et al.,
2019), Bitfit (Zaken et al.,2022) and Calibration (Zhao et al.,2021), where they only fine-tune a small set
of parameters of pre-trained models. Those works have demonstrated superior performance when training
samples are not very scarce. Our experimental study, however, show that the accuracy significantly decreases
when
#shots ≤4
as the limited training samples restrict the capability of learning and generalization of
fine-tuning.
There are two considerations when designing an elegant prompt learning method on PVLMs for few-shot
learning. Firstly, the method should be generic and easily adaptable for different architectures, such as
Bi-encoder structure CLIP (Radford et al.,2021) and single encoder ViLT (Kim et al.,2021). Secondly, the
prompt learning method should be lightweight and competitive to or even outperforms parameter-efficient
fine-tuning methods.
2