Prompting through Prototype A Prototype-based Prompt Learning on Pretrained Vision-Language Models

2025-05-02 0 0 519.05KB 18 页 10玖币
侵权投诉
Prompting through Prototype: A Prototype-based
Prompt Learning on Pretrained Vision-Language
Models
Yue Zhang, Hongliang Fei, Dingcheng Li, Tan Yu, Ping Li
Cognitive Computing Lab
Baidu Research
10900 NE 8th St. Bellevue, WA 98004, USA
{yuezhang030, feihongliang0, dingchengl, tanyu1503, pingli98}@gmail.com
Abstract
Prompt learning is a new learning paradigm which reformulates downstream tasks as similar pretraining
tasks on pretrained models by leveraging textual prompts. Recent works have demonstrated that prompt
learning is particularly useful for few-shot learning, where there is limited training data. Depending on the
granularity of prompts, those methods can be roughly divided into task-level prompting and instance-level
prompting. Task-level prompting methods learn one universal prompt for all input samples, which is
efficient but ineffective to capture subtle differences among different classes. Instance-level prompting
methods learn a specific prompt for each input, though effective but inefficient. In this work, we develop
a novel prototype-based prompt learning method to overcome the above limitations. In particular, we
focus on few-shot image recognition tasks on pretrained vision-language models (PVLMs) and develop
a method of prompting through prototype (
PTP
), where we define
K
image prototypes and
K
prompt
prototypes. In
PTP
, the image prototype represents a centroid of a certain image cluster in the latent space
and a prompt prototype is defined as a soft prompt in the continuous space. The similarity between a
query image and an image prototype determines how much this prediction relies on the corresponding
prompt prototype. Hence, in
PTP
, similar images will utilize similar prompting ways. Through extensive
experiments on seven real-world benchmarks, we show that
PTP
is an effective method to leverage the
latent knowledge and adaptive to various PVLMs. Moreover, through detailed analysis, we discuss pros
and cons for prompt learning and parameter-efficient fine-tuning under the context of few-shot learning.
1
arXiv:2210.10841v1 [cs.CL] 19 Oct 2022
1 Introduction
Prompt learning (Li and Liang,2021;Gao et al.,2021b;Sanh et al.,2022) is a new paradigm to reformulate
downstream tasks as similar pretraining tasks on pretrained language models (PLMs) with the help of a
textual prompt. Compared with the conventional “pre-train, fine-tuning” paradigm, prompt learning is
particularly useful for few-shot learning, where there is no sufficient training data to fine-tune the whole
pre-trained model. Recently, light-weight but effective prompt learning methods have been developed in
various few-shot learning tasks (Schick and Schütze,2021;Gao et al.,2021b;Shin et al.,2020) in natural
language processing (NLP), such as few-shot sentiment analysis and natural language inference.
With the success of prompt learning in NLP, it is natural to generalize prompt learning to pretrained
vision-language models (PVLMs) (Radford et al.,2021;Kim et al.,2021;Jin et al.,2022b;Zhou et al.,
2022b;Tsimpoukelli et al.,2021;Liang et al.,2022;Sanh et al.,2022) for vision-language tasks. In this work,
we especially focus on exploring few-shot image recognition tasks in the prompt learning paradigm, which
has not been fully explored in the prompt learning research area. The motivation originates from the fact that
PVLMs, such as CLIP (Radford et al.,2021) and ViLT (Kim et al.,2021), are pre-trained with image-text
matching and masked language modeling (MLM) style tasks on images and their aligned descriptions. For
the image recognition task, where class labels have a textual form (e.g. “faces”, “Hummer SUV”), they can
be converted into image-text matching tasks. For example, one simple manual-craft prompt template could
be “a photo of a [CLASS]”, where [CLASS] will be replaced by any candidate category name. The PVLM
matches the query image with all the prompted candidate category names, and chooses the one with the
highest matching score.
Similar to NLP, the essence of prompt learning for PVLM is designing the most appropriate prompts for
the downstream tasks. The latest methods to construct prompts include, i) manual-craft prompts (Petroni
et al.;Jin et al.,2022b), where researchers manually create intuitive templates based on human introspection;
ii) automatically searched prompts (Shin et al.,2020;Zhong et al.,2021;Zhou et al.,2022b), where
researchers search over discrete input token space or continuous embedding space for prompts that elicit
correct predictions in the training set; iii) instance-level prompt learning (Zhou et al.,2022a;Rao et al.,2022;
Jin et al.,2022a), where instead of learning one universal prompt that works for all the input, they learn
instance-level prompts conditional on the given input. Although manually written prompts are interpretable,
they are limited by the manual effort, and might not be optimal for eliciting correct predictions. The automated
approaches overcome the limitations of manual prompts by training a statistical model, but they learn one
universal prompt for each task, which may result in sub-optimal prompts. Instance-level prompt learning
methods learn different prompts conditional on the given inputs, however, they usually need to maintain
a complex neural module mapping the inputs into prompts, which makes them work poorly on few-shot
learning settings.
Meanwhile, besides prompt learning on PVLMs, researchers are also exploring parameter-efficient
fine-tuning methods for few-shot learning, such as linear probing (Tian et al.,2020), Adaptor (Houlsby et al.,
2019), Bitfit (Zaken et al.,2022) and Calibration (Zhao et al.,2021), where they only fine-tune a small set
of parameters of pre-trained models. Those works have demonstrated superior performance when training
samples are not very scarce. Our experimental study, however, show that the accuracy significantly decreases
when
#shots 4
as the limited training samples restrict the capability of learning and generalization of
fine-tuning.
There are two considerations when designing an elegant prompt learning method on PVLMs for few-shot
learning. Firstly, the method should be generic and easily adaptable for different architectures, such as
Bi-encoder structure CLIP (Radford et al.,2021) and single encoder ViLT (Kim et al.,2021). Secondly, the
prompt learning method should be lightweight and competitive to or even outperforms parameter-efficient
fine-tuning methods.
2
In this work, we propose our model: Prompting through Prototype (
PTP
), which is a prototype-based
prompt learning method on PVLMs to effectively solve the downstream few-shot image recognition tasks.
Based on the observation that 1) the aligned image-text pairs have high matching scores, and 2) the similar
images are close to each other in the embedding space in PVLMs, we hypothesize that similar images should
use similar prompts in prompt learning. The observation 1) is because that during vision-language model
pre-training, one of the pre-training objectives is image-text matching. Hence, pre-trained VL models have
remarkable zero-shot performance on image-text matching. In other words, the similar images and aligned
text-image paris naturally have high matching scores from PVLMs. The observation 2) will be shown during
experiments.
Intuitively, assuming training images can be coarsely divided into
K
clusters based on the similarity
between their latent embedding vectors, then each cluster can have its own textual prompt used for category
name (label words) prompting. Furthermore, based on our hypothesis, we define
K
prototype components,
where each prototype component contains an image prototype and a prompt prototype. In our context, the
image prototype means a point in the image latent space representing a centroid of a certain cluster. The
similarity between a query image and an image prototype determines how much this query image’s category
prediction relies on the corresponding prompt prototype. The final prediction is the weighted summation of
all the prediction scores using different prompt prototypes.
We summarize our contributions as follows.
We propose a novel prompt learning method
PTP
on PVLMs, to overcome the drawbacks of task-
level (manual/auto-searched prompts) and instance-level prompting. Instead of designing a universal
prompt regardless of instances (Shin et al.,2020;Zhou et al.,2022b,a) or instance-specific prompt
for each instance (Zhou et al.,2022a;Rao et al.,2022), we develop a prototype-based prompting
method, wherein similar query images utilizes similar prompting ways. During training, we only
update parameters related to prompting while freezing the weights of PVLM to ensure a lightweight
and efficient model.
We conduct extensive experiments on 7 real-world benchmarks across 2 types of PVLMs and show
that our
PTP
is an effective method for the full use of the pre-trained knowledge for the downstream
few-shot image recognition tasks. The absolute improvement on average accuracy compared to auto-
searched prompts (Zhou et al.,2022a) over all experiments are around: 4% for 1/2-shot, 5% for 4-shot,
6% for 8-shot, 7% for 16-shot.
We made empirical analyses between prompting and fine-tuning and revealed that both methods
have their advantages and limitations. In particular, a good prompt learning performance highly
relies on the pre-trained knowledge stored in the pre-training. A prompt learning method will have
difficulty triggering the correct answers, if the PVLM itself lacks such visual or textual knowledge.
Through detailed hyper-parameter analysis, we show how to choose the number of prototypes based
on performance and parameter-efficiency. We also show the importance of our novel regularizers for
learning the image prototypes.
3
2 Related Work
2.1 Pretrained Vision-and-Language Models
Recently, many vision-language models are proposed. The large-scale pre-training allows PVLMs to zero-
shot transfer to various downstream classification tasks. They can be coarsely divided into two groups based
on their architecture: the bi-encoder model (Radford et al.,2021;Jia et al.,2021), and the single-encoder
model (Kim et al.,2021;Lu et al.,2019). Bi-encoder model, such as CLIP (Radford et al.,2021) and
ALIGN (Jia et al.,2021), consists of two encoders, one for images and the other for text. This work uses CLIP
as a representative for the bi-encoder model, which has remarkable zero-shot performance on image-text
retrieval.. By default, CLIP uses “a photo of [CLASS]” on the text side for image recognition tasks.
Single-encoder model, such as ViLBERT (Lu et al.,2019), ViLT (Kim et al.,2021), etc., concatenates
the object features from the image and word features from the sentence into a long sequence. So the two
modalities interact with each other in self-attention layers. This work uses ViLT as a representative for
single-encoder models.
2.2 Few-shot Learning
Parameter-Efficient Fine-tuning.
Parameter-efficient fine-tuning methods mainly include: i) Adapters (Houlsby
et al.,2019;Gao et al.,2021a;Zhang et al.,2021), where neural networks layers are inserted between the
feed-forward portion of the Transformer architecture; ii) BitFit (Zaken et al.,2022;IV et al.,2022), where
they only update the bias terms inside the Transformer; iii) Calibration (Zhao et al.,2021), where they learn
an affine transformation on top of the logits output from the Transformer; iv) Linear probe (Tian et al.,2020),
where a linear classifier is trained on top of pre-trained models’ features.
Prompt Learning Methods.
Recently, multiple prompt learning works on PVLM are proposed (Jin et al.,
2022b;Zhou et al.,2022b;Tsimpoukelli et al.,2021;Liang et al.,2022;Rao et al.,2022). Jin et al. (2022b) first
pre-trained a prompt-aware vision language model, then transferred to downstream tasks, such as VQA, with
the help of hand-crafted prompts. Zhou et al. (2022b) learned universal soft prompts for solving downstream
few-shot image classification tasks. Tsimpoukelli et al. (2021) developed an image to text generation model,
with a dynamic prefix to control the generation. Liang et al. (2022) learned soft prompts to align the different
modalities. Rao et al. (2022) learned instance-aware prompts for dense prediction.
In this work, we focus on designing an efficient and effective prompt learning method on PVLMs
for downstream few-shot image classification tasks. We leverage prototype-based prompting. Our image
prototypes have a similar concept and usage “this looks like that” in previous works (Li et al.,2018;Chen
et al.,2019), where they learn and utilize prototypes to make interpretable predictions.
3 Methodology
3.1 Problem Setup
We define a few-shot image recognition training dataset as
D={(xi, yi, ci)}N
i=1
, where
xi
is the image input,
yi
is corresponding discrete label,
ci
is corresponding category name, e.g., “faces”, “Hummer SUV”. We
define the candidate pool of category names as
C={cj}C
j=1
, where
C
is total number of different categories.
Given a pre-trained vision-language model (PVLM) and a few-shot training dataset
D
, our task aims at
solving the downstream few-shot image classification task via prompt learning paradigm.
4
摘要:

PromptingthroughPrototype:APrototype-basedPromptLearningonPretrainedVision-LanguageModelsYueZhang,HongliangFei,DingchengLi,TanYu,PingLiCognitiveComputingLabBaiduResearch10900NE8thSt.Bellevue,WA98004,USA{yuezhang030,feihongliang0,dingchengl,tanyu1503,pingli98}@gmail.comAbstractPromptlearningisanewlea...

展开>> 收起<<
Prompting through Prototype A Prototype-based Prompt Learning on Pretrained Vision-Language Models.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:519.05KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注