Prompting through Prototype A Prototype-based Prompt Learning on Pretrained Vision-Language Models

2025-05-02 0 0 519.05KB 18 页 10玖币

侵权投诉

Prompting through Prototype: A Prototype-based

Prompt Learning on Pretrained Vision-Language

Models

Yue Zhang, Hongliang Fei, Dingcheng Li, Tan Yu, Ping Li

Cognitive Computing Lab

Baidu Research

10900 NE 8th St. Bellevue, WA 98004, USA

{yuezhang030, feihongliang0, dingchengl, tanyu1503, pingli98}@gmail.com

Abstract

Prompt learning is a new learning paradigm which reformulates downstream tasks as similar pretraining

tasks on pretrained models by leveraging textual prompts. Recent works have demonstrated that prompt

learning is particularly useful for few-shot learning, where there is limited training data. Depending on the

granularity of prompts, those methods can be roughly divided into task-level prompting and instance-level

prompting. Task-level prompting methods learn one universal prompt for all input samples, which is

efﬁcient but ineffective to capture subtle differences among different classes. Instance-level prompting

methods learn a speciﬁc prompt for each input, though effective but inefﬁcient. In this work, we develop

a novel prototype-based prompt learning method to overcome the above limitations. In particular, we

focus on few-shot image recognition tasks on pretrained vision-language models (PVLMs) and develop

a method of prompting through prototype (

PTP

), where we deﬁne

image prototypes and

prompt

prototypes. In

PTP

, the image prototype represents a centroid of a certain image cluster in the latent space

and a prompt prototype is deﬁned as a soft prompt in the continuous space. The similarity between a

query image and an image prototype determines how much this prediction relies on the corresponding

prompt prototype. Hence, in

PTP

, similar images will utilize similar prompting ways. Through extensive

experiments on seven real-world benchmarks, we show that

PTP

is an effective method to leverage the

latent knowledge and adaptive to various PVLMs. Moreover, through detailed analysis, we discuss pros

and cons for prompt learning and parameter-efﬁcient ﬁne-tuning under the context of few-shot learning.

arXiv:2210.10841v1 [cs.CL] 19 Oct 2022

1 Introduction

Prompt learning (Li and Liang,2021;Gao et al.,2021b;Sanh et al.,2022) is a new paradigm to reformulate

downstream tasks as similar pretraining tasks on pretrained language models (PLMs) with the help of a

textual prompt. Compared with the conventional “pre-train, ﬁne-tuning” paradigm, prompt learning is

particularly useful for few-shot learning, where there is no sufﬁcient training data to ﬁne-tune the whole

pre-trained model. Recently, light-weight but effective prompt learning methods have been developed in

various few-shot learning tasks (Schick and Schütze,2021;Gao et al.,2021b;Shin et al.,2020) in natural

language processing (NLP), such as few-shot sentiment analysis and natural language inference.

With the success of prompt learning in NLP, it is natural to generalize prompt learning to pretrained

vision-language models (PVLMs) (Radford et al.,2021;Kim et al.,2021;Jin et al.,2022b;Zhou et al.,

2022b;Tsimpoukelli et al.,2021;Liang et al.,2022;Sanh et al.,2022) for vision-language tasks. In this work,

we especially focus on exploring few-shot image recognition tasks in the prompt learning paradigm, which

has not been fully explored in the prompt learning research area. The motivation originates from the fact that

PVLMs, such as CLIP (Radford et al.,2021) and ViLT (Kim et al.,2021), are pre-trained with image-text

matching and masked language modeling (MLM) style tasks on images and their aligned descriptions. For

the image recognition task, where class labels have a textual form (e.g. “faces”, “Hummer SUV”), they can

be converted into image-text matching tasks. For example, one simple manual-craft prompt template could

be “a photo of a [CLASS]”, where [CLASS] will be replaced by any candidate category name. The PVLM

matches the query image with all the prompted candidate category names, and chooses the one with the

highest matching score.

Similar to NLP, the essence of prompt learning for PVLM is designing the most appropriate prompts for

the downstream tasks. The latest methods to construct prompts include, i) manual-craft prompts (Petroni

et al.;Jin et al.,2022b), where researchers manually create intuitive templates based on human introspection;

ii) automatically searched prompts (Shin et al.,2020;Zhong et al.,2021;Zhou et al.,2022b), where

researchers search over discrete input token space or continuous embedding space for prompts that elicit

correct predictions in the training set; iii) instance-level prompt learning (Zhou et al.,2022a;Rao et al.,2022;

Jin et al.,2022a), where instead of learning one universal prompt that works for all the input, they learn

instance-level prompts conditional on the given input. Although manually written prompts are interpretable,

they are limited by the manual effort, and might not be optimal for eliciting correct predictions. The automated

approaches overcome the limitations of manual prompts by training a statistical model, but they learn one

universal prompt for each task, which may result in sub-optimal prompts. Instance-level prompt learning

methods learn different prompts conditional on the given inputs, however, they usually need to maintain

a complex neural module mapping the inputs into prompts, which makes them work poorly on few-shot

learning settings.

Meanwhile, besides prompt learning on PVLMs, researchers are also exploring parameter-efﬁcient

ﬁne-tuning methods for few-shot learning, such as linear probing (Tian et al.,2020), Adaptor (Houlsby et al.,

2019), Bitﬁt (Zaken et al.,2022) and Calibration (Zhao et al.,2021), where they only ﬁne-tune a small set

of parameters of pre-trained models. Those works have demonstrated superior performance when training

samples are not very scarce. Our experimental study, however, show that the accuracy signiﬁcantly decreases

when

#shots ≤4

as the limited training samples restrict the capability of learning and generalization of

ﬁne-tuning.

There are two considerations when designing an elegant prompt learning method on PVLMs for few-shot

learning. Firstly, the method should be generic and easily adaptable for different architectures, such as

Bi-encoder structure CLIP (Radford et al.,2021) and single encoder ViLT (Kim et al.,2021). Secondly, the

prompt learning method should be lightweight and competitive to or even outperforms parameter-efﬁcient

ﬁne-tuning methods.

In this work, we propose our model: Prompting through Prototype (

PTP

), which is a prototype-based

prompt learning method on PVLMs to effectively solve the downstream few-shot image recognition tasks.

Based on the observation that 1) the aligned image-text pairs have high matching scores, and 2) the similar

images are close to each other in the embedding space in PVLMs, we hypothesize that similar images should

use similar prompts in prompt learning. The observation 1) is because that during vision-language model

pre-training, one of the pre-training objectives is image-text matching. Hence, pre-trained VL models have

remarkable zero-shot performance on image-text matching. In other words, the similar images and aligned

text-image paris naturally have high matching scores from PVLMs. The observation 2) will be shown during

experiments.

Intuitively, assuming training images can be coarsely divided into

clusters based on the similarity

between their latent embedding vectors, then each cluster can have its own textual prompt used for category

name (label words) prompting. Furthermore, based on our hypothesis, we deﬁne

prototype components,

where each prototype component contains an image prototype and a prompt prototype. In our context, the

image prototype means a point in the image latent space representing a centroid of a certain cluster. The

similarity between a query image and an image prototype determines how much this query image’s category

prediction relies on the corresponding prompt prototype. The ﬁnal prediction is the weighted summation of

all the prediction scores using different prompt prototypes.

We summarize our contributions as follows.

•

We propose a novel prompt learning method

PTP

on PVLMs, to overcome the drawbacks of task-

level (manual/auto-searched prompts) and instance-level prompting. Instead of designing a universal

prompt regardless of instances (Shin et al.,2020;Zhou et al.,2022b,a) or instance-speciﬁc prompt

for each instance (Zhou et al.,2022a;Rao et al.,2022), we develop a prototype-based prompting

method, wherein similar query images utilizes similar prompting ways. During training, we only

update parameters related to prompting while freezing the weights of PVLM to ensure a lightweight

and efﬁcient model.

•

We conduct extensive experiments on 7 real-world benchmarks across 2 types of PVLMs and show

that our

PTP

is an effective method for the full use of the pre-trained knowledge for the downstream

few-shot image recognition tasks. The absolute improvement on average accuracy compared to auto-

searched prompts (Zhou et al.,2022a) over all experiments are around: 4% for 1/2-shot, 5% for 4-shot,

6% for 8-shot, 7% for 16-shot.

•

We made empirical analyses between prompting and ﬁne-tuning and revealed that both methods

have their advantages and limitations. In particular, a good prompt learning performance highly

relies on the pre-trained knowledge stored in the pre-training. A prompt learning method will have

difﬁculty triggering the correct answers, if the PVLM itself lacks such visual or textual knowledge.

Through detailed hyper-parameter analysis, we show how to choose the number of prototypes based

on performance and parameter-efﬁciency. We also show the importance of our novel regularizers for

learning the image prototypes.

2 Related Work

2.1 Pretrained Vision-and-Language Models

Recently, many vision-language models are proposed. The large-scale pre-training allows PVLMs to zero-

shot transfer to various downstream classiﬁcation tasks. They can be coarsely divided into two groups based

on their architecture: the bi-encoder model (Radford et al.,2021;Jia et al.,2021), and the single-encoder

model (Kim et al.,2021;Lu et al.,2019). Bi-encoder model, such as CLIP (Radford et al.,2021) and

ALIGN (Jia et al.,2021), consists of two encoders, one for images and the other for text. This work uses CLIP

as a representative for the bi-encoder model, which has remarkable zero-shot performance on image-text

retrieval.. By default, CLIP uses “a photo of [CLASS]” on the text side for image recognition tasks.

Single-encoder model, such as ViLBERT (Lu et al.,2019), ViLT (Kim et al.,2021), etc., concatenates

the object features from the image and word features from the sentence into a long sequence. So the two

modalities interact with each other in self-attention layers. This work uses ViLT as a representative for

single-encoder models.

2.2 Few-shot Learning

Parameter-Efﬁcient Fine-tuning.

Parameter-efﬁcient ﬁne-tuning methods mainly include: i) Adapters (Houlsby

et al.,2019;Gao et al.,2021a;Zhang et al.,2021), where neural networks layers are inserted between the

feed-forward portion of the Transformer architecture; ii) BitFit (Zaken et al.,2022;IV et al.,2022), where

they only update the bias terms inside the Transformer; iii) Calibration (Zhao et al.,2021), where they learn

an afﬁne transformation on top of the logits output from the Transformer; iv) Linear probe (Tian et al.,2020),

where a linear classiﬁer is trained on top of pre-trained models’ features.

Prompt Learning Methods.

Recently, multiple prompt learning works on PVLM are proposed (Jin et al.,

2022b;Zhou et al.,2022b;Tsimpoukelli et al.,2021;Liang et al.,2022;Rao et al.,2022). Jin et al. (2022b) ﬁrst

pre-trained a prompt-aware vision language model, then transferred to downstream tasks, such as VQA, with

the help of hand-crafted prompts. Zhou et al. (2022b) learned universal soft prompts for solving downstream

few-shot image classiﬁcation tasks. Tsimpoukelli et al. (2021) developed an image to text generation model,

with a dynamic preﬁx to control the generation. Liang et al. (2022) learned soft prompts to align the different

modalities. Rao et al. (2022) learned instance-aware prompts for dense prediction.

In this work, we focus on designing an efﬁcient and effective prompt learning method on PVLMs

for downstream few-shot image classiﬁcation tasks. We leverage prototype-based prompting. Our image

prototypes have a similar concept and usage “this looks like that” in previous works (Li et al.,2018;Chen

et al.,2019), where they learn and utilize prototypes to make interpretable predictions.

3 Methodology

3.1 Problem Setup

We deﬁne a few-shot image recognition training dataset as

D={(xi, yi, ci)}N

i=1

, where

is the image input,

is corresponding discrete label,

is corresponding category name, e.g., “faces”, “Hummer SUV”. We

deﬁne the candidate pool of category names as

C={cj}C

j=1

, where

is total number of different categories.

Given a pre-trained vision-language model (PVLM) and a few-shot training dataset

, our task aims at

solving the downstream few-shot image classiﬁcation task via prompt learning paradigm.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PromptingthroughPrototype:APrototype-basedPromptLearningonPretrainedVision-LanguageModelsYueZhang,HongliangFei,DingchengLi,TanYu,PingLiCognitiveComputingLabBaiduResearch10900NE8thSt.Bellevue,WA98004,USA{yuezhang030,feihongliang0,dingchengl,tanyu1503,pingli98}@gmail.comAbstractPromptlearningisanewlea...

展开>> 收起<<

Prompting through Prototype A Prototype-based Prompt Learning on Pretrained Vision-Language Models.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Prompting through Prototype A Prototype-based Prompt Learning on Pretrained Vision-Language Models

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: