Bayesian Prompt Learning for Image-Language Model Generalization Mohammad Mahdi Derakhshani1 Enrique Sanchez5 Adrian Bulat3 5 Victor Guilherme Turrisi da Costa2 Cees G. M. Snoek1 Georgios Tzimiropoulos45 Brais Martinez5

2025-04-27 0 0 2.49MB 13 页 10玖币

侵权投诉

Bayesian Prompt Learning for Image-Language Model Generalization

Mohammad Mahdi Derakhshani1*

, Enrique Sanchez5, Adrian Bulat3, 5

Victor Guilherme Turrisi da Costa2*

, Cees G. M. Snoek1†

, Georgios Tzimiropoulos4,5†

, Brais Martinez5†

1University of Amsterdam 2University of Trento 3Technical University of Iasi

4Queen Mary University of London 5Samsung AI Cambrdige

Abstract

Foundational image-language models have generated

considerable interest due to their efﬁcient adaptation to

downstream tasks by prompt learning. Prompt learning

treats part of the language model input as trainable while

freezing the rest, and optimizes an Empirical Risk Mini-

mization objective. However, Empirical Risk Minimization

is known to suffer from distributional shifts which hurt gen-

eralizability to prompts unseen during training. By lever-

aging the regularization ability of Bayesian methods, we

frame prompt learning from the Bayesian perspective and

formulate it as a variational inference problem. Our ap-

proach regularizes the prompt space, reduces overﬁtting to

the seen prompts and improves the prompt generalization on

unseen prompts. Our framework is implemented by model-

ing the input prompt space in a probabilistic manner, as an

a priori distribution which makes our proposal compatible

with prompt learning approaches that are unconditional or

conditional on the image. We demonstrate empirically on

15 benchmarks that Bayesian prompt learning provides an

appropriate coverage of the prompt space, prevents learn-

ing spurious features, and exploits transferable invariant

features. This results in better generalization of unseen

prompts, even across different datasets and domains.

Code available at: https://github.com/saic-ﬁ/Bayesian-

Prompt-Learning

1. Introduction

In the continuous quest for better pre-training strategies,

models based on image and language supervision have set

impressive milestones, with CLIP [38], ALIGN [22] and

Flamingo [1] being leading examples. Contrastively trained

image-language models consist of image and text encoders

that align semantically-related concepts in a joint embed-

ding space. Such models offer impressive zero-shot image

classiﬁcation by using the text encoder to generate classi-

ﬁer weights from arbitrarily newly deﬁned category classes

*Most work done during an internship at Samsung AI Cambridge.

†Equal advising

Figure 1: We present a Bayesian perspective on prompt

learning by formulating it as a variational inference problem

(right column). Our framework models the prompt space as

an a priori distribution which makes our proposal compat-

ible with common prompt learning approaches that are un-

conditional (top) or conditional on the image (bottom).

without relying on any visual data. In particular, the class

name is used within a handcrafted prompt template and then

tokenized and encoded into the shared embedding space

to generate new classiﬁer weights. Rather than manually

deﬁning prompts, both Lester et al. [28] and Zhou et al.

[55] demonstrated prompts can instead be optimized in a

data-driven manner through backpropagation. However, as

prompt learning typically has access to only a few training

examples per prompt, overﬁtting to the seen prompts in lieu

of the unseen prompts is common [55]. In this paper, we

strive to mitigate the overﬁtting behavior of prompt learn-

ing so as to improve generalization for unseen prompts.

Others before us have considered the generalization

problem in prompt learning as well, e.g., [55, 54], be it they

all seek to optimize a so-called Empirical Risk Minimiza-

arXiv:2210.02390v3 [cs.CV] 20 Aug 2023

tion. It is, however, well known that Empirical Risk Min-

imization based models degrade drastically when training

and testing distributions are different [37, 2]. To relax the

i.i.d. assumption, Peters et al. [37] suggest exploiting the

“invariance principle” for better generalization. Unfortu-

nately, Invariant Risk Minimization methods for deep neu-

ral networks as of yet fail to deliver competitive results, as

observed in [15, 31, 30]. To alleviate this limitation, Lin

et al. [30] propose a Bayesian treatment of Invariant Risk

Minimization that alleviates overﬁtting with deep models

by deﬁning a regularization term over the posterior distri-

bution of classiﬁers, minimizing this term, and pushing the

model’s backbone to learn invariant features. We take inspi-

ration from this Bayesian Invariant Risk Minimization [30]

and propose the ﬁrst Bayesian prompt learning approach.

We make three contributions. First, we frame prompt

learning from the Bayesian perspective and formulate it as a

variational inference problem (see Figure 1). This formula-

tion provides several beneﬁts. First, it naturally injects noise

during prompt learning and induces a regularization term

that encourages the model to learn informative prompts,

scattered in the prompt space, for each downstream task.

As a direct result, we regularize the prompt space, reduce

overﬁtting to seen prompts, and improve generalization on

unseen prompts. Second, our framework models the input

prompt space in a probabilistic manner, as an a priori distri-

bution which makes our proposal compatible with prompt

learning approaches that are unconditional [55] or condi-

tional on the image [54]. Third, we empirically demon-

strate on 15 benchmarks that Bayesian prompt learning pro-

vides an appropriate coverage of the prompt space, prevents

learning spurious features, and exploits transferable invari-

ant features, leading to a better generalization of unseen

prompts, even across different datasets and domains.

2. Related Work

Prompt learning in language. Prompt learning was

originally proposed within natural language processing

(NLP), by models such as GPT-3 [4]. Early methods con-

structed prompts by combining words in the language space

such that the model would perform better on downstream

evaluation [46, 24]. Li and Liang [29] prepend a set of

learnable prompts to the different layers of a frozen model

and optimize through back-propagation. In parallel, Lester

et al. [28] demonstrate that with no intermediate layer pre-

ﬁxes or task-speciﬁc output layers, adding preﬁxes alone

to the input of the frozen model is enough to compete

with ﬁne-tuning. Alternatively, [16] uses a HyperNetwork

to conditionally generate task-speciﬁc and layer-speciﬁc

prompts pre-pended to the values and keys inside the self-

attention layers of a frozen model. Inspired by progress

from NLP, we propose a prompt learning method intended

for image and language models.

Prompt learning in image and language. Zhou et

al. [55] propose Context Optimization (CoOp), a prompt

learner for CLIP, which optimizes prompts in the contin-

uous space through back-propagation. The work demon-

strates the beneﬁt of prompt learning over prompt engi-

neering. While CoOp obtains good accuracy on prompts

seen during training, it has difﬁculty generalizing to un-

seen prompts. It motivated Zhou et al. to introduce Con-

ditional Context Optimization (CoCoOp) [54]. It generates

instance-speciﬁc prompt residuals through a conditioning

mechanism dependent on the image data, which general-

izes better.ProGrad by Zhu et al. [56] also strives to bridge

the generalization gap by matching the gradient of the

prompt to the general knowledge of the CLIP model to pre-

vent forgetting. Alternative directions consist of test-time

prompt learning [47], where consistency across multiple

views is the supervisory signal, and unsupervised prompt

learning [21], where a pseudo-labeling strategy drives the

prompt learning. More similar to ours is ProDA by Lu

et al. [32], who propose an ensemble of a ﬁxed number

of hand-crafted prompts and model the distribution exclu-

sively within the language embedding. Conversely, we pre-

fer to model the input prompt space rather than relying on

a ﬁxed number of templates, as it provides us with a mech-

anism to cover the prompt space by sampling. Moreover,

our approach is not limited to unconditional prompt learn-

ing from the language embedding, but like Zhou et al. [54],

also allows for prompt learning conditioned on an image.

Prompt learning in vision and language. While be-

yond our current scope, it is worth noting that prompt learn-

ing has been applied to a wider range of vision problems

and scenarios, which highlights its power and ﬂexibility.

Among them are important topics such as unsupervised do-

main adaptation [13], multi-label classiﬁcation [49], video

classiﬁcation [25], object detection [9, 12] and pixel-level

labelling [40]. Finally, prompt learning has also been ap-

plied to vision only models [23, 44] providing an efﬁcient

and ﬂexible means to adapt pre-trained models.

Variational inference in computer vision. Variational

inference and, more speciﬁcally, variational autoencoder

variants have been extensively applied to computer vi-

sion tasks as diverse as image generation [39, 43, 42], ac-

tion recognition [34], instance segmentation [20], few-shot

learning [53, 45], domain generalization [10], and contin-

ual learning [7]. For example, Zhang et al. [53] focus

on the reduction of noise vulnerability and estimation bias

in few-shot learning through variational inference. In the

same vein, Du et al. [10] propose a variational informa-

tion bottleneck to better manage prediction uncertainty and

unknown domains. Our proposed method also shares the

advantages of variational inference in avoiding overﬁtting

in low-shot settings, improving generalization, and encour-

ages the prompt space to be resilient against these chal-

lenges. To the best of our knowledge we are the ﬁrst to

introduce variational inference in prompt learning.

3. Method

3.1. Background

Contrastive Language-Image Pretraining

(CLIP) [38] consists of an image encoder f(x)and text

encoder g(t), each producing a d-dimensional (L2normal-

ized) embedding from an arbitrary image x∈R3×H×W,

and word embeddings t∈RL×e, with Lrepresenting

the text length and ethe embedding dimension1. Both

encoders are trained together using a contrastive loss

from a large-scale dataset composed of paired images and

captions. Once trained, CLIP enables zero-shot C-class

image classiﬁcation by generating each of the cclassiﬁer

weights wcas the d-dimensional text encoding g(tc). Here

tcresults from adding the class-speciﬁc word embedding

ecto a pre-deﬁned prompt p∈RL−1×e,i.e., wc=g(tc)

with tc={p,ec}. The prompt pis manually crafted to

capture the semantic meaning of the downstream task, e.g.,

tc=“An image of a {class}”. The probability of

image xbeing classiﬁed as y∈ {1...C}is thus deﬁned as

p(y|x)= ef(x)Twy

cef(x)Twc.

Context Optimization (CoOp) [55] provides a learned

alternative to manually deﬁning prompts. CoOp learns a

ﬁxed prompt from a few annotated samples. The prompt

is designed as a learnable embedding matrix p∈RL×e

which is updated via back-propagating the classiﬁcation er-

ror through the frozen CLIP model. Speciﬁcally, for a set of

Nannotated meta-training samples {xi, yi}N

i=1, the prompt

pis obtained by minimizing the cross-entropy loss, as:

p∗= arg min

Exi,yi[−log p(yi|xi,p)].(1)

Note that this approach, while resembling that of common

meta-learning approaches, can still be deployed in a zero-

shot scenario provided that for new classes the classiﬁcation

weights will be given by the text encoder. Although this ap-

proach generalizes to new tasks with few training iterations,

learning a ﬁxed prompt is sensitive to domain shifts between

the annotated samples and the unseen prompts.

Conditional Prompt Learning (CoCoOp) [54] at-

tempts to overcome domain shifts by learning an instance-

speciﬁc continuous prompt that is conditioned on the in-

put image. To ease the training of a conditional prompt

generator, CoCoOp deﬁnes each conditional token in a

residual way, with a task-speciﬁc, learnable set of to-

kens pand a residual prompt that is conditioned on

1In CLIP the word embedding is learned together with the text encoder.

Atokenizer is used to convert the text into one-hot vectors, or tokens, that

can be directly mapped into the word embeddings. For the sake of clarity

we refer indistinctly to words and word embeddings.

the input image. Assuming pto be composed of L

learnable tokens p=[p1,p2,· · · ,pL], the residual prompt

r(x)=πϕ(f(x)) ∈Reis produced by a small neural net-

work πϕwith as input the image features f(x). The

new prompt is then computed as p(x)=[p1+r(x),p2+

r(x),· · · ,pL+r(x)]. The training now comprises learn-

ing the task-speciﬁc prompt pand the parameters ϕof the

neural network πϕ. Deﬁning the context-speciﬁc text em-

bedding tc(x)={p(x),ec}, and p(y|x)as :

p(y|x) = ef(x)Tg(tc(x))

cef(x)Tg(tc(x)) ,(2)

the learning is formulated as:

p∗, ϕ∗= arg min

p,ϕ

Exi,yi[−log p(yi|xi,p, ϕ)].(3)

While CoCoOp achieves good results in many downstream

tasks, it is still prone to the domain shift problem, consid-

ering that πϕprovides a deterministic residual prompt from

the domain-speciﬁc image features f(x).

Prompt Distribution Learning (ProDA) [32] learns

a distribution of prompts that generalize to a broader set

of tasks. It learns a collection of prompts P={pk}K

k=1

that subsequently generate an a posteriori distribution

of the classiﬁer weights for each of the target classes.

For a given mini-batch of Ksampled prompts pk∼P,

the classiﬁer weights wcare sampled from the poste-

rior distribution q=N(µw1:C,Σw1:C), with mean µw1:C

and covariance Σw1:Ccomputed from the collection

{wk,c=g(tk,c)}c=1:C,k=1:K, with tk,c ={pk,ec}. The

objective is formulated as:

P∗= arg min

Exi,yi[−log Ewl∼qp(yi|xi,wl)].(4)

Computing Ewlp(yi|xi,wl)] is intractable and an upper

bound to Eq. 4 is derived. During inference, the classi-

ﬁer weights are set to those given by the predictive mean

wc=µw1:C, computed across the set of learned prompts P.

3.2. Conditional Bayesian Prompt Learning

We propose to model the input prompt space in a proba-

bilistic manner, as an a priori, conditional distribution. We

deﬁne a distribution pγover the prompts pthat is condi-

tional on the image, i.e., p∼pγ(x). To this end, we as-

sume that pcan be split into a ﬁxed set of prompts piand

an conditional residual prompt rthat act as a latent variable

over p. The conditional prompt is then deﬁned as:

pγ(x) = [p1+rγ,p2+rγ,· · · ,pL+rγ],rγ∼pγ(x),(5)

where pγ(x)refers to the real posterior distribution over r

conditioned on the observed features x. Denoting the class-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BayesianPromptLearningforImage-LanguageModelGeneralizationMohammadMahdiDerakhshani1*,EnriqueSanchez5,AdrianBulat3,5VictorGuilhermeTurrisidaCosta2*,CeesG.M.Snoek1†,GeorgiosTzimiropoulos4,5†,BraisMartinez5†1UniversityofAmsterdam2UniversityofTrento3TechnicalUniversityofIasi4QueenMaryUniversityofLondon5...

展开>> 收起<<

Bayesian Prompt Learning for Image-Language Model Generalization Mohammad Mahdi Derakhshani1 Enrique Sanchez5 Adrian Bulat3 5 Victor Guilherme Turrisi da Costa2 Cees G. M. Snoek1 Georgios Tzimiropoulos45 Brais Martinez5.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bayesian Prompt Learning for Image-Language Model Generalization Mohammad Mahdi Derakhshani1 Enrique Sanchez5 Adrian Bulat3 5 Victor Guilherme Turrisi da Costa2 Cees G. M. Snoek1 Georgios Tzimiropoulos45 Brais Martinez5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: