Bayesian Prompt Learning for Image-Language Model Generalization Mohammad Mahdi Derakhshani1 Enrique Sanchez5 Adrian Bulat3 5 Victor Guilherme Turrisi da Costa2 Cees G. M. Snoek1 Georgios Tzimiropoulos45 Brais Martinez5

2025-04-27 0 0 2.49MB 13 页 10玖币
侵权投诉
Bayesian Prompt Learning for Image-Language Model Generalization
Mohammad Mahdi Derakhshani1*
, Enrique Sanchez5, Adrian Bulat3, 5
Victor Guilherme Turrisi da Costa2*
, Cees G. M. Snoek1
, Georgios Tzimiropoulos4,5
, Brais Martinez5
1University of Amsterdam 2University of Trento 3Technical University of Iasi
4Queen Mary University of London 5Samsung AI Cambrdige
Abstract
Foundational image-language models have generated
considerable interest due to their efficient adaptation to
downstream tasks by prompt learning. Prompt learning
treats part of the language model input as trainable while
freezing the rest, and optimizes an Empirical Risk Mini-
mization objective. However, Empirical Risk Minimization
is known to suffer from distributional shifts which hurt gen-
eralizability to prompts unseen during training. By lever-
aging the regularization ability of Bayesian methods, we
frame prompt learning from the Bayesian perspective and
formulate it as a variational inference problem. Our ap-
proach regularizes the prompt space, reduces overfitting to
the seen prompts and improves the prompt generalization on
unseen prompts. Our framework is implemented by model-
ing the input prompt space in a probabilistic manner, as an
a priori distribution which makes our proposal compatible
with prompt learning approaches that are unconditional or
conditional on the image. We demonstrate empirically on
15 benchmarks that Bayesian prompt learning provides an
appropriate coverage of the prompt space, prevents learn-
ing spurious features, and exploits transferable invariant
features. This results in better generalization of unseen
prompts, even across different datasets and domains.
Code available at: https://github.com/saic-fi/Bayesian-
Prompt-Learning
1. Introduction
In the continuous quest for better pre-training strategies,
models based on image and language supervision have set
impressive milestones, with CLIP [38], ALIGN [22] and
Flamingo [1] being leading examples. Contrastively trained
image-language models consist of image and text encoders
that align semantically-related concepts in a joint embed-
ding space. Such models offer impressive zero-shot image
classification by using the text encoder to generate classi-
fier weights from arbitrarily newly defined category classes
*Most work done during an internship at Samsung AI Cambridge.
Equal advising
Figure 1: We present a Bayesian perspective on prompt
learning by formulating it as a variational inference problem
(right column). Our framework models the prompt space as
an a priori distribution which makes our proposal compat-
ible with common prompt learning approaches that are un-
conditional (top) or conditional on the image (bottom).
without relying on any visual data. In particular, the class
name is used within a handcrafted prompt template and then
tokenized and encoded into the shared embedding space
to generate new classifier weights. Rather than manually
defining prompts, both Lester et al. [28] and Zhou et al.
[55] demonstrated prompts can instead be optimized in a
data-driven manner through backpropagation. However, as
prompt learning typically has access to only a few training
examples per prompt, overfitting to the seen prompts in lieu
of the unseen prompts is common [55]. In this paper, we
strive to mitigate the overfitting behavior of prompt learn-
ing so as to improve generalization for unseen prompts.
Others before us have considered the generalization
problem in prompt learning as well, e.g., [55, 54], be it they
all seek to optimize a so-called Empirical Risk Minimiza-
arXiv:2210.02390v3 [cs.CV] 20 Aug 2023
tion. It is, however, well known that Empirical Risk Min-
imization based models degrade drastically when training
and testing distributions are different [37, 2]. To relax the
i.i.d. assumption, Peters et al. [37] suggest exploiting the
“invariance principle” for better generalization. Unfortu-
nately, Invariant Risk Minimization methods for deep neu-
ral networks as of yet fail to deliver competitive results, as
observed in [15, 31, 30]. To alleviate this limitation, Lin
et al. [30] propose a Bayesian treatment of Invariant Risk
Minimization that alleviates overfitting with deep models
by defining a regularization term over the posterior distri-
bution of classifiers, minimizing this term, and pushing the
model’s backbone to learn invariant features. We take inspi-
ration from this Bayesian Invariant Risk Minimization [30]
and propose the first Bayesian prompt learning approach.
We make three contributions. First, we frame prompt
learning from the Bayesian perspective and formulate it as a
variational inference problem (see Figure 1). This formula-
tion provides several benefits. First, it naturally injects noise
during prompt learning and induces a regularization term
that encourages the model to learn informative prompts,
scattered in the prompt space, for each downstream task.
As a direct result, we regularize the prompt space, reduce
overfitting to seen prompts, and improve generalization on
unseen prompts. Second, our framework models the input
prompt space in a probabilistic manner, as an a priori distri-
bution which makes our proposal compatible with prompt
learning approaches that are unconditional [55] or condi-
tional on the image [54]. Third, we empirically demon-
strate on 15 benchmarks that Bayesian prompt learning pro-
vides an appropriate coverage of the prompt space, prevents
learning spurious features, and exploits transferable invari-
ant features, leading to a better generalization of unseen
prompts, even across different datasets and domains.
2. Related Work
Prompt learning in language. Prompt learning was
originally proposed within natural language processing
(NLP), by models such as GPT-3 [4]. Early methods con-
structed prompts by combining words in the language space
such that the model would perform better on downstream
evaluation [46, 24]. Li and Liang [29] prepend a set of
learnable prompts to the different layers of a frozen model
and optimize through back-propagation. In parallel, Lester
et al. [28] demonstrate that with no intermediate layer pre-
fixes or task-specific output layers, adding prefixes alone
to the input of the frozen model is enough to compete
with fine-tuning. Alternatively, [16] uses a HyperNetwork
to conditionally generate task-specific and layer-specific
prompts pre-pended to the values and keys inside the self-
attention layers of a frozen model. Inspired by progress
from NLP, we propose a prompt learning method intended
for image and language models.
Prompt learning in image and language. Zhou et
al. [55] propose Context Optimization (CoOp), a prompt
learner for CLIP, which optimizes prompts in the contin-
uous space through back-propagation. The work demon-
strates the benefit of prompt learning over prompt engi-
neering. While CoOp obtains good accuracy on prompts
seen during training, it has difficulty generalizing to un-
seen prompts. It motivated Zhou et al. to introduce Con-
ditional Context Optimization (CoCoOp) [54]. It generates
instance-specific prompt residuals through a conditioning
mechanism dependent on the image data, which general-
izes better.ProGrad by Zhu et al. [56] also strives to bridge
the generalization gap by matching the gradient of the
prompt to the general knowledge of the CLIP model to pre-
vent forgetting. Alternative directions consist of test-time
prompt learning [47], where consistency across multiple
views is the supervisory signal, and unsupervised prompt
learning [21], where a pseudo-labeling strategy drives the
prompt learning. More similar to ours is ProDA by Lu
et al. [32], who propose an ensemble of a fixed number
of hand-crafted prompts and model the distribution exclu-
sively within the language embedding. Conversely, we pre-
fer to model the input prompt space rather than relying on
a fixed number of templates, as it provides us with a mech-
anism to cover the prompt space by sampling. Moreover,
our approach is not limited to unconditional prompt learn-
ing from the language embedding, but like Zhou et al. [54],
also allows for prompt learning conditioned on an image.
Prompt learning in vision and language. While be-
yond our current scope, it is worth noting that prompt learn-
ing has been applied to a wider range of vision problems
and scenarios, which highlights its power and flexibility.
Among them are important topics such as unsupervised do-
main adaptation [13], multi-label classification [49], video
classification [25], object detection [9, 12] and pixel-level
labelling [40]. Finally, prompt learning has also been ap-
plied to vision only models [23, 44] providing an efficient
and flexible means to adapt pre-trained models.
Variational inference in computer vision. Variational
inference and, more specifically, variational autoencoder
variants have been extensively applied to computer vi-
sion tasks as diverse as image generation [39, 43, 42], ac-
tion recognition [34], instance segmentation [20], few-shot
learning [53, 45], domain generalization [10], and contin-
ual learning [7]. For example, Zhang et al. [53] focus
on the reduction of noise vulnerability and estimation bias
in few-shot learning through variational inference. In the
same vein, Du et al. [10] propose a variational informa-
tion bottleneck to better manage prediction uncertainty and
unknown domains. Our proposed method also shares the
advantages of variational inference in avoiding overfitting
in low-shot settings, improving generalization, and encour-
ages the prompt space to be resilient against these chal-
lenges. To the best of our knowledge we are the first to
introduce variational inference in prompt learning.
3. Method
3.1. Background
Contrastive Language-Image Pretraining
(CLIP) [38] consists of an image encoder f(x)and text
encoder g(t), each producing a d-dimensional (L2normal-
ized) embedding from an arbitrary image xR3×H×W,
and word embeddings tRL×e, with Lrepresenting
the text length and ethe embedding dimension1. Both
encoders are trained together using a contrastive loss
from a large-scale dataset composed of paired images and
captions. Once trained, CLIP enables zero-shot C-class
image classification by generating each of the cclassifier
weights wcas the d-dimensional text encoding g(tc). Here
tcresults from adding the class-specific word embedding
ecto a pre-defined prompt pRL1×e,i.e., wc=g(tc)
with tc={p,ec}. The prompt pis manually crafted to
capture the semantic meaning of the downstream task, e.g.,
tc=“An image of a {class}. The probability of
image xbeing classified as y∈ {1...C}is thus defined as
p(y|x)= ef(x)Twy
PC
cef(x)Twc.
Context Optimization (CoOp) [55] provides a learned
alternative to manually defining prompts. CoOp learns a
fixed prompt from a few annotated samples. The prompt
is designed as a learnable embedding matrix pRL×e
which is updated via back-propagating the classification er-
ror through the frozen CLIP model. Specifically, for a set of
Nannotated meta-training samples {xi, yi}N
i=1, the prompt
pis obtained by minimizing the cross-entropy loss, as:
p= arg min
p
Exi,yi[log p(yi|xi,p)].(1)
Note that this approach, while resembling that of common
meta-learning approaches, can still be deployed in a zero-
shot scenario provided that for new classes the classification
weights will be given by the text encoder. Although this ap-
proach generalizes to new tasks with few training iterations,
learning a fixed prompt is sensitive to domain shifts between
the annotated samples and the unseen prompts.
Conditional Prompt Learning (CoCoOp) [54] at-
tempts to overcome domain shifts by learning an instance-
specific continuous prompt that is conditioned on the in-
put image. To ease the training of a conditional prompt
generator, CoCoOp defines each conditional token in a
residual way, with a task-specific, learnable set of to-
kens pand a residual prompt that is conditioned on
1In CLIP the word embedding is learned together with the text encoder.
Atokenizer is used to convert the text into one-hot vectors, or tokens, that
can be directly mapped into the word embeddings. For the sake of clarity
we refer indistinctly to words and word embeddings.
the input image. Assuming pto be composed of L
learnable tokens p=[p1,p2,· · · ,pL], the residual prompt
r(x)=πϕ(f(x)) Reis produced by a small neural net-
work πϕwith as input the image features f(x). The
new prompt is then computed as p(x)=[p1+r(x),p2+
r(x),· · · ,pL+r(x)]. The training now comprises learn-
ing the task-specific prompt pand the parameters ϕof the
neural network πϕ. Defining the context-specific text em-
bedding tc(x)={p(x),ec}, and p(y|x)as :
p(y|x) = ef(x)Tg(tc(x))
PC
cef(x)Tg(tc(x)) ,(2)
the learning is formulated as:
p, ϕ= arg min
p
Exi,yi[log p(yi|xi,p, ϕ)].(3)
While CoCoOp achieves good results in many downstream
tasks, it is still prone to the domain shift problem, consid-
ering that πϕprovides a deterministic residual prompt from
the domain-specific image features f(x).
Prompt Distribution Learning (ProDA) [32] learns
a distribution of prompts that generalize to a broader set
of tasks. It learns a collection of prompts P={pk}K
k=1
that subsequently generate an a posteriori distribution
of the classifier weights for each of the target classes.
For a given mini-batch of Ksampled prompts pkP,
the classifier weights wcare sampled from the poste-
rior distribution q=N(µw1:C,Σw1:C), with mean µw1:C
and covariance Σw1:Ccomputed from the collection
{wk,c=g(tk,c)}c=1:C,k=1:K, with tk,c ={pk,ec}. The
objective is formulated as:
P= arg min
P
Exi,yi[log Ewlqp(yi|xi,wl)].(4)
Computing Ewlp(yi|xi,wl)] is intractable and an upper
bound to Eq. 4 is derived. During inference, the classi-
fier weights are set to those given by the predictive mean
wc=µw1:C, computed across the set of learned prompts P.
3.2. Conditional Bayesian Prompt Learning
We propose to model the input prompt space in a proba-
bilistic manner, as an a priori, conditional distribution. We
define a distribution pγover the prompts pthat is condi-
tional on the image, i.e., ppγ(x). To this end, we as-
sume that pcan be split into a fixed set of prompts piand
an conditional residual prompt rthat act as a latent variable
over p. The conditional prompt is then defined as:
pγ(x) = [p1+rγ,p2+rγ,· · · ,pL+rγ],rγpγ(x),(5)
where pγ(x)refers to the real posterior distribution over r
conditioned on the observed features x. Denoting the class-
摘要:

BayesianPromptLearningforImage-LanguageModelGeneralizationMohammadMahdiDerakhshani1*,EnriqueSanchez5,AdrianBulat3,5VictorGuilhermeTurrisidaCosta2*,CeesG.M.Snoek1†,GeorgiosTzimiropoulos4,5†,BraisMartinez5†1UniversityofAmsterdam2UniversityofTrento3TechnicalUniversityofIasi4QueenMaryUniversityofLondon5...

展开>> 收起<<
Bayesian Prompt Learning for Image-Language Model Generalization Mohammad Mahdi Derakhshani1 Enrique Sanchez5 Adrian Bulat3 5 Victor Guilherme Turrisi da Costa2 Cees G. M. Snoek1 Georgios Tzimiropoulos45 Brais Martinez5.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:13 页 大小:2.49MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注