tion. It is, however, well known that Empirical Risk Min-
imization based models degrade drastically when training
and testing distributions are different [37, 2]. To relax the
i.i.d. assumption, Peters et al. [37] suggest exploiting the
“invariance principle” for better generalization. Unfortu-
nately, Invariant Risk Minimization methods for deep neu-
ral networks as of yet fail to deliver competitive results, as
observed in [15, 31, 30]. To alleviate this limitation, Lin
et al. [30] propose a Bayesian treatment of Invariant Risk
Minimization that alleviates overfitting with deep models
by defining a regularization term over the posterior distri-
bution of classifiers, minimizing this term, and pushing the
model’s backbone to learn invariant features. We take inspi-
ration from this Bayesian Invariant Risk Minimization [30]
and propose the first Bayesian prompt learning approach.
We make three contributions. First, we frame prompt
learning from the Bayesian perspective and formulate it as a
variational inference problem (see Figure 1). This formula-
tion provides several benefits. First, it naturally injects noise
during prompt learning and induces a regularization term
that encourages the model to learn informative prompts,
scattered in the prompt space, for each downstream task.
As a direct result, we regularize the prompt space, reduce
overfitting to seen prompts, and improve generalization on
unseen prompts. Second, our framework models the input
prompt space in a probabilistic manner, as an a priori distri-
bution which makes our proposal compatible with prompt
learning approaches that are unconditional [55] or condi-
tional on the image [54]. Third, we empirically demon-
strate on 15 benchmarks that Bayesian prompt learning pro-
vides an appropriate coverage of the prompt space, prevents
learning spurious features, and exploits transferable invari-
ant features, leading to a better generalization of unseen
prompts, even across different datasets and domains.
2. Related Work
Prompt learning in language. Prompt learning was
originally proposed within natural language processing
(NLP), by models such as GPT-3 [4]. Early methods con-
structed prompts by combining words in the language space
such that the model would perform better on downstream
evaluation [46, 24]. Li and Liang [29] prepend a set of
learnable prompts to the different layers of a frozen model
and optimize through back-propagation. In parallel, Lester
et al. [28] demonstrate that with no intermediate layer pre-
fixes or task-specific output layers, adding prefixes alone
to the input of the frozen model is enough to compete
with fine-tuning. Alternatively, [16] uses a HyperNetwork
to conditionally generate task-specific and layer-specific
prompts pre-pended to the values and keys inside the self-
attention layers of a frozen model. Inspired by progress
from NLP, we propose a prompt learning method intended
for image and language models.
Prompt learning in image and language. Zhou et
al. [55] propose Context Optimization (CoOp), a prompt
learner for CLIP, which optimizes prompts in the contin-
uous space through back-propagation. The work demon-
strates the benefit of prompt learning over prompt engi-
neering. While CoOp obtains good accuracy on prompts
seen during training, it has difficulty generalizing to un-
seen prompts. It motivated Zhou et al. to introduce Con-
ditional Context Optimization (CoCoOp) [54]. It generates
instance-specific prompt residuals through a conditioning
mechanism dependent on the image data, which general-
izes better.ProGrad by Zhu et al. [56] also strives to bridge
the generalization gap by matching the gradient of the
prompt to the general knowledge of the CLIP model to pre-
vent forgetting. Alternative directions consist of test-time
prompt learning [47], where consistency across multiple
views is the supervisory signal, and unsupervised prompt
learning [21], where a pseudo-labeling strategy drives the
prompt learning. More similar to ours is ProDA by Lu
et al. [32], who propose an ensemble of a fixed number
of hand-crafted prompts and model the distribution exclu-
sively within the language embedding. Conversely, we pre-
fer to model the input prompt space rather than relying on
a fixed number of templates, as it provides us with a mech-
anism to cover the prompt space by sampling. Moreover,
our approach is not limited to unconditional prompt learn-
ing from the language embedding, but like Zhou et al. [54],
also allows for prompt learning conditioned on an image.
Prompt learning in vision and language. While be-
yond our current scope, it is worth noting that prompt learn-
ing has been applied to a wider range of vision problems
and scenarios, which highlights its power and flexibility.
Among them are important topics such as unsupervised do-
main adaptation [13], multi-label classification [49], video
classification [25], object detection [9, 12] and pixel-level
labelling [40]. Finally, prompt learning has also been ap-
plied to vision only models [23, 44] providing an efficient
and flexible means to adapt pre-trained models.
Variational inference in computer vision. Variational
inference and, more specifically, variational autoencoder
variants have been extensively applied to computer vi-
sion tasks as diverse as image generation [39, 43, 42], ac-
tion recognition [34], instance segmentation [20], few-shot
learning [53, 45], domain generalization [10], and contin-
ual learning [7]. For example, Zhang et al. [53] focus
on the reduction of noise vulnerability and estimation bias
in few-shot learning through variational inference. In the
same vein, Du et al. [10] propose a variational informa-
tion bottleneck to better manage prediction uncertainty and
unknown domains. Our proposed method also shares the
advantages of variational inference in avoiding overfitting
in low-shot settings, improving generalization, and encour-
ages the prompt space to be resilient against these chal-