LASP Text-to-Text Optimization for Language-Aware Soft Prompting of Vision Language Models Adrian Bulat12 Georgios Tzimiropoulos13

2025-05-03 0 0 631.71KB 10 页 10玖币
侵权投诉
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting
of Vision & Language Models
Adrian Bulat1,2, Georgios Tzimiropoulos1,3
1Samsung AI Cambridge 2Technical University of Iasi 3Queen Mary University of London
Abstract
Soft prompt learning has recently emerged as one of the
methods of choice for adapting V&L models to a down-
stream task using a few training examples. However, cur-
rent methods significantly overfit the training data, suffer-
ing from large accuracy degradation when tested on un-
seen classes from the same domain. To this end, in this
paper, we make the following 4 contributions: (1) To alle-
viate base class overfitting, we propose a novel Language-
Aware Soft Prompting (LASP) learning method by means of
a text-to-text cross-entropy loss that maximizes the proba-
bility of the learned prompts to be correctly classified with
respect to pre-defined hand-crafted textual prompts. (2) To
increase the representation capacity of the prompts, we pro-
pose grouped LASP where each group of prompts is opti-
mized with respect to a separate subset of textual prompts.
(3) We identify a visual-language misalignment introduced
by prompt learning and LASP, and more importantly, pro-
pose a re-calibration mechanism to address it. (4) We show
that LASP is inherently amenable to including, during train-
ing, virtual classes, i.e. class names for which no visual
samples are available, further increasing the robustness of
the learned prompts. Through evaluations on 11 datasets,
we show that our approach (a) significantly outperforms all
prior works on soft prompting, and (b) matches and sur-
passes, for the first time, the accuracy on novel classes ob-
tained by hand-crafted prompts and CLIP for 8 out of 11
test datasets. Code will be made available here.
1. Introduction
Large-scale pre-training of neural networks has recently
resulted in the construction of a multitude of foundation
models for Language [7,25] and Vision & Language (V&L)
understanding [1,13,24,34]. Unlike the previous genera-
tion of neural networks, such models can better capture the
distribution of the world from which new favorable prop-
erties and characteristics emerge. Of particular interest to
this work are V&L models trained with contrastive learn-
ing (i.e. CLIP-like models [13,18,24,33,34]), which have
enabled seamless few-shot and even zero-shot adaptation to
new downstream tasks and datasets. Specifically, this pa-
per proposes a simple yet highly effective way to drastically
improve soft prompt learning for the few-shot adaptation of
the V&L model to a given downstream task.
Similarly to their NLP counterparts [16,17,24], prompt
engineering and learning has emerged as one of the
most powerful techniques for adapting a V&L to new
tasks. Initially, in [24], a set of manually-defined hand-
engineered templates (or prompts) like a photo of a
{cls name}, or a black and white photo of
a{cls name}were passed through the text encoder of
the V&L model to create class-specific weights for category
cls name that can be used for zero-shot recognition. Fol-
lowing research in NLP [16,17], subsequent work [35,36]
has proposed replacing the manually picked templates with
a sequence of learnable vectors, also coined soft prompts,
which are fed as input to the text encoder along with the
class name cls name. The soft prompts are learned from
a few training examples with the entire V&L model kept
frozen. The whole process can be seen as parameter effi-
cient fine-tuning of the model on a small training dataset.
However, a clearly identifiable problem with prompt
learning is base class overfitting: while the accuracy on
the classes used for training (base classes) significantly in-
creases, the accuracy on unseen, during training, (novel)
classes significantly drops. This is to some extent expected,
as soft prompts are learned from few examples belonging to
the base classes. Notably, on novel classes, direct, zero-shot
recognition using hand-engineered prompts outperforms all
existing soft prompt learning methods.
Key idea: To alleviate base class overfitting, in this work,
we propose a solution motivated by the following obser-
vation: since prompt learning improves the accuracy on
base classes, but prompt engineering is significantly bet-
ter on novel classes, we propose to learn the soft prompts
by adding a cross entropy text-to-text loss that enforces the
learned prompts to be close, in embedding space, to the
textual ones, thus exploiting the intrinsic information cap-
tured by the text encoder. The proposed text-to-text loss en-
ables language-only optimization for V&L model adaption
arXiv:2210.01115v2 [cs.CV] 2 Apr 2023
for the first time. This is in contrast with prior soft-prompt
learning methods that only capture V&L interactions.
Key contributions: Based on the above, we propose a
novel framework for soft prompt learning which we call
Language-Aware Soft Prompting (LASP). Our main con-
tributions within the LASP framework are as follows:
We propose, for the first time, language-only optimiza-
tion for V&L model adaption. Specifically, we propose
a novel text-to-text cross-entropy loss that maximizes the
probability of the learned prompts to be correctly classi-
fied with respect to the hand-engineered ones and show its
effectiveness in terms of alleviating base-class overfitting.
To increase the representation capacity of the prompts,
and inspired by grouped convolution and multi-head at-
tention, we propose a grouped language-aware prompt
representation where each group of prompts specializes
to a different subset of the pre-defined manual templates.
We identify a visual-language misalignment introduced
by prompt learning and LASP which impacts the gener-
alization. More importantly, we propose a re-calibration
mechanism based on (a) Layer Normalization fine-tuning
and (b) learning a class-agnostic bias to address it.
Thanks to our language-only learning framework, we pro-
pose training LASP with virtual classes by including, dur-
ing training, class names for which no visual samples are
available. Importantly, we show that this further increases
the robustness of the learned prompts.
Main results: Our methods set a new state-of-the-art for
few-shot and zero-shot image classification on 11 datasets,
significantly outperforming all soft prompting prior works.
Importantly, we present, for the first time, a prompt learn-
ing method that outperforms, for the majority of the test
datasets (8 out of 11), the very strong baseline based on
hand-crafted prompts and CLIP for the recognition of novel
classes (i.e. zero-shot setting).
2. Related work
Contrastive V&L Models: Recently, large scale V&L pre-
training with contrastive learning has been used to train
foundation models resulting in robust representations, trans-
ferable to new tasks both under few-shot and zero-shot set-
tings [13,18,24,33,34]. Such networks consist of a vi-
sion encoder (typically a ViT [8]) and a Transformer-based
text encoder [30]. Highly parameterized instantiations of
such architectures are trained on large corpora of image-
caption pairs (e.g. [24] uses 400M and [13] 1B pairs) using
contrastive learning. We used CLIP [24] as the foundation
model for our method.
Prompt Learning is about adapting pre-trained founda-
tional models on (downstream) tasks, typically in a zero-
shot or few-shot setting. Firstly proposed in the context
of Language Models (LM), prompting was initially about
prepending hand-crafted instructions/examples to the task
input so that the LM generates the appropriate output con-
ditioned to the input [4,25]. In [27,28], the main idea is
to reformulate the downstream task as a cloze task using
hand-crafted patterns (or templates), thus avoiding the need
to train a task-specific classifier. As finding the optimal pat-
terns is laborious, recent works have attempted to address
this by learning a set of soft (continuous) prompts [16,17].
In V&L foundation models, like CLIP, the class names
are used to create hand-crafted prompts [24] that are fed as
input to the text encoder, enabling zero-shot visual recogni-
tion. CoOp [36] extends work on soft prompt optimization
to the V&L domain by learning a set of Mprompts which
are used as input to the text encoder alongside the class
name. The prompts are learned by minimizing the classi-
fication error on a training set consisted of the given base
classes. One major limitation of CoOp is weak generaliza-
tion: the learned prompts overfit the base classes and do not
work well when tested on novel classes. To alleviate this,
CoCoOp [35] proposes a dynamic version of [36] where a
small network is trained to produce a visual feature from
the input image that is added to the learned prompts, hence
making them input specific (i.e. dynamic). ProDA [19]
adopts a probabilistic approach by modelling the distribu-
tion of the prompts at the output of the text encoder as
a multivariate Gaussian distribution. The estimated mean
is used during inference. Finally, UPL [12] uses CLIP to
generate pseudo-labels on the target dataset and then self-
training to learn the soft prompts. Finally, ProGrad [37]
aims to adapt the V&L model to each target domain by en-
couraging it “not to forget” CLIP’s zero-shot predictions us-
ing a KL visual-text loss between the CLIP’s logits and their
model’s logits (i.e. they use visual features). The weights
are then updated in the direction perpendicular to CLIP gra-
dients. In contrast, our loss is a pure text-to-text loss, fur-
ther allowing for the incorporation of virtual classes. Un-
like [37], we outperform CLIP on novel classes.
The proposed LASP framework alleviates base class
overfitting and significantly improves upon the previously
reported best results without resorting to a dynamic ap-
proach as in CoCoOp [35]. In its basic version, LASP de-
ploys a text-to-text loss that enforces the learned prompts to
be “close” to a set of manually defined textual prompts in
the text encoder space. Importantly, the basic LASP can be
extended in three important ways: (1) by allowing the incor-
poration of virtual classes, i.e. novel class name information
for which no (visual) training data is available (LASP-V).
This is shown to significantly improve the robustness of the
learned prompts at no extra cost during inference; (2) by al-
lowing the use of a grouped prompt representation within
the proposed language-aware training which is shown to in-
crease the representation capacity of the learned prompts;
(3) by performing further optimization of the visual encoder
摘要:

LASP:Text-to-TextOptimizationforLanguage-AwareSoftPromptingofVision&LanguageModelsAdrianBulat1,2,GeorgiosTzimiropoulos1,31SamsungAICambridge2TechnicalUniversityofIasi3QueenMaryUniversityofLondonAbstractSoftpromptlearninghasrecentlyemergedasoneofthemethodsofchoiceforadaptingV&Lmodelstoadown-streamtas...

展开>> 收起<<
LASP Text-to-Text Optimization for Language-Aware Soft Prompting of Vision Language Models Adrian Bulat12 Georgios Tzimiropoulos13.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:631.71KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注