LASP Text-to-Text Optimization for Language-Aware Soft Prompting of Vision Language Models Adrian Bulat12 Georgios Tzimiropoulos13

2025-05-03 0 0 631.71KB 10 页 10玖币

侵权投诉

LASP: Text-to-Text Optimization for Language-Aware Soft Prompting

of Vision & Language Models

Adrian Bulat1,2, Georgios Tzimiropoulos1,3

1Samsung AI Cambridge 2Technical University of Iasi 3Queen Mary University of London

Abstract

Soft prompt learning has recently emerged as one of the

methods of choice for adapting V&L models to a down-

stream task using a few training examples. However, cur-

rent methods signiﬁcantly overﬁt the training data, suffer-

ing from large accuracy degradation when tested on un-

seen classes from the same domain. To this end, in this

paper, we make the following 4 contributions: (1) To alle-

viate base class overﬁtting, we propose a novel Language-

Aware Soft Prompting (LASP) learning method by means of

a text-to-text cross-entropy loss that maximizes the proba-

bility of the learned prompts to be correctly classiﬁed with

respect to pre-deﬁned hand-crafted textual prompts. (2) To

increase the representation capacity of the prompts, we pro-

pose grouped LASP where each group of prompts is opti-

mized with respect to a separate subset of textual prompts.

(3) We identify a visual-language misalignment introduced

by prompt learning and LASP, and more importantly, pro-

pose a re-calibration mechanism to address it. (4) We show

that LASP is inherently amenable to including, during train-

ing, virtual classes, i.e. class names for which no visual

samples are available, further increasing the robustness of

the learned prompts. Through evaluations on 11 datasets,

we show that our approach (a) signiﬁcantly outperforms all

prior works on soft prompting, and (b) matches and sur-

passes, for the ﬁrst time, the accuracy on novel classes ob-

tained by hand-crafted prompts and CLIP for 8 out of 11

test datasets. Code will be made available here.

1. Introduction

Large-scale pre-training of neural networks has recently

resulted in the construction of a multitude of foundation

models for Language [7,25] and Vision & Language (V&L)

understanding [1,13,24,34]. Unlike the previous genera-

tion of neural networks, such models can better capture the

distribution of the world from which new favorable prop-

erties and characteristics emerge. Of particular interest to

this work are V&L models trained with contrastive learn-

ing (i.e. CLIP-like models [13,18,24,33,34]), which have

enabled seamless few-shot and even zero-shot adaptation to

new downstream tasks and datasets. Speciﬁcally, this pa-

per proposes a simple yet highly effective way to drastically

improve soft prompt learning for the few-shot adaptation of

the V&L model to a given downstream task.

Similarly to their NLP counterparts [16,17,24], prompt

engineering and learning has emerged as one of the

most powerful techniques for adapting a V&L to new

tasks. Initially, in [24], a set of manually-deﬁned hand-

engineered templates (or prompts) like a photo of a

{cls name}, or a black and white photo of

a{cls name}were passed through the text encoder of

the V&L model to create class-speciﬁc weights for category

cls name that can be used for zero-shot recognition. Fol-

lowing research in NLP [16,17], subsequent work [35,36]

has proposed replacing the manually picked templates with

a sequence of learnable vectors, also coined soft prompts,

which are fed as input to the text encoder along with the

class name cls name. The soft prompts are learned from

a few training examples with the entire V&L model kept

frozen. The whole process can be seen as parameter efﬁ-

cient ﬁne-tuning of the model on a small training dataset.

However, a clearly identiﬁable problem with prompt

learning is base class overﬁtting: while the accuracy on

the classes used for training (base classes) signiﬁcantly in-

creases, the accuracy on unseen, during training, (novel)

classes signiﬁcantly drops. This is to some extent expected,

as soft prompts are learned from few examples belonging to

the base classes. Notably, on novel classes, direct, zero-shot

recognition using hand-engineered prompts outperforms all

existing soft prompt learning methods.

Key idea: To alleviate base class overﬁtting, in this work,

we propose a solution motivated by the following obser-

vation: since prompt learning improves the accuracy on

base classes, but prompt engineering is signiﬁcantly bet-

ter on novel classes, we propose to learn the soft prompts

by adding a cross entropy text-to-text loss that enforces the

learned prompts to be close, in embedding space, to the

textual ones, thus exploiting the intrinsic information cap-

tured by the text encoder. The proposed text-to-text loss en-

ables language-only optimization for V&L model adaption

arXiv:2210.01115v2 [cs.CV] 2 Apr 2023

for the ﬁrst time. This is in contrast with prior soft-prompt

learning methods that only capture V&L interactions.

Key contributions: Based on the above, we propose a

novel framework for soft prompt learning which we call

Language-Aware Soft Prompting (LASP). Our main con-

tributions within the LASP framework are as follows:

• We propose, for the ﬁrst time, language-only optimiza-

tion for V&L model adaption. Speciﬁcally, we propose

a novel text-to-text cross-entropy loss that maximizes the

probability of the learned prompts to be correctly classi-

ﬁed with respect to the hand-engineered ones and show its

effectiveness in terms of alleviating base-class overﬁtting.

• To increase the representation capacity of the prompts,

and inspired by grouped convolution and multi-head at-

tention, we propose a grouped language-aware prompt

representation where each group of prompts specializes

to a different subset of the pre-deﬁned manual templates.

• We identify a visual-language misalignment introduced

by prompt learning and LASP which impacts the gener-

alization. More importantly, we propose a re-calibration

mechanism based on (a) Layer Normalization ﬁne-tuning

and (b) learning a class-agnostic bias to address it.

• Thanks to our language-only learning framework, we pro-

pose training LASP with virtual classes by including, dur-

ing training, class names for which no visual samples are

available. Importantly, we show that this further increases

the robustness of the learned prompts.

Main results: Our methods set a new state-of-the-art for

few-shot and zero-shot image classiﬁcation on 11 datasets,

signiﬁcantly outperforming all soft prompting prior works.

Importantly, we present, for the ﬁrst time, a prompt learn-

ing method that outperforms, for the majority of the test

datasets (8 out of 11), the very strong baseline based on

hand-crafted prompts and CLIP for the recognition of novel

classes (i.e. zero-shot setting).

2. Related work

Contrastive V&L Models: Recently, large scale V&L pre-

training with contrastive learning has been used to train

foundation models resulting in robust representations, trans-

ferable to new tasks both under few-shot and zero-shot set-

tings [13,18,24,33,34]. Such networks consist of a vi-

sion encoder (typically a ViT [8]) and a Transformer-based

text encoder [30]. Highly parameterized instantiations of

such architectures are trained on large corpora of image-

caption pairs (e.g. [24] uses 400M and [13] 1B pairs) using

contrastive learning. We used CLIP [24] as the foundation

model for our method.

Prompt Learning is about adapting pre-trained founda-

tional models on (downstream) tasks, typically in a zero-

shot or few-shot setting. Firstly proposed in the context

of Language Models (LM), prompting was initially about

prepending hand-crafted instructions/examples to the task

input so that the LM generates the appropriate output con-

ditioned to the input [4,25]. In [27,28], the main idea is

to reformulate the downstream task as a cloze task using

hand-crafted patterns (or templates), thus avoiding the need

to train a task-speciﬁc classiﬁer. As ﬁnding the optimal pat-

terns is laborious, recent works have attempted to address

this by learning a set of soft (continuous) prompts [16,17].

In V&L foundation models, like CLIP, the class names

are used to create hand-crafted prompts [24] that are fed as

input to the text encoder, enabling zero-shot visual recogni-

tion. CoOp [36] extends work on soft prompt optimization

to the V&L domain by learning a set of Mprompts which

are used as input to the text encoder alongside the class

name. The prompts are learned by minimizing the classi-

ﬁcation error on a training set consisted of the given base

classes. One major limitation of CoOp is weak generaliza-

tion: the learned prompts overﬁt the base classes and do not

work well when tested on novel classes. To alleviate this,

CoCoOp [35] proposes a dynamic version of [36] where a

small network is trained to produce a visual feature from

the input image that is added to the learned prompts, hence

making them input speciﬁc (i.e. dynamic). ProDA [19]

adopts a probabilistic approach by modelling the distribu-

tion of the prompts at the output of the text encoder as

a multivariate Gaussian distribution. The estimated mean

is used during inference. Finally, UPL [12] uses CLIP to

generate pseudo-labels on the target dataset and then self-

training to learn the soft prompts. Finally, ProGrad [37]

aims to adapt the V&L model to each target domain by en-

couraging it “not to forget” CLIP’s zero-shot predictions us-

ing a KL visual-text loss between the CLIP’s logits and their

model’s logits (i.e. they use visual features). The weights

are then updated in the direction perpendicular to CLIP gra-

dients. In contrast, our loss is a pure text-to-text loss, fur-

ther allowing for the incorporation of virtual classes. Un-

like [37], we outperform CLIP on novel classes.

The proposed LASP framework alleviates base class

overﬁtting and signiﬁcantly improves upon the previously

reported best results without resorting to a dynamic ap-

proach as in CoCoOp [35]. In its basic version, LASP de-

ploys a text-to-text loss that enforces the learned prompts to

be “close” to a set of manually deﬁned textual prompts in

the text encoder space. Importantly, the basic LASP can be

extended in three important ways: (1) by allowing the incor-

poration of virtual classes, i.e. novel class name information

for which no (visual) training data is available (LASP-V).

This is shown to signiﬁcantly improve the robustness of the

learned prompts at no extra cost during inference; (2) by al-

lowing the use of a grouped prompt representation within

the proposed language-aware training which is shown to in-

crease the representation capacity of the learned prompts;

(3) by performing further optimization of the visual encoder

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LASP:Text-to-TextOptimizationforLanguage-AwareSoftPromptingofVision&LanguageModelsAdrianBulat1,2,GeorgiosTzimiropoulos1,31SamsungAICambridge2TechnicalUniversityofIasi3QueenMaryUniversityofLondonAbstractSoftpromptlearninghasrecentlyemergedasoneofthemethodsofchoiceforadaptingV&Lmodelstoadown-streamtas...

展开>> 收起<<

LASP Text-to-Text Optimization for Language-Aware Soft Prompting of Vision Language Models Adrian Bulat12 Georgios Tzimiropoulos13.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LASP Text-to-Text Optimization for Language-Aware Soft Prompting of Vision Language Models Adrian Bulat12 Georgios Tzimiropoulos13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: