MaPLe Multi-modal Prompt Learning Muhammad Uzair Khattak1Hanoona Rasheed1Muhammad Maaz1 Salman Khan12Fahad Shahbaz Khan13

2025-05-02 0 0 8.05MB 13 页 10玖币

侵权投诉

MaPLe: Multi-modal Prompt Learning

Muhammad Uzair Khattak1Hanoona Rasheed1Muhammad Maaz1

Salman Khan1,2Fahad Shahbaz Khan1,3

1Mohamed bin Zayed University of AI 2Australian National University 3Link¨

oping University

Abstract

Pre-trained vision-language (V-L) models such as CLIP

have shown excellent generalization ability to downstream

tasks. However, they are sensitive to the choice of input text

prompts and require careful selection of prompt templates to

perform well. Inspired by the Natural Language Processing

(NLP) literature, recent CLIP adaptation approaches learn

prompts as the textual inputs to ﬁne-tune CLIP for down-

stream tasks. We note that using prompting to adapt repre-

sentations in a single branch of CLIP (language or vision) is

sub-optimal since it does not allow the ﬂexibility to dynam-

ically adjust both representation spaces on a downstream

task. In this work, we propose Multi-modal Prompt Learn-

ing (MaPLe) for both vision and language branches to im-

prove alignment between the vision and language represen-

tations. Our design promotes strong coupling between the

vision-language prompts to ensure mutual synergy and dis-

courages learning independent uni-modal solutions. Fur-

ther, we learn separate prompts across different early stages

to progressively model the stage-wise feature relationships

to allow rich context learning. We evaluate the effectiveness

of our approach on three representative tasks of generaliza-

tion to novel classes, new target datasets and unseen do-

main shifts. Compared with the state-of-the-art method Co-

CoOp, MaPLe exhibits favorable performance and achieves

an absolute gain of 3.45% on novel classes and 2.72% on

overall harmonic-mean, averaged over 11 diverse image

recognition datasets. Our code and pre-trained models are

available at https://github.com/muzairkhattak/multimodal-

prompt-learning.

1. Introduction

Foundational vision-language (V-L) models such as CLIP

(Contrastive Language-Image Pretraining) [32] have shown

excellent generalization ability to downstream tasks. Such

models are trained to align language and vision modali-

ties on web-scale data e.g., 400 million text-image pairs in

CLIP. These models can reason about open-vocabulary vi-

sual concepts, thanks to the rich supervision provided by

natural language. During inference, hand-engineered text

prompts are used e.g., ‘a photo of a <category>’ as

a query for text encoder. The output text embeddings are

matched with the visual embeddings from an image encoder

to predict the output class. Designing high quality contex-

tual prompts have been proven to enhance the performance

of CLIP and other V-L models [17,42].

Despite the effectiveness of CLIP towards generalization

to new concepts, its massive scale and scarcity of training

data (e.g., few-shot setting) makes it infeasible to ﬁne-tune

the full model for downstream tasks. Such ﬁne-tuning can

also forget the useful knowledge acquired in the large-scale

pretraining phase and can pose a risk of overﬁtting to the

downstream task. To address the above challenges, exist-

ing works propose language prompt learning to avoid man-

ually adjusting the prompt templates and providing a mech-

anism to adapt the model while keeping the original weights

frozen [14,25,29,48,49]. Inspired from Natural Language

Processing (NLP), these approaches only explore prompt

learning for the text encoder in CLIP (Fig. 1:a) while adap-

tation choices together with an equally important image en-

coder of CLIP remains an unexplored topic in the literature.

Our motivation derives from the multi-modal nature of

CLIP, where a text and image encoder co-exist and both

contribute towards properly aligning the V-L modalities.

We argue that any prompting technique should adapt the

model completely and therefore, learning prompts only for

the text encoder in CLIP is not sufﬁcient to model the adap-

tations needed for the image encoder. To this end, we set out

to achieve completeness in the prompting approach and pro-

pose Multi-modalPrompt Learning (MaPLe) to adequately

ﬁne-tune the text and image encoder representations such

that their optimal alignment can be achieved on the down-

stream tasks (Fig. 1:b). Our extensive experiments on three

key representative settings including base-to-novel gener-

alization, cross-dataset evaluation, and domain generaliza-

tion demonstrate the strength of MaPLe. On base-to-novel

generalization, our proposed MaPLe outperforms existing

prompt learning approaches across 11 diverse image recog-

nition datasets (Fig. 1:c) and achieves absolute average gain

of 3.45% on novel classes and 2.72% on harmonic-mean

arXiv:2210.03117v3 [cs.CV] 1 Apr 2023

Maximize similarity

Image Encoder

Prompts

"Text input"

(a) Existing prompt tuning methods (Uni-modal)

(b) Multi-modal Prompt Learning (MaPLe) (c) Performance comparison on base-to-novel generalization

Text Encoder

Maximize similarity

Image Encoder

Prompts

"Text input"

Text Encoder

Prompts

Figure 1. Comparison of MaPLe with standard prompt learning methods. (a) Existing methods adopt uni-modal prompting techniques

to ﬁne-tune CLIP representations as prompts are learned only in a single branch of CLIP (language or vision). (b) MaPLe introduces

branch-aware hierarchical prompts that adapt both language and vision branches simultaneously for improved generalization. (c) MaPLe

surpasses state-of-the-art methods on 11 diverse image recognition datasets for novel class generalization task.

over the state-of-the-art method Co-CoOp [48]. Further,

MaPLe demonstrates favorable generalization ability and

robustness in cross-dataset transfer and domain generaliza-

tion settings, leading to consistent improvements compared

to existing approaches. Owing to its streamlined architec-

tural design, MaPLe exhibits improved efﬁciency during

both training and inference without much overhead, as com-

pared to Co-CoOp which lacks efﬁciency due to its image

instance conditioned design. In summary, the main contri-

butions of this work include:

• We propose multi-modal prompt learning in CLIP to

favourably align its vision-language representations.

To the best of our knowledge, this is the ﬁrst multi-

modal prompting approach for ﬁne-tuning CLIP.

• To link prompts learned in text and image encoders, we

propose a coupling function to explicitly condition vi-

sion prompts on their language counterparts. It acts as

a bridge between the two modalities and allows mutual

propagation of gradients to promote synergy.

• Our multi-modal prompts are learned across multi-

ple transformer blocks in both vision and language

branches to progressively learn the synergistic be-

haviour of both modalities. This deep prompting strat-

egy allows modeling the contextual relationships inde-

pendently, thus providing more ﬂexibility to align the

vision-language representations.

2. Related Work

Vision Language Models: The combined use of language

supervision with natural images is found to be of great in-

terest in the computer vision community. In contrast to

models learned with only image supervision, these vision-

language (V-L) models encode rich multimodal representa-

tions. Recently, V-L models like CLIP [32], ALIGN [15],

LiT [45], FILIP [41] and Florence [43] have demonstrated

exceptional performance on a wide spectrum of tasks in-

cluding few-shot and zero-shot visual recognition. These

models learn joint image-language representations in a self-

supervised manner using abundantly available data from

the web. For example, CLIP and ALIGN respectively

use ∼400M and ∼1B image-text pairs to train a multi-

modal network. Although these pre-trained V-L models

learn generalized representations, efﬁciently adapting them

to downstream tasks is still a challenging problem. Many

works have demonstrated better performance on down-

stream tasks by using tailored methods to adapt V-L mod-

els for few-shot image-recognition [9,19,46], object detec-

tion [8,10,27,34,44,50], and segmentation [5,22,26,33]. In

this work, we propose a novel multi-modal prompt learning

technique to effectively adapt CLIP for few-shot and zero-

shot visual recognition tasks.

Prompt Learning: The instructions in the form of a sen-

tence, known as text prompt, are usually given to the lan-

guage branch of a V-L model, allowing it to better under-

stand the task. Prompts can be handcrafted for a down-

stream task or learned automatically during ﬁne-tuning

stage. The latter is referred to as ‘Prompt Learning’ which

was ﬁrst used in NLP [21,23,24] followed by the adaptation

in V-L [48,49,51] and vision-only [16,38,39,47] models.

Similar to [16] our design also uses deep ‘vision’ prompt-

ing. However, ours is the ﬁrst multi-modal prompting de-

Text Encoder

Image Encoder

Patch Embed Word Embed

"a photo of a

cat"

x1.y1x1.y2x1.y3... x1.yn

x2.y1x2.y2x2.y3... x2.yn

x3.y1x3.y2x2.y3... x3.yn

...

xn.y1xn.y2xn.y3... xn.yn

...

Encoder Layer

...

Encoder Layer

...

Encoder Layer

...

Encoder Layer

...

Encoder Layer

...

y1 y2 y3 ... yn

.....

Figure 2. Overview of our proposed MaPLe (Multi-modalPrompt Learning) framework for prompt learning in V-L models. MaPLe tunes

both vision and language branches where only the context prompts are learned, while the rest of the model is frozen. MaPLe conditions the

vision prompts on language prompts via a V-L coupling function Fto induce mutual synergy between the two modalities. Our framework

uses deep contextual prompting where separate context prompts are learned across multiple transformer blocks.

sign while [16] is uni-modal.

Prompt Learning in Vision Language models: Full ﬁne-

tuning and linear probing [9] are two typical approaches

to adapt a V-L model (i.e. CLIP) to the downstream tasks.

The complete ﬁne-tuning results in degrading the previ-

ously learned joint V-L representation while linear probing

limits the zero-shot capability of CLIP. To this end, inspired

from prompt learning in NLP, many works have proposed

to adapt V-L models by learning the prompt tokens in an

end-to-end training. CoOp [49] ﬁne-tunes CLIP for few-

shot transfer by optimizing continuous set of prompt vectors

at its language branch. Co-CoOp [48] highlights the infe-

rior performance of CoOp on novel classes and solves the

generalization issue by explicitly conditioning prompts on

image instances. [25] proposes to optimize multiple set of

prompts by learning the distribution of prompts. [18] adapt

CLIP by learning prompts for video understanding tasks. [1]

perform visual prompt tuning on CLIP by prompting on the

vision branch. We note that the existing methods follow in-

dependent uni-modal solutions and learn prompts either in

the language or in the vision branch of CLIP, thus adapt-

ing CLIP partially. In this paper, we explore an important

question: given the multimodal nature of CLIP, is complete

prompting (i.e., in both language and vision branches) bet-

ter suited to adapt CLIP? Our work is the ﬁrst to answer this

question by investigating the effectiveness of multi-modal

prompt learning in order to improve alignment between vi-

sion and language representations.

3. Method

Our approach concerns with ﬁne-tuning a pre-trained multi-

modal CLIP for better generalization to downstream tasks

through context optimization via prompting. Fig. 2shows

the overall architecture of our proposed MaPLe (Multi-

modalPrompt Learning) framework. Unlike previous ap-

proaches [48,49] which learn context prompts only at the

language branch, MaPLe proposes a joint prompting ap-

proach where the context prompts are learned in both vi-

sion and language branches. Speciﬁcally, we append learn-

able context tokens in the language branch and explicitly

condition the vision prompts on the language prompts via a

coupling function to establish interaction between them. To

learn hierarchical contextual representations, we introduce

deep prompting in both branches through separate learnable

context prompts across different transformer blocks. Dur-

ing ﬁne-tuning, only the context prompts along with their

coupling function are learned while the rest of the model is

frozen. Below, we ﬁrst outline the pre-trained CLIP archi-

tecture and then present our proposed ﬁne-tuning approach.

3.1. Revisiting CLIP

We build our approach on a pre-trained vision-language (V-

L) model, CLIP, which consists of a text and vision encoder.

Consistent with existing prompting methods [48,49], we

use a vision transformer (ViT) [6] based CLIP model. CLIP

encodes an image I∈RH×W×3and a corresponding text

description as explained below.

Encoding Image: Image encoder Vwith Ktransformer

layers {Vi}K

i=1, splits the image Iinto Mﬁxed-size patches

which are projected into patch embeddings E0∈RM×dv.

Patch embeddings Eiare input to the (i+ 1)th transformer

block (Vi+1)along with a learnable class (CLS) token ci

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MaPLe:Multi-modalPromptLearningMuhammadUzairKhattak1HanoonaRasheed1MuhammadMaaz1SalmanKhan1;2FahadShahbazKhan1;31MohamedbinZayedUniversityofAI2AustralianNationalUniversity3Link¨opingUniversityAbstractPre-trainedvision-language(V-L)modelssuchasCLIPhaveshownexcellentgeneralizationabilitytodownstreamta...

展开>> 收起<<

MaPLe Multi-modal Prompt Learning Muhammad Uzair Khattak1Hanoona Rasheed1Muhammad Maaz1 Salman Khan12Fahad Shahbaz Khan13.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MaPLe Multi-modal Prompt Learning Muhammad Uzair Khattak1Hanoona Rasheed1Muhammad Maaz1 Salman Khan12Fahad Shahbaz Khan13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: