MaPLe Multi-modal Prompt Learning Muhammad Uzair Khattak1Hanoona Rasheed1Muhammad Maaz1 Salman Khan12Fahad Shahbaz Khan13

2025-05-02 0 0 8.05MB 13 页 10玖币
侵权投诉
MaPLe: Multi-modal Prompt Learning
Muhammad Uzair Khattak1Hanoona Rasheed1Muhammad Maaz1
Salman Khan1,2Fahad Shahbaz Khan1,3
1Mohamed bin Zayed University of AI 2Australian National University 3Link¨
oping University
Abstract
Pre-trained vision-language (V-L) models such as CLIP
have shown excellent generalization ability to downstream
tasks. However, they are sensitive to the choice of input text
prompts and require careful selection of prompt templates to
perform well. Inspired by the Natural Language Processing
(NLP) literature, recent CLIP adaptation approaches learn
prompts as the textual inputs to fine-tune CLIP for down-
stream tasks. We note that using prompting to adapt repre-
sentations in a single branch of CLIP (language or vision) is
sub-optimal since it does not allow the flexibility to dynam-
ically adjust both representation spaces on a downstream
task. In this work, we propose Multi-modal Prompt Learn-
ing (MaPLe) for both vision and language branches to im-
prove alignment between the vision and language represen-
tations. Our design promotes strong coupling between the
vision-language prompts to ensure mutual synergy and dis-
courages learning independent uni-modal solutions. Fur-
ther, we learn separate prompts across different early stages
to progressively model the stage-wise feature relationships
to allow rich context learning. We evaluate the effectiveness
of our approach on three representative tasks of generaliza-
tion to novel classes, new target datasets and unseen do-
main shifts. Compared with the state-of-the-art method Co-
CoOp, MaPLe exhibits favorable performance and achieves
an absolute gain of 3.45% on novel classes and 2.72% on
overall harmonic-mean, averaged over 11 diverse image
recognition datasets. Our code and pre-trained models are
available at https://github.com/muzairkhattak/multimodal-
prompt-learning.
1. Introduction
Foundational vision-language (V-L) models such as CLIP
(Contrastive Language-Image Pretraining) [32] have shown
excellent generalization ability to downstream tasks. Such
models are trained to align language and vision modali-
ties on web-scale data e.g., 400 million text-image pairs in
CLIP. These models can reason about open-vocabulary vi-
sual concepts, thanks to the rich supervision provided by
natural language. During inference, hand-engineered text
prompts are used e.g., ‘a photo of a <category>’ as
a query for text encoder. The output text embeddings are
matched with the visual embeddings from an image encoder
to predict the output class. Designing high quality contex-
tual prompts have been proven to enhance the performance
of CLIP and other V-L models [17,42].
Despite the effectiveness of CLIP towards generalization
to new concepts, its massive scale and scarcity of training
data (e.g., few-shot setting) makes it infeasible to fine-tune
the full model for downstream tasks. Such fine-tuning can
also forget the useful knowledge acquired in the large-scale
pretraining phase and can pose a risk of overfitting to the
downstream task. To address the above challenges, exist-
ing works propose language prompt learning to avoid man-
ually adjusting the prompt templates and providing a mech-
anism to adapt the model while keeping the original weights
frozen [14,25,29,48,49]. Inspired from Natural Language
Processing (NLP), these approaches only explore prompt
learning for the text encoder in CLIP (Fig. 1:a) while adap-
tation choices together with an equally important image en-
coder of CLIP remains an unexplored topic in the literature.
Our motivation derives from the multi-modal nature of
CLIP, where a text and image encoder co-exist and both
contribute towards properly aligning the V-L modalities.
We argue that any prompting technique should adapt the
model completely and therefore, learning prompts only for
the text encoder in CLIP is not sufficient to model the adap-
tations needed for the image encoder. To this end, we set out
to achieve completeness in the prompting approach and pro-
pose Multi-modalPrompt Learning (MaPLe) to adequately
fine-tune the text and image encoder representations such
that their optimal alignment can be achieved on the down-
stream tasks (Fig. 1:b). Our extensive experiments on three
key representative settings including base-to-novel gener-
alization, cross-dataset evaluation, and domain generaliza-
tion demonstrate the strength of MaPLe. On base-to-novel
generalization, our proposed MaPLe outperforms existing
prompt learning approaches across 11 diverse image recog-
nition datasets (Fig. 1:c) and achieves absolute average gain
of 3.45% on novel classes and 2.72% on harmonic-mean
arXiv:2210.03117v3 [cs.CV] 1 Apr 2023
Maximize similarity
Image Encoder
Prompts
"Text input"
(a) Existing prompt tuning methods (Uni-modal)
(b) Multi-modal Prompt Learning (MaPLe) (c) Performance comparison on base-to-novel generalization
Text Encoder
Maximize similarity
Image Encoder
Prompts
"Text input"
Text Encoder
Prompts
Prompts
Prompts
Prompts
Prompts
Figure 1. Comparison of MaPLe with standard prompt learning methods. (a) Existing methods adopt uni-modal prompting techniques
to fine-tune CLIP representations as prompts are learned only in a single branch of CLIP (language or vision). (b) MaPLe introduces
branch-aware hierarchical prompts that adapt both language and vision branches simultaneously for improved generalization. (c) MaPLe
surpasses state-of-the-art methods on 11 diverse image recognition datasets for novel class generalization task.
over the state-of-the-art method Co-CoOp [48]. Further,
MaPLe demonstrates favorable generalization ability and
robustness in cross-dataset transfer and domain generaliza-
tion settings, leading to consistent improvements compared
to existing approaches. Owing to its streamlined architec-
tural design, MaPLe exhibits improved efficiency during
both training and inference without much overhead, as com-
pared to Co-CoOp which lacks efficiency due to its image
instance conditioned design. In summary, the main contri-
butions of this work include:
We propose multi-modal prompt learning in CLIP to
favourably align its vision-language representations.
To the best of our knowledge, this is the first multi-
modal prompting approach for fine-tuning CLIP.
To link prompts learned in text and image encoders, we
propose a coupling function to explicitly condition vi-
sion prompts on their language counterparts. It acts as
a bridge between the two modalities and allows mutual
propagation of gradients to promote synergy.
Our multi-modal prompts are learned across multi-
ple transformer blocks in both vision and language
branches to progressively learn the synergistic be-
haviour of both modalities. This deep prompting strat-
egy allows modeling the contextual relationships inde-
pendently, thus providing more flexibility to align the
vision-language representations.
2. Related Work
Vision Language Models: The combined use of language
supervision with natural images is found to be of great in-
terest in the computer vision community. In contrast to
models learned with only image supervision, these vision-
language (V-L) models encode rich multimodal representa-
tions. Recently, V-L models like CLIP [32], ALIGN [15],
LiT [45], FILIP [41] and Florence [43] have demonstrated
exceptional performance on a wide spectrum of tasks in-
cluding few-shot and zero-shot visual recognition. These
models learn joint image-language representations in a self-
supervised manner using abundantly available data from
the web. For example, CLIP and ALIGN respectively
use 400M and 1B image-text pairs to train a multi-
modal network. Although these pre-trained V-L models
learn generalized representations, efficiently adapting them
to downstream tasks is still a challenging problem. Many
works have demonstrated better performance on down-
stream tasks by using tailored methods to adapt V-L mod-
els for few-shot image-recognition [9,19,46], object detec-
tion [8,10,27,34,44,50], and segmentation [5,22,26,33]. In
this work, we propose a novel multi-modal prompt learning
technique to effectively adapt CLIP for few-shot and zero-
shot visual recognition tasks.
Prompt Learning: The instructions in the form of a sen-
tence, known as text prompt, are usually given to the lan-
guage branch of a V-L model, allowing it to better under-
stand the task. Prompts can be handcrafted for a down-
stream task or learned automatically during fine-tuning
stage. The latter is referred to as ‘Prompt Learning’ which
was first used in NLP [21,23,24] followed by the adaptation
in V-L [48,49,51] and vision-only [16,38,39,47] models.
Similar to [16] our design also uses deep ‘vision’ prompt-
ing. However, ours is the first multi-modal prompting de-
Text Encoder
Image Encoder
x1
Patch Embed Word Embed
"a photo of a
cat"
x1.y1x1.y2x1.y3... x1.yn
x2.y1x2.y2x2.y3... x2.yn
x3.y1x3.y2x2.y3... x3.yn
...
...
...
...
xn.y1xn.y2xn.y3... xn.yn
...
...
...
Encoder Layer
Encoder Layer
...
Encoder Layer
...
Encoder Layer
...
Encoder Layer
...
Encoder Layer
x2
x3
...
xn
y1 y2 y3 ... yn
.....
.....
Figure 2. Overview of our proposed MaPLe (Multi-modalPrompt Learning) framework for prompt learning in V-L models. MaPLe tunes
both vision and language branches where only the context prompts are learned, while the rest of the model is frozen. MaPLe conditions the
vision prompts on language prompts via a V-L coupling function Fto induce mutual synergy between the two modalities. Our framework
uses deep contextual prompting where separate context prompts are learned across multiple transformer blocks.
sign while [16] is uni-modal.
Prompt Learning in Vision Language models: Full fine-
tuning and linear probing [9] are two typical approaches
to adapt a V-L model (i.e. CLIP) to the downstream tasks.
The complete fine-tuning results in degrading the previ-
ously learned joint V-L representation while linear probing
limits the zero-shot capability of CLIP. To this end, inspired
from prompt learning in NLP, many works have proposed
to adapt V-L models by learning the prompt tokens in an
end-to-end training. CoOp [49] fine-tunes CLIP for few-
shot transfer by optimizing continuous set of prompt vectors
at its language branch. Co-CoOp [48] highlights the infe-
rior performance of CoOp on novel classes and solves the
generalization issue by explicitly conditioning prompts on
image instances. [25] proposes to optimize multiple set of
prompts by learning the distribution of prompts. [18] adapt
CLIP by learning prompts for video understanding tasks. [1]
perform visual prompt tuning on CLIP by prompting on the
vision branch. We note that the existing methods follow in-
dependent uni-modal solutions and learn prompts either in
the language or in the vision branch of CLIP, thus adapt-
ing CLIP partially. In this paper, we explore an important
question: given the multimodal nature of CLIP, is complete
prompting (i.e., in both language and vision branches) bet-
ter suited to adapt CLIP? Our work is the first to answer this
question by investigating the effectiveness of multi-modal
prompt learning in order to improve alignment between vi-
sion and language representations.
3. Method
Our approach concerns with fine-tuning a pre-trained multi-
modal CLIP for better generalization to downstream tasks
through context optimization via prompting. Fig. 2shows
the overall architecture of our proposed MaPLe (Multi-
modalPrompt Learning) framework. Unlike previous ap-
proaches [48,49] which learn context prompts only at the
language branch, MaPLe proposes a joint prompting ap-
proach where the context prompts are learned in both vi-
sion and language branches. Specifically, we append learn-
able context tokens in the language branch and explicitly
condition the vision prompts on the language prompts via a
coupling function to establish interaction between them. To
learn hierarchical contextual representations, we introduce
deep prompting in both branches through separate learnable
context prompts across different transformer blocks. Dur-
ing fine-tuning, only the context prompts along with their
coupling function are learned while the rest of the model is
frozen. Below, we first outline the pre-trained CLIP archi-
tecture and then present our proposed fine-tuning approach.
3.1. Revisiting CLIP
We build our approach on a pre-trained vision-language (V-
L) model, CLIP, which consists of a text and vision encoder.
Consistent with existing prompting methods [48,49], we
use a vision transformer (ViT) [6] based CLIP model. CLIP
encodes an image IRH×W×3and a corresponding text
description as explained below.
Encoding Image: Image encoder Vwith Ktransformer
layers {Vi}K
i=1, splits the image Iinto Mfixed-size patches
which are projected into patch embeddings E0RM×dv.
Patch embeddings Eiare input to the (i+ 1)th transformer
block (Vi+1)along with a learnable class (CLS) token ci
摘要:

MaPLe:Multi-modalPromptLearningMuhammadUzairKhattak1HanoonaRasheed1MuhammadMaaz1SalmanKhan1;2FahadShahbazKhan1;31MohamedbinZayedUniversityofAI2AustralianNationalUniversity3Link¨opingUniversityAbstractPre-trainedvision-language(V-L)modelssuchasCLIPhaveshownexcellentgeneralizationabilitytodownstreamta...

展开>> 收起<<
MaPLe Multi-modal Prompt Learning Muhammad Uzair Khattak1Hanoona Rasheed1Muhammad Maaz1 Salman Khan12Fahad Shahbaz Khan13.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:8.05MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注