MaPLe: Multi-modal Prompt Learning
Muhammad Uzair Khattak1Hanoona Rasheed1Muhammad Maaz1
Salman Khan1,2Fahad Shahbaz Khan1,3
1Mohamed bin Zayed University of AI 2Australian National University 3Link¨
oping University
Abstract
Pre-trained vision-language (V-L) models such as CLIP
have shown excellent generalization ability to downstream
tasks. However, they are sensitive to the choice of input text
prompts and require careful selection of prompt templates to
perform well. Inspired by the Natural Language Processing
(NLP) literature, recent CLIP adaptation approaches learn
prompts as the textual inputs to fine-tune CLIP for down-
stream tasks. We note that using prompting to adapt repre-
sentations in a single branch of CLIP (language or vision) is
sub-optimal since it does not allow the flexibility to dynam-
ically adjust both representation spaces on a downstream
task. In this work, we propose Multi-modal Prompt Learn-
ing (MaPLe) for both vision and language branches to im-
prove alignment between the vision and language represen-
tations. Our design promotes strong coupling between the
vision-language prompts to ensure mutual synergy and dis-
courages learning independent uni-modal solutions. Fur-
ther, we learn separate prompts across different early stages
to progressively model the stage-wise feature relationships
to allow rich context learning. We evaluate the effectiveness
of our approach on three representative tasks of generaliza-
tion to novel classes, new target datasets and unseen do-
main shifts. Compared with the state-of-the-art method Co-
CoOp, MaPLe exhibits favorable performance and achieves
an absolute gain of 3.45% on novel classes and 2.72% on
overall harmonic-mean, averaged over 11 diverse image
recognition datasets. Our code and pre-trained models are
available at https://github.com/muzairkhattak/multimodal-
prompt-learning.
1. Introduction
Foundational vision-language (V-L) models such as CLIP
(Contrastive Language-Image Pretraining) [32] have shown
excellent generalization ability to downstream tasks. Such
models are trained to align language and vision modali-
ties on web-scale data e.g., 400 million text-image pairs in
CLIP. These models can reason about open-vocabulary vi-
sual concepts, thanks to the rich supervision provided by
natural language. During inference, hand-engineered text
prompts are used e.g., ‘a photo of a <category>’ as
a query for text encoder. The output text embeddings are
matched with the visual embeddings from an image encoder
to predict the output class. Designing high quality contex-
tual prompts have been proven to enhance the performance
of CLIP and other V-L models [17,42].
Despite the effectiveness of CLIP towards generalization
to new concepts, its massive scale and scarcity of training
data (e.g., few-shot setting) makes it infeasible to fine-tune
the full model for downstream tasks. Such fine-tuning can
also forget the useful knowledge acquired in the large-scale
pretraining phase and can pose a risk of overfitting to the
downstream task. To address the above challenges, exist-
ing works propose language prompt learning to avoid man-
ually adjusting the prompt templates and providing a mech-
anism to adapt the model while keeping the original weights
frozen [14,25,29,48,49]. Inspired from Natural Language
Processing (NLP), these approaches only explore prompt
learning for the text encoder in CLIP (Fig. 1:a) while adap-
tation choices together with an equally important image en-
coder of CLIP remains an unexplored topic in the literature.
Our motivation derives from the multi-modal nature of
CLIP, where a text and image encoder co-exist and both
contribute towards properly aligning the V-L modalities.
We argue that any prompting technique should adapt the
model completely and therefore, learning prompts only for
the text encoder in CLIP is not sufficient to model the adap-
tations needed for the image encoder. To this end, we set out
to achieve completeness in the prompting approach and pro-
pose Multi-modalPrompt Learning (MaPLe) to adequately
fine-tune the text and image encoder representations such
that their optimal alignment can be achieved on the down-
stream tasks (Fig. 1:b). Our extensive experiments on three
key representative settings including base-to-novel gener-
alization, cross-dataset evaluation, and domain generaliza-
tion demonstrate the strength of MaPLe. On base-to-novel
generalization, our proposed MaPLe outperforms existing
prompt learning approaches across 11 diverse image recog-
nition datasets (Fig. 1:c) and achieves absolute average gain
of 3.45% on novel classes and 2.72% on harmonic-mean
arXiv:2210.03117v3 [cs.CV] 1 Apr 2023