
MAMO: Fine-Grained Vision-Language Representations Learning
with Masked Multimodal Modeling
Zijia Zhao1,3∗†, Longteng Guo2∗, Xingjian He1, Shuai Shao2, Zehuan Yuan2, Jing Liu1,3‡
1Laboratory of Cognition and Decision Intelligence for Complex Systems,
Institute of Automation, Chinese Academy of Sciences
2Bytedance Inc. 3School of Articial Intelligence, University of Chinese Academy of Sciences
zhaozijia2021@ia.ac.cn, {xingjian.he,jliu}@nlpr.ia.ac.cn, {guolongteng.lt,shaoshuai.0516,yuanzehuan}@bytedance.com
ABSTRACT
Multimodal representation learning has shown promising improve-
ments on various vision-language tasks (e.g., image-text retrieval,
visual question answering, etc) and has signicantly advanced the
development of multimedia information systems. Most existing
methods excel at building global-level alignment between vision
and language while lacking eective ne-grained image-text inter-
action. In this paper, we propose a jointly masked multimodal mod-
eling method to learn ne-grained multimodal representations. Our
method performs joint masking on image-text input and integrates
both implicit and explicit targets for the masked signals to recover.
The implicit target provides a unied and debiased objective for
vision and language, where the model predicts latent multimodal
representations of the unmasked input. The explicit target further
enriches the multimodal representations by recovering high-level
and semantically meaningful information: momentum visual fea-
tures of image patches and concepts of word tokens. Through such
a masked modeling process, our model not only learns ne-grained
multimodal interaction, but also avoids the semantic gap between
high-level representations and low- or mid-level prediction targets
(e.g., image pixels, discrete vision tokens), thus producing seman-
tically rich multimodal representations that perform well on both
zero-shot and ne-tuned settings. Our pre-trained model (named
MAMO) achieves state-of-the-art performance on various down-
stream vision-language tasks, including image-text retrieval, visual
question answering, visual reasoning, and weakly-supervised visual
grounding.
CCS CONCEPTS
•Information systems
→
Multimedia information systems;
Multimedia and multimodal retrieval.
*Equal contribution.
†This work was performed while Zijia worked as an intern at ByteDance.
‡Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
SIGIR ’23, July 23–27, 2023, Taipei, Taiwan
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9408-6/23/07. . . $15.00
https://doi.org/10.1145/3539618.3591721
KEYWORDS
vision-language pretraining, masked modeling, image-text retrieval,
visual question answering
ACM Reference Format:
Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing
Liu. 2023. MAMO: Fine-Grained Vision-Language Representations Learning
with Masked Multimodal Modeling. In Proceedings of the 46th International
ACM SIGIR Conference on Research and Development in Information Retrieval
(SIGIR ’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA,
11 pages. https://doi.org/10.1145/3539618.3591721
1 INTRODUCTION
Vision-Language Pre-training (VLP) is an emerging research topic
in multimedia information systems. It aims to learn the interac-
tion between image and text and produce semantically rich multi-
modal representations that transfer well to downstream Vision-and-
Language (V+L) tasks including image-text retrieval, visual question
answering, etc. In order to learn cross-modal interaction, most ex-
isting VLP methods [
6
,
18
,
23
,
25
,
32
,
38
] rely on the consistency
between the global views of image and text, using pre-training objec-
tives like image-text contrastive loss [
23
] and image-text matching
loss [
38
]. While eective, such global interaction fails to model the
subtle local association within image-text pairs. The ne-grained
interaction between image patches and word tokens is therefore
lacking.
Masked signal modeling is an eective self-supervised pre-training
task that masks a portion of input signals and tries to predict these
masked signals from the visible ones. It has been actively explored
in natural language processing (NLP) and computer vision (CV)
separately, and has brought powerful generalization performance
across diverse downstream tasks. For example, BERT [
20
] formu-
lates masked language modeling (MLM) to predict masked linguistic
tokens, while MAE [
16
] and BEiT [
2
] formulate masked image mod-
eling (MIM) to reconstruct raw pixels and dVAE visual tokens [
33
]
of image patches respectively. In the domain of VLP, however, there
is a lack of a jointly masked signal modeling method for both vision
and language modalities. Although previous VLP methods [
6
,
23
]
have adopted conditional MLM to predict masked words given un-
masked image and other words, however, masking of the image side
has not been fully explored. As a result, the images’ internal struc-
tures and their interactions with text tokens are not suciently
learned, as is shown in Fig. 1.
The challenge of designing a jointly masked signal modeling
method for VLP lies in the natural dierences between image and
text modalities — image is continuous, low-level, highly redundant
arXiv:2210.04183v3 [cs.CV] 14 Jun 2023