MAMO Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

2025-05-02 0 0 6.77MB 11 页 10玖币
侵权投诉
MAMO: Fine-Grained Vision-Language Representations Learning
with Masked Multimodal Modeling
Zijia Zhao1,3∗†, Longteng Guo2, Xingjian He1, Shuai Shao2, Zehuan Yuan2, Jing Liu1,3
1Laboratory of Cognition and Decision Intelligence for Complex Systems,
Institute of Automation, Chinese Academy of Sciences
2Bytedance Inc. 3School of Articial Intelligence, University of Chinese Academy of Sciences
zhaozijia2021@ia.ac.cn, {xingjian.he,jliu}@nlpr.ia.ac.cn, {guolongteng.lt,shaoshuai.0516,yuanzehuan}@bytedance.com
ABSTRACT
Multimodal representation learning has shown promising improve-
ments on various vision-language tasks (e.g., image-text retrieval,
visual question answering, etc) and has signicantly advanced the
development of multimedia information systems. Most existing
methods excel at building global-level alignment between vision
and language while lacking eective ne-grained image-text inter-
action. In this paper, we propose a jointly masked multimodal mod-
eling method to learn ne-grained multimodal representations. Our
method performs joint masking on image-text input and integrates
both implicit and explicit targets for the masked signals to recover.
The implicit target provides a unied and debiased objective for
vision and language, where the model predicts latent multimodal
representations of the unmasked input. The explicit target further
enriches the multimodal representations by recovering high-level
and semantically meaningful information: momentum visual fea-
tures of image patches and concepts of word tokens. Through such
a masked modeling process, our model not only learns ne-grained
multimodal interaction, but also avoids the semantic gap between
high-level representations and low- or mid-level prediction targets
(e.g., image pixels, discrete vision tokens), thus producing seman-
tically rich multimodal representations that perform well on both
zero-shot and ne-tuned settings. Our pre-trained model (named
MAMO) achieves state-of-the-art performance on various down-
stream vision-language tasks, including image-text retrieval, visual
question answering, visual reasoning, and weakly-supervised visual
grounding.
CCS CONCEPTS
Information systems
Multimedia information systems;
Multimedia and multimodal retrieval.
*Equal contribution.
This work was performed while Zijia worked as an intern at ByteDance.
Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
SIGIR ’23, July 23–27, 2023, Taipei, Taiwan
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9408-6/23/07. . . $15.00
https://doi.org/10.1145/3539618.3591721
KEYWORDS
vision-language pretraining, masked modeling, image-text retrieval,
visual question answering
ACM Reference Format:
Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing
Liu. 2023. MAMO: Fine-Grained Vision-Language Representations Learning
with Masked Multimodal Modeling. In Proceedings of the 46th International
ACM SIGIR Conference on Research and Development in Information Retrieval
(SIGIR ’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA,
11 pages. https://doi.org/10.1145/3539618.3591721
1 INTRODUCTION
Vision-Language Pre-training (VLP) is an emerging research topic
in multimedia information systems. It aims to learn the interac-
tion between image and text and produce semantically rich multi-
modal representations that transfer well to downstream Vision-and-
Language (V+L) tasks including image-text retrieval, visual question
answering, etc. In order to learn cross-modal interaction, most ex-
isting VLP methods [
6
,
18
,
23
,
25
,
32
,
38
] rely on the consistency
between the global views of image and text, using pre-training objec-
tives like image-text contrastive loss [
23
] and image-text matching
loss [
38
]. While eective, such global interaction fails to model the
subtle local association within image-text pairs. The ne-grained
interaction between image patches and word tokens is therefore
lacking.
Masked signal modeling is an eective self-supervised pre-training
task that masks a portion of input signals and tries to predict these
masked signals from the visible ones. It has been actively explored
in natural language processing (NLP) and computer vision (CV)
separately, and has brought powerful generalization performance
across diverse downstream tasks. For example, BERT [
20
] formu-
lates masked language modeling (MLM) to predict masked linguistic
tokens, while MAE [
16
] and BEiT [
2
] formulate masked image mod-
eling (MIM) to reconstruct raw pixels and dVAE visual tokens [
33
]
of image patches respectively. In the domain of VLP, however, there
is a lack of a jointly masked signal modeling method for both vision
and language modalities. Although previous VLP methods [
6
,
23
]
have adopted conditional MLM to predict masked words given un-
masked image and other words, however, masking of the image side
has not been fully explored. As a result, the images’ internal struc-
tures and their interactions with text tokens are not suciently
learned, as is shown in Fig. 1.
The challenge of designing a jointly masked signal modeling
method for VLP lies in the natural dierences between image and
text modalities — image is continuous, low-level, highly redundant
arXiv:2210.04183v3 [cs.CV] 14 Jun 2023
SIGIR ’23, July 23–27, 2023, Taipei, Taiwan Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu
raw signals, while text tokens are discrete, high-level, highly com-
pressed concepts generated by humans. This phenomenon raises
two questions: (1) How to design a unied prediction target that
applies to masked multimodal data composing both continuous
visual signals and discrete text tokens? (2) How to avoid the seman-
tic gap between the learning of high-level representations and the
prediction of low-level image signals?
In this paper, we propose MAsked Multimodal mOdel (MAMO),
a VLP model with a jointly masked learning strategy on both vi-
sion and language modalities. MAMO performs joint masking on
image-text input and integrates both implicit and explicit targets
for the masked signals to recover. The implicit target provides a
unied and debiased objective for vision and language, the core
idea of which is to predict latent multimodal representations of
the masked signals under the guidance of a self-distillation teacher
that takes the unmasked view as input. Such a bootstrapping latent
target avoids learning biases in modality-specic designs. While
the implicit prediction target performs eective empirically, it can
collapse into meaningless solutions, e.g., outputting the same vector
for all tokens. To further enrich the multimodal representations
and avoid such potential trivial solutions, we also add auxiliary
explicit prediction targets that are naturally distinguishable on each
masked position. These targets are explicit in that they are semanti-
cally meaningful features or concepts extracted from the raw data.
Particularly, for the masked image tokens, instead of reconstructing
low-level raw pixels [
16
] or predicting mid-level pre-dened visual
tokens [
2
] (encapsulating mostly patch details, e.g.color and tex-
ture, according to [
48
]), they are enforced to predict the high-level
momentum visual features extracted from the image encoder. As
for masked text tokens, we directly predict word tokens since they
are already high-level concepts.
MAMO naturally addresses the aforementioned two questions:
First, the prediction targets of both masked vision and language
signals are unied as regressing the implicit latent multimodal rep-
resentations; Second, our implicit and explicit prediction targets are
both high-level representations and features/concepts, thus avoid-
ing the semantic-gap caused by predicting low- or mid-level image
signals. Such a masked modeling process enforces the model to
build both intra- and inter-modality interactions between image
patches and word tokens, and produce ne-grained and semanti-
cally rich multimodal representations that perform well on both
zero-shot and ne-tuned settings.
We demonstrate the eectiveness of MAMO on various down-
stream V+L tasks including image-text retrieval, visual question
answering, visual reasoning, and weakly-supervised visual ground-
ing. MAMO achieves substantial improvements over existing state-
of-the-art methods. On zero-shot image-text retrieval, MAMO even
outperforms methods that are pre-trained on orders of magnitude
larger datasets, e.g., it outperforms the state-of-the-art methods,
CLIP [
32
] and ALIGN [
18
], by absolute improvements of 11
.
9% and
4
.
9% on image retrieval R@1, respectively. On ne-tuned image-text
retrieval, MAMO also outperforms other methods (e.g., ALBEF [
23
],
METER [
10
], VLMo [
3
]) with a large margin. Quantitative and qual-
itative analyses on MAMO using Grad-CAM [
35
] further demon-
strate its ability to perform more ne-grained vision-language in-
teraction. Our contributions are summarized as follows:
Caption: a cat on the desk near a lamp
cat lamp
ALBEFMAMO
desk
Figure 1: Per-word Grad-CAM visualization of word to image
attention. Comparing with ALBEF [
23
], MAMO can focus on
corresponding region for each word more precisely, indicat-
ing more ne-grained interactions between word and image
patches are built.
We propose a jointly masked multimodal modeling method
that integrates both implicit and explicit prediction targets
to learn ne-grained multimodal representations.
Our implicit target provides a unied and debiased objec-
tive for VLP, and our explicit target shows that high-level
momentum visual features can serve as a better auxiliary
target for masked images compared with low-level pixels or
mid-level visual tokens.
Qualitative and quantitative results across a broad range of
downstream tasks show that our method learns ne-grained
and transferable vision-language representations.
2 RELATED WORK
2.1 Vision-Language Representation Learning
Depending on how vision and language modalities interact, most
previous VLP methods fall into two categories: shallow interac-
tion and deep interaction. Shallow interaction methods [
18
,
32
]
use light-weight operations (e.g., dot product) for interaction while
deep interaction methods [
6
,
24
,
25
,
38
,
40
,
47
] use deep networks
(e.g., a transformer-based multimodal encoder) to fuse image and
text features. ALBEF [
23
] combines the above two types of methods,
learning both shallow and deep interactions in a single framework.
To train such interaction networks, previous VLP methods often
employ contrastive learning [
23
] or image-text matching [
24
] as
pre-training tasks, which excel at learning global-level image-text
alignment, but lack eective modeling of ne-grained interaction
between image patches and word tokens. MAMO inherits the ar-
chitecture of ALBEF to combine shallow and deep interactions and
further introduces a new masked multimodal modeling method to
enforce ne-grained multimodal interaction.
2.2 Masked Signal Modeling
Masked signal modeling has been actively explored in NLP and
CV separately. The prediction target for masked signals to recover
is one of the main dierences in previous works. In NLP, word
MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling SIGIR ’23, July 23–27, 2023, Taipei, Taiwan
token is the most commonly used prediction target. In CV, various
targets are explored, e.g., raw pixel [
16
,
43
], HOG features [
42
],
visual tokens [
2
] from a pre-trained dVAE [
33
], visual tokens from
a momentum model [48], etc.
In VLP, conditional MLM is a commonly used pre-training objec-
tive, which predicts masked text tokens based on unmasked images.
Some works perform MIM in VLP. UNITER [
6
] applies masking on
pre-extracted regional features and let the model predict the class
label or feature of those regions. OSCAR [
25
] inputs additional ob-
ject tags into the model and adds a mask-and-predict task on these
object tags. However, these methods require a pre-trained model,
e.g., object detector like Faster R-CNN [
34
], to extract the prediction
target, causing domain bias and error propagation. Most recently,
inspired by MAE [
16
], several concurrent works, e.g., M3AE [
12
],
VLC [
15
] and MaskVLM [
22
], transfer the pixel reconstruction task
into VLP by simply adding low-level pixel reconstruction tasks
on top of VLP models. Besides, some methods, e.g., FLAVA [
37
]
and BEiT-3 [
41
], explore a dierent mid-level discrete visual token
prediction task on vision modality. These VLP methods, however,
neglect the semantic gap between low-level pixels (or mid-level
visual tokens) and high-level multimodal representations, which
can disturb the learning of semantically rich representations. Our
MAMO unies masked multimodal modeling of both vision and
language as predicting the high-level latent representations and
features/concepts, thus avoiding the potential negative impacts
brought by the semantic gaps.
2.3 Self-Distillation
Self-distillation methods [
4
,
14
] attempt to utilize history knowl-
edge to drive the model learning from itself. BYOL [
14
] proposes a
self-supervised image representation learning approach that itera-
tively bootstraps the outputs of a network to serve as targets for
enhanced representations. Some methods [
1
,
17
,
48
] combine this
learning strategy with masked modeling for separate modalities.
ALBEF [
23
] use momentum-distillation strategy to generate pseudo
logits, which helps model to improve learning from noisy web data.
The implicit prediction target of MAMO gets inspiration from these
methods, but our work diers in that we operate on multimodal
inputs and representations with the help of mask-and-predict tasks,
and further enhance the representations with the aid of explicit
and semantically meaningful targets.
3 METHOD
3.1 Model Architecture
As illustrated in Fig. 2, our model includes an image encoder, a text
encoder, and a multimodal fusion encoder. The image encoder is
a pre-trained visual transformer, ViT-B/16 [
9
]. The text encoder
and the multimodal fusion encoder are both based on transformer,
which are initialized with the rst 6 layers and the last 6 layers of
BERT
𝑏𝑎𝑠𝑒
[
20
], respectively. The image encoder encodes an input
image
𝐼
into a sequence of visual features
{𝑣𝑐𝑙𝑠 , 𝑣1, 𝑣2, ..., 𝑣𝑁}
, where
𝑣𝑐𝑙𝑠
represents the global feature on the [CLS] token and others
correspond to each visible image patch. Particularly, we append a
shared, learnable vector (i.e., mask token) into those visual features
to indicate the presence of a missing patch to be predicted, and
Image
Encoder
Text
Encoder
Multimodal Fusion
Visible Image Patches Language Tokens
Figure 2: Architecture of MAMO.
add positional embeddings to all tokens in this full set. The text en-
coder transforms an input text
𝑇
into a sequence of linguistic token
features
{𝑤𝑐𝑙𝑠, 𝑤1, 𝑤2, ..., 𝑤𝑁}
. The visual and linguistic tokens are
concatenated and fed to the multimodal fusion encoder to generate
fused multimodal representations.
We pre-train our model with two categories of pre-training
tasks: 1) masked multimodal modeling, which enables learning ne-
grained multimodal interaction by the way of mask-and-predict;
2) global-level image-text alignment, which aligns image and text
from the perspective of their global consistency.
3.2 Masked Multimodal Modeling
Masked multimodal modeling is designed to learn ne-grained
multimodal interaction for vision-language input. Given an image-
text pair
(𝐼,𝑇 )
, we create two masked views,
(ˆ
𝐼,𝑇 )
and
(𝐼, ˆ
𝑇)
, by
randomly masking a portion of the input, i.e., either removing
some image patches or replacing some sub-words in the text with
[MASK] token. The masked views,
(ˆ
𝐼,𝑇 )
and
(𝐼, ˆ
𝑇)
, are sent into
our model, the online network
𝑓
parameterized by
𝜃
, to get their
multimodal representations,
𝑓𝜃(ˆ
𝐼,𝑇 )
and
𝑓𝜃(𝐼, ˆ
𝑇)
. We then design an
implicit (masked representation modeling) and two explicit (masked
image/language modeling) prediction sub-tasks for those masked
views.
Masked Representation Modeling (MRM). MRM serves as an im-
plicit, unied, and debiased prediction target for both vision and
language. It requires predicting latent multimodal representations
on each masked position under the guidance of a self-distillation
teacher, which is referred to as the target network.
The target network has the same architecture as the online net-
work, both dened as
𝑓
, but uses a dierent set of parameters. We
don’t simply copy the online network to the target network be-
cause the frequent change in the target network makes the learning
process diverge. To acquire a smoothing target, we utilize a mo-
mentum target network [
14
] whose parameters
¯
𝜃
are updated by
an exponential moving average (EMA) of the online parameters
𝜃
:
¯
𝜃𝛼¯
𝜃+ (
1
𝛼)𝜃
. We stop the gradient propagation in the target
network.
摘要:

MAMO:Fine-GrainedVision-LanguageRepresentationsLearningwithMaskedMultimodalModelingZijiaZhao1,3∗†,LongtengGuo2∗,XingjianHe1,ShuaiShao2,ZehuanYuan2,JingLiu1,3‡1LaboratoryofCognitionandDecisionIntelligenceforComplexSystems,InstituteofAutomation,ChineseAcademyofSciences2BytedanceInc.3SchoolofArtificial...

展开>> 收起<<
MAMO Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:6.77MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注