MAMO Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

2025-05-02 0 0 6.77MB 11 页 10玖币

侵权投诉

MAMO: Fine-Grained Vision-Language Representations Learning

with Masked Multimodal Modeling

Zijia Zhao1,3∗†, Longteng Guo2∗, Xingjian He1, Shuai Shao2, Zehuan Yuan2, Jing Liu1,3‡

1Laboratory of Cognition and Decision Intelligence for Complex Systems,

Institute of Automation, Chinese Academy of Sciences

2Bytedance Inc. 3School of Articial Intelligence, University of Chinese Academy of Sciences

zhaozijia2021@ia.ac.cn, {xingjian.he,jliu}@nlpr.ia.ac.cn, {guolongteng.lt,shaoshuai.0516,yuanzehuan}@bytedance.com

ABSTRACT

Multimodal representation learning has shown promising improve-

ments on various vision-language tasks (e.g., image-text retrieval,

visual question answering, etc) and has signicantly advanced the

development of multimedia information systems. Most existing

methods excel at building global-level alignment between vision

and language while lacking eective ne-grained image-text inter-

action. In this paper, we propose a jointly masked multimodal mod-

eling method to learn ne-grained multimodal representations. Our

method performs joint masking on image-text input and integrates

both implicit and explicit targets for the masked signals to recover.

The implicit target provides a unied and debiased objective for

vision and language, where the model predicts latent multimodal

representations of the unmasked input. The explicit target further

enriches the multimodal representations by recovering high-level

and semantically meaningful information: momentum visual fea-

tures of image patches and concepts of word tokens. Through such

a masked modeling process, our model not only learns ne-grained

multimodal interaction, but also avoids the semantic gap between

high-level representations and low- or mid-level prediction targets

(e.g., image pixels, discrete vision tokens), thus producing seman-

tically rich multimodal representations that perform well on both

zero-shot and ne-tuned settings. Our pre-trained model (named

MAMO) achieves state-of-the-art performance on various down-

stream vision-language tasks, including image-text retrieval, visual

question answering, visual reasoning, and weakly-supervised visual

grounding.

CCS CONCEPTS

•Information systems

→

Multimedia information systems;

Multimedia and multimodal retrieval.

*Equal contribution.

†This work was performed while Zijia worked as an intern at ByteDance.

‡Corresponding author.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specic permission

and/or a fee. Request permissions from permissions@acm.org.

SIGIR ’23, July 23–27, 2023, Taipei, Taiwan

ACM ISBN 978-1-4503-9408-6/23/07. . . $15.00

https://doi.org/10.1145/3539618.3591721

KEYWORDS

vision-language pretraining, masked modeling, image-text retrieval,

visual question answering

ACM Reference Format:

Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing

Liu. 2023. MAMO: Fine-Grained Vision-Language Representations Learning

with Masked Multimodal Modeling. In Proceedings of the 46th International

ACM SIGIR Conference on Research and Development in Information Retrieval

(SIGIR ’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA,

11 pages. https://doi.org/10.1145/3539618.3591721

1 INTRODUCTION

Vision-Language Pre-training (VLP) is an emerging research topic

in multimedia information systems. It aims to learn the interac-

tion between image and text and produce semantically rich multi-

modal representations that transfer well to downstream Vision-and-

Language (V+L) tasks including image-text retrieval, visual question

answering, etc. In order to learn cross-modal interaction, most ex-

isting VLP methods [

] rely on the consistency

between the global views of image and text, using pre-training objec-

tives like image-text contrastive loss [

] and image-text matching

loss [

]. While eective, such global interaction fails to model the

subtle local association within image-text pairs. The ne-grained

interaction between image patches and word tokens is therefore

lacking.

Masked signal modeling is an eective self-supervised pre-training

task that masks a portion of input signals and tries to predict these

masked signals from the visible ones. It has been actively explored

in natural language processing (NLP) and computer vision (CV)

separately, and has brought powerful generalization performance

across diverse downstream tasks. For example, BERT [

] formu-

lates masked language modeling (MLM) to predict masked linguistic

tokens, while MAE [

] and BEiT [

] formulate masked image mod-

eling (MIM) to reconstruct raw pixels and dVAE visual tokens [

]

of image patches respectively. In the domain of VLP, however, there

is a lack of a jointly masked signal modeling method for both vision

and language modalities. Although previous VLP methods [

]

have adopted conditional MLM to predict masked words given un-

masked image and other words, however, masking of the image side

has not been fully explored. As a result, the images’ internal struc-

tures and their interactions with text tokens are not suciently

learned, as is shown in Fig. 1.

The challenge of designing a jointly masked signal modeling

method for VLP lies in the natural dierences between image and

text modalities — image is continuous, low-level, highly redundant

arXiv:2210.04183v3 [cs.CV] 14 Jun 2023

SIGIR ’23, July 23–27, 2023, Taipei, Taiwan Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu

raw signals, while text tokens are discrete, high-level, highly com-

pressed concepts generated by humans. This phenomenon raises

two questions: (1) How to design a unied prediction target that

applies to masked multimodal data composing both continuous

visual signals and discrete text tokens? (2) How to avoid the seman-

tic gap between the learning of high-level representations and the

prediction of low-level image signals?

In this paper, we propose MAsked Multimodal mOdel (MAMO),

a VLP model with a jointly masked learning strategy on both vi-

sion and language modalities. MAMO performs joint masking on

image-text input and integrates both implicit and explicit targets

for the masked signals to recover. The implicit target provides a

unied and debiased objective for vision and language, the core

idea of which is to predict latent multimodal representations of

the masked signals under the guidance of a self-distillation teacher

that takes the unmasked view as input. Such a bootstrapping latent

target avoids learning biases in modality-specic designs. While

the implicit prediction target performs eective empirically, it can

collapse into meaningless solutions, e.g., outputting the same vector

for all tokens. To further enrich the multimodal representations

and avoid such potential trivial solutions, we also add auxiliary

explicit prediction targets that are naturally distinguishable on each

masked position. These targets are explicit in that they are semanti-

cally meaningful features or concepts extracted from the raw data.

Particularly, for the masked image tokens, instead of reconstructing

low-level raw pixels [

] or predicting mid-level pre-dened visual

tokens [

] (encapsulating mostly patch details, e.g.color and tex-

ture, according to [

]), they are enforced to predict the high-level

momentum visual features extracted from the image encoder. As

for masked text tokens, we directly predict word tokens since they

are already high-level concepts.

MAMO naturally addresses the aforementioned two questions:

First, the prediction targets of both masked vision and language

signals are unied as regressing the implicit latent multimodal rep-

resentations; Second, our implicit and explicit prediction targets are

both high-level representations and features/concepts, thus avoid-

ing the semantic-gap caused by predicting low- or mid-level image

signals. Such a masked modeling process enforces the model to

build both intra- and inter-modality interactions between image

patches and word tokens, and produce ne-grained and semanti-

cally rich multimodal representations that perform well on both

zero-shot and ne-tuned settings.

We demonstrate the eectiveness of MAMO on various down-

stream V+L tasks including image-text retrieval, visual question

answering, visual reasoning, and weakly-supervised visual ground-

ing. MAMO achieves substantial improvements over existing state-

of-the-art methods. On zero-shot image-text retrieval, MAMO even

outperforms methods that are pre-trained on orders of magnitude

larger datasets, e.g., it outperforms the state-of-the-art methods,

CLIP [

] and ALIGN [

], by absolute improvements of 11

9% and

9% on image retrieval R@1, respectively. On ne-tuned image-text

retrieval, MAMO also outperforms other methods (e.g., ALBEF [

METER [

], VLMo [

]) with a large margin. Quantitative and qual-

itative analyses on MAMO using Grad-CAM [

] further demon-

strate its ability to perform more ne-grained vision-language in-

teraction. Our contributions are summarized as follows:

Caption: a cat on the desk near a lamp

cat lamp

ALBEFMAMO

desk

Figure 1: Per-word Grad-CAM visualization of word to image

attention. Comparing with ALBEF [

], MAMO can focus on

corresponding region for each word more precisely, indicat-

ing more ne-grained interactions between word and image

patches are built.

•

We propose a jointly masked multimodal modeling method

that integrates both implicit and explicit prediction targets

to learn ne-grained multimodal representations.

•

Our implicit target provides a unied and debiased objec-

tive for VLP, and our explicit target shows that high-level

momentum visual features can serve as a better auxiliary

target for masked images compared with low-level pixels or

mid-level visual tokens.

•

Qualitative and quantitative results across a broad range of

downstream tasks show that our method learns ne-grained

and transferable vision-language representations.

2 RELATED WORK

2.1 Vision-Language Representation Learning

Depending on how vision and language modalities interact, most

previous VLP methods fall into two categories: shallow interac-

tion and deep interaction. Shallow interaction methods [

]

use light-weight operations (e.g., dot product) for interaction while

deep interaction methods [

] use deep networks

(e.g., a transformer-based multimodal encoder) to fuse image and

text features. ALBEF [

] combines the above two types of methods,

learning both shallow and deep interactions in a single framework.

To train such interaction networks, previous VLP methods often

employ contrastive learning [

] or image-text matching [

] as

pre-training tasks, which excel at learning global-level image-text

alignment, but lack eective modeling of ne-grained interaction

between image patches and word tokens. MAMO inherits the ar-

chitecture of ALBEF to combine shallow and deep interactions and

further introduces a new masked multimodal modeling method to

enforce ne-grained multimodal interaction.

2.2 Masked Signal Modeling

Masked signal modeling has been actively explored in NLP and

CV separately. The prediction target for masked signals to recover

is one of the main dierences in previous works. In NLP, word

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling SIGIR ’23, July 23–27, 2023, Taipei, Taiwan

token is the most commonly used prediction target. In CV, various

targets are explored, e.g., raw pixel [

], HOG features [

visual tokens [

] from a pre-trained dVAE [

], visual tokens from

a momentum model [48], etc.

In VLP, conditional MLM is a commonly used pre-training objec-

tive, which predicts masked text tokens based on unmasked images.

Some works perform MIM in VLP. UNITER [

] applies masking on

pre-extracted regional features and let the model predict the class

label or feature of those regions. OSCAR [

] inputs additional ob-

ject tags into the model and adds a mask-and-predict task on these

object tags. However, these methods require a pre-trained model,

e.g., object detector like Faster R-CNN [

], to extract the prediction

target, causing domain bias and error propagation. Most recently,

inspired by MAE [

], several concurrent works, e.g., M3AE [

VLC [

] and MaskVLM [

], transfer the pixel reconstruction task

into VLP by simply adding low-level pixel reconstruction tasks

on top of VLP models. Besides, some methods, e.g., FLAVA [

]

and BEiT-3 [

], explore a dierent mid-level discrete visual token

prediction task on vision modality. These VLP methods, however,

neglect the semantic gap between low-level pixels (or mid-level

visual tokens) and high-level multimodal representations, which

can disturb the learning of semantically rich representations. Our

MAMO unies masked multimodal modeling of both vision and

language as predicting the high-level latent representations and

features/concepts, thus avoiding the potential negative impacts

brought by the semantic gaps.

2.3 Self-Distillation

Self-distillation methods [

] attempt to utilize history knowl-

edge to drive the model learning from itself. BYOL [

] proposes a

self-supervised image representation learning approach that itera-

tively bootstraps the outputs of a network to serve as targets for

enhanced representations. Some methods [

] combine this

learning strategy with masked modeling for separate modalities.

ALBEF [

] use momentum-distillation strategy to generate pseudo

logits, which helps model to improve learning from noisy web data.

The implicit prediction target of MAMO gets inspiration from these

methods, but our work diers in that we operate on multimodal

inputs and representations with the help of mask-and-predict tasks,

and further enhance the representations with the aid of explicit

and semantically meaningful targets.

3 METHOD

3.1 Model Architecture

As illustrated in Fig. 2, our model includes an image encoder, a text

encoder, and a multimodal fusion encoder. The image encoder is

a pre-trained visual transformer, ViT-B/16 [

]. The text encoder

and the multimodal fusion encoder are both based on transformer,

which are initialized with the rst 6 layers and the last 6 layers of

BERT

𝑏𝑎𝑠𝑒

[

], respectively. The image encoder encodes an input

image

𝐼

into a sequence of visual features

{𝑣𝑐𝑙𝑠 , 𝑣1, 𝑣2, ..., 𝑣𝑁}

, where

𝑣𝑐𝑙𝑠

represents the global feature on the [CLS] token and others

correspond to each visible image patch. Particularly, we append a

shared, learnable vector (i.e., mask token) into those visual features

to indicate the presence of a missing patch to be predicted, and

Image

Encoder

Text

Encoder

Multimodal Fusion

Visible Image Patches Language Tokens

Figure 2: Architecture of MAMO.

add positional embeddings to all tokens in this full set. The text en-

coder transforms an input text

𝑇

into a sequence of linguistic token

features

{𝑤𝑐𝑙𝑠, 𝑤1, 𝑤2, ..., 𝑤𝑁}

. The visual and linguistic tokens are

concatenated and fed to the multimodal fusion encoder to generate

fused multimodal representations.

We pre-train our model with two categories of pre-training

tasks: 1) masked multimodal modeling, which enables learning ne-

grained multimodal interaction by the way of mask-and-predict;

2) global-level image-text alignment, which aligns image and text

from the perspective of their global consistency.

3.2 Masked Multimodal Modeling

Masked multimodal modeling is designed to learn ne-grained

multimodal interaction for vision-language input. Given an image-

text pair

(𝐼,𝑇 )

, we create two masked views,

(ˆ

𝐼,𝑇 )

and

(𝐼, ˆ

𝑇)

, by

randomly masking a portion of the input, i.e., either removing

some image patches or replacing some sub-words in the text with

[MASK] token. The masked views,

(ˆ

𝐼,𝑇 )

and

(𝐼, ˆ

𝑇)

, are sent into

our model, the online network

𝑓

parameterized by

𝜃

, to get their

multimodal representations,

𝑓𝜃(ˆ

𝐼,𝑇 )

and

𝑓𝜃(𝐼, ˆ

𝑇)

. We then design an

implicit (masked representation modeling) and two explicit (masked

image/language modeling) prediction sub-tasks for those masked

views.

Masked Representation Modeling (MRM). MRM serves as an im-

plicit, unied, and debiased prediction target for both vision and

language. It requires predicting latent multimodal representations

on each masked position under the guidance of a self-distillation

teacher, which is referred to as the target network.

The target network has the same architecture as the online net-

work, both dened as

𝑓

, but uses a dierent set of parameters. We

don’t simply copy the online network to the target network be-

cause the frequent change in the target network makes the learning

process diverge. To acquire a smoothing target, we utilize a mo-

mentum target network [

] whose parameters

𝜃

are updated by

an exponential moving average (EMA) of the online parameters

𝜃

𝜃←𝛼¯

𝜃+ (

−𝛼)𝜃

. We stop the gradient propagation in the target

network.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MAMO:Fine-GrainedVision-LanguageRepresentationsLearningwithMaskedMultimodalModelingZijiaZhao1,3∗†,LongtengGuo2∗,XingjianHe1,ShuaiShao2,ZehuanYuan2,JingLiu1,3‡1LaboratoryofCognitionandDecisionIntelligenceforComplexSystems,InstituteofAutomation,ChineseAcademyofSciences2BytedanceInc.3SchoolofArtificial...

展开>> 收起<<

MAMO Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MAMO Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: