MAP Multimodal Uncertainty-Aware Vision-Language Pre-training Model Yatai Ji1 Junjie Wang2 Yuan Gong1Lin Zhang3Yanru Zhu1 Hongfa Wang4Jiaxing Zhang3Tetsuya Sakai2Yujiu Yang1

2025-05-02 0 0 2.87MB 16 页 10玖币

侵权投诉

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Yatai Ji1*Junjie Wang2*Yuan Gong1Lin Zhang3Yanru Zhu1

Hongfa Wang4Jiaxing Zhang3Tetsuya Sakai2† Yujiu Yang1†

1Tsinghua University 2Waseda University 3IDEA 4Tencent TEG

{jyt21, gong-y21, zhuyr20}@mails.tsinghua.edu.cn yang.yujiu@sz.tsinghua.edu.cn

wjj1020181822@toki.waseda.jp tetsuyasakai@acm.org

{zhanglin, zhangjiaxing}@idea.edu.cn hongfawang@tencent.com

Abstract

Multimodal semantic understanding often has to deal

with uncertainty, which means the obtained messages tend

to refer to multiple targets. Such uncertainty is problematic

for our interpretation, including inter- and intra-modal un-

certainty. Little effort has studied the modeling of this uncer-

tainty, particularly in pre-training on unlabeled datasets

and ﬁne-tuning in task-speciﬁc downstream datasets. In

this paper, we project the representations of all modali-

ties as probabilistic distributions via a Probability Distri-

bution Encoder (PDE) by utilizing sequence-level interac-

tions. Compared to the existing deterministic methods, such

uncertainty modeling can convey richer multimodal seman-

tic information and more complex relationships. Further-

more, we integrate uncertainty modeling with popular pre-

training frameworks and propose suitable pre-training tasks:

Distribution-based Vision-Language Contrastive learning

(D-VLC), Distribution-based Masked Language Modeling

(D-MLM), and Distribution-based Image-Text Matching (D-

ITM). The ﬁne-tuned models are applied to challenging

downstream tasks, including image-text retrieval, visual

question answering, visual reasoning, and visual entailment,

and achieve state-of-the-art results.

1. Introduction

Precise understanding is a fundamental ability of hu-

man intelligence, whether it involves localizing objects from

similar semantics or ﬁnding corresponding across multiple

modalities. Our artiﬁcial models suppose to do the same,

pinpointing exact concepts from rich multimodal seman-

tic scenarios. However, this kind of precise understanding

is challenging. Information from different modalities can

present rich semantics from each other, but the resulting am-

biguity and noise are also greater than the case with a single

*Equal contribution.

†Corresponding Author.

Intra-modal uncertainty

(a) Vision uncertainty

Inter-modal uncertainty

(b) Language uncertainty

(c.1) There are people

with umbrellas in the

street.

(d.1) People along a narrow street and a guy riding a scooter.

(d.2) The motorcyclist is traveling down the busy street.

(d.3) Man on moped traveling through busy street.

(b.1) A woman poses with avocado sandwich

lunch at an outdoor restaurant

(b.2) Young girl having a meal in outdoor setting.

(b.3) A woman sitting at a restaurant getting

ready to eat her food.

(a.3) Snowy Mountains

(a.2) Zebras

(a.1) A billboard

(a.0) An image region / grid

(e) An example for (b) Language uncertainty (Point Rep. V.S. Distribution Rep.)

avocado sandwich

food

meal

avocado sandwich

food

meal

(e.1) Point Representations (e.2) Distribution Representations

Multimodal uncertainties

Figure 1. Multimodal uncertainties and an example for language un-

certainty (b) by modeling as point representations and distribution

representations. The images and text are from MSCOCO [30].

modality.

Multimodal representation learning methods hold the

promise of promoting the desired precise understanding

across different modalities [13]. While these methods have

shown promising results, current methods face the challenge

of uncertainty [7,51], including within a modality and be-

tween modalities. Considering image (a.0) in Fig. 1as an

example, one vision region includes multiple objects, such

as a billboard, several zebras and others. Therefore, it is

unclear which objects when mentioning this region. In the

language domain, the complex relationships of words lead

to uncertainty, such as synonymy and hyponymy. In Fig. 1

(c)&(d), the same object often has different descriptions

arXiv:2210.05335v3 [cs.CV] 20 Jul 2023

from different modalities, such as text and images, which

manifests inter-modal uncertainty. Instead, previous methods

often neglect the uncertainty [11,19,46], resulting in limited

understanding ability on complicated concept hierarchies

and poor prediction diversity. Therefore, it is desirable to

model such uncertainty.

Moreover, with multimodal datasets becoming more com-

monplace, there is a ﬂourishing trend to implement pre-

training models, particularly Vision-Language Pre-training

(VLP), to support downstream applications [6,18,23,36,50].

Existing deterministic representations, however, often fail to

understand uncertainty in pre-training data, as they can only

express positions in semantic space and measure the relation-

ship between targets in certainty, such as Euclidean distance.

How can we efﬁciently model uncertainty in multi-modalities

when dealing with pre-training models?

Applying Gaussian distribution is one of the prominent

approaches used for modeling uncertainty in the representa-

tion space [40,45,51,54]. In these methods, however, the ob-

tained uncertainty depends on individual features rather than

considering the whole features together, which ignores the in-

ner connection between features. To exploit this connection,

we implicitly model them when formulating the uncertainty

with a module called Probability Distribution Encoder (PDE).

Inspired by the self-attention mechanism [44], we further

add the interaction between text tokens and image patches

when constructing our distribution representations to capture

more information. In Figure 1(e), we provide an example

for two different types of representations to describe the lan-

guage uncertainty, where the distribution representations can

express richer semantic relationships than the conventional

point representations. The distribution variance measures the

uncertainty of the corresponding text. As a byproduct, distri-

bution representations enable diverse generations, providing

multiple reasonable predictions with random sampling.

In this paper, we integrate this uncertainty modeling in

the pre-training framework, resulting in three new tasks:

Distribution-based Vision-Language Contrastive learning

(D-VLC), Distribution-based Masked Language Modeling

(D-MLM), and Distribution-based Image-Text Matching (D-

ITM) pre-training tasks. All these tasks are to deal with cross-

modality alignment. More speciﬁcally, D-VLC is to handle

the coarse-grained cross-modal alignment, which measures

the whole distributions to align representations from differ-

ent domains. D-MLM and D-ITM are implemented after the

ﬁne-grained interaction between different modalities, pro-

viding the token level and overall level alignment for images

and text.

Our contributions are summarized as follows:

1) We focus on the semantic uncertainty of multimodal un-

derstanding and propose a new module, called Probability

Distribution Encoder, to frame the uncertainty in multimodal

representations as Gaussian distributions.

2) We develop three uncertainty-aware pre-training tasks to

deal with large-scale unlabeled datasets, including D-VLC,

D-MLM, and D-ITM tasks. To the best of our knowledge,

these are the ﬁrst attempt to harness the probability distribu-

tion of representations in VLP.

3) We wrap the proposed pre-training tasks into an end-2-end

Multimodal uncertainty-Aware vision-language Pre-training

model, called MAP, for downstream tasks. Experiments show

MAP gains State-of-The-Art (SoTA) performance. Our code

is available at https://github.com/IIGROUP/MAP.

2. Related Works

2.1. Probability Distribution Representations

Current popular representation learning methods extract

features as point representations and focus on searching for

the closest position to ground truth in high-level representa-

tion space. However, there is usually more than one suitable

point representation, which shows the uncertainty in multi-

ple tasks. To address this problem, the following researchers

introduced probability distribution representations to infer

diversely and improve robustness, avoiding model overﬁtting

to one single solution. In the Natural Language Processing

(NLP) ﬁeld, multivariate Gaussian distribution was utilized

to represent words [45] due to the powerful capability for rep-

resenting the asymmetric relations among words. Since then,

different distribution families were exploited for word repre-

sentations [2,28]. In Computer Vision (CV), for modeling

vision uncertainty, some researchers introduce Gaussian rep-

resentations into speciﬁc tasks, such as face recognition [4],

person re-identiﬁcation [54], 3D skeleton action representa-

tion [40] and pose estimation [42]. For solving the long-tail

problem in relation prediction, Gaussian distribution was

utilized to build objects relationship in scene graph genera-

tion [52]. Recently, constructing distributions achieved some

progress to yield diverse predictions for cross-modal retrieval

in multimodal ﬁeld [7]. However, those existing methods

only consider the feature level to build the distributions for a

whole image or sentence. In this work, we model not only

the whole image or sentence to the distribution representa-

tions but also each token of them, such as patches and words.

Furthermore, our approach learns the multimodal uncertainty

from sequence-level and feature-level interactions.

2.2. Vision-Language Pre-training (VLP)

Inspired by the Transformer structure [44] and pre-

training tasks from BERT [8], the recent emergence of

vision-language pre-training tasks and models have been

explored to learn multimodal representations. The main pro-

cess is ﬁrst to pre-train the models by exploiting auxiliary

tasks to understand hidden supervision information from

large-scale unlabeled data. Then, the pre-trained models

embed real-world objects into multimodal representations.

Language

Feature Encoder

Vision

Feature Encoder

PDE 𝜇

𝜎

Point Rep.

Multivariate GD

Distribution Rep.

PDE Processing

Distribution Rep. Space

PDE Processing

D-VLC

Contrastive Loss

Image

input

Text

input Cross-modal Transformer

Cross Attention

Self Attention

Feed Forward

Cross Attention

Self Attention

Feed Forward

Distribution Rep. Space

× 𝑁𝐿

PDE Processing

D-ITM D-MLM

Image [CLS] Token

Text [CLS] Token

Three-view

𝑋~𝑁(𝜇, 𝜎2)

Figure 2. Pre-training model architecture and objectives of MAP. We propose PDE to model distribution representations as multivariate

Gaussian Distributions (GD). “

” indicates the layer number of the cross-modal transformer. We perform two-dimensional Gaussian

distribution as an example.

With effective universal multimodal representations, they

can achieve good performance by ﬁne-tuning on relatively

small labeled datasets of VL downstream tasks. The key chal-

lenge of VLP is to design suitable pre-training objectives.

Recently, mainstream strategies include Masked Language

Modeling (MLM) [15,16,20,23,26], Image-Text Matching

(ITM) [15,16,23,26] and Vision-Language Contrastive learn-

ing (VLC) [18,20,26,36]. MLM in VLP requires the model

to predict the masked language context tokens by the rest of

the language context tokens and vision context tokens. To

understand alignment information of language context and

vision context, ITM requires the model to judge whether the

input of different modalities matches or not. VLC learns the

similarity from inter-modal information and aligns point rep-

resentations of different modalities. However, those methods

only are designed in the point representation space without

considering multimodal uncertainty. Therefore, we propose

the D-VLC, D-MLM and D-ITM to pre-train our model

in the distribution representation space. The details will be

explained in Sec. 3.2.

3. Approaches

In this section, we introduce our proposed PDE and the

architecture of MAP (Sec. 3.1), and the overall structure is

described in Fig. 2. The details of our proposed distribution-

based pre-training tasks are presented in Sec. 3.2. In addition,

we further discuss the ethical considerations in Appendix C.

3.1. Model Overview

3.1.1 Probability Distribution Encoder (PDE).

The input features of PDE are from the point representation

space of different modalities. To model the multimodal un-

certainty, we further frame the input features as multivariate

Gaussian distributions. Speciﬁcally, PDE predicts a mean

vector (

) and a variance vector (

σ2

) for each input feature.

The mean vector represents the center position of distribu-

tions in probabilistic space, and the variance vector expresses

the scope of distributions in each dimension.

As shown in Fig. 3, we propose a probability distribution

encoder (PDE) while considering that modeling the mean

and variance vectors takes feature-level and sequence-level

interactions. Speciﬁcally, Feed Forward layer is used for

feature-level interactions and Multi-Head (MH) operation

is responsible for sequence-level interactions. By applying

the MH operation, the input hidden states

H∈RT×D

are

split into

heads, where

is sequence length and

hidden size. In each head, we split the features and send

them to two paths (

σ2

). In each path, the input hidden

states

H(i)∈RT×D/2k

are projected to

Q(i)

K(i)

V(i)

i-th head. As an example, the operation in the µpath is:

[Q(i)

µ, K(i)

µ, V (i)

µ] = H(i)

µWqkv ,

Head(i)

µ= Act Q(i)

µK(i)

⊤/pdkV(i)

µ,

MHµ= concati∈[k]Head(i)

µWO,

(1)

where

is set to

D/(2k)

. The weight

Wqkv ∈Rdk×3dk

is to project the inputs in the subspace of each head. The

weight

WO∈Rkdk×D

projects the concatenation of

head

results to the output space. The “

Act

” includes an activa-

tion function and a normalization function for considering

sequence-level interaction. The

σ2

path is similar to the

path. Since the input point representation correlates with the

mean vector, an add operation is employed to learn the mean

vector. The motivations of design choices are in Sec. 4.3.2.

After PDE, each vision or language token is represented as a

+𝜇

𝜎2

𝑁ℎ𝑒𝑎𝑑

×𝝁

Path

𝝈𝟐Path

Feed Forward

Concat

Interaction Interaction

Concat

Feed Forward

Figure 3. The architecture of Probability Distribution Encoder

(PDE) block.

gaussian distribution in high-dimension probabilistic space.

3.1.2 Feature Extraction.

To extract features, we utilize an image encoder and a lan-

guage encoder. In detail, we employ CLIP-ViT [36] as the

image encoder and RoBERTa-Base [31] as the language en-

coder. An image is encoded as an patch feature sequence

{v[CLS], v1, . . . , vN}

, where

v[CLS]

is the overall feature.

Moreover, the input text is embedded into a sequence of

tokens {w[CLS], w1, . . . , wM}.

3.1.3 Cross-modal Transformer.

Recently, there are two main types of the multimodel trans-

former to fuse the different modalities: single-stream [6,39,

56] and dual-stream [26,32,43] models.

In our method, the image patch sequences are much

longer than text sequences, making the weights of vision

features too large to compute the attention scores together.

To address this issue, we choose the dual-stream module

with two transformer branches, where self-attention scores

are calculated separately.

As shown in Fig. 2, the main structure has

layers

of cross-modal encoders. Each encoder mainly consists of

two Self-Attention (SA) blocks and two Cross-Attention

(CA) blocks. In the SA block of each modality, query, key

and value vectors are all linearly projected from vision or

language features. In the vision-to-language cross-attention

block of

-th layer, query vectors represent language fea-

ture

T′

after the self-attention block, and key/value vectors

denote vision feature

I′

. By employing the Multi-Head At-

tention (MHA) operation, the CA block enables language

features to learn visual information across modalities. The

language-to-vision CA block is similar to the vision-to-

language one. The workﬂow of

-th layer encoder with SA

and CA is as follows:

SAvision :I′

i= MHA(Ii−1, Ii−1, Ii−1),

SAlanguage :T′

i= MHA(Ti−1, Ti−1, Ti−1),

CAvision :Ii= MHA(I′

i, T ′

i),

CAlanguage :Ti= MHA(T′

i, I′

i).

(2)

For the overall structure design of MAP, we apply PDEs

after feature extractors and cross-modal transformer, respec-

tively. PDE after the feature extractor learns unimodal dis-

tribution representations to conduct the D-VLC pre-training

task. PDE at the end of MAP is responsible for D-MLM,

D-ITM and downstream tasks.

3.2. Distribution-based Pre-Training Tasks

In order to learn the multimodal uncertainty in common

sense, we pre-train our model with distribution-based pre-

training tasks on large-scale datasets.

3.2.1 Coarse-grained Pre-training.

We propose Distribution-based Vision-Language Contrastive

Learning, called D-VLC, to realize coarse-grained seman-

tic alignment of overall unimodal distributional represen-

tations before fusion. We compute the 2-Wasserstein dis-

tance [21,22,33] to measure the distance between multivari-

ate Gaussian distributions. For two Gaussian distributions

N(µ1,Σ1)

and

N(µ2,Σ2)

, their 2-Wasserstein distance is

deﬁned as:

D2W=||µ1−µ2||2

2+ Tr(Σ1+ Σ2−2(Σ1/2

1Σ2Σ1/2

1)1/2).(3)

In our modeled distributions,

Σ1

and

Σ2

are both diagonal

matrices, which indicates

Σ1/2

1Σ2Σ1/2

1= Σ1Σ2

. The above

formula can be rewritten as:

D2W=||µ1−µ2||2

2+ Tr((Σ1/2

1−Σ1/2

2)2)

=||µ1−µ2||2

2+||σ1−σ2||2

2,(4)

where

refers to a standard deviation vector. The overall

unimodal features denote the distribution representations of

[CLS]

from the PDEs following single-modal feature extrac-

tors. The similarity between an image and text is given by:

s(I, T ) = a·D2W(v[CLS], w[CLS]) + b , (5)

where

is a negative scale factor since similarity is inversely

proportional to the distance, and

is a shift value. For

image-text pairs in a batch, there are

positive matched

samples and

N(N−1)

negative samples. We use InfoNCE

loss as follows:

LI2T

N CE (i) = −log exp(s(Ii, Ti)/τ)

n=1 exp(s(Ii, Tn)/τ),

LT2I

N CE (i) = −log exp(s(Ti, Ii)/τ)

n=1 exp(s(Ti, In)/τ),

(6)

where

is a learned temperature parameter. The above are

summed as D-VLC loss LD-V LC .

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MAP:MultimodalUncertainty-AwareVision-LanguagePre-trainingModelYataiJi1*JunjieWang2*YuanGong1LinZhang3YanruZhu1HongfaWang4JiaxingZhang3TetsuyaSakai2†YujiuYang1†1TsinghuaUniversity2WasedaUniversity3IDEA4TencentTEG{jyt21,gong-y21,zhuyr20}@mails.tsinghua.edu.cnyang.yujiu@sz.tsinghua.edu.cnwjj1020181822...

展开>> 收起<<

MAP Multimodal Uncertainty-Aware Vision-Language Pre-training Model Yatai Ji1 Junjie Wang2 Yuan Gong1Lin Zhang3Yanru Zhu1 Hongfa Wang4Jiaxing Zhang3Tetsuya Sakai2Yujiu Yang1.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MAP Multimodal Uncertainty-Aware Vision-Language Pre-training Model Yatai Ji1 Junjie Wang2 Yuan Gong1Lin Zhang3Yanru Zhu1 Hongfa Wang4Jiaxing Zhang3Tetsuya Sakai2Yujiu Yang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: