MAP Multimodal Uncertainty-Aware Vision-Language Pre-training Model Yatai Ji1 Junjie Wang2 Yuan Gong1Lin Zhang3Yanru Zhu1 Hongfa Wang4Jiaxing Zhang3Tetsuya Sakai2Yujiu Yang1

2025-05-02 0 0 2.87MB 16 页 10玖币
侵权投诉
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Yatai Ji1*Junjie Wang2*Yuan Gong1Lin Zhang3Yanru Zhu1
Hongfa Wang4Jiaxing Zhang3Tetsuya Sakai2Yujiu Yang1
1Tsinghua University 2Waseda University 3IDEA 4Tencent TEG
{jyt21, gong-y21, zhuyr20}@mails.tsinghua.edu.cn yang.yujiu@sz.tsinghua.edu.cn
wjj1020181822@toki.waseda.jp tetsuyasakai@acm.org
{zhanglin, zhangjiaxing}@idea.edu.cn hongfawang@tencent.com
Abstract
Multimodal semantic understanding often has to deal
with uncertainty, which means the obtained messages tend
to refer to multiple targets. Such uncertainty is problematic
for our interpretation, including inter- and intra-modal un-
certainty. Little effort has studied the modeling of this uncer-
tainty, particularly in pre-training on unlabeled datasets
and fine-tuning in task-specific downstream datasets. In
this paper, we project the representations of all modali-
ties as probabilistic distributions via a Probability Distri-
bution Encoder (PDE) by utilizing sequence-level interac-
tions. Compared to the existing deterministic methods, such
uncertainty modeling can convey richer multimodal seman-
tic information and more complex relationships. Further-
more, we integrate uncertainty modeling with popular pre-
training frameworks and propose suitable pre-training tasks:
Distribution-based Vision-Language Contrastive learning
(D-VLC), Distribution-based Masked Language Modeling
(D-MLM), and Distribution-based Image-Text Matching (D-
ITM). The fine-tuned models are applied to challenging
downstream tasks, including image-text retrieval, visual
question answering, visual reasoning, and visual entailment,
and achieve state-of-the-art results.
1. Introduction
Precise understanding is a fundamental ability of hu-
man intelligence, whether it involves localizing objects from
similar semantics or finding corresponding across multiple
modalities. Our artificial models suppose to do the same,
pinpointing exact concepts from rich multimodal seman-
tic scenarios. However, this kind of precise understanding
is challenging. Information from different modalities can
present rich semantics from each other, but the resulting am-
biguity and noise are also greater than the case with a single
*Equal contribution.
Corresponding Author.
Intra-modal uncertainty
(a) Vision uncertainty
Inter-modal uncertainty
(b) Language uncertainty
(c.1) There are people
with umbrellas in the
street.
(c) Vision-to-Language uncertainty (d) Language-to-Vision uncertainty
(d.1) People along a narrow street and a guy riding a scooter.
(d.2) The motorcyclist is traveling down the busy street.
(d.3) Man on moped traveling through busy street.
(b.1) A woman poses with avocado sandwich
lunch at an outdoor restaurant
(b.2) Young girl having a meal in outdoor setting.
(b.3) A woman sitting at a restaurant getting
ready to eat her food.
(a.3) Snowy Mountains
(a.2) Zebras
(a.1) A billboard
(a.0) An image region / grid
(e) An example for (b) Language uncertainty (Point Rep. V.S. Distribution Rep.)
avocado sandwich
food
meal
avocado sandwich
food
meal
(e.1) Point Representations (e.2) Distribution Representations
Multimodal uncertainties
Figure 1. Multimodal uncertainties and an example for language un-
certainty (b) by modeling as point representations and distribution
representations. The images and text are from MSCOCO [30].
modality.
Multimodal representation learning methods hold the
promise of promoting the desired precise understanding
across different modalities [13]. While these methods have
shown promising results, current methods face the challenge
of uncertainty [7,51], including within a modality and be-
tween modalities. Considering image (a.0) in Fig. 1as an
example, one vision region includes multiple objects, such
as a billboard, several zebras and others. Therefore, it is
unclear which objects when mentioning this region. In the
language domain, the complex relationships of words lead
to uncertainty, such as synonymy and hyponymy. In Fig. 1
(c)&(d), the same object often has different descriptions
arXiv:2210.05335v3 [cs.CV] 20 Jul 2023
from different modalities, such as text and images, which
manifests inter-modal uncertainty. Instead, previous methods
often neglect the uncertainty [11,19,46], resulting in limited
understanding ability on complicated concept hierarchies
and poor prediction diversity. Therefore, it is desirable to
model such uncertainty.
Moreover, with multimodal datasets becoming more com-
monplace, there is a flourishing trend to implement pre-
training models, particularly Vision-Language Pre-training
(VLP), to support downstream applications [6,18,23,36,50].
Existing deterministic representations, however, often fail to
understand uncertainty in pre-training data, as they can only
express positions in semantic space and measure the relation-
ship between targets in certainty, such as Euclidean distance.
How can we efficiently model uncertainty in multi-modalities
when dealing with pre-training models?
Applying Gaussian distribution is one of the prominent
approaches used for modeling uncertainty in the representa-
tion space [40,45,51,54]. In these methods, however, the ob-
tained uncertainty depends on individual features rather than
considering the whole features together, which ignores the in-
ner connection between features. To exploit this connection,
we implicitly model them when formulating the uncertainty
with a module called Probability Distribution Encoder (PDE).
Inspired by the self-attention mechanism [44], we further
add the interaction between text tokens and image patches
when constructing our distribution representations to capture
more information. In Figure 1(e), we provide an example
for two different types of representations to describe the lan-
guage uncertainty, where the distribution representations can
express richer semantic relationships than the conventional
point representations. The distribution variance measures the
uncertainty of the corresponding text. As a byproduct, distri-
bution representations enable diverse generations, providing
multiple reasonable predictions with random sampling.
In this paper, we integrate this uncertainty modeling in
the pre-training framework, resulting in three new tasks:
Distribution-based Vision-Language Contrastive learning
(D-VLC), Distribution-based Masked Language Modeling
(D-MLM), and Distribution-based Image-Text Matching (D-
ITM) pre-training tasks. All these tasks are to deal with cross-
modality alignment. More specifically, D-VLC is to handle
the coarse-grained cross-modal alignment, which measures
the whole distributions to align representations from differ-
ent domains. D-MLM and D-ITM are implemented after the
fine-grained interaction between different modalities, pro-
viding the token level and overall level alignment for images
and text.
Our contributions are summarized as follows:
1) We focus on the semantic uncertainty of multimodal un-
derstanding and propose a new module, called Probability
Distribution Encoder, to frame the uncertainty in multimodal
representations as Gaussian distributions.
2) We develop three uncertainty-aware pre-training tasks to
deal with large-scale unlabeled datasets, including D-VLC,
D-MLM, and D-ITM tasks. To the best of our knowledge,
these are the first attempt to harness the probability distribu-
tion of representations in VLP.
3) We wrap the proposed pre-training tasks into an end-2-end
Multimodal uncertainty-Aware vision-language Pre-training
model, called MAP, for downstream tasks. Experiments show
MAP gains State-of-The-Art (SoTA) performance. Our code
is available at https://github.com/IIGROUP/MAP.
2. Related Works
2.1. Probability Distribution Representations
Current popular representation learning methods extract
features as point representations and focus on searching for
the closest position to ground truth in high-level representa-
tion space. However, there is usually more than one suitable
point representation, which shows the uncertainty in multi-
ple tasks. To address this problem, the following researchers
introduced probability distribution representations to infer
diversely and improve robustness, avoiding model overfitting
to one single solution. In the Natural Language Processing
(NLP) field, multivariate Gaussian distribution was utilized
to represent words [45] due to the powerful capability for rep-
resenting the asymmetric relations among words. Since then,
different distribution families were exploited for word repre-
sentations [2,28]. In Computer Vision (CV), for modeling
vision uncertainty, some researchers introduce Gaussian rep-
resentations into specific tasks, such as face recognition [4],
person re-identification [54], 3D skeleton action representa-
tion [40] and pose estimation [42]. For solving the long-tail
problem in relation prediction, Gaussian distribution was
utilized to build objects relationship in scene graph genera-
tion [52]. Recently, constructing distributions achieved some
progress to yield diverse predictions for cross-modal retrieval
in multimodal field [7]. However, those existing methods
only consider the feature level to build the distributions for a
whole image or sentence. In this work, we model not only
the whole image or sentence to the distribution representa-
tions but also each token of them, such as patches and words.
Furthermore, our approach learns the multimodal uncertainty
from sequence-level and feature-level interactions.
2.2. Vision-Language Pre-training (VLP)
Inspired by the Transformer structure [44] and pre-
training tasks from BERT [8], the recent emergence of
vision-language pre-training tasks and models have been
explored to learn multimodal representations. The main pro-
cess is first to pre-train the models by exploiting auxiliary
tasks to understand hidden supervision information from
large-scale unlabeled data. Then, the pre-trained models
embed real-world objects into multimodal representations.
Language
Feature Encoder
Vision
Feature Encoder
PDE 𝜇
𝜎
Point Rep.
Multivariate GD
Distribution Rep.
PDE Processing
Distribution Rep. Space
PDE Processing
D-VLC
Contrastive Loss
Image
input
Text
input Cross-modal Transformer
Cross Attention
Self Attention
Feed Forward
Cross Attention
Self Attention
Feed Forward
Distribution Rep. Space
× 𝑁𝐿
PDE Processing
D-ITM D-MLM
Image [CLS] Token
Text [CLS] Token
Three-view
𝑋~𝑁(𝜇, 𝜎2)
Figure 2. Pre-training model architecture and objectives of MAP. We propose PDE to model distribution representations as multivariate
Gaussian Distributions (GD). “
NL
” indicates the layer number of the cross-modal transformer. We perform two-dimensional Gaussian
distribution as an example.
With effective universal multimodal representations, they
can achieve good performance by fine-tuning on relatively
small labeled datasets of VL downstream tasks. The key chal-
lenge of VLP is to design suitable pre-training objectives.
Recently, mainstream strategies include Masked Language
Modeling (MLM) [15,16,20,23,26], Image-Text Matching
(ITM) [15,16,23,26] and Vision-Language Contrastive learn-
ing (VLC) [18,20,26,36]. MLM in VLP requires the model
to predict the masked language context tokens by the rest of
the language context tokens and vision context tokens. To
understand alignment information of language context and
vision context, ITM requires the model to judge whether the
input of different modalities matches or not. VLC learns the
similarity from inter-modal information and aligns point rep-
resentations of different modalities. However, those methods
only are designed in the point representation space without
considering multimodal uncertainty. Therefore, we propose
the D-VLC, D-MLM and D-ITM to pre-train our model
in the distribution representation space. The details will be
explained in Sec. 3.2.
3. Approaches
In this section, we introduce our proposed PDE and the
architecture of MAP (Sec. 3.1), and the overall structure is
described in Fig. 2. The details of our proposed distribution-
based pre-training tasks are presented in Sec. 3.2. In addition,
we further discuss the ethical considerations in Appendix C.
3.1. Model Overview
3.1.1 Probability Distribution Encoder (PDE).
The input features of PDE are from the point representation
space of different modalities. To model the multimodal un-
certainty, we further frame the input features as multivariate
Gaussian distributions. Specifically, PDE predicts a mean
vector (
µ
) and a variance vector (
σ2
) for each input feature.
The mean vector represents the center position of distribu-
tions in probabilistic space, and the variance vector expresses
the scope of distributions in each dimension.
As shown in Fig. 3, we propose a probability distribution
encoder (PDE) while considering that modeling the mean
and variance vectors takes feature-level and sequence-level
interactions. Specifically, Feed Forward layer is used for
feature-level interactions and Multi-Head (MH) operation
is responsible for sequence-level interactions. By applying
the MH operation, the input hidden states
HRT×D
are
split into
k
heads, where
T
is sequence length and
D
is
hidden size. In each head, we split the features and send
them to two paths (
µ
,
σ2
). In each path, the input hidden
states
H(i)RT×D/2k
are projected to
Q(i)
,
K(i)
,
V(i)
in
i-th head. As an example, the operation in the µpath is:
[Q(i)
µ, K(i)
µ, V (i)
µ] = H(i)
µWqkv ,
Head(i)
µ= Act Q(i)
µK(i)
µ
/pdkV(i)
µ,
MHµ= concati[k]Head(i)
µWO,
(1)
where
dk
is set to
D/(2k)
. The weight
Wqkv Rdk×3dk
is to project the inputs in the subspace of each head. The
weight
WORkdk×D
projects the concatenation of
k
head
results to the output space. The “
Act
” includes an activa-
tion function and a normalization function for considering
sequence-level interaction. The
σ2
path is similar to the
µ
path. Since the input point representation correlates with the
mean vector, an add operation is employed to learn the mean
vector. The motivations of design choices are in Sec. 4.3.2.
After PDE, each vision or language token is represented as a
+𝜇
𝜎2
𝑁𝑒𝑎𝑑
×𝝁
Path
𝝈𝟐Path
Feed Forward
Concat
Interaction Interaction
Concat
Feed Forward
Feed Forward
Q
K
V
Q
K
V
Figure 3. The architecture of Probability Distribution Encoder
(PDE) block.
gaussian distribution in high-dimension probabilistic space.
3.1.2 Feature Extraction.
To extract features, we utilize an image encoder and a lan-
guage encoder. In detail, we employ CLIP-ViT [36] as the
image encoder and RoBERTa-Base [31] as the language en-
coder. An image is encoded as an patch feature sequence
{v[CLS], v1, . . . , vN}
, where
v[CLS]
is the overall feature.
Moreover, the input text is embedded into a sequence of
tokens {w[CLS], w1, . . . , wM}.
3.1.3 Cross-modal Transformer.
Recently, there are two main types of the multimodel trans-
former to fuse the different modalities: single-stream [6,39,
56] and dual-stream [26,32,43] models.
In our method, the image patch sequences are much
longer than text sequences, making the weights of vision
features too large to compute the attention scores together.
To address this issue, we choose the dual-stream module
with two transformer branches, where self-attention scores
are calculated separately.
As shown in Fig. 2, the main structure has
NL
layers
of cross-modal encoders. Each encoder mainly consists of
two Self-Attention (SA) blocks and two Cross-Attention
(CA) blocks. In the SA block of each modality, query, key
and value vectors are all linearly projected from vision or
language features. In the vision-to-language cross-attention
block of
i
-th layer, query vectors represent language fea-
ture
T
i
after the self-attention block, and key/value vectors
denote vision feature
I
i
. By employing the Multi-Head At-
tention (MHA) operation, the CA block enables language
features to learn visual information across modalities. The
language-to-vision CA block is similar to the vision-to-
language one. The workflow of
i
-th layer encoder with SA
and CA is as follows:
SAvision :I
i= MHA(Ii1, Ii1, Ii1),
SAlanguage :T
i= MHA(Ti1, Ti1, Ti1),
CAvision :Ii= MHA(I
i, T
i, T
i),
CAlanguage :Ti= MHA(T
i, I
i, I
i).
(2)
For the overall structure design of MAP, we apply PDEs
after feature extractors and cross-modal transformer, respec-
tively. PDE after the feature extractor learns unimodal dis-
tribution representations to conduct the D-VLC pre-training
task. PDE at the end of MAP is responsible for D-MLM,
D-ITM and downstream tasks.
3.2. Distribution-based Pre-Training Tasks
In order to learn the multimodal uncertainty in common
sense, we pre-train our model with distribution-based pre-
training tasks on large-scale datasets.
3.2.1 Coarse-grained Pre-training.
We propose Distribution-based Vision-Language Contrastive
Learning, called D-VLC, to realize coarse-grained seman-
tic alignment of overall unimodal distributional represen-
tations before fusion. We compute the 2-Wasserstein dis-
tance [21,22,33] to measure the distance between multivari-
ate Gaussian distributions. For two Gaussian distributions
N(µ1,Σ1)
and
N(µ2,Σ2)
, their 2-Wasserstein distance is
defined as:
D2W=||µ1µ2||2
2+ Tr(Σ1+ Σ22(Σ1/2
1Σ2Σ1/2
1)1/2).(3)
In our modeled distributions,
Σ1
and
Σ2
are both diagonal
matrices, which indicates
Σ1/2
1Σ2Σ1/2
1= Σ1Σ2
. The above
formula can be rewritten as:
D2W=||µ1µ2||2
2+ Tr((Σ1/2
1Σ1/2
2)2)
=||µ1µ2||2
2+||σ1σ2||2
2,(4)
where
σ
refers to a standard deviation vector. The overall
unimodal features denote the distribution representations of
[CLS]
from the PDEs following single-modal feature extrac-
tors. The similarity between an image and text is given by:
s(I, T ) = a·D2W(v[CLS], w[CLS]) + b , (5)
where
a
is a negative scale factor since similarity is inversely
proportional to the distance, and
b
is a shift value. For
N
image-text pairs in a batch, there are
N
positive matched
samples and
N(N1)
negative samples. We use InfoNCE
loss as follows:
LI2T
N CE (i) = log exp(s(Ii, Ti))
PN
n=1 exp(s(Ii, Tn)),
LT2I
N CE (i) = log exp(s(Ti, Ii))
PN
n=1 exp(s(Ti, In)),
(6)
where
τ
is a learned temperature parameter. The above are
summed as D-VLC loss LD-V LC .
摘要:

MAP:MultimodalUncertainty-AwareVision-LanguagePre-trainingModelYataiJi1*JunjieWang2*YuanGong1LinZhang3YanruZhu1HongfaWang4JiaxingZhang3TetsuyaSakai2†YujiuYang1†1TsinghuaUniversity2WasedaUniversity3IDEA4TencentTEG{jyt21,gong-y21,zhuyr20}@mails.tsinghua.edu.cnyang.yujiu@sz.tsinghua.edu.cnwjj1020181822...

展开>> 收起<<
MAP Multimodal Uncertainty-Aware Vision-Language Pre-training Model Yatai Ji1 Junjie Wang2 Yuan Gong1Lin Zhang3Yanru Zhu1 Hongfa Wang4Jiaxing Zhang3Tetsuya Sakai2Yujiu Yang1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.87MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注