from different modalities, such as text and images, which
manifests inter-modal uncertainty. Instead, previous methods
often neglect the uncertainty [11,19,46], resulting in limited
understanding ability on complicated concept hierarchies
and poor prediction diversity. Therefore, it is desirable to
model such uncertainty.
Moreover, with multimodal datasets becoming more com-
monplace, there is a flourishing trend to implement pre-
training models, particularly Vision-Language Pre-training
(VLP), to support downstream applications [6,18,23,36,50].
Existing deterministic representations, however, often fail to
understand uncertainty in pre-training data, as they can only
express positions in semantic space and measure the relation-
ship between targets in certainty, such as Euclidean distance.
How can we efficiently model uncertainty in multi-modalities
when dealing with pre-training models?
Applying Gaussian distribution is one of the prominent
approaches used for modeling uncertainty in the representa-
tion space [40,45,51,54]. In these methods, however, the ob-
tained uncertainty depends on individual features rather than
considering the whole features together, which ignores the in-
ner connection between features. To exploit this connection,
we implicitly model them when formulating the uncertainty
with a module called Probability Distribution Encoder (PDE).
Inspired by the self-attention mechanism [44], we further
add the interaction between text tokens and image patches
when constructing our distribution representations to capture
more information. In Figure 1(e), we provide an example
for two different types of representations to describe the lan-
guage uncertainty, where the distribution representations can
express richer semantic relationships than the conventional
point representations. The distribution variance measures the
uncertainty of the corresponding text. As a byproduct, distri-
bution representations enable diverse generations, providing
multiple reasonable predictions with random sampling.
In this paper, we integrate this uncertainty modeling in
the pre-training framework, resulting in three new tasks:
Distribution-based Vision-Language Contrastive learning
(D-VLC), Distribution-based Masked Language Modeling
(D-MLM), and Distribution-based Image-Text Matching (D-
ITM) pre-training tasks. All these tasks are to deal with cross-
modality alignment. More specifically, D-VLC is to handle
the coarse-grained cross-modal alignment, which measures
the whole distributions to align representations from differ-
ent domains. D-MLM and D-ITM are implemented after the
fine-grained interaction between different modalities, pro-
viding the token level and overall level alignment for images
and text.
Our contributions are summarized as follows:
1) We focus on the semantic uncertainty of multimodal un-
derstanding and propose a new module, called Probability
Distribution Encoder, to frame the uncertainty in multimodal
representations as Gaussian distributions.
2) We develop three uncertainty-aware pre-training tasks to
deal with large-scale unlabeled datasets, including D-VLC,
D-MLM, and D-ITM tasks. To the best of our knowledge,
these are the first attempt to harness the probability distribu-
tion of representations in VLP.
3) We wrap the proposed pre-training tasks into an end-2-end
Multimodal uncertainty-Aware vision-language Pre-training
model, called MAP, for downstream tasks. Experiments show
MAP gains State-of-The-Art (SoTA) performance. Our code
is available at https://github.com/IIGROUP/MAP.
2. Related Works
2.1. Probability Distribution Representations
Current popular representation learning methods extract
features as point representations and focus on searching for
the closest position to ground truth in high-level representa-
tion space. However, there is usually more than one suitable
point representation, which shows the uncertainty in multi-
ple tasks. To address this problem, the following researchers
introduced probability distribution representations to infer
diversely and improve robustness, avoiding model overfitting
to one single solution. In the Natural Language Processing
(NLP) field, multivariate Gaussian distribution was utilized
to represent words [45] due to the powerful capability for rep-
resenting the asymmetric relations among words. Since then,
different distribution families were exploited for word repre-
sentations [2,28]. In Computer Vision (CV), for modeling
vision uncertainty, some researchers introduce Gaussian rep-
resentations into specific tasks, such as face recognition [4],
person re-identification [54], 3D skeleton action representa-
tion [40] and pose estimation [42]. For solving the long-tail
problem in relation prediction, Gaussian distribution was
utilized to build objects relationship in scene graph genera-
tion [52]. Recently, constructing distributions achieved some
progress to yield diverse predictions for cross-modal retrieval
in multimodal field [7]. However, those existing methods
only consider the feature level to build the distributions for a
whole image or sentence. In this work, we model not only
the whole image or sentence to the distribution representa-
tions but also each token of them, such as patches and words.
Furthermore, our approach learns the multimodal uncertainty
from sequence-level and feature-level interactions.
2.2. Vision-Language Pre-training (VLP)
Inspired by the Transformer structure [44] and pre-
training tasks from BERT [8], the recent emergence of
vision-language pre-training tasks and models have been
explored to learn multimodal representations. The main pro-
cess is first to pre-train the models by exploiting auxiliary
tasks to understand hidden supervision information from
large-scale unlabeled data. Then, the pre-trained models
embed real-world objects into multimodal representations.