
Multimodal Contrastive Learning via Uni-Modal Coding and
Cross-Modal Prediction for Multimodal Sentiment Analysis
Ronghao Lin and Haifeng Hu∗
Sun Yat-sen University, China
linrh7@mail2.sysu.edu.cn, huhaif@mail.sysu.edu.cn
Abstract
Multimodal representation learning is a chal-
lenging task in which previous work mostly
focus on either uni-modality pre-training or
cross-modality fusion. In fact, we regard mod-
eling multimodal representation as building a
skyscraper, where laying stable foundation and
designing the main structure are equally es-
sential. The former is like encoding robust
uni-modal representation while the later is like
integrating interactive information among dif-
ferent modalities, both of which are critical
to learning an effective multimodal representa-
tion. Recently, contrastive learning has been
successfully applied in representation learn-
ing, which can be utilized as the pillar of the
skyscraper and benefit the model to extract
the most important features contained in the
multimodal data. In this paper, we propose
a novel framework named MultiModal Con-
trastive Learning (MMCL) for multimodal rep-
resentation to capture intra- and inter-modality
dynamics simultaneously. Specifically, we de-
vise uni-modal contrastive coding with an ef-
ficient uni-modal feature augmentation strat-
egy to filter inherent noise contained in acous-
tic and visual modality and acquire more ro-
bust uni-modality representations. Besides, a
pseudo siamese network is presented to pre-
dict representation across different modalities,
which successfully captures cross-modal dy-
namics. Moreover, we design two contrastive
learning tasks, instance- and sentiment-based
contrastive learning, to promote the process
of prediction and learn more interactive in-
formation related to sentiment. Extensive ex-
periments conducted on two public datasets
demonstrate that our method surpasses the
state-of-the-art methods.
1 Introduction
With the surge of user-generated videos, Multi-
modal Sentiment Analysis (MSA) have become
∗Corresponding author.
a hot research field, which aims to infer people’s
sentiment based on multimodal data including text,
audio and video (Zadeh et al.,2017;Tsai et al.,
2019a,2020;Poria et al.,2020). To successfully
understand human behaviours and interpret hu-
man intents, it is necessary to attain an effective
and powerful multimodal representation for the
model. However, two major challenges in learning
such multimodal representation exist: the accurate
extraction of uni-modal features and the hetero-
geneities across different modalities bring difficulty
of modeling cross-modal interaction.
To acquire powerful uni-modal features, Devlin
et al. (2019) presents a large-scale language model
named BERT for textual modality and Wu et al.
(2022) introduce an audio representation learn-
ing method for audio modality by distilling from
Radford et al. (2021) which targets at transferable
model for visual modality. In MSA, previous meth-
ods (Yu et al.,2021;Han et al.,2021) mainly utilize
BERT for textual modality while vague feature ex-
tractor such as COVAREP (Degottex et al.,2014)
and Facet (iMotions 2017) for acoustic and visual
modality, where the inherent noise contain in uni-
modal features may still exist. To avoid uni-modal
noise interfering downstream sentiment inference
task, we design Uni-Modal Contrastive Coding
(UMCC) which employs feature cutoff strategy in-
spired by (Shen et al.,2020) and generates augmen-
tation features to construct contrastive learning task
with origin uni-modal representation. As shown
in Figure 1, we then obtain robust and efficient
representations for acoustic and visual modalities.
To alleviate the impact of modality heterogeneity,
previous MSA models propose various modalities
fusion methods to learn cross-modality interaction
information (Hazarika et al.,2020;Rahman et al.,
2020). Modality translation is a popular method
to explicitly translate source modality to the target
one, which directly manipulates the commonali-
ties across modalities (Tsai et al.,2019a;Wu et al.,
arXiv:2210.14556v1 [cs.CL] 26 Oct 2022