Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis Ronghao Lin and Haifeng Hu

2025-05-02 0 0 1.06MB 13 页 10玖币
侵权投诉
Multimodal Contrastive Learning via Uni-Modal Coding and
Cross-Modal Prediction for Multimodal Sentiment Analysis
Ronghao Lin and Haifeng Hu
Sun Yat-sen University, China
linrh7@mail2.sysu.edu.cn, huhaif@mail.sysu.edu.cn
Abstract
Multimodal representation learning is a chal-
lenging task in which previous work mostly
focus on either uni-modality pre-training or
cross-modality fusion. In fact, we regard mod-
eling multimodal representation as building a
skyscraper, where laying stable foundation and
designing the main structure are equally es-
sential. The former is like encoding robust
uni-modal representation while the later is like
integrating interactive information among dif-
ferent modalities, both of which are critical
to learning an effective multimodal representa-
tion. Recently, contrastive learning has been
successfully applied in representation learn-
ing, which can be utilized as the pillar of the
skyscraper and benefit the model to extract
the most important features contained in the
multimodal data. In this paper, we propose
a novel framework named MultiModal Con-
trastive Learning (MMCL) for multimodal rep-
resentation to capture intra- and inter-modality
dynamics simultaneously. Specifically, we de-
vise uni-modal contrastive coding with an ef-
ficient uni-modal feature augmentation strat-
egy to filter inherent noise contained in acous-
tic and visual modality and acquire more ro-
bust uni-modality representations. Besides, a
pseudo siamese network is presented to pre-
dict representation across different modalities,
which successfully captures cross-modal dy-
namics. Moreover, we design two contrastive
learning tasks, instance- and sentiment-based
contrastive learning, to promote the process
of prediction and learn more interactive in-
formation related to sentiment. Extensive ex-
periments conducted on two public datasets
demonstrate that our method surpasses the
state-of-the-art methods.
1 Introduction
With the surge of user-generated videos, Multi-
modal Sentiment Analysis (MSA) have become
Corresponding author.
a hot research field, which aims to infer people’s
sentiment based on multimodal data including text,
audio and video (Zadeh et al.,2017;Tsai et al.,
2019a,2020;Poria et al.,2020). To successfully
understand human behaviours and interpret hu-
man intents, it is necessary to attain an effective
and powerful multimodal representation for the
model. However, two major challenges in learning
such multimodal representation exist: the accurate
extraction of uni-modal features and the hetero-
geneities across different modalities bring difficulty
of modeling cross-modal interaction.
To acquire powerful uni-modal features, Devlin
et al. (2019) presents a large-scale language model
named BERT for textual modality and Wu et al.
(2022) introduce an audio representation learn-
ing method for audio modality by distilling from
Radford et al. (2021) which targets at transferable
model for visual modality. In MSA, previous meth-
ods (Yu et al.,2021;Han et al.,2021) mainly utilize
BERT for textual modality while vague feature ex-
tractor such as COVAREP (Degottex et al.,2014)
and Facet (iMotions 2017) for acoustic and visual
modality, where the inherent noise contain in uni-
modal features may still exist. To avoid uni-modal
noise interfering downstream sentiment inference
task, we design Uni-Modal Contrastive Coding
(UMCC) which employs feature cutoff strategy in-
spired by (Shen et al.,2020) and generates augmen-
tation features to construct contrastive learning task
with origin uni-modal representation. As shown
in Figure 1, we then obtain robust and efficient
representations for acoustic and visual modalities.
To alleviate the impact of modality heterogeneity,
previous MSA models propose various modalities
fusion methods to learn cross-modality interaction
information (Hazarika et al.,2020;Rahman et al.,
2020). Modality translation is a popular method
to explicitly translate source modality to the target
one, which directly manipulates the commonali-
ties across modalities (Tsai et al.,2019a;Wu et al.,
arXiv:2210.14556v1 [cs.CL] 26 Oct 2022
2021;Zhao et al.,2021). However, due to the exis-
tence of discrepancy modality-specific information
and huge modality gap, it is undesirable and ex-
tremely difficult to project the representations from
different modalities to the same one. Different
with these explicit modality translation methods,
we propose Cross-Modal Contrastive Prediction
(CMCP) composed of a pseudo siamese predic-
tive network and two designed contrastive learning
tasks to predict cross-modal representation in an
implicitly contrastive way. The predictive represen-
tation efficiently capture cross-modality dynamics
and concurrently preserve modality-specific fea-
tures for the modalities.
The novel contributions of our work can be sum-
marized as follows:
1)
We propose a framework named MultiModal
Contrastive Learning (MMCL), consisting
of Uni-Modal Contrastive Coding (UMCC)
which mitigates the interference of modality
inherent noise and learns robust uni-modal
representations, and Cross-Modal Contrastive
Prediction (CMCP) with a pseudo siamese
predictive network which learn commonali-
ties and interactive features across different
modalities.
2)
We design two contrastive learning tasks,
instance- and sentiment-based contrastive
learning, in order to improve the conver-
gence of the predictive network and capture
sentiment-related information contained in the
multimodal data.
3)
We conduct extensive experiments on two
publicly available datasets, and gain superior
results to the state-of-the-art MSA models.
2 Related Work
2.1 Multimodal Sentiment Analysis (MSA)
MSA focus on integrating textual, acoustic and
visual modalities to comprehend varied human sen-
timent (Morency et al.,2011). Previous research
mainly comprises of two steps: uni-modal repre-
sentation learning and multimodal fusion. For uni-
modal representation, Tsai et al. (2019b) factor-
izes them into two independent sets while Haz-
arika et al. (2020) projects them into two distinct
subspaces. Large pre-trained Transformer-based
language models such as BERT have shown great
performance improvement on downstream NLP
tasks (Devlin et al.,2019). However, for acoustic
and visual modalities, the features are extracted by
CMU-MultimodalSDK with a vague description of
feature and backbone selection(Zadeh et al.,2018c)
in MSA task. We argue that without powerful pre-
trained language tokenizer to extract features as
textual modality does, the inherent noise of acous-
tic and visual features may disturb the inference of
sentiment.
Different from direct processing the uni-modal
features, we present uni-modal features coding to
learn robust acoustic and visual representations.
For multimodal fusion, Zadeh et al. (2017); Liu
et al. (2018); Zadeh et al. (2018c) present early
fusion at the feature level while Poria et al. (2017);
Zadeh et al. (2018a); Yu et al. (2021) adopt late
fusion at the decision level. However, the former
methods limit capabilities in modeling cross-modal
dynamics due to the inconsistent space of differ-
ent modalities while the later methods suffer from
neglecting modality-specific information with the
absence of low-level feature process. To avoid the
respective issues from the two methods, Tsai et al.
(2019a,b); Hazarika et al. (2020) propose hybrid
fusion which perform multimodal fusion at both
input and output level. Guided by this thought, we
construct a cross-modal predictive network to pro-
cess representations from different modalities at
early and late fusion stage, which effectively ex-
ploit the intra- and inter-modality dynamics in a
prediction manner.
2.2 Contrastive Learning
The core idea of contrastive learning is to measure
the similarities of sample pairs in the representation
space (Hadsell et al.,2006), which is firstly adopted
in the field of computer vision (He et al.,2020) and
then extend to the field of nature language analysis
(Gao et al.,2021). Previous work based on con-
trastive learning mostly only consider uni-modal
data and utilize contrastive losses in a discrimina-
tion manner. Different with discrimination models,
Oord et al. (2018) combines predicting future ob-
servations named predictive coding with a proba-
bilistic contrastive loss called InfoNCE. Inspired
by but diverse from this work, we apply predictive
network with contrastive learning in multimodal
feature to capture cross-modal dynamics and en-
hance the interaction among different modalities.
[CLS] But the way it pulled off is
like: Wow! That was sick! [SEP]
Modality Content
Acoustic
Textual
Visual
t = 0 t = n
Audio Feature
Extractor
Vision Feature
Extractor
Pre-trained BERT Encoder
Audio
Encoder
Vision
Encoder
Origin
Cutoff
Origin
Cutoff
Launi
Lvuni
Uni-Modal Coding
{
Bi-Modal
Fusion
Bi-Modal
Fusion
Text+Audio
Encoder
Text+Vision
Encoder
AR
Predict
AR
Predict
Tri-Modal
Fusion
Sentiment
N
Cross-Modal Prediction
{
Lacross Laalign Launiform
Contrastive Learning Loss Function
F
a
F
t
F
v
Lasent
Lvcross Lvalign Lvuniform
Lvsent
Figure 1: The overall architecture of our proposed MMCL framework.
3 Method
3.1 Problem Definition
In MSA task, the input is utterance consisting
of three modalities: textual, acoustic and visual
modality, where
m∈ {t, a, v}
. The sequences
of these three modalities are represented as triplet
(T, A, V )
, including
TRNt×dt
,
ARNa×da
and
VRNv×dv
where
Nm
denotes the sequence
length of corresponding modality and
dm
denotes
the dimensionality. The goal of MSA task is to
learn a mapping
f(T, A, V )
to infer the sentiment
score ˆyR.
3.2 Overall Architecture
As shown in Figure 1, we firstly process raw in-
put into sequential feature vectors with fixed fea-
ture extractor for audio and vision data while pre-
trained BERT (Devlin et al.,2019) encoder for text.
Then we utilize contrastive learning in both uni-
modal coding and cross-modal prediction, which
are the two key modules in our proposed model.
The uni-modal coding drive the model to focus on
informative features which then implicitly filter out
inherent noise and produces robust and effective
uni-modal representation for acoustic and visual
modalities. The cross-modal prediction captures
commonalities among different modalities and out-
puts predictive representation full of interaction
dynamics. Lastly, we fuse predictive acoustic and
visual representations with textual representation
to derive the final multimodal representation which
contains both modality-specific and cross-modal
dynamics most related to sentiment.
3.3 Uni-Modal Contrastive Coding
For uni-modality, we encode the sequential
triplet
(T, A, V )
into corresponding representa-
tions. Specifically, we use BERT (Devlin et al.,
2019) to encode input sentences to obtain the hid-
den representations of textual modality. The em-
bedding from the last Transformer layer’s output
can be represented as:
Ft=BERT (T;θBERT
t)RLt×dt(1)
To acquire more robust acoustic and visual repre-
sentations, we design Uni-Modal Contrastive Cod-
ing (UMCC) for both modalities. Firstly, we en-
code audio and vision inputs by two uni-modal
bi-directional LSTMs (Hochreiter and Schmidhu-
ber,1997) to capture temporal characteristic:
ha=bLST M (A;θbLST M
a)RLa×da
hv=bLST M (V;θbLST M
v)RLv×dv(2)
To construct contrastive learning, we treat the
encoded acoustic and visual representations as
query samples
q
and get the corresponding pos-
itive key samples
k+
by feature augmentation strat-
egy. In natural language understanding and gen-
eration task, Shen et al. (2020) introduces an ef-
ficient data augmentation approach named cutoff
to erase part of the information within an input
sentence and yield its restricted views during the
fine-tuning stage. Inspired by this work, we uti-
lize random feature cutoff strategy on acoustic and
visual representations which randomly convert a
certain proportion of embedding dimensions of ev-
ery token within the sequence into a vector of zeros.
As shown in Figure 2, we then generate augmented
version of uni-modal representations, denoted as
h
aand h
v.
In order to fuse with textual representations in
the similar semantic space later, we design uni-
modal Transformer models for acoustic and visual
摘要:

MultimodalContrastiveLearningviaUni-ModalCodingandCross-ModalPredictionforMultimodalSentimentAnalysisRonghaoLinandHaifengHuSunYat-senUniversity,Chinalinrh7@mail2.sysu.edu.cn,huhaif@mail.sysu.edu.cnAbstractMultimodalrepresentationlearningisachal-lengingtaskinwhichpreviousworkmostlyfocusoneitheruni-m...

展开>> 收起<<
Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis Ronghao Lin and Haifeng Hu.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.06MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注