Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis Ronghao Lin and Haifeng Hu

2025-05-02 0 0 1.06MB 13 页 10玖币

侵权投诉

Multimodal Contrastive Learning via Uni-Modal Coding and

Cross-Modal Prediction for Multimodal Sentiment Analysis

Ronghao Lin and Haifeng Hu∗

Sun Yat-sen University, China

linrh7@mail2.sysu.edu.cn, huhaif@mail.sysu.edu.cn

Abstract

Multimodal representation learning is a chal-

lenging task in which previous work mostly

focus on either uni-modality pre-training or

cross-modality fusion. In fact, we regard mod-

eling multimodal representation as building a

skyscraper, where laying stable foundation and

designing the main structure are equally es-

sential. The former is like encoding robust

uni-modal representation while the later is like

integrating interactive information among dif-

ferent modalities, both of which are critical

to learning an effective multimodal representa-

tion. Recently, contrastive learning has been

successfully applied in representation learn-

ing, which can be utilized as the pillar of the

skyscraper and beneﬁt the model to extract

the most important features contained in the

multimodal data. In this paper, we propose

a novel framework named MultiModal Con-

trastive Learning (MMCL) for multimodal rep-

resentation to capture intra- and inter-modality

dynamics simultaneously. Speciﬁcally, we de-

vise uni-modal contrastive coding with an ef-

ﬁcient uni-modal feature augmentation strat-

egy to ﬁlter inherent noise contained in acous-

tic and visual modality and acquire more ro-

bust uni-modality representations. Besides, a

pseudo siamese network is presented to pre-

dict representation across different modalities,

which successfully captures cross-modal dy-

namics. Moreover, we design two contrastive

learning tasks, instance- and sentiment-based

contrastive learning, to promote the process

of prediction and learn more interactive in-

formation related to sentiment. Extensive ex-

periments conducted on two public datasets

demonstrate that our method surpasses the

state-of-the-art methods.

1 Introduction

With the surge of user-generated videos, Multi-

modal Sentiment Analysis (MSA) have become

∗Corresponding author.

a hot research ﬁeld, which aims to infer people’s

sentiment based on multimodal data including text,

audio and video (Zadeh et al.,2017;Tsai et al.,

2019a,2020;Poria et al.,2020). To successfully

understand human behaviours and interpret hu-

man intents, it is necessary to attain an effective

and powerful multimodal representation for the

model. However, two major challenges in learning

such multimodal representation exist: the accurate

extraction of uni-modal features and the hetero-

geneities across different modalities bring difﬁculty

of modeling cross-modal interaction.

To acquire powerful uni-modal features, Devlin

et al. (2019) presents a large-scale language model

named BERT for textual modality and Wu et al.

(2022) introduce an audio representation learn-

ing method for audio modality by distilling from

Radford et al. (2021) which targets at transferable

model for visual modality. In MSA, previous meth-

ods (Yu et al.,2021;Han et al.,2021) mainly utilize

BERT for textual modality while vague feature ex-

tractor such as COVAREP (Degottex et al.,2014)

and Facet (iMotions 2017) for acoustic and visual

modality, where the inherent noise contain in uni-

modal features may still exist. To avoid uni-modal

noise interfering downstream sentiment inference

task, we design Uni-Modal Contrastive Coding

(UMCC) which employs feature cutoff strategy in-

spired by (Shen et al.,2020) and generates augmen-

tation features to construct contrastive learning task

with origin uni-modal representation. As shown

in Figure 1, we then obtain robust and efﬁcient

representations for acoustic and visual modalities.

To alleviate the impact of modality heterogeneity,

previous MSA models propose various modalities

fusion methods to learn cross-modality interaction

information (Hazarika et al.,2020;Rahman et al.,

2020). Modality translation is a popular method

to explicitly translate source modality to the target

one, which directly manipulates the commonali-

ties across modalities (Tsai et al.,2019a;Wu et al.,

arXiv:2210.14556v1 [cs.CL] 26 Oct 2022

2021;Zhao et al.,2021). However, due to the exis-

tence of discrepancy modality-speciﬁc information

and huge modality gap, it is undesirable and ex-

tremely difﬁcult to project the representations from

different modalities to the same one. Different

with these explicit modality translation methods,

we propose Cross-Modal Contrastive Prediction

(CMCP) composed of a pseudo siamese predic-

tive network and two designed contrastive learning

tasks to predict cross-modal representation in an

implicitly contrastive way. The predictive represen-

tation efﬁciently capture cross-modality dynamics

and concurrently preserve modality-speciﬁc fea-

tures for the modalities.

The novel contributions of our work can be sum-

marized as follows:

We propose a framework named MultiModal

Contrastive Learning (MMCL), consisting

of Uni-Modal Contrastive Coding (UMCC)

which mitigates the interference of modality

inherent noise and learns robust uni-modal

representations, and Cross-Modal Contrastive

Prediction (CMCP) with a pseudo siamese

predictive network which learn commonali-

ties and interactive features across different

modalities.

We design two contrastive learning tasks,

instance- and sentiment-based contrastive

learning, in order to improve the conver-

gence of the predictive network and capture

sentiment-related information contained in the

multimodal data.

We conduct extensive experiments on two

publicly available datasets, and gain superior

results to the state-of-the-art MSA models.

2 Related Work

2.1 Multimodal Sentiment Analysis (MSA)

MSA focus on integrating textual, acoustic and

visual modalities to comprehend varied human sen-

timent (Morency et al.,2011). Previous research

mainly comprises of two steps: uni-modal repre-

sentation learning and multimodal fusion. For uni-

modal representation, Tsai et al. (2019b) factor-

izes them into two independent sets while Haz-

arika et al. (2020) projects them into two distinct

subspaces. Large pre-trained Transformer-based

language models such as BERT have shown great

performance improvement on downstream NLP

tasks (Devlin et al.,2019). However, for acoustic

and visual modalities, the features are extracted by

CMU-MultimodalSDK with a vague description of

feature and backbone selection(Zadeh et al.,2018c)

in MSA task. We argue that without powerful pre-

trained language tokenizer to extract features as

textual modality does, the inherent noise of acous-

tic and visual features may disturb the inference of

sentiment.

Different from direct processing the uni-modal

features, we present uni-modal features coding to

learn robust acoustic and visual representations.

For multimodal fusion, Zadeh et al. (2017); Liu

et al. (2018); Zadeh et al. (2018c) present early

fusion at the feature level while Poria et al. (2017);

Zadeh et al. (2018a); Yu et al. (2021) adopt late

fusion at the decision level. However, the former

methods limit capabilities in modeling cross-modal

dynamics due to the inconsistent space of differ-

ent modalities while the later methods suffer from

neglecting modality-speciﬁc information with the

absence of low-level feature process. To avoid the

respective issues from the two methods, Tsai et al.

(2019a,b); Hazarika et al. (2020) propose hybrid

fusion which perform multimodal fusion at both

input and output level. Guided by this thought, we

construct a cross-modal predictive network to pro-

cess representations from different modalities at

early and late fusion stage, which effectively ex-

ploit the intra- and inter-modality dynamics in a

prediction manner.

2.2 Contrastive Learning

The core idea of contrastive learning is to measure

the similarities of sample pairs in the representation

space (Hadsell et al.,2006), which is ﬁrstly adopted

in the ﬁeld of computer vision (He et al.,2020) and

then extend to the ﬁeld of nature language analysis

(Gao et al.,2021). Previous work based on con-

trastive learning mostly only consider uni-modal

data and utilize contrastive losses in a discrimina-

tion manner. Different with discrimination models,

Oord et al. (2018) combines predicting future ob-

servations named predictive coding with a proba-

bilistic contrastive loss called InfoNCE. Inspired

by but diverse from this work, we apply predictive

network with contrastive learning in multimodal

feature to capture cross-modal dynamics and en-

hance the interaction among different modalities.

[CLS] But the way it pulled off is

like: Wow! That was sick! [SEP]

Modality Content

Acoustic

Textual

Visual

t = 0 t = n

…

Audio Feature

Extractor

Vision Feature

Extractor

Pre-trained BERT Encoder

Audio

Encoder

Vision

Encoder

Origin

Cutoff

Origin

Cutoff

Launi

Lvuni

Uni-Modal Coding

{

Bi-Modal

Fusion

Bi-Modal

Fusion

Text+Audio

Encoder

Text+Vision

Encoder

Predict

Tri-Modal

Fusion

Sentiment

Cross-Modal Prediction

{

Lacross 、Laalign 、Launiform

Contrastive Learning Loss Function

Lasent

Lvcross 、Lvalign 、Lvuniform

Lvsent

Figure 1: The overall architecture of our proposed MMCL framework.

3 Method

3.1 Problem Deﬁnition

In MSA task, the input is utterance consisting

of three modalities: textual, acoustic and visual

modality, where

m∈ {t, a, v}

. The sequences

of these three modalities are represented as triplet

(T, A, V )

, including

T∈RNt×dt

A∈RNa×da

and

V∈RNv×dv

where

denotes the sequence

length of corresponding modality and

denotes

the dimensionality. The goal of MSA task is to

learn a mapping

f(T, A, V )

to infer the sentiment

score ˆy∈R.

3.2 Overall Architecture

As shown in Figure 1, we ﬁrstly process raw in-

put into sequential feature vectors with ﬁxed fea-

ture extractor for audio and vision data while pre-

trained BERT (Devlin et al.,2019) encoder for text.

Then we utilize contrastive learning in both uni-

modal coding and cross-modal prediction, which

are the two key modules in our proposed model.

The uni-modal coding drive the model to focus on

informative features which then implicitly ﬁlter out

inherent noise and produces robust and effective

uni-modal representation for acoustic and visual

modalities. The cross-modal prediction captures

commonalities among different modalities and out-

puts predictive representation full of interaction

dynamics. Lastly, we fuse predictive acoustic and

visual representations with textual representation

to derive the ﬁnal multimodal representation which

contains both modality-speciﬁc and cross-modal

dynamics most related to sentiment.

3.3 Uni-Modal Contrastive Coding

For uni-modality, we encode the sequential

triplet

(T, A, V )

into corresponding representa-

tions. Speciﬁcally, we use BERT (Devlin et al.,

2019) to encode input sentences to obtain the hid-

den representations of textual modality. The em-

bedding from the last Transformer layer’s output

can be represented as:

Ft=BERT (T;θBERT

t)∈RLt×dt(1)

To acquire more robust acoustic and visual repre-

sentations, we design Uni-Modal Contrastive Cod-

ing (UMCC) for both modalities. Firstly, we en-

code audio and vision inputs by two uni-modal

bi-directional LSTMs (Hochreiter and Schmidhu-

ber,1997) to capture temporal characteristic:

ha=bLST M (A;θbLST M

a)∈RLa×da

hv=bLST M (V;θbLST M

v)∈RLv×dv(2)

To construct contrastive learning, we treat the

encoded acoustic and visual representations as

query samples

and get the corresponding pos-

itive key samples

by feature augmentation strat-

egy. In natural language understanding and gen-

eration task, Shen et al. (2020) introduces an ef-

ﬁcient data augmentation approach named cutoff

to erase part of the information within an input

sentence and yield its restricted views during the

ﬁne-tuning stage. Inspired by this work, we uti-

lize random feature cutoff strategy on acoustic and

visual representations which randomly convert a

certain proportion of embedding dimensions of ev-

ery token within the sequence into a vector of zeros.

As shown in Figure 2, we then generate augmented

version of uni-modal representations, denoted as

h†

aand h†

In order to fuse with textual representations in

the similar semantic space later, we design uni-

modal Transformer models for acoustic and visual

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MultimodalContrastiveLearningviaUni-ModalCodingandCross-ModalPredictionforMultimodalSentimentAnalysisRonghaoLinandHaifengHuSunYat-senUniversity,Chinalinrh7@mail2.sysu.edu.cn,huhaif@mail.sysu.edu.cnAbstractMultimodalrepresentationlearningisachal-lengingtaskinwhichpreviousworkmostlyfocusoneitheruni-m...

展开>> 收起<<

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis Ronghao Lin and Haifeng Hu.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis Ronghao Lin and Haifeng Hu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: