form in sampling, following the model-agnostic
meta-learning (MAML) framework (Finn et al.,
2017). M
3
S maintains the advantage of meta-
learning and makes models easily adapt to data
with different missing rates. M
3
S can be treated
as an efficient add-on training component on ex-
isting models and significantly improve their per-
formances on multimodal data with a mixture of
missing modalities. We conduct experiments on
IEMOCAP (Busso et al.,2008), SIMS (Yu et al.,
2020) and CMU-MOSI (Zadeh et al.,2016) datasets
and superior performance is achieved compared
with recent state-of-the-art (SOTA) methods. A
simple example is shown in Figure 1, demonstrating
the effectiveness of our proposed M
3
S compared
with other methods. More details are provided in
the experiment section.
The main contributions of our work are as fol-
lows:
•
We formulate a simple yet effective meta-
training framework to address the problem
of a mixture of partial missing modalities in
the MSA tasks.
•
The proposed method M
3
S can be treated as
an efficient add-on training component on ex-
isting models and significantly improve their
performances on dealing with missing modal-
ity.
•
We conduct comprehensive experiments on
widely used datasets in MSA, including IEMO-
CAP, SIMS, and CMU-MOSI. Superior per-
formance is achieved compared with recent
SOTA methods.
2 Related Work
2.1 Emotion Recognition
Emotion recognition aims to identify and predict
emotions through these physiological and behav-
ioral responses. Emotions are expressed in a variety
of modality forms. However, early studies on emo-
tion recognition are often single modality. Shaheen
et al. (2014) and Calefato et al. (2017) present
novel approaches to automatic emotion recognition
from text. Burkert et al. (2015) and Deng et al.
(2020) conduct researches on facial expressions
and the emotions behind them. Koolagudi and Rao
(2012) and Yoon et al. (2019) exploit acoustic data
in different types of speeches for emotional recogni-
tion and classification tasks. Though much progress
has been made for emotion recognition with sin-
gle modality data, how to combine information
from diverse modalities has become an interesting
direction in this area.
2.2 Multimodal Sentiment Analysis
Multimodal sentiment analysis (MSA) is a popu-
lar area of research in the present since the world
we live in has several modality forms. When the
dataset consists of more than one modality infor-
mation, traditional single modality methods are
difficult to deal with. MSA mainly focuses on three
modalities: text, audio, and video. It makes use
of the complementarity of multimodal information
to improve the accuracy of emotion recognition.
However, the heterogeneity of data and signals
bring significant challenges because it creates dis-
tributional modality gaps. Hazarika et al. (2020)
propose a novel framework, MISA, which projects
each modality to two distinct subspaces to aid the
fusion process. And Hori et al. (2017) introduce
a multimodal attention model that can selectively
utilize features from different modalities. Since
the performance of a model highly depends on the
quality of multimodal fusion, Han et al. (2021b)
construct a framework named MultiModal InfoMax
(MMIM) to maximize the mutual information in
unimodal input pairs as well as obtain information
related to tasks through multimodal fusion process.
Besides, Han et al. (2021a) make use of an end-to-
end network Bi-Bimodal Fusion Network (BBFN)
to better utilize the dynamics of independence and
correlation between modalities. Due to the unified
multimodal annotation, previous methods are re-
stricted in capturing differentiated information. Yu
et al. (2021) design a label generation module based
on the self-supervised learning strategy. Then, joint
training the multimodal and unimodal tasks to learn
the consistency and difference. However, limited by
the pre-processed features, the results show that the
generated audio and vision labels are not significant
enough.
2.3 Missing Modality Problem
Compared with unimodal learning method, mul-
timodal learning has achieved great success. It
improves the performance of emotion recognition
tasks by effectively combining the information from
different modalities. However, the multimodal data
may have missing modalities in reality due to a
variety of reasons like signal transmission error