
MM-Align: Learning Optimal Transport-based Alignment Dynamics for
Fast and Accurate Inference on Missing Modality Sequences
Wei Han
Hui Chen
Min-Yen Kan♣Soujanya Poria
DeCLaRelab, Singapore University of Technology and Design, Singapore
♣National University of Singapore, Singapore
{wei_han,hui_chen}@mymail.sutd.edu.sg
kanmy@comp.nus.edu.sg,sporia@sutd.edu.sg
Abstract
Existing multimodal tasks mostly target at
the complete input modality setting, i.e., each
modality is either complete or completely miss-
ing in both training and test sets. How-
ever, the randomly missing situations have
still been underexplored. In this paper, we
present a novel approach named MM-Align to
address the missing-modality inference prob-
lem. Concretely, we propose 1) an align-
ment dynamics learning module based on
the theory of optimal transport (OT) for in-
direct missing data imputation; 2) a denois-
ing training algorithm to simultaneously en-
hance the imputation results and backbone
network performance. Compared with pre-
vious methods which devote to reconstruct-
ing the missing inputs, MM-Align learns to
capture and imitate the alignment dynam-
ics between modality sequences. Results of
comprehensive experiments on three datasets
covering two multimodal tasks empirically
demonstrate that our method can perform
more accurate and faster inference and relieve
overfitting under various missing conditions.
Our code is available at https://github.
com/declare-lab/MM-Align.
1 Introduction
The topic of multimodal learning has grown un-
precedentedly prevalent in recent years (Ramachan-
dram and Taylor,2017;Baltrušaitis et al.,2018),
ranging from a variety of machine learning tasks
such as computer vision (Zhu et al.,2017;Nam
et al.,2017), natural langauge processing (Fei
et al.,2021;Ilharco et al.,2021), autonomous driv-
ing (Caesar et al.,2020) and medical care (Nascita
et al.,2021), etc. Despite the promising achieve-
ments in these fields, most of existent approaches
assume a complete input modality setting of train-
ing data, in which every modality is either complete
or completely missing (at inference time) in both
training and test sets (Pham et al.,2019;Tang et al.,
2021;Zhao et al.,2021), as shown in Fig. 1a and 1b.
Such synergies between train and test sets in the
modality input patterns are usually far from the
realistic scenario where there is a certain portion
of data without parallel modality sequences, prob-
ably due to noise pollution during collecting and
preprocessing time. In other words, data from each
modality are more probable to be missing at ran-
dom (Fig.1c and 1d) than completely present or
missing (Fig.1a and 1b) (Pham et al.,2019;Tang
et al.,2021;Zhao et al.,2021). Based on the com-
plete input modality setting, a family of popular
routines regarding the missing-modality inference
is to design intricate generative modules attached
to the main network and train the model under full
supervision with complete modality data. By mini-
mizing a customized reconstruction loss, the data
restoration (a.k.a. missing data imputation (Van Bu-
uren,2018)) capability of the generative modules
is enhanced (Pham et al.,2019;Wang et al.,2020;
Tang et al.,2021) so that the model can be tested
in the missing situations (Fig. 1b). However, we
notice that (i) if modality-complete data in the train-
ing set is scarce, a severe overfitting issue may
occur, especially when the generative model is
large (Robb et al.,2020;Schick and Schütze,2021;
Ojha et al.,2021); (ii) global attention-based (i.e.,
attention over the whole sequence) imputation may
bring unexpected noise since true correspondence
mainly exists between temporally adjacent parallel
signals (Sakoe and Chiba,1978). Ma et al. (2021)
proposed to leverage unit-length sequential repre-
sentation to represent the missing modality from
the seen complete modality from the input for train-
ing. Nevertheless, such kinds of methods inevitably
overlook the temporal correlation between modal-
ity sequences and only acquire fair performance on
the downstream tasks.
To mitigate these issues, in this paper we present
MM-Align, a novel framework for fast and effec-
tive multimodal learning on randomly missing mul-
arXiv:2210.12798v1 [cs.CL] 23 Oct 2022