
A Multimodal Sensor Fusion Framework Robust to Missing
Modalities for Person Recognition
Vijay John
vijay.john@riken.jp
Guardian Robot Project, RIKEN
Japan
Yasutomo Kawanishi
yasutomo.kawanishi@riken.jp
Guardian Robot Project, RIKEN
Japan
ABSTRACT
Utilizing the sensor characteristics of the audio, visible camera,
and thermal camera, the robustness of person recognition can be
enhanced. Existing multimodal person recognition frameworks
are primarily formulated assuming that multimodal data is always
available. In this paper, we propose a novel trimodal sensor fusion
framework using the audio, visible, and thermal camera, which
addresses the missing modality problem. In the framework, a novel
deep latent embedding framework, termed the AVTNet, is proposed
to learn multiple latent embeddings. Also, a novel loss function,
termed missing modality loss, accounts for possible missing modal-
ities based on the triplet loss calculation while learning the indi-
vidual latent embeddings. Additionally, a joint latent embedding
utilizing the trimodal data is learnt using the multi-head atten-
tion transformer, which assigns attention weights to the dierent
modalities. The dierent latent embeddings are subsequently used
to train a deep neural network. The proposed framework is vali-
dated on the Speaking Faces dataset. A comparative analysis with
baseline algorithms shows that the proposed framework signi-
cantly increases the person recognition accuracy while accounting
for missing modalities.
KEYWORDS
missing modality loss, multimodal transformer, person recognition
ACM Reference Format:
Vijay John and Yasutomo Kawanishi. 2022. A Multimodal Sensor Fusion
Framework Robust to Missing Modalities for Person Recognition. In ACM
Multimedia Asia (MMAsia ’22), December 13–16, 2022, Tokyo, Japan. ACM,
New York, NY, USA, 5 pages. https://doi.org/10.1145/3551626.3564965
1 INTRODUCTION
Audio-visible person recognition (AVPR) [
5
,
11
,
15
,
19
] and thermal-
visible person recognition (TVPR) [
2
,
7
,
8
,
16
,
18
] report enhanced
recognition accuracy owing to the complementary characteristics
of the dierent sensors. The performance of the AVPR and TVPR
can be further enhanced using an audio-visible-thermal camera
person recognition (AVTPR) framework by utilizing the sensor
characteristics of all three sensors.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MMAsia ’22, December 13–16, 2022, Tokyo, Japan
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9478-9/22/12. . . $15.00
https://doi.org/10.1145/3551626.3564965
Existing works in audio-video and thermal-video fusion are
formulated with the assumption that all the input modalities are
present during training and inference. However, in real-world ap-
plications, there is the possibility of missing modalities due to
conditions such as sensor malfunction or failure, resulting in miss-
ing data (Fig 1 (a)). Under these circumstances, the performance of
existing deep fusion frameworks is aected.
In this paper, we propose an AVTPR framework that addresses
the missing modality problem. The proposed AVTPR framework
consists of a deep latent embedding learning framework, termed
AVTNet, which simultaneously performs multimodal sensor fusion
while learning multiple latent embeddings. To account for the miss-
ing modalities, the AVTNet utilizes a loss function strategy, termed
missing modality loss, and the transformers [
20
] to learn multiple
latent embeddings.
The AVTNet learns four embeddings from the multimodal data,
represented by three modal-specic embeddings and one joint mul-
timodal embedding. The missing modality loss is used to learn the
three modal-specic embeddings. The joint multimodal embedding
learns the joint latent representation of the visible, thermal, and au-
dio features using the multi-head attention-based transformer [
20
].
By utilizing the attention mechanism, the attention weights account
for any missing modality while learning the joint embedding. The
four learnt embeddings are used to train a deep learning-based
person recognition model.
The proposed framework is validated on the Speaking Faces pub-
lic dataset [
1
]. A comparative analysis is performed with baseline
algorithms, and a detailed ablation study is performed. The results
show that the proposed framework addresses the missing modality
problem.
The main contributions to literature are as follows:
•
A novel computer vision application framework, termed
the AVTNet, for the visible camera, thermal camera, and
audio-based person recognition. The AVTNet, addresses the
missing modality problem.
•
We introduce a tailored loss function strategy termed the
missing modality loss, which learns the individual modal-
specic embeddings while accounting for missing modalities.
The remainder of the paper is structured as follows. The literature
is reviewed in Section 2. The proposed framework is presented in
Section 3, and the experimental results are presented in Section 4.
Finally, we summarize the research in Section 5.
2 LITERATURE REVIEW
In literature, the thermal-visible [
2
,
7
,
8
,
16
,
18
] (TVPR) and audio-
visible [
3
,
4
,
9
,
12
,
17
,
19
,
22
] (AVPR) person recognition approaches
report enhanced recognition accuracy owing to the complementary
arXiv:2210.10972v2 [cs.MM] 22 Oct 2022