A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

2025-04-30 0 0 881.93KB 5 页 10玖币
侵权投诉
A Multimodal Sensor Fusion Framework Robust to Missing
Modalities for Person Recognition
Vijay John
vijay.john@riken.jp
Guardian Robot Project, RIKEN
Japan
Yasutomo Kawanishi
yasutomo.kawanishi@riken.jp
Guardian Robot Project, RIKEN
Japan
ABSTRACT
Utilizing the sensor characteristics of the audio, visible camera,
and thermal camera, the robustness of person recognition can be
enhanced. Existing multimodal person recognition frameworks
are primarily formulated assuming that multimodal data is always
available. In this paper, we propose a novel trimodal sensor fusion
framework using the audio, visible, and thermal camera, which
addresses the missing modality problem. In the framework, a novel
deep latent embedding framework, termed the AVTNet, is proposed
to learn multiple latent embeddings. Also, a novel loss function,
termed missing modality loss, accounts for possible missing modal-
ities based on the triplet loss calculation while learning the indi-
vidual latent embeddings. Additionally, a joint latent embedding
utilizing the trimodal data is learnt using the multi-head atten-
tion transformer, which assigns attention weights to the dierent
modalities. The dierent latent embeddings are subsequently used
to train a deep neural network. The proposed framework is vali-
dated on the Speaking Faces dataset. A comparative analysis with
baseline algorithms shows that the proposed framework signi-
cantly increases the person recognition accuracy while accounting
for missing modalities.
KEYWORDS
missing modality loss, multimodal transformer, person recognition
ACM Reference Format:
Vijay John and Yasutomo Kawanishi. 2022. A Multimodal Sensor Fusion
Framework Robust to Missing Modalities for Person Recognition. In ACM
Multimedia Asia (MMAsia ’22), December 13–16, 2022, Tokyo, Japan. ACM,
New York, NY, USA, 5 pages. https://doi.org/10.1145/3551626.3564965
1 INTRODUCTION
Audio-visible person recognition (AVPR) [
5
,
11
,
15
,
19
] and thermal-
visible person recognition (TVPR) [
2
,
7
,
8
,
16
,
18
] report enhanced
recognition accuracy owing to the complementary characteristics
of the dierent sensors. The performance of the AVPR and TVPR
can be further enhanced using an audio-visible-thermal camera
person recognition (AVTPR) framework by utilizing the sensor
characteristics of all three sensors.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MMAsia ’22, December 13–16, 2022, Tokyo, Japan
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9478-9/22/12. . . $15.00
https://doi.org/10.1145/3551626.3564965
Existing works in audio-video and thermal-video fusion are
formulated with the assumption that all the input modalities are
present during training and inference. However, in real-world ap-
plications, there is the possibility of missing modalities due to
conditions such as sensor malfunction or failure, resulting in miss-
ing data (Fig 1 (a)). Under these circumstances, the performance of
existing deep fusion frameworks is aected.
In this paper, we propose an AVTPR framework that addresses
the missing modality problem. The proposed AVTPR framework
consists of a deep latent embedding learning framework, termed
AVTNet, which simultaneously performs multimodal sensor fusion
while learning multiple latent embeddings. To account for the miss-
ing modalities, the AVTNet utilizes a loss function strategy, termed
missing modality loss, and the transformers [
20
] to learn multiple
latent embeddings.
The AVTNet learns four embeddings from the multimodal data,
represented by three modal-specic embeddings and one joint mul-
timodal embedding. The missing modality loss is used to learn the
three modal-specic embeddings. The joint multimodal embedding
learns the joint latent representation of the visible, thermal, and au-
dio features using the multi-head attention-based transformer [
20
].
By utilizing the attention mechanism, the attention weights account
for any missing modality while learning the joint embedding. The
four learnt embeddings are used to train a deep learning-based
person recognition model.
The proposed framework is validated on the Speaking Faces pub-
lic dataset [
1
]. A comparative analysis is performed with baseline
algorithms, and a detailed ablation study is performed. The results
show that the proposed framework addresses the missing modality
problem.
The main contributions to literature are as follows:
A novel computer vision application framework, termed
the AVTNet, for the visible camera, thermal camera, and
audio-based person recognition. The AVTNet, addresses the
missing modality problem.
We introduce a tailored loss function strategy termed the
missing modality loss, which learns the individual modal-
specic embeddings while accounting for missing modalities.
The remainder of the paper is structured as follows. The literature
is reviewed in Section 2. The proposed framework is presented in
Section 3, and the experimental results are presented in Section 4.
Finally, we summarize the research in Section 5.
2 LITERATURE REVIEW
In literature, the thermal-visible [
2
,
7
,
8
,
16
,
18
] (TVPR) and audio-
visible [
3
,
4
,
9
,
12
,
17
,
19
,
22
] (AVPR) person recognition approaches
report enhanced recognition accuracy owing to the complementary
arXiv:2210.10972v2 [cs.MM] 22 Oct 2022
摘要:

AMultimodalSensorFusionFrameworkRobusttoMissingModalitiesforPersonRecognitionVijayJohnvijay.john@riken.jpGuardianRobotProject,RIKENJapanYasutomoKawanishiyasutomo.kawanishi@riken.jpGuardianRobotProject,RIKENJapanABSTRACTUtilizingthesensorcharacteristicsoftheaudio,visiblecamera,andthermalcamera,therob...

收起<<
A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:881.93KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注