A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

2025-04-30 1 0 881.93KB 5 页 10玖币

侵权投诉

A Multimodal Sensor Fusion Framework Robust to Missing

Modalities for Person Recognition

Vijay John

vijay.john@riken.jp

Guardian Robot Project, RIKEN

Japan

Yasutomo Kawanishi

yasutomo.kawanishi@riken.jp

Guardian Robot Project, RIKEN

Japan

ABSTRACT

Utilizing the sensor characteristics of the audio, visible camera,

and thermal camera, the robustness of person recognition can be

enhanced. Existing multimodal person recognition frameworks

are primarily formulated assuming that multimodal data is always

available. In this paper, we propose a novel trimodal sensor fusion

framework using the audio, visible, and thermal camera, which

addresses the missing modality problem. In the framework, a novel

deep latent embedding framework, termed the AVTNet, is proposed

to learn multiple latent embeddings. Also, a novel loss function,

termed missing modality loss, accounts for possible missing modal-

ities based on the triplet loss calculation while learning the indi-

vidual latent embeddings. Additionally, a joint latent embedding

utilizing the trimodal data is learnt using the multi-head atten-

tion transformer, which assigns attention weights to the dierent

modalities. The dierent latent embeddings are subsequently used

to train a deep neural network. The proposed framework is vali-

dated on the Speaking Faces dataset. A comparative analysis with

baseline algorithms shows that the proposed framework signi-

cantly increases the person recognition accuracy while accounting

for missing modalities.

KEYWORDS

missing modality loss, multimodal transformer, person recognition

ACM Reference Format:

Vijay John and Yasutomo Kawanishi. 2022. A Multimodal Sensor Fusion

Framework Robust to Missing Modalities for Person Recognition. In ACM

Multimedia Asia (MMAsia ’22), December 13–16, 2022, Tokyo, Japan. ACM,

New York, NY, USA, 5 pages. https://doi.org/10.1145/3551626.3564965

1 INTRODUCTION

Audio-visible person recognition (AVPR) [

] and thermal-

visible person recognition (TVPR) [

] report enhanced

recognition accuracy owing to the complementary characteristics

of the dierent sensors. The performance of the AVPR and TVPR

can be further enhanced using an audio-visible-thermal camera

person recognition (AVTPR) framework by utilizing the sensor

characteristics of all three sensors.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

MMAsia ’22, December 13–16, 2022, Tokyo, Japan

ACM ISBN 978-1-4503-9478-9/22/12. . . $15.00

https://doi.org/10.1145/3551626.3564965

Existing works in audio-video and thermal-video fusion are

formulated with the assumption that all the input modalities are

present during training and inference. However, in real-world ap-

plications, there is the possibility of missing modalities due to

conditions such as sensor malfunction or failure, resulting in miss-

ing data (Fig 1 (a)). Under these circumstances, the performance of

existing deep fusion frameworks is aected.

In this paper, we propose an AVTPR framework that addresses

the missing modality problem. The proposed AVTPR framework

consists of a deep latent embedding learning framework, termed

AVTNet, which simultaneously performs multimodal sensor fusion

while learning multiple latent embeddings. To account for the miss-

ing modalities, the AVTNet utilizes a loss function strategy, termed

missing modality loss, and the transformers [

] to learn multiple

latent embeddings.

The AVTNet learns four embeddings from the multimodal data,

represented by three modal-specic embeddings and one joint mul-

timodal embedding. The missing modality loss is used to learn the

three modal-specic embeddings. The joint multimodal embedding

learns the joint latent representation of the visible, thermal, and au-

dio features using the multi-head attention-based transformer [

By utilizing the attention mechanism, the attention weights account

for any missing modality while learning the joint embedding. The

four learnt embeddings are used to train a deep learning-based

person recognition model.

The proposed framework is validated on the Speaking Faces pub-

lic dataset [

]. A comparative analysis is performed with baseline

algorithms, and a detailed ablation study is performed. The results

show that the proposed framework addresses the missing modality

problem.

The main contributions to literature are as follows:

•

A novel computer vision application framework, termed

the AVTNet, for the visible camera, thermal camera, and

audio-based person recognition. The AVTNet, addresses the

missing modality problem.

•

We introduce a tailored loss function strategy termed the

missing modality loss, which learns the individual modal-

specic embeddings while accounting for missing modalities.

The remainder of the paper is structured as follows. The literature

is reviewed in Section 2. The proposed framework is presented in

Section 3, and the experimental results are presented in Section 4.

Finally, we summarize the research in Section 5.

2 LITERATURE REVIEW

In literature, the thermal-visible [

] (TVPR) and audio-

visible [

] (AVPR) person recognition approaches

report enhanced recognition accuracy owing to the complementary

arXiv:2210.10972v2 [cs.MM] 22 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AMultimodalSensorFusionFrameworkRobusttoMissingModalitiesforPersonRecognitionVijayJohnvijay.john@riken.jpGuardianRobotProject,RIKENJapanYasutomoKawanishiyasutomo.kawanishi@riken.jpGuardianRobotProject,RIKENJapanABSTRACTUtilizingthesensorcharacteristicsoftheaudio,visiblecamera,andthermalcamera,therob...

展开>> 收起<<

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: