Multi-Modal Human Authentication Using Silhouettes Gait and RGB Yuxiang Guo Cheng Peng Chun Pong Lau Rama Chellappa Johns Hopkins University Baltimore MD USA

2025-05-02 0 0 7.75MB 7 页 10玖币

侵权投诉

Multi-Modal Human Authentication Using Silhouettes, Gait and RGB

Yuxiang Guo∗, Cheng Peng∗, Chun Pong Lau, Rama Chellappa

Johns Hopkins University, Baltimore, MD, USA

Abstract— Whole-body-based human authentication is a

promising approach for remote biometrics scenarios. Current

literature focuses on either body recognition based on RGB

images or gait recognition based on body shapes and walking

patterns; both have their advantages and drawbacks. In this

work, we propose Dual-Modal Ensemble (DME), which com-

bines both RGB and silhouette data to achieve more robust

performances for indoor and outdoor whole-body based recog-

nition. Within DME, we propose GaitPattern, which is inspired

by the double helical gait pattern used in traditional gait

analysis. The GaitPattern contributes to robust identiﬁcation

performance over a large range of viewing angles. Extensive

experimental results on the CASIA-B dataset demonstrate that

the proposed method outperforms state-of-the-art recognition

systems. We also provide experimental results using the newly

collected BRIAR dataset.

I. INTRODUCTION

Body recognition from videos is an important yet chal-

lenging computer vision task. The objective is to determine

whether the subjects in different videos have the same

identity. Similar to face recognition, body recognition has ap-

plications ranging from surveillance to intelligent transporta-

tion. In particular, body recognition is more robust than face

recognition in many unconstrained situations, where faces are

acquired in non-cooperative conditions. Gait recognition [6],

[18], [5], which recognizes people based on their walking

patterns over time, similarly performs recognition based on

the body, but typically uses human silhouettes as input. In

comparison, body recognition methods exploit silhouettes as

well as RGB data.

There are advantages and disadvantages when body or

gait is used for remote identiﬁcation separately. Speciﬁcally,

RGB videos contain rich information that can lead to signif-

icantly better performance; however, such information can

be ineffective in situations involving clothing change and

image quality degradation. The silhouette modality allows a

recognition network to focus purely on the subject’s body

shape and gait; as such it is less susceptible to clothing

changes. This robustness comes at the cost of generally

lower performance compared to methods that exploit RGB

data. Motivated by this observation, we propose Dual-Modal

Ensemble (DME), which performs learning in both RGB

and silhouette domains and leverages the model ensemble to

extract the most robust features for unconstrained situations.

DME is also ﬂexible and can be used separately when the

image condition is ill-suited for an individual modality.

Furthermore, within DME, we address several issues in

the gait recognition space. Due to the high computational

*Equal contribution.

cost of processing video data through neural networks,

State-of-The-Art (SoTA) methods rely on temporal pooling

operations to reduce the dimension from 3D to 2D. As such,

spatial information plays a major role in these recognition

algorithms. It has been observed across the gait recognition

literature that such systems tend to generalize poorly over

different view angles [6], [18], [5]. An efﬁcient gait rep-

resentation in the temporal domain is thus highly desired

and can complement conventional gait recognition systems.

Such a temporal representation can provide knowledge of the

camera’s viewing angle, assuming the cameras are stationary

and the subject is walking in a normal pattern.

Inspired by traditional gait recognition approaches that

produce such a temporal representation through a Double He-

lical Signature (DHS) [24], we propose GaitPattern, learned

using a deep neural network and incorporate it into the

recognition system. We empirically ﬁnd that the introduc-

tion of GaitPattern signiﬁcantly improves performances at

challenging view angles.

In summary, this paper makes the following contributions:

•We propose Dual-Modal Ensemble, which incorpo-

rates both gait and RGB modalities to perform body

recognition; such a design allows for both ﬂexibility

and superior performance both for indoor and outdoor

scenes.

•We propose GaitPattern, which efﬁciently incorporates

temporal information into gait recognition and leads to

performance improvements in indoor scenes, especially

for challenging viewing angles.

•We perform extensive evaluations on both CASIA-B

and Biometric Recognition and Identiﬁcation at Altitude

and Range (BRIAR) dataset, an unconstrained face and

body recognition dataset.

II. RELATED WORK

A. Gait Recognition

Gait recognition aims to identify human subjects by their

walking pattern. In general, gait representation can be done

in 2D and 3D. There are many works [34], [4], [2] that

use 3D gait representations, which are obtained from multi-

camera setups. While 3D gait representation contains rich

information and provides compelling performances, a multi-

camera setup is often impractical for general applications.

Moreover, such a 3D approach is computationaly costlier

than 2D approaches. There are typically three approaches

to obtain 2D gait representation. One approach uses gait

silhouettes [6], [18], [5], [10], [29], [11], [12], [13], [21],

arXiv:2210.04050v1 [cs.CV] 8 Oct 2022

[8], extracted from video frames of walking subjects. The

skeleton-based approaches [1], [17], [26], [25] use the key-

points of the joints, i.e. a skeleton representation, and extract

features from the skeleton. Another approach fuses features

from silhouettes and skeleton [27], [23].

B. Double Helical Signature

Human gait represented as a Double Helical Signature

(DHS) [24], also known as a type of “Frieze pattern" [22],

has been proposed and applied for gait sequence analysis

in the past, when limited computational power encouraged

researchers to ﬁnd efﬁcient representations to perform recog-

nition. As shown in Fig. 2, DHS can be generated by

extracting the horizontal slices of the subject’s knee from

every frame in a video and stacking the slices to form a time-

width image. Prior works [24], [22], [19], [20] have shown

that DHS can efﬁciently encode parameters such as step size

and gait/walking rate and also determine if the subject is

carrying objects.

C. Video-based Body Recognition

Due to the availability of rich spacial and temporal in-

formation and computational resources, videos can be more

effective for whole body recognition. Accurate capture of

temporal information is a primary challenge. There are

mainly two methods to capture the temporal relations. Tem-

poral attention [7], [38], [35] is one method, which aims

to determine the effectiveness of each frame and discards

the low-quality ones. The other method is self-attention or

Graph Convolutional Networks (GCNs) [16], [30], [3], which

exploits the temporal relations.

D. Remote Biometric Recognition in the Wild

Remote biometrics in the wild has been studied for over

ﬁfteen years. Some of the early efforts in remote biometrics

recognition were based on gait and faces. As discussed

before, the gait recognition works reported in [8], [12], [13],

[21] and video-based recognition works reported in [36],

[37] are some examples of early works on remote biometric

recognition efforts, although the images and videos were

collected at varying degrees of complexity. More recently,

face recognition at distances of 300-1000 meters has been ad-

dressed in [15], [14], [31], [32]. For example, [15], [14] pro-

pose a generative single frame restoration algorithm which

disentangles the blur and deformation due to atmospheric

turbulence and reconstructs a restored image. Extensive

experiments demonstrate the effectiveness of the proposed

restoration algorithm, which achieves satisfactory results at

300 and 650 meters, but the low-accuracy detection of

faces at 1000 meters affects the overall performance. Similar

efforts using generative models have been reported in [31],

[32]. While many datasets such as CASIA-B and CASIA-E

have been collected and many traditional and deep learning-

based algorithms have been evaluated on these datasets,

these datasets also vary in terms of how unconstrained the

collected videos are. Recently, the IARPA BRIAR program

has collected datasets of faces and whole bodies at distances

upto 500 meters. We present the results of whole-body

recognition based on silhouettes, gait and RGB using the

BRIAR dataset.

III. METHOD

A. Problem Formulation

Suppose the target video V= [f1,f2,...,fT]∈ F :=

RH×W×T×3consists of Tvideo frames fi, where Hand

Ware the height and width of the frame. In V, we assume

there is a corresponding subject, labeled as y∈ Y, where

y∈ {1,2,...,|Y|} and Yis the dataset. To represent gait, we

obtain the human silhouette sequence S= [s1,s2,...,sT]∈

S:= {0,1}H×W×T×1, which consists of binary masks

obtained by image segmentation. Let Fθbe a parameterized

model which maps any video in Fto a feature vector Xin

Fθ(V). An accurate whole-body recognition system can map

two videos V1and V2with the same identity to features that

are close in a certain feature space, i.e. D(Fθ(V1), Fθ(V2)),

where Dis an appropriate distance function, e.g. Euclidean.

Speciﬁcally, this can be done by using either RGB frames,

silhouette sequences, or both. For body recognition using

RGB frames, the videos are ﬁrst preprocessed by masking

out the background based one Sto prevent overﬁtting, i.e.

V=Vorig S, where Vorig is the original video and is

the Hadamard product.

In the following sections, we describe DME, which uses

RGB and silhouette together for more robust performances,

and GaitPattern, a gait recognition method that leverages the

efﬁcient Double Helical Signature representation [24].

B. Dual-Modal Ensemble

Dual-Modal Ensemble proposes to use two CNN archi-

tectures to extract features from individual modalities. As

shown in Fig. 1, this process can be expressed as:

Xf=Ff(V), Xs=Fs(S),(1)

where Ffand Fsare the feature extraction CNNs for the

respective modalities. Generally, Ffand Fscan be any well-

performing CNN architecture, and we elaborate the details

of network design in Section III-C and III-D. Using separate

feature extraction blocks for each modality allows for better

ﬂexibility, as compared to combining the inputs and feature

extraction stages in a monolithic framework. Speciﬁcally,

if one input mode suffers from obviously degraded quality,

DME can still use the other mode to perform recognition.

We highlight this ﬂexibility in Section IV-C.

After Xfand Xsare extracted, they pass through the

respective MLP networks to extract the video identiﬁcation

embeddings lfand ls, which are deﬁned as,

lf=Mf(Xf), ls=Ms(Xs),(2)

where Mfand Msare the MLP networks. For identiﬁ-

cation, a triplet loss [9] is used to maximize the distance

of embeddings from different subjects, and minimize the

distance from the same subject. Speciﬁcally:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-ModalHumanAuthenticationUsingSilhouettes,GaitandRGBYuxiangGuo,ChengPeng,ChunPongLau,RamaChellappaJohnsHopkinsUniversity,Baltimore,MD,USAAbstractWhole-body-basedhumanauthenticationisapromisingapproachforremotebiometricsscenarios.CurrentliteraturefocusesoneitherbodyrecognitionbasedonRGBimages...

展开>> 收起<<

Multi-Modal Human Authentication Using Silhouettes Gait and RGB Yuxiang Guo Cheng Peng Chun Pong Lau Rama Chellappa Johns Hopkins University Baltimore MD USA.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-Modal Human Authentication Using Silhouettes Gait and RGB Yuxiang Guo Cheng Peng Chun Pong Lau Rama Chellappa Johns Hopkins University Baltimore MD USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: