Multi-Modal Human Authentication Using Silhouettes Gait and RGB Yuxiang Guo Cheng Peng Chun Pong Lau Rama Chellappa Johns Hopkins University Baltimore MD USA

2025-05-02 0 0 7.75MB 7 页 10玖币
侵权投诉
Multi-Modal Human Authentication Using Silhouettes, Gait and RGB
Yuxiang Guo, Cheng Peng, Chun Pong Lau, Rama Chellappa
Johns Hopkins University, Baltimore, MD, USA
Abstract Whole-body-based human authentication is a
promising approach for remote biometrics scenarios. Current
literature focuses on either body recognition based on RGB
images or gait recognition based on body shapes and walking
patterns; both have their advantages and drawbacks. In this
work, we propose Dual-Modal Ensemble (DME), which com-
bines both RGB and silhouette data to achieve more robust
performances for indoor and outdoor whole-body based recog-
nition. Within DME, we propose GaitPattern, which is inspired
by the double helical gait pattern used in traditional gait
analysis. The GaitPattern contributes to robust identification
performance over a large range of viewing angles. Extensive
experimental results on the CASIA-B dataset demonstrate that
the proposed method outperforms state-of-the-art recognition
systems. We also provide experimental results using the newly
collected BRIAR dataset.
I. INTRODUCTION
Body recognition from videos is an important yet chal-
lenging computer vision task. The objective is to determine
whether the subjects in different videos have the same
identity. Similar to face recognition, body recognition has ap-
plications ranging from surveillance to intelligent transporta-
tion. In particular, body recognition is more robust than face
recognition in many unconstrained situations, where faces are
acquired in non-cooperative conditions. Gait recognition [6],
[18], [5], which recognizes people based on their walking
patterns over time, similarly performs recognition based on
the body, but typically uses human silhouettes as input. In
comparison, body recognition methods exploit silhouettes as
well as RGB data.
There are advantages and disadvantages when body or
gait is used for remote identification separately. Specifically,
RGB videos contain rich information that can lead to signif-
icantly better performance; however, such information can
be ineffective in situations involving clothing change and
image quality degradation. The silhouette modality allows a
recognition network to focus purely on the subject’s body
shape and gait; as such it is less susceptible to clothing
changes. This robustness comes at the cost of generally
lower performance compared to methods that exploit RGB
data. Motivated by this observation, we propose Dual-Modal
Ensemble (DME), which performs learning in both RGB
and silhouette domains and leverages the model ensemble to
extract the most robust features for unconstrained situations.
DME is also flexible and can be used separately when the
image condition is ill-suited for an individual modality.
Furthermore, within DME, we address several issues in
the gait recognition space. Due to the high computational
*Equal contribution.
cost of processing video data through neural networks,
State-of-The-Art (SoTA) methods rely on temporal pooling
operations to reduce the dimension from 3D to 2D. As such,
spatial information plays a major role in these recognition
algorithms. It has been observed across the gait recognition
literature that such systems tend to generalize poorly over
different view angles [6], [18], [5]. An efficient gait rep-
resentation in the temporal domain is thus highly desired
and can complement conventional gait recognition systems.
Such a temporal representation can provide knowledge of the
camera’s viewing angle, assuming the cameras are stationary
and the subject is walking in a normal pattern.
Inspired by traditional gait recognition approaches that
produce such a temporal representation through a Double He-
lical Signature (DHS) [24], we propose GaitPattern, learned
using a deep neural network and incorporate it into the
recognition system. We empirically find that the introduc-
tion of GaitPattern significantly improves performances at
challenging view angles.
In summary, this paper makes the following contributions:
We propose Dual-Modal Ensemble, which incorpo-
rates both gait and RGB modalities to perform body
recognition; such a design allows for both flexibility
and superior performance both for indoor and outdoor
scenes.
We propose GaitPattern, which efficiently incorporates
temporal information into gait recognition and leads to
performance improvements in indoor scenes, especially
for challenging viewing angles.
We perform extensive evaluations on both CASIA-B
and Biometric Recognition and Identification at Altitude
and Range (BRIAR) dataset, an unconstrained face and
body recognition dataset.
II. RELATED WORK
A. Gait Recognition
Gait recognition aims to identify human subjects by their
walking pattern. In general, gait representation can be done
in 2D and 3D. There are many works [34], [4], [2] that
use 3D gait representations, which are obtained from multi-
camera setups. While 3D gait representation contains rich
information and provides compelling performances, a multi-
camera setup is often impractical for general applications.
Moreover, such a 3D approach is computationaly costlier
than 2D approaches. There are typically three approaches
to obtain 2D gait representation. One approach uses gait
silhouettes [6], [18], [5], [10], [29], [11], [12], [13], [21],
arXiv:2210.04050v1 [cs.CV] 8 Oct 2022
[8], extracted from video frames of walking subjects. The
skeleton-based approaches [1], [17], [26], [25] use the key-
points of the joints, i.e. a skeleton representation, and extract
features from the skeleton. Another approach fuses features
from silhouettes and skeleton [27], [23].
B. Double Helical Signature
Human gait represented as a Double Helical Signature
(DHS) [24], also known as a type of “Frieze pattern" [22],
has been proposed and applied for gait sequence analysis
in the past, when limited computational power encouraged
researchers to find efficient representations to perform recog-
nition. As shown in Fig. 2, DHS can be generated by
extracting the horizontal slices of the subject’s knee from
every frame in a video and stacking the slices to form a time-
width image. Prior works [24], [22], [19], [20] have shown
that DHS can efficiently encode parameters such as step size
and gait/walking rate and also determine if the subject is
carrying objects.
C. Video-based Body Recognition
Due to the availability of rich spacial and temporal in-
formation and computational resources, videos can be more
effective for whole body recognition. Accurate capture of
temporal information is a primary challenge. There are
mainly two methods to capture the temporal relations. Tem-
poral attention [7], [38], [35] is one method, which aims
to determine the effectiveness of each frame and discards
the low-quality ones. The other method is self-attention or
Graph Convolutional Networks (GCNs) [16], [30], [3], which
exploits the temporal relations.
D. Remote Biometric Recognition in the Wild
Remote biometrics in the wild has been studied for over
fifteen years. Some of the early efforts in remote biometrics
recognition were based on gait and faces. As discussed
before, the gait recognition works reported in [8], [12], [13],
[21] and video-based recognition works reported in [36],
[37] are some examples of early works on remote biometric
recognition efforts, although the images and videos were
collected at varying degrees of complexity. More recently,
face recognition at distances of 300-1000 meters has been ad-
dressed in [15], [14], [31], [32]. For example, [15], [14] pro-
pose a generative single frame restoration algorithm which
disentangles the blur and deformation due to atmospheric
turbulence and reconstructs a restored image. Extensive
experiments demonstrate the effectiveness of the proposed
restoration algorithm, which achieves satisfactory results at
300 and 650 meters, but the low-accuracy detection of
faces at 1000 meters affects the overall performance. Similar
efforts using generative models have been reported in [31],
[32]. While many datasets such as CASIA-B and CASIA-E
have been collected and many traditional and deep learning-
based algorithms have been evaluated on these datasets,
these datasets also vary in terms of how unconstrained the
collected videos are. Recently, the IARPA BRIAR program
has collected datasets of faces and whole bodies at distances
upto 500 meters. We present the results of whole-body
recognition based on silhouettes, gait and RGB using the
BRIAR dataset.
III. METHOD
A. Problem Formulation
Suppose the target video V= [f1,f2,...,fT]∈ F :=
RH×W×T×3consists of Tvideo frames fi, where Hand
Ware the height and width of the frame. In V, we assume
there is a corresponding subject, labeled as y∈ Y, where
y∈ {1,2,...,|Y|} and Yis the dataset. To represent gait, we
obtain the human silhouette sequence S= [s1,s2,...,sT]
S:= {0,1}H×W×T×1, which consists of binary masks
obtained by image segmentation. Let Fθbe a parameterized
model which maps any video in Fto a feature vector Xin
Fθ(V). An accurate whole-body recognition system can map
two videos V1and V2with the same identity to features that
are close in a certain feature space, i.e. D(Fθ(V1), Fθ(V2)),
where Dis an appropriate distance function, e.g. Euclidean.
Specifically, this can be done by using either RGB frames,
silhouette sequences, or both. For body recognition using
RGB frames, the videos are first preprocessed by masking
out the background based one Sto prevent overfitting, i.e.
V=Vorig S, where Vorig is the original video and is
the Hadamard product.
In the following sections, we describe DME, which uses
RGB and silhouette together for more robust performances,
and GaitPattern, a gait recognition method that leverages the
efficient Double Helical Signature representation [24].
B. Dual-Modal Ensemble
Dual-Modal Ensemble proposes to use two CNN archi-
tectures to extract features from individual modalities. As
shown in Fig. 1, this process can be expressed as:
Xf=Ff(V), Xs=Fs(S),(1)
where Ffand Fsare the feature extraction CNNs for the
respective modalities. Generally, Ffand Fscan be any well-
performing CNN architecture, and we elaborate the details
of network design in Section III-C and III-D. Using separate
feature extraction blocks for each modality allows for better
flexibility, as compared to combining the inputs and feature
extraction stages in a monolithic framework. Specifically,
if one input mode suffers from obviously degraded quality,
DME can still use the other mode to perform recognition.
We highlight this flexibility in Section IV-C.
After Xfand Xsare extracted, they pass through the
respective MLP networks to extract the video identification
embeddings lfand ls, which are defined as,
lf=Mf(Xf), ls=Ms(Xs),(2)
where Mfand Msare the MLP networks. For identifi-
cation, a triplet loss [9] is used to maximize the distance
of embeddings from different subjects, and minimize the
distance from the same subject. Specifically:
摘要:

Multi-ModalHumanAuthenticationUsingSilhouettes,GaitandRGBYuxiangGuo,ChengPeng,ChunPongLau,RamaChellappaJohnsHopkinsUniversity,Baltimore,MD,USAAbstract—Whole-body-basedhumanauthenticationisapromisingapproachforremotebiometricsscenarios.CurrentliteraturefocusesoneitherbodyrecognitionbasedonRGBimages...

展开>> 收起<<
Multi-Modal Human Authentication Using Silhouettes Gait and RGB Yuxiang Guo Cheng Peng Chun Pong Lau Rama Chellappa Johns Hopkins University Baltimore MD USA.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:7.75MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注