[8], extracted from video frames of walking subjects. The
skeleton-based approaches [1], [17], [26], [25] use the key-
points of the joints, i.e. a skeleton representation, and extract
features from the skeleton. Another approach fuses features
from silhouettes and skeleton [27], [23].
B. Double Helical Signature
Human gait represented as a Double Helical Signature
(DHS) [24], also known as a type of “Frieze pattern" [22],
has been proposed and applied for gait sequence analysis
in the past, when limited computational power encouraged
researchers to find efficient representations to perform recog-
nition. As shown in Fig. 2, DHS can be generated by
extracting the horizontal slices of the subject’s knee from
every frame in a video and stacking the slices to form a time-
width image. Prior works [24], [22], [19], [20] have shown
that DHS can efficiently encode parameters such as step size
and gait/walking rate and also determine if the subject is
carrying objects.
C. Video-based Body Recognition
Due to the availability of rich spacial and temporal in-
formation and computational resources, videos can be more
effective for whole body recognition. Accurate capture of
temporal information is a primary challenge. There are
mainly two methods to capture the temporal relations. Tem-
poral attention [7], [38], [35] is one method, which aims
to determine the effectiveness of each frame and discards
the low-quality ones. The other method is self-attention or
Graph Convolutional Networks (GCNs) [16], [30], [3], which
exploits the temporal relations.
D. Remote Biometric Recognition in the Wild
Remote biometrics in the wild has been studied for over
fifteen years. Some of the early efforts in remote biometrics
recognition were based on gait and faces. As discussed
before, the gait recognition works reported in [8], [12], [13],
[21] and video-based recognition works reported in [36],
[37] are some examples of early works on remote biometric
recognition efforts, although the images and videos were
collected at varying degrees of complexity. More recently,
face recognition at distances of 300-1000 meters has been ad-
dressed in [15], [14], [31], [32]. For example, [15], [14] pro-
pose a generative single frame restoration algorithm which
disentangles the blur and deformation due to atmospheric
turbulence and reconstructs a restored image. Extensive
experiments demonstrate the effectiveness of the proposed
restoration algorithm, which achieves satisfactory results at
300 and 650 meters, but the low-accuracy detection of
faces at 1000 meters affects the overall performance. Similar
efforts using generative models have been reported in [31],
[32]. While many datasets such as CASIA-B and CASIA-E
have been collected and many traditional and deep learning-
based algorithms have been evaluated on these datasets,
these datasets also vary in terms of how unconstrained the
collected videos are. Recently, the IARPA BRIAR program
has collected datasets of faces and whole bodies at distances
upto 500 meters. We present the results of whole-body
recognition based on silhouettes, gait and RGB using the
BRIAR dataset.
III. METHOD
A. Problem Formulation
Suppose the target video V= [f1,f2,...,fT]∈ F :=
RH×W×T×3consists of Tvideo frames fi, where Hand
Ware the height and width of the frame. In V, we assume
there is a corresponding subject, labeled as y∈ Y, where
y∈ {1,2,...,|Y|} and Yis the dataset. To represent gait, we
obtain the human silhouette sequence S= [s1,s2,...,sT]∈
S:= {0,1}H×W×T×1, which consists of binary masks
obtained by image segmentation. Let Fθbe a parameterized
model which maps any video in Fto a feature vector Xin
Fθ(V). An accurate whole-body recognition system can map
two videos V1and V2with the same identity to features that
are close in a certain feature space, i.e. D(Fθ(V1), Fθ(V2)),
where Dis an appropriate distance function, e.g. Euclidean.
Specifically, this can be done by using either RGB frames,
silhouette sequences, or both. For body recognition using
RGB frames, the videos are first preprocessed by masking
out the background based one Sto prevent overfitting, i.e.
V=Vorig S, where Vorig is the original video and is
the Hadamard product.
In the following sections, we describe DME, which uses
RGB and silhouette together for more robust performances,
and GaitPattern, a gait recognition method that leverages the
efficient Double Helical Signature representation [24].
B. Dual-Modal Ensemble
Dual-Modal Ensemble proposes to use two CNN archi-
tectures to extract features from individual modalities. As
shown in Fig. 1, this process can be expressed as:
Xf=Ff(V), Xs=Fs(S),(1)
where Ffand Fsare the feature extraction CNNs for the
respective modalities. Generally, Ffand Fscan be any well-
performing CNN architecture, and we elaborate the details
of network design in Section III-C and III-D. Using separate
feature extraction blocks for each modality allows for better
flexibility, as compared to combining the inputs and feature
extraction stages in a monolithic framework. Specifically,
if one input mode suffers from obviously degraded quality,
DME can still use the other mode to perform recognition.
We highlight this flexibility in Section IV-C.
After Xfand Xsare extracted, they pass through the
respective MLP networks to extract the video identification
embeddings lfand ls, which are defined as,
lf=Mf(Xf), ls=Ms(Xs),(2)
where Mfand Msare the MLP networks. For identifi-
cation, a triplet loss [9] is used to maximize the distance
of embeddings from different subjects, and minimize the
distance from the same subject. Specifically: