
mAP compared to training on a typical PRID dataset [29].
The main contributions of this work are:
• We introduce a new PRID dataset, FRIDA, for indoor
person re-identification using time-synchronized over-
head fisheye cameras. This is the first overhead fisheye
dataset for PRID and will be made publicly available.
• We evaluate the performance of 10 state-of-the-art
PRID methods on FRIDA using two metrics. We com-
pare the performance of 6 of those algorithms, when
training on FRIDA against training on the non-fisheye
Market-1501 dataset [29].
2. Related Work
2.1. Datasets
There exist several datasets for person re-identification
using side-mounted rectilinear-lens cameras. Table 1lists
key statistics of the most common ones: VIPeR [11],
PRID 2011 [14], Airport [15], CUHK03 [18], GRID [20],
MSMT17 [25], Market-1501 [29] and iLIDS [30], but more
details can be found in [27]. All these datasets have been
designed with the goal of matching the image of a person
from the query set to an image from the gallery set, and the
query and gallery sets consist of images captured by differ-
ent cameras. Moreover, different cameras have no field-of-
view overlap so query and gallery images of the same iden-
tity have been captured at different time instants. Finally, in
most of these datasets there are, typically, multiple gallery
images having the same ID as the query image.
While there exist people-focused datasets captured by
overhead fisheye-lens cameras (PIROPO [7], BOMNI [8],
MW [21], HABBOF [17], CEPDOF [9], WEPDTOF [24]),
they have been developed with the goal of people detection
and, in some cases, tracking. However, each dataset only
consists of frames from a single camera which severely lim-
its the variability of body appearance, unlike in FRIDA.
Table 1. Commonly-used image datasets for person re-
identification. (BBox = bounding box)
Dataset Year # # Frame
BBoxes Cameras Resol.
VIPer [11] 2007 1,264 2 Fixed
iLIDS [30] 2009 476 2 Variable
GRID [20] 2009 1,275 8 Variable
PRID 2011 [14] 2011 24,541 2 Fixed
CUHK03 [18] 2014 13,164 2 Variable
Market-1501 [29] 2015 32,668 6 Fixed
Airport [15] 2017 39,902 6 Fixed
MSMT17 [25] 2018 126,441 15 Variable
FRIDA 2022 242,809 3 Fixed
2.2. Algorithms
Person re-identification using rectilinear-lens cameras is
a well-studied problem. Early approaches were model-
based [12,10,19,16] and used hand-crafted features. Re-
cent approaches use deep learning [22,6,5,4,28,31,26,
27] and outperform the traditional methods.
Sun et al. proposed PCB [22] in which feature vectors
are uniformly partitioned in an intermediate layer to obtain
part-informed features. This structure allows to separately
focus on different parts of an image and extract local infor-
mation for each part. Zheng et al. proposed a network called
Pyramid [28] which does not only focus on part-informed
local features, but also accounts for global features in ad-
dition to gradual cues. Pyramid achieves this through a
coarse-to-fine model, which performs image matching by
leveraging information from different spatial scales. Chen
et al. proposed an attention-based network called ABD-
Net [6], which instead of a small portion of an image fo-
cuses on its wider aspect by means of a diverse attention
map. This is accomplished by combining two separate mod-
ules: one module focuses on context-wise relevance of pix-
els while the other module focuses on spatial relevance of
these pixels. Zhu et al. proposed a network called VA-
reID [31] that allows matching of people regardless of the
viewpoint from which they were captured. Instead of cre-
ating a separate space for each viewpoint (i.e., front, side,
back), they create a unified hyperspace which accommo-
dates viewpoints in-between the main viewpoints (e.g, side-
front, side-back, etc.). Recently, Wieczorek et al. proposed
a CTL model (Centroid Triplet Loss model) [26], which ex-
tends the triplet loss. When working with triplet loss, it is
typical to choose one positive sample and one negative sam-
ple for an anchor. However, in the CTL model, instead of
choosing a single sample, a centroid is computed over a set
of samples which significantly improves performance.
The methods above have been designed for and tested on
images from rectilinear-lens cameras. Very few PRID meth-
ods have been developed for overhead fisheye cameras. An
early approach, proposed by Barman et al. [1], matches im-
ages of people who appear at the same radial distance from a
camera (similar viewpoints). This is restrictive, and leads to
sub-par performance, since people often appear at different
distances from FOV centers in different cameras. Another
algorithm proposed by Blott et al. [2] applies tracking to
extract front-, back- and side-view images of a person. A
person-descriptor is built by fusing features extracted from
individual views. The algorithm does not perform PRID for
each pose/viewpoint. Moreover, there is no guarantee that
a person will appear at all 3 viewpoints during a recording,
thus limiting performance. Recently, Bone et al. [3] pro-
posed a PRID method for fisheye-lens cameras with over-
lapping FOVs. This approach leverages locations of peo-
ple in images instead of their appearance. Using a cali-
brated fisheye-lens model this method maps pixel-location
of a person in a query image to a pixel-location in a gallery
image. The mapped query-person location is compared to