PS-ARM An End-to-End Attention-aware Relation Mixer Network for Person Search Mustansar Fiaz10000000322892284 Hisham Cholakkal1 Sanath Narayan2

2025-05-02 0 0 2.84MB 17 页 10玖币
侵权投诉
PS-ARM: An End-to-End Attention-aware
Relation Mixer Network for Person Search
Mustansar Fiaz1[0000000322892284], Hisham Cholakkal1, Sanath Narayan2,
Rao Muhammad Anwer1, and Fahad Shahbaz Khan1
1Department of computer Vision, Mohamed bin Zayed University of Artificial
Intelligence, Abu Dhabi, UAE.
(mustansar.fiaz, hisham.cholakkal, rao.anwer, fahad.khan)@mbzuai.ac.ae
2Inception Institute of Artificial Intelligence, Abu Dhabi, UAE.
Abstract. Person search is a challenging problem with various real-
world applications, that aims at joint person detection and re-identification
of a query person from uncropped gallery images. Although, previous
study focuses on rich feature information learning, it’s still hard to re-
trieve the query person due to the occurrence of appearance deforma-
tions and background distractors. In this paper, we propose a novel
attention-aware relation mixer (ARM) module for person search, which
exploits the global relation between different local regions within RoI of
a person and make it robust against various appearance deformations
and occlusion. The proposed ARM is composed of a relation mixer block
and a spatio-channel attention layer. The relation mixer block introduces
a spatially attended spatial mixing and a channel-wise attended channel
mixing for effectively capturing discriminative relation features within
an RoI. These discriminative relation features are further enriched by
introducing a spatio-channel attention where the foreground and back-
ground discriminability is empowered in a joint spatio-channel space.
Our ARM module is generic and it does not rely on fine-grained su-
pervisions or topological assumptions, hence being easily integrated into
any Faster R-CNN based person search methods. Comprehensive exper-
iments are performed on two challenging benchmark datasets: CUHK-
SYSU and PRW. Our PS-ARM achieves state-of-the-art performance on
both datasets. On the challenging PRW dataset, our PS-ARM achieves
an absolute gain of 5% in the mAP score over SeqNet, while operat-
ing at a comparable speed. The source code and pre-trained models are
available at (this https URL).
Keywords: Person Search ·Transformer ·Spatial attention ·channel
attention.
1 Introduction
Person search is a challenging computer vision problem where the task is to find a
target query person in a gallery of whole scene images. The person search meth-
ods need to perform pedestrian detection [42,27,29] on the uncropped gallery
arXiv:2210.03433v1 [cs.CV] 7 Oct 2022
2 F. Author et al.
images and do re-identification (re-id) [44,25,8] of the detected pedestrians. In
addition to addressing the challenges associated with these individual sub-tasks,
both these tasks need to be simultaneously optimized within person search. De-
spite numerous real-world applications, person search is highly challenging due
to the diverse nature of person detection and re-id sub-tasks within the person
search problem.
Person search approaches can be broadly grouped into two-step [45,4,17] and
one-step methods[37,39,6]. In two-step approaches, person detection and re-id
are performed separately using two different steps. In the first step a detec-
tion network such as Faster R-CNN is employed to detect pedestrians. In the
second step detected persons are first cropped and re-sized from the input im-
age, then utilized in another independent network for the re-identification of
the cropped pedestrians. Although two-step methods provide promising results,
they are computationally expensive. Different to two-step methods, one step
methods employ a unified framework where the backbone networks are shared
for the detection and identifications of persons. For a given uncropped image,
one-step methods predict the box coordinates and re-id features for all persons
in that image. One-step person search approaches such as [6,24] generally ex-
tend Faster R-CNN object detection frameworks by introducing an additional
branch to produce re-id feature embedding, and the whole network is jointly
trained end-to-end. Such methods often struggle while the target person in the
galley images has large appearance deformations such as pose variation, occlu-
sion, and overlapping background distractions within the region of interest (RoI)
of a target person (see Figure. 1).
1.1 Motivation
To motivate our approach, we first distinguish two desirable characteristics to
be considered when designing a Faster R-CNN based person search framework
that is robust to appearance deformations (e.g. pose variations, occlusions) and
background distractions occurring in the query person (see Figure. 1).
Discriminative Relation Features through Local Information Mixing: The posi-
tion of different local person regions within an RoI can vary in case of appearance
deformations such as pose variations and occlusions. This is likely to deteriorate
the quality of re-id features, leading to inaccurate person matching. Therefore, a
dedicated mechanism is desired that generates discriminative relation features by
globally mixing relevant information from different local regions within an RoI.
To ensure a straightforward integration into existing person search pipelines,
such a mechanism is further expected to learn discriminative relation features
without requiring fine-level region supervision or topological body approxima-
tions.
Foreground-Background Discriminability for Accurate Local Information Mix-
ing: The quality of the aforementioned relation features rely on the assumption
that the RoI region only contains foreground (person) information. However, in
real-world scenarios the RoI regions are likely to contain unwanted background
Title Suppressed Due to Excessive Length 3
Query w/o ARM w/ ARM
Query w/o ARM w/ ARM
Query w/o ARM w/ ARM
6 8 10 12 14 16 18
20
30
40
50
60
Speed (fps)
Accuracy (AP)
Method mAP Speed (fps)
NAE [6] 43.3 12.0
NAE+ [6] 44.0 10.2
ACCE [7] 46.2 6.2
AlignPS [38] 45.9 16.3
OIM [37] 21.3 8.5
SeqNet [24] 47.6 11.6
Ours (PS-ARM) 52.6 10.4
Fig. 1: On the left: Qualitative comparison showing different query examples
and their corresponding top-1 matching results obtained with and without our
ARM module in the same base framework. Here, true and false matching results
are marked in green and red, respectively. These examples depict appearance
deformations and distracting backgrounds in the gallery images for the query
person. Our ARM module that explicitly captures discriminative relation fea-
tures better handle the appearance deformations in these examples. On the
right: Accuracy (AP) vs. speed (frames per second) comparison with state-of-
the-art person search methods on PRW test set. All methods are reported with
a Resnet50 backbone and speed is computed over V100 GPU. Our approach
(PS-ARM) achieves an absolute mAP gain of 5% over SeqNet while operating
at a comparable speed.
information due to less accurate bounding-box locations. Therefore, discrim-
inability of the foreground from the background is essential for accurate local
information mixing to obtain discriminative relation features. Further, such a
FG/BG discrimination is expected to also improve the detection performance.
1.2 Contribution
We propose a novel end-to-end one-step person search method with the fol-
lowing novel contributions. We propose a novel attention-aware relation mixer
(ARM) module that strives to capture global relation between different local
person regions through global mixing of local information while simultaneously
suppressing background distractions within an RoI. Our ARM module com-
prises a relation mixer block and a spatio-channel attention layer. The rela-
tion mixer block captures discriminative relation features through a spatially-
attended spatial mixing and a channel-wise attended channel mixing. These dis-
criminative relation features are further enriched by the spatio-channel attention
layer performing foreground/background discrimination in a joint spatio-channel
space. Comprehensive experiments are performed on two challenging benchmark
datasets: CUHK-SYSU [37] and PRW [46]. On both datasets, our PS-ARM per-
forms favourably against state-of-the-art approaches. On the challenging PRW
benchmark, our PS-ARM achieves a mAP score of 52.6%. Our ARM module is
4 F. Author et al.
generic and can be easily integrated to any Faster R-CNN based person search
methods. Our PS-ARM provides an absolute gain of 5% mAP score over SeqNet,
while operating at a comparable speed (see Figure. 1), resulting in a mAP score
of 52.6% on the challenging PRW dataset.
2 Related Work
Person search is a challenging computer vision problem with numerous real-world
applications. As mentioned earlier, existing person methods can be broadly clas-
sified into two-step and one-step methods. Most existing two-step person search
approaches address this problem by first detecting the pedestrians, followed by
cropping and resizing into a fixed resolution before passing to the re-id network
that identifies the cropped pedestrian [46,5,18,11,23]. These methods generally
employ two different backbone networks for the detection and re-identifcation.
On the other hand, several one-step person search methods employ feature
pooling strategies such as, RoIPooling or RoIAlign pooling to obtain a scale-
invariant representation for the re-id sub-task. [5] proposed a two-step method
to learn robust person features by exploiting person foreground maps using pre-
trained segmentation network. Han et al. [18] introduced a bounding box refine-
ment mechanism for person localization. Dong et al. [11] utilized the similarity
between the query and query-like features to reduce the number of proposals for
re-identification. Zhang et al. [46] introduced the challenging PRW dataset. A
multi-scale feature pyramid was introduced in [23] for improving person search
under scale variations. Wang et al. [35] proposed a method to address the incon-
sistency between the detection and re-id sub-tasks.
Most one-step person search methods [37,36,26,2,39,10,6,28,16,24] are de-
veloped based on Faster R-CNN object detector [31]. These methods generally
introduce an additional branch to Faster R-CNN and jointly address the de-
tection and Re-ID subtasks. One of the earliest Faster R-CNN based one-step
person approach is [37], which proposed an online instance matching (OIM) loss.
Xiao et al. [36] introduced a center loss to explore intra-class compactness. For
generating person proposals, Liu et al. [26] introduced a mechanism to itera-
tively shrink the search area based on query guidance. Similarly, Chang et al. [2]
used reinforcement learning to address the person search problem. Chang et al.
[39] exploited complementary cues based on graph learning framework. Dont et
al. [10] proposed Siamese based Bi-directional Interaction Network (BINet) to
mitigate redundant context information outside the BBoxes. On the contrary,
Chen et al. [6] proposed Norm Aware Embedding (NAE) to alleviate the conflict
between person localization and re-identification by computing magnitude and
angle of the embedded features respectively.
Chen at al. [3] developed a Hierarchical Online Instance Matching loss to
guide the feature learning by exploiting the hierarchical relationship between
detection and re-identification. A query-guided proposal network (QGPN) is
proposed by Munjal et al. [28] to learn query guided re-identification score. H
Li et al. [24] proposed a Sequential End-to-end Network (SeqNet) to refine the
摘要:

PS-ARM:AnEnd-to-EndAttention-awareRelationMixerNetworkforPersonSearchMustansarFiaz1[0000−0003−2289−2284],HishamCholakkal1,SanathNarayan2,RaoMuhammadAnwer1,andFahadShahbazKhan11DepartmentofcomputerVision,MohamedbinZayedUniversityofArtificialIntelligence,AbuDhabi,UAE.(mustansar.fiaz,hisham.cholakkal,r...

展开>> 收起<<
PS-ARM An End-to-End Attention-aware Relation Mixer Network for Person Search Mustansar Fiaz10000000322892284 Hisham Cholakkal1 Sanath Narayan2.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:2.84MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注