PS-ARM An End-to-End Attention-aware Relation Mixer Network for Person Search Mustansar Fiaz10000000322892284 Hisham Cholakkal1 Sanath Narayan2

2025-05-02 1 0 2.84MB 17 页 10玖币

侵权投诉

PS-ARM: An End-to-End Attention-aware

Relation Mixer Network for Person Search

Mustansar Fiaz1[0000−0003−2289−2284], Hisham Cholakkal1, Sanath Narayan2,

Rao Muhammad Anwer1, and Fahad Shahbaz Khan1

1Department of computer Vision, Mohamed bin Zayed University of Artiﬁcial

Intelligence, Abu Dhabi, UAE.

(mustansar.fiaz, hisham.cholakkal, rao.anwer, fahad.khan)@mbzuai.ac.ae

2Inception Institute of Artiﬁcial Intelligence, Abu Dhabi, UAE.

Abstract. Person search is a challenging problem with various real-

world applications, that aims at joint person detection and re-identiﬁcation

of a query person from uncropped gallery images. Although, previous

study focuses on rich feature information learning, it’s still hard to re-

trieve the query person due to the occurrence of appearance deforma-

tions and background distractors. In this paper, we propose a novel

attention-aware relation mixer (ARM) module for person search, which

exploits the global relation between diﬀerent local regions within RoI of

a person and make it robust against various appearance deformations

and occlusion. The proposed ARM is composed of a relation mixer block

and a spatio-channel attention layer. The relation mixer block introduces

a spatially attended spatial mixing and a channel-wise attended channel

mixing for eﬀectively capturing discriminative relation features within

an RoI. These discriminative relation features are further enriched by

introducing a spatio-channel attention where the foreground and back-

ground discriminability is empowered in a joint spatio-channel space.

Our ARM module is generic and it does not rely on ﬁne-grained su-

pervisions or topological assumptions, hence being easily integrated into

any Faster R-CNN based person search methods. Comprehensive exper-

iments are performed on two challenging benchmark datasets: CUHK-

SYSU and PRW. Our PS-ARM achieves state-of-the-art performance on

both datasets. On the challenging PRW dataset, our PS-ARM achieves

an absolute gain of 5% in the mAP score over SeqNet, while operat-

ing at a comparable speed. The source code and pre-trained models are

available at (this https URL).

Keywords: Person Search ·Transformer ·Spatial attention ·channel

attention.

1 Introduction

Person search is a challenging computer vision problem where the task is to ﬁnd a

target query person in a gallery of whole scene images. The person search meth-

ods need to perform pedestrian detection [42,27,29] on the uncropped gallery

arXiv:2210.03433v1 [cs.CV] 7 Oct 2022

2 F. Author et al.

images and do re-identiﬁcation (re-id) [44,25,8] of the detected pedestrians. In

addition to addressing the challenges associated with these individual sub-tasks,

both these tasks need to be simultaneously optimized within person search. De-

spite numerous real-world applications, person search is highly challenging due

to the diverse nature of person detection and re-id sub-tasks within the person

search problem.

Person search approaches can be broadly grouped into two-step [45,4,17] and

one-step methods[37,39,6]. In two-step approaches, person detection and re-id

are performed separately using two diﬀerent steps. In the ﬁrst step a detec-

tion network such as Faster R-CNN is employed to detect pedestrians. In the

second step detected persons are ﬁrst cropped and re-sized from the input im-

age, then utilized in another independent network for the re-identiﬁcation of

the cropped pedestrians. Although two-step methods provide promising results,

they are computationally expensive. Diﬀerent to two-step methods, one step

methods employ a uniﬁed framework where the backbone networks are shared

for the detection and identiﬁcations of persons. For a given uncropped image,

one-step methods predict the box coordinates and re-id features for all persons

in that image. One-step person search approaches such as [6,24] generally ex-

tend Faster R-CNN object detection frameworks by introducing an additional

branch to produce re-id feature embedding, and the whole network is jointly

trained end-to-end. Such methods often struggle while the target person in the

galley images has large appearance deformations such as pose variation, occlu-

sion, and overlapping background distractions within the region of interest (RoI)

of a target person (see Figure. 1).

1.1 Motivation

To motivate our approach, we ﬁrst distinguish two desirable characteristics to

be considered when designing a Faster R-CNN based person search framework

that is robust to appearance deformations (e.g. pose variations, occlusions) and

background distractions occurring in the query person (see Figure. 1).

Discriminative Relation Features through Local Information Mixing: The posi-

tion of diﬀerent local person regions within an RoI can vary in case of appearance

deformations such as pose variations and occlusions. This is likely to deteriorate

the quality of re-id features, leading to inaccurate person matching. Therefore, a

dedicated mechanism is desired that generates discriminative relation features by

globally mixing relevant information from diﬀerent local regions within an RoI.

To ensure a straightforward integration into existing person search pipelines,

such a mechanism is further expected to learn discriminative relation features

without requiring ﬁne-level region supervision or topological body approxima-

tions.

Foreground-Background Discriminability for Accurate Local Information Mix-

ing: The quality of the aforementioned relation features rely on the assumption

that the RoI region only contains foreground (person) information. However, in

real-world scenarios the RoI regions are likely to contain unwanted background

Title Suppressed Due to Excessive Length 3

Query w/o ARM w/ ARM

6 8 10 12 14 16 18

Speed (fps)

Accuracy (AP)

Method mAP Speed (fps)

NAE [6] 43.3 12.0

NAE+ [6] 44.0 10.2

ACCE [7] 46.2 6.2

AlignPS [38] 45.9 16.3

OIM [37] 21.3 8.5

SeqNet [24] 47.6 11.6

Ours (PS-ARM) 52.6 10.4

Fig. 1: On the left: Qualitative comparison showing diﬀerent query examples

and their corresponding top-1 matching results obtained with and without our

ARM module in the same base framework. Here, true and false matching results

are marked in green and red, respectively. These examples depict appearance

deformations and distracting backgrounds in the gallery images for the query

person. Our ARM module that explicitly captures discriminative relation fea-

tures better handle the appearance deformations in these examples. On the

right: Accuracy (AP) vs. speed (frames per second) comparison with state-of-

the-art person search methods on PRW test set. All methods are reported with

a Resnet50 backbone and speed is computed over V100 GPU. Our approach

(PS-ARM) achieves an absolute mAP gain of 5% over SeqNet while operating

at a comparable speed.

information due to less accurate bounding-box locations. Therefore, discrim-

inability of the foreground from the background is essential for accurate local

information mixing to obtain discriminative relation features. Further, such a

FG/BG discrimination is expected to also improve the detection performance.

1.2 Contribution

We propose a novel end-to-end one-step person search method with the fol-

lowing novel contributions. We propose a novel attention-aware relation mixer

(ARM) module that strives to capture global relation between diﬀerent local

person regions through global mixing of local information while simultaneously

suppressing background distractions within an RoI. Our ARM module com-

prises a relation mixer block and a spatio-channel attention layer. The rela-

tion mixer block captures discriminative relation features through a spatially-

attended spatial mixing and a channel-wise attended channel mixing. These dis-

criminative relation features are further enriched by the spatio-channel attention

layer performing foreground/background discrimination in a joint spatio-channel

space. Comprehensive experiments are performed on two challenging benchmark

datasets: CUHK-SYSU [37] and PRW [46]. On both datasets, our PS-ARM per-

forms favourably against state-of-the-art approaches. On the challenging PRW

benchmark, our PS-ARM achieves a mAP score of 52.6%. Our ARM module is

4 F. Author et al.

generic and can be easily integrated to any Faster R-CNN based person search

methods. Our PS-ARM provides an absolute gain of 5% mAP score over SeqNet,

while operating at a comparable speed (see Figure. 1), resulting in a mAP score

of 52.6% on the challenging PRW dataset.

2 Related Work

Person search is a challenging computer vision problem with numerous real-world

applications. As mentioned earlier, existing person methods can be broadly clas-

siﬁed into two-step and one-step methods. Most existing two-step person search

approaches address this problem by ﬁrst detecting the pedestrians, followed by

cropping and resizing into a ﬁxed resolution before passing to the re-id network

that identiﬁes the cropped pedestrian [46,5,18,11,23]. These methods generally

employ two diﬀerent backbone networks for the detection and re-identifcation.

On the other hand, several one-step person search methods employ feature

pooling strategies such as, RoIPooling or RoIAlign pooling to obtain a scale-

invariant representation for the re-id sub-task. [5] proposed a two-step method

to learn robust person features by exploiting person foreground maps using pre-

trained segmentation network. Han et al. [18] introduced a bounding box reﬁne-

ment mechanism for person localization. Dong et al. [11] utilized the similarity

between the query and query-like features to reduce the number of proposals for

re-identiﬁcation. Zhang et al. [46] introduced the challenging PRW dataset. A

multi-scale feature pyramid was introduced in [23] for improving person search

under scale variations. Wang et al. [35] proposed a method to address the incon-

sistency between the detection and re-id sub-tasks.

Most one-step person search methods [37,36,26,2,39,10,6,28,16,24] are de-

veloped based on Faster R-CNN object detector [31]. These methods generally

introduce an additional branch to Faster R-CNN and jointly address the de-

tection and Re-ID subtasks. One of the earliest Faster R-CNN based one-step

person approach is [37], which proposed an online instance matching (OIM) loss.

Xiao et al. [36] introduced a center loss to explore intra-class compactness. For

generating person proposals, Liu et al. [26] introduced a mechanism to itera-

tively shrink the search area based on query guidance. Similarly, Chang et al. [2]

used reinforcement learning to address the person search problem. Chang et al.

[39] exploited complementary cues based on graph learning framework. Dont et

al. [10] proposed Siamese based Bi-directional Interaction Network (BINet) to

mitigate redundant context information outside the BBoxes. On the contrary,

Chen et al. [6] proposed Norm Aware Embedding (NAE) to alleviate the conﬂict

between person localization and re-identiﬁcation by computing magnitude and

angle of the embedded features respectively.

Chen at al. [3] developed a Hierarchical Online Instance Matching loss to

guide the feature learning by exploiting the hierarchical relationship between

detection and re-identiﬁcation. A query-guided proposal network (QGPN) is

proposed by Munjal et al. [28] to learn query guided re-identiﬁcation score. H

Li et al. [24] proposed a Sequential End-to-end Network (SeqNet) to reﬁne the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PS-ARM:AnEnd-to-EndAttention-awareRelationMixerNetworkforPersonSearchMustansarFiaz1[0000−0003−2289−2284],HishamCholakkal1,SanathNarayan2,RaoMuhammadAnwer1,andFahadShahbazKhan11DepartmentofcomputerVision,MohamedbinZayedUniversityofArtificialIntelligence,AbuDhabi,UAE.(mustansar.fiaz,hisham.cholakkal,r...

展开>> 收起<<

PS-ARM An End-to-End Attention-aware Relation Mixer Network for Person Search Mustansar Fiaz10000000322892284 Hisham Cholakkal1 Sanath Narayan2.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PS-ARM An End-to-End Attention-aware Relation Mixer Network for Person Search Mustansar Fiaz10000000322892284 Hisham Cholakkal1 Sanath Narayan2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: