Beta R-CNN Looking into Pedestrian Detection from Another Perspective Zixuan Xu

2025-05-06 0 0 2.56MB 13 页 10玖币

侵权投诉

Beta R-CNN: Looking into Pedestrian Detection from

Another Perspective

Zixuan Xu∗

Peking University

zixuanxu@pku.edu.cn

Banghuai Li ∗

Megvii Research

libanghuai@megvii.com

Ye Yuan

Megvii Research

yuanye@megvii.com

Anhong Dang

Peking University

ahdang@pku.edu.cn

Abstract

Recently signiﬁcant progress has been made in pedestrian detection, but it remains

challenging to achieve high performance in occluded and crowded scenes. It could

be attributed mostly to the widely used representation of pedestrians, i.e., 2D

axis-aligned bounding box, which just describes the approximate location and size

of the object. Bounding box models the object as a uniform distribution within the

boundary, making pedestrians indistinguishable in occluded and crowded scenes

due to much noise. To eliminate the problem, we propose a novel representation

based on 2D beta distribution, named Beta Representation. It pictures a pedestrian

by explicitly constructing the relationship between full-body and visible boxes, and

emphasizes the center of visual mass by assigning different probability values to

pixels. As a result, Beta Representation is much better for distinguishing highly-

overlapped instances in crowded scenes with a new NMS strategy named BetaNMS.

What’s more, to fully exploit Beta Representation, a novel pipeline Beta R-CNN

equipped with BetaHead and BetaMask is proposed, leading to high detection

performance in occluded and crowded scenes.

1 Introduction

Pedestrian detection is a critical research topic in computer vision ﬁeld with various real-world

applications such as autonomous vehicles, intelligent video surveillance, robotics, and so on. During

the last decade, with the rise of deep convolutional neural networks (CNNs), great progress has

been achieved in pedestrian detection. However, it remains challenging to accurately distinguish

pedestrians in occluded and crowded scenes.

Although extensive methods have been attempted for occlusion and crowd issues, the performance

is still limited by pedestrian representation, i.e., 2D bounding box representation. The axis-aligned

minimum bounding box is widely utilized to explicitly deﬁne a distinct object, with its approximate

location and size. Although box representation has advantages such as parameterization- and

annotation-friendly as the identity of an object, some nonnegligible drawbacks are limiting the

performance of pedestrian detection especially in occluded and crowded scenes. Firstly, the bounding

box can be regarded as modeling the object as a uniform distribution in the box, but it actually goes

against our intuitive perception. Given an occluded pedestrian, what attracts our attention should be

the visible part rather than the occluded noise. Secondly, based on box representation, intersection

over union (IoU) serves as the metric to measure the difference between objects, which results in

∗These authors contributed equally

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arXiv:2210.12758v1 [cs.CV] 23 Oct 2022

FWHM

5, 2



2, 2



2, 3



2, 4



Figure 1: Beta distributions have ﬂexible

shapes with different peaks and FWHMs.

BBox representation 2-value Mask Beta Representation

fIoU:0.74, vIoU:0.21, KL:9.95

fIoU:0.68, vIoU:0.31, KL:10.34

fIoU:0.61, vIoU:0.45, KL:8.28

fIoU:0.84, vIoU:0.19, KL:12.47

Full Box Visible Box

fIoU:0.68, vIoU:0.31, KL:10.34

fIoU:0.61, vIoU:0.45, KL:8.28

fIoU:0.84, vIoU:0.19, KL:12.47

Full-body Box

Visible Box

Figure 2: Beta Representation samples and compar-

isons between IoU and KL divergence.

difﬁculty to distinguish highly-overlapped instances in crowded scenes. As shown in Fig. 2, even if

the detectors succeed to identify different human instances in a crowded scene, the highly-overlapped

detections may also be suppressed by the post-processing of non-maximum suppression (NMS). Last,

the full-body and visible boxes treat a distinct person as two separate parts, which omit their inner

relationship as a whole and lead to difﬁculty for model optimization.

To eliminate the weaknesses of box representation and preserve its advantages in the meanwhile,

we propose a novel representation for pedestrians based on 2D beta distribution, named

Beta

Representation

. In probability theory, the beta distribution is a family of continuous probability

distribution deﬁned in the interval [0, 1], as depicted in Fig. 1. By assigning different values to

α, β

we could control the shape of the beta distribution, especially the peak and the full width at half

maximum (FWHM), which is naturally suitable for pedestrian representation with unpredictable

visible patterns. We take each pedestrian as a 2D beta distribution on the image and generate eight new

parameters as the Beta Representation. As illustrated in Fig. 2, the boundary of 2D beta distribution is

consistent with the full-body box, while the peak along with FWHM depends on the relation between

the visible part and full-body box. Compared with paired boxes, i.e., full-body and visible boxes,

2D beta distribution treats each pedestrian more like an integrated whole and emphasizes the object

center of visual mass meanwhile.

Besides, instead of IoU, Kullback-Leibler (KL) divergence is adopted as a new metric to measure

the distance of two objects and the beta-distribution-based NMS strategy is named BetaNMS. Fig. 2

illustrates that while the bounding boxes are too close to distinguish (fIoU > 0.5, vIoU > 0.3

), the

2D beta distributions still maintain high discrimination (KL > 7) between each other, thereby leading

to better performance in distinguishing highly-overlapped instances.

Moreover, to fully exploit Beta Representation in pedestrian detection, we design a novel pedestrian

detector named Beta R-CNN, equipped with two different key modules, i.e., BetaHead and BetaMask.

BetaHead is utilized to regress the eight beta parameters and the class score, while BetaMask serves

as an attention mechanism to modulate the extracted feature with beta-distribution-based masks.

Experiments on the extremely crowded benchmark CrowdHuman [

] and CityPersons [

] show

that our proposed approach can outperform the state-of-the-art results, which strongly validate the

superiority of our method.

2 Related Work

Pedestrian Detection.

Pedestrian detection can be viewed as object detection for the speciﬁc

category. With the development of deep learning, CNN-based detectors can be roughly divided into

two categories: the two-stage approaches [

] comprise separate proposal generation followed by

classiﬁcation and regression module to reﬁne the proposals; and the one-stage approaches [

–

]

perform localization and classiﬁcation simultaneously on the feature maps without the separate

2FIoU and vIoU are the IoU calculated based on full-body/visible boxes respectively.

proposal generation module. Most existing pedestrian detection methods employ either the single-

stage or two-stage strategy as their model architectures.

Occlusion Handling.

In pedestrian detection, occlusion leads to misclassifying pedestrians. A

common strategy is the part-based approaches [

–

], which ensemble a series of body-part detectors

to localize partially occluded pedestrians. Also some methods train different models for most frequent

occlusion patterns [

] or model different occlusion patterns in a joint framework [

], but

they are all just designed for some speciﬁc occlusion patterns and not able to generalize well in

other occluded scenes. Besides, attention mechanism has been applied to handle different occlusion

patterns [

]. MGAN [

] introduces a novel mask guided attention network, which emphasizes

visible pedestrian regions while suppressing the occluded parts by modulating extracted features.

Moreover, a few recent works [

] have exploited to utilize annotations of the visible box as extra

supervisions to improve pedestrian detection performance.

Crowdness Handling.

As for crowded scenes, except for the misclassifying issues, crowdedness

makes it difﬁcult to distinguish highly-overlapped pedestrians. A few previous works propose new

loss functions to address the problem of crowded detections. For example, OR-CNN [

] proposes

aggregation loss to enforce proposals to be close to the corresponding objects and minimize the

internal region distances of proposals associated with the same objects. RepLoss [

] proposes

Repulsion Loss, which introduces extra penalty to proposals intertwined with multiple ground truths.

Moreover, some advanced NMS strategies [

–

] are proposed to alleviate the crowded issues to

some extent, but they still take IoU as the metric to measure the difference between detected objects,

which limits the performance on identifying highly-overlapped instances from crowded boxes.

Object Representation.

In computer vision, object representation is one primary topic, and there

are many representations for objects in 2D images, such as 2D bounding boxes [

], polygons [

splines [

], and pixels [

]. Each has strengths and weaknesses from a speciﬁc application’s

practical perspective, providing annotation cost, information density, and variable levels of ﬁdelity.

Distribution-based representation has also been tried in [

] which utilizes the bivariate normal

distribution as the representation of objects. However, when transformed from bounding boxes rather

than segmentation, the mean and variance of bivarite normal distribution are still consistent with the

center and scale. Besides, its performance is considerably poor compared to other methods.

In this paper, Beta Representation provides a more detailed representation for occluded pedestrians,

along with a new metric to substitute for IoU and a new detector Beta R-CNN, thereby alleviating the

occlusion and crowd issues to a great extent.

3 Method

In this section, we ﬁrst introduce the parameterized Beta Representation for pedestrians. Then to

fully exploit the Beta Representation, a novel pipeline Beta R-CNN is proposed. Moreover, a speciﬁc

NMS strategy based on beta distribution and KL divergence, i.e., BetaNMS, is analyzed in detail.

3.1 Beta Representation

3.1.1 Beta Distribution

In probability theory and mathematical statistics, the beta distribution is a family of one-dimensional

continuous probability distribution deﬁned in the interval

[0,1]

, parameterized by two positive shape

parameters

and

. For

0≤x≤1

and shape parameters

α, β > 0

, the probability density function

(PDF) of beta distribution is a exponential function of the variable

and its reﬂection

(1 −x)

follows:

Be(x;α, β) = Γ(α+β)

Γ(α)Γ(β)·x(α−1)(1 −x)(β−1)

B(α, β)·x(α−1)(1 −x)(β−1),

(1)

where

Γ(z)

is the gamma function and

B(α, β)

is a normalization factor to ensure the total probability

. Some beta distribution samples are shown in Fig. 1. According to the above deﬁnition, the mean

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BetaR-CNN:LookingintoPedestrianDetectionfromAnotherPerspectiveZixuanXuPekingUniversityzixuanxu@pku.edu.cnBanghuaiLiMegviiResearchlibanghuai@megvii.comYeYuanMegviiResearchyuanye@megvii.comAnhongDangPekingUniversityahdang@pku.edu.cnAbstractRecentlysignicantprogresshasbeenmadeinpedestriandetection,b...

展开>> 收起<<

Beta R-CNN Looking into Pedestrian Detection from Another Perspective Zixuan Xu.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Beta R-CNN Looking into Pedestrian Detection from Another Perspective Zixuan Xu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: