
proposal generation module. Most existing pedestrian detection methods employ either the single-
stage or two-stage strategy as their model architectures.
Occlusion Handling.
In pedestrian detection, occlusion leads to misclassifying pedestrians. A
common strategy is the part-based approaches [
8
–
11
], which ensemble a series of body-part detectors
to localize partially occluded pedestrians. Also some methods train different models for most frequent
occlusion patterns [
12
,
13
] or model different occlusion patterns in a joint framework [
14
,
15
], but
they are all just designed for some specific occlusion patterns and not able to generalize well in
other occluded scenes. Besides, attention mechanism has been applied to handle different occlusion
patterns [
9
,
16
]. MGAN [
16
] introduces a novel mask guided attention network, which emphasizes
visible pedestrian regions while suppressing the occluded parts by modulating extracted features.
Moreover, a few recent works [
17
,
18
] have exploited to utilize annotations of the visible box as extra
supervisions to improve pedestrian detection performance.
Crowdness Handling.
As for crowded scenes, except for the misclassifying issues, crowdedness
makes it difficult to distinguish highly-overlapped pedestrians. A few previous works propose new
loss functions to address the problem of crowded detections. For example, OR-CNN [
8
] proposes
aggregation loss to enforce proposals to be close to the corresponding objects and minimize the
internal region distances of proposals associated with the same objects. RepLoss [
19
] proposes
Repulsion Loss, which introduces extra penalty to proposals intertwined with multiple ground truths.
Moreover, some advanced NMS strategies [
20
–
23
,
18
] are proposed to alleviate the crowded issues to
some extent, but they still take IoU as the metric to measure the difference between detected objects,
which limits the performance on identifying highly-overlapped instances from crowded boxes.
Object Representation.
In computer vision, object representation is one primary topic, and there
are many representations for objects in 2D images, such as 2D bounding boxes [
4
], polygons [
24
],
splines [
25
], and pixels [
26
]. Each has strengths and weaknesses from a specific application’s
practical perspective, providing annotation cost, information density, and variable levels of fidelity.
Distribution-based representation has also been tried in [
27
] which utilizes the bivariate normal
distribution as the representation of objects. However, when transformed from bounding boxes rather
than segmentation, the mean and variance of bivarite normal distribution are still consistent with the
center and scale. Besides, its performance is considerably poor compared to other methods.
In this paper, Beta Representation provides a more detailed representation for occluded pedestrians,
along with a new metric to substitute for IoU and a new detector Beta R-CNN, thereby alleviating the
occlusion and crowd issues to a great extent.
3 Method
In this section, we first introduce the parameterized Beta Representation for pedestrians. Then to
fully exploit the Beta Representation, a novel pipeline Beta R-CNN is proposed. Moreover, a specific
NMS strategy based on beta distribution and KL divergence, i.e., BetaNMS, is analyzed in detail.
3.1 Beta Representation
3.1.1 Beta Distribution
In probability theory and mathematical statistics, the beta distribution is a family of one-dimensional
continuous probability distribution defined in the interval
[0,1]
, parameterized by two positive shape
parameters
α
and
β
. For
0≤x≤1
and shape parameters
α, β > 0
, the probability density function
(PDF) of beta distribution is a exponential function of the variable
x
and its reflection
(1 −x)
as
follows:
Be(x;α, β) = Γ(α+β)
Γ(α)Γ(β)·x(α−1)(1 −x)(β−1)
=1
B(α, β)·x(α−1)(1 −x)(β−1),
(1)
where
Γ(z)
is the gamma function and
B(α, β)
is a normalization factor to ensure the total probability
is
1
. Some beta distribution samples are shown in Fig. 1. According to the above definition, the mean
3