BoxMask Revisiting Bounding Box Supervision for Video Object Detection Khurram Azeem Hashmi Alain Pagani Didier Stricker Muhammamd Zeshan Afzal

2025-04-27 0 0 6.9MB 11 页 10玖币
侵权投诉
BoxMask: Revisiting Bounding Box Supervision for Video Object Detection
Khurram Azeem Hashmi Alain Pagani Didier Stricker
Muhammamd Zeshan Afzal
DFKI - German Research Center for Artificial Intelligence
firstname[0] firstname[1].lastname@dfki.de
Abstract
We present a new, simple yet effective approach to uplift
video object detection. We observe that prior works oper-
ate on instance-level feature aggregation that imminently
neglects the refined pixel-level representation, resulting in
confusion among objects sharing similar appearance or
motion characteristics. To address this limitation, we pro-
pose BoxMask, which effectively learns discriminative rep-
resentations by incorporating class-aware pixel-level infor-
mation. We simply consider bounding box-level annotations
as a coarse mask for each object to supervise our method.
The proposed module can be effortlessly integrated into any
region-based detector to boost detection. Extensive exper-
iments on ImageNet VID and EPIC KITCHENS datasets
demonstrate consistent and significant improvement when
we plug our BoxMask module into numerous recent state-
of-the-art methods.
1. Introduction
With the recent advancements in deep convolutional neu-
ral networks [32, 61, 56], object detection in still images
has gained a remarkable progress [23, 47, 44, 52, 21]. The
naive idea of applying image-based detectors on each frame
to perform Video Object Detection (VOD) often under-
performs, owing to the deteriorated object appearance due
to motion blur, rare poses, and part occlusions in videos.
Therefore, exploiting the encoded temporal information in
videos [67, 68, 58, 24] has become a de facto choice to
tackle these challenges.
Earlier video object detection techniques utilizing tem-
poral information mainly operate under two paradigms. The
first category of methods applies post-processing on tem-
poral information to make still image object detection re-
sults [30, 36, 35, 3] more consistent and stable. Alterna-
tively, the second group leverages the feature aggregation
of temporal information [67, 8, 58, 63, 11, 24]. Albeit these
region-based state-of-the-art systems have greatly boosted
the performance of VOD, they suffer from differentiating
the confusing objects with similar appearances or uniform
motion attributes.
We observe that most of the previous approaches [67, 58,
24, 11] operate on instance-level feature aggregation that
imminently neglects the refined pixel-level representation,
resulting in acceptable localization but inferior classifica-
tion. As illustrated in the first two rows of Figure 1, al-
though the object detector exploits spatio-temporal context
from support frames (tsand t+s) to refine proposal
features, it produces false positives by classifying back-
ground as a Bear and misclassifies Watercraft with a Car
at the target frame t. To overcome this hurdle, we design
a novel module called BoxMask that exploits class-aware
pixel-level temporal information to boost VOD. Inspired
by [31] in still images, the BoxMask predicts a class-aware
segmentation mask for each region of interest along with
the conventional classification and localization. Since this
paper deals with the problem of object detection in videos,
we investigate bounding box-level annotation to generate
coarse masks which supervise our BoxMask network. The
advantages of adopting our BoxMask head are two folds.
First, the class-aware pixel-level features reduce the hard
false positives between objects with low spatial and tempo-
ral inter-class variance. Second, since the size of the pre-
dicted mask is identical to the target region, fine-grained
pixel-level learning assists the detector in precise localiza-
tion. We summarize the main contribution of this paper as
follows:
We observe that object misclassification is the crucial
obstacle that limits the upper bound of existing video
object detection methods. We further revisit the idea of
leveraging bounding box annotations to supervise both
regression and mask prediction (see Figure 1).
We propose BoxMask, an extremely simple yet effec-
tive module that learns additional discriminative repre-
sentations by incorporating class-aware pixel-level in-
formation to boost VOD.
Our BoxMask is a plug-and-play module and can
be integrated into any region-based detection method.
arXiv:2210.06008v1 [cs.CV] 12 Oct 2022
Support Frame (t-s) Support Frame (t+s)Target Frame (t) Target Frame (t)
Car (0.64)
Object Confusion
Object Confusion and
imprecise localization
Watercraft (0.94)
Bear (0.91)
Modern Region-based
VOD methods
After Incorporation of
BoxMask Head
Reduced False
Positives
Confident Classification and
Precise Localization
Bear (0.67)
Bear (0.55)
Bear (0.99)
(a) bear
(a) bear
(b) watercraft
(b) watercraft
Bear (0.67)
Figure 1. Motivation. Despite leveraging spatio-temporal information from support frames tsand t+s, modern VOD methods of-
ten misclassify objects with similar appearance and uniform motion characteristics. For instance, a moving object in the background is
categorized as a bear in (a), while Watercraft is mistaken for a Car in (b). To address this, we devise a simple BoxMask module that
learns pixel-level features by introducing crucial discriminative cues to boost detection among confused object categories. Note that with
fine-grained pixel-level learning, our BoxMask removes misclassification of background in (a) and correctly categorizes Watercraft in (b).
Best view it on the screen.
With our novel class-aware pixel-level learning intro-
duced in recent state-of-the-art methods, we achieve
an absolute gain of 1.8% in mAP and 2.1% in mAP
on ImageNet VID and EPIC KITCHENS benchmarks,
respectively.
2. Related Work
Object Detection in Images. The existing methods in
image-based object detection can be mainly divided into
single-stage detectors [42, 44, 45, 46, 9, 21] and multi-
stage or region-based detectors [47, 6, 7, 26, 34]. Mask
R-CNN [31] replaces RoI Pooling with RoIAlign and intro-
duces an extra instance segmentation head that not only im-
proves instance segmentation but advances object detection.
Cheng et al. [12] blame the weak classification head for in-
ferior detections and propose to ensemble the classification
scores of Faster R-CNN [47] and R-CNN [23] as a remedy.
IoU-Net [33] proposes a separate confidence mechanism for
localization. Double-Head R-CNN [59] disentangles the
detection head by treating classification with the fully con-
nected head and regression with a convolution head. Along
with this direction, seminal work [51] incorporates TSD in
a region-based detector [47] that learns different features
for classification and regression. Later, separate losses are
added to the whole loss function to optimize detection. Sim-
ilar to these works in still images [31, 59, 51, 33], we ob-
serve that a naive sibling head in the region-based detec-
tor [47] confuses objects with similar motion characteristics
and leads to sub-optimal video object detection.
Box-supervised Semantic and Instance Segmentation in
Images. There has been an increasing trend in exploiting
bounding box annotations to enhance weakly supervised in-
stance and semantic segmentation approaches in still im-
ages [14, 39, 37, 41, 4]. The main reason is that bounding
boxes contain knowledge about the precise location of each
object, and they are approximately 35 times faster to anno-
tate than per-pixel labeling [19, 2]. Along with a similar di-
rection, our work exploits box-level annotations to generate
coarse masks, eventually boosting video object detection.
Object Detection in Videos. Prior methods for video ob-
ject detection have two directions. One direction exploits
the redundancy in video frames by incorporating optical
flow [68, 65], scale-time lattice [8], reinforcement learning
capabilities [63], and heatmaps [62] to reduce the cost of the
feature extraction process by propagating keyframe features
to other frames in videos. Another line of work leverages
temporal information encoded in videos to boost VOD, and
our work operates on this trend. Existing techniques ex-
support frame
target frame
support frame
Temporal RoI
Feature
Extaction
RoI Feature
Extaction
RoI Feature
Extaction
Semantic
Feature
Aggregation For
Detection
Classification
BoxMask
Prediction
RoI Proposals
RoI Proposals
Backbone Network RPN
Regression
Loss
Loss
Loss
Module introduced in
recent VOD methods
Proposals from
support frames
Support frame
features flow
Proposals from
target frame
Figure 2. Architectural overview of modern region-based VOD methods and our proposed modules highlighted in magenta. Alongside
spatio-temporal features, our method introduces important class-aware pixel-level features, which effectively tackles object confusion to
boost performance in modern region-based video object detection methods.
ploit temporal information in two ways. The first way is
to refine the detection results with post-processing meth-
ods [30, 36, 35]. Although these approaches improve the
performance of VOD, they heavily rely on the image-based
detector trained with no knowledge of temporal informa-
tion. On the contrary, The second direction is to capitalize
temporal information during the training stage [67, 57, 20,
5, 25, 50, 58, 17, 68, 65, 8, 60, 16, 53, 10, 11, 24, 64].
Some of these methods utilize optical flow [18] to warp
and aggregate features across frames [67, 57, 35]. Despite
the improvement, the optical flow based-methods fail in the
case of occlusions. Most existing region-based VOD meth-
ods [67, 67, 58, 24] tackle the inherent challenges by ag-
gregating temporal features. However, they mainly rely on
instance-level feature aggregation, which pays less atten-
tion to the content of object proposals, resulting in confu-
sion between objects with similar appearance and motion
characteristics. Very recently, TransVOD [64] introduces
the transformer-based VOD method by extending the De-
formable DETR [66] with a temporal transformer to aggre-
gate object queries from different video frames.
Tackling Object Confusion in Videos. Han et al. [29] are
the first to highlight object confusion as to the main prob-
lem in VOD. They propose exploiting inter-video and intra-
video proposal relations to tackle object confusion. Another
seminal works [27, 28] attempts to solve this problem by de-
vising better feature aggregation schemes that enhance tar-
get frame feature representation. Despite the gratifying im-
provement in detection, these approaches rely on a region-
based detector that focuses more on discriminating between
background and foreground regions than differentiating be-
tween various foreground regions [12]. Moreover, these
methods operate on complex pipelines to produce impres-
sive results. Alternatively, we design a simple but effective
BoxMask module that achieves similar performance upon
integrating into recent region-based VOD methods.
3. Method
This section first describes an overview of the modern
region-based detectors in VOD by diving into the inherent
misclassification problem in Section 3.1. Later, we explain
the proposed BoxMask module and its learning mechanism
in Sections 3.2 and 3.3, respectively.
3.1. Revisiting Region-based Detectors in VOD
Figure 2 depicts an overview of region-based detectors
in VOD. First, a backbone network extracts spatial features
from the target frame (the actual frame on which detec-
tion needs to be executed) and support frames (other video
frames that assist the detection on a target frame). Subse-
quently, a Region Proposal Network (RPN) [47] predicts
object proposals for each frame and aims to minimize the
regression loss Lreg and classification loss Lcls defined as:
Lrpn =Lcls(p, p) + p.Lreg (t, t)(1)
where pis the estimated probability of a proposal being an
object and prepresents 1 or 0 depending upon the label of
the anchor box. The term tdenotes the coordinates of the
predicted object proposal, and tis the ground truth. Here,
note that the classification loss Lcls in Equation 1 only fo-
cuses on improving the objectness of proposals instead of
object classification.
In the second stage, feature aggregation is performed be-
tween object proposal features of the target frame and sup-
port frames in a video. These aggregated features are pooled
by an RoI Align pooling operator and propagated to the de-
tection head designed to optimize multi-class classification
and regression. For training, the detection loss is given by:
Ldet =Lcls(pc, y) + Lreg(t, t)(2)
where pcrepresents the predicted class distribution and y
is the class label of an object in a target frame. For com-
prehensive details about the parameterization of RPN and
摘要:

BoxMask:RevisitingBoundingBoxSupervisionforVideoObjectDetectionKhurramAzeemHashmiAlainPaganiDidierStrickerMuhammamdZeshanAfzalDFKI-GermanResearchCenterforArticialIntelligencefirstname[0]firstname[1].lastname@dfki.deAbstractWepresentanew,simpleyeteffectiveapproachtoupliftvideoobjectdetection.Weobser...

展开>> 收起<<
BoxMask Revisiting Bounding Box Supervision for Video Object Detection Khurram Azeem Hashmi Alain Pagani Didier Stricker Muhammamd Zeshan Afzal.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:6.9MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注