BoxMask Revisiting Bounding Box Supervision for Video Object Detection Khurram Azeem Hashmi Alain Pagani Didier Stricker Muhammamd Zeshan Afzal

2025-04-27 0 0 6.9MB 11 页 10玖币

侵权投诉

BoxMask: Revisiting Bounding Box Supervision for Video Object Detection

Khurram Azeem Hashmi Alain Pagani Didier Stricker

Muhammamd Zeshan Afzal

DFKI - German Research Center for Artiﬁcial Intelligence

firstname[0] firstname[1].lastname@dfki.de

Abstract

We present a new, simple yet effective approach to uplift

video object detection. We observe that prior works oper-

ate on instance-level feature aggregation that imminently

neglects the reﬁned pixel-level representation, resulting in

confusion among objects sharing similar appearance or

motion characteristics. To address this limitation, we pro-

pose BoxMask, which effectively learns discriminative rep-

resentations by incorporating class-aware pixel-level infor-

mation. We simply consider bounding box-level annotations

as a coarse mask for each object to supervise our method.

The proposed module can be effortlessly integrated into any

region-based detector to boost detection. Extensive exper-

iments on ImageNet VID and EPIC KITCHENS datasets

demonstrate consistent and signiﬁcant improvement when

we plug our BoxMask module into numerous recent state-

of-the-art methods.

1. Introduction

With the recent advancements in deep convolutional neu-

ral networks [32, 61, 56], object detection in still images

has gained a remarkable progress [23, 47, 44, 52, 21]. The

naive idea of applying image-based detectors on each frame

to perform Video Object Detection (VOD) often under-

performs, owing to the deteriorated object appearance due

to motion blur, rare poses, and part occlusions in videos.

Therefore, exploiting the encoded temporal information in

videos [67, 68, 58, 24] has become a de facto choice to

tackle these challenges.

Earlier video object detection techniques utilizing tem-

poral information mainly operate under two paradigms. The

ﬁrst category of methods applies post-processing on tem-

poral information to make still image object detection re-

sults [30, 36, 35, 3] more consistent and stable. Alterna-

tively, the second group leverages the feature aggregation

of temporal information [67, 8, 58, 63, 11, 24]. Albeit these

region-based state-of-the-art systems have greatly boosted

the performance of VOD, they suffer from differentiating

the confusing objects with similar appearances or uniform

motion attributes.

We observe that most of the previous approaches [67, 58,

24, 11] operate on instance-level feature aggregation that

imminently neglects the reﬁned pixel-level representation,

resulting in acceptable localization but inferior classiﬁca-

tion. As illustrated in the ﬁrst two rows of Figure 1, al-

though the object detector exploits spatio-temporal context

from support frames (t−sand t+s) to reﬁne proposal

features, it produces false positives by classifying back-

ground as a Bear and misclassiﬁes Watercraft with a Car

at the target frame t. To overcome this hurdle, we design

a novel module called BoxMask that exploits class-aware

pixel-level temporal information to boost VOD. Inspired

by [31] in still images, the BoxMask predicts a class-aware

segmentation mask for each region of interest along with

the conventional classiﬁcation and localization. Since this

paper deals with the problem of object detection in videos,

we investigate bounding box-level annotation to generate

coarse masks which supervise our BoxMask network. The

advantages of adopting our BoxMask head are two folds.

First, the class-aware pixel-level features reduce the hard

false positives between objects with low spatial and tempo-

ral inter-class variance. Second, since the size of the pre-

dicted mask is identical to the target region, ﬁne-grained

pixel-level learning assists the detector in precise localiza-

tion. We summarize the main contribution of this paper as

follows:

• We observe that object misclassiﬁcation is the crucial

obstacle that limits the upper bound of existing video

object detection methods. We further revisit the idea of

leveraging bounding box annotations to supervise both

regression and mask prediction (see Figure 1).

• We propose BoxMask, an extremely simple yet effec-

tive module that learns additional discriminative repre-

sentations by incorporating class-aware pixel-level in-

formation to boost VOD.

• Our BoxMask is a plug-and-play module and can

be integrated into any region-based detection method.

arXiv:2210.06008v1 [cs.CV] 12 Oct 2022

Support Frame (t-s) Support Frame (t+s)Target Frame (t) Target Frame (t)

Car (0.64)

Object Confusion

Object Confusion and

imprecise localization

Watercraft (0.94)

Bear (0.91)

Modern Region-based

VOD methods

After Incorporation of

BoxMask Head

Reduced False

Positives

Confident Classification and

Precise Localization

Bear (0.67)

Bear (0.55)

Bear (0.99)

(a) bear

(b) watercraft

Bear (0.67)

Figure 1. Motivation. Despite leveraging spatio-temporal information from support frames t−sand t+s, modern VOD methods of-

ten misclassify objects with similar appearance and uniform motion characteristics. For instance, a moving object in the background is

categorized as a bear in (a), while Watercraft is mistaken for a Car in (b). To address this, we devise a simple BoxMask module that

learns pixel-level features by introducing crucial discriminative cues to boost detection among confused object categories. Note that with

ﬁne-grained pixel-level learning, our BoxMask removes misclassiﬁcation of background in (a) and correctly categorizes Watercraft in (b).

Best view it on the screen.

With our novel class-aware pixel-level learning intro-

duced in recent state-of-the-art methods, we achieve

an absolute gain of 1.8% in mAP and 2.1% in mAP

on ImageNet VID and EPIC KITCHENS benchmarks,

respectively.

2. Related Work

Object Detection in Images. The existing methods in

image-based object detection can be mainly divided into

single-stage detectors [42, 44, 45, 46, 9, 21] and multi-

stage or region-based detectors [47, 6, 7, 26, 34]. Mask

R-CNN [31] replaces RoI Pooling with RoIAlign and intro-

duces an extra instance segmentation head that not only im-

proves instance segmentation but advances object detection.

Cheng et al. [12] blame the weak classiﬁcation head for in-

ferior detections and propose to ensemble the classiﬁcation

scores of Faster R-CNN [47] and R-CNN [23] as a remedy.

IoU-Net [33] proposes a separate conﬁdence mechanism for

localization. Double-Head R-CNN [59] disentangles the

detection head by treating classiﬁcation with the fully con-

nected head and regression with a convolution head. Along

with this direction, seminal work [51] incorporates TSD in

a region-based detector [47] that learns different features

for classiﬁcation and regression. Later, separate losses are

added to the whole loss function to optimize detection. Sim-

ilar to these works in still images [31, 59, 51, 33], we ob-

serve that a naive sibling head in the region-based detec-

tor [47] confuses objects with similar motion characteristics

and leads to sub-optimal video object detection.

Box-supervised Semantic and Instance Segmentation in

Images. There has been an increasing trend in exploiting

bounding box annotations to enhance weakly supervised in-

stance and semantic segmentation approaches in still im-

ages [14, 39, 37, 41, 4]. The main reason is that bounding

boxes contain knowledge about the precise location of each

object, and they are approximately 35 times faster to anno-

tate than per-pixel labeling [19, 2]. Along with a similar di-

rection, our work exploits box-level annotations to generate

coarse masks, eventually boosting video object detection.

Object Detection in Videos. Prior methods for video ob-

ject detection have two directions. One direction exploits

the redundancy in video frames by incorporating optical

ﬂow [68, 65], scale-time lattice [8], reinforcement learning

capabilities [63], and heatmaps [62] to reduce the cost of the

feature extraction process by propagating keyframe features

to other frames in videos. Another line of work leverages

temporal information encoded in videos to boost VOD, and

our work operates on this trend. Existing techniques ex-

support frame

target frame

support frame

Temporal RoI

Feature

Extaction

RoI Feature

Extaction

RoI Feature

Extaction

Semantic

Feature

Aggregation For

Detection

Classification

BoxMask

Prediction

RoI Proposals

Backbone Network RPN

Regression

Loss

Module introduced in

recent VOD methods

Proposals from

support frames

Support frame

features flow

Proposals from

target frame

Figure 2. Architectural overview of modern region-based VOD methods and our proposed modules highlighted in magenta. Alongside

spatio-temporal features, our method introduces important class-aware pixel-level features, which effectively tackles object confusion to

boost performance in modern region-based video object detection methods.

ploit temporal information in two ways. The ﬁrst way is

to reﬁne the detection results with post-processing meth-

ods [30, 36, 35]. Although these approaches improve the

performance of VOD, they heavily rely on the image-based

detector trained with no knowledge of temporal informa-

tion. On the contrary, The second direction is to capitalize

temporal information during the training stage [67, 57, 20,

5, 25, 50, 58, 17, 68, 65, 8, 60, 16, 53, 10, 11, 24, 64].

Some of these methods utilize optical ﬂow [18] to warp

and aggregate features across frames [67, 57, 35]. Despite

the improvement, the optical ﬂow based-methods fail in the

case of occlusions. Most existing region-based VOD meth-

ods [67, 67, 58, 24] tackle the inherent challenges by ag-

gregating temporal features. However, they mainly rely on

instance-level feature aggregation, which pays less atten-

tion to the content of object proposals, resulting in confu-

sion between objects with similar appearance and motion

characteristics. Very recently, TransVOD [64] introduces

the transformer-based VOD method by extending the De-

formable DETR [66] with a temporal transformer to aggre-

gate object queries from different video frames.

Tackling Object Confusion in Videos. Han et al. [29] are

the ﬁrst to highlight object confusion as to the main prob-

lem in VOD. They propose exploiting inter-video and intra-

video proposal relations to tackle object confusion. Another

seminal works [27, 28] attempts to solve this problem by de-

vising better feature aggregation schemes that enhance tar-

get frame feature representation. Despite the gratifying im-

provement in detection, these approaches rely on a region-

based detector that focuses more on discriminating between

background and foreground regions than differentiating be-

tween various foreground regions [12]. Moreover, these

methods operate on complex pipelines to produce impres-

sive results. Alternatively, we design a simple but effective

BoxMask module that achieves similar performance upon

integrating into recent region-based VOD methods.

3. Method

This section ﬁrst describes an overview of the modern

region-based detectors in VOD by diving into the inherent

misclassiﬁcation problem in Section 3.1. Later, we explain

the proposed BoxMask module and its learning mechanism

in Sections 3.2 and 3.3, respectively.

3.1. Revisiting Region-based Detectors in VOD

Figure 2 depicts an overview of region-based detectors

in VOD. First, a backbone network extracts spatial features

from the target frame (the actual frame on which detec-

tion needs to be executed) and support frames (other video

frames that assist the detection on a target frame). Subse-

quently, a Region Proposal Network (RPN) [47] predicts

object proposals for each frame and aims to minimize the

regression loss Lreg and classiﬁcation loss Lcls deﬁned as:

Lrpn =Lcls(p, p∗) + p∗.Lreg (t, t∗)(1)

where pis the estimated probability of a proposal being an

object and p∗represents 1 or 0 depending upon the label of

the anchor box. The term tdenotes the coordinates of the

predicted object proposal, and t∗is the ground truth. Here,

note that the classiﬁcation loss Lcls in Equation 1 only fo-

cuses on improving the objectness of proposals instead of

object classiﬁcation.

In the second stage, feature aggregation is performed be-

tween object proposal features of the target frame and sup-

port frames in a video. These aggregated features are pooled

by an RoI Align pooling operator and propagated to the de-

tection head designed to optimize multi-class classiﬁcation

and regression. For training, the detection loss is given by:

Ldet =Lcls(pc, y) + Lreg(t, t∗)(2)

where pcrepresents the predicted class distribution and y

is the class label of an object in a target frame. For com-

prehensive details about the parameterization of RPN and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BoxMask:RevisitingBoundingBoxSupervisionforVideoObjectDetectionKhurramAzeemHashmiAlainPaganiDidierStrickerMuhammamdZeshanAfzalDFKI-GermanResearchCenterforArticialIntelligencefirstname[0]firstname[1].lastname@dfki.deAbstractWepresentanew,simpleyeteffectiveapproachtoupliftvideoobjectdetection.Weobser...

展开>> 收起<<

BoxMask Revisiting Bounding Box Supervision for Video Object Detection Khurram Azeem Hashmi Alain Pagani Didier Stricker Muhammamd Zeshan Afzal.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

BoxMask Revisiting Bounding Box Supervision for Video Object Detection Khurram Azeem Hashmi Alain Pagani Didier Stricker Muhammamd Zeshan Afzal

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: