
BoxMask: Revisiting Bounding Box Supervision for Video Object Detection
Khurram Azeem Hashmi Alain Pagani Didier Stricker
Muhammamd Zeshan Afzal
DFKI - German Research Center for Artificial Intelligence
firstname[0] firstname[1].lastname@dfki.de
Abstract
We present a new, simple yet effective approach to uplift
video object detection. We observe that prior works oper-
ate on instance-level feature aggregation that imminently
neglects the refined pixel-level representation, resulting in
confusion among objects sharing similar appearance or
motion characteristics. To address this limitation, we pro-
pose BoxMask, which effectively learns discriminative rep-
resentations by incorporating class-aware pixel-level infor-
mation. We simply consider bounding box-level annotations
as a coarse mask for each object to supervise our method.
The proposed module can be effortlessly integrated into any
region-based detector to boost detection. Extensive exper-
iments on ImageNet VID and EPIC KITCHENS datasets
demonstrate consistent and significant improvement when
we plug our BoxMask module into numerous recent state-
of-the-art methods.
1. Introduction
With the recent advancements in deep convolutional neu-
ral networks [32, 61, 56], object detection in still images
has gained a remarkable progress [23, 47, 44, 52, 21]. The
naive idea of applying image-based detectors on each frame
to perform Video Object Detection (VOD) often under-
performs, owing to the deteriorated object appearance due
to motion blur, rare poses, and part occlusions in videos.
Therefore, exploiting the encoded temporal information in
videos [67, 68, 58, 24] has become a de facto choice to
tackle these challenges.
Earlier video object detection techniques utilizing tem-
poral information mainly operate under two paradigms. The
first category of methods applies post-processing on tem-
poral information to make still image object detection re-
sults [30, 36, 35, 3] more consistent and stable. Alterna-
tively, the second group leverages the feature aggregation
of temporal information [67, 8, 58, 63, 11, 24]. Albeit these
region-based state-of-the-art systems have greatly boosted
the performance of VOD, they suffer from differentiating
the confusing objects with similar appearances or uniform
motion attributes.
We observe that most of the previous approaches [67, 58,
24, 11] operate on instance-level feature aggregation that
imminently neglects the refined pixel-level representation,
resulting in acceptable localization but inferior classifica-
tion. As illustrated in the first two rows of Figure 1, al-
though the object detector exploits spatio-temporal context
from support frames (t−sand t+s) to refine proposal
features, it produces false positives by classifying back-
ground as a Bear and misclassifies Watercraft with a Car
at the target frame t. To overcome this hurdle, we design
a novel module called BoxMask that exploits class-aware
pixel-level temporal information to boost VOD. Inspired
by [31] in still images, the BoxMask predicts a class-aware
segmentation mask for each region of interest along with
the conventional classification and localization. Since this
paper deals with the problem of object detection in videos,
we investigate bounding box-level annotation to generate
coarse masks which supervise our BoxMask network. The
advantages of adopting our BoxMask head are two folds.
First, the class-aware pixel-level features reduce the hard
false positives between objects with low spatial and tempo-
ral inter-class variance. Second, since the size of the pre-
dicted mask is identical to the target region, fine-grained
pixel-level learning assists the detector in precise localiza-
tion. We summarize the main contribution of this paper as
follows:
• We observe that object misclassification is the crucial
obstacle that limits the upper bound of existing video
object detection methods. We further revisit the idea of
leveraging bounding box annotations to supervise both
regression and mask prediction (see Figure 1).
• We propose BoxMask, an extremely simple yet effec-
tive module that learns additional discriminative repre-
sentations by incorporating class-aware pixel-level in-
formation to boost VOD.
• Our BoxMask is a plug-and-play module and can
be integrated into any region-based detection method.
arXiv:2210.06008v1 [cs.CV] 12 Oct 2022