2
To this end, we put our concentration on investigating how
to utilize these image-level annotated data to improve the
performance of long-tailed object detection.
In this paper, we propose a novel long-tailed object detection
framework named CLIS (multi-task Collaborative Learning
with Image-level Supervision), which incorporates additional
image-level supervision into the learning of object detectors
in a multi-task collaborative way. As demonstrated in Fig. 1,
there are two main tasks in our framework: an object detec-
tion task (consisting of an instance-classification task and a
localization task) and an image-classification task. They are
responsible for the two types of supervision, respectively.
Since the major cause of the performance degradation for
long-tailed object detectors is the inaccurate prediction of
the instance-classification task [12], CLIS mainly focuses on
improving the performance for this task with the help of extra
knowledge collaboratively learned by the image-classification
task.
To achieve this goal, three key components are designed
in our framework. Firstly, we propose to adopt the task-
specialized sub-networks to learn specific representations of
different tasks. It disentangles the features of the localization
task and the two classification tasks, making them have
a clear division of labor. Then, a siamese sub-network is
introduced for the image-classification task, which brings its
knowledge to the instance-classification task by parameter
sharing. This siamese structure enriches feature representations
of the instance-classification task, which indeed enhances the
long-tailed object detection ability. Finally, due to the two
classification tasks receiving data from two different types of
supervision, there is a feature gap between them during the
knowledge sharing, preventing image-level supervision from
making its best in our framework. To address this problem,
we propose a contrastive learning regularization method to
bridge the feature gap between the two classification tasks,
keeping their consistency through a contrastive loss. By the
synergy of these components, CLIS could collaboratively
learn knowledge across multi-tasks, taking full advantage of
additional data to improve detection performance.
Extensive experiments are conducted to demonstrate the
effectiveness of our proposed method. On the challenging
LVISv1.0 [5] benchmark with the image-level supervision
from the ImageNet-22k [15] dataset, our approach achieves
an overall AP of 31.1, bringing significant improvement for
rare categories and establishing a new state-of-the-art. Experi-
mental results for other tasks, e.g. instance segmentation, also
demonstrate the generalization ability of our method. Mean-
while, although training with additional data, our proposed
framework introduces negligible computational cost during
inference, making it a practice method in realistic long-tailed
scenarios.
II. RELATED WORK
A. Long-tailed Object Detection
Long-tailed object detection is a challenging vision task
receiving growing attention today. General solutions for this
task are data re-sampling [5], [7], [12] and cost-sensitive
learning [8]–[11], [16]–[18] that re-balance the contribution
of different categories or instances to achieve a balanced
training status. Decoupled training methods [12], [13], [19],
[20] decouple the learning of representation and classifier
into two separated stages to address the classifier imbalance
problem. Besides, there are also many other methods that make
their effort on incremental learning [21], causal inference [22],
and so on. Nevertheless, all these methods try to solve the
long-tailed problem given the training data with only instance-
level supervision. However, the scarce instance number of rare
categories prevents the model from learning discriminative
features for classification. In contrast, our method makes use
of extra image-level annotations to improve the classification
ability of the long-tailed object detectors.
B. Object Detection with Image-Level Supervision
There are plenty of works that adopt image-level supervision
in the object detection task. Weakly-supervised object detec-
tion (WSOD) [23]–[29] trains object detectors from images
with only image-level supervision, formulating the task as
multiple instance learning (MIL) problem. Due to the lack
of location information, the accuracy of these methods is far
behind that of supervised object detectors, especially in some
complex scenes. Semi-supervised object detection [30]–[33]
trains the instance-level supervision data together with unla-
beled images. And the Semi-supervised WSOD methods [34]–
[40] learn detectors with additional image-level supervision,
which have the similar setting to our method. Among them,
DLWL [38] and MosaicOS [39] improve the performance
of low-shot categories with image-level supervision either
by a linear program constraint or a multi-stage self-training
framework. However, all these methods learn the image-level
annotations as the weakly supervision to generate the boxes
for the detection task, which are heavily dependent on the
accuracy of the pseudo-label generation algorithm and may
introduce too much noise. Besides, the recently proposed
method Detic [40] trains the classifier of the detector from the
data coming from the two types of supervision which could
be viewed as a multi-task training process. However, it does
not take into account the feature entanglement of different
tasks, let alone bridge the feature gaps among them. In this
paper, we treat the learning of the two types of supervision
as multi-tasks with a clear division. Based on our proposed
framework, different tasks could be trained collaboratively,
taking full advantage of image-level supervision to improve
the long-tailed object detection performance.
C. Multi-Task Learning and Collaborative Learning
Multi-task learning approaches [41]–[45] learn to predict
multiple outputs for a series of tasks jointly by a shared feature
encoder/representation. They aim to improve the performance
of all tasks by knowledge sharing between different tasks.
However, in this work, our framework mainly focuses on
learning the tasks related to long-tailed object detection, while
the task for image-level supervision is utilized to bring its
knowledge to improve the performance of detectors. Col-
laborative learning methods [46]–[49] are usually applied