1 Improving Long-tailed Object Detection with Image-Level Supervision by Multi-Task

2025-04-24 0 0 1.93MB 10 页 10玖币
侵权投诉
1
Improving Long-tailed Object Detection with
Image-Level Supervision by Multi-Task
Collaborative Learning
Bo Li, Yongqiang Yao, Jingru Tan, Xin Lu, Fengwei Yu, Ye Luo, and Jianwei Lu
Abstract—Data in real-world object detection often exhibits
the long-tailed distribution. Existing solutions tackle this prob-
lem by mitigating the competition between the head and tail
categories. However, due to the scarcity of training samples,
tail categories are still unable to learn discriminative repre-
sentations. Bringing more data into the training may alleviate
the problem, but collecting instance-level annotations is an
excruciating task. In contrast, image-level annotations are easily
accessible but not fully exploited. In this paper, we propose a
novel framework CLIS (multi-task Collaborative Learning with
Image-level Supervision), which leverage image-level supervision
to enhance the detection ability in a multi-task collaborative
way. Specifically, there are an object detection task (consisting
of an instance-classification task and a localization task) and
an image-classification task in our framework, responsible for
utilizing the two types of supervision. Different tasks are trained
collaboratively by three key designs: (1) task-specialized sub-
networks that learn specific representations of different tasks
without feature entanglement. (2) a siamese sub-network for
the image-classification task that shares its knowledge with the
instance-classification task, resulting in feature enrichment of
detectors. (3) a contrastive learning regularization that maintains
representation consistency, bridging feature gaps of different
supervision. Extensive experiments are conducted on the chal-
lenging LVIS dataset. Without sophisticated loss engineering,
CLIS achieves an overall AP of 31.1 with 10.1 point improvement
on tail categories, establishing a new state-of-the-art. Code will
be at https://github.com/waveboo/CLIS.
Index Terms—Long-tailed Object Detection, Multi-Task Col-
laborative Learning, Image-Level Supervision
I. INTRODUCTION
GENERAL object detection [1], [2] has achieved great
progress thanks to deep neural networks. However, these
methods are mainly performed on balanced datasets(e.g., PAS-
CAL VOC [3] and MS COCO [4]), in which the instance
numbers of all categories are close. When it comes to a
more realistic scenario (e.g., LVIS [5]), the categories usually
follow a long-tailed distribution, where a few head categories
contain plenty of instances while most tail categories are
instance-scarce. In practice, tail categories often show poor
Bo Li and Ye Luo are with Tongji University, Shanghai, China. E-mail:
1911030@tongji.edu.cn, yeluo@tongji.edu.cn.
Jingru Tan is with Shanghai Jiao Tong University, Shanghai, China. E-mail:
tanjingru120@gmail.com.
Yongqiang Yao, Xin Lu, and Fengwei Yu are with SenseTime Research,
Shanghai, China. E-mail: soundbupt@gmail.com, luxin@sensetime.com,
yufengwei@sensetime.com
Jianwei Lu is with Shanghai University of Traditional Chinese Medicine,
Shanghai, China. E-mail: jwlu33@shutcm.edu.cn.
Corresponding author. Equal Contribution.
Object Detection on Instance-Level
box1: [!"!, !$!, %!, ℎ!, !!]
box2: [!"", !$", %", ℎ", !"]
Image Classification on Image-Level
class:(!!class:(!"
class:(!#class:(!$
Localization Instance
Classification
Image
Classification
Collaborative Learning
Fig. 1. An overview of our framework. Our framework involves three tasks,
namely the localization task, the instance-classification task, and the image-
classification task. Two classification tasks are learned collaboratively for
improving the detection ability.
performance [5], [6]. The main difficulty lies in two aspects:
On the one hand, tail categories are easily overwhelmed by the
dominant head categories due to extreme imbalance. On the
other hand, deep learning methods are data-hungry, while the
number of instances for tail categories may not be sufficient
to learn good feature representations.
Most existing solutions try to address the long-tailed prob-
lem from the first perspective. They re-balance the contri-
bution of different categories by data re-sampling [5], [7],
cost-sensitive learning [8]–[11], decoupled training [12], [13],
and so on. However, all these methods investigate the long-
tailed problem under limited bounding-box annotations. The
performance improvement mainly comes from a seesaw game
that decreases the score ranks of the head categories and
increases them for tail categories [14]. Tail categories are
still unable to learn discriminative feature representations. If
we train detectors with only limited tail category annotations,
generalization and performance can not be promised.
Different from the rebalance-based methods, we hope to
solve the long-tailed problem from the second perspective,
exploiting more training data to alleviate the instance-scarce
problem for better feature representations. However, collect-
ing images with instance-level supervision(i.e. bounding-box
annotations) is a daunting task that requires a lot of effort
and resources. In contrast, images with only image-level
supervision(i.e. category labels) could be easily collected from
existed dataset (e.g. ImageNet [15]) or Internet search engine.
arXiv:2210.05568v1 [cs.CV] 11 Oct 2022
2
To this end, we put our concentration on investigating how
to utilize these image-level annotated data to improve the
performance of long-tailed object detection.
In this paper, we propose a novel long-tailed object detection
framework named CLIS (multi-task Collaborative Learning
with Image-level Supervision), which incorporates additional
image-level supervision into the learning of object detectors
in a multi-task collaborative way. As demonstrated in Fig. 1,
there are two main tasks in our framework: an object detec-
tion task (consisting of an instance-classification task and a
localization task) and an image-classification task. They are
responsible for the two types of supervision, respectively.
Since the major cause of the performance degradation for
long-tailed object detectors is the inaccurate prediction of
the instance-classification task [12], CLIS mainly focuses on
improving the performance for this task with the help of extra
knowledge collaboratively learned by the image-classification
task.
To achieve this goal, three key components are designed
in our framework. Firstly, we propose to adopt the task-
specialized sub-networks to learn specific representations of
different tasks. It disentangles the features of the localization
task and the two classification tasks, making them have
a clear division of labor. Then, a siamese sub-network is
introduced for the image-classification task, which brings its
knowledge to the instance-classification task by parameter
sharing. This siamese structure enriches feature representations
of the instance-classification task, which indeed enhances the
long-tailed object detection ability. Finally, due to the two
classification tasks receiving data from two different types of
supervision, there is a feature gap between them during the
knowledge sharing, preventing image-level supervision from
making its best in our framework. To address this problem,
we propose a contrastive learning regularization method to
bridge the feature gap between the two classification tasks,
keeping their consistency through a contrastive loss. By the
synergy of these components, CLIS could collaboratively
learn knowledge across multi-tasks, taking full advantage of
additional data to improve detection performance.
Extensive experiments are conducted to demonstrate the
effectiveness of our proposed method. On the challenging
LVISv1.0 [5] benchmark with the image-level supervision
from the ImageNet-22k [15] dataset, our approach achieves
an overall AP of 31.1, bringing significant improvement for
rare categories and establishing a new state-of-the-art. Experi-
mental results for other tasks, e.g. instance segmentation, also
demonstrate the generalization ability of our method. Mean-
while, although training with additional data, our proposed
framework introduces negligible computational cost during
inference, making it a practice method in realistic long-tailed
scenarios.
II. RELATED WORK
A. Long-tailed Object Detection
Long-tailed object detection is a challenging vision task
receiving growing attention today. General solutions for this
task are data re-sampling [5], [7], [12] and cost-sensitive
learning [8]–[11], [16]–[18] that re-balance the contribution
of different categories or instances to achieve a balanced
training status. Decoupled training methods [12], [13], [19],
[20] decouple the learning of representation and classifier
into two separated stages to address the classifier imbalance
problem. Besides, there are also many other methods that make
their effort on incremental learning [21], causal inference [22],
and so on. Nevertheless, all these methods try to solve the
long-tailed problem given the training data with only instance-
level supervision. However, the scarce instance number of rare
categories prevents the model from learning discriminative
features for classification. In contrast, our method makes use
of extra image-level annotations to improve the classification
ability of the long-tailed object detectors.
B. Object Detection with Image-Level Supervision
There are plenty of works that adopt image-level supervision
in the object detection task. Weakly-supervised object detec-
tion (WSOD) [23]–[29] trains object detectors from images
with only image-level supervision, formulating the task as
multiple instance learning (MIL) problem. Due to the lack
of location information, the accuracy of these methods is far
behind that of supervised object detectors, especially in some
complex scenes. Semi-supervised object detection [30]–[33]
trains the instance-level supervision data together with unla-
beled images. And the Semi-supervised WSOD methods [34]–
[40] learn detectors with additional image-level supervision,
which have the similar setting to our method. Among them,
DLWL [38] and MosaicOS [39] improve the performance
of low-shot categories with image-level supervision either
by a linear program constraint or a multi-stage self-training
framework. However, all these methods learn the image-level
annotations as the weakly supervision to generate the boxes
for the detection task, which are heavily dependent on the
accuracy of the pseudo-label generation algorithm and may
introduce too much noise. Besides, the recently proposed
method Detic [40] trains the classifier of the detector from the
data coming from the two types of supervision which could
be viewed as a multi-task training process. However, it does
not take into account the feature entanglement of different
tasks, let alone bridge the feature gaps among them. In this
paper, we treat the learning of the two types of supervision
as multi-tasks with a clear division. Based on our proposed
framework, different tasks could be trained collaboratively,
taking full advantage of image-level supervision to improve
the long-tailed object detection performance.
C. Multi-Task Learning and Collaborative Learning
Multi-task learning approaches [41]–[45] learn to predict
multiple outputs for a series of tasks jointly by a shared feature
encoder/representation. They aim to improve the performance
of all tasks by knowledge sharing between different tasks.
However, in this work, our framework mainly focuses on
learning the tasks related to long-tailed object detection, while
the task for image-level supervision is utilized to bring its
knowledge to improve the performance of detectors. Col-
laborative learning methods [46]–[49] are usually applied
摘要:

1ImprovingLong-tailedObjectDetectionwithImage-LevelSupervisionbyMulti-TaskCollaborativeLearningBoLiy,YongqiangYaoy,JingruTan,XinLu,FengweiYu,YeLuo,andJianweiLuAbstract—Datainreal-worldobjectdetectionoftenexhibitsthelong-taileddistribution.Existingsolutionstacklethisprob-lembymitigatingthecompetitio...

展开>> 收起<<
1 Improving Long-tailed Object Detection with Image-Level Supervision by Multi-Task.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.93MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注