1 Improving Long-tailed Object Detection with Image-Level Supervision by Multi-Task

2025-04-24 0 0 1.93MB 10 页 10玖币

侵权投诉

Improving Long-tailed Object Detection with

Image-Level Supervision by Multi-Task

Collaborative Learning

Bo Li†, Yongqiang Yao†, Jingru Tan∗, Xin Lu, Fengwei Yu, Ye Luo, and Jianwei Lu

Abstract—Data in real-world object detection often exhibits

the long-tailed distribution. Existing solutions tackle this prob-

lem by mitigating the competition between the head and tail

categories. However, due to the scarcity of training samples,

tail categories are still unable to learn discriminative repre-

sentations. Bringing more data into the training may alleviate

the problem, but collecting instance-level annotations is an

excruciating task. In contrast, image-level annotations are easily

accessible but not fully exploited. In this paper, we propose a

novel framework CLIS (multi-task Collaborative Learning with

Image-level Supervision), which leverage image-level supervision

to enhance the detection ability in a multi-task collaborative

way. Speciﬁcally, there are an object detection task (consisting

of an instance-classiﬁcation task and a localization task) and

an image-classiﬁcation task in our framework, responsible for

utilizing the two types of supervision. Different tasks are trained

collaboratively by three key designs: (1) task-specialized sub-

networks that learn speciﬁc representations of different tasks

without feature entanglement. (2) a siamese sub-network for

the image-classiﬁcation task that shares its knowledge with the

instance-classiﬁcation task, resulting in feature enrichment of

detectors. (3) a contrastive learning regularization that maintains

representation consistency, bridging feature gaps of different

supervision. Extensive experiments are conducted on the chal-

lenging LVIS dataset. Without sophisticated loss engineering,

CLIS achieves an overall AP of 31.1 with 10.1 point improvement

on tail categories, establishing a new state-of-the-art. Code will

be at https://github.com/waveboo/CLIS.

Index Terms—Long-tailed Object Detection, Multi-Task Col-

laborative Learning, Image-Level Supervision

I. INTRODUCTION

GENERAL object detection [1], [2] has achieved great

progress thanks to deep neural networks. However, these

methods are mainly performed on balanced datasets(e.g., PAS-

CAL VOC [3] and MS COCO [4]), in which the instance

numbers of all categories are close. When it comes to a

more realistic scenario (e.g., LVIS [5]), the categories usually

follow a long-tailed distribution, where a few head categories

contain plenty of instances while most tail categories are

instance-scarce. In practice, tail categories often show poor

Bo Li and Ye Luo are with Tongji University, Shanghai, China. E-mail:

1911030@tongji.edu.cn, yeluo@tongji.edu.cn.

Jingru Tan is with Shanghai Jiao Tong University, Shanghai, China. E-mail:

tanjingru120@gmail.com.

Yongqiang Yao, Xin Lu, and Fengwei Yu are with SenseTime Research,

Shanghai, China. E-mail: soundbupt@gmail.com, luxin@sensetime.com,

yufengwei@sensetime.com

Jianwei Lu is with Shanghai University of Traditional Chinese Medicine,

Shanghai, China. E-mail: jwlu33@shutcm.edu.cn.

∗Corresponding author. †Equal Contribution.

Object Detection on Instance-Level

box1: [!"!, !$!, %!, ℎ!, !!]

box2: [!"", !$", %", ℎ", !"]

…

Image Classification on Image-Level

class:(!!class:(!"

class:(!#class:(!$

Localization Instance

Classification

Image

Classification

Collaborative Learning

Fig. 1. An overview of our framework. Our framework involves three tasks,

namely the localization task, the instance-classiﬁcation task, and the image-

classiﬁcation task. Two classiﬁcation tasks are learned collaboratively for

improving the detection ability.

performance [5], [6]. The main difﬁculty lies in two aspects:

On the one hand, tail categories are easily overwhelmed by the

dominant head categories due to extreme imbalance. On the

other hand, deep learning methods are data-hungry, while the

number of instances for tail categories may not be sufﬁcient

to learn good feature representations.

Most existing solutions try to address the long-tailed prob-

lem from the ﬁrst perspective. They re-balance the contri-

bution of different categories by data re-sampling [5], [7],

cost-sensitive learning [8]–[11], decoupled training [12], [13],

and so on. However, all these methods investigate the long-

tailed problem under limited bounding-box annotations. The

performance improvement mainly comes from a seesaw game

that decreases the score ranks of the head categories and

increases them for tail categories [14]. Tail categories are

still unable to learn discriminative feature representations. If

we train detectors with only limited tail category annotations,

generalization and performance can not be promised.

Different from the rebalance-based methods, we hope to

solve the long-tailed problem from the second perspective,

exploiting more training data to alleviate the instance-scarce

problem for better feature representations. However, collect-

ing images with instance-level supervision(i.e. bounding-box

annotations) is a daunting task that requires a lot of effort

and resources. In contrast, images with only image-level

supervision(i.e. category labels) could be easily collected from

existed dataset (e.g. ImageNet [15]) or Internet search engine.

arXiv:2210.05568v1 [cs.CV] 11 Oct 2022

To this end, we put our concentration on investigating how

to utilize these image-level annotated data to improve the

performance of long-tailed object detection.

In this paper, we propose a novel long-tailed object detection

framework named CLIS (multi-task Collaborative Learning

with Image-level Supervision), which incorporates additional

image-level supervision into the learning of object detectors

in a multi-task collaborative way. As demonstrated in Fig. 1,

there are two main tasks in our framework: an object detec-

tion task (consisting of an instance-classiﬁcation task and a

localization task) and an image-classiﬁcation task. They are

responsible for the two types of supervision, respectively.

Since the major cause of the performance degradation for

long-tailed object detectors is the inaccurate prediction of

the instance-classiﬁcation task [12], CLIS mainly focuses on

improving the performance for this task with the help of extra

knowledge collaboratively learned by the image-classiﬁcation

task.

To achieve this goal, three key components are designed

in our framework. Firstly, we propose to adopt the task-

specialized sub-networks to learn speciﬁc representations of

different tasks. It disentangles the features of the localization

task and the two classiﬁcation tasks, making them have

a clear division of labor. Then, a siamese sub-network is

introduced for the image-classiﬁcation task, which brings its

knowledge to the instance-classiﬁcation task by parameter

sharing. This siamese structure enriches feature representations

of the instance-classiﬁcation task, which indeed enhances the

long-tailed object detection ability. Finally, due to the two

classiﬁcation tasks receiving data from two different types of

supervision, there is a feature gap between them during the

knowledge sharing, preventing image-level supervision from

making its best in our framework. To address this problem,

we propose a contrastive learning regularization method to

bridge the feature gap between the two classiﬁcation tasks,

keeping their consistency through a contrastive loss. By the

synergy of these components, CLIS could collaboratively

learn knowledge across multi-tasks, taking full advantage of

additional data to improve detection performance.

Extensive experiments are conducted to demonstrate the

effectiveness of our proposed method. On the challenging

LVISv1.0 [5] benchmark with the image-level supervision

from the ImageNet-22k [15] dataset, our approach achieves

an overall AP of 31.1, bringing signiﬁcant improvement for

rare categories and establishing a new state-of-the-art. Experi-

mental results for other tasks, e.g. instance segmentation, also

demonstrate the generalization ability of our method. Mean-

while, although training with additional data, our proposed

framework introduces negligible computational cost during

inference, making it a practice method in realistic long-tailed

scenarios.

II. RELATED WORK

A. Long-tailed Object Detection

Long-tailed object detection is a challenging vision task

receiving growing attention today. General solutions for this

task are data re-sampling [5], [7], [12] and cost-sensitive

learning [8]–[11], [16]–[18] that re-balance the contribution

of different categories or instances to achieve a balanced

training status. Decoupled training methods [12], [13], [19],

[20] decouple the learning of representation and classiﬁer

into two separated stages to address the classiﬁer imbalance

problem. Besides, there are also many other methods that make

their effort on incremental learning [21], causal inference [22],

and so on. Nevertheless, all these methods try to solve the

long-tailed problem given the training data with only instance-

level supervision. However, the scarce instance number of rare

categories prevents the model from learning discriminative

features for classiﬁcation. In contrast, our method makes use

of extra image-level annotations to improve the classiﬁcation

ability of the long-tailed object detectors.

B. Object Detection with Image-Level Supervision

There are plenty of works that adopt image-level supervision

in the object detection task. Weakly-supervised object detec-

tion (WSOD) [23]–[29] trains object detectors from images

with only image-level supervision, formulating the task as

multiple instance learning (MIL) problem. Due to the lack

of location information, the accuracy of these methods is far

behind that of supervised object detectors, especially in some

complex scenes. Semi-supervised object detection [30]–[33]

trains the instance-level supervision data together with unla-

beled images. And the Semi-supervised WSOD methods [34]–

[40] learn detectors with additional image-level supervision,

which have the similar setting to our method. Among them,

DLWL [38] and MosaicOS [39] improve the performance

of low-shot categories with image-level supervision either

by a linear program constraint or a multi-stage self-training

framework. However, all these methods learn the image-level

annotations as the weakly supervision to generate the boxes

for the detection task, which are heavily dependent on the

accuracy of the pseudo-label generation algorithm and may

introduce too much noise. Besides, the recently proposed

method Detic [40] trains the classiﬁer of the detector from the

data coming from the two types of supervision which could

be viewed as a multi-task training process. However, it does

not take into account the feature entanglement of different

tasks, let alone bridge the feature gaps among them. In this

paper, we treat the learning of the two types of supervision

as multi-tasks with a clear division. Based on our proposed

framework, different tasks could be trained collaboratively,

taking full advantage of image-level supervision to improve

the long-tailed object detection performance.

C. Multi-Task Learning and Collaborative Learning

Multi-task learning approaches [41]–[45] learn to predict

multiple outputs for a series of tasks jointly by a shared feature

encoder/representation. They aim to improve the performance

of all tasks by knowledge sharing between different tasks.

However, in this work, our framework mainly focuses on

learning the tasks related to long-tailed object detection, while

the task for image-level supervision is utilized to bring its

knowledge to improve the performance of detectors. Col-

laborative learning methods [46]–[49] are usually applied

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1ImprovingLong-tailedObjectDetectionwithImage-LevelSupervisionbyMulti-TaskCollaborativeLearningBoLiy,YongqiangYaoy,JingruTan,XinLu,FengweiYu,YeLuo,andJianweiLuAbstractDatainreal-worldobjectdetectionoftenexhibitsthelong-taileddistribution.Existingsolutionstacklethisprob-lembymitigatingthecompetitio...

展开>> 收起<<

1 Improving Long-tailed Object Detection with Image-Level Supervision by Multi-Task.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Improving Long-tailed Object Detection with Image-Level Supervision by Multi-Task

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: