Fast Hierarchical Learning for Few-Shot Object Detection

2025-04-22 0 0 1.55MB 13 页 10玖币

侵权投诉

Yihang She, Goutam Bhat, Martin Danelljan, Fisher Yu

Computer Vision Lab, ETH Zurich

Abstract— Transfer learning based approaches have recently

achieved promising results on the few-shot detection task. These

approaches however suffer from “catastrophic forgetting” issue

due to ﬁnetuning of base detector, leading to sub-optimal perfor-

mance on the base classes. Furthermore, the slow convergence

rate of stochastic gradient descent (SGD) results in high latency

and consequently restricts real-time applications. We tackle the

aforementioned issues in this work. We pose few-shot detection

as a hierarchical learning problem, where the novel classes are

treated as the child classes of existing base classes and the

background class. The detection heads for the novel classes are

then trained using a specialized optimization strategy, leading

to signiﬁcantly lower training times compared to SGD. Our

approach obtains competitive novel class performance on few-

shot MS-COCO benchmark, while completely retaining the

performance of the initial model on the base classes. We further

demonstrate the application of our approach to a new class-

reﬁned few-shot detection task.

I. INTRODUCTION

Few-shot object detection [14], [36], [33], [11] is an

important computer vision problem with practical applica-

tions in robotics. For instance, it can be used to deploy

an autonomous agent in a new environment with unseen

objects, without having to collect large amount of training

data. Alternatively a user may want a robot to detect new

objects by showing just a few examples. Few-shot object

detection is an especially challenging problem since a model

should learn to both classify an object and localize it using

sparse data. This is further complicated in the generalized

few-shot detection case [33], where the model should retain

the ability to detect a set of pre-learned base classes, while

learning to detect novel classes.

One of the popular paradigms for the few-shot detection

task is the use of transfer learning. These approaches [33],

[17] aim to exploit general object detection knowledge learnt

over a large dataset containing annotation for a set of base

classes. Here, a detection model is ﬁrst trained on the data-

abundant base classes. The ﬁnal few layers of this model

are then ﬁnetuned to jointly detect both the base and novel

classes, using a few-shot dataset. While achieving promising

results, especially on the novel classes, the transfer learning

based methods suffer from two key limitations. Firstly, the

ﬁnetuning of the base model on the few-shot dataset leads

to a signiﬁcant drop in the base class performance. This

issue, termed as “catastrophic forgetting”, is undesirable in

practical applications where we may want a robot to detect

new classes on the ﬂy, while not forgetting the old knowl-

edge. Secondly, the base model is ﬁnetuned using stochastic

gradient descent, which takes long time to converge. This

prohibits the use of the method for real-time applications.

Base

Model

Input

Base

Model

Input

Base

Model

Input

Pretrained on base data Trained on novel data

Base Training

Animal

Car

Other

Car

Apple

Sofa

Other

Animal

Car

Other

Cat

Fox

Dog

Sofa

Other

Apple

Transfer LearningOur Approach

Fig. 1: Transfer learning based approaches ﬁnetune a base de-

tector to jointly detect both base and novel classes. However,

this results in a drop in base class performance of the detector

due to “catastrophic forgetting”. Our approach instead detects

novel classes in a hierarchical manner. This preserves the

base class performance, while also enabling detecting child

classes of existing base classes in a few-shot manner.

In this work, we propose a novel few-shot detection

approach to address the aforementioned issues. Our approach

is based on the idea of posing few-shot detection as a

hierarchical learning problem. We consider a general few-

shot learning setting where we may wish to extend a detector

to detect novel classes which are either a child class of

an existing base class, or completely unrelated to the base

classes. For example, given a model which detects “animal”

and “car” classes, we may wish to additionally detect the

animal types, e.g. “cat”, “dog”, or “fox”, as well as unrelated

novel classes “apple” and “sofa”. To achieve this, we build

a class hierarchy wherein the original base classes constitute

a set of super-classes, which are then sub-divided into novel

child classes. Speciﬁcally, the novel classes which are unre-

lated to any of the base classes are set as descendants of the

“other/background” super-class. With such a hierarchy, we

arXiv:2210.05008v1 [cs.CV] 10 Oct 2022

can ﬁrst apply the base detector to detect the leaf base class

objects (“car”), as well as the candidates for the base super-

classes (“animal” and “other”). These candidates are then

processed by separately trained novel predictors to detect

the novel classes. See Figure 1 for an illustration.

Our hierarchical approach decouples the weights of the

novel class predictors from the base detector. As a result,

our approach retains the performance of the pre-trained

detector on the base classes by design, addressing the “catas-

trophic forgetting” issue. Furthermore, we also introduce

a specialized optimization strategy, based on the Newton’s

method, to speed up the learning of the novel predictors.

By exploiting second-order information, our approach can

adapt to detect the novel classes using only 30 update steps.

Consequently, our approach obtains over 10×speed-up in

computation time, compared to our transfer learning based

baseline TFA [33].

Our contributions can thus be summarized as follows:

•We propose a simple yet effective hierarchical detection

approach which completely alleviates the “catastrophic

forgetting” on base classes, while obtaining competitive

results on the novel classes.

•We present a Newton’s method based optimization

strategy which achieves mush faster convergence than

traditional gradient descent.

•We introduce a new class-reﬁned few-shot detection

task where a method should also be able to learn ﬁne-

grained classiﬁcation for existing base classes.

II. RELATED WORK

Few-Shot Object Detection: Existing literature mainly

adopt two paradigms to tackle the few-shot object detection

problem: meta learning-based approach [14], [23], [10], [36]

and transfer learning-based approach [8], [33], [17], [11],

[24]. For meta learning-based approach, researchers leverage

the meta-learned task-level knowledge to the detection task

with limited training data. MetaYOLO [14] meta learned

a feature learner module to extract the generic features of

novel objects and a reweighting module to make predictions

provided these features. Fan et al. [10] proposed Attention-

RPN and Multi-Relation Detector to learn a metric space

to measure the similarity of object pairs for detection. Meta-

DETR [36] meta learned an encoder-decoder transformer for

the few-shot detection.

For transfer learning-based approach, LSTD [8] is one of

the early works that adapted the detector learned on data-

abundant objects to the target domain of few-shot novel

objects. Wang et al. [33] proposed the two-stage ﬁne-tuning

approach TFA. In the ﬁrst stage, a base predictor was

trained for data-abundant base objects. The ﬁnal layers of

the detector were then tuned in the second stage, on a

balanced few-shot dataset containing both base and novel

classes. This tuning-based approach is simple yet effective,

and outperformed previous methods using meta-learning.

Compared to TFA, LEAST [17] ﬁne-tuned more layers on

novel classes, leading to a better novel class performance,

albeit with a deterioration on the base class performance.

To mitigate this catastrophic forgetting, they further applied

knowledge distillation and the clustered exemplars of base

objects. DeFRCN [24] ﬁne-tuned the entire detector of Faster

R-CNN by jointly training it with two auxiliary modules

to improve novel class performance. Fan et al. [11] pro-

posed Retentive R-CNN, which inherited the tuning approach

of TFA with an auxiliary consistency loss to distill the

knowledge of the base detector. Retentive R-CNN achieved

competitive performance on novel classes, while maintaining

the performance of the pre-trained detector on base classes.

In this paper, we propose an alternate hierarchical detection

approach which can achieve similar results to Retentive R-

CNN, while being much simpler and general.

Incremental Learning and Reﬁned Classiﬁcation: Incre-

mental learning aims to incrementally learn new knowledge

from a stream of data while preserving its previous knowl-

edge [18], [27], [13], [25], [30], [34], [35], [7]. A real-world

scenario which is often neglected is that over time, humans

learn not only new entities, but also reﬁned granularity of

previously learned entities. Abdelsalam et al. [1] propose the

Incremental Learning and Reﬁned Classiﬁcation (IIRC) setup

related to this scenario. Here, each class has two granularity

levels of labels to simulate the process of incremental learn-

ing from coarse-grained categories to ﬁne-grained categories.

Following the IIRC setup, Wang et al. [32] proposed HCV

to learn the ﬁne-grained categories while retaining previous

knowledge. HCV aims to identify hierarchical relationship

between classes and exploit this knowledge for the IIRC task.

Hierarchy for few-shot learning: Li et al. [16] perform

large-scale few-shot learning by using class hierarchy which

encodes semantic relations between base and novel classes.

The prior knowledge from class hierarchy is used to learn

transferrable visual features. Liu et al. [21] use class hi-

erarchy to perform coarse-to-ﬁne classiﬁcation. In contrast

to these works, we show that the idea of hierarchy can be

effectively used to address the “catastrophic forgetting” issue

in few-shot detection.

Optimization Methods for Few-Shot Learning: Bertinetto

et al. [4] noted that updating only the parameters sensitive

to speciﬁc classes for few-shot classiﬁcation task leads to a

shallow learning problem. This enables developing adapta-

tion strategies that are more efﬁcient than standard gradient

descent. Consequently, they proposed ridge and sigmoid

regression based classiﬁers with closed-form solutions to

achieve fast convergence for the meta-learning-based few-

shot classiﬁcation. Lee et al. [15] meta-learn representations

for few-shot classiﬁcation using discriminative linear classi-

ﬁers. Several works have utilized the steepest-descent opti-

mization strategy to train shallow learners for tackling few-

shot learning problem arising in object tracking [9], [29], [5],

video object segmentation [6] and classiﬁcation [31]. A few

works [2], [3] have employed conjugate gradient (CG) as a

black box optimization tool for object detection. In this work,

we develop a specialized optimization strategy based on CG

to perform efﬁcient few-shot detection. By running extensive

experiments, we show that our optimization approach obtains

similar performance to SGD while being much faster.

III. METHOD

In this work, we propose a few-shot detection approach

that can learn to efﬁciently detect novel classes, while fully

retaining the performance of the original detector on the

base classes. This is achieved by i) introducing a hierarchical

detection approach which preserves the performance on the

base classes by design, while obtaining competitive results on

the novel classes, and ii) utilizing a specialized optimization

approach which leads to faster model adaptation on novel

classes. Our approach is detailed in subsequent sections.

A. Problem statement

We tackle the generalized few-shot learning setting em-

ployed in previous works [33], [11]. Here, a method is given

a large base dataset Dbcontaining annotated samples for a

set of base object classes Cb, which can be used to learn a

base detection model Mb. Next, given a small dataset Dn

for a set of novel classes Cn, the goal is to adapt the base

model Mbto detect the novel classes Cn, in addition to the

original base classes Cb. The novel dataset Dnis assumed to

contain only Kexamples (K < 30) per class. Furthermore,

the method is only allowed to access a small K-shot subset

bof the base dataset Dbwhen adapting the base model

to detect the novel classes. Thus the method should be able

to easily adapt to detect the novel classes Cnusing a small

dataset Dn∪D0

b, while still retaining the ability to detect

the base classes Cb. Furthermore, for practical applications

in e.g. robotics, the adaptation to novel classes is expected

to be fast in order to ensure real-time performance.

B. Motivation

We base our approach on the recently introduced Two-

stage Fine-tuning Approach (TFA) [33]. TFA is a trans-

fer learning approach for few-shot object detection which

has obtained promising results. TFA employs the Faster-

RCNN [26] as the detector architecture. Faster-RCNN con-

sists of a convolutional neural network (CNN) module for ex-

tracting generic image features, a Regional Proposal Network

(RPN) to generate proposals for potential objects, a Region

of Interest (ROI) feature extractor to compute features from

the sampled proposals, and a predictor head to output the

detections, given the ROI features. The predictor P={C,R}

consists of two separate linear layers: a classiﬁer Cto predict

object class for each proposal and a bounding box regressor

Rto localize each proposal.

TFA proposes to ﬁrst train a Faster-RCNN model on

the data-abundant Dbto obtain a base detector Mb, with

predictor Pb. Next, when provided the novel dataset Dnin

the second stage, TFA extends the predictor Pbto also output

detections for the novel classes. This extended predictor,

denoted as Pn, is then ﬁne-tuned on the combined dataset

Dn∪D0

bby minimizing a loss L=Lcls +Lloc using the

stochastic gradient descent optimizer. Here, Lcls is the cross

entropy loss for classiﬁer Cwhile Lloc is the smooth L1

loss for box regressor R. We refer to the detector using the

ﬁnetuned predictor Pnas Mn.

Base 1-shot 2-shot 3-shot 5-shot 10-shot

bAP 39.2 34.1 34.7 34.7 34.7 35.0

TABLE I: Average precision of TFA [33] on base classes

(bAP) over different shots on MS-COCO dataset [20]. TFA

suffers from a signiﬁcant drop in bAP, compared to the pre-

trained base model, due to “catastrophic forgetting”.

The two-stage training strategy allows TFA to leverage

the strong backbone feature extractor and the RPN modules

trained on the larger base dataset Dbto obtain improved

performance on the data-scarce novel categories Cn. How-

ever, it suffers from two signiﬁcant issues which limits its

applicability to practical applications.

Catastrophic forgetting: In the second stage, TFA ﬁnetunes

the predictor on a small balanced dataset containing both

novel and base categories to obtain the detector Mn. This

ﬁnetuning can lead to a signiﬁcant drop in the base class

performance, compared to the base detector Mbwhich was

trained on a much larger dataset Db. This “catastrophic

forgetting” problem is illustrated in Tab. I. Compared to

the pretrained base detector Mb, the ﬁnetuned detector Mn

obtains much lower average precision score on the base

classes (bAP), even in the 10-shot case (39.2 vs 35.0). This

is undesirable in cases when the performance on base classes

is equally important as the performance on novel classes.

Slow convergence: TFA uses the Stochastic Gradient De-

scent (SGD) to ﬁnetune the predictor Pnin the second stage.

While SGD is computationally cheap for each iteration, it

suffers from slow convergence. Thus, a large number of SGD

iterations is required to adapt the base detector Mbfor novel

classes, leading to high computational times.

In this work, we address the aforementioned issues with

TFA by proposing a novel few-shot detection framework.

C. Hierarchical Detection Approach

Here, we present our Hierarchical Detection Approach

(HDA) for generalized few-shot detection (see Fig. 2). We

note that the base model Mbis pre-trained on a large

dataset Dbcontaining abundant examples of base classes Cb.

Thus, the detector Mbshould already achieve high detection

performance on the base classes. Finetuning it further on a

smaller subset Dn∪D0

b, as in TFA, is likely to only reduce the

base class performance due to overﬁtting. Furthermore, the

base dataset Dbalso contains a large number of background

objects not belonging to Cb. Consequently, the base detector

Mbshould be able to classify most of the unseen object

classes, including the novel classes, as background. Under

these settings, we can pose generalized few-shot detection

as a hierarchical detection problem, as described next.

Similar to TFA, we ﬁrst train a Faster-RCNN base detector

Mbto detect the base classes Cbusing the large-scale base

dataset Db. Next, instead of ﬁnetuning the predictor Pbin

order to adapt the model for novel classes, we employ an

alternate hierarchical approach. We ﬁrst apply the detector

to generate object proposals and use the base predictor Pb

to obtain the classiﬁcation scores and reﬁned boxes for each

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FastHierarchicalLearningforFew-ShotObjectDetectionYihangShe,GoutamBhat,MartinDanelljan,FisherYuComputerVisionLab,ETHZurichAbstractTransferlearningbasedapproacheshaverecentlyachievedpromisingresultsonthefew-shotdetectiontask.Theseapproacheshoweversufferfromcatastrophicforgettingissueduetonetuning...

展开>> 收起<<

Fast Hierarchical Learning for Few-Shot Object Detection.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fast Hierarchical Learning for Few-Shot Object Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: