Fast Hierarchical Learning for Few-Shot Object Detection

2025-04-22 0 0 1.55MB 13 页 10玖币
侵权投诉
Fast Hierarchical Learning for Few-Shot Object Detection
Yihang She, Goutam Bhat, Martin Danelljan, Fisher Yu
Computer Vision Lab, ETH Zurich
Abstract Transfer learning based approaches have recently
achieved promising results on the few-shot detection task. These
approaches however suffer from “catastrophic forgetting” issue
due to finetuning of base detector, leading to sub-optimal perfor-
mance on the base classes. Furthermore, the slow convergence
rate of stochastic gradient descent (SGD) results in high latency
and consequently restricts real-time applications. We tackle the
aforementioned issues in this work. We pose few-shot detection
as a hierarchical learning problem, where the novel classes are
treated as the child classes of existing base classes and the
background class. The detection heads for the novel classes are
then trained using a specialized optimization strategy, leading
to significantly lower training times compared to SGD. Our
approach obtains competitive novel class performance on few-
shot MS-COCO benchmark, while completely retaining the
performance of the initial model on the base classes. We further
demonstrate the application of our approach to a new class-
refined few-shot detection task.
I. INTRODUCTION
Few-shot object detection [14], [36], [33], [11] is an
important computer vision problem with practical applica-
tions in robotics. For instance, it can be used to deploy
an autonomous agent in a new environment with unseen
objects, without having to collect large amount of training
data. Alternatively a user may want a robot to detect new
objects by showing just a few examples. Few-shot object
detection is an especially challenging problem since a model
should learn to both classify an object and localize it using
sparse data. This is further complicated in the generalized
few-shot detection case [33], where the model should retain
the ability to detect a set of pre-learned base classes, while
learning to detect novel classes.
One of the popular paradigms for the few-shot detection
task is the use of transfer learning. These approaches [33],
[17] aim to exploit general object detection knowledge learnt
over a large dataset containing annotation for a set of base
classes. Here, a detection model is first trained on the data-
abundant base classes. The final few layers of this model
are then finetuned to jointly detect both the base and novel
classes, using a few-shot dataset. While achieving promising
results, especially on the novel classes, the transfer learning
based methods suffer from two key limitations. Firstly, the
finetuning of the base model on the few-shot dataset leads
to a significant drop in the base class performance. This
issue, termed as “catastrophic forgetting”, is undesirable in
practical applications where we may want a robot to detect
new classes on the fly, while not forgetting the old knowl-
edge. Secondly, the base model is finetuned using stochastic
gradient descent, which takes long time to converge. This
prohibits the use of the method for real-time applications.
Base
Model
Input
Base
Model
Input
Base
Model
Input
Pretrained on base data Trained on novel data
Base Training
Animal
Car
Other
Car
Apple
Sofa
Other
Animal
Animal
Car
Other
Cat
Fox
Dog
Sofa
Other
Apple
Transfer LearningOur Approach
Fig. 1: Transfer learning based approaches finetune a base de-
tector to jointly detect both base and novel classes. However,
this results in a drop in base class performance of the detector
due to “catastrophic forgetting”. Our approach instead detects
novel classes in a hierarchical manner. This preserves the
base class performance, while also enabling detecting child
classes of existing base classes in a few-shot manner.
In this work, we propose a novel few-shot detection
approach to address the aforementioned issues. Our approach
is based on the idea of posing few-shot detection as a
hierarchical learning problem. We consider a general few-
shot learning setting where we may wish to extend a detector
to detect novel classes which are either a child class of
an existing base class, or completely unrelated to the base
classes. For example, given a model which detects “animal”
and “car” classes, we may wish to additionally detect the
animal types, e.g. “cat”, “dog”, or “fox”, as well as unrelated
novel classes “apple” and “sofa”. To achieve this, we build
a class hierarchy wherein the original base classes constitute
a set of super-classes, which are then sub-divided into novel
child classes. Specifically, the novel classes which are unre-
lated to any of the base classes are set as descendants of the
“other/background” super-class. With such a hierarchy, we
arXiv:2210.05008v1 [cs.CV] 10 Oct 2022
can first apply the base detector to detect the leaf base class
objects (“car”), as well as the candidates for the base super-
classes (“animal” and “other”). These candidates are then
processed by separately trained novel predictors to detect
the novel classes. See Figure 1 for an illustration.
Our hierarchical approach decouples the weights of the
novel class predictors from the base detector. As a result,
our approach retains the performance of the pre-trained
detector on the base classes by design, addressing the “catas-
trophic forgetting” issue. Furthermore, we also introduce
a specialized optimization strategy, based on the Newton’s
method, to speed up the learning of the novel predictors.
By exploiting second-order information, our approach can
adapt to detect the novel classes using only 30 update steps.
Consequently, our approach obtains over 10×speed-up in
computation time, compared to our transfer learning based
baseline TFA [33].
Our contributions can thus be summarized as follows:
We propose a simple yet effective hierarchical detection
approach which completely alleviates the “catastrophic
forgetting” on base classes, while obtaining competitive
results on the novel classes.
We present a Newton’s method based optimization
strategy which achieves mush faster convergence than
traditional gradient descent.
We introduce a new class-refined few-shot detection
task where a method should also be able to learn fine-
grained classification for existing base classes.
II. RELATED WORK
Few-Shot Object Detection: Existing literature mainly
adopt two paradigms to tackle the few-shot object detection
problem: meta learning-based approach [14], [23], [10], [36]
and transfer learning-based approach [8], [33], [17], [11],
[24]. For meta learning-based approach, researchers leverage
the meta-learned task-level knowledge to the detection task
with limited training data. MetaYOLO [14] meta learned
a feature learner module to extract the generic features of
novel objects and a reweighting module to make predictions
provided these features. Fan et al. [10] proposed Attention-
RPN and Multi-Relation Detector to learn a metric space
to measure the similarity of object pairs for detection. Meta-
DETR [36] meta learned an encoder-decoder transformer for
the few-shot detection.
For transfer learning-based approach, LSTD [8] is one of
the early works that adapted the detector learned on data-
abundant objects to the target domain of few-shot novel
objects. Wang et al. [33] proposed the two-stage fine-tuning
approach TFA. In the first stage, a base predictor was
trained for data-abundant base objects. The final layers of
the detector were then tuned in the second stage, on a
balanced few-shot dataset containing both base and novel
classes. This tuning-based approach is simple yet effective,
and outperformed previous methods using meta-learning.
Compared to TFA, LEAST [17] fine-tuned more layers on
novel classes, leading to a better novel class performance,
albeit with a deterioration on the base class performance.
To mitigate this catastrophic forgetting, they further applied
knowledge distillation and the clustered exemplars of base
objects. DeFRCN [24] fine-tuned the entire detector of Faster
R-CNN by jointly training it with two auxiliary modules
to improve novel class performance. Fan et al. [11] pro-
posed Retentive R-CNN, which inherited the tuning approach
of TFA with an auxiliary consistency loss to distill the
knowledge of the base detector. Retentive R-CNN achieved
competitive performance on novel classes, while maintaining
the performance of the pre-trained detector on base classes.
In this paper, we propose an alternate hierarchical detection
approach which can achieve similar results to Retentive R-
CNN, while being much simpler and general.
Incremental Learning and Refined Classification: Incre-
mental learning aims to incrementally learn new knowledge
from a stream of data while preserving its previous knowl-
edge [18], [27], [13], [25], [30], [34], [35], [7]. A real-world
scenario which is often neglected is that over time, humans
learn not only new entities, but also refined granularity of
previously learned entities. Abdelsalam et al. [1] propose the
Incremental Learning and Refined Classification (IIRC) setup
related to this scenario. Here, each class has two granularity
levels of labels to simulate the process of incremental learn-
ing from coarse-grained categories to fine-grained categories.
Following the IIRC setup, Wang et al. [32] proposed HCV
to learn the fine-grained categories while retaining previous
knowledge. HCV aims to identify hierarchical relationship
between classes and exploit this knowledge for the IIRC task.
Hierarchy for few-shot learning: Li et al. [16] perform
large-scale few-shot learning by using class hierarchy which
encodes semantic relations between base and novel classes.
The prior knowledge from class hierarchy is used to learn
transferrable visual features. Liu et al. [21] use class hi-
erarchy to perform coarse-to-fine classification. In contrast
to these works, we show that the idea of hierarchy can be
effectively used to address the “catastrophic forgetting” issue
in few-shot detection.
Optimization Methods for Few-Shot Learning: Bertinetto
et al. [4] noted that updating only the parameters sensitive
to specific classes for few-shot classification task leads to a
shallow learning problem. This enables developing adapta-
tion strategies that are more efficient than standard gradient
descent. Consequently, they proposed ridge and sigmoid
regression based classifiers with closed-form solutions to
achieve fast convergence for the meta-learning-based few-
shot classification. Lee et al. [15] meta-learn representations
for few-shot classification using discriminative linear classi-
fiers. Several works have utilized the steepest-descent opti-
mization strategy to train shallow learners for tackling few-
shot learning problem arising in object tracking [9], [29], [5],
video object segmentation [6] and classification [31]. A few
works [2], [3] have employed conjugate gradient (CG) as a
black box optimization tool for object detection. In this work,
we develop a specialized optimization strategy based on CG
to perform efficient few-shot detection. By running extensive
experiments, we show that our optimization approach obtains
similar performance to SGD while being much faster.
III. METHOD
In this work, we propose a few-shot detection approach
that can learn to efficiently detect novel classes, while fully
retaining the performance of the original detector on the
base classes. This is achieved by i) introducing a hierarchical
detection approach which preserves the performance on the
base classes by design, while obtaining competitive results on
the novel classes, and ii) utilizing a specialized optimization
approach which leads to faster model adaptation on novel
classes. Our approach is detailed in subsequent sections.
A. Problem statement
We tackle the generalized few-shot learning setting em-
ployed in previous works [33], [11]. Here, a method is given
a large base dataset Dbcontaining annotated samples for a
set of base object classes Cb, which can be used to learn a
base detection model Mb. Next, given a small dataset Dn
for a set of novel classes Cn, the goal is to adapt the base
model Mbto detect the novel classes Cn, in addition to the
original base classes Cb. The novel dataset Dnis assumed to
contain only Kexamples (K < 30) per class. Furthermore,
the method is only allowed to access a small K-shot subset
D0
bof the base dataset Dbwhen adapting the base model
to detect the novel classes. Thus the method should be able
to easily adapt to detect the novel classes Cnusing a small
dataset DnD0
b, while still retaining the ability to detect
the base classes Cb. Furthermore, for practical applications
in e.g. robotics, the adaptation to novel classes is expected
to be fast in order to ensure real-time performance.
B. Motivation
We base our approach on the recently introduced Two-
stage Fine-tuning Approach (TFA) [33]. TFA is a trans-
fer learning approach for few-shot object detection which
has obtained promising results. TFA employs the Faster-
RCNN [26] as the detector architecture. Faster-RCNN con-
sists of a convolutional neural network (CNN) module for ex-
tracting generic image features, a Regional Proposal Network
(RPN) to generate proposals for potential objects, a Region
of Interest (ROI) feature extractor to compute features from
the sampled proposals, and a predictor head to output the
detections, given the ROI features. The predictor P={C,R}
consists of two separate linear layers: a classifier Cto predict
object class for each proposal and a bounding box regressor
Rto localize each proposal.
TFA proposes to first train a Faster-RCNN model on
the data-abundant Dbto obtain a base detector Mb, with
predictor Pb. Next, when provided the novel dataset Dnin
the second stage, TFA extends the predictor Pbto also output
detections for the novel classes. This extended predictor,
denoted as Pn, is then fine-tuned on the combined dataset
DnD0
bby minimizing a loss L=Lcls +Lloc using the
stochastic gradient descent optimizer. Here, Lcls is the cross
entropy loss for classifier Cwhile Lloc is the smooth L1
loss for box regressor R. We refer to the detector using the
finetuned predictor Pnas Mn.
Base 1-shot 2-shot 3-shot 5-shot 10-shot
bAP 39.2 34.1 34.7 34.7 34.7 35.0
TABLE I: Average precision of TFA [33] on base classes
(bAP) over different shots on MS-COCO dataset [20]. TFA
suffers from a significant drop in bAP, compared to the pre-
trained base model, due to “catastrophic forgetting”.
The two-stage training strategy allows TFA to leverage
the strong backbone feature extractor and the RPN modules
trained on the larger base dataset Dbto obtain improved
performance on the data-scarce novel categories Cn. How-
ever, it suffers from two significant issues which limits its
applicability to practical applications.
Catastrophic forgetting: In the second stage, TFA finetunes
the predictor on a small balanced dataset containing both
novel and base categories to obtain the detector Mn. This
finetuning can lead to a significant drop in the base class
performance, compared to the base detector Mbwhich was
trained on a much larger dataset Db. This “catastrophic
forgetting” problem is illustrated in Tab. I. Compared to
the pretrained base detector Mb, the finetuned detector Mn
obtains much lower average precision score on the base
classes (bAP), even in the 10-shot case (39.2 vs 35.0). This
is undesirable in cases when the performance on base classes
is equally important as the performance on novel classes.
Slow convergence: TFA uses the Stochastic Gradient De-
scent (SGD) to finetune the predictor Pnin the second stage.
While SGD is computationally cheap for each iteration, it
suffers from slow convergence. Thus, a large number of SGD
iterations is required to adapt the base detector Mbfor novel
classes, leading to high computational times.
In this work, we address the aforementioned issues with
TFA by proposing a novel few-shot detection framework.
C. Hierarchical Detection Approach
Here, we present our Hierarchical Detection Approach
(HDA) for generalized few-shot detection (see Fig. 2). We
note that the base model Mbis pre-trained on a large
dataset Dbcontaining abundant examples of base classes Cb.
Thus, the detector Mbshould already achieve high detection
performance on the base classes. Finetuning it further on a
smaller subset DnD0
b, as in TFA, is likely to only reduce the
base class performance due to overfitting. Furthermore, the
base dataset Dbalso contains a large number of background
objects not belonging to Cb. Consequently, the base detector
Mbshould be able to classify most of the unseen object
classes, including the novel classes, as background. Under
these settings, we can pose generalized few-shot detection
as a hierarchical detection problem, as described next.
Similar to TFA, we first train a Faster-RCNN base detector
Mbto detect the base classes Cbusing the large-scale base
dataset Db. Next, instead of finetuning the predictor Pbin
order to adapt the model for novel classes, we employ an
alternate hierarchical approach. We first apply the detector
to generate object proposals and use the base predictor Pb
to obtain the classification scores and refined boxes for each
摘要:

FastHierarchicalLearningforFew-ShotObjectDetectionYihangShe,GoutamBhat,MartinDanelljan,FisherYuComputerVisionLab,ETHZurichAbstract—Transferlearningbasedapproacheshaverecentlyachievedpromisingresultsonthefew-shotdetectiontask.Theseapproacheshoweversufferfrom“catastrophicforgetting”issueduetonetuning...

展开>> 收起<<
Fast Hierarchical Learning for Few-Shot Object Detection.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.55MB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注