FS-DETR Few-Shot DEtection TRansformer with prompting and without re-training Adrian Bulat12 Ricardo Guerrero1 Brais Martinez1 Georgios Tzimiropoulos13

2025-05-06 0 0 5.98MB 14 页 10玖币
侵权投诉
FS-DETR: Few-Shot DEtection TRansformer with prompting and without
re-training
Adrian Bulat1,2, Ricardo Guerrero1, Brais Martinez1, Georgios Tzimiropoulos1,3
1Samsung AI Cambridge 2Technical University of Iasi 3Queen Mary University of London
Abstract
This paper is on Few-Shot Object Detection (FSOD),
where given a few templates (examples) depicting a novel
class (not seen during training), the goal is to detect all
of its occurrences within a set of images. From a practi-
cal perspective, an FSOD system must fulfil the following
desiderata: (a) it must be used as is, without requiring any
fine-tuning at test time, (b) it must be able to process an arbi-
trary number of novel objects concurrently while supporting
an arbitrary number of examples from each class and (c) it
must achieve accuracy comparable to a closed system. To-
wards satisfying (a)-(c), in this work, we make the following
contributions: We introduce, for the first time, a simple, yet
powerful, few-shot detection transformer (FS-DETR) based
on visual prompting that can address both desiderata (a) and
(b). Our system builds upon the DETR framework, extend-
ing it based on two key ideas: (1) feed the provided visual
templates of the novel classes as visual prompts during test
time, and (2) “stamp” these prompts with pseudo-class em-
beddings (akin to soft prompting), which are then predicted
at the output of the decoder. Importantly, we show that
our system is not only more flexible than existing methods,
but also, it makes a step towards satisfying desideratum (c).
Specifically, it is significantly more accurate than all meth-
ods that do not require fine-tuning and even matches and
outperforms the current state-of-the-art fine-tuning based
methods on the most well-established benchmarks (PASCAL
VOC & MSCOCO).
1. Introduction
Thanks to the advent of deep learning, object detection
has witnessed tremendous progress over the last years. How-
ever, the standard setting of training and testing on a closed
set of classes has specific important limitations. Firstly, it’s
unfeasible to annotate all objects of relevance present in-the-
wild, thus, current systems are trained only on a small subset.
It does not seem straightforward to significantly scale up this
figure. Secondly, human perception operates mostly under
the open set recognition/detection setting. Humans can de-
tect/track new unseen objects on the fly, typically using a
single template, without requiring any “re-training” or “fine-
tuning” of their “detection” skills, arguably a consequence
of the prior representation learned, an aspect we sought to
exploit here too. Finally, important applications in robotics,
where agents may interact with previously unseen objects,
might require their subsequent detection on the fly without
any re-training. Few-Shot Object Detection (FSOD) refers
to the problem of detecting a novel class not seen during
training and, hence, can potentially address many of the
aforementioned challenges.
There are still important desiderata that current FSOD
system must address in order to be practical and flexible
to use: (a) They must be used as is, not requiring any re-
training (e.g. fine-tuning) at test time - a crucial component
for autonomous exploration [
26
]. However, many existing
state-of-the-art FSOD systems (e.g. [
42
,
50
,
36
]) rely on
re-training with the few available examples of the unseen
classes. While such systems are still useful, the require-
ment for re-training makes them significantly more difficult
to deploy on the fly and in real-time or on devices with
limited capabilities for training. (b) They must be able to
handle an arbitrary number of novel objects (and moreover
an arbitrary number of examples per novel class) simulta-
neously during test time, in a single forward pass without
requiring batching. This is akin to how closed systems work,
which are able to detect multiple objects concurrently. How-
ever, to our knowledge there is no FSOD system possessing
this property without requiring re-training. (c) They must
attain classification accuracy that is comparable to that of
closed systems. However, existing FSOD systems are far
from achieving such high accuracy, especially for difficult
datasets like MSCOCO.
This work aims to significantly advance the state-of-the-
art in all three above-mentioned challenges. To this end,
and building upon the DETR [
3
] framework, we propose a
system, called Few-Shot Detection Prompting (FS-DETR),
capable of detecting multiple novel classes at once, support-
ing a variable number of examples per class, and importantly,
without any extra re-training. In our system, the visual tem-
plate(s) (i.e. prompts) from the new class(es) are used, dur-
1
arXiv:2210.04845v2 [cs.CV] 20 Aug 2023
ing test time, in two ways: (1) in FS-DETR’s encoder to
filter the backbone’s image features via cross-attention, and
more importantly, (2) as visual prompts in FS-DETR’s de-
coder, “stamped” with special pseudo-class encodings and
prepended to the learnable object queries. The pseudo-class
encodings are used as pseudo-classes which a classification
head attached to the object queries is trained to predict via
a Cross-Entropy loss. Finally, the output of the decoder are
the predicted pseudo-classes and regressed bounding boxes.
The two components, when combined allow the creation of
a FSOD model that can localise, within one forward pass
multiple objects at once, each with an arbitrary number of
examples, without retraining.
Contrary to prior work (e.g. TSF [
23
] and AirDet [
26
]),
FS-DETR, akin to soft-prompting [
21
], “instructs” the model
in the input space regarding the visual appearance of the
searched object(s). The network is then capable of predict-
ing for each prompt (i.e. visual template) all the locations at
which it is present in the image, if any. This is achieved with-
out any additional modules or carefully engineered structures
and feature filtering mechanisms (e.g. TSF [
23
] AirDet [
26
]).
Instead, we directly append the prompts to the object queries
of the decoder.
In summary, our main contributions are:
1.
We propose a fine-tuning-free Few-Shot Detection
Prompting (FS-DETR) method which is capable of de-
tecting multiple novel objects at once, and can support
an arbitrary number of samples per class in an efficient
manner via soft visual prompting.
2.
We show that all these features can be enabled by ex-
tending DETR based on two key ideas: (1) feed the
provided visual templates of novel classes as visual
prompts during test time, and (2) “stamp” these prompts
with (class agnostic) pseudo-class embeddings, which
are then predicted at the output of the decoder along
with bounding boxes (akin to soft-prompting).
3.
We also propose a simple and efficient yet powerful
pipeline consisting of unsupervised pre-training fol-
lowed by prompt-like base class training.
4.
In addition to being more flexible, our system matches
and outperforms state-of-the-art results on the standard
FSOD setting on PASCAL VOC and MSCOCO. Specif-
ically, FS-DETR outperforms the not re-trained meth-
ods of [
14
,
26
] and most re-training based methods
on extreme few-shot settings (
k= 1,2
), while being
competitive for more shots.
2. Related work
DEtection TRansformer (DETR) approaches: After revo-
lutionizing NLP [
46
,
37
], Transformer-based architectures
have started making significant impact in computer vision
problems [
6
,
32
]. In object detection, methods are typi-
cally grouped into two-stage (proposal-based) [
39
,
17
,
2
]
and single-stage (proposal-free)[
28
,
31
,
44
,
58
,
24
] methods.
In this field, a recent breakthrough is the DEtection TRans-
former (DETR) [
3
], which is a single-stage approach that
treats the task as a direct set prediction without requiring
hand-crafted components, like non-maximum suppression
or anchor generation. Specifically, DETR is trained in an
end-to-end manner using a set loss function which performs
bipartite matching between the predicted and the ground-
truth bounding boxes. Because DETR has slow training con-
vergence, several methods have been proposed to improve
it [
34
,
60
,
5
]. Conditional DETR [
34
] learns a conditional
spatial query from the decoder embeddings that are used
in the decoder for cross-attention with the image features.
Deformable DETR [
60
] proposes deformable attention in
which attention is performed only over a small set of key
sampling points around a reference point. Unsupervised pre-
training of DETR [
5
] (UP-DETR), improves its convergence,
where randomly cropped patches are summed to the object
queries and the model is then trained to detect them in the
original image. A follow-up work, DETReg [
1
], replaces the
random crops with proposals generated by Selective Search.
While our approach is agnostic to the exact variant of DETR,
due to its fast training convergence, we opted to use Condi-
tional DETR as the model that we build our FS-DETR ap-
proach upon. Beyond this, the above mentioned works are on
closed set recognition and while UP-DETR’s unsupervised
pre-training could be potentially used for few-shot detec-
tion, the experimental setting presented in their work doesn’t
match the standard settings for few-shot detection and no
code is provided for its training. We re-implemented UP-
DETR [
5
] for few-shot detection and found that our method
outperforms it. This is expected as their goal is unsupervised
pre-training and not FSOD.
Few Shot Object Detection (FSOD) methods can be cate-
gorised into re-training based and without re-training meth-
ods. Re-training based methods assume that during test time,
but before deployment, the provided samples of the novel
categories can be used to fine-tune the model. This setting is
restrictive as it requires training before deployment. Instead,
without re-training methods can be directly deployed on the
fly for the detection of novel examples.
Re-training based approaches can be divided into meta-
learning and fine-tuning approaches. Meta-learning based
approaches attempt to transfer knowledge from the base
classes to the novel classes through meta-learning [
12
,
13
,
53
,
49
,
25
,
52
]. Fine-tuning based methods follow the
standard pre-train and fine-tune pipeline. They have been
shown to significantly outperform meta-learning approaches.
TFA [
48
] proposes fine-tuning the final classification layer
of a Faster R-CNN model (first trained on base classes), with
a balanced subset containing also the examples of the novel
classes. SRR-FSD [
59
] proposes to construct a semantic
space using word embeddings, and then train a FSOD by
projecting and aligning object visual features with their cor-
responding text embeddings. CME [
27
] proposes to learn a
feature embedding space where the margins between novel
classes are maximised. Retentive R-CNN [
11
] addresses the
problem of learning a FSOD without catastrophic forgetting
(i.e. without compromising base class accuracy). FSCE [
42
]
aims to decrease instance similarity between objects belong-
ing to different categories by adding a secondary branch to
the primary RoI head, which is trained via supervised con-
trastive learning. The method of [
57
] proposes a hallucinator
network to generate examples which can help the classifier
learn a better decision boundary for the novel classes. FSOD-
UP [
50
] proposes to construct universal prototypes capturing
invariant object characteristics which, via fine-tuning, are
adapted to the novel categories. DeFRCN [
36
] proposes to
perform stop-gradient between the RPN and the backbone,
and scale-gradient between RCNN and the backbone.
More recently, FSODMC [
10
] proposes to address base
class bias via novel class fine-tuning while calibrating the
RPN, detector and backbone components to preserve well-
learned prior knowledge. KFSOD [
56
] improves upon [
9
]
by replacing the class-specific average-pooling of features
with kernel-pooled representations that are meant to capture
non-linear patterns. TENET [
55
] extends KFSOD with a
multi-head attention transformer block on 2nd-, 3rd- and
4th-order pooling. FCT [
16
] extends [
14
] by incorporat-
ing a cross-transformer into both the feature backbone and
detection head to encourage query-support multi-level inter-
actions. Their approach is based on two-stage Faster-RCNN
trained with a binary cross-entropy loss, i.e. it is entirely
different from our architecture and training objective based
on pseudo-class prediction. Meta-DETR [
54
] proposes a cor-
relation aggregation module, which is then placed before a
standard DETR encoder-decoder, that filters the query image
tokens using the support images and tasks. In contrast, we
model the interactions directly via a novel visual template
prompting formulation, without any additional modules and
can process an arbitrary number of examples per-object and
object within the same forward pass. Moreover, their method
requires finetuning for FSOD deployment, while our doesn’t
require any retraining. TSF [
23
] proposes a transformer plu-
gin module for modelling interactions the input features
f
and a set of learnable parameters
θ
representing base class in-
formation (i.e. prototypes). In contrast to [
23
], our approach
does not learn any type of base class prototypes and is fully
dynamic (interactions between data and data as opposed to
data and prototypes).
Without re-training approaches are primarily based on met-
ric learning [
47
,
41
]. A standard approach is [
19
], which uses
cross-attention between the backbone’s and the query’s fea-
tures to refine the proposal generation, then re-uses the query
to re-weight the RoI features channel-wise (in a squeeze-
and-excitation manner) for novel class classification. A sim-
ilar approach for proposal generation is described in [
9
],
where the squeeze-and-excitation module is replaced with a
multi-relation network. QA-FewDet [
14
] extends [
19
,
9
] by
modelling class-class, class-proposal and proposal-proposal
relationships using various GCNs. Finally, the concurrent
work of AirDet [
26
] attempts to learn a set of prototypes
and a cross-scale support guided proposal network, with the
association and regression performed at the end of the model
via a detection head. To our knowledge, AirDet represents
the state-of-the-art FSOD without re-training. We show that
the proposed FS-DETR outperforms it by a large margin.
Relation to our work: Our method is the first to perform re-
training free visual prompting for few shot object detection.
Different to many other works (e.g. TSF [
23
], AirDet [
26
]),
FS-DETR does not learn perform visual prompting nor learn
class-related prototypes (i.e. soft prompts-like). We empha-
size that the pseudo-class embeddings in FS-DETR are
class-agnostic Finally, there are methods which are trained
using metric learning [
9
,
14
,
16
] using a binary cross entropy
loss. In contrast, FS-DETR is trained to predict pseudo-
classes using cross entropy (in a class-agnostic way) which
is a more powerful training objective.
3. Method
Given a dataset where each image is annotated with a
set of bounding boxes representing the instantiations of
C
known base classes, our goal is to train a model capable
of localizing objects belonging to novel classes, i.e. unseen
during training, using up to
k
examples per novel class. In
practice, we partition the available datasets into two disjoint
sets, one containing
Cnovel
classes for testing, and another
with
Cbase
classes for training (i.e.
C=Cnovel Cbase
and
Cnovel Cbase =).
3.1. Overview of FS-DETR
We build the proposed Few-Shot DEtection TRansformer
(FS-DETR) upon DETR’s architecture
1
. FS-DETR’s archi-
tecture consists of: (1) the CNN backbone used to extract
visual features from the target image and the templates, (2)
a transformer encoder that performs self-attention on the
image tokens and cross-attention between the templates and
the image tokens, and (3) a transformer decoder that pro-
cesses object queries and templates to make predictions for
pseudo-classes (see also below) and bounding boxes. Con-
trary to the related works of [
9
,
14
,
15
], our system processes
an arbitrary number of templates (i.e. new classes) jointly,
1
We note that, in practice, due to its superior convergence properties,
we used the Conditional DETR as the basis of our implementation but for
simplicity of exposition we will use the original DETR architecture.
摘要:

FS-DETR:Few-ShotDEtectionTRansformerwithpromptingandwithoutre-trainingAdrianBulat1,2,RicardoGuerrero1,BraisMartinez1,GeorgiosTzimiropoulos1,31SamsungAICambridge2TechnicalUniversityofIasi3QueenMaryUniversityofLondonAbstractThispaperisonFew-ShotObjectDetection(FSOD),wheregivenafewtemplates(examples)de...

展开>> 收起<<
FS-DETR Few-Shot DEtection TRansformer with prompting and without re-training Adrian Bulat12 Ricardo Guerrero1 Brais Martinez1 Georgios Tzimiropoulos13.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:5.98MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注