FS-DETR Few-Shot DEtection TRansformer with prompting and without re-training Adrian Bulat12 Ricardo Guerrero1 Brais Martinez1 Georgios Tzimiropoulos13

2025-05-06 0 0 5.98MB 14 页 10玖币

侵权投诉

FS-DETR: Few-Shot DEtection TRansformer with prompting and without

re-training

Adrian Bulat1,2, Ricardo Guerrero1, Brais Martinez1, Georgios Tzimiropoulos1,3

1Samsung AI Cambridge 2Technical University of Iasi 3Queen Mary University of London

Abstract

This paper is on Few-Shot Object Detection (FSOD),

where given a few templates (examples) depicting a novel

class (not seen during training), the goal is to detect all

of its occurrences within a set of images. From a practi-

cal perspective, an FSOD system must fulﬁl the following

desiderata: (a) it must be used as is, without requiring any

ﬁne-tuning at test time, (b) it must be able to process an arbi-

trary number of novel objects concurrently while supporting

an arbitrary number of examples from each class and (c) it

must achieve accuracy comparable to a closed system. To-

wards satisfying (a)-(c), in this work, we make the following

contributions: We introduce, for the ﬁrst time, a simple, yet

powerful, few-shot detection transformer (FS-DETR) based

on visual prompting that can address both desiderata (a) and

(b). Our system builds upon the DETR framework, extend-

ing it based on two key ideas: (1) feed the provided visual

templates of the novel classes as visual prompts during test

time, and (2) “stamp” these prompts with pseudo-class em-

beddings (akin to soft prompting), which are then predicted

at the output of the decoder. Importantly, we show that

our system is not only more ﬂexible than existing methods,

but also, it makes a step towards satisfying desideratum (c).

Speciﬁcally, it is signiﬁcantly more accurate than all meth-

ods that do not require ﬁne-tuning and even matches and

outperforms the current state-of-the-art ﬁne-tuning based

methods on the most well-established benchmarks (PASCAL

VOC & MSCOCO).

1. Introduction

Thanks to the advent of deep learning, object detection

has witnessed tremendous progress over the last years. How-

ever, the standard setting of training and testing on a closed

set of classes has speciﬁc important limitations. Firstly, it’s

unfeasible to annotate all objects of relevance present in-the-

wild, thus, current systems are trained only on a small subset.

It does not seem straightforward to signiﬁcantly scale up this

ﬁgure. Secondly, human perception operates mostly under

the open set recognition/detection setting. Humans can de-

tect/track new unseen objects on the ﬂy, typically using a

single template, without requiring any “re-training” or “ﬁne-

tuning” of their “detection” skills, arguably a consequence

of the prior representation learned, an aspect we sought to

exploit here too. Finally, important applications in robotics,

where agents may interact with previously unseen objects,

might require their subsequent detection on the ﬂy without

any re-training. Few-Shot Object Detection (FSOD) refers

to the problem of detecting a novel class not seen during

training and, hence, can potentially address many of the

aforementioned challenges.

There are still important desiderata that current FSOD

system must address in order to be practical and ﬂexible

to use: (a) They must be used as is, not requiring any re-

training (e.g. ﬁne-tuning) at test time - a crucial component

for autonomous exploration [

]. However, many existing

state-of-the-art FSOD systems (e.g. [

]) rely on

re-training with the few available examples of the unseen

classes. While such systems are still useful, the require-

ment for re-training makes them signiﬁcantly more difﬁcult

to deploy on the ﬂy and in real-time or on devices with

limited capabilities for training. (b) They must be able to

handle an arbitrary number of novel objects (and moreover

an arbitrary number of examples per novel class) simulta-

neously during test time, in a single forward pass without

requiring batching. This is akin to how closed systems work,

which are able to detect multiple objects concurrently. How-

ever, to our knowledge there is no FSOD system possessing

this property without requiring re-training. (c) They must

attain classiﬁcation accuracy that is comparable to that of

closed systems. However, existing FSOD systems are far

from achieving such high accuracy, especially for difﬁcult

datasets like MSCOCO.

This work aims to signiﬁcantly advance the state-of-the-

art in all three above-mentioned challenges. To this end,

and building upon the DETR [

] framework, we propose a

system, called Few-Shot Detection Prompting (FS-DETR),

capable of detecting multiple novel classes at once, support-

ing a variable number of examples per class, and importantly,

without any extra re-training. In our system, the visual tem-

plate(s) (i.e. prompts) from the new class(es) are used, dur-

arXiv:2210.04845v2 [cs.CV] 20 Aug 2023

ing test time, in two ways: (1) in FS-DETR’s encoder to

ﬁlter the backbone’s image features via cross-attention, and

more importantly, (2) as visual prompts in FS-DETR’s de-

coder, “stamped” with special pseudo-class encodings and

prepended to the learnable object queries. The pseudo-class

encodings are used as pseudo-classes which a classiﬁcation

head attached to the object queries is trained to predict via

a Cross-Entropy loss. Finally, the output of the decoder are

the predicted pseudo-classes and regressed bounding boxes.

The two components, when combined allow the creation of

a FSOD model that can localise, within one forward pass

multiple objects at once, each with an arbitrary number of

examples, without retraining.

Contrary to prior work (e.g. TSF [

] and AirDet [

]),

FS-DETR, akin to soft-prompting [

], “instructs” the model

in the input space regarding the visual appearance of the

searched object(s). The network is then capable of predict-

ing for each prompt (i.e. visual template) all the locations at

which it is present in the image, if any. This is achieved with-

out any additional modules or carefully engineered structures

and feature ﬁltering mechanisms (e.g. TSF [

] AirDet [

]).

Instead, we directly append the prompts to the object queries

of the decoder.

In summary, our main contributions are:

We propose a ﬁne-tuning-free Few-Shot Detection

Prompting (FS-DETR) method which is capable of de-

tecting multiple novel objects at once, and can support

an arbitrary number of samples per class in an efﬁcient

manner via soft visual prompting.

We show that all these features can be enabled by ex-

tending DETR based on two key ideas: (1) feed the

provided visual templates of novel classes as visual

prompts during test time, and (2) “stamp” these prompts

with (class agnostic) pseudo-class embeddings, which

are then predicted at the output of the decoder along

with bounding boxes (akin to soft-prompting).

We also propose a simple and efﬁcient yet powerful

pipeline consisting of unsupervised pre-training fol-

lowed by prompt-like base class training.

In addition to being more ﬂexible, our system matches

and outperforms state-of-the-art results on the standard

FSOD setting on PASCAL VOC and MSCOCO. Specif-

ically, FS-DETR outperforms the not re-trained meth-

ods of [

] and most re-training based methods

on extreme few-shot settings (

k= 1,2

), while being

competitive for more shots.

2. Related work

DEtection TRansformer (DETR) approaches: After revo-

lutionizing NLP [

], Transformer-based architectures

have started making signiﬁcant impact in computer vision

problems [

]. In object detection, methods are typi-

cally grouped into two-stage (proposal-based) [

]

and single-stage (proposal-free)[

] methods.

In this ﬁeld, a recent breakthrough is the DEtection TRans-

former (DETR) [

], which is a single-stage approach that

treats the task as a direct set prediction without requiring

hand-crafted components, like non-maximum suppression

or anchor generation. Speciﬁcally, DETR is trained in an

end-to-end manner using a set loss function which performs

bipartite matching between the predicted and the ground-

truth bounding boxes. Because DETR has slow training con-

vergence, several methods have been proposed to improve

it [

]. Conditional DETR [

] learns a conditional

spatial query from the decoder embeddings that are used

in the decoder for cross-attention with the image features.

Deformable DETR [

] proposes deformable attention in

which attention is performed only over a small set of key

sampling points around a reference point. Unsupervised pre-

training of DETR [

] (UP-DETR), improves its convergence,

where randomly cropped patches are summed to the object

queries and the model is then trained to detect them in the

original image. A follow-up work, DETReg [

], replaces the

random crops with proposals generated by Selective Search.

While our approach is agnostic to the exact variant of DETR,

due to its fast training convergence, we opted to use Condi-

tional DETR as the model that we build our FS-DETR ap-

proach upon. Beyond this, the above mentioned works are on

closed set recognition and while UP-DETR’s unsupervised

pre-training could be potentially used for few-shot detec-

tion, the experimental setting presented in their work doesn’t

match the standard settings for few-shot detection and no

code is provided for its training. We re-implemented UP-

DETR [

] for few-shot detection and found that our method

outperforms it. This is expected as their goal is unsupervised

pre-training and not FSOD.

Few Shot Object Detection (FSOD) methods can be cate-

gorised into re-training based and without re-training meth-

ods. Re-training based methods assume that during test time,

but before deployment, the provided samples of the novel

categories can be used to ﬁne-tune the model. This setting is

restrictive as it requires training before deployment. Instead,

without re-training methods can be directly deployed on the

ﬂy for the detection of novel examples.

Re-training based approaches can be divided into meta-

learning and ﬁne-tuning approaches. Meta-learning based

approaches attempt to transfer knowledge from the base

classes to the novel classes through meta-learning [

]. Fine-tuning based methods follow the

standard pre-train and ﬁne-tune pipeline. They have been

shown to signiﬁcantly outperform meta-learning approaches.

TFA [

] proposes ﬁne-tuning the ﬁnal classiﬁcation layer

of a Faster R-CNN model (ﬁrst trained on base classes), with

a balanced subset containing also the examples of the novel

classes. SRR-FSD [

] proposes to construct a semantic

space using word embeddings, and then train a FSOD by

projecting and aligning object visual features with their cor-

responding text embeddings. CME [

] proposes to learn a

feature embedding space where the margins between novel

classes are maximised. Retentive R-CNN [

] addresses the

problem of learning a FSOD without catastrophic forgetting

(i.e. without compromising base class accuracy). FSCE [

]

aims to decrease instance similarity between objects belong-

ing to different categories by adding a secondary branch to

the primary RoI head, which is trained via supervised con-

trastive learning. The method of [

] proposes a hallucinator

network to generate examples which can help the classiﬁer

learn a better decision boundary for the novel classes. FSOD-

UP [

] proposes to construct universal prototypes capturing

invariant object characteristics which, via ﬁne-tuning, are

adapted to the novel categories. DeFRCN [

] proposes to

perform stop-gradient between the RPN and the backbone,

and scale-gradient between RCNN and the backbone.

More recently, FSODMC [

] proposes to address base

class bias via novel class ﬁne-tuning while calibrating the

RPN, detector and backbone components to preserve well-

learned prior knowledge. KFSOD [

] improves upon [

]

by replacing the class-speciﬁc average-pooling of features

with kernel-pooled representations that are meant to capture

non-linear patterns. TENET [

] extends KFSOD with a

multi-head attention transformer block on 2nd-, 3rd- and

4th-order pooling. FCT [

] extends [

] by incorporat-

ing a cross-transformer into both the feature backbone and

detection head to encourage query-support multi-level inter-

actions. Their approach is based on two-stage Faster-RCNN

trained with a binary cross-entropy loss, i.e. it is entirely

different from our architecture and training objective based

on pseudo-class prediction. Meta-DETR [

] proposes a cor-

relation aggregation module, which is then placed before a

standard DETR encoder-decoder, that ﬁlters the query image

tokens using the support images and tasks. In contrast, we

model the interactions directly via a novel visual template

prompting formulation, without any additional modules and

can process an arbitrary number of examples per-object and

object within the same forward pass. Moreover, their method

requires ﬁnetuning for FSOD deployment, while our doesn’t

require any retraining. TSF [

] proposes a transformer plu-

gin module for modelling interactions the input features

and a set of learnable parameters

representing base class in-

formation (i.e. prototypes). In contrast to [

], our approach

does not learn any type of base class prototypes and is fully

dynamic (interactions between data and data as opposed to

data and prototypes).

Without re-training approaches are primarily based on met-

ric learning [

]. A standard approach is [

], which uses

cross-attention between the backbone’s and the query’s fea-

tures to reﬁne the proposal generation, then re-uses the query

to re-weight the RoI features channel-wise (in a squeeze-

and-excitation manner) for novel class classiﬁcation. A sim-

ilar approach for proposal generation is described in [

where the squeeze-and-excitation module is replaced with a

multi-relation network. QA-FewDet [

] extends [

] by

modelling class-class, class-proposal and proposal-proposal

relationships using various GCNs. Finally, the concurrent

work of AirDet [

] attempts to learn a set of prototypes

and a cross-scale support guided proposal network, with the

association and regression performed at the end of the model

via a detection head. To our knowledge, AirDet represents

the state-of-the-art FSOD without re-training. We show that

the proposed FS-DETR outperforms it by a large margin.

Relation to our work: Our method is the ﬁrst to perform re-

training free visual prompting for few shot object detection.

Different to many other works (e.g. TSF [

], AirDet [

]),

FS-DETR does not learn perform visual prompting nor learn

class-related prototypes (i.e. soft prompts-like). We empha-

size that the pseudo-class embeddings in FS-DETR are

class-agnostic Finally, there are methods which are trained

using metric learning [

] using a binary cross entropy

loss. In contrast, FS-DETR is trained to predict pseudo-

classes using cross entropy (in a class-agnostic way) which

is a more powerful training objective.

3. Method

Given a dataset where each image is annotated with a

set of bounding boxes representing the instantiations of

known base classes, our goal is to train a model capable

of localizing objects belonging to novel classes, i.e. unseen

during training, using up to

examples per novel class. In

practice, we partition the available datasets into two disjoint

sets, one containing

Cnovel

classes for testing, and another

with

Cbase

classes for training (i.e.

C=Cnovel ∪Cbase

and

Cnovel ∩Cbase =∅).

3.1. Overview of FS-DETR

We build the proposed Few-Shot DEtection TRansformer

(FS-DETR) upon DETR’s architecture

. FS-DETR’s archi-

tecture consists of: (1) the CNN backbone used to extract

visual features from the target image and the templates, (2)

a transformer encoder that performs self-attention on the

image tokens and cross-attention between the templates and

the image tokens, and (3) a transformer decoder that pro-

cesses object queries and templates to make predictions for

pseudo-classes (see also below) and bounding boxes. Con-

trary to the related works of [

], our system processes

an arbitrary number of templates (i.e. new classes) jointly,

We note that, in practice, due to its superior convergence properties,

we used the Conditional DETR as the basis of our implementation but for

simplicity of exposition we will use the original DETR architecture.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FS-DETR:Few-ShotDEtectionTRansformerwithpromptingandwithoutre-trainingAdrianBulat1,2,RicardoGuerrero1,BraisMartinez1,GeorgiosTzimiropoulos1,31SamsungAICambridge2TechnicalUniversityofIasi3QueenMaryUniversityofLondonAbstractThispaperisonFew-ShotObjectDetection(FSOD),wheregivenafewtemplates(examples)de...

展开>> 收起<<

FS-DETR Few-Shot DEtection TRansformer with prompting and without re-training Adrian Bulat12 Ricardo Guerrero1 Brais Martinez1 Georgios Tzimiropoulos13.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FS-DETR Few-Shot DEtection TRansformer with prompting and without re-training Adrian Bulat12 Ricardo Guerrero1 Brais Martinez1 Georgios Tzimiropoulos13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: