
a balanced subset containing also the examples of the novel
classes. SRR-FSD [
59
] proposes to construct a semantic
space using word embeddings, and then train a FSOD by
projecting and aligning object visual features with their cor-
responding text embeddings. CME [
27
] proposes to learn a
feature embedding space where the margins between novel
classes are maximised. Retentive R-CNN [
11
] addresses the
problem of learning a FSOD without catastrophic forgetting
(i.e. without compromising base class accuracy). FSCE [
42
]
aims to decrease instance similarity between objects belong-
ing to different categories by adding a secondary branch to
the primary RoI head, which is trained via supervised con-
trastive learning. The method of [
57
] proposes a hallucinator
network to generate examples which can help the classifier
learn a better decision boundary for the novel classes. FSOD-
UP [
50
] proposes to construct universal prototypes capturing
invariant object characteristics which, via fine-tuning, are
adapted to the novel categories. DeFRCN [
36
] proposes to
perform stop-gradient between the RPN and the backbone,
and scale-gradient between RCNN and the backbone.
More recently, FSODMC [
10
] proposes to address base
class bias via novel class fine-tuning while calibrating the
RPN, detector and backbone components to preserve well-
learned prior knowledge. KFSOD [
56
] improves upon [
9
]
by replacing the class-specific average-pooling of features
with kernel-pooled representations that are meant to capture
non-linear patterns. TENET [
55
] extends KFSOD with a
multi-head attention transformer block on 2nd-, 3rd- and
4th-order pooling. FCT [
16
] extends [
14
] by incorporat-
ing a cross-transformer into both the feature backbone and
detection head to encourage query-support multi-level inter-
actions. Their approach is based on two-stage Faster-RCNN
trained with a binary cross-entropy loss, i.e. it is entirely
different from our architecture and training objective based
on pseudo-class prediction. Meta-DETR [
54
] proposes a cor-
relation aggregation module, which is then placed before a
standard DETR encoder-decoder, that filters the query image
tokens using the support images and tasks. In contrast, we
model the interactions directly via a novel visual template
prompting formulation, without any additional modules and
can process an arbitrary number of examples per-object and
object within the same forward pass. Moreover, their method
requires finetuning for FSOD deployment, while our doesn’t
require any retraining. TSF [
23
] proposes a transformer plu-
gin module for modelling interactions the input features
f
and a set of learnable parameters
θ
representing base class in-
formation (i.e. prototypes). In contrast to [
23
], our approach
does not learn any type of base class prototypes and is fully
dynamic (interactions between data and data as opposed to
data and prototypes).
Without re-training approaches are primarily based on met-
ric learning [
47
,
41
]. A standard approach is [
19
], which uses
cross-attention between the backbone’s and the query’s fea-
tures to refine the proposal generation, then re-uses the query
to re-weight the RoI features channel-wise (in a squeeze-
and-excitation manner) for novel class classification. A sim-
ilar approach for proposal generation is described in [
9
],
where the squeeze-and-excitation module is replaced with a
multi-relation network. QA-FewDet [
14
] extends [
19
,
9
] by
modelling class-class, class-proposal and proposal-proposal
relationships using various GCNs. Finally, the concurrent
work of AirDet [
26
] attempts to learn a set of prototypes
and a cross-scale support guided proposal network, with the
association and regression performed at the end of the model
via a detection head. To our knowledge, AirDet represents
the state-of-the-art FSOD without re-training. We show that
the proposed FS-DETR outperforms it by a large margin.
Relation to our work: Our method is the first to perform re-
training free visual prompting for few shot object detection.
Different to many other works (e.g. TSF [
23
], AirDet [
26
]),
FS-DETR does not learn perform visual prompting nor learn
class-related prototypes (i.e. soft prompts-like). We empha-
size that the pseudo-class embeddings in FS-DETR are
class-agnostic Finally, there are methods which are trained
using metric learning [
9
,
14
,
16
] using a binary cross entropy
loss. In contrast, FS-DETR is trained to predict pseudo-
classes using cross entropy (in a class-agnostic way) which
is a more powerful training objective.
3. Method
Given a dataset where each image is annotated with a
set of bounding boxes representing the instantiations of
C
known base classes, our goal is to train a model capable
of localizing objects belonging to novel classes, i.e. unseen
during training, using up to
k
examples per novel class. In
practice, we partition the available datasets into two disjoint
sets, one containing
Cnovel
classes for testing, and another
with
Cbase
classes for training (i.e.
C=Cnovel ∪Cbase
and
Cnovel ∩Cbase =∅).
3.1. Overview of FS-DETR
We build the proposed Few-Shot DEtection TRansformer
(FS-DETR) upon DETR’s architecture
1
. FS-DETR’s archi-
tecture consists of: (1) the CNN backbone used to extract
visual features from the target image and the templates, (2)
a transformer encoder that performs self-attention on the
image tokens and cross-attention between the templates and
the image tokens, and (3) a transformer decoder that pro-
cesses object queries and templates to make predictions for
pseudo-classes (see also below) and bounding boxes. Con-
trary to the related works of [
9
,
14
,
15
], our system processes
an arbitrary number of templates (i.e. new classes) jointly,
1
We note that, in practice, due to its superior convergence properties,
we used the Conditional DETR as the basis of our implementation but for
simplicity of exposition we will use the original DETR architecture.