
MM ’22, October 10–14, 2022, Lisboa, Portugal Lu Zhang et al.
root
order
family
genus
species
(a)
(b)
Hierarchical taxonomy
Hierarchical taxonomy T(partial) Feature space
ab
a.1 a.2 a.3 b.1 b.2 b.3
ab
a.1
a.2
a.3
b.1
b.2
b.3
Level 1
Level 2
Level 3
Figure 1: (a) The hierarchical taxonomy of our HiFSOD-Bird
dataset; (b) Illustration of the proposed hierarchical con-
trastive learning, which constrains the feature space such
that the distribution of object features is consistent with the
hierarchical taxonomy.
than FSOD, especially in the scenarios that the number of cate-
gories of objects is huge, where existing FSOD methods are neither
ecient nor eective. To address the Hi-FSOD problem, we have
tackled two major subproblems:
On the one hand, we construct the rst high-quality and large-
scale Hi-FSOD benchmark dataset of wild birds, which is called
HiFSOD-Bird
. Although there are already some datasets of wildlife
for computer vision (CV) tasks [
30
,
37
,
38
,
45
], most of them are for
classication tasks and a few of them are dedicated to object detec-
tion tasks. Nevertheless, few of them have a strictly hierarchical
organization of categories. Existing FSOD methods perform train-
ing and testing on the modied COCO [
21
] and VOC [
6
] datasets
whose label structures are at and contain only 80 and 20 cate-
gories, respectively, which thus are unsuitable for the Hi-FSOD
task. Our HiFSOD-Bird dataset contains totally 1,432 categories and
176,350 bird images with high-quality annotated bounding boxes.
All categories are organized into a 4-level hierarchical taxonomy:
from top to bottom, order, family, genus and species, as shown
in Fig. 1(a). It consists of 32 orders, 132 families, 572 genera and
1,432 species, covering more than 90% of the world’s water birds
and part of forest birds. The bounding boxes and class labels of
each image are manually annotated and carefully double-checked.
Moreover, each category of birds comes with a textual description,
so the dataset can be further used for the zero-shot object detection
task. The HiFSOD-Bird dataset is also of great signicance to the
monitoring and protection of endangered birds, since the samples of
endangered birds are dicult to acquire and the domain knowledge
is mainly from expert annotations.
On the other hand, we develop the rst Hi-FSOD method
Hi-
CLPL
, which is a two-stage method with
hi
erarchical
c
ontrastive
l
earning and
p
robabilistic
l
oss. Here, hierarchical contrastive learn-
ing (HiCL) is used to constrain the feature space so that the feature
distribution of objects is consistent with the hierarchical category
structure, and the probabilistic loss is designed to enable the child
nodes to correct the classication errors of their parent nodes.
Fig. 1(b) illustrates the HiCL mechanism. We use memories to hold
the prototypes of classes in the hierarchical tree. Then, a hierar-
chical contrastive loss is designed to control the distance between
box features and memories at dierent levels. Finally, we utilize
exponential moving average to update the parameters of memories.
HiCL can boost the generalization power of the model. Meanwhile,
we found that in the process of hierarchical classication from top
to bottom, if a non-leaf node wrongly classies an instance, the
classications of the instance at the descendants nodes are useless.
Therefore, we design a probabilistic loss such that the child nodes
can learn to identify and correct the misclassied samples of their
parent nodes.
In summary, contributions of this paper are as follows: 1) We
propose a new problem of hierarchical few-shot object detection
(Hi-FSOD), which is an extension to the existing FSOD problem, so it
is more challenging and has wider applications. 2) We establish the
rst large-scale and high-quality benchmark dataset HiFSOD-Bird,
specically for the Hi-FSOD problem. 3) We develop the rst Hi-
FSOD method HiCLPL, which uses hierarchical contrastive learning
to constrain the feature space and a probabilistic loss to correct
the classication errors of parent nodes. 4) We conduct extensive
experiments on the benchmark dataset HiFSOD-Bird to evaluate
the proposed method HiCLPL. Experimental results show that our
method HiCLPL outperforms the existing FSOD methods.
2 RELATED WORK
2.1 Few-shot Object Detection
Existing few-shot object detection (FSOD) methods roughly fall
into two types: meta-learning based and ne-tuning based. Meta-
learning based methods [
7
,
17
,
36
,
39
,
40
] learn meta knowledge
from base classes to facilitate model training for novel classes.
Among them, FSRW [
17
] utilizes a feature re-weighting strategy
to construct a one-stage object detector. Attention-RPN [
7
] inte-
grates the information of supports into RPN, in order to pay more
attention to the foreground objects relevant to support classes. Meta-
DETR [
39
] exploits the inter-class correlation to apply the detection
transformer [
44
] to the FSOD task. We proposed a support-query
mutual guidance strategy that can generate more support-relevant
candidate regions, together with a hybrid loss to enhance the metric
space [
40
]. Fine-tuning based methods [
24
,
27
,
31
,
35
,
42
] formulate
the FSOD problem in a transfer learning setting. TFA [
31
] is the
rst work that proposes a two-stage ne-tuning strategy. It rst
trains the entire model on the base classes, and then ne-tunes
the nal classier on a balanced dataset containing base and novel
data. Experiments show that such ne-tuning method is simple yet
very eective. Following TFA, a number of methods are developed.
DeFRCN [
24
] adopts multi-stage and multi-task decoupling to im-
prove performance. FSCE [
27
] uses a contrastive learning strategy
to constrain the intra-class similarity and enhance the inter-class
similarity of box features. Nevertheless, existing methods do not
consider the scenarios where object classes form a hierarchical
taxonomy, thus they cannot be directly used to eectively handle
the problem proposed in this paper.
Dierent from these works above, here we address a new problem
— hierarchical few-shot object detection (Hi-FSOD). To this end, we
build a large-scale and high-quality benchmark dataset and develop
an eective method.