In what follows, we exclusively work and compare with two-stage heads. Two-stage heads can further
be subdivided in region-based and query-based two-stage heads, based on how the second stage of
the two-stage head is implemented.
Region-based two-stage heads pool a rectangular grid of features from the backbone feature maps
and further process these using convolutional neural networks (CNNs) in order to obtain their final
object predictions. Region-based heads are found in many well-established two-stage object detectors
such as Faster R-CNN [26] and Cascade R-CNN [1].
Query-based two-stage heads were recently introduced in the two-stage variant of Deformable
DETR [39]. While region-based heads extract a grid of features from the backbone feature maps
per detection, only a single feature called the query is selected in query-based heads. These queries
are then further processed using operations commonly found in a transformer decoder [31], namely
cross-attention, self-attention, and feedforward operations. Here the cross-attention operation is
defined between the backbone feature maps and each query, the self-attention operation is applied
between the queries from different predictions, and the feedforward operations are performed on each
query individually.
In this paper, we propose a new query-based two-stage head, called the FQDet (Fast-converging
Query-based Detector) head. FQDet was obtained by combining the query-based paradigm from two-
stage Deformable DETR [39] with classical object detection techniques such as anchor generation
and static (i.e. non-Hungarian) matching.
When evaluated on the 2017 COCO object detection benchmark [17] after training for
12
epochs, our
FQDet head with ResNet-50+TPN backbone achieves
45.4
AP, outperforming other high-performing
two-stage heads such as Faster R-CNN [26], Cascade R-CNN [1], two-stage Deformable DETR [39]
and Sparse R-CNN [28] when using the same backbone.
2 Related work
DETR and its variants.
With its unique way of approaching the object detection task, DETR [2]
quickly gained a lot of popularity. Since, many variants such as SMCA [9], Conditional DETR [22],
Deformable DETR [39], Anchor DETR [32], DAB-DETR [18] and DN-DETR [14] have been
proposed to improve the main two shortcomings of DETR, namely its slow convergence speed and its
poor performance on small objects. In doing so, Deformable DETR [39] introduces the query-based
two-stage head, a new type of two-stage head now also used in other works such as DINO [34].
Anchors.
Our FQDet head differs from other query-based two-stage heads by introducing anchors
within the query-based two-stage paradigm. Faster R-CNN [26] was one of the first works to use
anchors. Anchors are axis-aligned boxes of different sizes and aspect ratios that are attached to
backbone feature locations and are refined to yield the final bounding box predictions. Anchors are
found in both one-stage detectors (SSD [20], YOLO [25], RetinaNet [16]) and two-stage detectors
(Faster R-CNN [26], Cascade R-CNN [1]). Over the years, many anchor-free detectors have also
been proposed such as CornerNet [13], FCOS [30], CenterNet [37], FoveaBox [12], DETR [2] and
Deformable DETR [39].
Matching.
Our FQDet head additionally differs from other query-based two-stage heads by using
a static top-k matching scheme as opposed to the dynamic Hungarian matching scheme. Matching
is the process responsible of assigning ground-truth labels to the different model outputs during
training. Most detectors have a static matching scheme, meaning that the matching scheme stays
the same throughout the whole training process. Static matching schemes are common when using
anchors, where label assignment is done based on the IoU overlap between anchors and ground-truth
boxes (e.g. Faster R-CNN [26], RetinaNet [16] and Cascade R-CNN [1]). Some works make use of a
dynamic matching scheme, meaning that the matching scheme changes as training progresses. In
Dynamic R-CNN [35], the minimum IoU thresholds are increased throughout the training process.
In DETR [2], a dynamic Hungarian matching scheme is used, matching every ground-truth to a
single prediction (one-to-one matching). In OTA [10], the one-to-one Hungarian matching scheme
from DETR is extended to one-to-many Hungarian matching, where a single ground-truth could be
assigned to multiple predictions.
2