FQDet Fast-converging Query-based Detector

2025-04-22 0 0 384.81KB 13 页 10玖币
侵权投诉
FQDet: Fast-converging Query-based Detector
Cédric Picron
ESAT-PSI, KU Leuven
cedric.picron@esat.kuleuven.be
Punarjay Chakravarty
Ford Greenfield Labs, Palo Alto
pchakra5@ford.com
Tinne Tuytelaars
ESAT-PSI, KU Leuven
tinne.tuytelaars@esat.kuleuven.be
Abstract
Recently, two-stage Deformable DETR introduced the query-based two-stage head,
a new type of two-stage head different from the region-based two-stage heads of
classical detectors as Faster R-CNN. In query-based two-stage heads, the second
stage selects one feature per detection processed by a transformer, called the
query, as opposed to pooling a rectangular grid of features processed by CNNs
as in region-based detectors. In this work, we improve the query-based head by
improving the prior of the cross-attention operation with anchors, significantly
speeding up the convergence while increasing its performance. Additionally, we
empirically show that by improving the cross-attention prior, auxiliary losses and
iterative bounding box mechanisms typically used by DETR-based detectors are
no longer needed. By combining the best of both the classical and the DETR-based
detectors, our FQDet head peaks at
45.4
AP on the 2017 COCO validation set
when using a ResNet-50+TPN backbone, only after training for 12 epochs using
the 1x schedule. We outperform other high-performing two-stage heads such as e.g.
Cascade R-CNN, while using the same backbone and while being computationally
cheaper. Additionally, when using the large ResNeXt-101-DCN+TPN backbone
and multi-scale testing, our FQDet head achieves
52.9
AP on the 2017 COCO
test-dev set after only 12 epochs of training. Code is released at
https://github.
com/CedricPicron/FQDet.
1 Introduction
Deep neural networks designed for solving the object detection task, are commonly subdivided
in a backbone and a head. The backbone is defined as taking in an image and outputting a set of
feature maps. These feature maps are typically of different resolutions, with a factor two separating
consecutive maps in width and height, hence forming what is known as a feature pyramid [15]. A
common choice for the backbone is using a ResNet-50 [11] network in combination with a pyramid
network (PN) such as FPN [15], PANet [19], BiFPN [29] or TPN [23].
The object detection head is defined as taking in a set of feature maps from the backbone, and
outputting object detection predictions, with each prediction consisting of an axis-aligned box
together with a corresponding class label and confidence score. Object detection heads are commonly
subdivided in one-stage and two-stage heads. One-stage heads make an object detection prediction
for every feature from its input set of feature maps. Two-stage heads first evaluate for every feature
whether it is related to an object, and then use these results to focus the processing on object-related
parts of the feature maps.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02318v2 [cs.CV] 28 Oct 2022
In what follows, we exclusively work and compare with two-stage heads. Two-stage heads can further
be subdivided in region-based and query-based two-stage heads, based on how the second stage of
the two-stage head is implemented.
Region-based two-stage heads pool a rectangular grid of features from the backbone feature maps
and further process these using convolutional neural networks (CNNs) in order to obtain their final
object predictions. Region-based heads are found in many well-established two-stage object detectors
such as Faster R-CNN [26] and Cascade R-CNN [1].
Query-based two-stage heads were recently introduced in the two-stage variant of Deformable
DETR [39]. While region-based heads extract a grid of features from the backbone feature maps
per detection, only a single feature called the query is selected in query-based heads. These queries
are then further processed using operations commonly found in a transformer decoder [31], namely
cross-attention, self-attention, and feedforward operations. Here the cross-attention operation is
defined between the backbone feature maps and each query, the self-attention operation is applied
between the queries from different predictions, and the feedforward operations are performed on each
query individually.
In this paper, we propose a new query-based two-stage head, called the FQDet (Fast-converging
Query-based Detector) head. FQDet was obtained by combining the query-based paradigm from two-
stage Deformable DETR [39] with classical object detection techniques such as anchor generation
and static (i.e. non-Hungarian) matching.
When evaluated on the 2017 COCO object detection benchmark [17] after training for
12
epochs, our
FQDet head with ResNet-50+TPN backbone achieves
45.4
AP, outperforming other high-performing
two-stage heads such as Faster R-CNN [26], Cascade R-CNN [1], two-stage Deformable DETR [39]
and Sparse R-CNN [28] when using the same backbone.
2 Related work
DETR and its variants.
With its unique way of approaching the object detection task, DETR [2]
quickly gained a lot of popularity. Since, many variants such as SMCA [9], Conditional DETR [22],
Deformable DETR [39], Anchor DETR [32], DAB-DETR [18] and DN-DETR [14] have been
proposed to improve the main two shortcomings of DETR, namely its slow convergence speed and its
poor performance on small objects. In doing so, Deformable DETR [39] introduces the query-based
two-stage head, a new type of two-stage head now also used in other works such as DINO [34].
Anchors.
Our FQDet head differs from other query-based two-stage heads by introducing anchors
within the query-based two-stage paradigm. Faster R-CNN [26] was one of the first works to use
anchors. Anchors are axis-aligned boxes of different sizes and aspect ratios that are attached to
backbone feature locations and are refined to yield the final bounding box predictions. Anchors are
found in both one-stage detectors (SSD [20], YOLO [25], RetinaNet [16]) and two-stage detectors
(Faster R-CNN [26], Cascade R-CNN [1]). Over the years, many anchor-free detectors have also
been proposed such as CornerNet [13], FCOS [30], CenterNet [37], FoveaBox [12], DETR [2] and
Deformable DETR [39].
Matching.
Our FQDet head additionally differs from other query-based two-stage heads by using
a static top-k matching scheme as opposed to the dynamic Hungarian matching scheme. Matching
is the process responsible of assigning ground-truth labels to the different model outputs during
training. Most detectors have a static matching scheme, meaning that the matching scheme stays
the same throughout the whole training process. Static matching schemes are common when using
anchors, where label assignment is done based on the IoU overlap between anchors and ground-truth
boxes (e.g. Faster R-CNN [26], RetinaNet [16] and Cascade R-CNN [1]). Some works make use of a
dynamic matching scheme, meaning that the matching scheme changes as training progresses. In
Dynamic R-CNN [35], the minimum IoU thresholds are increased throughout the training process.
In DETR [2], a dynamic Hungarian matching scheme is used, matching every ground-truth to a
single prediction (one-to-one matching). In OTA [10], the one-to-one Hungarian matching scheme
from DETR is extended to one-to-many Hungarian matching, where a single ground-truth could be
assigned to multiple predictions.
2
3 Method
3.1 Overview and motivation
Overview.
Below, we give an overview of the main improvements of our FQDet head over two-
stage Deformable DETR (see Appendix A for more information about DETR and Deformable
DETR):
1.
We use multiple anchors of different sizes and aspect ratios [16], improving the cross-
attention prior.
2. We encode the bounding boxes relative to these anchors, similar to Faster R-CNN [26].
3. We only use an L1 loss for bounding box regression [26] without GIoU loss [27].
4. We remove the auxiliary decoder losses and predictions as used in DETR [2].
5. We do not perform iterative bounding box regression as in Deformable DETR [39].
6. We replace Hungarian matching [2, 39] with static top-k matching (ours).
Motivation.
In what follows, we motivate each of the above FQDet design choices. The impact of
these design choices on the performance will be analyzed in the ablation studies of Subsection 4.2.
(Item 1) We first motivate the use of anchors in our FQDet head. To understand this, we must take a
look at the cross-attention operation from Deformable DETR. This operation updates a query based
on features sampled from the backbone feature maps, where the sample coordinates are regressed
from the query itself. These sample coordinates are defined w.r.t. to a reference frame based on the
query box prior, where the origin of the reference frame is placed at the center of the box prior and
where the unit lengths correspond to half the width and height of the box prior. By using anchors of
various sizes and aspect ratios in our FQDet head, we significantly improve these cross-attention box
priors resulting in more accurate and robust sampling, and eventually in better performance.
Items 2-5 from above list now in fact all follow from the use of anchors. (Item 2) Given that we
use anchors, it is logical to also encode our bounding box predictions relative to these anchors as
e.g. done in Faster R-CNN [26]. (Item 3) Since we encode our boxes relative to the anchors, we
only use an L1 loss (without GIoU loss) for the bounding box regression, as commonly used for
this box encoding [26]. (Item 4) Since we improved our box priors with the introduction of anchors,
there is no need to further improve these by introducing intermediate box predictions supervised
by auxiliary losses as Deformable DETR [39]. In our setting, updating the box priors with these
intermediate box predictions even hurts the performance, as a fixed sample reference frame for the
different decoder layers is to be preferred over a changing one. (Item 5) Given that we do not make
intermediate box predictions, there is also no need for iterative bounding box regression as used in
Deformable DETR [39].
(Item 6) Our final design choice involves replacing the Hungarian matching with our static top-k
matching. At the beginning of training, Hungarian matching might assign a ground-truth detection
to a query which sampled from a very different region compared to the location of the ground-
truth detection. Hungarian matching hence produces many low-quality matches at the beginning of
training, significantly slowing down convergence. Instead, we propose our static (i.e. non-Hungarian)
top-k matching scheme consisting of matching each ground-truth detection with its top-k anchors.
Queries corresponding to these anchors are then also automatically matched. This matching scheme
guarantees that each matched query indeed did processing in the neighborhood of the matched
ground-truth detection.
3.2 FQDet in detail
An architectural overview of our FQDet head is displayed in Figure 1. The head consists of two
stages: a first stage applied to all input features and a second stage applied only to those features (i.e.
queries) selected from Stage 1. We now discuss both stages in more depth.
Stage 1.
In Stage 1, the goal of our FQDet head is to determine which input features belong to an
object, such that these can be selected for Stage 2. At its input, the head receives a set of backbone
feature maps in the form of a feature pyramid. To each of those backbone feature locations we
3
摘要:

FQDet:Fast-convergingQuery-basedDetectorCédricPicronESAT-PSI,KULeuvencedric.picron@esat.kuleuven.bePunarjayChakravartyFordGreeneldLabs,PaloAltopchakra5@ford.comTinneTuytelaarsESAT-PSI,KULeuventinne.tuytelaars@esat.kuleuven.beAbstractRecently,two-stageDeformableDETRintroducedthequery-basedtwo-stageh...

展开>> 收起<<
FQDet Fast-converging Query-based Detector.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:384.81KB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注