FQDet Fast-converging Query-based Detector

2025-04-22 0 0 384.81KB 13 页 10玖币

侵权投诉

FQDet: Fast-converging Query-based Detector

Cédric Picron

ESAT-PSI, KU Leuven

cedric.picron@esat.kuleuven.be

Punarjay Chakravarty

Ford Greenﬁeld Labs, Palo Alto

pchakra5@ford.com

Tinne Tuytelaars

ESAT-PSI, KU Leuven

tinne.tuytelaars@esat.kuleuven.be

Abstract

Recently, two-stage Deformable DETR introduced the query-based two-stage head,

a new type of two-stage head different from the region-based two-stage heads of

classical detectors as Faster R-CNN. In query-based two-stage heads, the second

stage selects one feature per detection processed by a transformer, called the

query, as opposed to pooling a rectangular grid of features processed by CNNs

as in region-based detectors. In this work, we improve the query-based head by

improving the prior of the cross-attention operation with anchors, signiﬁcantly

speeding up the convergence while increasing its performance. Additionally, we

empirically show that by improving the cross-attention prior, auxiliary losses and

iterative bounding box mechanisms typically used by DETR-based detectors are

no longer needed. By combining the best of both the classical and the DETR-based

detectors, our FQDet head peaks at

45.4

AP on the 2017 COCO validation set

when using a ResNet-50+TPN backbone, only after training for 12 epochs using

the 1x schedule. We outperform other high-performing two-stage heads such as e.g.

Cascade R-CNN, while using the same backbone and while being computationally

cheaper. Additionally, when using the large ResNeXt-101-DCN+TPN backbone

and multi-scale testing, our FQDet head achieves

52.9

AP on the 2017 COCO

test-dev set after only 12 epochs of training. Code is released at

https://github.

com/CedricPicron/FQDet.

1 Introduction

Deep neural networks designed for solving the object detection task, are commonly subdivided

in a backbone and a head. The backbone is deﬁned as taking in an image and outputting a set of

feature maps. These feature maps are typically of different resolutions, with a factor two separating

consecutive maps in width and height, hence forming what is known as a feature pyramid [15]. A

common choice for the backbone is using a ResNet-50 [11] network in combination with a pyramid

network (PN) such as FPN [15], PANet [19], BiFPN [29] or TPN [23].

The object detection head is deﬁned as taking in a set of feature maps from the backbone, and

outputting object detection predictions, with each prediction consisting of an axis-aligned box

together with a corresponding class label and conﬁdence score. Object detection heads are commonly

subdivided in one-stage and two-stage heads. One-stage heads make an object detection prediction

for every feature from its input set of feature maps. Two-stage heads ﬁrst evaluate for every feature

whether it is related to an object, and then use these results to focus the processing on object-related

parts of the feature maps.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02318v2 [cs.CV] 28 Oct 2022

In what follows, we exclusively work and compare with two-stage heads. Two-stage heads can further

be subdivided in region-based and query-based two-stage heads, based on how the second stage of

the two-stage head is implemented.

Region-based two-stage heads pool a rectangular grid of features from the backbone feature maps

and further process these using convolutional neural networks (CNNs) in order to obtain their ﬁnal

object predictions. Region-based heads are found in many well-established two-stage object detectors

such as Faster R-CNN [26] and Cascade R-CNN [1].

Query-based two-stage heads were recently introduced in the two-stage variant of Deformable

DETR [39]. While region-based heads extract a grid of features from the backbone feature maps

per detection, only a single feature called the query is selected in query-based heads. These queries

are then further processed using operations commonly found in a transformer decoder [31], namely

cross-attention, self-attention, and feedforward operations. Here the cross-attention operation is

deﬁned between the backbone feature maps and each query, the self-attention operation is applied

between the queries from different predictions, and the feedforward operations are performed on each

query individually.

In this paper, we propose a new query-based two-stage head, called the FQDet (Fast-converging

Query-based Detector) head. FQDet was obtained by combining the query-based paradigm from two-

stage Deformable DETR [39] with classical object detection techniques such as anchor generation

and static (i.e. non-Hungarian) matching.

When evaluated on the 2017 COCO object detection benchmark [17] after training for

epochs, our

FQDet head with ResNet-50+TPN backbone achieves

45.4

AP, outperforming other high-performing

two-stage heads such as Faster R-CNN [26], Cascade R-CNN [1], two-stage Deformable DETR [39]

and Sparse R-CNN [28] when using the same backbone.

2 Related work

DETR and its variants.

With its unique way of approaching the object detection task, DETR [2]

quickly gained a lot of popularity. Since, many variants such as SMCA [9], Conditional DETR [22],

Deformable DETR [39], Anchor DETR [32], DAB-DETR [18] and DN-DETR [14] have been

proposed to improve the main two shortcomings of DETR, namely its slow convergence speed and its

poor performance on small objects. In doing so, Deformable DETR [39] introduces the query-based

two-stage head, a new type of two-stage head now also used in other works such as DINO [34].

Anchors.

Our FQDet head differs from other query-based two-stage heads by introducing anchors

within the query-based two-stage paradigm. Faster R-CNN [26] was one of the ﬁrst works to use

anchors. Anchors are axis-aligned boxes of different sizes and aspect ratios that are attached to

backbone feature locations and are reﬁned to yield the ﬁnal bounding box predictions. Anchors are

found in both one-stage detectors (SSD [20], YOLO [25], RetinaNet [16]) and two-stage detectors

(Faster R-CNN [26], Cascade R-CNN [1]). Over the years, many anchor-free detectors have also

been proposed such as CornerNet [13], FCOS [30], CenterNet [37], FoveaBox [12], DETR [2] and

Deformable DETR [39].

Matching.

Our FQDet head additionally differs from other query-based two-stage heads by using

a static top-k matching scheme as opposed to the dynamic Hungarian matching scheme. Matching

is the process responsible of assigning ground-truth labels to the different model outputs during

training. Most detectors have a static matching scheme, meaning that the matching scheme stays

the same throughout the whole training process. Static matching schemes are common when using

anchors, where label assignment is done based on the IoU overlap between anchors and ground-truth

boxes (e.g. Faster R-CNN [26], RetinaNet [16] and Cascade R-CNN [1]). Some works make use of a

dynamic matching scheme, meaning that the matching scheme changes as training progresses. In

Dynamic R-CNN [35], the minimum IoU thresholds are increased throughout the training process.

In DETR [2], a dynamic Hungarian matching scheme is used, matching every ground-truth to a

single prediction (one-to-one matching). In OTA [10], the one-to-one Hungarian matching scheme

from DETR is extended to one-to-many Hungarian matching, where a single ground-truth could be

assigned to multiple predictions.

3 Method

3.1 Overview and motivation

Overview.

Below, we give an overview of the main improvements of our FQDet head over two-

stage Deformable DETR (see Appendix A for more information about DETR and Deformable

DETR):

We use multiple anchors of different sizes and aspect ratios [16], improving the cross-

attention prior.

2. We encode the bounding boxes relative to these anchors, similar to Faster R-CNN [26].

3. We only use an L1 loss for bounding box regression [26] without GIoU loss [27].

4. We remove the auxiliary decoder losses and predictions as used in DETR [2].

5. We do not perform iterative bounding box regression as in Deformable DETR [39].

6. We replace Hungarian matching [2, 39] with static top-k matching (ours).

Motivation.

In what follows, we motivate each of the above FQDet design choices. The impact of

these design choices on the performance will be analyzed in the ablation studies of Subsection 4.2.

(Item 1) We ﬁrst motivate the use of anchors in our FQDet head. To understand this, we must take a

look at the cross-attention operation from Deformable DETR. This operation updates a query based

on features sampled from the backbone feature maps, where the sample coordinates are regressed

from the query itself. These sample coordinates are deﬁned w.r.t. to a reference frame based on the

query box prior, where the origin of the reference frame is placed at the center of the box prior and

where the unit lengths correspond to half the width and height of the box prior. By using anchors of

various sizes and aspect ratios in our FQDet head, we signiﬁcantly improve these cross-attention box

priors resulting in more accurate and robust sampling, and eventually in better performance.

Items 2-5 from above list now in fact all follow from the use of anchors. (Item 2) Given that we

use anchors, it is logical to also encode our bounding box predictions relative to these anchors as

e.g. done in Faster R-CNN [26]. (Item 3) Since we encode our boxes relative to the anchors, we

only use an L1 loss (without GIoU loss) for the bounding box regression, as commonly used for

this box encoding [26]. (Item 4) Since we improved our box priors with the introduction of anchors,

there is no need to further improve these by introducing intermediate box predictions supervised

by auxiliary losses as Deformable DETR [39]. In our setting, updating the box priors with these

intermediate box predictions even hurts the performance, as a ﬁxed sample reference frame for the

different decoder layers is to be preferred over a changing one. (Item 5) Given that we do not make

intermediate box predictions, there is also no need for iterative bounding box regression as used in

Deformable DETR [39].

(Item 6) Our ﬁnal design choice involves replacing the Hungarian matching with our static top-k

matching. At the beginning of training, Hungarian matching might assign a ground-truth detection

to a query which sampled from a very different region compared to the location of the ground-

truth detection. Hungarian matching hence produces many low-quality matches at the beginning of

training, signiﬁcantly slowing down convergence. Instead, we propose our static (i.e. non-Hungarian)

top-k matching scheme consisting of matching each ground-truth detection with its top-k anchors.

Queries corresponding to these anchors are then also automatically matched. This matching scheme

guarantees that each matched query indeed did processing in the neighborhood of the matched

ground-truth detection.

3.2 FQDet in detail

An architectural overview of our FQDet head is displayed in Figure 1. The head consists of two

stages: a ﬁrst stage applied to all input features and a second stage applied only to those features (i.e.

queries) selected from Stage 1. We now discuss both stages in more depth.

Stage 1.

In Stage 1, the goal of our FQDet head is to determine which input features belong to an

object, such that these can be selected for Stage 2. At its input, the head receives a set of backbone

feature maps in the form of a feature pyramid. To each of those backbone feature locations we

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FQDet:Fast-convergingQuery-basedDetectorCédricPicronESAT-PSI,KULeuvencedric.picron@esat.kuleuven.bePunarjayChakravartyFordGreeneldLabs,PaloAltopchakra5@ford.comTinneTuytelaarsESAT-PSI,KULeuventinne.tuytelaars@esat.kuleuven.beAbstractRecently,two-stageDeformableDETRintroducedthequery-basedtwo-stageh...

展开>> 收起<<

FQDet Fast-converging Query-based Detector.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FQDet Fast-converging Query-based Detector

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: