UNDER SUBMISSION 1 Centralized Feature Pyramid for Object Detection Yu Quan Dong Zhang Member IEEE Liyan Zhang Jinhui Tang Senior Member IEEE

2025-05-06 0 0 1.74MB 13 页 10玖币

侵权投诉

UNDER SUBMISSION 1

Centralized Feature Pyramid for Object Detection

Yu Quan, Dong Zhang, Member, IEEE, Liyan Zhang, Jinhui Tang, Senior Member, IEEE

Abstract—Visual feature pyramid has shown its superiority in

both effectiveness and efﬁciency in a wide range of applications.

However, the existing methods exorbitantly concentrate on the

inter-layer feature interactions but ignore the intra-layer feature

regulations, which are empirically proved beneﬁcial. Although

some methods try to learn a compact intra-layer feature repre-

sentation with the help of the attention mechanism or the vision

transformer, they ignore the neglected corner regions that are

important for dense prediction tasks. To address this problem, in

this paper, we propose a Centralized Feature Pyramid (CFP) for

object detection, which is based on a globally explicit centralized

feature regulation. Speciﬁcally, we ﬁrst propose a spatial explicit

visual center scheme, where a lightweight MLP is used to capture

the globally long-range dependencies and a parallel learnable

visual center mechanism is used to capture the local corner

regions of the input images. Based on this, we then propose

a globally centralized regulation for the commonly-used feature

pyramid in a top-down fashion, where the explicit visual center

information obtained from the deepest intra-layer feature is used

to regulate frontal shallow features. Compared to the existing

feature pyramids, CFP not only has the ability to capture the

global long-range dependencies, but also efﬁciently obtain an all-

round yet discriminative feature representation. Experimental

results on the challenging MS-COCO validate that our proposed

CFP can achieve the consistent performance gains on the state-

of-the-art YOLOv5 and YOLOX object detection baselines. The

code has been released at: CFPNet.

Index Terms—Feature pyramid, visual center, object detection,

attention learning mechanism, long-range dependencies.

I. INTRODUCTION

Object detection is one of the most fundamental yet chal-

lenging research tasks in the community of computer vision,

which aims to predict a unique bounding box for each object

of the input image that contains not only the location but

also the category information [1]. In the past few years, this

task has been extensively developed and applied to a wide

range of potential applications, e.g., autonomous driving [2]

and computer-aided diagnosis [3].

The successful object detection methods are mainly based

on the Convolutional Neural Network (CNN) as the backbone

followed with a two-stage (e.g., Fast/Faster R-CNN [4], [5])

or single-stage (e.g., SSD [6] and YOLO [7]) framework.

However, due to the uncertainty object sizes, a single feature

scale cannot meet requirements of the high-accuracy recog-

nition performance. To this end, methods (e.g., SSD [6] and

FFP [8]) based on the in-network feature pyramid are proposed

Y. Quan, D. Zhang and J. Tang are with the School of Computer Science and

Engineering, Nanjing University of Science and Technology, Nanjing 210094,

China. E-mail: {quanyu, dongzhang, jinhuitang}@njust.edu.cn.

L. Zhang is with the College of Computer Science and Technology, Nanjing

University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern

Analysis and Machine Intelligence, Collaborative Innovation Center of Novel

Software Technology and Industrialization, Nanjing 211106, China. E-mail:

zhangliyan@nuaa.edu.cn.

Corresponding author: Liyan Zhang.

(a) Inputs (b) CNN (c) Transformer (d) Ours

Fig. 1. Visualizations of image feature evolution for vision recognition tasks.

For the input images in (a), a CNN model in (b) only locates those most

discriminative regions; although the progressive model in (c) can see wider

under the help of the attention mechanism [20] or transformer [23], it usually

ignores the corner cues that are important for dense prediction tasks; our model

in (d) can not only see wider but also more well-rounded by attaching the

centralized constraints on features with the advanced long-range dependencies,

which is more suitable for dense prediction tasks. Best viewed in color.

and achieve satisfactory results effectively and efﬁciently. The

uniﬁed principle behind these methods is to assign region of

interest for each object of different size with the appropriate

contextual information and enable these objects to be recog-

nized in different feature layers.

Feature interactions among pixels or objects are impor-

tant [9]. We consider that effective feature interaction can

make image features see wider and obtain richer representa-

tions, so that the object detection model can learn an implicit

relation (i.e., the favorable co-occurrence features [10], [11])

between pixels/objects, which has been empirically proved to

be beneﬁcial to the visual recognition tasks [12], [13], [14],

[15], [16], [17], [18]. For example, FPN [17] proposes a top-

down inter-layer feature interaction mechanism, which enables

shallow features to obtain the global contextual information

and semantic representations of deep features. NAS-FPN [13]

tries to learn the network structure of the feature pyramid

part via a network architecture search strategy, and obtains a

scalable feature representation. Besides the inter-layer interac-

tions, inspired by the non-local/self-attention mechanism [19],

[20], the ﬁner intra-layer interaction methods for spatial feature

regulation are also applied to object detection task, e.g., non-

local features [21] and GCNet [22]. Based on the above two

interaction mechanisms, FPT [15] further proposes an inter-

layer cross-layer and intra-layer cross-space feature regulation

method, and has achieved remarkable performances.

Despite the initiatory success in object detection, the above

methods are based on the CNN backbone, which suffer from

the inherent limit receptive ﬁelds. As shown in Figure 1(a),

the standard CNN backbone features can only locate those

arXiv:2210.02093v1 [cs.CV] 5 Oct 2022

UNDER SUBMISSION 2

most discriminative object regions (e.g., the “body of an

airplane” and the “motorcycle pedals”). To solve this prob-

lem, vision transformer-based object detection methods [24],

[23], [25], [26] have been recently proposed and ﬂourished.

These methods ﬁrst divide the input image into different

image patches, and then use the multi-head attention-based

feature interaction among patches to complete the purpose of

obtaining the global long-range dependencies. As expected,

the feature pyramid is also employed in a vision transformer,

e.g., PVT [26] and Swin Transformer [25]. Although these

methods can address the limited receptive ﬁelds and the local

contextual information in CNN, an obvious drawback is their

large computational complexity. For example, a Swin-B [25]

has almost 3×model FLOPs (i.e., 47.0G vs 16.0G) than

a performance-comparable CNN model RegNetY [27] with

the input size of 224 ×224. Besides, as shown in Figure 1

(b), since vision transformer-based methods are implemented

in an omnidirectional and unbiased learning pattern, which is

easy to ignore some corner regions (e.g., the “airplane engine”,

the “motorcycle wheel” and the “bat”) that are important for

dense prediction tasks. These drawbacks are more obvious on

the large-scale input images. To this end, we rise a question:

is it necessary to use transformer encodes on all layers? To

answer such a question, we start from an analysis of shallow

features. Researches of the advanced methods [28], [29], [30]

show that the shallow features mainly contain some general

object feature patterns, e.g., texture, colour and orientation,

which are often not global. In contrast, the deep features

reﬂect the object-speciﬁc information, which usually requires

global information [31], [32]. Therefore, we argue that the

transformer encoder is unnecessary in all layers.

In this work, we propose a Centralized Feature Pyramid

(CFP) network for object detection, which is based on a

globally explicit centralized regulation scheme. Speciﬁcally,

based on an visual feature pyramid extracted from the CNN

backbone, we ﬁrst propose an explicit visual center scheme,

where a lightweight MLP architecture is used to capture the

long-range dependencies and a parallel learnable visual center

mechanism is used to aggregate the local key regions of the

input images. Considering the fact that the deepest features

usually contain the most abstract feature representations scarce

in the shallow features [33], based on the proposed regulation

scheme, we then propose a globally centralized regulation for

the extracted feature pyramid in a top-down manner, where

the spatial explicit visual center obtained from the deepest

features is used to regulate all the frontal shallow features

simultaneously. Compared to the existing feature pyramids,

as shown in Figure 1(c), CFP not only has the ability to

capture the global long-range dependencies, but also efﬁciently

obtain an all-round yet discriminative feature representation.

To demonstrate the superiority, extensive experiments are

carried out on the challenging MS-COCO dataset [34]. Results

validate that our proposed CFP can achieve the consistent

performance gains on the state-of-the-art YOLOv5 [35] and

YOLOX [36] object detection baselines.

Our contributions are summarized as the following: 1) We

proposed a spatial explicit visual center scheme, which con-

sists of a lightweight MLP for capturing the global long-range

dependencies and a learnable visual center for aggregating

the local key regions. 2) We proposed a globally centralized

regulation for the commonly-used feature pyramid in a top-

down manner. 3) CFP achieved the consistent performance

gains on the strong object detection baselines.

II. RELATED WORK

A. Feature Pyramid in Computer Vision

Feature pyramid is a fundamental neck network in modern

recognition systems that can be effectively and efﬁciently used

to detect objects with different scales. SSD [6] is one of

the ﬁrst approaches that uses a pyramidal feature hierarchy

representation, which captures multi-scale feature information

through network of different spatial sizes, thus the model

recognition accuracy is improved. FPN [17] hierarchically

mainly relies on the bottom-up in-network feature pyramid,

which builds a top-down path with lateral connections from

multi-scale high-level semantic feature maps. Based on which,

PANet [16] further proposed an additional bottom-up pathway

based on FPN to share feature information between the inter-

layer features, such that the high-level features can also obtain

sufﬁcient details in low-level features. Under the help of

the neural architecture search, NAS-FPN [13] uses spatial

search strategy to connect across layers via a feature pyramid

and obtains the extensible feature information. M2Det [37]

extracted multi-stage and multi-scale features by construct-

ing multi-stage feature pyramid to achieve cross-level and

cross-layer feature fusion. In general, 1) the feature pyramid

can deal with the problem of multi-scale change in object

recognition without increasing the computational overhead;

2) the extracted features can generate multi-scale feature

representations including some high resolution features. In this

work, we propose an intra-layer feature regulation from the

perspective of inter-layer feature interactions and intra-layer

feature regulations of feature pyramids, which makes up for

the shortcomings of current methods in this regard.

B. Visual Attention Learning

CNN [38] focuses more on the representative learning of lo-

cal regions. However, this local representation does not satisfy

the requirement for global context and long-term dependencies

of the modern recognition systems. To this end, the attention

learning mechanism [20] is proposed that focuses on deciding

where to project more attention in an image. For example,

non-local operation [19] uses the non-local neural network to

directly capture long-range dependencies, demonstrating the

signiﬁcance of non-local modeling for tasks of video classiﬁ-

cation, object detection and segmentation. However, the local

representation of the internal nature of CNNs is not resolved,

i.e., CNN features can only capture limited contextual informa-

tion. To address this problem, Transformer [20] which mainly

beneﬁts from the multi-head attention mechanism has caused a

great sensation recently and achieved great success in the ﬁeld

of computer vision, such as image recognition [24], [39], [23],

[40], [25]. For example, the representative VIT divides the

image into a sequence with position encoding, and then uses

the cascaded transformer block to extract the parameterized

UNDER SUBMISSION 3

Backbone Network

LVC

lightweight

MLP

2×up

CCC

4×up

Conv-BN-SiLU

A Top-Down Path

Explicit Visual

Center (EVC)

Input Image

EVC

H×W×C Cls.

H×W×4 Reg.

Head Network

2×u psampling

Data Flow

2×d ownsampling

Cls.

Feature Map

Concatenation

Reg. Regression Loss

Classficiation Loss

airplane:98.7 %

Output

Fig. 2. An illustration of the overall architecture, which mainly consists of four components: input image, a backbone network for feature extraction, the

centralized feature pyramid which is based on a commonly-used vision feature pyramid following [36], and the object detection head network which includes

a classiﬁcation (i.e., Cls.) loss and a regression (i.e., Reg.) loss. Cdenotes the class size of the used dataset. Our contribution lines in that we propose an

intra-layer feature regulation method in a feature pyramid, and a top-to-down global centralized regulation.

vector as visual representations. On this basis, many excellent

models [39], [41], [42] have been proposed through further

improvement, and have achieved good performance in various

tasks of computer vision. Nevertheless, the transformer-based

image recognition models still have disadvantages of being

computationally intensive and complex.

C. MLP in Computer Vision

In order to alleviate shortcomings of complex transformer

models [43], [44], [23], [45], recent works [46], [47], [48], [49]

show that replacing attention-based modules in a transformer

model with MLP still performs well. The reason for this

phenomenon is that both MLP (e.g., two fully-connected

layer network) and attention mechanism are global information

processing modules. On the one hand, the introduction of the

MLP-Mixer [46] into the vision alleviates changes to the data

layout. On the other hand, MLP-Mixer can better establish the

long dependence/global relationship and spatial relationship

of features through the interaction between spatial feature

information and channel feature information. Although MLP-

style models perform well in computer vision tasks, they are

still lacking in capturing ﬁne-grained feature representations

and obtaining higher recognition accuracy in object detection.

Nevertheless, MLP is playing an increasingly important role in

the ﬁeld of computer vision, and has the advantage of a simpler

network structure than transformer. In our work, we also use

MLP to capture the global contextual information and long-

term dependencies of the input images. Our contribution lies

in the centrality of the grasped information using the proposed

spatial explicit visual center scheme.

D. Object Detection

Object detection is a fundamental computer vision task,

which aimes to recognize objects or instances of interest

for the given image and provide a comprehensive scene

description including the object category and location. With

the unprecedented development of CNN [38] in the recent

years, plenty of object detection models achieve remarkable

progress. The existing methods can be divided into two types

of two-stage and single-stage. Two-stage object detectors [50],

[4], [5], [51], [52] usually ﬁrst use a RPN to generate a

collection of region proposals. Then use a learning module to

extract region features of these region proposals and complete

the classiﬁcation and regression process. However, storing and

repetitively extracting the features of each region proposal

is not only computationally expensive, but also makes it

impossible to capture the global feature representations. To

this end, the single-stage detectors [7], [6], [53], [54] directly

perform prediction and region classiﬁcation by generating

bounding boxes. The existing single-stage methods have a

global concept in the design of feature extraction, and use

the backbone network to extract feature maps of the entire

image to predict each bounding box. In this paper, we also

choose the single-stage object detectors (i.e., YOLOv5 [35]

and YOLOX [36]) as our baseline models. Our focus is to

enhance the representation of the feature pyramid used for

these detectors.

III. OUR APPROACH

In this section, we introduce the implementation details of

the proposed centralized feature pyramid (CFP). We ﬁrst make

an overview architecture description for CFP in Section III-A.

Then, we show the implementation details of the explicit visual

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UNDERSUBMISSION1CentralizedFeaturePyramidforObjectDetectionYuQuan,DongZhang,Member,IEEE,LiyanZhang,JinhuiTang,SeniorMember,IEEEAbstractVisualfeaturepyramidhasshownitssuperiorityinbotheffectivenessandefciencyinawiderangeofapplications.However,theexistingmethodsexorbitantlyconcentrateontheinter-laye...

展开>> 收起<<

UNDER SUBMISSION 1 Centralized Feature Pyramid for Object Detection Yu Quan Dong Zhang Member IEEE Liyan Zhang Jinhui Tang Senior Member IEEE.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UNDER SUBMISSION 1 Centralized Feature Pyramid for Object Detection Yu Quan Dong Zhang Member IEEE Liyan Zhang Jinhui Tang Senior Member IEEE

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: