UNDER SUBMISSION 1 Centralized Feature Pyramid for Object Detection Yu Quan Dong Zhang Member IEEE Liyan Zhang Jinhui Tang Senior Member IEEE

2025-05-06 0 0 1.74MB 13 页 10玖币
侵权投诉
UNDER SUBMISSION 1
Centralized Feature Pyramid for Object Detection
Yu Quan, Dong Zhang, Member, IEEE, Liyan Zhang, Jinhui Tang, Senior Member, IEEE
Abstract—Visual feature pyramid has shown its superiority in
both effectiveness and efficiency in a wide range of applications.
However, the existing methods exorbitantly concentrate on the
inter-layer feature interactions but ignore the intra-layer feature
regulations, which are empirically proved beneficial. Although
some methods try to learn a compact intra-layer feature repre-
sentation with the help of the attention mechanism or the vision
transformer, they ignore the neglected corner regions that are
important for dense prediction tasks. To address this problem, in
this paper, we propose a Centralized Feature Pyramid (CFP) for
object detection, which is based on a globally explicit centralized
feature regulation. Specifically, we first propose a spatial explicit
visual center scheme, where a lightweight MLP is used to capture
the globally long-range dependencies and a parallel learnable
visual center mechanism is used to capture the local corner
regions of the input images. Based on this, we then propose
a globally centralized regulation for the commonly-used feature
pyramid in a top-down fashion, where the explicit visual center
information obtained from the deepest intra-layer feature is used
to regulate frontal shallow features. Compared to the existing
feature pyramids, CFP not only has the ability to capture the
global long-range dependencies, but also efficiently obtain an all-
round yet discriminative feature representation. Experimental
results on the challenging MS-COCO validate that our proposed
CFP can achieve the consistent performance gains on the state-
of-the-art YOLOv5 and YOLOX object detection baselines. The
code has been released at: CFPNet.
Index Terms—Feature pyramid, visual center, object detection,
attention learning mechanism, long-range dependencies.
I. INTRODUCTION
Object detection is one of the most fundamental yet chal-
lenging research tasks in the community of computer vision,
which aims to predict a unique bounding box for each object
of the input image that contains not only the location but
also the category information [1]. In the past few years, this
task has been extensively developed and applied to a wide
range of potential applications, e.g., autonomous driving [2]
and computer-aided diagnosis [3].
The successful object detection methods are mainly based
on the Convolutional Neural Network (CNN) as the backbone
followed with a two-stage (e.g., Fast/Faster R-CNN [4], [5])
or single-stage (e.g., SSD [6] and YOLO [7]) framework.
However, due to the uncertainty object sizes, a single feature
scale cannot meet requirements of the high-accuracy recog-
nition performance. To this end, methods (e.g., SSD [6] and
FFP [8]) based on the in-network feature pyramid are proposed
Y. Quan, D. Zhang and J. Tang are with the School of Computer Science and
Engineering, Nanjing University of Science and Technology, Nanjing 210094,
China. E-mail: {quanyu, dongzhang, jinhuitang}@njust.edu.cn.
L. Zhang is with the College of Computer Science and Technology, Nanjing
University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern
Analysis and Machine Intelligence, Collaborative Innovation Center of Novel
Software Technology and Industrialization, Nanjing 211106, China. E-mail:
zhangliyan@nuaa.edu.cn.
Corresponding author: Liyan Zhang.
(a) Inputs (b) CNN (c) Transformer (d) Ours
Fig. 1. Visualizations of image feature evolution for vision recognition tasks.
For the input images in (a), a CNN model in (b) only locates those most
discriminative regions; although the progressive model in (c) can see wider
under the help of the attention mechanism [20] or transformer [23], it usually
ignores the corner cues that are important for dense prediction tasks; our model
in (d) can not only see wider but also more well-rounded by attaching the
centralized constraints on features with the advanced long-range dependencies,
which is more suitable for dense prediction tasks. Best viewed in color.
and achieve satisfactory results effectively and efficiently. The
unified principle behind these methods is to assign region of
interest for each object of different size with the appropriate
contextual information and enable these objects to be recog-
nized in different feature layers.
Feature interactions among pixels or objects are impor-
tant [9]. We consider that effective feature interaction can
make image features see wider and obtain richer representa-
tions, so that the object detection model can learn an implicit
relation (i.e., the favorable co-occurrence features [10], [11])
between pixels/objects, which has been empirically proved to
be beneficial to the visual recognition tasks [12], [13], [14],
[15], [16], [17], [18]. For example, FPN [17] proposes a top-
down inter-layer feature interaction mechanism, which enables
shallow features to obtain the global contextual information
and semantic representations of deep features. NAS-FPN [13]
tries to learn the network structure of the feature pyramid
part via a network architecture search strategy, and obtains a
scalable feature representation. Besides the inter-layer interac-
tions, inspired by the non-local/self-attention mechanism [19],
[20], the finer intra-layer interaction methods for spatial feature
regulation are also applied to object detection task, e.g., non-
local features [21] and GCNet [22]. Based on the above two
interaction mechanisms, FPT [15] further proposes an inter-
layer cross-layer and intra-layer cross-space feature regulation
method, and has achieved remarkable performances.
Despite the initiatory success in object detection, the above
methods are based on the CNN backbone, which suffer from
the inherent limit receptive fields. As shown in Figure 1(a),
the standard CNN backbone features can only locate those
arXiv:2210.02093v1 [cs.CV] 5 Oct 2022
UNDER SUBMISSION 2
most discriminative object regions (e.g., the “body of an
airplane” and the “motorcycle pedals”). To solve this prob-
lem, vision transformer-based object detection methods [24],
[23], [25], [26] have been recently proposed and flourished.
These methods first divide the input image into different
image patches, and then use the multi-head attention-based
feature interaction among patches to complete the purpose of
obtaining the global long-range dependencies. As expected,
the feature pyramid is also employed in a vision transformer,
e.g., PVT [26] and Swin Transformer [25]. Although these
methods can address the limited receptive fields and the local
contextual information in CNN, an obvious drawback is their
large computational complexity. For example, a Swin-B [25]
has almost 3×model FLOPs (i.e., 47.0G vs 16.0G) than
a performance-comparable CNN model RegNetY [27] with
the input size of 224 ×224. Besides, as shown in Figure 1
(b), since vision transformer-based methods are implemented
in an omnidirectional and unbiased learning pattern, which is
easy to ignore some corner regions (e.g., the “airplane engine”,
the “motorcycle wheel” and the “bat”) that are important for
dense prediction tasks. These drawbacks are more obvious on
the large-scale input images. To this end, we rise a question:
is it necessary to use transformer encodes on all layers? To
answer such a question, we start from an analysis of shallow
features. Researches of the advanced methods [28], [29], [30]
show that the shallow features mainly contain some general
object feature patterns, e.g., texture, colour and orientation,
which are often not global. In contrast, the deep features
reflect the object-specific information, which usually requires
global information [31], [32]. Therefore, we argue that the
transformer encoder is unnecessary in all layers.
In this work, we propose a Centralized Feature Pyramid
(CFP) network for object detection, which is based on a
globally explicit centralized regulation scheme. Specifically,
based on an visual feature pyramid extracted from the CNN
backbone, we first propose an explicit visual center scheme,
where a lightweight MLP architecture is used to capture the
long-range dependencies and a parallel learnable visual center
mechanism is used to aggregate the local key regions of the
input images. Considering the fact that the deepest features
usually contain the most abstract feature representations scarce
in the shallow features [33], based on the proposed regulation
scheme, we then propose a globally centralized regulation for
the extracted feature pyramid in a top-down manner, where
the spatial explicit visual center obtained from the deepest
features is used to regulate all the frontal shallow features
simultaneously. Compared to the existing feature pyramids,
as shown in Figure 1(c), CFP not only has the ability to
capture the global long-range dependencies, but also efficiently
obtain an all-round yet discriminative feature representation.
To demonstrate the superiority, extensive experiments are
carried out on the challenging MS-COCO dataset [34]. Results
validate that our proposed CFP can achieve the consistent
performance gains on the state-of-the-art YOLOv5 [35] and
YOLOX [36] object detection baselines.
Our contributions are summarized as the following: 1) We
proposed a spatial explicit visual center scheme, which con-
sists of a lightweight MLP for capturing the global long-range
dependencies and a learnable visual center for aggregating
the local key regions. 2) We proposed a globally centralized
regulation for the commonly-used feature pyramid in a top-
down manner. 3) CFP achieved the consistent performance
gains on the strong object detection baselines.
II. RELATED WORK
A. Feature Pyramid in Computer Vision
Feature pyramid is a fundamental neck network in modern
recognition systems that can be effectively and efficiently used
to detect objects with different scales. SSD [6] is one of
the first approaches that uses a pyramidal feature hierarchy
representation, which captures multi-scale feature information
through network of different spatial sizes, thus the model
recognition accuracy is improved. FPN [17] hierarchically
mainly relies on the bottom-up in-network feature pyramid,
which builds a top-down path with lateral connections from
multi-scale high-level semantic feature maps. Based on which,
PANet [16] further proposed an additional bottom-up pathway
based on FPN to share feature information between the inter-
layer features, such that the high-level features can also obtain
sufficient details in low-level features. Under the help of
the neural architecture search, NAS-FPN [13] uses spatial
search strategy to connect across layers via a feature pyramid
and obtains the extensible feature information. M2Det [37]
extracted multi-stage and multi-scale features by construct-
ing multi-stage feature pyramid to achieve cross-level and
cross-layer feature fusion. In general, 1) the feature pyramid
can deal with the problem of multi-scale change in object
recognition without increasing the computational overhead;
2) the extracted features can generate multi-scale feature
representations including some high resolution features. In this
work, we propose an intra-layer feature regulation from the
perspective of inter-layer feature interactions and intra-layer
feature regulations of feature pyramids, which makes up for
the shortcomings of current methods in this regard.
B. Visual Attention Learning
CNN [38] focuses more on the representative learning of lo-
cal regions. However, this local representation does not satisfy
the requirement for global context and long-term dependencies
of the modern recognition systems. To this end, the attention
learning mechanism [20] is proposed that focuses on deciding
where to project more attention in an image. For example,
non-local operation [19] uses the non-local neural network to
directly capture long-range dependencies, demonstrating the
significance of non-local modeling for tasks of video classifi-
cation, object detection and segmentation. However, the local
representation of the internal nature of CNNs is not resolved,
i.e., CNN features can only capture limited contextual informa-
tion. To address this problem, Transformer [20] which mainly
benefits from the multi-head attention mechanism has caused a
great sensation recently and achieved great success in the field
of computer vision, such as image recognition [24], [39], [23],
[40], [25]. For example, the representative VIT divides the
image into a sequence with position encoding, and then uses
the cascaded transformer block to extract the parameterized
UNDER SUBMISSION 3
Backbone Network
LVC
lightweight
MLP
up
up
CCC
up
Conv-BN-SiLU
A Top-Down Path
Explicit Visual
Center (EVC)
Input Image
EVC
H×W×C Cls.
H×W×4 Reg.
Head Network
2×u psampling
Data Flow
2×d ownsampling
C
Cls.
Feature Map
Concatenation
CC
Reg. Regression Loss
Classficiation Loss
airplane:98.7 %
Output
Fig. 2. An illustration of the overall architecture, which mainly consists of four components: input image, a backbone network for feature extraction, the
centralized feature pyramid which is based on a commonly-used vision feature pyramid following [36], and the object detection head network which includes
a classification (i.e., Cls.) loss and a regression (i.e., Reg.) loss. Cdenotes the class size of the used dataset. Our contribution lines in that we propose an
intra-layer feature regulation method in a feature pyramid, and a top-to-down global centralized regulation.
vector as visual representations. On this basis, many excellent
models [39], [41], [42] have been proposed through further
improvement, and have achieved good performance in various
tasks of computer vision. Nevertheless, the transformer-based
image recognition models still have disadvantages of being
computationally intensive and complex.
C. MLP in Computer Vision
In order to alleviate shortcomings of complex transformer
models [43], [44], [23], [45], recent works [46], [47], [48], [49]
show that replacing attention-based modules in a transformer
model with MLP still performs well. The reason for this
phenomenon is that both MLP (e.g., two fully-connected
layer network) and attention mechanism are global information
processing modules. On the one hand, the introduction of the
MLP-Mixer [46] into the vision alleviates changes to the data
layout. On the other hand, MLP-Mixer can better establish the
long dependence/global relationship and spatial relationship
of features through the interaction between spatial feature
information and channel feature information. Although MLP-
style models perform well in computer vision tasks, they are
still lacking in capturing fine-grained feature representations
and obtaining higher recognition accuracy in object detection.
Nevertheless, MLP is playing an increasingly important role in
the field of computer vision, and has the advantage of a simpler
network structure than transformer. In our work, we also use
MLP to capture the global contextual information and long-
term dependencies of the input images. Our contribution lies
in the centrality of the grasped information using the proposed
spatial explicit visual center scheme.
D. Object Detection
Object detection is a fundamental computer vision task,
which aimes to recognize objects or instances of interest
for the given image and provide a comprehensive scene
description including the object category and location. With
the unprecedented development of CNN [38] in the recent
years, plenty of object detection models achieve remarkable
progress. The existing methods can be divided into two types
of two-stage and single-stage. Two-stage object detectors [50],
[4], [5], [51], [52] usually first use a RPN to generate a
collection of region proposals. Then use a learning module to
extract region features of these region proposals and complete
the classification and regression process. However, storing and
repetitively extracting the features of each region proposal
is not only computationally expensive, but also makes it
impossible to capture the global feature representations. To
this end, the single-stage detectors [7], [6], [53], [54] directly
perform prediction and region classification by generating
bounding boxes. The existing single-stage methods have a
global concept in the design of feature extraction, and use
the backbone network to extract feature maps of the entire
image to predict each bounding box. In this paper, we also
choose the single-stage object detectors (i.e., YOLOv5 [35]
and YOLOX [36]) as our baseline models. Our focus is to
enhance the representation of the feature pyramid used for
these detectors.
III. OUR APPROACH
In this section, we introduce the implementation details of
the proposed centralized feature pyramid (CFP). We first make
an overview architecture description for CFP in Section III-A.
Then, we show the implementation details of the explicit visual
摘要:

UNDERSUBMISSION1CentralizedFeaturePyramidforObjectDetectionYuQuan,DongZhang,Member,IEEE,LiyanZhang,JinhuiTang,SeniorMember,IEEEAbstract—Visualfeaturepyramidhasshownitssuperiorityinbotheffectivenessandefciencyinawiderangeofapplications.However,theexistingmethodsexorbitantlyconcentrateontheinter-laye...

展开>> 收起<<
UNDER SUBMISSION 1 Centralized Feature Pyramid for Object Detection Yu Quan Dong Zhang Member IEEE Liyan Zhang Jinhui Tang Senior Member IEEE.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:13 页 大小:1.74MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注