UNDER SUBMISSION 2
most discriminative object regions (e.g., the “body of an
airplane” and the “motorcycle pedals”). To solve this prob-
lem, vision transformer-based object detection methods [24],
[23], [25], [26] have been recently proposed and flourished.
These methods first divide the input image into different
image patches, and then use the multi-head attention-based
feature interaction among patches to complete the purpose of
obtaining the global long-range dependencies. As expected,
the feature pyramid is also employed in a vision transformer,
e.g., PVT [26] and Swin Transformer [25]. Although these
methods can address the limited receptive fields and the local
contextual information in CNN, an obvious drawback is their
large computational complexity. For example, a Swin-B [25]
has almost 3×model FLOPs (i.e., 47.0G vs 16.0G) than
a performance-comparable CNN model RegNetY [27] with
the input size of 224 ×224. Besides, as shown in Figure 1
(b), since vision transformer-based methods are implemented
in an omnidirectional and unbiased learning pattern, which is
easy to ignore some corner regions (e.g., the “airplane engine”,
the “motorcycle wheel” and the “bat”) that are important for
dense prediction tasks. These drawbacks are more obvious on
the large-scale input images. To this end, we rise a question:
is it necessary to use transformer encodes on all layers? To
answer such a question, we start from an analysis of shallow
features. Researches of the advanced methods [28], [29], [30]
show that the shallow features mainly contain some general
object feature patterns, e.g., texture, colour and orientation,
which are often not global. In contrast, the deep features
reflect the object-specific information, which usually requires
global information [31], [32]. Therefore, we argue that the
transformer encoder is unnecessary in all layers.
In this work, we propose a Centralized Feature Pyramid
(CFP) network for object detection, which is based on a
globally explicit centralized regulation scheme. Specifically,
based on an visual feature pyramid extracted from the CNN
backbone, we first propose an explicit visual center scheme,
where a lightweight MLP architecture is used to capture the
long-range dependencies and a parallel learnable visual center
mechanism is used to aggregate the local key regions of the
input images. Considering the fact that the deepest features
usually contain the most abstract feature representations scarce
in the shallow features [33], based on the proposed regulation
scheme, we then propose a globally centralized regulation for
the extracted feature pyramid in a top-down manner, where
the spatial explicit visual center obtained from the deepest
features is used to regulate all the frontal shallow features
simultaneously. Compared to the existing feature pyramids,
as shown in Figure 1(c), CFP not only has the ability to
capture the global long-range dependencies, but also efficiently
obtain an all-round yet discriminative feature representation.
To demonstrate the superiority, extensive experiments are
carried out on the challenging MS-COCO dataset [34]. Results
validate that our proposed CFP can achieve the consistent
performance gains on the state-of-the-art YOLOv5 [35] and
YOLOX [36] object detection baselines.
Our contributions are summarized as the following: 1) We
proposed a spatial explicit visual center scheme, which con-
sists of a lightweight MLP for capturing the global long-range
dependencies and a learnable visual center for aggregating
the local key regions. 2) We proposed a globally centralized
regulation for the commonly-used feature pyramid in a top-
down manner. 3) CFP achieved the consistent performance
gains on the strong object detection baselines.
II. RELATED WORK
A. Feature Pyramid in Computer Vision
Feature pyramid is a fundamental neck network in modern
recognition systems that can be effectively and efficiently used
to detect objects with different scales. SSD [6] is one of
the first approaches that uses a pyramidal feature hierarchy
representation, which captures multi-scale feature information
through network of different spatial sizes, thus the model
recognition accuracy is improved. FPN [17] hierarchically
mainly relies on the bottom-up in-network feature pyramid,
which builds a top-down path with lateral connections from
multi-scale high-level semantic feature maps. Based on which,
PANet [16] further proposed an additional bottom-up pathway
based on FPN to share feature information between the inter-
layer features, such that the high-level features can also obtain
sufficient details in low-level features. Under the help of
the neural architecture search, NAS-FPN [13] uses spatial
search strategy to connect across layers via a feature pyramid
and obtains the extensible feature information. M2Det [37]
extracted multi-stage and multi-scale features by construct-
ing multi-stage feature pyramid to achieve cross-level and
cross-layer feature fusion. In general, 1) the feature pyramid
can deal with the problem of multi-scale change in object
recognition without increasing the computational overhead;
2) the extracted features can generate multi-scale feature
representations including some high resolution features. In this
work, we propose an intra-layer feature regulation from the
perspective of inter-layer feature interactions and intra-layer
feature regulations of feature pyramids, which makes up for
the shortcomings of current methods in this regard.
B. Visual Attention Learning
CNN [38] focuses more on the representative learning of lo-
cal regions. However, this local representation does not satisfy
the requirement for global context and long-term dependencies
of the modern recognition systems. To this end, the attention
learning mechanism [20] is proposed that focuses on deciding
where to project more attention in an image. For example,
non-local operation [19] uses the non-local neural network to
directly capture long-range dependencies, demonstrating the
significance of non-local modeling for tasks of video classifi-
cation, object detection and segmentation. However, the local
representation of the internal nature of CNNs is not resolved,
i.e., CNN features can only capture limited contextual informa-
tion. To address this problem, Transformer [20] which mainly
benefits from the multi-head attention mechanism has caused a
great sensation recently and achieved great success in the field
of computer vision, such as image recognition [24], [39], [23],
[40], [25]. For example, the representative VIT divides the
image into a sequence with position encoding, and then uses
the cascaded transformer block to extract the parameterized