
TOWARDS LIGHT WEIGHT OBJECT DETECTION SYSTEM
Dharma KC1,? , Venkata Ravi Kiran Dayana2, Meng-Lin Wu2
Venkateswara Rao Cherukuri†, Hau Hwang2
1Department of Computer Science, University of Arizona
2Qualcomm Technologies, Inc.
ABSTRACT
Transformers are a popular choice for classification tasks and
as backbones for object detection tasks. However, their high
latency brings challenges in their adaptation to light weight
object detection systems. We present an approximation of the
self-attention layers used in the transformer architecture. This
approximation reduces the latency of the classification system
while incurring minimal loss in accuracy. We also present
a method that uses a transformer encoder layer for multi-
resolution feature fusion. This feature fusion improves the
accuracy of the state-of-the-art light weight object detection
system without significantly increasing the number of param-
eters. Finally, we provide an abstraction for the transformer
architecture called Generalized Transformer (gFormer) that
can guide the design of novel transformer-like architectures.
Index Terms—Vision transformer, self-attention, object
detection, deep neural networks
1. INTRODUCTION
Convolutional neural networks (CNNs) [1] have been widely
used as backbones for object detection systems. MobileNets [2]
use depthwise separable convolutions to develop light weight
CNNs. MobileNetV2 further improves MobileNets using in-
verted residuals and linear bottlenecks. It also introduced ef-
ficient ways of applying depthwise separable convolutions to
the heads of Single Shot MultiBox Detector (SSD) [3], which
resulted in the light weight object detection system, SSDLite.
Recently, Vision Transformers (ViTs) [4] are gaining popular-
ity due to their ability to extract global information. However,
they lack the spatial inductive biases present in CNNs. Mo-
bileViT [5] presented a hybrid architecture based on CNNs
and ViTs that leverages the inductive biases of CNNs and
also includes global information through ViTs. MobileViT
achieves impressive performance on the ImageNet-1k classi-
fication dataset [6], and its main disadvantage is high latency.
In this work, we propose Convolution as Transformer
(CAT): a module that approximates the self-attention layer
?Work performed during internship at Qualcomm Technologies, Inc.
†Work performed while at Qualcomm Technologies, Inc.
in transformers. CAT has low latency and thus can be used
in light weight systems for image classification and object
detection. We replace expensive transformer blocks used
in MobileViT with our CAT blocks, and we show that they
are competitive with the self-attention modules for image
classification tasks. Moreover, CAT blocks have complexity
O(n×d), unlike self-attention that has complexity O(n2×d),
where nis the sequence length, and dis feature vector size.
Existing light weight systems for object detection [5, 3]
mainly consist of a backbone to extract features from images,
followed by heads to extract features from multiple output
resolutions. Predictions on object label and localization are
made directly from these multi-scale features. It is therefore
challenging to learn the relationship between these features
from multiple scales, carrying different semantic information.
To overcome this, we propose the module Transformer
Encoder as Feature Fusion (TAFF): a single layered trans-
former encoder [7] which fuses features from multiple resolu-
tions at different scales. We show empirically that the feature
fusion performed by TAFF improves the accuracy of state-of-
the-art object detection models like MobileViT [8].
Finally, we propose Generalized TransFormer (gFormer):
a general abstract architecture that binds multiple variations
of attention and transformer mechanisms under a common
umbrella. From this perspective, MetaFormer [9], Trans-
former [7], Squeeze and Excitation Networks [10, 11], and
our CAT block are all variations of gFormer.
2. SYSTEM
2.1. Convolution as Transformer (CAT)
The baseline for this architecture is the MobileViT architec-
ture [8] that uses MobileNetV2 blocks along with MobileViT
blocks that contain transformer layers for extracting global in-
formation. We refer to [8] for the full architecture and only
show the MobileViT block in Fig. 1.
The MobileViT architecture extracts global information
with transformers. The major disadvantage of the above
method is that it has high latency because of the self-attention
layers used inside the transformer as convolutions block. We
hypothesize and prove empirically that we can extract the
arXiv:2210.03861v1 [cs.CV] 8 Oct 2022