TOWARDS LIGHT WEIGHT OBJECT DETECTION SYSTEM Dharma KC1 Venkata Ravi Kiran Dayana2 Meng-Lin Wu2 Venkateswara Rao Cherukuriy Hau Hwang2

2025-05-06 0 0 425.84KB 5 页 10玖币
侵权投诉
TOWARDS LIGHT WEIGHT OBJECT DETECTION SYSTEM
Dharma KC1,? , Venkata Ravi Kiran Dayana2, Meng-Lin Wu2
Venkateswara Rao Cherukuri, Hau Hwang2
1Department of Computer Science, University of Arizona
2Qualcomm Technologies, Inc.
ABSTRACT
Transformers are a popular choice for classification tasks and
as backbones for object detection tasks. However, their high
latency brings challenges in their adaptation to light weight
object detection systems. We present an approximation of the
self-attention layers used in the transformer architecture. This
approximation reduces the latency of the classification system
while incurring minimal loss in accuracy. We also present
a method that uses a transformer encoder layer for multi-
resolution feature fusion. This feature fusion improves the
accuracy of the state-of-the-art light weight object detection
system without significantly increasing the number of param-
eters. Finally, we provide an abstraction for the transformer
architecture called Generalized Transformer (gFormer) that
can guide the design of novel transformer-like architectures.
Index TermsVision transformer, self-attention, object
detection, deep neural networks
1. INTRODUCTION
Convolutional neural networks (CNNs) [1] have been widely
used as backbones for object detection systems. MobileNets [2]
use depthwise separable convolutions to develop light weight
CNNs. MobileNetV2 further improves MobileNets using in-
verted residuals and linear bottlenecks. It also introduced ef-
ficient ways of applying depthwise separable convolutions to
the heads of Single Shot MultiBox Detector (SSD) [3], which
resulted in the light weight object detection system, SSDLite.
Recently, Vision Transformers (ViTs) [4] are gaining popular-
ity due to their ability to extract global information. However,
they lack the spatial inductive biases present in CNNs. Mo-
bileViT [5] presented a hybrid architecture based on CNNs
and ViTs that leverages the inductive biases of CNNs and
also includes global information through ViTs. MobileViT
achieves impressive performance on the ImageNet-1k classi-
fication dataset [6], and its main disadvantage is high latency.
In this work, we propose Convolution as Transformer
(CAT): a module that approximates the self-attention layer
?Work performed during internship at Qualcomm Technologies, Inc.
Work performed while at Qualcomm Technologies, Inc.
in transformers. CAT has low latency and thus can be used
in light weight systems for image classification and object
detection. We replace expensive transformer blocks used
in MobileViT with our CAT blocks, and we show that they
are competitive with the self-attention modules for image
classification tasks. Moreover, CAT blocks have complexity
O(n×d), unlike self-attention that has complexity O(n2×d),
where nis the sequence length, and dis feature vector size.
Existing light weight systems for object detection [5, 3]
mainly consist of a backbone to extract features from images,
followed by heads to extract features from multiple output
resolutions. Predictions on object label and localization are
made directly from these multi-scale features. It is therefore
challenging to learn the relationship between these features
from multiple scales, carrying different semantic information.
To overcome this, we propose the module Transformer
Encoder as Feature Fusion (TAFF): a single layered trans-
former encoder [7] which fuses features from multiple resolu-
tions at different scales. We show empirically that the feature
fusion performed by TAFF improves the accuracy of state-of-
the-art object detection models like MobileViT [8].
Finally, we propose Generalized TransFormer (gFormer):
a general abstract architecture that binds multiple variations
of attention and transformer mechanisms under a common
umbrella. From this perspective, MetaFormer [9], Trans-
former [7], Squeeze and Excitation Networks [10, 11], and
our CAT block are all variations of gFormer.
2. SYSTEM
2.1. Convolution as Transformer (CAT)
The baseline for this architecture is the MobileViT architec-
ture [8] that uses MobileNetV2 blocks along with MobileViT
blocks that contain transformer layers for extracting global in-
formation. We refer to [8] for the full architecture and only
show the MobileViT block in Fig. 1.
The MobileViT architecture extracts global information
with transformers. The major disadvantage of the above
method is that it has high latency because of the self-attention
layers used inside the transformer as convolutions block. We
hypothesize and prove empirically that we can extract the
arXiv:2210.03861v1 [cs.CV] 8 Oct 2022
摘要:

TOWARDSLIGHTWEIGHTOBJECTDETECTIONSYSTEMDharmaKC1;?,VenkataRaviKiranDayana2,Meng-LinWu2VenkateswaraRaoCherukuriy,HauHwang21DepartmentofComputerScience,UniversityofArizona2QualcommTechnologies,Inc.ABSTRACTTransformersareapopularchoiceforclassicationtasksandasbackbonesforobjectdetectiontasks.However,t...

展开>> 收起<<
TOWARDS LIGHT WEIGHT OBJECT DETECTION SYSTEM Dharma KC1 Venkata Ravi Kiran Dayana2 Meng-Lin Wu2 Venkateswara Rao Cherukuriy Hau Hwang2.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:425.84KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注