TOWARDS LIGHT WEIGHT OBJECT DETECTION SYSTEM Dharma KC1 Venkata Ravi Kiran Dayana2 Meng-Lin Wu2 Venkateswara Rao Cherukuriy Hau Hwang2

2025-05-06 0 0 425.84KB 5 页 10玖币

侵权投诉

TOWARDS LIGHT WEIGHT OBJECT DETECTION SYSTEM

Dharma KC1,? , Venkata Ravi Kiran Dayana2, Meng-Lin Wu2

Venkateswara Rao Cherukuri†, Hau Hwang2

1Department of Computer Science, University of Arizona

2Qualcomm Technologies, Inc.

ABSTRACT

Transformers are a popular choice for classiﬁcation tasks and

as backbones for object detection tasks. However, their high

latency brings challenges in their adaptation to light weight

object detection systems. We present an approximation of the

self-attention layers used in the transformer architecture. This

approximation reduces the latency of the classiﬁcation system

while incurring minimal loss in accuracy. We also present

a method that uses a transformer encoder layer for multi-

resolution feature fusion. This feature fusion improves the

accuracy of the state-of-the-art light weight object detection

system without signiﬁcantly increasing the number of param-

eters. Finally, we provide an abstraction for the transformer

architecture called Generalized Transformer (gFormer) that

can guide the design of novel transformer-like architectures.

Index Terms—Vision transformer, self-attention, object

detection, deep neural networks

1. INTRODUCTION

Convolutional neural networks (CNNs) [1] have been widely

used as backbones for object detection systems. MobileNets [2]

use depthwise separable convolutions to develop light weight

CNNs. MobileNetV2 further improves MobileNets using in-

verted residuals and linear bottlenecks. It also introduced ef-

ﬁcient ways of applying depthwise separable convolutions to

the heads of Single Shot MultiBox Detector (SSD) [3], which

resulted in the light weight object detection system, SSDLite.

Recently, Vision Transformers (ViTs) [4] are gaining popular-

ity due to their ability to extract global information. However,

they lack the spatial inductive biases present in CNNs. Mo-

bileViT [5] presented a hybrid architecture based on CNNs

and ViTs that leverages the inductive biases of CNNs and

also includes global information through ViTs. MobileViT

achieves impressive performance on the ImageNet-1k classi-

ﬁcation dataset [6], and its main disadvantage is high latency.

In this work, we propose Convolution as Transformer

(CAT): a module that approximates the self-attention layer

?Work performed during internship at Qualcomm Technologies, Inc.

†Work performed while at Qualcomm Technologies, Inc.

in transformers. CAT has low latency and thus can be used

in light weight systems for image classiﬁcation and object

detection. We replace expensive transformer blocks used

in MobileViT with our CAT blocks, and we show that they

are competitive with the self-attention modules for image

classiﬁcation tasks. Moreover, CAT blocks have complexity

O(n×d), unlike self-attention that has complexity O(n2×d),

where nis the sequence length, and dis feature vector size.

Existing light weight systems for object detection [5, 3]

mainly consist of a backbone to extract features from images,

followed by heads to extract features from multiple output

resolutions. Predictions on object label and localization are

made directly from these multi-scale features. It is therefore

challenging to learn the relationship between these features

from multiple scales, carrying different semantic information.

To overcome this, we propose the module Transformer

Encoder as Feature Fusion (TAFF): a single layered trans-

former encoder [7] which fuses features from multiple resolu-

tions at different scales. We show empirically that the feature

fusion performed by TAFF improves the accuracy of state-of-

the-art object detection models like MobileViT [8].

Finally, we propose Generalized TransFormer (gFormer):

a general abstract architecture that binds multiple variations

of attention and transformer mechanisms under a common

umbrella. From this perspective, MetaFormer [9], Trans-

former [7], Squeeze and Excitation Networks [10, 11], and

our CAT block are all variations of gFormer.

2. SYSTEM

2.1. Convolution as Transformer (CAT)

The baseline for this architecture is the MobileViT architec-

ture [8] that uses MobileNetV2 blocks along with MobileViT

blocks that contain transformer layers for extracting global in-

formation. We refer to [8] for the full architecture and only

show the MobileViT block in Fig. 1.

The MobileViT architecture extracts global information

with transformers. The major disadvantage of the above

method is that it has high latency because of the self-attention

layers used inside the transformer as convolutions block. We

hypothesize and prove empirically that we can extract the

arXiv:2210.03861v1 [cs.CV] 8 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TOWARDSLIGHTWEIGHTOBJECTDETECTIONSYSTEMDharmaKC1;?,VenkataRaviKiranDayana2,Meng-LinWu2VenkateswaraRaoCherukuriy,HauHwang21DepartmentofComputerScience,UniversityofArizona2QualcommTechnologies,Inc.ABSTRACTTransformersareapopularchoiceforclassicationtasksandasbackbonesforobjectdetectiontasks.However,t...

展开>> 收起<<

TOWARDS LIGHT WEIGHT OBJECT DETECTION SYSTEM Dharma KC1 Venkata Ravi Kiran Dayana2 Meng-Lin Wu2 Venkateswara Rao Cherukuriy Hau Hwang2.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TOWARDS LIGHT WEIGHT OBJECT DETECTION SYSTEM Dharma KC1 Venkata Ravi Kiran Dayana2 Meng-Lin Wu2 Venkateswara Rao Cherukuriy Hau Hwang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: