Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets Xiangyu Chen1 Qinghao Hu2 Kaidong Li1 Cuncong Zhong1 Guanghui Wang3 1Department of EECS University of Kansas KS USA

2025-04-30 0 0 698.79KB 9 页 10玖币

侵权投诉

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

Xiangyu Chen1, Qinghao Hu2, Kaidong Li1, Cuncong Zhong1, Guanghui Wang3∗

1Department of EECS, University of Kansas, KS, USA

2Institute of Automation, Chinese Academy of Sciences, China

3Department of CS, Toronto Metropolitan University, Toronto, ON, Canada

xychen@ku.edu, wangcs@ryerson.ca (*corresponding author)

Abstract

Vision Transformers has demonstrated competitive per-

formance on computer vision tasks beneﬁting from their

ability to capture long-range dependencies with multi-head

self-attention modules and multi-layer perceptron. How-

ever, calculating global attention brings another disadvan-

tage compared with convolutional neural networks, i.e. re-

quiring much more data and computations to converge,

which makes it difﬁcult to generalize well on small datasets,

which is common in practical applications. Previous works

are either focusing on transferring knowledge from large

datasets or adjusting the structure for small datasets. After

carefully examining the self-attention modules, we discover

that the number of trivial attention weights is far greater

than the important ones and the accumulated trivial weights

are dominating the attention in Vision Transformers due to

their large quantity, which is not handled by the attention

itself. This will cover useful non-trivial attention and harm

the performance when trivial attention includes more noise,

e.g. in shallow layers for some backbones. To solve this

issue, we proposed to divide attention weights into trivial

and non-trivial ones by thresholds, then Suppressing Accu-

mulated Trivial Attention (SATA) weights by proposed Triv-

ial WeIghts Suppression Transformation (TWIST) to reduce

attention noise. Extensive experiments on CIFAR-100 and

Tiny-ImageNet datasets show that our suppressing method

boosts the accuracy of Vision Transformers by up to 2.3%.

Code is available at https://github.com/xiangyu8/SATA.

1. Introduction

Convolutional Neural Networks (CNN) have dominated

computer vision tasks for the past decade, especially with

the emergence of ResNet [16]. Convolution operation, the

core technology in CNN, takes all the pixels from its re-

ceptive ﬁeld as input and outputs one value. When the lay-

ers go deep, the stacked locality becomes non-local as the

receptive ﬁeld of each layer is built on the convolution re-

sults of the previous layer. The advantage of convolution is

its power to extract local features, making it converge fast

and a good ﬁt, especially for data-efﬁcient tasks. Differ-

ent from CNN, Vision Transformer (ViT) [11] and its vari-

ants [6, 10, 12, 27, 30, 33] consider the similarities between

each image patch embedding and all other patch embed-

dings. This global attention boosts its potential for feature

extraction, however, requiring a large amount of data to feed

the model and limiting its application to small datasets.

On the one hand, CNNs have demonstrated superior per-

formance to ViT regarding the accuracy, computation and

convergence speed on data-efﬁcient tasks, like ResNet-50

for image classiﬁcation [2, 3, 25, 23], object detection [43]

and ResNet-12 for few-shot learning [5]. However, to im-

prove the performance is to ﬁnd more inductive bias to in-

clude, which is tedious. The local attention also sets a lower

performance ceiling by eliminating much necessary non-

local attention, which is in contrast to Vision Transformers.

On the other hand, the stronger feature extraction ability of

Vision Transformers can perfectly make up for the lack of

data on small datasets. As a result, Vision Transformers

show promising direction for those tasks.

To adapt Vision Transformers to data-efﬁcient tasks,

some researchers focus on transfer learning [20, 35, 39],

semi-supervised learning and unsupervised learning to

leverage large datasets. Others dedicate to self-supervised

learning or other modalities to dig the inherent structure in-

formation of images themselves [4]. For supervised learn-

ing, one path is to integrate convolution operations in Vision

Transformers to increase their locality. Another approach

is to increase efﬁciency by revising the structure of Vision

Transformers themselves [31]. The proposed method be-

longs to the second category.

The main transformer blocks include a multi-head self-

attention (MHSA) module and a multi-layer perceptron

(MLP) layer, along with some layer normalization, where

the MHSA module is key to enriching each sequence by in-

cluding long-range dependencies with all other sequences,

i.e. attention. Intuitively, this attention module is expected

arXiv:2210.12333v1 [cs.CV] 22 Oct 2022

seq

query size

seq

V SA(z)

A[0]

SA[0]

after suppressing

(a)

0.138

0.097

0.077

0.046

0.045

0.002

...

0.69

0.31

sum

A[0]TV

...

(b)

(e)

Figure 1: Our proposed SATA strategy. (a) The multi-head self-attention module in Vision Transformers. Each row in A

represents attention weights corresponding to all sequences in v. (b) A closer look at how to get the ﬁrst sequence SA[0] after

applying attention. We set the threshold to 0.05. The blue part denotes larger attention weights and purple is for trivial ones.

We get up to 62 trivial attention weights and sum up to 0.69 in the entire attention in SA[0] compared with 0.31 from similar

sequences. (c) The distribution of attention weights. (d) Accumulated attention within each bin. (e) The result of suppressing

trivial weights by our approach. Even if single attention trivial weight contributes little, the accumulated trivial attention is

still dominating, which is harmful when the attention contains much noise as in shallow layers of some backbones.

to have larger coefﬁcients for those sequences with higher

similarity while smaller values for those less similar as the

example A[0] in Figure 1(a). In this way, all sequences can

be enhanced by other similar sequences. However, this only

considers single similarities themselves, but not their accu-

mulation. Taking a closer look at the dot product operation

on how each weighted sequence obtained in Figure 1(a),

we can ﬁnd it is from weighting each sequence with atten-

tion coefﬁcients and then summing up into one sequence

as shown in Figure 1(b). This is problematic when the se-

quence length is large and those less similar sequences are

noise. When the similarities are added from all less similar

sequences, the accumulated sum can be even greater than

the largest similarity as in Figure 1(d) caused by the small-

value but large-amount trivial attention coefﬁcients. This

means the accumulated trivial attention dominates the at-

tention, which brings much noise to the convergence of the

Transformer. As a result, the trivial attention would hinder

the training of the Transformer on small datasets. To solve

this rooted problem in Vision Transformers and make it bet-

ter deploy on small datasets, we proposed to suppress all

trivial attention and hence the accumulated trivial attention

to make sequences with higher similarity dominant again.

The contributions of this paper are summarized below.

• We found the accumulated trivial attention inherently

dominates the MHSA module in vision Transformers

and brings many noises on shallow layers. To cure

this problem, we propose Suppress Accumulated Triv-

ial Attention (SATA) to separate out trivial attention

ﬁrst and then decrease the selected attention.

• We propose a trivial weights suppression transforma-

tion (TWIST) to control accumulated trivial attention.

The proposed transformation is proved to suppress the

trivial weights to a portion of maximum attention.

• Extensive experiments on CIFAR-100 and Tiny-

ImageNet demonstrated up to 2.3%gain in accuracy

by using the proposed method.

2. Related Work

Vision Transformer has become a powerful counterpart

of CNN in computer vision tasks since its introduction in

2020 [11], beneﬁting from its power to capture long-term

dependencies. This ability is brought by their inherent

structures in ViT, including the MHSA attention module

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AccumulatedTrivialAttentionMattersinVisionTransformersonSmallDatasetsXiangyuChen1,QinghaoHu2,KaidongLi1,CuncongZhong1,GuanghuiWang31DepartmentofEECS,UniversityofKansas,KS,USA2InstituteofAutomation,ChineseAcademyofSciences,China3DepartmentofCS,TorontoMetropolitanUniversity,Toronto,ON,Canadaxychen@ku...

展开>> 收起<<

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets Xiangyu Chen1 Qinghao Hu2 Kaidong Li1 Cuncong Zhong1 Guanghui Wang3 1Department of EECS University of Kansas KS USA.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets Xiangyu Chen1 Qinghao Hu2 Kaidong Li1 Cuncong Zhong1 Guanghui Wang3 1Department of EECS University of Kansas KS USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: