Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets Xiangyu Chen1 Qinghao Hu2 Kaidong Li1 Cuncong Zhong1 Guanghui Wang3 1Department of EECS University of Kansas KS USA

2025-04-30 0 0 698.79KB 9 页 10玖币
侵权投诉
Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets
Xiangyu Chen1, Qinghao Hu2, Kaidong Li1, Cuncong Zhong1, Guanghui Wang3
1Department of EECS, University of Kansas, KS, USA
2Institute of Automation, Chinese Academy of Sciences, China
3Department of CS, Toronto Metropolitan University, Toronto, ON, Canada
xychen@ku.edu, wangcs@ryerson.ca (*corresponding author)
Abstract
Vision Transformers has demonstrated competitive per-
formance on computer vision tasks benefiting from their
ability to capture long-range dependencies with multi-head
self-attention modules and multi-layer perceptron. How-
ever, calculating global attention brings another disadvan-
tage compared with convolutional neural networks, i.e. re-
quiring much more data and computations to converge,
which makes it difficult to generalize well on small datasets,
which is common in practical applications. Previous works
are either focusing on transferring knowledge from large
datasets or adjusting the structure for small datasets. After
carefully examining the self-attention modules, we discover
that the number of trivial attention weights is far greater
than the important ones and the accumulated trivial weights
are dominating the attention in Vision Transformers due to
their large quantity, which is not handled by the attention
itself. This will cover useful non-trivial attention and harm
the performance when trivial attention includes more noise,
e.g. in shallow layers for some backbones. To solve this
issue, we proposed to divide attention weights into trivial
and non-trivial ones by thresholds, then Suppressing Accu-
mulated Trivial Attention (SATA) weights by proposed Triv-
ial WeIghts Suppression Transformation (TWIST) to reduce
attention noise. Extensive experiments on CIFAR-100 and
Tiny-ImageNet datasets show that our suppressing method
boosts the accuracy of Vision Transformers by up to 2.3%.
Code is available at https://github.com/xiangyu8/SATA.
1. Introduction
Convolutional Neural Networks (CNN) have dominated
computer vision tasks for the past decade, especially with
the emergence of ResNet [16]. Convolution operation, the
core technology in CNN, takes all the pixels from its re-
ceptive field as input and outputs one value. When the lay-
ers go deep, the stacked locality becomes non-local as the
receptive field of each layer is built on the convolution re-
sults of the previous layer. The advantage of convolution is
its power to extract local features, making it converge fast
and a good fit, especially for data-efficient tasks. Differ-
ent from CNN, Vision Transformer (ViT) [11] and its vari-
ants [6, 10, 12, 27, 30, 33] consider the similarities between
each image patch embedding and all other patch embed-
dings. This global attention boosts its potential for feature
extraction, however, requiring a large amount of data to feed
the model and limiting its application to small datasets.
On the one hand, CNNs have demonstrated superior per-
formance to ViT regarding the accuracy, computation and
convergence speed on data-efficient tasks, like ResNet-50
for image classification [2, 3, 25, 23], object detection [43]
and ResNet-12 for few-shot learning [5]. However, to im-
prove the performance is to find more inductive bias to in-
clude, which is tedious. The local attention also sets a lower
performance ceiling by eliminating much necessary non-
local attention, which is in contrast to Vision Transformers.
On the other hand, the stronger feature extraction ability of
Vision Transformers can perfectly make up for the lack of
data on small datasets. As a result, Vision Transformers
show promising direction for those tasks.
To adapt Vision Transformers to data-efficient tasks,
some researchers focus on transfer learning [20, 35, 39],
semi-supervised learning and unsupervised learning to
leverage large datasets. Others dedicate to self-supervised
learning or other modalities to dig the inherent structure in-
formation of images themselves [4]. For supervised learn-
ing, one path is to integrate convolution operations in Vision
Transformers to increase their locality. Another approach
is to increase efficiency by revising the structure of Vision
Transformers themselves [31]. The proposed method be-
longs to the second category.
The main transformer blocks include a multi-head self-
attention (MHSA) module and a multi-layer perceptron
(MLP) layer, along with some layer normalization, where
the MHSA module is key to enriching each sequence by in-
cluding long-range dependencies with all other sequences,
i.e. attention. Intuitively, this attention module is expected
arXiv:2210.12333v1 [cs.CV] 22 Oct 2022
seq
seq
seq
query size
query size
seq
*
V SA(z)
A
A[0]
SA[0]
after suppressing
(a)
(c) (d)
0.138
0.097
0.077
0.046
0.045
0.002
...
...
0.69
0.31
x
sum
A[0]TV
...
62
(b)
(e)
Figure 1: Our proposed SATA strategy. (a) The multi-head self-attention module in Vision Transformers. Each row in A
represents attention weights corresponding to all sequences in v. (b) A closer look at how to get the first sequence SA[0] after
applying attention. We set the threshold to 0.05. The blue part denotes larger attention weights and purple is for trivial ones.
We get up to 62 trivial attention weights and sum up to 0.69 in the entire attention in SA[0] compared with 0.31 from similar
sequences. (c) The distribution of attention weights. (d) Accumulated attention within each bin. (e) The result of suppressing
trivial weights by our approach. Even if single attention trivial weight contributes little, the accumulated trivial attention is
still dominating, which is harmful when the attention contains much noise as in shallow layers of some backbones.
to have larger coefficients for those sequences with higher
similarity while smaller values for those less similar as the
example A[0] in Figure 1(a). In this way, all sequences can
be enhanced by other similar sequences. However, this only
considers single similarities themselves, but not their accu-
mulation. Taking a closer look at the dot product operation
on how each weighted sequence obtained in Figure 1(a),
we can find it is from weighting each sequence with atten-
tion coefficients and then summing up into one sequence
as shown in Figure 1(b). This is problematic when the se-
quence length is large and those less similar sequences are
noise. When the similarities are added from all less similar
sequences, the accumulated sum can be even greater than
the largest similarity as in Figure 1(d) caused by the small-
value but large-amount trivial attention coefficients. This
means the accumulated trivial attention dominates the at-
tention, which brings much noise to the convergence of the
Transformer. As a result, the trivial attention would hinder
the training of the Transformer on small datasets. To solve
this rooted problem in Vision Transformers and make it bet-
ter deploy on small datasets, we proposed to suppress all
trivial attention and hence the accumulated trivial attention
to make sequences with higher similarity dominant again.
The contributions of this paper are summarized below.
We found the accumulated trivial attention inherently
dominates the MHSA module in vision Transformers
and brings many noises on shallow layers. To cure
this problem, we propose Suppress Accumulated Triv-
ial Attention (SATA) to separate out trivial attention
first and then decrease the selected attention.
We propose a trivial weights suppression transforma-
tion (TWIST) to control accumulated trivial attention.
The proposed transformation is proved to suppress the
trivial weights to a portion of maximum attention.
Extensive experiments on CIFAR-100 and Tiny-
ImageNet demonstrated up to 2.3%gain in accuracy
by using the proposed method.
2. Related Work
Vision Transformer has become a powerful counterpart
of CNN in computer vision tasks since its introduction in
2020 [11], benefiting from its power to capture long-term
dependencies. This ability is brought by their inherent
structures in ViT, including the MHSA attention module
摘要:

AccumulatedTrivialAttentionMattersinVisionTransformersonSmallDatasetsXiangyuChen1,QinghaoHu2,KaidongLi1,CuncongZhong1,GuanghuiWang31DepartmentofEECS,UniversityofKansas,KS,USA2InstituteofAutomation,ChineseAcademyofSciences,China3DepartmentofCS,TorontoMetropolitanUniversity,Toronto,ON,Canadaxychen@ku...

展开>> 收起<<
Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets Xiangyu Chen1 Qinghao Hu2 Kaidong Li1 Cuncong Zhong1 Guanghui Wang3 1Department of EECS University of Kansas KS USA.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:698.79KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注