Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets
Xiangyu Chen1, Qinghao Hu2, Kaidong Li1, Cuncong Zhong1, Guanghui Wang3∗
1Department of EECS, University of Kansas, KS, USA
2Institute of Automation, Chinese Academy of Sciences, China
3Department of CS, Toronto Metropolitan University, Toronto, ON, Canada
xychen@ku.edu, wangcs@ryerson.ca (*corresponding author)
Abstract
Vision Transformers has demonstrated competitive per-
formance on computer vision tasks benefiting from their
ability to capture long-range dependencies with multi-head
self-attention modules and multi-layer perceptron. How-
ever, calculating global attention brings another disadvan-
tage compared with convolutional neural networks, i.e. re-
quiring much more data and computations to converge,
which makes it difficult to generalize well on small datasets,
which is common in practical applications. Previous works
are either focusing on transferring knowledge from large
datasets or adjusting the structure for small datasets. After
carefully examining the self-attention modules, we discover
that the number of trivial attention weights is far greater
than the important ones and the accumulated trivial weights
are dominating the attention in Vision Transformers due to
their large quantity, which is not handled by the attention
itself. This will cover useful non-trivial attention and harm
the performance when trivial attention includes more noise,
e.g. in shallow layers for some backbones. To solve this
issue, we proposed to divide attention weights into trivial
and non-trivial ones by thresholds, then Suppressing Accu-
mulated Trivial Attention (SATA) weights by proposed Triv-
ial WeIghts Suppression Transformation (TWIST) to reduce
attention noise. Extensive experiments on CIFAR-100 and
Tiny-ImageNet datasets show that our suppressing method
boosts the accuracy of Vision Transformers by up to 2.3%.
Code is available at https://github.com/xiangyu8/SATA.
1. Introduction
Convolutional Neural Networks (CNN) have dominated
computer vision tasks for the past decade, especially with
the emergence of ResNet [16]. Convolution operation, the
core technology in CNN, takes all the pixels from its re-
ceptive field as input and outputs one value. When the lay-
ers go deep, the stacked locality becomes non-local as the
receptive field of each layer is built on the convolution re-
sults of the previous layer. The advantage of convolution is
its power to extract local features, making it converge fast
and a good fit, especially for data-efficient tasks. Differ-
ent from CNN, Vision Transformer (ViT) [11] and its vari-
ants [6, 10, 12, 27, 30, 33] consider the similarities between
each image patch embedding and all other patch embed-
dings. This global attention boosts its potential for feature
extraction, however, requiring a large amount of data to feed
the model and limiting its application to small datasets.
On the one hand, CNNs have demonstrated superior per-
formance to ViT regarding the accuracy, computation and
convergence speed on data-efficient tasks, like ResNet-50
for image classification [2, 3, 25, 23], object detection [43]
and ResNet-12 for few-shot learning [5]. However, to im-
prove the performance is to find more inductive bias to in-
clude, which is tedious. The local attention also sets a lower
performance ceiling by eliminating much necessary non-
local attention, which is in contrast to Vision Transformers.
On the other hand, the stronger feature extraction ability of
Vision Transformers can perfectly make up for the lack of
data on small datasets. As a result, Vision Transformers
show promising direction for those tasks.
To adapt Vision Transformers to data-efficient tasks,
some researchers focus on transfer learning [20, 35, 39],
semi-supervised learning and unsupervised learning to
leverage large datasets. Others dedicate to self-supervised
learning or other modalities to dig the inherent structure in-
formation of images themselves [4]. For supervised learn-
ing, one path is to integrate convolution operations in Vision
Transformers to increase their locality. Another approach
is to increase efficiency by revising the structure of Vision
Transformers themselves [31]. The proposed method be-
longs to the second category.
The main transformer blocks include a multi-head self-
attention (MHSA) module and a multi-layer perceptron
(MLP) layer, along with some layer normalization, where
the MHSA module is key to enriching each sequence by in-
cluding long-range dependencies with all other sequences,
i.e. attention. Intuitively, this attention module is expected
arXiv:2210.12333v1 [cs.CV] 22 Oct 2022