
SAIT: SPARSE VISION TRANSFORMERS THROUGH
ADAPTIVE TOKEN PRUNING
Ling Li, David Thorsley, and Joseph Hassoun
Samsung Semiconductor Inc.
{ling.li, d.thorsley, j.hassoun}@samsung.com
ABSTRACT
While vision transformers have achieved impressive results, effectively and effi-
ciently accelerating these models can further boost performances. In this work,
we propose a dense/sparse training framework to obtain a unified model, en-
abling weight sharing across various token densities. Thus one model offers a
range of accuracy and throughput tradeoffs for different applications. Besides,
we introduce adaptive token pruning to optimize the patch token sparsity based
on the input image. In addition, we investigate knowledge distillation to en-
hance token selection capability in early transformer modules. Sparse adaptive
image Transformer (SaiT) offers varying levels of model acceleration by merely
changing the token sparsity on the fly. Specifically, SaiT reduces the computation
complexity (FLOPs) by 39% - 43% and increases the throughput by 67% - 91%
with less than 0.5% accuracy loss for various vision transformer models. Mean-
while, the same model also provides the zero accuracy drop option by skipping
the sparsification step. SaiT achieves better accuracy and computation tradeoffs
than state-of-the-art transformer and convolutional models.
1 INTRODUCTION
Even though Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016; Tan
& Le, 2019) have fueled rapid development in the computer vision field, emerging studies on vision
transformers show encouraging results, with some surpassing CNNs in a wide range of tasks such as
classification (Dosovitskiy et al., 2021; Touvron et al., 2021a; Liu et al., 2021a), semantic segmen-
tation (Cheng et al., 2021; Zheng et al., 2021) , and object detection (Carion et al., 2020; Li et al.,
2022). To improve model efficiency, especially on edge devices, model compression techniques
such as pruning (Han et al., 2015), quantization (Gong et al., 2014), and knowledge distillation
(Hinton et al., 2015) have been widely used in CNNs. However, model acceleration/compression
in vision transformers is still less explored. Additionally, these typical compression techniques –
which usually lead to some accuracy loss – are not ideal for accuracy-sensitive applications.
For efficient and hardware-friendly model acceleration, we leverage the intrinsic structure of vision
transformers, where input images are transformed into patch tokens before further processing. Patch
tokens allow the vision transformer to process the entire image; however, the computation for the
all-to-all attention is expensive. Token pruning is an effective approach to reduce computation and
save memory. The essential number of patch tokens varies depending on the input image, since
background patches often contribute little for correct classification. Some ’easy’ inputs require
fewer patches and ’difficult’ inputs need more patches. Figure 1 shows that for ’easy’ inputs, such
as the bird image on the upper left, around 20% of patches are sufficient for detection, whereas
’difficult’ inputs, such as the fishes on the lower right, require 53% token density. To save more
computation based on the specifics of input images, we propose an adaptive token pruning strategy
that dynamically adjusts the number of preserved tokens. This approach evaluates the importance
(in probability) of each token based on the attention scores of the early transformer modules. Instead
of selecting a fixed number of tokens, we accumulate a varying number of the most important tokens
up to a probability mass threshold. As a result, this introduces no extra parameters and negligible
computation, and thus its efficiency compares favorably with some prior works (Wang et al., 2021b;
Rao et al., 2021).
1
arXiv:2210.05832v1 [cs.CV] 11 Oct 2022