SAIT S PARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN PRUNING Ling Li David Thorsley and Joseph Hassoun

2025-04-26 0 0 4.54MB 15 页 10玖币
侵权投诉
SAIT: SPARSE VISION TRANSFORMERS THROUGH
ADAPTIVE TOKEN PRUNING
Ling Li, David Thorsley, and Joseph Hassoun
Samsung Semiconductor Inc.
{ling.li, d.thorsley, j.hassoun}@samsung.com
ABSTRACT
While vision transformers have achieved impressive results, effectively and effi-
ciently accelerating these models can further boost performances. In this work,
we propose a dense/sparse training framework to obtain a unified model, en-
abling weight sharing across various token densities. Thus one model offers a
range of accuracy and throughput tradeoffs for different applications. Besides,
we introduce adaptive token pruning to optimize the patch token sparsity based
on the input image. In addition, we investigate knowledge distillation to en-
hance token selection capability in early transformer modules. Sparse adaptive
image Transformer (SaiT) offers varying levels of model acceleration by merely
changing the token sparsity on the fly. Specifically, SaiT reduces the computation
complexity (FLOPs) by 39% - 43% and increases the throughput by 67% - 91%
with less than 0.5% accuracy loss for various vision transformer models. Mean-
while, the same model also provides the zero accuracy drop option by skipping
the sparsification step. SaiT achieves better accuracy and computation tradeoffs
than state-of-the-art transformer and convolutional models.
1 INTRODUCTION
Even though Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016; Tan
& Le, 2019) have fueled rapid development in the computer vision field, emerging studies on vision
transformers show encouraging results, with some surpassing CNNs in a wide range of tasks such as
classification (Dosovitskiy et al., 2021; Touvron et al., 2021a; Liu et al., 2021a), semantic segmen-
tation (Cheng et al., 2021; Zheng et al., 2021) , and object detection (Carion et al., 2020; Li et al.,
2022). To improve model efficiency, especially on edge devices, model compression techniques
such as pruning (Han et al., 2015), quantization (Gong et al., 2014), and knowledge distillation
(Hinton et al., 2015) have been widely used in CNNs. However, model acceleration/compression
in vision transformers is still less explored. Additionally, these typical compression techniques –
which usually lead to some accuracy loss – are not ideal for accuracy-sensitive applications.
For efficient and hardware-friendly model acceleration, we leverage the intrinsic structure of vision
transformers, where input images are transformed into patch tokens before further processing. Patch
tokens allow the vision transformer to process the entire image; however, the computation for the
all-to-all attention is expensive. Token pruning is an effective approach to reduce computation and
save memory. The essential number of patch tokens varies depending on the input image, since
background patches often contribute little for correct classification. Some ’easy’ inputs require
fewer patches and ’difficult’ inputs need more patches. Figure 1 shows that for ’easy’ inputs, such
as the bird image on the upper left, around 20% of patches are sufficient for detection, whereas
difficult’ inputs, such as the fishes on the lower right, require 53% token density. To save more
computation based on the specifics of input images, we propose an adaptive token pruning strategy
that dynamically adjusts the number of preserved tokens. This approach evaluates the importance
(in probability) of each token based on the attention scores of the early transformer modules. Instead
of selecting a fixed number of tokens, we accumulate a varying number of the most important tokens
up to a probability mass threshold. As a result, this introduces no extra parameters and negligible
computation, and thus its efficiency compares favorably with some prior works (Wang et al., 2021b;
Rao et al., 2021).
1
arXiv:2210.05832v1 [cs.CV] 11 Oct 2022
In addition, we formulate a dense/sparse training framework to obtain a unified model which can
flexibly adjust the tradeoff between accuracy and throughput on demand. Different computation
savings are achieved by merely modifying the token density in the later transformer modules. Shar-
ing the same weights throughout the transformer model, fully dense patch tokens preserve accuracy,
but no model acceleration, while sparse tokens offer varying levels of acceleration in return for some
accuracy drop. Therefore, different applications can share the same model and weights regardless of
accuracy and throughput requirements. Consequently, this approach saves training cost and memory
footprint by training and storing a single unified model instead of a series of different models.
To demonstrate the effectiveness of this approach, we deploy our proposed training framework and
sparsification schemes on top of DeiT (Touvron et al., 2021a) and LVViT (Jiang et al., 2021). The
resulting unified model, Sparse adaptive image Transformer (SaiT), offers different levels of sparsi-
fication and achieves 39% - 43% floating point operation (FLOP) reduction and 74-91% throughput
gain with less than 0.5% accuracy loss. In summary, we present three major contributions of this
work: 1) We formulate a dense/sparse training framework to obtain a unified model offering a range
of accuracy and throughput tradeoffs; 2) We propose an adaptive pruning strategy to flexibly and
dynamically adjust the token sparsity based on the input images; 3) We introduce knowledge distil-
lation to improve the accuracy of early transformer modules in learning the token importance.
Figure 1: Visualization of the adaptive pruning based on the results from SaiT-S. Original image
and sparsification results are presented next to each other. Based on the difficulties of the inputs, the
densities of essential patch tokens dynamically change from 21% to 53%.
2 RELATED WORK
Vision Transformers ViT (Dosovitskiy et al., 2021) is the first pure transformer based-model for
image classification with results competitive to CNNs. However, the training of ViT requires a large
private dataset JFT300M (Sun et al., 2017). To address this issue, DeiT (Touvron et al., 2021a) de-
velops a training schedule with data augmentation, regularization, and knowledge distillation. Many
subsequent variants (Touvron et al., 2021b; Liu et al., 2021a; Chu et al., 2021; Jiang et al., 2021) of
ViT and DeiT achieve promising performances, with some even surpassing CNN counterparts (He
et al., 2016; Tan & Le, 2019; Radosavovic et al., 2020). Moreover, self-supervised vision trans-
formers, such as DINO (Caron et al., 2021) and MAE (He et al., 2021), not only achieve impressive
classification accuracy but also obtain useful feature representations for downstream tasks such as
object detection and segmentation.
Efficient Vision Transformers To improve model efficiency and save computation, Wu et al.
(2021b) and Jaegle et al. (2021) introduce new attention modules, while other works (Li et al.,
2
2021b; Srinivas et al., 2021; Graham et al., 2021; Li et al., 2021a; Wu et al., 2021a; Xu et al.,
2021; Guo et al., 2021) incorporate some convolutional layers into vision transformers. Following
conventional model compression approaches, Liu et al. (2021b) apply post-training quantization in
vision transformers, Chen et al. (2021b) study the sparsity via sparse subnetwork training, and Rao
et al. (2021) implement a lightweight prediction module for hierarchical token pruning. Wang et al.
(2021b) dynamically adjusts the number of patch tokens through cascading multiple transformers
with confidence estimation modules. Liang et al. (2022) exploits the classification token to select
attentive tokens without introducing extra parameters and fuse inattentive tokens through multiple
stages for efficient inference.
Compared to prior works, our token pruning strategy is more efficient by leveraging the attention
scores for one-stage adaptive pruning, with no extra parameters and negligible computation. Be-
sides, the unified model from our dense/sparse training framework offers a range of accuracy and
throughput tradeoffs for different applications.
3 SAIT
The proposed dense/sparse training framework and adaptive token pruning apply to general vision
transformer architectures. To illustrate the ideas, we use DeiT (Touvron et al., 2021a) and LVViT
(Jiang et al., 2021) as examples in this work. Both DeiT and LVViT use an embedding module to
convert an input image into Npatch tokens. These Npatch tokens, along with the classification
token CLS, go through a series of transformer modules/layers. The feature representation from
the last transformer layer is used for the final classification. The key to the proposed approach
is to enable early layers to effectively capture the importance of each token, in order to reduce
computation in the later layers with sparse tokens.
3.1 DENSE/SPARSE TRAINING FRAMEWORK
Figure 2: An overview of the dense/sparse training framework used in SaiT. Early layers (blue) learn
the importance of each token. The attention scores at the prune-location are used to extract token
importance score (T IS) and the token mask. Later layers are trained alternately with fully dense
tokens (green) and sparse tokens (orange). Optionally, knowledge distillation from a teacher model
enhances T IS learning ability of early layers.
The overview of the dense/sparse training framework is in Figure 2 and Algorithm 1. Given an
architecture with Ltotal transformer layers, the early layers (l0to lP1) learn to identify the impor-
tance of each patch token. At the designated pruning location (lP), token importance scores (T IS)
are extracted based on the attention scores, which are used for token selection and knowledge distil-
lation. Later layers (lP+1 to lL1) are trained on alternate epochs with Nfully dense patch tokens
(without pruning) and N0sparse patch tokens (after pruning).
Dense/sparse alternate training This training schedule enables weight sharing between fully dense
(unpruned) and sparse (pruned) patch tokens at layers lP+1 to lL1, since the weights of transformer
blocks are independent of the number of patch tokens. Moreover, it improves processing accuracy of
later layer on sparse tokens as shown in the ablation study (Section 4.5). Besides, training with this
framework preserves the model accuracy when skipping the sparsification. This is different from
3
摘要:

SAIT:SPARSEVISIONTRANSFORMERSTHROUGHADAPTIVETOKENPRUNINGLingLi,DavidThorsley,andJosephHassounSamsungSemiconductorInc.fling.li,d.thorsley,j.hassoung@samsung.comABSTRACTWhilevisiontransformershaveachievedimpressiveresults,effectivelyandef-cientlyacceleratingthesemodelscanfurtherboostperformances.Inth...

展开>> 收起<<
SAIT S PARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN PRUNING Ling Li David Thorsley and Joseph Hassoun.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:4.54MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注