SAIT S PARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN PRUNING Ling Li David Thorsley and Joseph Hassoun

2025-04-26 0 0 4.54MB 15 页 10玖币

侵权投诉

SAIT: SPARSE VISION TRANSFORMERS THROUGH

ADAPTIVE TOKEN PRUNING

Ling Li, David Thorsley, and Joseph Hassoun

Samsung Semiconductor Inc.

{ling.li, d.thorsley, j.hassoun}@samsung.com

ABSTRACT

While vision transformers have achieved impressive results, effectively and efﬁ-

ciently accelerating these models can further boost performances. In this work,

we propose a dense/sparse training framework to obtain a uniﬁed model, en-

abling weight sharing across various token densities. Thus one model offers a

range of accuracy and throughput tradeoffs for different applications. Besides,

we introduce adaptive token pruning to optimize the patch token sparsity based

on the input image. In addition, we investigate knowledge distillation to en-

hance token selection capability in early transformer modules. Sparse adaptive

image Transformer (SaiT) offers varying levels of model acceleration by merely

changing the token sparsity on the ﬂy. Speciﬁcally, SaiT reduces the computation

complexity (FLOPs) by 39% - 43% and increases the throughput by 67% - 91%

with less than 0.5% accuracy loss for various vision transformer models. Mean-

while, the same model also provides the zero accuracy drop option by skipping

the sparsiﬁcation step. SaiT achieves better accuracy and computation tradeoffs

than state-of-the-art transformer and convolutional models.

1 INTRODUCTION

Even though Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012; He et al., 2016; Tan

& Le, 2019) have fueled rapid development in the computer vision ﬁeld, emerging studies on vision

transformers show encouraging results, with some surpassing CNNs in a wide range of tasks such as

classiﬁcation (Dosovitskiy et al., 2021; Touvron et al., 2021a; Liu et al., 2021a), semantic segmen-

tation (Cheng et al., 2021; Zheng et al., 2021) , and object detection (Carion et al., 2020; Li et al.,

2022). To improve model efﬁciency, especially on edge devices, model compression techniques

such as pruning (Han et al., 2015), quantization (Gong et al., 2014), and knowledge distillation

(Hinton et al., 2015) have been widely used in CNNs. However, model acceleration/compression

in vision transformers is still less explored. Additionally, these typical compression techniques –

which usually lead to some accuracy loss – are not ideal for accuracy-sensitive applications.

For efﬁcient and hardware-friendly model acceleration, we leverage the intrinsic structure of vision

transformers, where input images are transformed into patch tokens before further processing. Patch

tokens allow the vision transformer to process the entire image; however, the computation for the

all-to-all attention is expensive. Token pruning is an effective approach to reduce computation and

save memory. The essential number of patch tokens varies depending on the input image, since

background patches often contribute little for correct classiﬁcation. Some ’easy’ inputs require

fewer patches and ’difﬁcult’ inputs need more patches. Figure 1 shows that for ’easy’ inputs, such

as the bird image on the upper left, around 20% of patches are sufﬁcient for detection, whereas

’difﬁcult’ inputs, such as the ﬁshes on the lower right, require 53% token density. To save more

computation based on the speciﬁcs of input images, we propose an adaptive token pruning strategy

that dynamically adjusts the number of preserved tokens. This approach evaluates the importance

(in probability) of each token based on the attention scores of the early transformer modules. Instead

of selecting a ﬁxed number of tokens, we accumulate a varying number of the most important tokens

up to a probability mass threshold. As a result, this introduces no extra parameters and negligible

computation, and thus its efﬁciency compares favorably with some prior works (Wang et al., 2021b;

Rao et al., 2021).

arXiv:2210.05832v1 [cs.CV] 11 Oct 2022

In addition, we formulate a dense/sparse training framework to obtain a uniﬁed model which can

ﬂexibly adjust the tradeoff between accuracy and throughput on demand. Different computation

savings are achieved by merely modifying the token density in the later transformer modules. Shar-

ing the same weights throughout the transformer model, fully dense patch tokens preserve accuracy,

but no model acceleration, while sparse tokens offer varying levels of acceleration in return for some

accuracy drop. Therefore, different applications can share the same model and weights regardless of

accuracy and throughput requirements. Consequently, this approach saves training cost and memory

footprint by training and storing a single uniﬁed model instead of a series of different models.

To demonstrate the effectiveness of this approach, we deploy our proposed training framework and

sparsiﬁcation schemes on top of DeiT (Touvron et al., 2021a) and LVViT (Jiang et al., 2021). The

resulting uniﬁed model, Sparse adaptive image Transformer (SaiT), offers different levels of sparsi-

ﬁcation and achieves 39% - 43% ﬂoating point operation (FLOP) reduction and 74-91% throughput

gain with less than 0.5% accuracy loss. In summary, we present three major contributions of this

work: 1) We formulate a dense/sparse training framework to obtain a uniﬁed model offering a range

of accuracy and throughput tradeoffs; 2) We propose an adaptive pruning strategy to ﬂexibly and

dynamically adjust the token sparsity based on the input images; 3) We introduce knowledge distil-

lation to improve the accuracy of early transformer modules in learning the token importance.

Figure 1: Visualization of the adaptive pruning based on the results from SaiT-S†. Original image

and sparsiﬁcation results are presented next to each other. Based on the difﬁculties of the inputs, the

densities of essential patch tokens dynamically change from 21% to 53%.

2 RELATED WORK

Vision Transformers ViT (Dosovitskiy et al., 2021) is the ﬁrst pure transformer based-model for

image classiﬁcation with results competitive to CNNs. However, the training of ViT requires a large

private dataset JFT300M (Sun et al., 2017). To address this issue, DeiT (Touvron et al., 2021a) de-

velops a training schedule with data augmentation, regularization, and knowledge distillation. Many

subsequent variants (Touvron et al., 2021b; Liu et al., 2021a; Chu et al., 2021; Jiang et al., 2021) of

ViT and DeiT achieve promising performances, with some even surpassing CNN counterparts (He

et al., 2016; Tan & Le, 2019; Radosavovic et al., 2020). Moreover, self-supervised vision trans-

formers, such as DINO (Caron et al., 2021) and MAE (He et al., 2021), not only achieve impressive

classiﬁcation accuracy but also obtain useful feature representations for downstream tasks such as

object detection and segmentation.

Efﬁcient Vision Transformers To improve model efﬁciency and save computation, Wu et al.

(2021b) and Jaegle et al. (2021) introduce new attention modules, while other works (Li et al.,

2021b; Srinivas et al., 2021; Graham et al., 2021; Li et al., 2021a; Wu et al., 2021a; Xu et al.,

2021; Guo et al., 2021) incorporate some convolutional layers into vision transformers. Following

conventional model compression approaches, Liu et al. (2021b) apply post-training quantization in

vision transformers, Chen et al. (2021b) study the sparsity via sparse subnetwork training, and Rao

et al. (2021) implement a lightweight prediction module for hierarchical token pruning. Wang et al.

(2021b) dynamically adjusts the number of patch tokens through cascading multiple transformers

with conﬁdence estimation modules. Liang et al. (2022) exploits the classiﬁcation token to select

attentive tokens without introducing extra parameters and fuse inattentive tokens through multiple

stages for efﬁcient inference.

Compared to prior works, our token pruning strategy is more efﬁcient by leveraging the attention

scores for one-stage adaptive pruning, with no extra parameters and negligible computation. Be-

sides, the uniﬁed model from our dense/sparse training framework offers a range of accuracy and

throughput tradeoffs for different applications.

3 SAIT

The proposed dense/sparse training framework and adaptive token pruning apply to general vision

transformer architectures. To illustrate the ideas, we use DeiT (Touvron et al., 2021a) and LVViT

(Jiang et al., 2021) as examples in this work. Both DeiT and LVViT use an embedding module to

convert an input image into Npatch tokens. These Npatch tokens, along with the classiﬁcation

token CLS, go through a series of transformer modules/layers. The feature representation from

the last transformer layer is used for the ﬁnal classiﬁcation. The key to the proposed approach

is to enable early layers to effectively capture the importance of each token, in order to reduce

computation in the later layers with sparse tokens.

3.1 DENSE/SPARSE TRAINING FRAMEWORK

Figure 2: An overview of the dense/sparse training framework used in SaiT. Early layers (blue) learn

the importance of each token. The attention scores at the prune-location are used to extract token

importance score (T IS) and the token mask. Later layers are trained alternately with fully dense

tokens (green) and sparse tokens (orange). Optionally, knowledge distillation from a teacher model

enhances T IS learning ability of early layers.

The overview of the dense/sparse training framework is in Figure 2 and Algorithm 1. Given an

architecture with Ltotal transformer layers, the early layers (l0to lP−1) learn to identify the impor-

tance of each patch token. At the designated pruning location (lP), token importance scores (T IS)

are extracted based on the attention scores, which are used for token selection and knowledge distil-

lation. Later layers (lP+1 to lL−1) are trained on alternate epochs with Nfully dense patch tokens

(without pruning) and N0sparse patch tokens (after pruning).

Dense/sparse alternate training This training schedule enables weight sharing between fully dense

(unpruned) and sparse (pruned) patch tokens at layers lP+1 to lL−1, since the weights of transformer

blocks are independent of the number of patch tokens. Moreover, it improves processing accuracy of

later layer on sparse tokens as shown in the ablation study (Section 4.5). Besides, training with this

framework preserves the model accuracy when skipping the sparsiﬁcation. This is different from

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SAIT:SPARSEVISIONTRANSFORMERSTHROUGHADAPTIVETOKENPRUNINGLingLi,DavidThorsley,andJosephHassounSamsungSemiconductorInc.fling.li,d.thorsley,j.hassoung@samsung.comABSTRACTWhilevisiontransformershaveachievedimpressiveresults,effectivelyandef-cientlyacceleratingthesemodelscanfurtherboostperformances.Inth...

展开>> 收起<<

SAIT S PARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN PRUNING Ling Li David Thorsley and Joseph Hassoun.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SAIT S PARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN PRUNING Ling Li David Thorsley and Joseph Hassoun

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: