Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning Weicong Liang1Yuhui Yuan4yHenghui Ding3Xiao Luo2

2025-04-27 0 0 2.74MB 23 页 10玖币

侵权投诉

Expediting Large-Scale Vision Transformer

for Dense Prediction without Fine-tuning

Weicong Liang1∗Yuhui Yuan4∗† Henghui Ding3Xiao Luo2

Weihong Lin4Ding Jia1Zheng Zhang4Chao Zhang1Han Hu4

1Key Laboratory of Machine Perception (MOE)

School of Intelligence Science and Technology, Peking University

2School of Mathematical Sciences, Peking University 3ETH Zurich

4Microsoft Research Asia

Abstract

Vision transformers have recently achieved competitive results across various

vision tasks but still suffer from heavy computation costs when processing a

large number of tokens. Many advanced approaches have been developed to

reduce the total number of tokens in large-scale vision transformers, especially for

image classiﬁcation tasks. Typically, they select a small group of essential tokens

according to their relevance with the [

class

] token, then ﬁne-tune the weights of

the vision transformer. Such ﬁne-tuning is less practical for dense prediction due

to the much heavier computation and GPU memory cost than image classiﬁcation.

In this paper, we focus on a more challenging problem, i.e., accelerating large-scale

vision transformers for dense prediction without any additional re-training or ﬁne-

tuning. In response to the fact that high-resolution representations are necessary

for dense prediction, we present two non-parametric operators, a token clustering

layer to decrease the number of tokens and a token reconstruction layer to increase

the number of tokens. The following steps are performed to achieve this: (i) we

use the token clustering layer to cluster the neighboring tokens together, resulting

in low-resolution representations that maintain the spatial structures; (ii) we apply

the following transformer layers only to these low-resolution representations or

clustered tokens; and (iii) we use the token reconstruction layer to re-create the

high-resolution representations from the reﬁned low-resolution representations.

The results obtained by our method are promising on ﬁve dense prediction tasks,

including object detection, semantic segmentation, panoptic segmentation, instance

segmentation, and depth estimation. Accordingly, our method accelerates

40% ↑

FPS and saves

30% ↓

GFLOPs of “Segmenter+ViT-L/

” while maintaining

99.5%

of the performance on ADE20K without ﬁne-tuning the ofﬁcial weights.

1 Introduction

Transformer [

] has made signiﬁcant progress across various challenging vision tasks since pi-

oneering efforts such as DETR [

], Vision Transformer (ViT) [

], and Swin Transformer [

By removing the local inductive bias [

] from convolutional neural networks [

], vi-

sion transformers armed with global self-attention show superiority in scalability for large-scale

models and billion-scale dataset [

], self-supervised learning [

], connecting vi-

sion and language [

], etc. We can ﬁnd from recent developments of SOTA approaches that

∗Equal contribution.

†Byuhui.yuan@microsoft.com

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01035v1 [cs.CV] 3 Oct 2022

vision transformers have dominated various leader-boards, including but not limited to image clas-

siﬁcation [

], object detection [

], semantic segmentation [

], pose

estimation [77], image generation [85], and depth estimation [42].

Although vision transformers have achieved more accurate predictions in many vision tasks, large-

scale vision transformers are still burdened with heavy computational overhead, particularly when

processing high-resolution inputs [

], thus limiting their broader application to more resource-

constrained applications and attracting efforts on re-designing light-weight vision transformer ar-

chitectures [

]. In addition to this, several recent efforts have investigated how to decrease

the model complexity and accelerate vision transformers, especially for image classiﬁcation, and

introduced various advanced approaches to accelerate vision transformers. Dynamic ViT [

] and

EViT [

], for example, propose two different dynamic token sparsiﬁcation frameworks to reduce

the redundant tokens progressively and select the most informative tokens according to the scores

predicted with an extra trained prediction module or their relevance with the [

class

] token. To-

kenLearner [

] learns to spatially attend over a subset of tokens and generates a set of clustered

tokens adaptive to the input for video understanding tasks. Most of these token reduction approaches

are carefully designed for image classiﬁcation tasks and require ﬁne-tuning or retraining. These

approaches might not be suitable to tackle more challenging dense prediction tasks that need to

process high-resolution input images, e.g.,

1024 ×1024

, thus, resulting in heavy computation and

GPU memory cost brought. We also demonstrate in the supplemental material the superiority of our

method over several representative methods on dense prediction tasks.

Rather than proposing a new lightweight architecture for dense prediction or token reduction scheme

for only image classiﬁcation, we focus on how to expedite well-trained large-scale vision transformers

and use them for various dense prediction tasks without ﬁne-tuning or re-training. Motivated by

these two key observations including (i) the intermediate token representations of a well-trained

vision transformer carry a heavy amount of local spatial redundancy and (ii) dense prediction tasks

require high-resolution representations, we propose a simple yet effective scheme to convert the

“high-resolution” path of the vision transformer to a “high-to-low-to-high resolution” path via two

non-parametric layers including a token clustering layer and a token reconstruction layer. Our method

can produce a wide range of more efﬁcient models without requiring further ﬁne-tuning or re-training.

We apply our approach to expedite two main-stream vision transformer architectures, e.g., ViTs and

Swin Transformers, for ﬁve challenging dense prediction tasks, including object detection, semantic

segmentation, panoptic segmentation, instance segmentation, and depth estimation. We have achieved

encouraging results across several evaluated benchmarks and Figure 1 illustrates some representative

results on both semantic segmentation and depth estimation tasks.

2 Related work

Pruning Convolutional Neural Networks.

Convolutional neural network pruning [

] is

a task that involves removing the redundant parameters to reduce the model complexity without

a signiﬁcant performance drop. Pruning methods typically entail three steps: (i) training a large,

over-parameterized model to convergence, (ii) pruning the trained large model according to a certain

criterion, and (iii) ﬁne-tuning the pruned model to regain the lost performance [

]. The key idea is

to design an importance score function that is capable of pruning the less informative parameters.

We follow [

] to categorize the existing methods into two main paths: (i) unstructured pruning

(also named weight pruning) and (ii) structured pruning. Unstructured pruning methods explore the

absolute value of each weight or the product of each weight and its gradient to estimate the importance

scores. Structured pruning methods, such as layer-level pruning [

], ﬁlter-level pruning[

], and

image-level pruning [

], removes the model sub-structures. Recent studies [

] further

extend these pruning methods to vision transformer. Unlike the previous pruning methods, we explore

how to expedite vision transformers for dense prediction tasks by carefully reducing & increasing the

number of tokens without removing or modifying the parameters.

Efﬁcient Vision Transformer.

The success of vision transformers has incentivised many recent

efforts [

] to exploit the spatial redundancies of

intermediate token representations. For example, TokenLearner [

] learns to attend over a subset

of tokens and generates a set of clustered tokens adaptive to the input. They empirically show that

very few clustered tokens are sufﬁcient for video understanding tasks. Token Pooling [

] exploits a

nonuniform data-aware down-sampling operator based on K-Means or K-medoids to cluster similar

ADE20K PASCAL-Context Cityscapes

100

51.858.1

79.1

51.6

58.2

78.4

mIoU (%)↑

Segmenter

Segmenter + Ours

(a) mIoU@Segmenter

ADE20K PASCAL-Context Cityscapes

6.2

13.3

8.8

18.8

5.5

FPS↑

Segmenter

Segmenter + Ours

(b) FPS@Segmenter

ADE20K PASCAL-Context Cityscapes

200

400

600

800

1,000

1,200

1,400

659

339

996

439

221

691

GFLOPs↓

Segmenter

Segmenter + Ours

KITTI NYUv2

2.57

0.36

2.6

0.36

RMSE↓

DPT

DPT + Ours

(d) RMSE@DPT

KITTI NYUv2

11.4

17.6

14.8

FPS↑

DPT

DPT + Ours

(e) FPS@DPT

KITTI NYUv2

200

400

600

800

1,000

810

560

627

404

GFLOPs↓

DPT

DPT + Ours

(f) GFLOPs@DPT

Figure 1: Illustrating the improvements of our approach: we report the results of applying our

approach to Segmenter [

] for semantic segmentation and DPT [

] for depth estimation on the

-st

and

-ed row respectively. Without any ﬁne-tuning, our proposed method reduces the GFLOPs and

accelerates the FPS signiﬁcantly with a slight performance drop on both dense prediction tasks.

↑

and ↓represent higher is better and lower is better respectively. Refer to Section 4 for more details.

tokens together to reduce the number of tokens while minimizing the reconstruction error. Dynamic

ViT [

] observes that the accurate image recognition with vision transformers mainly depends

on a subset of the most informative tokens, and hence it develops a dynamic token sparsiﬁcation

framework for pruning the redundant tokens dynamically based on the input. EViT (expediting

vision transformers) [

] proposes to calculate the attentiveness of the [

class

] token with respect

to each token and identify the top-

attentive tokens according to the attentiveness score. Patch

Merger [

] uses a learnable attention matrix to merge and combine together the redundant tokens,

therefore creating a much more practical and cheaper model with only a slight performance drop.

Refer to [

] for more details on efﬁcient transformer architecture designs, such as Performer [

]

and Reformer [

]. In contrast to these methods that require either retraining or ﬁne-tuning the

modiﬁed transformer architectures from scratch or the pre-trained weights, our approach can reuse

the once-trained weights for free and produce lightweight models with a modest performance drop.

Vision Transformer for Dense Prediction.

In the wake of success of the representative pyramid

vision transformers [

] for object detection and semantic segmentation, more and more efforts

have explored different advanced vision transformer architecture designs [

] suitable for various dense prediction tasks. For example, MViT [

]

focuses more on multi-scale representation learning, while HRFormer [

] examines the beneﬁts of

combining multi-scale representation learning and high-resolution representation learning. Instead of

designing a novel vision transformer architecture for dense prediction, we focus on how to accelerate

a well-trained vision transformer while maintaining the prediction performance as much as possible.

Our approach.

The contribution of our work lies in two main aspects: (i) we are the ﬁrst to study

how to accelerate state-of-the-art large-scale vision transformers for dense prediction tasks without

ﬁne-tuning (e.g., "Mask2Former + Swin-L" and "SwinV2-L + HTC++"). Besides, our approach

also achieves much better accuracy and speedup trade-off when compared to the very recent ACT

[1] which is based on a clustering attention scheme; (ii) our token clustering and reconstruction

layers are capable of maintaining the semantic information encoded in the original high-resolution

representations. This is the very most important factor to avoid ﬁne-tuning. We design an effective

combination of a token clustering function and a token reconstruction function to maximize the

cosine similarity between the reconstructed high-resolution feature maps and the original ones without

ﬁne-tuning. The design of our token reconstruction layer is the key and not straightforward essentially.

We also show that our token reconstruction layer can be used to adapt the very recent EViT [

] and

DynamicViT [55] for dense prediction tasks in the supplementary.

(a) (b)

(c)

Figure 2: (a) Plain high-resolution vision transformer with

layers. (b) U-shape High-to-low-

to-high-resolution vision transformer with

, and

layers respectively (

L=α+β+γ

). (c)

Illustrating the details of using our approach to plain ViTs: we insert a token clustering layer and a

token reconstruction layer into a trained vision transformer in order to decrease and then increase the

spatial resolution, respectively. The weights of modules marked with are trained once based on the

conﬁguration of (a). The token clustering layer and token reconstruction layer are marked with are

non-parametric, thus do not require any ﬁne-tuning and can be included directly during evaluation.

3 Our Approach

Preliminary.

The conventional Vision Transformer [

] ﬁrst reshapes the input image

X∈RH×W×3

into a sequence of ﬂatten patches

Xp∈RN×(P23)

, where

(P,P)

represents the resolution of each

patch,

(H,W)

represents the resolution of the input image,

N= (H×W)/P2

represents the number

of resulting patches or tokens, i.e., the input sequence length. The Vision Transformer consists of al-

ternating layers of multi-head self-attention (

MHSA

) and feed-forward network (

FFN

) accompanied

with layer norm (LN) and residual connections:

l= MHSA (LN (Zl−1)) + Zl−1,

Zl= FFN LN Z

l+Z

l,(1)

where

l∈ {1,...,L}

represents the layer index,

Zl∈RN×C

, and

is based on

. The computa-

tion cost

O(LNC(N+C))

mainly depends on the number of layers

, the number of tokens

, and

the channel dimension C.

Despite the great success of transformer, its computation cost increases signiﬁcantly when handling

high-resolution representations, which are critical for dense prediction tasks. This paper attempts to

resolve this issue by reducing the computation complexity during the inference stage, and presents a

very simple solution for generating a large number of efﬁcient vision transformer models directly

from a single trained vision transformer, requiring no further training or ﬁne-tuning.

We demonstrate how our approach could be applied to the existing standard Vision Transformer in

Figure 2. The original Vision Transformer is modiﬁed using two non-parametric operations, namely

a token clustering layer and a token reconstruction layer. The proposed token clustering layer is

utilized to convert the high-resolution representations to low-resolution representations by clustering

the locally semantically similar tokens. Then, we apply the following transformer layers on the

low-resolution representations, which greatly accelerates the inference speed and saves computation

resources. Last, a token reconstruction layer is proposed to reconstruct the feature representations

back to high-resolution.

Token Clustering Layer.

We construct the token clustering layer following the improved SLIC

scheme [33], which performs local k-means clustering as follows:

-Initial superpixel center: We apply adaptive average pooling (

AAP

) over the high-resolution

representations from the α-th layer to compute the h×winitial cluster center representations:

Sα= AAP(Zα,(h×w)),(2)

where Sα∈Rhw×C,Zα∈RN×C, and hw N.

-Iterative local clustering: (i) Expectation step: compute the normalized similarity between each pixel

and the surrounding superpixel

(we only consider the neighboring

positions), (ii) Maximization

step: compute the new superpixel centers:

Qp,i =

exp

−

||Zα,p−Sα,i||2/τ

Pλ

j=1 exp

−

||Zα,p−Sα,j||2/τ,Sα,i =

p=1

Qp,iZα,p,(3)

where we iterate the above Expectation step and Maximization step for

times,

is a temperature

hyper-parameter, and

i∈ {1,2,· · · , λ}

. We apply the following

transformer layers on

Sα

instead

of Zα, thus results in Sα+βand decreases the computation cost signiﬁcantly.

Token Reconstruction Layer.

We implement the token reconstruction layer by exploiting the

relations between the high-resolution representations and the low-resolution clustered representations:

Zα+β,p =X

Sα,i∈k-NN(Zα,p )

exp

−

||Zα,p −Sα,i||2/τ

PSα,j ∈k-NN(Zα,p)exp

−

||Zα,p −Sα,j ||2/τSα+β,i,(4)

where

is the same temperature hyper-parameter as in Equation 3.

k-NN(Zα,p)

represents a set

of the

nearest, a.k.a, most similar, superpixel representations for

Zα,i

. We empirically ﬁnd that

choosing the same neighboring positions as in Equation 3 achieves close performance as the

k-NN

scheme while being more easy to implementation.

In summary, we estimate their semantic relations based on the representations before reﬁnement with

the following

transformer layers and then reconstruct the high-resolution representations from the

reﬁned low-resolution clustered representations accordingly.

Finally, we apply the remained

transformer layers to the reconstructed high-resolution features

and the task-speciﬁc head on the reﬁned high-resolution features to predict the target results such as

semantic segmentation maps or monocular depth maps.

Extension to Swin Transformer.

We further introduce the window token clustering layer and

window token reconstruction layer, which are suitable for Swin Transformer [

]. Figure 3

illustrates an example usage of the proposed window token clustering layer and window token

reconstruction layer. We ﬁrst cluster the

K×K

window tokens into

k×k

window tokens and then

reconstruct

K×K

window tokens according to the reﬁned

k×k

window tokens. We apply the swin

transformer layer equipped with smaller window size

k×k

on the clustered representations, where

we need to bi-linear interpolate the pre-trained weights of relative position embedding table from

(2K−1)×(2K−1)

(2k−1)×(2k−1)

when processing the clustered representations. In summary,

we can improve the efﬁciency of Swin Transformer by injecting the window token clustering layer

and the window token reconstruction layer into the backbones seamlessly without ﬁne-tuning the

model weights.

Why our approach can avoid ﬁne-tuning?

The reasons include the following two aspects: (i) our

token clustering/reconstruction layers are non-parametric, thus avoiding retraining any additional

parameters, (ii) the reconstructed high-resolution representations maintain high semantic similarity

with the original high-resolution representations. We take Segmenter+ViT-L/

(on ADEK,

) as

an example and analyze the semantic similarity between the reconstructed high-resolution feature

(with our approach) and the original high-resolution feature (with the original ViT-L/

) in Table 1.

Accordingly, we can see that the cosine similarities are consistently high across different transformer

layers between the reconstructed high-resolution feature (with our approach) and the original high-

resolution feature. In other words, our approach well maintains the semantic information carried in

the original high-resolution feature maps and thus is capable of avoiding ﬁne-tuning.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExpeditingLarge-ScaleVisionTransformerforDensePredictionwithoutFine-tuningWeicongLiang1YuhuiYuan4yHenghuiDing3XiaoLuo2WeihongLin4DingJia1ZhengZhang4ChaoZhang1HanHu41KeyLaboratoryofMachinePerception(MOE)SchoolofIntelligenceScienceandTechnology,PekingUniversity2SchoolofMathematicalSciences,PekingUni...

展开>> 收起<<

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning Weicong Liang1Yuhui Yuan4yHenghui Ding3Xiao Luo2.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning Weicong Liang1Yuhui Yuan4yHenghui Ding3Xiao Luo2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: