Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning Weicong Liang1Yuhui Yuan4yHenghui Ding3Xiao Luo2

2025-04-27 0 0 2.74MB 23 页 10玖币
侵权投诉
Expediting Large-Scale Vision Transformer
for Dense Prediction without Fine-tuning
Weicong Liang1Yuhui Yuan4Henghui Ding3Xiao Luo2
Weihong Lin4Ding Jia1Zheng Zhang4Chao Zhang1Han Hu4
1Key Laboratory of Machine Perception (MOE)
School of Intelligence Science and Technology, Peking University
2School of Mathematical Sciences, Peking University 3ETH Zurich
4Microsoft Research Asia
Abstract
Vision transformers have recently achieved competitive results across various
vision tasks but still suffer from heavy computation costs when processing a
large number of tokens. Many advanced approaches have been developed to
reduce the total number of tokens in large-scale vision transformers, especially for
image classification tasks. Typically, they select a small group of essential tokens
according to their relevance with the [
class
] token, then fine-tune the weights of
the vision transformer. Such fine-tuning is less practical for dense prediction due
to the much heavier computation and GPU memory cost than image classification.
In this paper, we focus on a more challenging problem, i.e., accelerating large-scale
vision transformers for dense prediction without any additional re-training or fine-
tuning. In response to the fact that high-resolution representations are necessary
for dense prediction, we present two non-parametric operators, a token clustering
layer to decrease the number of tokens and a token reconstruction layer to increase
the number of tokens. The following steps are performed to achieve this: (i) we
use the token clustering layer to cluster the neighboring tokens together, resulting
in low-resolution representations that maintain the spatial structures; (ii) we apply
the following transformer layers only to these low-resolution representations or
clustered tokens; and (iii) we use the token reconstruction layer to re-create the
high-resolution representations from the refined low-resolution representations.
The results obtained by our method are promising on five dense prediction tasks,
including object detection, semantic segmentation, panoptic segmentation, instance
segmentation, and depth estimation. Accordingly, our method accelerates
40%
FPS and saves
30%
GFLOPs of “Segmenter+ViT-L/
16
” while maintaining
99.5%
of the performance on ADE20K without fine-tuning the official weights.
1 Introduction
Transformer [
67
] has made significant progress across various challenging vision tasks since pi-
oneering efforts such as DETR [
4
], Vision Transformer (ViT) [
18
], and Swin Transformer [
47
].
By removing the local inductive bias [
19
] from convolutional neural networks [
28
,
64
,
60
], vi-
sion transformers armed with global self-attention show superiority in scalability for large-scale
models and billion-scale dataset [
18
,
84
,
61
], self-supervised learning [
27
,
76
,
1
], connecting vi-
sion and language [
53
,
34
], etc. We can find from recent developments of SOTA approaches that
Equal contribution.
Byuhui.yuan@microsoft.com
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01035v1 [cs.CV] 3 Oct 2022
vision transformers have dominated various leader-boards, including but not limited to image clas-
sification [
74
,
14
,
17
,
84
], object detection [
86
,
46
,
38
], semantic segmentation [
32
,
11
,
3
], pose
estimation [77], image generation [85], and depth estimation [42].
Although vision transformers have achieved more accurate predictions in many vision tasks, large-
scale vision transformers are still burdened with heavy computational overhead, particularly when
processing high-resolution inputs [
20
,
46
], thus limiting their broader application to more resource-
constrained applications and attracting efforts on re-designing light-weight vision transformer ar-
chitectures [
10
,
51
,
87
]. In addition to this, several recent efforts have investigated how to decrease
the model complexity and accelerate vision transformers, especially for image classification, and
introduced various advanced approaches to accelerate vision transformers. Dynamic ViT [
55
] and
EViT [
44
], for example, propose two different dynamic token sparsification frameworks to reduce
the redundant tokens progressively and select the most informative tokens according to the scores
predicted with an extra trained prediction module or their relevance with the [
class
] token. To-
kenLearner [
58
] learns to spatially attend over a subset of tokens and generates a set of clustered
tokens adaptive to the input for video understanding tasks. Most of these token reduction approaches
are carefully designed for image classification tasks and require fine-tuning or retraining. These
approaches might not be suitable to tackle more challenging dense prediction tasks that need to
process high-resolution input images, e.g.,
1024 ×1024
, thus, resulting in heavy computation and
GPU memory cost brought. We also demonstrate in the supplemental material the superiority of our
method over several representative methods on dense prediction tasks.
Rather than proposing a new lightweight architecture for dense prediction or token reduction scheme
for only image classification, we focus on how to expedite well-trained large-scale vision transformers
and use them for various dense prediction tasks without fine-tuning or re-training. Motivated by
these two key observations including (i) the intermediate token representations of a well-trained
vision transformer carry a heavy amount of local spatial redundancy and (ii) dense prediction tasks
require high-resolution representations, we propose a simple yet effective scheme to convert the
“high-resolution” path of the vision transformer to a “high-to-low-to-high resolution” path via two
non-parametric layers including a token clustering layer and a token reconstruction layer. Our method
can produce a wide range of more efficient models without requiring further fine-tuning or re-training.
We apply our approach to expedite two main-stream vision transformer architectures, e.g., ViTs and
Swin Transformers, for five challenging dense prediction tasks, including object detection, semantic
segmentation, panoptic segmentation, instance segmentation, and depth estimation. We have achieved
encouraging results across several evaluated benchmarks and Figure 1 illustrates some representative
results on both semantic segmentation and depth estimation tasks.
2 Related work
Pruning Convolutional Neural Networks.
Convolutional neural network pruning [
2
,
30
,
71
] is
a task that involves removing the redundant parameters to reduce the model complexity without
a significant performance drop. Pruning methods typically entail three steps: (i) training a large,
over-parameterized model to convergence, (ii) pruning the trained large model according to a certain
criterion, and (iii) fine-tuning the pruned model to regain the lost performance [
49
]. The key idea is
to design an importance score function that is capable of pruning the less informative parameters.
We follow [
7
] to categorize the existing methods into two main paths: (i) unstructured pruning
(also named weight pruning) and (ii) structured pruning. Unstructured pruning methods explore the
absolute value of each weight or the product of each weight and its gradient to estimate the importance
scores. Structured pruning methods, such as layer-level pruning [
72
], filter-level pruning[
48
,
79
], and
image-level pruning [
25
,
65
], removes the model sub-structures. Recent studies [
5
,
6
,
31
] further
extend these pruning methods to vision transformer. Unlike the previous pruning methods, we explore
how to expedite vision transformers for dense prediction tasks by carefully reducing & increasing the
number of tokens without removing or modifying the parameters.
Efficient Vision Transformer.
The success of vision transformers has incentivised many recent
efforts [
58
,
62
,
50
,
29
,
73
,
55
,
37
,
24
,
69
,
9
,
35
,
44
,
57
] to exploit the spatial redundancies of
intermediate token representations. For example, TokenLearner [
58
] learns to attend over a subset
of tokens and generates a set of clustered tokens adaptive to the input. They empirically show that
very few clustered tokens are sufficient for video understanding tasks. Token Pooling [
50
] exploits a
nonuniform data-aware down-sampling operator based on K-Means or K-medoids to cluster similar
2
ADE20K PASCAL-Context Cityscapes
20
40
60
80
100
51.858.1
79.1
51.6
58.2
78.4
mIoU (%)
Segmenter
Segmenter + Ours
(a) mIoU@Segmenter
ADE20K PASCAL-Context Cityscapes
5
10
15
20
6.2
13.3
4
8.8
18.8
5.5
FPS
Segmenter
Segmenter + Ours
(b) FPS@Segmenter
ADE20K PASCAL-Context Cityscapes
200
400
600
800
1,000
1,200
1,400
659
339
996
439
221
691
GFLOPs
Segmenter
Segmenter + Ours
(c) GFLOPs@Segmenter
KITTI NYUv2
0
1
2
3
2.57
0.36
2.6
0.36
RMSE
DPT
DPT + Ours
(d) RMSE@DPT
KITTI NYUv2
10
20
30
11.4
17.6
14.8
24
FPS
DPT
DPT + Ours
(e) FPS@DPT
KITTI NYUv2
200
400
600
800
1,000
810
560
627
404
GFLOPs
DPT
DPT + Ours
(f) GFLOPs@DPT
Figure 1: Illustrating the improvements of our approach: we report the results of applying our
approach to Segmenter [
63
] for semantic segmentation and DPT [
54
] for depth estimation on the
1
-st
and
2
-ed row respectively. Without any fine-tuning, our proposed method reduces the GFLOPs and
accelerates the FPS significantly with a slight performance drop on both dense prediction tasks.
and represent higher is better and lower is better respectively. Refer to Section 4 for more details.
tokens together to reduce the number of tokens while minimizing the reconstruction error. Dynamic
ViT [
55
] observes that the accurate image recognition with vision transformers mainly depends
on a subset of the most informative tokens, and hence it develops a dynamic token sparsification
framework for pruning the redundant tokens dynamically based on the input. EViT (expediting
vision transformers) [
44
] proposes to calculate the attentiveness of the [
class
] token with respect
to each token and identify the top-
k
attentive tokens according to the attentiveness score. Patch
Merger [
56
] uses a learnable attention matrix to merge and combine together the redundant tokens,
therefore creating a much more practical and cheaper model with only a slight performance drop.
Refer to [
66
] for more details on efficient transformer architecture designs, such as Performer [
12
]
and Reformer [
36
]. In contrast to these methods that require either retraining or fine-tuning the
modified transformer architectures from scratch or the pre-trained weights, our approach can reuse
the once-trained weights for free and produce lightweight models with a modest performance drop.
Vision Transformer for Dense Prediction.
In the wake of success of the representative pyramid
vision transformers [
47
,
70
] for object detection and semantic segmentation, more and more efforts
have explored different advanced vision transformer architecture designs [
39
,
8
,
40
,
41
,
21
,
81
,
78
,
88
,
23
,
43
,
75
,
80
,
16
,
83
,
82
,
26
] suitable for various dense prediction tasks. For example, MViT [
40
]
focuses more on multi-scale representation learning, while HRFormer [
81
] examines the benefits of
combining multi-scale representation learning and high-resolution representation learning. Instead of
designing a novel vision transformer architecture for dense prediction, we focus on how to accelerate
a well-trained vision transformer while maintaining the prediction performance as much as possible.
Our approach.
The contribution of our work lies in two main aspects: (i) we are the first to study
how to accelerate state-of-the-art large-scale vision transformers for dense prediction tasks without
fine-tuning (e.g., "Mask2Former + Swin-L" and "SwinV2-L + HTC++"). Besides, our approach
also achieves much better accuracy and speedup trade-off when compared to the very recent ACT
[1] which is based on a clustering attention scheme; (ii) our token clustering and reconstruction
layers are capable of maintaining the semantic information encoded in the original high-resolution
representations. This is the very most important factor to avoid fine-tuning. We design an effective
combination of a token clustering function and a token reconstruction function to maximize the
cosine similarity between the reconstructed high-resolution feature maps and the original ones without
fine-tuning. The design of our token reconstruction layer is the key and not straightforward essentially.
We also show that our token reconstruction layer can be used to adapt the very recent EViT [
44
] and
DynamicViT [55] for dense prediction tasks in the supplementary.
3
(a) (b)
(c)
Figure 2: (a) Plain high-resolution vision transformer with
L
layers. (b) U-shape High-to-low-
to-high-resolution vision transformer with
α
,
β
, and
γ
layers respectively (
L=α+β+γ
). (c)
Illustrating the details of using our approach to plain ViTs: we insert a token clustering layer and a
token reconstruction layer into a trained vision transformer in order to decrease and then increase the
spatial resolution, respectively. The weights of modules marked with are trained once based on the
configuration of (a). The token clustering layer and token reconstruction layer are marked with are
non-parametric, thus do not require any fine-tuning and can be included directly during evaluation.
3 Our Approach
Preliminary.
The conventional Vision Transformer [
18
] first reshapes the input image
XRH×W×3
into a sequence of flatten patches
XpRN×(P23)
, where
(P,P)
represents the resolution of each
patch,
(H,W)
represents the resolution of the input image,
N= (H×W)/P2
represents the number
of resulting patches or tokens, i.e., the input sequence length. The Vision Transformer consists of al-
ternating layers of multi-head self-attention (
MHSA
) and feed-forward network (
FFN
) accompanied
with layer norm (LN) and residual connections:
Z
0
l= MHSA (LN (Zl1)) + Zl1,
Zl= FFN LN Z
0
l+Z
0
l,(1)
where
l∈ {1,...,L}
represents the layer index,
ZlRN×C
, and
Z0
is based on
Xp
. The computa-
tion cost
O(LNC(N+C))
mainly depends on the number of layers
L
, the number of tokens
N
, and
the channel dimension C.
Despite the great success of transformer, its computation cost increases significantly when handling
high-resolution representations, which are critical for dense prediction tasks. This paper attempts to
resolve this issue by reducing the computation complexity during the inference stage, and presents a
very simple solution for generating a large number of efficient vision transformer models directly
from a single trained vision transformer, requiring no further training or fine-tuning.
We demonstrate how our approach could be applied to the existing standard Vision Transformer in
Figure 2. The original Vision Transformer is modified using two non-parametric operations, namely
a token clustering layer and a token reconstruction layer. The proposed token clustering layer is
utilized to convert the high-resolution representations to low-resolution representations by clustering
the locally semantically similar tokens. Then, we apply the following transformer layers on the
low-resolution representations, which greatly accelerates the inference speed and saves computation
resources. Last, a token reconstruction layer is proposed to reconstruct the feature representations
back to high-resolution.
4
Token Clustering Layer.
We construct the token clustering layer following the improved SLIC
scheme [33], which performs local k-means clustering as follows:
-Initial superpixel center: We apply adaptive average pooling (
AAP
) over the high-resolution
representations from the α-th layer to compute the h×winitial cluster center representations:
Sα= AAP(Zα,(h×w)),(2)
where SαRhw×C,ZαRN×C, and hw N.
-Iterative local clustering: (i) Expectation step: compute the normalized similarity between each pixel
p
and the surrounding superpixel
i
(we only consider the neighboring
λ
positions), (ii) Maximization
step: compute the new superpixel centers:
Qp,i =
exp
||Zα,pSα,i||2
Pλ
j=1 exp
||Zα,pSα,j||2,Sα,i =
N
X
p=1
Qp,iZα,p,(3)
where we iterate the above Expectation step and Maximization step for
κ
times,
τ
is a temperature
hyper-parameter, and
i∈ {1,2,· · · , λ}
. We apply the following
β
transformer layers on
Sα
instead
of Zα, thus results in Sα+βand decreases the computation cost significantly.
Token Reconstruction Layer.
We implement the token reconstruction layer by exploiting the
relations between the high-resolution representations and the low-resolution clustered representations:
Zα+β,p =X
Sα,ik-NN(Zα,p )
exp
||Zα,p Sα,i||2
PSα,j k-NN(Zα,p)exp
||Zα,p Sα,j ||2Sα+β,i,(4)
where
τ
is the same temperature hyper-parameter as in Equation 3.
k-NN(Zα,p)
represents a set
of the
k
nearest, a.k.a, most similar, superpixel representations for
Zα,i
. We empirically find that
choosing the same neighboring positions as in Equation 3 achieves close performance as the
k-NN
scheme while being more easy to implementation.
In summary, we estimate their semantic relations based on the representations before refinement with
the following
β
transformer layers and then reconstruct the high-resolution representations from the
refined low-resolution clustered representations accordingly.
Finally, we apply the remained
γ
transformer layers to the reconstructed high-resolution features
and the task-specific head on the refined high-resolution features to predict the target results such as
semantic segmentation maps or monocular depth maps.
Extension to Swin Transformer.
We further introduce the window token clustering layer and
window token reconstruction layer, which are suitable for Swin Transformer [
46
,
47
]. Figure 3
illustrates an example usage of the proposed window token clustering layer and window token
reconstruction layer. We first cluster the
K×K
window tokens into
k×k
window tokens and then
reconstruct
K×K
window tokens according to the refined
k×k
window tokens. We apply the swin
transformer layer equipped with smaller window size
k×k
on the clustered representations, where
we need to bi-linear interpolate the pre-trained weights of relative position embedding table from
(2K1)×(2K1)
to
(2k1)×(2k1)
when processing the clustered representations. In summary,
we can improve the efficiency of Swin Transformer by injecting the window token clustering layer
and the window token reconstruction layer into the backbones seamlessly without fine-tuning the
model weights.
Why our approach can avoid fine-tuning?
The reasons include the following two aspects: (i) our
token clustering/reconstruction layers are non-parametric, thus avoiding retraining any additional
parameters, (ii) the reconstructed high-resolution representations maintain high semantic similarity
with the original high-resolution representations. We take Segmenter+ViT-L/
16
(on ADEK,
α
=
10
) as
an example and analyze the semantic similarity between the reconstructed high-resolution feature
(with our approach) and the original high-resolution feature (with the original ViT-L/
16
) in Table 1.
Accordingly, we can see that the cosine similarities are consistently high across different transformer
layers between the reconstructed high-resolution feature (with our approach) and the original high-
resolution feature. In other words, our approach well maintains the semantic information carried in
the original high-resolution feature maps and thus is capable of avoiding fine-tuning.
5
摘要:

ExpeditingLarge-ScaleVisionTransformerforDensePredictionwithoutFine-tuningWeicongLiang1YuhuiYuan4yHenghuiDing3XiaoLuo2WeihongLin4DingJia1ZhengZhang4ChaoZhang1HanHu41KeyLaboratoryofMachinePerception(MOE)SchoolofIntelligenceScienceandTechnology,PekingUniversity2SchoolofMathematicalSciences,PekingUni...

展开>> 收起<<
Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning Weicong Liang1Yuhui Yuan4yHenghui Ding3Xiao Luo2.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:2.74MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注