vision transformers have dominated various leader-boards, including but not limited to image clas-
sification [
74
,
14
,
17
,
84
], object detection [
86
,
46
,
38
], semantic segmentation [
32
,
11
,
3
], pose
estimation [77], image generation [85], and depth estimation [42].
Although vision transformers have achieved more accurate predictions in many vision tasks, large-
scale vision transformers are still burdened with heavy computational overhead, particularly when
processing high-resolution inputs [
20
,
46
], thus limiting their broader application to more resource-
constrained applications and attracting efforts on re-designing light-weight vision transformer ar-
chitectures [
10
,
51
,
87
]. In addition to this, several recent efforts have investigated how to decrease
the model complexity and accelerate vision transformers, especially for image classification, and
introduced various advanced approaches to accelerate vision transformers. Dynamic ViT [
55
] and
EViT [
44
], for example, propose two different dynamic token sparsification frameworks to reduce
the redundant tokens progressively and select the most informative tokens according to the scores
predicted with an extra trained prediction module or their relevance with the [
class
] token. To-
kenLearner [
58
] learns to spatially attend over a subset of tokens and generates a set of clustered
tokens adaptive to the input for video understanding tasks. Most of these token reduction approaches
are carefully designed for image classification tasks and require fine-tuning or retraining. These
approaches might not be suitable to tackle more challenging dense prediction tasks that need to
process high-resolution input images, e.g.,
1024 ×1024
, thus, resulting in heavy computation and
GPU memory cost brought. We also demonstrate in the supplemental material the superiority of our
method over several representative methods on dense prediction tasks.
Rather than proposing a new lightweight architecture for dense prediction or token reduction scheme
for only image classification, we focus on how to expedite well-trained large-scale vision transformers
and use them for various dense prediction tasks without fine-tuning or re-training. Motivated by
these two key observations including (i) the intermediate token representations of a well-trained
vision transformer carry a heavy amount of local spatial redundancy and (ii) dense prediction tasks
require high-resolution representations, we propose a simple yet effective scheme to convert the
“high-resolution” path of the vision transformer to a “high-to-low-to-high resolution” path via two
non-parametric layers including a token clustering layer and a token reconstruction layer. Our method
can produce a wide range of more efficient models without requiring further fine-tuning or re-training.
We apply our approach to expedite two main-stream vision transformer architectures, e.g., ViTs and
Swin Transformers, for five challenging dense prediction tasks, including object detection, semantic
segmentation, panoptic segmentation, instance segmentation, and depth estimation. We have achieved
encouraging results across several evaluated benchmarks and Figure 1 illustrates some representative
results on both semantic segmentation and depth estimation tasks.
2 Related work
Pruning Convolutional Neural Networks.
Convolutional neural network pruning [
2
,
30
,
71
] is
a task that involves removing the redundant parameters to reduce the model complexity without
a significant performance drop. Pruning methods typically entail three steps: (i) training a large,
over-parameterized model to convergence, (ii) pruning the trained large model according to a certain
criterion, and (iii) fine-tuning the pruned model to regain the lost performance [
49
]. The key idea is
to design an importance score function that is capable of pruning the less informative parameters.
We follow [
7
] to categorize the existing methods into two main paths: (i) unstructured pruning
(also named weight pruning) and (ii) structured pruning. Unstructured pruning methods explore the
absolute value of each weight or the product of each weight and its gradient to estimate the importance
scores. Structured pruning methods, such as layer-level pruning [
72
], filter-level pruning[
48
,
79
], and
image-level pruning [
25
,
65
], removes the model sub-structures. Recent studies [
5
,
6
,
31
] further
extend these pruning methods to vision transformer. Unlike the previous pruning methods, we explore
how to expedite vision transformers for dense prediction tasks by carefully reducing & increasing the
number of tokens without removing or modifying the parameters.
Efficient Vision Transformer.
The success of vision transformers has incentivised many recent
efforts [
58
,
62
,
50
,
29
,
73
,
55
,
37
,
24
,
69
,
9
,
35
,
44
,
57
] to exploit the spatial redundancies of
intermediate token representations. For example, TokenLearner [
58
] learns to attend over a subset
of tokens and generates a set of clustered tokens adaptive to the input. They empirically show that
very few clustered tokens are sufficient for video understanding tasks. Token Pooling [
50
] exploits a
nonuniform data-aware down-sampling operator based on K-Means or K-medoids to cluster similar
2