Q-ViT Accurate and Fully Quantized Low-bit Vision Transformer Yanjing Li1y Sheng Xu1y Baochang Zhang12 Xianbin Cao1 Peng Gao3 Guodong Guo45

2025-04-24 0 0 1.99MB 12 页 10玖币
侵权投诉
Q-ViT: Accurate and Fully Quantized Low-bit Vision
Transformer
Yanjing Li1, Sheng Xu1, Baochang Zhang1,2, Xianbin Cao1, Peng Gao3, Guodong Guo4,5
1Beihang University, Beijing, P.R.China
2Zhongguancun Laboratory, Beijing, P.R.China
3Shanghai Artificial Intelligence Laboratory, Shanghai, P.R.China
4Institute of Deep Learning, Baidu Research, Beijing, P.R.China
5National Engineering Laboratory for Deep Learning Technology and Application,
Beijing, P.R.China
{yanjingli, shengxu, bczhang, xbcao}@buaa.edu.cn
Abstract
The large pre-trained vision transformers (ViTs) have demonstrated remarkable
performance on various visual tasks, but suffer from expensive computational and
memory cost problems when deployed on resource-constrained devices. Among
the powerful compression approaches, quantization extremely reduces the compu-
tation and memory consumption by low-bit parameters and bit-wise operations.
However, low-bit ViTs remain largely unexplored and usually suffer from a signifi-
cant performance drop compared with the real-valued counterparts. In this work,
through extensive empirical analysis, we first identify the bottleneck for severe
performance drop comes from the information distortion of the low-bit quantized
self-attention map. We then develop an information rectification module (IRM)
and a distribution guided distillation (DGD) scheme for fully quantized vision
transformers (Q-ViT) to effectively eliminate such distortion, leading to a fully
quantized ViTs. We evaluate our methods on popular DeiT and Swin backbones.
Extensive experimental results show that our method achieves a much better per-
formance than the prior arts. For example, our Q-ViT can theoretically accelerates
the ViT-S by 6.14
×
and achieves about 80.9% Top-1 accuracy, even surpassing the
full-precision counterpart by 1.0% on ImageNet dataset. Our codes and models are
attached on https://github.com/YanjingLi0202/Q-ViT.
1 Introduction
Inspired by the success in natural language processing (NLP), transformer-based models have shown
great power in various computer vision (CV) tasks, such as image classification [
4
] and object
detection [
2
]. Pre-trained with large-scale data, these models usually have a tremendous number of
parameters. For example, there are 632M parameters taking up 2528MB memory usage and 162G
FLOPs in the ViT-H model, which is both memory and computation expensive during inference.
This limits these models for the deployment on resource-limited platforms. Therefore, compressed
transformers are urgently needed for real applications.
Substantial efforts have been made to compress and accelerate neural networks for efficient online
inference. Methods include compact network design [
10
], network pruning [
9
], low-rank decomposi-
tion [
3
], quantization [
21
,
30
,
32
], and knowledge distillation [
24
,
31
]. Quantization is particularly
suitable for deployment on AI chips because it reduces the bit-width of network parameters and
activations for efficient inference. Prior post-training quantization (PTQ) methods [
18
,
14
] on ViTs
Equal contribution. Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06707v1 [cs.CV] 13 Oct 2022
𝜇 = 0.0841
𝜎 = 1.2124
𝜇 = −0.0176
𝜎 = 1.0753
Block.0.query Block.3.query Block.6.query
𝜇 = −0.0361
𝜎 = 1.2748
Block.0.query Block.3.query Block.6.query
𝜇 = 0.0765
𝜎 = 1.6533
𝜇 = −0.0110
𝜎 = 1.3183
𝜇 = −0.0323
𝜎 = 1.2268
(a) Full-Precision (b) Fully quantized ViT (baseline)
Figure 1: The histogram of query values
q
(blue shadow) along with the PDF curve (red line) of
Gaussian distribution
N(µ, σ2)
[
20
], for 3 selected layers in DeiT-T and 4-bit fully quantized DeiT-T
(baseline). µand σ2are the statistical mean and variance of the values.
directly compute quantized parameters based on pre-trained full-precision models, which constrains
the model performance to a sub-optimized level without fine-tuning. Furthermore, quantizing these
models based on PTQ methods to ultra-low bits (e.g., 4 bits or lower) is ineffective and suffers from a
significant performance reduction.
Differently, quantization-aware training (QAT) [
16
] methods perform quantization during back
propagation and achieve much less performance drop with a higher compression rate generally. QAT
is shown to be effective for CNN models [17] for CV tasks. However, QAT methods remain largely
unexplored for low-bit quantization of vision transformers. Therefore, we first build a fully quantized
ViT baseline, a straightforward yet effective solution based on common techniques. Our study
discovers that the performance drop of fully quantized ViT lies in the information distortion among
the attention mechanism in the forward process, and the ineffective optimization for eliminating
the distribution difference through distillation in the backward propagation. First, the attention
mechanism of ViT aims at modeling long-distance dependencies [
27
,
4
]. However, our analysis
shows that a direct quantization method leads to the information distortion, i.e., significant distribution
variation for the query module between quantized ViT and full-precision counterpart. For example,
as shown in Fig. 1, the variance difference is 0.4409 (1.2124 v.s. 1.6533) for the first block . This
inevitably deteriorates the representation capability of the attention module on capturing the global
dependency for the input. Second, the distillation for the fully quantized ViT baseline utilizes
distillation token (following [
25
]) to directly supervise the classification output of the quantized ViT.
However, we found that such a simple supervision is ineffective, which is coarse-grained for the large
gap between the quantized attention scores and their full-precision counterparts.
To address the aforementioned issues, a fully quantized ViT (Q-ViT) is developed by retaining the
distribution of quantized attention modules as that of full-precision counterparts (see the overview in
Fig. 2). Accordingly, we propose to modify the distorted distribution over quantized attention modules
through an Information Rectification Module (IRM) based on information entropy maximization, in
the forward process. While in the backward process, we present a Distribution Guided Distillation
(DGD) scheme to eliminate the distribution variation through attention similarity loss between
quantized ViT and full-precision counterpart. The contributions of our work include:
We propose an Information Rectification Module (IRM) based on the information theory to
address the information distortion problem. IRM applies quantized representations in the
attention module with a maximized information entropy, allowing the quantized model to
restore the representation of input images.
We develope a Distribution Guided Distillation (DGD) scheme to eliminate the distribution
mismatch in distillation. DGD takes appropriate activations and utilizes knowledge from the
similarity matrices in distillation to perform optimization accurately.
Our Q-ViT, for the first time, explores a promising way towards accurate and low-bit ViT.
Extensive experiments on the ImageNet benchmark show that our Q-ViT outperforms the
baseline by a large margin, and achieves comparable performances with the full-precision
counterparts.
The Gaussian distribution hypothesis is supported by [20]
2
MHSA
MLP
Add & Norm
Add & Norm
Classifier
Input!Q#(𝐱)
query 𝐪key!𝐤 value 𝐯
𝐪*
Q#(𝐯)
Q#(!!!)
Q#(𝐪*)
Attention score 𝐀
Q#(𝐀) Matrix
Multiplication
MHSA
MLP
Add & Norm
Add & Norm
×L×L
𝐆𝐪*
𝐆𝐤
.
𝐆𝐪;0
𝐆𝐤;0
Patch Embedding Teacher activations
Distribution Guided Distillation (DGD)
Information Rectification Module (IRM)
Figure 2: Overview of Q-ViT, applying Information Rectification Module (IRM) for maximizing
representation information and Distribution Guided Distillation (DGD) for accurate optimization.
2 Related Work
Vision transformer.
Motivated by the great success of Transformer in natural language processing,
researchers are trying to apply Transformer architecture to Computer Vision tasks. Unlike mainstream
CNN-based models, Transformer is capable of capturing long-distance visual relations by its self-
attention module and provides the paradigm without image-specific inductive bias. ViT [
4
] views
16
×
16 image patches as token sequence and predicts classification via a unique class token,
which shows promising results. Subsequently, many works, such as DeiT [
25
] and PVT [
28
]
achieve further improvement on ViT, making it more efficient and applicable in downstream tasks.
CONTAINER [
7
] fully utilizes a hybrid ViT to aggregate dynamic and static information, exploring a
new framework for visual tasks. However, these high-performing vision transformers are attributed to
the large number of parameters and high computational overhead, limiting their adoption. Therefore,
innovating a smaller and faster vision transformer becomes a new trend. DynamicViT [
23
] presents a
dynamic token sparsification framework to prune redundant tokens progressively and dynamically,
achieving competitive complexity and accuracy trade-off. Evo-ViT [
33
] proposes a slow-fast updating
mechanism that guarantees information flow and spatial structure, trimming down both the training
and inference complexity. While the above works focus on efficient model designing, this paper
boosts the compression and acceleration in the track of quantization.
Quantization.
Quantizing neural networks (QNNs) often possess low-bit (1
4-bit) weights and
activations to accelerate the model inference and save the memory usage. Specifically, ternary
weights are introduced to reduce the quantization error in TWN [
13
]. DoReFa-Net [
35
] exploits
convolution kernels with low bit-width parameters and gradients to accelerate both the training
and inference. TTQ [
36
] uses two full-precision scaling coefficients to quantize the weights to
ternary values. [
37
] presented a
24
-bit quantization scheme using a two-stage approach to
alternately quantize the weights and activations, which provides an optimal trade-off among memory,
efficiency, and performance. [
11
] parameterizes the quantization intervals and obtain their optimal
values by directly minimizing the task loss of the network and also the accuracy degeneration with
further bit-width reduction. [
29
] introduces transfer learning into network quantization to obtain
an accurate low-precision model by utilizing the Kullback-Leibler (KL) divergence. [
6
] enables
accurate approximation for tensor values that have bell-shaped distributions with long tails and finds
the entire range by minimizing the quantization error. In our Q-ViT, we aim to implement an accurate,
fully quantized vision transformer under the QAT paradigm.
3 Baseline of Fully Quantized ViT
First of all, we build a baseline to study the fully quantized ViT since it has never been proposed in
previous works. A straightforward solution is to quantize the representations (weights and activations)
in ViT architecture in the forward propagation and apply distillation to the optimization in the
backward propagation.
Quantized ViT architecture.
We briefly introduce the technology of neural network quantization.
We first introduce a general asymmetric activation quantization and symmetric weight quantization
3
摘要:

Q-ViT:AccurateandFullyQuantizedLow-bitVisionTransformerYanjingLi1y,ShengXu1y,BaochangZhang1,2,XianbinCao1,PengGao3,GuodongGuo4,51BeihangUniversity,Beijing,P.R.China2ZhongguancunLaboratory,Beijing,P.R.China3ShanghaiArticialIntelligenceLaboratory,Shanghai,P.R.China4InstituteofDeepLearning,BaiduResea...

展开>> 收起<<
Q-ViT Accurate and Fully Quantized Low-bit Vision Transformer Yanjing Li1y Sheng Xu1y Baochang Zhang12 Xianbin Cao1 Peng Gao3 Guodong Guo45.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.99MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注