Q-ViT Accurate and Fully Quantized Low-bit Vision Transformer Yanjing Li1y Sheng Xu1y Baochang Zhang12 Xianbin Cao1 Peng Gao3 Guodong Guo45

2025-04-24 1 0 1.99MB 12 页 10玖币

侵权投诉

Q-ViT: Accurate and Fully Quantized Low-bit Vision

Transformer

Yanjing Li1†, Sheng Xu1†, Baochang Zhang1,2, Xianbin Cao1∗, Peng Gao3, Guodong Guo4,5

1Beihang University, Beijing, P.R.China

2Zhongguancun Laboratory, Beijing, P.R.China

3Shanghai Artiﬁcial Intelligence Laboratory, Shanghai, P.R.China

4Institute of Deep Learning, Baidu Research, Beijing, P.R.China

5National Engineering Laboratory for Deep Learning Technology and Application,

Beijing, P.R.China

{yanjingli, shengxu, bczhang, xbcao}@buaa.edu.cn

Abstract

The large pre-trained vision transformers (ViTs) have demonstrated remarkable

performance on various visual tasks, but suffer from expensive computational and

memory cost problems when deployed on resource-constrained devices. Among

the powerful compression approaches, quantization extremely reduces the compu-

tation and memory consumption by low-bit parameters and bit-wise operations.

However, low-bit ViTs remain largely unexplored and usually suffer from a signiﬁ-

cant performance drop compared with the real-valued counterparts. In this work,

through extensive empirical analysis, we ﬁrst identify the bottleneck for severe

performance drop comes from the information distortion of the low-bit quantized

self-attention map. We then develop an information rectiﬁcation module (IRM)

and a distribution guided distillation (DGD) scheme for fully quantized vision

transformers (Q-ViT) to effectively eliminate such distortion, leading to a fully

quantized ViTs. We evaluate our methods on popular DeiT and Swin backbones.

Extensive experimental results show that our method achieves a much better per-

formance than the prior arts. For example, our Q-ViT can theoretically accelerates

the ViT-S by 6.14

and achieves about 80.9% Top-1 accuracy, even surpassing the

full-precision counterpart by 1.0% on ImageNet dataset. Our codes and models are

attached on https://github.com/YanjingLi0202/Q-ViT.

1 Introduction

Inspired by the success in natural language processing (NLP), transformer-based models have shown

great power in various computer vision (CV) tasks, such as image classiﬁcation [

] and object

detection [

]. Pre-trained with large-scale data, these models usually have a tremendous number of

parameters. For example, there are 632M parameters taking up 2528MB memory usage and 162G

FLOPs in the ViT-H model, which is both memory and computation expensive during inference.

This limits these models for the deployment on resource-limited platforms. Therefore, compressed

transformers are urgently needed for real applications.

Substantial efforts have been made to compress and accelerate neural networks for efﬁcient online

inference. Methods include compact network design [

], network pruning [

], low-rank decomposi-

tion [

], quantization [

], and knowledge distillation [

]. Quantization is particularly

suitable for deployment on AI chips because it reduces the bit-width of network parameters and

activations for efﬁcient inference. Prior post-training quantization (PTQ) methods [

] on ViTs

†Equal contribution. ∗Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06707v1 [cs.CV] 13 Oct 2022

𝜇 = 0.0841

𝜎 = 1.2124

𝜇 = −0.0176

𝜎 = 1.0753

Block.0.query Block.3.query Block.6.query

𝜇 = −0.0361

𝜎 = 1.2748

Block.0.query Block.3.query Block.6.query

𝜇 = 0.0765

𝜎 = 1.6533

𝜇 = −0.0110

𝜎 = 1.3183

𝜇 = −0.0323

𝜎 = 1.2268

(a) Full-Precision (b) Fully quantized ViT (baseline)

Figure 1: The histogram of query values

(blue shadow) along with the PDF curve (red line) of

Gaussian distribution

N(µ, σ2)

[

], for 3 selected layers in DeiT-T and 4-bit fully quantized DeiT-T

(baseline). µand σ2are the statistical mean and variance of the values.

directly compute quantized parameters based on pre-trained full-precision models, which constrains

the model performance to a sub-optimized level without ﬁne-tuning. Furthermore, quantizing these

models based on PTQ methods to ultra-low bits (e.g., 4 bits or lower) is ineffective and suffers from a

signiﬁcant performance reduction.

Differently, quantization-aware training (QAT) [

] methods perform quantization during back

propagation and achieve much less performance drop with a higher compression rate generally. QAT

is shown to be effective for CNN models [17] for CV tasks. However, QAT methods remain largely

unexplored for low-bit quantization of vision transformers. Therefore, we ﬁrst build a fully quantized

ViT baseline, a straightforward yet effective solution based on common techniques. Our study

discovers that the performance drop of fully quantized ViT lies in the information distortion among

the attention mechanism in the forward process, and the ineffective optimization for eliminating

the distribution difference through distillation in the backward propagation. First, the attention

mechanism of ViT aims at modeling long-distance dependencies [

]. However, our analysis

shows that a direct quantization method leads to the information distortion, i.e., signiﬁcant distribution

variation for the query module between quantized ViT and full-precision counterpart. For example,

as shown in Fig. 1, the variance difference is 0.4409 (1.2124 v.s. 1.6533) for the ﬁrst block . This

inevitably deteriorates the representation capability of the attention module on capturing the global

dependency for the input. Second, the distillation for the fully quantized ViT baseline utilizes

distillation token (following [

]) to directly supervise the classiﬁcation output of the quantized ViT.

However, we found that such a simple supervision is ineffective, which is coarse-grained for the large

gap between the quantized attention scores and their full-precision counterparts.

To address the aforementioned issues, a fully quantized ViT (Q-ViT) is developed by retaining the

distribution of quantized attention modules as that of full-precision counterparts (see the overview in

Fig. 2). Accordingly, we propose to modify the distorted distribution over quantized attention modules

through an Information Rectiﬁcation Module (IRM) based on information entropy maximization, in

the forward process. While in the backward process, we present a Distribution Guided Distillation

(DGD) scheme to eliminate the distribution variation through attention similarity loss between

quantized ViT and full-precision counterpart. The contributions of our work include:

•

We propose an Information Rectiﬁcation Module (IRM) based on the information theory to

address the information distortion problem. IRM applies quantized representations in the

attention module with a maximized information entropy, allowing the quantized model to

restore the representation of input images.

•

We develope a Distribution Guided Distillation (DGD) scheme to eliminate the distribution

mismatch in distillation. DGD takes appropriate activations and utilizes knowledge from the

similarity matrices in distillation to perform optimization accurately.

•

Our Q-ViT, for the ﬁrst time, explores a promising way towards accurate and low-bit ViT.

Extensive experiments on the ImageNet benchmark show that our Q-ViT outperforms the

baseline by a large margin, and achieves comparable performances with the full-precision

counterparts.

The Gaussian distribution hypothesis is supported by [20]

MHSA

MLP

Add & Norm

Classifier

Input!Q#(𝐱)

query 𝐪key!𝐤 value 𝐯

𝐪*

Q#(𝐯)

Q#(!!!)

Q#(𝐪*)

Attention score 𝐀

Q#(𝐀) Matrix

Multiplication

MHSA

MLP

Add & Norm

×L×L

𝐆𝐪*

𝐆𝐤

𝐆𝐪;0

𝐆𝐤;0

Patch Embedding Teacher activations

Distribution Guided Distillation (DGD)

Information Rectification Module (IRM)

Figure 2: Overview of Q-ViT, applying Information Rectiﬁcation Module (IRM) for maximizing

representation information and Distribution Guided Distillation (DGD) for accurate optimization.

2 Related Work

Vision transformer.

Motivated by the great success of Transformer in natural language processing,

researchers are trying to apply Transformer architecture to Computer Vision tasks. Unlike mainstream

CNN-based models, Transformer is capable of capturing long-distance visual relations by its self-

attention module and provides the paradigm without image-speciﬁc inductive bias. ViT [

] views

16 image patches as token sequence and predicts classiﬁcation via a unique class token,

which shows promising results. Subsequently, many works, such as DeiT [

] and PVT [

]

achieve further improvement on ViT, making it more efﬁcient and applicable in downstream tasks.

CONTAINER [

] fully utilizes a hybrid ViT to aggregate dynamic and static information, exploring a

new framework for visual tasks. However, these high-performing vision transformers are attributed to

the large number of parameters and high computational overhead, limiting their adoption. Therefore,

innovating a smaller and faster vision transformer becomes a new trend. DynamicViT [

] presents a

dynamic token sparsiﬁcation framework to prune redundant tokens progressively and dynamically,

achieving competitive complexity and accuracy trade-off. Evo-ViT [

] proposes a slow-fast updating

mechanism that guarantees information ﬂow and spatial structure, trimming down both the training

and inference complexity. While the above works focus on efﬁcient model designing, this paper

boosts the compression and acceleration in the track of quantization.

Quantization.

Quantizing neural networks (QNNs) often possess low-bit (1

∼

4-bit) weights and

activations to accelerate the model inference and save the memory usage. Speciﬁcally, ternary

weights are introduced to reduce the quantization error in TWN [

]. DoReFa-Net [

] exploits

convolution kernels with low bit-width parameters and gradients to accelerate both the training

and inference. TTQ [

] uses two full-precision scaling coefﬁcients to quantize the weights to

ternary values. [

] presented a

2∼4

-bit quantization scheme using a two-stage approach to

alternately quantize the weights and activations, which provides an optimal trade-off among memory,

efﬁciency, and performance. [

] parameterizes the quantization intervals and obtain their optimal

values by directly minimizing the task loss of the network and also the accuracy degeneration with

further bit-width reduction. [

] introduces transfer learning into network quantization to obtain

an accurate low-precision model by utilizing the Kullback-Leibler (KL) divergence. [

] enables

accurate approximation for tensor values that have bell-shaped distributions with long tails and ﬁnds

the entire range by minimizing the quantization error. In our Q-ViT, we aim to implement an accurate,

fully quantized vision transformer under the QAT paradigm.

3 Baseline of Fully Quantized ViT

First of all, we build a baseline to study the fully quantized ViT since it has never been proposed in

previous works. A straightforward solution is to quantize the representations (weights and activations)

in ViT architecture in the forward propagation and apply distillation to the optimization in the

backward propagation.

Quantized ViT architecture.

We brieﬂy introduce the technology of neural network quantization.

We ﬁrst introduce a general asymmetric activation quantization and symmetric weight quantization

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Q-ViT:AccurateandFullyQuantizedLow-bitVisionTransformerYanjingLi1y,ShengXu1y,BaochangZhang1,2,XianbinCao1,PengGao3,GuodongGuo4,51BeihangUniversity,Beijing,P.R.China2ZhongguancunLaboratory,Beijing,P.R.China3ShanghaiArticialIntelligenceLaboratory,Shanghai,P.R.China4InstituteofDeepLearning,BaiduResea...

展开>> 收起<<

Q-ViT Accurate and Fully Quantized Low-bit Vision Transformer Yanjing Li1y Sheng Xu1y Baochang Zhang12 Xianbin Cao1 Peng Gao3 Guodong Guo45.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Q-ViT Accurate and Fully Quantized Low-bit Vision Transformer Yanjing Li1y Sheng Xu1y Baochang Zhang12 Xianbin Cao1 Peng Gao3 Guodong Guo45

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: