
𝜇 = 0.0841
𝜎 = 1.2124
𝜇 = −0.0176
𝜎 = 1.0753
Block.0.query Block.3.query Block.6.query
𝜇 = −0.0361
𝜎 = 1.2748
Block.0.query Block.3.query Block.6.query
𝜇 = 0.0765
𝜎 = 1.6533
𝜇 = −0.0110
𝜎 = 1.3183
𝜇 = −0.0323
𝜎 = 1.2268
(a) Full-Precision (b) Fully quantized ViT (baseline)
Figure 1: The histogram of query values
q
(blue shadow) along with the PDF curve (red line) of
Gaussian distribution
N(µ, σ2)
[
20
], for 3 selected layers in DeiT-T and 4-bit fully quantized DeiT-T
(baseline). µand σ2are the statistical mean and variance of the values.
directly compute quantized parameters based on pre-trained full-precision models, which constrains
the model performance to a sub-optimized level without fine-tuning. Furthermore, quantizing these
models based on PTQ methods to ultra-low bits (e.g., 4 bits or lower) is ineffective and suffers from a
significant performance reduction.
Differently, quantization-aware training (QAT) [
16
] methods perform quantization during back
propagation and achieve much less performance drop with a higher compression rate generally. QAT
is shown to be effective for CNN models [17] for CV tasks. However, QAT methods remain largely
unexplored for low-bit quantization of vision transformers. Therefore, we first build a fully quantized
ViT baseline, a straightforward yet effective solution based on common techniques. Our study
discovers that the performance drop of fully quantized ViT lies in the information distortion among
the attention mechanism in the forward process, and the ineffective optimization for eliminating
the distribution difference through distillation in the backward propagation. First, the attention
mechanism of ViT aims at modeling long-distance dependencies [
27
,
4
]. However, our analysis
shows that a direct quantization method leads to the information distortion, i.e., significant distribution
variation for the query module between quantized ViT and full-precision counterpart. For example,
as shown in Fig. 1, the variance difference is 0.4409 (1.2124 v.s. 1.6533) for the first block . This
inevitably deteriorates the representation capability of the attention module on capturing the global
dependency for the input. Second, the distillation for the fully quantized ViT baseline utilizes
distillation token (following [
25
]) to directly supervise the classification output of the quantized ViT.
However, we found that such a simple supervision is ineffective, which is coarse-grained for the large
gap between the quantized attention scores and their full-precision counterparts.
To address the aforementioned issues, a fully quantized ViT (Q-ViT) is developed by retaining the
distribution of quantized attention modules as that of full-precision counterparts (see the overview in
Fig. 2). Accordingly, we propose to modify the distorted distribution over quantized attention modules
through an Information Rectification Module (IRM) based on information entropy maximization, in
the forward process. While in the backward process, we present a Distribution Guided Distillation
(DGD) scheme to eliminate the distribution variation through attention similarity loss between
quantized ViT and full-precision counterpart. The contributions of our work include:
•
We propose an Information Rectification Module (IRM) based on the information theory to
address the information distortion problem. IRM applies quantized representations in the
attention module with a maximized information entropy, allowing the quantized model to
restore the representation of input images.
•
We develope a Distribution Guided Distillation (DGD) scheme to eliminate the distribution
mismatch in distillation. DGD takes appropriate activations and utilizes knowledge from the
similarity matrices in distillation to perform optimization accurately.
•
Our Q-ViT, for the first time, explores a promising way towards accurate and low-bit ViT.
Extensive experiments on the ImageNet benchmark show that our Q-ViT outperforms the
baseline by a large margin, and achieves comparable performances with the full-precision
counterparts.
The Gaussian distribution hypothesis is supported by [20]
2