AlphaTuning Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models Se Jung Kwon1 Jeonghoon Kim1 Jeongin Bae14y Kang Min Yoo123 Jin-Hwa Kim23_2

2025-04-30 0 0 711.93KB 18 页 10玖币
侵权投诉
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation
of Large-Scale Pre-Trained Language Models
Se Jung Kwon1
, Jeonghoon Kim1, Jeongin Bae1,4
, Kang Min Yoo1,2,3, Jin-Hwa Kim2,3,
Baeseong Park1, Byeongwook Kim1, Jung-Woo Ha2, Nako Sung1and Dongsoo Lee1
1NAVER CLOVA 2NAVER AI Lab 3SNU AIIS 4KAIST
Abstract
There are growing interests in adapting
large-scale language models using parameter-
efficient fine-tuning methods. However, accel-
erating the model itself and achieving better
inference efficiency through model compres-
sion has not been thoroughly explored yet.
Model compression could provide the bene-
fits of reducing memory footprints, enabling
low-precision computations, and ultimately
achieving cost-effective inference. To combine
parameter-efficient adaptation and model com-
pression, we propose AlphaTuning consisting
of post-training quantization of the pre-trained
language model and fine-tuning only some
parts of quantized parameters for a target task.
Specifically, AlphaTuning works by employ-
ing binary-coding quantization, which factor-
izes the full-precision parameters into binary
parameters and a separate set of scaling fac-
tors. During the adaptation phase, the binary
values are frozen for all tasks, while the scal-
ing factors are fine-tuned for the downstream
task. We demonstrate that AlphaTuning, when
applied to GPT-2 and OPT, performs com-
petitively with full fine-tuning on a variety
of downstream tasks while achieving >10×
compression ratio under 4-bit quantization and
>1,000×reduction in the number of trainable
parameters.
1 Introduction
Self-supervised learning facilitates the increased
number of parameters to construct pre-trained lan-
guage models (PLMs) (e.g.,Brown et al. (2020);
Devlin et al. (2019)). We expect the continuation
of model scaling of the PLMs, especially for the
Transformers (Vaswani et al.,2017), because their
general capability follows the power-law in param-
eter size, exhibiting "the high-level predictability
and appearance of useful capabilities" (Ganguli
et al.,2022). Therefore, the Transformer-based
Corresponding author: sejung.kwon@navercorp.com
Work done while at NAVER CLOVA
PLMs have been studied with great enthusiasm for
various applications including natural language pro-
cessing (Devlin et al.,2019;Radford et al.,2019;
Brown et al.,2020;Smith et al.,2022;Rae et al.,
2021;Hoffmann et al.,2022a;Chowdhery et al.,
2022;Kim et al.,2021a), automatic speech recog-
nition (Baevski et al.,2020), and computer vision
(He et al.,2022;Xie et al.,2022).
Despite the impressive zero or few-shot learning
performance of PLMs, additional adaptation steps
(e.g., fine-tuning on a target task) are required to
further enhance performance on downstream tasks.
Since each downstream task needs to load/store
independent adaptation outcomes, if we aim to de-
ploy multiple instances of distinct tasks, adapting
PLMs with limited trainable parameters is crucial
for the efficient deployment (Li et al.,2018). Thus,
various parameter-efficient adaptation techniques,
such as adapter modules (Houlsby et al.,2019),
low-rank adaptation (Hu et al.,2022), prefix-tuning
(Li and Liang,2021), prompt tuning (Liu et al.,
2021a;Gao et al.,2020), and p-tuning (Liu et al.,
2021b), are proposed.
Although trainable parameters can be signifi-
cantly reduced by parameter-efficient adaptation
schemes, we notice that the memory footprints
for inference are not reduced compared to those
of PLMs
1
. To enable efficient deployments of
multiple downstream tasks, we incorporate model
compression and parameter-efficient adaptation.
We argue that previous model compression tech-
niques were not practical solutions in terms of
parameter-efficiency for adaptations. For example,
Quantization-Aware Training (QAT) (Jacob et al.,
2018;Esser et al.,2020) can perform full fine-
tuning coupled with model compression; however,
each task needs dedicated memory storage as much
as that of a compressed PLM. Our key observation
to achieve a compression-aware parameter-efficient
1
In practice, the adaptation is usually implemented by
adding small additional parameters to PLMs.
arXiv:2210.03858v1 [cs.LG] 8 Oct 2022
A B C C D
Zero-shot Fine-tuned
Score on a task
A B C D
Weight Size
Large LM
(LLM)
A
Fine-tuned
LLM
C
Fine-tuned
& Quantized
LLM
D
Quantized
LLM
B
(Parameter-
efficient)
Fine-tuning
Proposed Work:
AlphaTuning
Quantization
(Post-training)
Quantization
(Post-training)
AlphaTuning
B D
Figure 1: Approaches to satisfy both parameter-efficient adaptation and parameter quantization. Our proposed
AlphaTuning technique can achieve 1) competitive performances to fine-tuned LMs (i.e., A;C) with a remarkably
reduced parameter size, and 2) significantly better scores than quantized LMs implemented through A;C;D.
adaptation is that, once a PLM is quantized, only a
small amount of quantization-related parameters is
needed to be fine-tuned for each target task. As a
result, both the overall memory footprints and the
number of trainable parameters for adaptation can
be substantially reduced.
Figure 1illustratively compares two differ-
ent approaches enabling both model compression
and parameter-efficient adaptation. Fine-tuned and
quantized LMs can be achieved through A
;
C
;
D
or A
;
B
;
D as shown in Figure 1. In the case
of A
;
C
;
D, we may have a large number of
trainable parameters, and/or PTQ may degrade per-
formance on downstream tasks. To address such
issues, we investigate A
;
B
;
D scheme, called
AlphaTuning” in this work. Specifically, we fac-
torize the parameters of large PLMs into binary
values and scaling factors. Then, AlphaTuning con-
ducts the adaptation by training only the scaling
factors that occupy a small portion in the quan-
tization format, while freezing the other binary
values. Note that, to conduct A
;
B, we consider
post-training quantization (PTQ) (Zhao et al.,2019;
Hubara et al.,2020;Li et al.,2020a) because the
QAT demands significant computational overhead
for training from a scratch with the whole dataset.
In this paper, our contributions are as follows:
To the best of our knowledge, this work is the
first successful compression-aware parameter-
efficient adaptation method.
We report that once PLMs are quantized by
PTQ, training scaling factors (less than 0.1%
of total parameter size) for each task only is
enough for successful adaptations.
Throughout various LMs and tasks, we
demonstrate that AlphaTuning can achieve
high scores even under 4-bit quantization.
2 Recent Work
Large-Scale Language Models and Quanti-
zation
Pre-trained transformer-based language
models (Devlin et al.,2019;Radford et al.,2019)
have shaped the way we design and deploy NLP
models. In recent years, the explosion of availabil-
ity of large-scale (i.e., larger than ten-billion scale)
language models (Brown et al.,2020;Black et al.,
2021;Chowdhery et al.,2022;Zhang et al.,2022a;
Hoffmann et al.,2022b) has paved way for a new
era in the NLP scene, where few-shot learning and
the parameter-efficient adaptation for downstream
tasks will be more important (He et al.,2021). The
quantization (that we discuss in detail in the next
section) is an effective approach to fundamentally
overcome the space and time complexities of the
large-scale language models (Zafrir et al.,2019;
Bondarenko et al.,2021), but existing methods are
only applicable to limited domains and task adapt-
ability under the quantized state.
Parameter-Efficient Adaptation of LMs
Adapting language models efficiently for a task
and domain-specific data has been at the center
of the community’s interests since the emergence
of large-scale language models. One promising
approach is in-context learning (ICL) (Brown et al.,
2020), in which the language model learns and
predicts from the given prompt patterns. As the
technique elicits reasonable few-shot performances
from the large-scale language models without
parameter-tuning, a plethora of works (Zhao et al.,
2021;Lu et al.,2022;Reynolds and McDonell,
2021;Min et al.,2022) have investigated the
underlying mechanism and proposed various
methods to further exploit this approach. Another
class of techniques is to adopt external or partially
internal parameters such as continuous prompt
embeddings to enable parameter-efficient LM
adaptation, which is based on the intuition that
specific prompt prefixes may better elicit certain
LM behaviors. Earlier works explored the discrete
prompt token space (Shin et al.,2020), but later
work showed that optimizing on the continuous
word embedding space yielded better results (Liu
et al.,2021b;Li and Liang,2021;Gu et al.,2022),
even performing on par with full fine-tuning
(Lester et al.,2021;Vu et al.,2022). Another
similar line of works explored introducing new
parameters within the Transformer blocks or
partially training existing parameters (Houlsby
et al.,2019;Zhang et al.,2020;Karimi Mahabadi
et al.,2021;Hu et al.,2022). Finally, some works
have suggested unifying all existing approaches
related to parameter-efficient fine-tuning (He et al.,
2021;Zhang et al.,2022b).
3 Quantization for AlphaTuning
Enterprise-scale LMs, such as 175B GPT-3, face
challenges in the prohibitive cost of massive de-
ployment mainly resulting from their huge parame-
ter size. To facilitate cost-effective LMs by allevi-
ating memory requirements without noticeable per-
formance degradation, we can consider compres-
sion techniques, such as quantization (Jacob et al.,
2018), pruning (Frankle et al.,2020a), and low-rank
approximation (N. Sainath et al.,2013). Memory
reduction by model compression is also useful to
reduce latency because memory-bound operations
dominate the overall performance of LMs with a
small batch size (Park et al.,2022). In addition,
model compression can save the number of GPUs
for inference because GPUs present highly limited
memory capacity (Shoeybi et al.,2019;Narayanan
et al.,2021). In this work, we choose quantization
as a practical compression technique because of
its high compression ratio, simple representation
format, and the capability to accelerate memory-
bound workloads (Chung et al.,2020).
Let us discuss our quantization strategy for LMs
(see more details in Appendix C). We choose non-
uniform quantization since uniform quantization
demands aggressive activation quantization (to ex-
ploit integer arithmetic units) which is challenged
by highly non-linear operations (such as softmax
and layer normalization) of the Transformers (Bon-
darenko et al.,2021). Even though uniform quan-
tization can mitigate performance degradation by
frequent activation quantization/dequantization pro-
2.66 1.05 -0.07 0.65
-1.82 -0.15 0.41 0.64
1.48 0.76 0.06 1.36
1.11
0.76
0.91
1 1 -1 1
-1 -1 1 1
1 1 1 1
1.11 1.11 -1.11 1.11
-0.76 -0.76 0.76 0.76
0.91 0.91 0.91 0.91
1.11
0.76
0.91
1 1 -1 1
-1 -1 1 1
1 1 1 1
1.88 0.33 -0.33 0.33
-1.29 -0.22 0.22 0.22
1.41 0.41 0.41 1.41
0.78
0.53
0.50
1 -1 1 -1
-1 1 -1 -1
1 -1 -1 1
1.11
0.76
0.91
1 1 -1 1
-1 -1 1 1
1 1 1 1
2.40 0.85 0.19 0.85
-1.59 0.08 0.53 0.53
1.62 0.61 0.2 1.21
0.78
0.53
0.50
1 -1 1 -1
-1 1 -1 -1
1 -1 -1 1
0.52
0.30
0.21
1 1 1 1
-1 1 1 1
1 -1 1 1
Original Weight (W)
1-bit Quantized W'(q=1)
2-bit Quantized W'(q=2)
3-bit Quantized W'(q=3)
+
+
=
+
=
=
MSE(W, W'(q=1))
= 0.855
MSE(W,W'(q=2))
= 0.550
MSE(W,W'(q=3))
= 0.171
from a set of 21 elements
from a set of 22 elements
from a set of 23 elements
> >
Figure 2: BCQ examples with g= 4 and different q
values. As qincreases, the MSE between the original
weight and the quantized weight decreases.
cedures (Bhandare et al.,2019) or additional high-
precision units (Kim et al.,2021b), such techniques
are slow and/or expensive. Among various non-
uniform quantization formats, we choose binary-
coding-quantization (BCQ) (Guo et al.,2017;Xu
et al.,2018) which is extended from binary neural
networks (Rastegari et al.,2016) because of high
compression ratio (Chung et al.,2020) and efficient
computations (Xu et al.,2018;Jeon et al.,2020).
BCQ Format
Given a full-precision weight vec-
tor
wRg
, BCQ format approximates
w
to be
wPq
i=1 αibi
where
q
is the number of quantiza-
tion bits,
αR
is a scaling factor to be shared by
g
weights, and
b∈ {−1,+1}g
is a binary vector.
Note that
g
represents a group size or the number
of weights sharing a common scaling factor. Thus,
g
is a hyper-parameter for quantization. When
q
=1,
α
and
b
can be analytically determined to minimize
the mean squared error (MSE). If
q > 1
, however,
α
and
b
need to be obtained by heuristic methods
such as greedy approximation (Guo et al.,2017)
and iterative fine-tuning method (Xu et al.,2018).
For a weight matrix
WRhout×hin
, row-wise
quantization (i.e.,
g=hin
) is a popular choice
2
(Jeon et al.,2020;Xu et al.,2018) and can be ex-
pressed as follows:
W
q
X
i=1
diag(αi)·Bi,(1)
2
On/off-chip memory bandwidth can be maximized by
contiguous memory allocation if row-wise quantization is
adopted. Additionally, for large LLMs (along with a large
hin
), the amount of
α
becomes almost ignorable (i.e.,
α
size
is
32/hin
of
B
size) even assuming 32 bits to represent an
α
.
Layer WShape WSize gαRB∈ {-1,+1}Quantized WSize (MB)
(hout,hin) (FP32) Shape Shape q= 1 q= 2 q= 3
ATT_qkv (3h, h) 12.58 MB h(q, 3h) (q, 3h, h) 0.41 0.81 1.22
ATT_output (h, h) 4.19 MB h(q, h) (q, h, h) 0.14 0.27 0.41
FFN_h_4h (4h, h) 16.78 MB h(q, 4h) (q, 4h, h) 0.54 1.08 1.62
FFN_4h_h (h, 4h) 16.78 MB
4h(q, h) (q, h, 4h) 0.52 1.06 1.56
h(q, 4h) (q, 4h, h) 0.54 1.08 1.62
0.5h(q, 8h) (q, 8h, h) 0.56 1.11 1.67
Table 1: BCQ scheme for q-bit quantization applied to linear layers of the Transformers and examples of BCQ
formats for GPT-2 medium model (hidden size his 1024). Row-wise quantization is performed when g=hin.
Lower gresults in slightly increased weight size after quantization.
𝐵
! "#$% &$'!×#×$
𝛼 ∈ ℝ()*
𝑊 ∈ ℝ()+
A weight matrix
in a pre-trained model
Scaling Factors
(,Trainable)
Binary Weights
({-1, +1},Frozen)
Non-uniform k-bit Quantization
(Post-training Quantization)
𝛼,
𝛼-
𝛼.
dataset1
dataset2
datasetk
Quantized
Linear (hà3h)
Quantized
Linear (hàh)
Quantized
Linear (hà4h)
Quantized
Linear (4hàh)
Dot-product attention
Attention Layer Feed-forward Layer
Normalization
Normalization
Embedding
Transformer Layer x N
/
/
/
/
B
B
B
B
bias
bias
bias
bias
Frozen
Params
Trainable
Params
A C D
AlphaTuning
task1
𝐵
(fixed & shared)
𝐵
(fixed & shared)
𝐵
(fixed & shared)
task2
taskk
Figure 3: (Left): Quantized Transformer structure in which parameters are categorized into frozen ones and train-
able ones. (Right): Overview of AlphaTuning process that trains scaling factors only for adaptation.
where
αiRhout
,
Bi∈ {−1,+1}hout×hin
, and
diag(·)
denotes the function of a vector that outputs
a zero-matrix except for the vector elements in its
diagonal. A linear operation of
Y=X·(W)>
,
then can be approximated as follows:
Y=X·W>
X· q
X
i=1
diag (αi)·Bi!>
=
q
X
i=1 (X·B>
i)·diag (αi),
(2)
where
XRnb×hin
, and
YRnb×hout
. Note
that even though
X
is not quantized above, most
complicated floating-point operations are removed
due to binary values in
B
. Since the computational
advantages of BCQ have been introduced in the
literature (Hubara et al.,2016;Jeon et al.,2020), we
do not quantize activations in this work to improve
quantization quality.
Figure 2describes the row-wise BCQ examples
based on greedy approximation (Guo et al.,2017)
when
q
varies. Note that increasing
q
and/or de-
creasing
g
can reduce the MSE after quantization
at the cost of a lower compression ratio.
Transformer Quantization
Table 1presents our
BCQ scheme applied to linear layers of the Trans-
formers while BCQ formats are illustrated for the
medium-sized GPT-2 model (that has a hidden size
(
h
) of 1024). Note that if
g
is large enough such that
each scaling factor is shared by many weights, the
amount of scaling factors is ignorable compared
to that of
B
. In Table 1, hence, 1-bit quantization
attains almost 32
×
compression ratio compared to
FP32 format while lower
g
slightly increases stor-
age overhead induced by additional scaling factors.
4 AlphaTuning: Efficient Fine-Tuning of
Quantized Models
4.1 AlphaTuning Principles
The key idea of AlphaTuning is identifying param-
eters presenting greater expressive power to mini-
mize the number of trainable parameters after PTQ.
Note that training affine parameters (that trans-
form the activations through operations such as
scaling, shifting, and rotating) reportedly achieves
reasonably high accuracy even when all the other
parameters are fixed to be random (Frankle et al.,
摘要:

AlphaTuning:Quantization-AwareParameter-EfcientAdaptationofLarge-ScalePre-TrainedLanguageModelsSeJungKwon1,JeonghoonKim1,JeonginBae1,4y,KangMinYoo1,2,3,Jin-HwaKim2,3,BaeseongPark1,ByeongwookKim1,Jung-WooHa2,NakoSung1andDongsooLee11NAVERCLOVA2NAVERAILab3SNUAIIS4KAISTAbstractTherearegrowinginterests...

展开>> 收起<<
AlphaTuning Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models Se Jung Kwon1 Jeonghoon Kim1 Jeongin Bae14y Kang Min Yoo123 Jin-Hwa Kim23_2.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:711.93KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注