AlphaTuning Quantization-Aware Parameter-Efﬁcient Adaptation of Large-Scale Pre-Trained Language Models Se Jung Kwon1 Jeonghoon Kim1 Jeongin Bae14y Kang Min Yoo123 Jin-Hwa Kim23_2

2025-04-30 0 0 711.93KB 18 页 10玖币

侵权投诉

AlphaTuning: Quantization-Aware Parameter-Efﬁcient Adaptation

of Large-Scale Pre-Trained Language Models

Se Jung Kwon1∗

, Jeonghoon Kim1, Jeongin Bae1,4†

, Kang Min Yoo1,2,3, Jin-Hwa Kim2,3,

Baeseong Park1, Byeongwook Kim1, Jung-Woo Ha2, Nako Sung1and Dongsoo Lee1

1NAVER CLOVA 2NAVER AI Lab 3SNU AIIS 4KAIST

Abstract

There are growing interests in adapting

large-scale language models using parameter-

efﬁcient ﬁne-tuning methods. However, accel-

erating the model itself and achieving better

inference efﬁciency through model compres-

sion has not been thoroughly explored yet.

Model compression could provide the bene-

ﬁts of reducing memory footprints, enabling

low-precision computations, and ultimately

achieving cost-effective inference. To combine

parameter-efﬁcient adaptation and model com-

pression, we propose AlphaTuning consisting

of post-training quantization of the pre-trained

language model and ﬁne-tuning only some

parts of quantized parameters for a target task.

Speciﬁcally, AlphaTuning works by employ-

ing binary-coding quantization, which factor-

izes the full-precision parameters into binary

parameters and a separate set of scaling fac-

tors. During the adaptation phase, the binary

values are frozen for all tasks, while the scal-

ing factors are ﬁne-tuned for the downstream

task. We demonstrate that AlphaTuning, when

applied to GPT-2 and OPT, performs com-

petitively with full ﬁne-tuning on a variety

of downstream tasks while achieving >10×

compression ratio under 4-bit quantization and

>1,000×reduction in the number of trainable

parameters.

1 Introduction

Self-supervised learning facilitates the increased

number of parameters to construct pre-trained lan-

guage models (PLMs) (e.g.,Brown et al. (2020);

Devlin et al. (2019)). We expect the continuation

of model scaling of the PLMs, especially for the

Transformers (Vaswani et al.,2017), because their

general capability follows the power-law in param-

eter size, exhibiting "the high-level predictability

and appearance of useful capabilities" (Ganguli

et al.,2022). Therefore, the Transformer-based

∗Corresponding author: sejung.kwon@navercorp.com

†Work done while at NAVER CLOVA

PLMs have been studied with great enthusiasm for

various applications including natural language pro-

cessing (Devlin et al.,2019;Radford et al.,2019;

Brown et al.,2020;Smith et al.,2022;Rae et al.,

2021;Hoffmann et al.,2022a;Chowdhery et al.,

2022;Kim et al.,2021a), automatic speech recog-

nition (Baevski et al.,2020), and computer vision

(He et al.,2022;Xie et al.,2022).

Despite the impressive zero or few-shot learning

performance of PLMs, additional adaptation steps

(e.g., ﬁne-tuning on a target task) are required to

further enhance performance on downstream tasks.

Since each downstream task needs to load/store

independent adaptation outcomes, if we aim to de-

ploy multiple instances of distinct tasks, adapting

PLMs with limited trainable parameters is crucial

for the efﬁcient deployment (Li et al.,2018). Thus,

various parameter-efﬁcient adaptation techniques,

such as adapter modules (Houlsby et al.,2019),

low-rank adaptation (Hu et al.,2022), preﬁx-tuning

(Li and Liang,2021), prompt tuning (Liu et al.,

2021a;Gao et al.,2020), and p-tuning (Liu et al.,

2021b), are proposed.

Although trainable parameters can be signiﬁ-

cantly reduced by parameter-efﬁcient adaptation

schemes, we notice that the memory footprints

for inference are not reduced compared to those

of PLMs

. To enable efﬁcient deployments of

multiple downstream tasks, we incorporate model

compression and parameter-efﬁcient adaptation.

We argue that previous model compression tech-

niques were not practical solutions in terms of

parameter-efﬁciency for adaptations. For example,

Quantization-Aware Training (QAT) (Jacob et al.,

2018;Esser et al.,2020) can perform full ﬁne-

tuning coupled with model compression; however,

each task needs dedicated memory storage as much

as that of a compressed PLM. Our key observation

to achieve a compression-aware parameter-efﬁcient

In practice, the adaptation is usually implemented by

adding small additional parameters to PLMs.

arXiv:2210.03858v1 [cs.LG] 8 Oct 2022

A B C C D

Zero-shot Fine-tuned

Score on a task

A B C D

Weight Size

Large LM

(LLM)

Fine-tuned

LLM

Fine-tuned

& Quantized

LLM

Quantized

LLM

(Parameter-

efficient)

Fine-tuning

Proposed Work:

AlphaTuning

Quantization

(Post-training)

Quantization

(Post-training)

AlphaTuning

B D

Figure 1: Approaches to satisfy both parameter-efﬁcient adaptation and parameter quantization. Our proposed

AlphaTuning technique can achieve 1) competitive performances to ﬁne-tuned LMs (i.e., A;C) with a remarkably

reduced parameter size, and 2) signiﬁcantly better scores than quantized LMs implemented through A;C;D.

adaptation is that, once a PLM is quantized, only a

small amount of quantization-related parameters is

needed to be ﬁne-tuned for each target task. As a

result, both the overall memory footprints and the

number of trainable parameters for adaptation can

be substantially reduced.

Figure 1illustratively compares two differ-

ent approaches enabling both model compression

and parameter-efﬁcient adaptation. Fine-tuned and

quantized LMs can be achieved through A

;

or A

;

D as shown in Figure 1. In the case

of A

;

D, we may have a large number of

trainable parameters, and/or PTQ may degrade per-

formance on downstream tasks. To address such

issues, we investigate A

;

D scheme, called

“AlphaTuning” in this work. Speciﬁcally, we fac-

torize the parameters of large PLMs into binary

values and scaling factors. Then, AlphaTuning con-

ducts the adaptation by training only the scaling

factors that occupy a small portion in the quan-

tization format, while freezing the other binary

values. Note that, to conduct A

;

B, we consider

post-training quantization (PTQ) (Zhao et al.,2019;

Hubara et al.,2020;Li et al.,2020a) because the

QAT demands signiﬁcant computational overhead

for training from a scratch with the whole dataset.

In this paper, our contributions are as follows:

•

To the best of our knowledge, this work is the

ﬁrst successful compression-aware parameter-

efﬁcient adaptation method.

•

We report that once PLMs are quantized by

PTQ, training scaling factors (less than 0.1%

of total parameter size) for each task only is

enough for successful adaptations.

•

Throughout various LMs and tasks, we

demonstrate that AlphaTuning can achieve

high scores even under 4-bit quantization.

2 Recent Work

Large-Scale Language Models and Quanti-

zation

Pre-trained transformer-based language

models (Devlin et al.,2019;Radford et al.,2019)

have shaped the way we design and deploy NLP

models. In recent years, the explosion of availabil-

ity of large-scale (i.e., larger than ten-billion scale)

language models (Brown et al.,2020;Black et al.,

2021;Chowdhery et al.,2022;Zhang et al.,2022a;

Hoffmann et al.,2022b) has paved way for a new

era in the NLP scene, where few-shot learning and

the parameter-efﬁcient adaptation for downstream

tasks will be more important (He et al.,2021). The

quantization (that we discuss in detail in the next

section) is an effective approach to fundamentally

overcome the space and time complexities of the

large-scale language models (Zafrir et al.,2019;

Bondarenko et al.,2021), but existing methods are

only applicable to limited domains and task adapt-

ability under the quantized state.

Parameter-Efﬁcient Adaptation of LMs

Adapting language models efﬁciently for a task

and domain-speciﬁc data has been at the center

of the community’s interests since the emergence

of large-scale language models. One promising

approach is in-context learning (ICL) (Brown et al.,

2020), in which the language model learns and

predicts from the given prompt patterns. As the

technique elicits reasonable few-shot performances

from the large-scale language models without

parameter-tuning, a plethora of works (Zhao et al.,

2021;Lu et al.,2022;Reynolds and McDonell,

2021;Min et al.,2022) have investigated the

underlying mechanism and proposed various

methods to further exploit this approach. Another

class of techniques is to adopt external or partially

internal parameters such as continuous prompt

embeddings to enable parameter-efﬁcient LM

adaptation, which is based on the intuition that

speciﬁc prompt preﬁxes may better elicit certain

LM behaviors. Earlier works explored the discrete

prompt token space (Shin et al.,2020), but later

work showed that optimizing on the continuous

word embedding space yielded better results (Liu

et al.,2021b;Li and Liang,2021;Gu et al.,2022),

even performing on par with full ﬁne-tuning

(Lester et al.,2021;Vu et al.,2022). Another

similar line of works explored introducing new

parameters within the Transformer blocks or

partially training existing parameters (Houlsby

et al.,2019;Zhang et al.,2020;Karimi Mahabadi

et al.,2021;Hu et al.,2022). Finally, some works

have suggested unifying all existing approaches

related to parameter-efﬁcient ﬁne-tuning (He et al.,

2021;Zhang et al.,2022b).

3 Quantization for AlphaTuning

Enterprise-scale LMs, such as 175B GPT-3, face

challenges in the prohibitive cost of massive de-

ployment mainly resulting from their huge parame-

ter size. To facilitate cost-effective LMs by allevi-

ating memory requirements without noticeable per-

formance degradation, we can consider compres-

sion techniques, such as quantization (Jacob et al.,

2018), pruning (Frankle et al.,2020a), and low-rank

approximation (N. Sainath et al.,2013). Memory

reduction by model compression is also useful to

reduce latency because memory-bound operations

dominate the overall performance of LMs with a

small batch size (Park et al.,2022). In addition,

model compression can save the number of GPUs

for inference because GPUs present highly limited

memory capacity (Shoeybi et al.,2019;Narayanan

et al.,2021). In this work, we choose quantization

as a practical compression technique because of

its high compression ratio, simple representation

format, and the capability to accelerate memory-

bound workloads (Chung et al.,2020).

Let us discuss our quantization strategy for LMs

(see more details in Appendix C). We choose non-

uniform quantization since uniform quantization

demands aggressive activation quantization (to ex-

ploit integer arithmetic units) which is challenged

by highly non-linear operations (such as softmax

and layer normalization) of the Transformers (Bon-

darenko et al.,2021). Even though uniform quan-

tization can mitigate performance degradation by

frequent activation quantization/dequantization pro-

2.66 1.05 -0.07 0.65

-1.82 -0.15 0.41 0.64

1.48 0.76 0.06 1.36

1.11

0.76

0.91

1 1 -1 1

-1 -1 1 1

1 1 1 1

1.11 1.11 -1.11 1.11

-0.76 -0.76 0.76 0.76

0.91 0.91 0.91 0.91

1.11

0.76

0.91

1 1 -1 1

-1 -1 1 1

1 1 1 1

1.88 0.33 -0.33 0.33

-1.29 -0.22 0.22 0.22

1.41 0.41 0.41 1.41

0.78

0.53

0.50

1 -1 1 -1

-1 1 -1 -1

1 -1 -1 1

1.11

0.76

0.91

1 1 -1 1

-1 -1 1 1

1 1 1 1

2.40 0.85 0.19 0.85

-1.59 0.08 0.53 0.53

1.62 0.61 0.2 1.21

0.78

0.53

0.50

1 -1 1 -1

-1 1 -1 -1

1 -1 -1 1

0.52

0.30

0.21

1 1 1 1

-1 1 1 1

1 -1 1 1

Original Weight (W)

1-bit Quantized W'(q=1)

2-bit Quantized W'(q=2)

3-bit Quantized W'(q=3)

MSE(W, W'(q=1))

= 0.855

MSE(W,W'(q=2))

= 0.550

MSE(W,W'(q=3))

= 0.171

from a set of 21 elements

from a set of 22 elements

from a set of 23 elements

> >

Figure 2: BCQ examples with g= 4 and different q

values. As qincreases, the MSE between the original

weight and the quantized weight decreases.

cedures (Bhandare et al.,2019) or additional high-

precision units (Kim et al.,2021b), such techniques

are slow and/or expensive. Among various non-

uniform quantization formats, we choose binary-

coding-quantization (BCQ) (Guo et al.,2017;Xu

et al.,2018) which is extended from binary neural

networks (Rastegari et al.,2016) because of high

compression ratio (Chung et al.,2020) and efﬁcient

computations (Xu et al.,2018;Jeon et al.,2020).

BCQ Format

Given a full-precision weight vec-

tor

w∈Rg

, BCQ format approximates

to be

w≈Pq

i=1 αibi

where

is the number of quantiza-

tion bits,

α∈R

is a scaling factor to be shared by

weights, and

b∈ {−1,+1}g

is a binary vector.

Note that

represents a group size or the number

of weights sharing a common scaling factor. Thus,

is a hyper-parameter for quantization. When

=1,

and

can be analytically determined to minimize

the mean squared error (MSE). If

q > 1

, however,

and

need to be obtained by heuristic methods

such as greedy approximation (Guo et al.,2017)

and iterative ﬁne-tuning method (Xu et al.,2018).

For a weight matrix

W∈Rhout×hin

, row-wise

quantization (i.e.,

g=hin

) is a popular choice

(Jeon et al.,2020;Xu et al.,2018) and can be ex-

pressed as follows:

W≈

i=1

diag(αi)·Bi,(1)

On/off-chip memory bandwidth can be maximized by

contiguous memory allocation if row-wise quantization is

adopted. Additionally, for large LLMs (along with a large

hin

), the amount of

becomes almost ignorable (i.e.,

size

32/hin

size) even assuming 32 bits to represent an

Layer WShape WSize gα∈RB∈ {-1,+1}Quantized WSize (MB)

(hout,hin) (FP32) Shape Shape q= 1 q= 2 q= 3

ATT_qkv (3h, h) 12.58 MB h(q, 3h) (q, 3h, h) 0.41 0.81 1.22

ATT_output (h, h) 4.19 MB h(q, h) (q, h, h) 0.14 0.27 0.41

FFN_h_4h (4h, h) 16.78 MB h(q, 4h) (q, 4h, h) 0.54 1.08 1.62

FFN_4h_h (h, 4h) 16.78 MB

4h(q, h) (q, h, 4h) 0.52 1.06 1.56

h(q, 4h) (q, 4h, h) 0.54 1.08 1.62

0.5h(q, 8h) (q, 8h, h) 0.56 1.11 1.67

Table 1: BCQ scheme for q-bit quantization applied to linear layers of the Transformers and examples of BCQ

formats for GPT-2 medium model (hidden size his 1024). Row-wise quantization is performed when g=hin.

Lower gresults in slightly increased weight size after quantization.

𝐵

! "#$% &$'!×#×$

𝛼 ∈ ℝ()*

𝑊 ∈ ℝ()+

A weight matrix

in a pre-trained model

Scaling Factors

(ℝ,Trainable)

Binary Weights

({-1, +1},Frozen)

Non-uniform k-bit Quantization

(Post-training Quantization)

𝛼,

𝛼-

𝛼.

dataset1

dataset2

datasetk

Quantized

Linear (hà3h)

Quantized

Linear (hàh)

Quantized

Linear (hà4h)

Quantized

Linear (4hàh)

Dot-product attention

Attention Layer Feed-forward Layer

Normalization

Embedding

Transformer Layer x N

bias

Frozen

Params

Trainable

Params

A C D

AlphaTuning

task1

𝐵

(fixed & shared)

𝐵

(fixed & shared)

𝐵

(fixed & shared)

task2

taskk

Figure 3: (Left): Quantized Transformer structure in which parameters are categorized into frozen ones and train-

able ones. (Right): Overview of AlphaTuning process that trains scaling factors only for adaptation.

where

αi∈Rhout

Bi∈ {−1,+1}hout×hin

, and

diag(·)

denotes the function of a vector that outputs

a zero-matrix except for the vector elements in its

diagonal. A linear operation of

Y=X·(W)>

then can be approximated as follows:

Y=X·W>

≈X· q

i=1

diag (αi)·Bi!>

i=1 (X·B>

i)·diag (αi),

(2)

where

X∈Rnb×hin

, and

Y∈Rnb×hout

. Note

that even though

is not quantized above, most

complicated ﬂoating-point operations are removed

due to binary values in

. Since the computational

advantages of BCQ have been introduced in the

literature (Hubara et al.,2016;Jeon et al.,2020), we

do not quantize activations in this work to improve

quantization quality.

Figure 2describes the row-wise BCQ examples

based on greedy approximation (Guo et al.,2017)

when

varies. Note that increasing

and/or de-

creasing

can reduce the MSE after quantization

at the cost of a lower compression ratio.

Transformer Quantization

Table 1presents our

BCQ scheme applied to linear layers of the Trans-

formers while BCQ formats are illustrated for the

medium-sized GPT-2 model (that has a hidden size

(

) of 1024). Note that if

is large enough such that

each scaling factor is shared by many weights, the

amount of scaling factors is ignorable compared

to that of

. In Table 1, hence, 1-bit quantization

attains almost 32

compression ratio compared to

FP32 format while lower

slightly increases stor-

age overhead induced by additional scaling factors.

4 AlphaTuning: Efﬁcient Fine-Tuning of

Quantized Models

4.1 AlphaTuning Principles

The key idea of AlphaTuning is identifying param-

eters presenting greater expressive power to mini-

mize the number of trainable parameters after PTQ.

Note that training afﬁne parameters (that trans-

form the activations through operations such as

scaling, shifting, and rotating) reportedly achieves

reasonably high accuracy even when all the other

parameters are ﬁxed to be random (Frankle et al.,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AlphaTuning:Quantization-AwareParameter-EfcientAdaptationofLarge-ScalePre-TrainedLanguageModelsSeJungKwon1,JeonghoonKim1,JeonginBae1,4y,KangMinYoo1,2,3,Jin-HwaKim2,3,BaeseongPark1,ByeongwookKim1,Jung-WooHa2,NakoSung1andDongsooLee11NAVERCLOVA2NAVERAILab3SNUAIIS4KAISTAbstractTherearegrowinginterests...

展开>> 收起<<

AlphaTuning Quantization-Aware Parameter-Efﬁcient Adaptation of Large-Scale Pre-Trained Language Models Se Jung Kwon1 Jeonghoon Kim1 Jeongin Bae14y Kang Min Yoo123 Jin-Hwa Kim23_2.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AlphaTuning Quantization-Aware Parameter-Efﬁcient Adaptation of Large-Scale Pre-Trained Language Models Se Jung Kwon1 Jeonghoon Kim1 Jeongin Bae14y Kang Min Yoo123 Jin-Hwa Kim23_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: