
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation
of Large-Scale Pre-Trained Language Models
Se Jung Kwon1∗
, Jeonghoon Kim1, Jeongin Bae1,4†
, Kang Min Yoo1,2,3, Jin-Hwa Kim2,3,
Baeseong Park1, Byeongwook Kim1, Jung-Woo Ha2, Nako Sung1and Dongsoo Lee1
1NAVER CLOVA 2NAVER AI Lab 3SNU AIIS 4KAIST
Abstract
There are growing interests in adapting
large-scale language models using parameter-
efficient fine-tuning methods. However, accel-
erating the model itself and achieving better
inference efficiency through model compres-
sion has not been thoroughly explored yet.
Model compression could provide the bene-
fits of reducing memory footprints, enabling
low-precision computations, and ultimately
achieving cost-effective inference. To combine
parameter-efficient adaptation and model com-
pression, we propose AlphaTuning consisting
of post-training quantization of the pre-trained
language model and fine-tuning only some
parts of quantized parameters for a target task.
Specifically, AlphaTuning works by employ-
ing binary-coding quantization, which factor-
izes the full-precision parameters into binary
parameters and a separate set of scaling fac-
tors. During the adaptation phase, the binary
values are frozen for all tasks, while the scal-
ing factors are fine-tuned for the downstream
task. We demonstrate that AlphaTuning, when
applied to GPT-2 and OPT, performs com-
petitively with full fine-tuning on a variety
of downstream tasks while achieving >10×
compression ratio under 4-bit quantization and
>1,000×reduction in the number of trainable
parameters.
1 Introduction
Self-supervised learning facilitates the increased
number of parameters to construct pre-trained lan-
guage models (PLMs) (e.g.,Brown et al. (2020);
Devlin et al. (2019)). We expect the continuation
of model scaling of the PLMs, especially for the
Transformers (Vaswani et al.,2017), because their
general capability follows the power-law in param-
eter size, exhibiting "the high-level predictability
and appearance of useful capabilities" (Ganguli
et al.,2022). Therefore, the Transformer-based
∗Corresponding author: sejung.kwon@navercorp.com
†Work done while at NAVER CLOVA
PLMs have been studied with great enthusiasm for
various applications including natural language pro-
cessing (Devlin et al.,2019;Radford et al.,2019;
Brown et al.,2020;Smith et al.,2022;Rae et al.,
2021;Hoffmann et al.,2022a;Chowdhery et al.,
2022;Kim et al.,2021a), automatic speech recog-
nition (Baevski et al.,2020), and computer vision
(He et al.,2022;Xie et al.,2022).
Despite the impressive zero or few-shot learning
performance of PLMs, additional adaptation steps
(e.g., fine-tuning on a target task) are required to
further enhance performance on downstream tasks.
Since each downstream task needs to load/store
independent adaptation outcomes, if we aim to de-
ploy multiple instances of distinct tasks, adapting
PLMs with limited trainable parameters is crucial
for the efficient deployment (Li et al.,2018). Thus,
various parameter-efficient adaptation techniques,
such as adapter modules (Houlsby et al.,2019),
low-rank adaptation (Hu et al.,2022), prefix-tuning
(Li and Liang,2021), prompt tuning (Liu et al.,
2021a;Gao et al.,2020), and p-tuning (Liu et al.,
2021b), are proposed.
Although trainable parameters can be signifi-
cantly reduced by parameter-efficient adaptation
schemes, we notice that the memory footprints
for inference are not reduced compared to those
of PLMs
1
. To enable efficient deployments of
multiple downstream tasks, we incorporate model
compression and parameter-efficient adaptation.
We argue that previous model compression tech-
niques were not practical solutions in terms of
parameter-efficiency for adaptations. For example,
Quantization-Aware Training (QAT) (Jacob et al.,
2018;Esser et al.,2020) can perform full fine-
tuning coupled with model compression; however,
each task needs dedicated memory storage as much
as that of a compressed PLM. Our key observation
to achieve a compression-aware parameter-efficient
1
In practice, the adaptation is usually implemented by
adding small additional parameters to PLMs.
arXiv:2210.03858v1 [cs.LG] 8 Oct 2022