DyLoRA Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Lo w Rank A daptation Mojtaba Valipour12Mehdi Rezagholizadeh2Ivan Kobyzev2Ali Ghodsi1

2025-05-04 0 0 629.42KB 14 页 10玖币
侵权投诉
DyLoRA: Parameter-Efficient Tuning of Pretrained Models using
Dynamic Search-Free Low Rank Adaptation
Mojtaba Valipour1,2Mehdi Rezagholizadeh2Ivan Kobyzev2Ali Ghodsi1
{mojtaba.valipour, ali.ghodsi}@uwaterloo.ca, {mehdi.rezagholizadeh, ivan.kobyzev}@huawei.com
1: University of Waterloo, 2: Huawei Noah’s Ark Lab
Abstract
With the ever-growing size of pretrained mod-
els (PMs), fine-tuning them has become more
expensive and resource-hungry. As a remedy,
low-rank adapters (LoRA) keep the main pre-
trained weights of the model frozen and just
introduce some learnable truncated SVD mod-
ules (so-called LoRA blocks) to the model.
While LoRA blocks are parameter-efficient,
they suffer from two major problems: first,
the size of these blocks is fixed and cannot
be modified after training (for example, if we
need to change the rank of LoRA blocks, then
we need to re-train them from scratch); sec-
ond, optimizing their rank requires an exhaus-
tive search and effort. In this work, we in-
troduce a dynamic low-rank adaptation (Dy-
LoRA) technique to address these two prob-
lems together. Our DyLoRA method trains
LoRA blocks for a range of ranks instead
of a single rank by sorting the representa-
tion learned by the adapter module at different
ranks during training. We evaluate our solu-
tion on different natural language understand-
ing (GLUE benchmark) and language genera-
tion tasks (E2E, DART and WebNLG) using
different pretrained models such as RoBERTa
and GPT with different sizes. Our results show
that we can train dynamic search-free models
with DyLoRA at least 4 to 7 times (depending
to the task) faster than LoRA without signifi-
cantly compromising performance. Moreover,
our models can perform consistently well on a
much larger range of ranks compared to LoRA.
1
1 Introduction
Pre-training/fine-tuning has become a popular
paradigm for solving many tasks in natural lan-
guage processing (NLP) (Devlin et al.,2018;Liu
et al.,2019;Brown et al.,2020) and Computer Vi-
sion (Simonyan and Zisserman,2014;He et al.,
2016;Howard et al.,2019;Bochkovskiy et al.,
1github.com/huawei-noah/KD-NLP/tree/main/DyLoRA
2020;Chen et al.,2020;Dosovitskiy et al.,2020).
pretrained models (PMs) such as pretrained lan-
guage models (PLMs) (Devlin et al.,2018;Brown
et al.,2020), and pretrained visual-language mod-
els (Lu et al.,2019;Li et al.,2019;Su et al.,2019;
Xia et al.,2021) have advanced a lot in recent years.
With the ever-growing size of these pretrained mod-
els, fine-tuning them on downstream tasks becomes
more expensive. Moreover, as the ratio of the num-
ber of parameters of models with respect to the
labeled data increases, the fine-tuning process will
be more prone to overfitting (Karimi Mahabadi
et al.,2021). There are two categories of solutions:
first, model compression (Jafari et al.,2021;Chen
et al.,2021); second, parameter-efficient tuning
(PET) (Houlsby et al.,2019a;Karimi Mahabadi
et al.,2021;Mao et al.,2021).
There are many different model compression
techniques in the literature for Transformer-based
models such as matrix factorization (Noach and
Goldberg,2020;Tahaei et al.,2021), prun-
ing (Wang et al.,2019), quantization (Tao et al.,
2022;Prato et al.,2020), and knowledge distilla-
tion (Hinton et al.,2015;Li et al.,2021;Jafari et al.,
2021;Passban et al.,2021;Rashid et al.,2021).
There are also different types of PET techniques
in the literature such as low-rank adapters (Wang
et al.,2020;Karimi Mahabadi et al.,2021;Houlsby
et al.,2019b;Hu et al.,2021b), and prompt-based
techniques (Lester et al.,2021).
Although model compression solutions are well-
established in recent years in the literature, apply-
ing them to large language models can be very
costly, because compression techniques usually
need to train (or fine-tune) the original large model.
A case in point is knowledge distillation which re-
lies on fine-tuning a large teacher model or even
pre-training the student model as suggested in (Jiao
et al.,2019). Moreover, using compression tech-
niques usually leads to degrading the model perfor-
mance. PETs can be alternatives to the compres-
arXiv:2210.07558v2 [cs.CL] 19 Apr 2023
Frozen
Pretrained
Weights
DyLoRA Parameter Updates
Forward Pass
Figure 1: DyLoRA: The overall diagram of our proposed method. In each iteration, we sample from a pre-defined
random distribution which will help us to truncate the up-projection and down-projection matrices in the LoRA
(Hu et al.,2021a) objective.
sion methods, especially when we would like to use
the full capacity of the large pretrained models with
light training efforts (such as the language-model-
as-a-service scenario (Sun et al.,2022)). Among
PET techniques, low-rank adapters have received
much attention because, in contrast to prompt-
tuning techniques, low-rank adapters do not add to
the sequence length, get trained faster, and perform
better (Karimi Mahabadi et al.,2021). Even though
there are several low-rank adaptation techniques
in the literature, such as Adapter (Houlsby et al.,
2019b), Compacter (Karimi Mahabadi et al.,2021),
and LoRA (Hu et al.,2021b); they all suffer from
two major common problems: first, it is not clear
how to select the size of their rank (while their per-
formance is very sensitive to this rank selection);
second, their training is static which means that if
a low-rank model is trained based on a particular
rank size, it will not work well in other rank values
(i.e. for any other rank value we need to train a
separate model).
This paper proposes a dynamic low-rank adapter
technique (DyLoRA) to address these two prob-
lems. Without loss of generality, we focus on
LoRA(Hu et al.,2021a) and train LoRA blocks
for a range of ranks instead of a single rank by
sorting out the representation learned at different
ranks during training. While our model is more
flexible, it can outperform LoRA in a much wider
range of ranks without adding to the training time.
Moreover, our technique does not need extra train-
ing for searching across ranks. We summarize our
contributions in the following:
Dynamic LoRA: On top of LoRA, we devel-
oped a new algorithm (DyLoRA) that makes
it dynamic at inference time without incurring
extra costs.
Search-free LoRA: We demonstrate that by
making a negligible compromise in perfor-
mance, it is possible to avoid the costly search
process of choosing the optimal rank for
LoRA.
2 Related Work
This section reviews low-rank adaptation tech-
niques for parameter-efficient tuning and poten-
tial existing solutions to make these techniques
dynamic and search-free.
It has been shown in (Aghajanyan et al.,2020)
that for classification tasks such as natural language
understanding (NLU), PLMs have a low intrinsic
dimension. This observation motivates the use of
low-rank adapters for parameter-efficient tuning.
There are several low-rank adapters in the literature
such as LoRA (Hu et al.,2021b), Adapter (Houlsby
et al.,2019b), Compacter (Karimi Mahabadi et al.,
2021), and Parallel Adapter (PA) (He et al.,2021).
LoRA is a low-rank up-projection/down-projection
transformation without any non-linearity applied
in parallel to key and value attention matrices.
The main benefit of LoRA is that the adapter
module, after training, can be integrated into the
original weight matrices of the model, which in
turn can lead to a very efficient inference time.
Adapters also have a low-rank up-projection/down-
projection transformation with an intermediate non-
linearity. The Adapter module is applied in series
with the feed-forward network (FFN). Having the
adaptor module in-line with other blocks in the
model can increase the inference time of the model.
PA is a faster version of the Adapter, which can be
applied in parallel with the FFN block. The com-
pactor is a more memory-efficient version of the
Adapter, which deploys the sum of Kronecker prod-
ucts to reconstruct each up-projection and down-
projection matrices. All these low-rank adapters
suffer from two major issues: first, finding the best
rank requires heavy exhaustive training and search;
second, the tuned adapter module works well only
with a particular rank.
While there have been some efforts in the lit-
erature towards dynamic networks such as Dyn-
aBERT (Hou et al.,2020) and GradMax (Evci et al.,
2022), to the best of our knowledge, this problem
for factorized networks and low-rank adapters is
still open. DRONE (Chen et al.,2021) propose a
technique for data-aware low-rank model compres-
sion however their approach is not search-free, and
also, it is not dynamic. DynaBERT introduces a
two-stage method to train width and depth-wise
dynamic networks. However, DynaBERT requires
a fine-tuned teacher model on the task to train its
sub-networks which makes it unsuitable for PET
techniques. GradMax is a technique that gradually
adds to the neurons of a network without touch-
ing the already trained neurons. But it is unclear
how GradMax can be deployed to alleviate the
rank-search problem in low-rank adapters. Wang
et al. (2019) propose a structured pruning technique
called factorized low-rank pruning (FLOP). FLOP
decomposes weight matrices of a network into the
sum of rank-1 components, which are regularized
during training to gain sparsity. It is worth men-
tioning that FLOP aims at compressing the main
model, and even if it can be used for finding a good
rank in the lower-rank representation of full-weight
matrices, the final low-rank model will not be dy-
namic (i.e. it is trained well only for one rank and
not a range of ranks, same as LoRA.). In this paper,
we propose a new methodology for training low-
rank modules for multiple ranks simultaneously
rather than training a single-rank adapter at a time
(without changing the training budget). Inspired by
the idea of nested dropout (Rippel et al.,2014), we
pursue ordering the representations of the bottle-
neck at the low-rank adapter modules with a new
recipe. To the best of our knowledge, it is the first
time that the concept of ordering representations
has been deployed in training PLMs.
3 Background
3.1 Nested Dropout
Inspired by the dropout (Hinton et al.,2012), nested
drop-out (Rippel et al.,2014) is a stochastic regular-
ization technique that targets enforcing ordered rep-
resentations in training auto-encoders. The nested
dropout, adds an implicit bias (which does not exist
in dropout) to favor order in training. For example,
in dropout, we can randomly drop any nodes or
units in the network, but in nested dropout, if we
randomly select
kth
unit, then we keep all the units
indexed from
1
to
k
and drop the units with indices
larger than
k
. Therefore, nested dropout tends to-
ward accommodating more important information
in lower indices while learning representations.
Following the notations of (Rippel et al.,2014),
nested dropout assumes an auto-encoder mapping
of
N
training examples
{yi}N
i=1 Y
,
YRD
to
their corresponding representations
{xi}N
i=1 X
,
XRK
using the function
fθ:YX
with pa-
rameters
θ
; and then decoding these representations
using another function
gψ:XY
with parame-
ters
ψ
to reconstruct the inputs. The reconstruction
loss can be defined as follows:
C(θ, ψ) =
N
X
i=1
||yigψ(fθ(yi))||2.(1)
Suppose we want to randomly drop some units in
our representation vector x. In this regard, we sam-
ple a random variable
bpB(.), b ∈ {1,2, ..., K}
from a pre-defined categorical distribution
pB(.)
and truncate the functions
fθ
and
gψ
to keep their
corresponding units indexed from 1 to
b
and drop-
ping
b+1
to
K
indices. Let’s define the b-truncated
version of the vector
x
as
xb
and the b-truncated
version of the functions
fθ
and
gψ
as
fθb
and
gψb
respectively. In this case, the reconstruction loss is
redefined for the b-truncated model as follows:
C(θ, ψ) = EpB[Cb(θ, ψ)] =
K
X
b=1
pB(b)Cb(θ, ψ)
where
Cb(θ, ψ) =
N
X
i=1
||yigψb(fθb(yi))||2.
(2)
摘要:

DyLoRA:Parameter-EfcientTuningofPretrainedModelsusingDynamicSearch-FreeLowRankAdaptationMojtabaValipour1;2MehdiRezagholizadeh2IvanKobyzev2AliGhodsi1{mojtaba.valipour,ali.ghodsi}@uwaterloo.ca,{mehdi.rezagholizadeh,ivan.kobyzev}@huawei.com1:UniversityofWaterloo,2:HuaweiNoah'sArkLabAbstractWiththeever...

展开>> 收起<<
DyLoRA Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Lo w Rank A daptation Mojtaba Valipour12Mehdi Rezagholizadeh2Ivan Kobyzev2Ali Ghodsi1.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:629.42KB 格式:PDF 时间:2025-05-04

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注