DyLoRA Parameter-Efﬁcient Tuning of Pretrained Models using Dynamic Search-Free Lo w Rank A daptation Mojtaba Valipour12Mehdi Rezagholizadeh2Ivan Kobyzev2Ali Ghodsi1

2025-05-04 0 0 629.42KB 14 页 10玖币

侵权投诉

DyLoRA: Parameter-Efﬁcient Tuning of Pretrained Models using

Dynamic Search-Free Low Rank Adaptation

Mojtaba Valipour1,2Mehdi Rezagholizadeh2Ivan Kobyzev2Ali Ghodsi1

{mojtaba.valipour, ali.ghodsi}@uwaterloo.ca, {mehdi.rezagholizadeh, ivan.kobyzev}@huawei.com

1: University of Waterloo, 2: Huawei Noah’s Ark Lab

Abstract

With the ever-growing size of pretrained mod-

els (PMs), ﬁne-tuning them has become more

expensive and resource-hungry. As a remedy,

low-rank adapters (LoRA) keep the main pre-

trained weights of the model frozen and just

introduce some learnable truncated SVD mod-

ules (so-called LoRA blocks) to the model.

While LoRA blocks are parameter-efﬁcient,

they suffer from two major problems: ﬁrst,

the size of these blocks is ﬁxed and cannot

be modiﬁed after training (for example, if we

need to change the rank of LoRA blocks, then

we need to re-train them from scratch); sec-

ond, optimizing their rank requires an exhaus-

tive search and effort. In this work, we in-

troduce a dynamic low-rank adaptation (Dy-

LoRA) technique to address these two prob-

lems together. Our DyLoRA method trains

LoRA blocks for a range of ranks instead

of a single rank by sorting the representa-

tion learned by the adapter module at different

ranks during training. We evaluate our solu-

tion on different natural language understand-

ing (GLUE benchmark) and language genera-

tion tasks (E2E, DART and WebNLG) using

different pretrained models such as RoBERTa

and GPT with different sizes. Our results show

that we can train dynamic search-free models

with DyLoRA at least 4 to 7 times (depending

to the task) faster than LoRA without signiﬁ-

cantly compromising performance. Moreover,

our models can perform consistently well on a

much larger range of ranks compared to LoRA.

1 Introduction

Pre-training/ﬁne-tuning has become a popular

paradigm for solving many tasks in natural lan-

guage processing (NLP) (Devlin et al.,2018;Liu

et al.,2019;Brown et al.,2020) and Computer Vi-

sion (Simonyan and Zisserman,2014;He et al.,

2016;Howard et al.,2019;Bochkovskiy et al.,

1github.com/huawei-noah/KD-NLP/tree/main/DyLoRA

2020;Chen et al.,2020;Dosovitskiy et al.,2020).

pretrained models (PMs) such as pretrained lan-

guage models (PLMs) (Devlin et al.,2018;Brown

et al.,2020), and pretrained visual-language mod-

els (Lu et al.,2019;Li et al.,2019;Su et al.,2019;

Xia et al.,2021) have advanced a lot in recent years.

With the ever-growing size of these pretrained mod-

els, ﬁne-tuning them on downstream tasks becomes

more expensive. Moreover, as the ratio of the num-

ber of parameters of models with respect to the

labeled data increases, the ﬁne-tuning process will

be more prone to overﬁtting (Karimi Mahabadi

et al.,2021). There are two categories of solutions:

ﬁrst, model compression (Jafari et al.,2021;Chen

et al.,2021); second, parameter-efﬁcient tuning

(PET) (Houlsby et al.,2019a;Karimi Mahabadi

et al.,2021;Mao et al.,2021).

There are many different model compression

techniques in the literature for Transformer-based

models such as matrix factorization (Noach and

Goldberg,2020;Tahaei et al.,2021), prun-

ing (Wang et al.,2019), quantization (Tao et al.,

2022;Prato et al.,2020), and knowledge distilla-

tion (Hinton et al.,2015;Li et al.,2021;Jafari et al.,

2021;Passban et al.,2021;Rashid et al.,2021).

There are also different types of PET techniques

in the literature such as low-rank adapters (Wang

et al.,2020;Karimi Mahabadi et al.,2021;Houlsby

et al.,2019b;Hu et al.,2021b), and prompt-based

techniques (Lester et al.,2021).

Although model compression solutions are well-

established in recent years in the literature, apply-

ing them to large language models can be very

costly, because compression techniques usually

need to train (or ﬁne-tune) the original large model.

A case in point is knowledge distillation which re-

lies on ﬁne-tuning a large teacher model or even

pre-training the student model as suggested in (Jiao

et al.,2019). Moreover, using compression tech-

niques usually leads to degrading the model perfor-

mance. PETs can be alternatives to the compres-

arXiv:2210.07558v2 [cs.CL] 19 Apr 2023

Frozen

Pretrained

Weights

DyLoRA Parameter Updates

Forward Pass

Figure 1: DyLoRA: The overall diagram of our proposed method. In each iteration, we sample from a pre-deﬁned

random distribution which will help us to truncate the up-projection and down-projection matrices in the LoRA

(Hu et al.,2021a) objective.

sion methods, especially when we would like to use

the full capacity of the large pretrained models with

light training efforts (such as the language-model-

as-a-service scenario (Sun et al.,2022)). Among

PET techniques, low-rank adapters have received

much attention because, in contrast to prompt-

tuning techniques, low-rank adapters do not add to

the sequence length, get trained faster, and perform

better (Karimi Mahabadi et al.,2021). Even though

there are several low-rank adaptation techniques

in the literature, such as Adapter (Houlsby et al.,

2019b), Compacter (Karimi Mahabadi et al.,2021),

and LoRA (Hu et al.,2021b); they all suffer from

two major common problems: ﬁrst, it is not clear

how to select the size of their rank (while their per-

formance is very sensitive to this rank selection);

second, their training is static which means that if

a low-rank model is trained based on a particular

rank size, it will not work well in other rank values

(i.e. for any other rank value we need to train a

separate model).

This paper proposes a dynamic low-rank adapter

technique (DyLoRA) to address these two prob-

lems. Without loss of generality, we focus on

LoRA(Hu et al.,2021a) and train LoRA blocks

for a range of ranks instead of a single rank by

sorting out the representation learned at different

ranks during training. While our model is more

ﬂexible, it can outperform LoRA in a much wider

range of ranks without adding to the training time.

Moreover, our technique does not need extra train-

ing for searching across ranks. We summarize our

contributions in the following:

•

Dynamic LoRA: On top of LoRA, we devel-

oped a new algorithm (DyLoRA) that makes

it dynamic at inference time without incurring

extra costs.

•

Search-free LoRA: We demonstrate that by

making a negligible compromise in perfor-

mance, it is possible to avoid the costly search

process of choosing the optimal rank for

LoRA.

2 Related Work

This section reviews low-rank adaptation tech-

niques for parameter-efﬁcient tuning and poten-

tial existing solutions to make these techniques

dynamic and search-free.

It has been shown in (Aghajanyan et al.,2020)

that for classiﬁcation tasks such as natural language

understanding (NLU), PLMs have a low intrinsic

dimension. This observation motivates the use of

low-rank adapters for parameter-efﬁcient tuning.

There are several low-rank adapters in the literature

such as LoRA (Hu et al.,2021b), Adapter (Houlsby

et al.,2019b), Compacter (Karimi Mahabadi et al.,

2021), and Parallel Adapter (PA) (He et al.,2021).

LoRA is a low-rank up-projection/down-projection

transformation without any non-linearity applied

in parallel to key and value attention matrices.

The main beneﬁt of LoRA is that the adapter

module, after training, can be integrated into the

original weight matrices of the model, which in

turn can lead to a very efﬁcient inference time.

Adapters also have a low-rank up-projection/down-

projection transformation with an intermediate non-

linearity. The Adapter module is applied in series

with the feed-forward network (FFN). Having the

adaptor module in-line with other blocks in the

model can increase the inference time of the model.

PA is a faster version of the Adapter, which can be

applied in parallel with the FFN block. The com-

pactor is a more memory-efﬁcient version of the

Adapter, which deploys the sum of Kronecker prod-

ucts to reconstruct each up-projection and down-

projection matrices. All these low-rank adapters

suffer from two major issues: ﬁrst, ﬁnding the best

rank requires heavy exhaustive training and search;

second, the tuned adapter module works well only

with a particular rank.

While there have been some efforts in the lit-

erature towards dynamic networks such as Dyn-

aBERT (Hou et al.,2020) and GradMax (Evci et al.,

2022), to the best of our knowledge, this problem

for factorized networks and low-rank adapters is

still open. DRONE (Chen et al.,2021) propose a

technique for data-aware low-rank model compres-

sion however their approach is not search-free, and

also, it is not dynamic. DynaBERT introduces a

two-stage method to train width and depth-wise

dynamic networks. However, DynaBERT requires

a ﬁne-tuned teacher model on the task to train its

sub-networks which makes it unsuitable for PET

techniques. GradMax is a technique that gradually

adds to the neurons of a network without touch-

ing the already trained neurons. But it is unclear

how GradMax can be deployed to alleviate the

rank-search problem in low-rank adapters. Wang

et al. (2019) propose a structured pruning technique

called factorized low-rank pruning (FLOP). FLOP

decomposes weight matrices of a network into the

sum of rank-1 components, which are regularized

during training to gain sparsity. It is worth men-

tioning that FLOP aims at compressing the main

model, and even if it can be used for ﬁnding a good

rank in the lower-rank representation of full-weight

matrices, the ﬁnal low-rank model will not be dy-

namic (i.e. it is trained well only for one rank and

not a range of ranks, same as LoRA.). In this paper,

we propose a new methodology for training low-

rank modules for multiple ranks simultaneously

rather than training a single-rank adapter at a time

(without changing the training budget). Inspired by

the idea of nested dropout (Rippel et al.,2014), we

pursue ordering the representations of the bottle-

neck at the low-rank adapter modules with a new

recipe. To the best of our knowledge, it is the ﬁrst

time that the concept of ordering representations

has been deployed in training PLMs.

3 Background

3.1 Nested Dropout

Inspired by the dropout (Hinton et al.,2012), nested

drop-out (Rippel et al.,2014) is a stochastic regular-

ization technique that targets enforcing ordered rep-

resentations in training auto-encoders. The nested

dropout, adds an implicit bias (which does not exist

in dropout) to favor order in training. For example,

in dropout, we can randomly drop any nodes or

units in the network, but in nested dropout, if we

randomly select

kth

unit, then we keep all the units

indexed from

and drop the units with indices

larger than

. Therefore, nested dropout tends to-

ward accommodating more important information

in lower indices while learning representations.

Following the notations of (Rippel et al.,2014),

nested dropout assumes an auto-encoder mapping

training examples

{yi}N

i=1 ∈Y

Y⊂RD

their corresponding representations

{xi}N

i=1 ∈X

X⊂RK

using the function

fθ:Y→X

with pa-

rameters

; and then decoding these representations

using another function

gψ:X→Y

with parame-

ters

to reconstruct the inputs. The reconstruction

loss can be deﬁned as follows:

C(θ, ψ) =

i=1

||yi−gψ(fθ(yi))||2.(1)

Suppose we want to randomly drop some units in

our representation vector x. In this regard, we sam-

ple a random variable

b∼pB(.), b ∈ {1,2, ..., K}

from a pre-deﬁned categorical distribution

pB(.)

and truncate the functions

fθ

and

gψ

to keep their

corresponding units indexed from 1 to

and drop-

ping

b+1

indices. Let’s deﬁne the b-truncated

version of the vector

x↓b

and the b-truncated

version of the functions

fθ

and

gψ

fθ↓b

and

gψ↓b

respectively. In this case, the reconstruction loss is

redeﬁned for the b-truncated model as follows:

C(θ, ψ) = EpB[C↓b(θ, ψ)] =

b=1

pB(b)C↓b(θ, ψ)

where

C↓b(θ, ψ) =

i=1

||yi−gψ↓b(fθ↓b(yi))||2.

(2)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DyLoRA:Parameter-EfcientTuningofPretrainedModelsusingDynamicSearch-FreeLowRankAdaptationMojtabaValipour1;2MehdiRezagholizadeh2IvanKobyzev2AliGhodsi1{mojtaba.valipour,ali.ghodsi}@uwaterloo.ca,{mehdi.rezagholizadeh,ivan.kobyzev}@huawei.com1:UniversityofWaterloo,2:HuaweiNoah'sArkLabAbstractWiththeever...

展开>> 收起<<

DyLoRA Parameter-Efﬁcient Tuning of Pretrained Models using Dynamic Search-Free Lo w Rank A daptation Mojtaba Valipour12Mehdi Rezagholizadeh2Ivan Kobyzev2Ali Ghodsi1.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DyLoRA Parameter-Efﬁcient Tuning of Pretrained Models using Dynamic Search-Free Lo w Rank A daptation Mojtaba Valipour12Mehdi Rezagholizadeh2Ivan Kobyzev2Ali Ghodsi1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: