Different Tunes Played with Equal Skill Exploring a Uniﬁed Optimization Subspace for Delta Tuning Jing Yi1 Weize Chen1 Yujia Qin1 Yankai Lin23 Ning Ding1 Xu Han1

2025-05-06 0 0 1.51MB 19 页 10玖币

侵权投诉

Different Tunes Played with Equal Skill:

Exploring a Uniﬁed Optimization Subspace for Delta Tuning

Jing Yi1∗, Weize Chen1∗, Yujia Qin1∗, Yankai Lin2,3, Ning Ding1, Xu Han1,

Zhiyuan Liu1,4,5†,Maosong Sun1,4,5†, Jie Zhou6

1NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing

2Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing

3Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing

4International Innovation Center of Tsinghua University, Shanghai

5Quan Cheng Laboratory 6Pattern Recognition Center, WeChat AI, Tencent Inc.

{yi-j20, chenwz21, qyj20}@mails.tsinghua.edu.cn

{liuzy,sms}@tsinghua.edu.cn

Abstract

Delta tuning (DET, also known as parameter-

efﬁcient tuning) is deemed as the new

paradigm for using pre-trained language mod-

els (PLMs). Up to now, various DETs with

distinct design elements have been proposed,

achieving performance on par with ﬁne-tuning.

However, the mechanisms behind the above

success are still under-explored, especially the

connections among various DETs. To fathom

the mystery, we hypothesize that the adapta-

tions of different DETs could all be reparam-

eterized as low-dimensional optimizations in

a uniﬁed optimization subspace, which could

be found by jointly decomposing independent

solutions of different DETs. Then we explore

the connections among different DETs by con-

ducting optimization within the subspace. In

experiments, we ﬁnd that, for a certain DET,

conducting optimization simply in the sub-

space could achieve comparable performance

to its original space, and the found solution in

the subspace could be transferred to another

DET and achieve non-trivial performance. We

also visualize the performance landscape of

the subspace, and ﬁnd that, there exists a sub-

stantial region where different DETs all per-

form well. Finally, we extend our analysis

and show the strong connections between ﬁne-

tuning and DETs. The codes are publicly avail-

able at https://github.com/thunlp/

Unified-DeltaTuning.

1 Introduction

Serving as the critical backbone for NLP, pre-

trained language models (PLMs) achieve supe-

rior performance when adapted to downstream

∗Indicates equal contribution.

†Corresponding author.

tasks (Han et al.,2021). Conventionally, the dom-

inant way for such an adaptation is ﬁne-tuning,

which requires updating and storing all the param-

eters in PLMs. Consequently, with ever-larger

PLMs continually being proposed (Raffel et al.,

2019;Brown et al.,2020), ﬁne-tuning becomes ex-

tremely computationally expensive. As an alterna-

tive, various delta tuning algorithms (DETs) spring

up, which freeze most of the parameters and only

optimize minimal adaptive parameters (Ding et al.,

2022). Up to now, various DETs have been pro-

posed, including introducing extra tunable neuron

modules (Houlsby et al.,2019a), specifying partial

parameters to be tunable (Ben Zaken et al.,2021)

and re-parameterizing part of existing modules in

PLMs (Hu et al.,2021b), etc. DETs extensively

reduce the number of tunable parameters, and still

achieves comparable downstream performance to

ﬁne-tuning.

Despite the success of DETs, the mechanism

behind it remains unclear. An essential question

is: how could the PLM adaptation using differ-

ent DETs relate to each other? To answer this

question, a direct exploration of the connections

among different DETs is needed, but this would

run into a problem: due to the versatile designs

of DETs, the parameter space of various DETs is

inherently different. To address the issue and inves-

tigate the above research question, we hypothesize

that the adaptations of different DETs could be re-

parameterized as low-dimensional optimizations

in a

uniﬁed optimization subspace

. In this sense,

optimizing various DETs can all be viewed as ﬁnd-

ing optimal solutions within the same subspace.

Our hypothesis is inspired by recent ﬁndings that

despite owning huge amounts of parameters, PLMs

have an extremely low intrinsic dimension (Agha-

arXiv:2210.13311v1 [cs.CL] 24 Oct 2022

janyan et al.,2021;Qin et al.,2021). In this regard,

optimizing a certain PET, which is typically a high-

dimensional optimization problem, could be equiv-

alently re-parameterized as a low-dimensional opti-

mization problem, while achieving non-trivial per-

formance.

To ﬁnd evidence for our hypothesis, we design

an analysis pipeline as follows: we ﬁrst indepen-

dently obtain solutions for different DETs on a

set of tasks. Then we learn to project these solu-

tions to a desired subspace. Meanwhile, we also

deﬁne a mapping from the subspace to each DET’s

original space. We contend that if the found sub-

space is indeed shared among various DETs, then

two conditions should be satisﬁed: (1) the opti-

mizations of different DETs could be equivalently

conducted in the found subspace and achieve non-

trivial performance, and (2) the local optima of

various DETs have a substantial intersection in the

subspace, which means the solution obtained in

the subspace using a certain DET could be directly

transferred to other DETs. If both conditions are

well-established for the found subspace, then we

could validate the existence of the uniﬁed optimiza-

tion subspace for DETs.

We conduct experiments on a series of represen-

tative NLP tasks, and demonstrate that in the found

subspace:

•Solutions are transferable.

The solution of a

DET in the found subspace not only achieves

comparable performance to that in its origi-

nal DET space, but can be directly transferred

to another DET, achieving non-trivial perfor-

mance.

•Local optima of DETs greatly overlap.

When visualizing the performance landscape,

we ﬁnd that there exists a substantial region

where different DETs all perform well, indi-

cating the close connections among different

DETs.

•Fine-tuning has strong connection with

DETs.

We extend the above analysis to ﬁne-

tuning and show the strong connections be-

tween ﬁne-tuning and DETs.

In general, our study is the ﬁrst work to re-

veal the connections among different DETs and

ﬁne-tuning from the perspective of subspace op-

timization, and uncovers the underlying mecha-

nism of PLMs’ downstream adaptation. We believe

many applications such as the ensemble and trans-

fer among various DETs can be well empowered

by the uniﬁed optimization subspace. Our ﬁndings

can be of interest to researchers who are working

on designing better DETs, and may provide some

guidance for using DETs in many real-world sce-

narios.

2 Background

Delta Tuning.

DET has been regarded as the

new paradigm for PLM adaptation. By training

lightweight parameters, DET yields a compact and

extensible model, and could achieve comparable

performance to full-parameter ﬁne-tuning. Up to

now, various DET designs have sprung up. For in-

stance, some introduce additional tunable modules

after the feed-forward and attention modules in a

PLM (Houlsby et al.,2019a;Pfeiffer et al.,2021);

others prepend tunable prompt tokens into each

attention layer (Li and Liang,2021a) or only the

embedding layer (Lester et al.,2021). Another line

of work re-parameterizes existing modules with

low-rank decompositions (Hu et al.,2021b). Re-

cently, researchers demonstrate that existing DET

algorithms can be combined simultaneously and

achieve better performance (He et al.,2021;Mao

et al.,2021).

To fathom the mechanisms behind DET, He et al.

(2021) pioneered to explore the connections among

different DETs. They formalize various DETs as

different ways to compute the modiﬁcations on the

hidden states and unify different DETs in terms of

formulas. However, the uniﬁcation in the formula

does not reveal the essence of DETs’ success, and

does not indicate that their internal mechanisms

are uniﬁed. Our paper differs from theirs in that

we explore whether DETs can be uniﬁed in terms

of internal mechanisms through the lens of opti-

mization. Speciﬁcally, we investigate whether the

optimization of different DETs can be uniﬁed in a

certain subspace.

Intrinsic Dimension.

Intrinsic dimension (Li

et al.,2018) estimates the minimum number of tun-

able parameters needed to reach a satisfying perfor-

mance for neural networks. Instead of training net-

works in their native parameter space, they linearly

re-parameterize all the tunable parameters

θ0

in a

randomly oriented subspace:

θ←θ0+Proj(θI)

where

Proj :R|θI|→R|θ0|

denotes a random pro-

jection (

|θI||θ0|

). During optimization, only the

low-dimensional vector

θI

is tuned. Considering

that

|θ0|

could be extremely large, making com-

putation of the projection intractable, Aghajanyan

et al. (2021) reduce the computational complex-

ity using Fastfood transformation (Le et al.,2013).

In experiments, they ﬁnd that for PLMs, a low-

dimensional (e.g.,

|θI| ∼ 103

) re-parameterization

could achieve over 85% performance of ﬁne-tuning

(

|θ0|

exceeds millions or even billions). Further,

Qin et al. (2021) extend the tuning method from

ﬁne-tuning to prompt tuning (Lester et al.,2021).

They demonstrate that the projection

Proj

can be

trained in order to approximate a better optimiza-

tion subspace. Based on previous explorations of

intrinsic subspace, we aim to validate the existence

of a uniﬁed subspace for various tuning methods.

3 Preliminary

Following He et al. (2021), we investigate three

representative DET algorithms to validate our

hypothesis, including Adapter (Houlsby et al.,

2019a), Preﬁx-tuning (Li and Liang,2021a), and

LoRA (Hu et al.,2021b). We will ﬁrst recap the

Transformer layer (Vaswani et al.,2017), and then

give a brief review of the three DETs.

Transformer layer.

PLMs generally have multi-

ple Transformer layers, each consisting of a multi-

head attention (MHA) and a feed-forward network

(FFN). MHA is composed of

attention heads,

each containing a query / key / value weight matrix

W(i)

v∈Rd×dh

, where

denotes

the model dimension and

dh=d/Nh

. Given a

sequence of

vectors

X∈Rn×d

, MHA parame-

terizes them into queries (

Q(i)

), keys (

K(i)

) and

values (V(i)) as follows:

Q(i)=XW(i)

q,K(i)=XW(i)

k,V(i)=XW(i)

Each (

Q(i)

K(i)

V(i)

) triple is then fed into a

self-attention function to obtain the

-th head’s rep-

resentation

. All head representations are then

concatenated and combined using an output weight

matrix Wo∈Rd×d:

Hi=softmax(Q(i)(K(i))T

√dh

V(i)),

H=concat(H1, ..., HNh)Wo.

The FFN module is a two-layer MLP:

FFN(H) = σ(HW1+b1)W2+b2,

where

W1∈Rd×dm

d∈Rdm

W2∈Rdm×d

and b2∈Rd.dmis often chosen larger than d.

Adapter.

Adapter (Houlsby et al.,2019a) plugs

in light-weight feed-forward networks in Trans-

former layers (after the MHA module and the FFN

module). Every adapter layer typically consists of

a down-projection matrix

Wdown ∈Rd×rA

, a non-

linear activation function

f(·)

, and an up-projection

matrix

Wup ∈RrA×d

, where

denotes the bottle-

neck dimension. Denote the input as

X∈Rn×d

adapter applies a residual connection as follows:

X←X+f(XWdown)Wup.

Preﬁx-tuning.

Preﬁx-tuning (Li and Liang,

2021a) extends the queries

K(i)

/ the values

V(i)

every MHA module by prepending learnable preﬁx

vectors

P(i)

V∈Rm×dh

before them, where

denotes the number of virtual tokens. The output

of an attention head Hican be re-formulated as:

i=ATT(Q(i),[P(i)

K;K(i)],[P(i)

V;V(i)]),

where [·;·]denotes concatenation.

LoRA.

LoRA (Hu et al.,2021b) re-parameterizes

the weight updates

∆W

of the weight matrix

the MHA module with low-rank decompositions,

i.e.,

∆W=WAWB

, where

WA∈Rd×rL

and

WB∈RrL×d

are two learnable low-rank matrices,

with

being typically a small integer. For an input

X∈Rn×d, LoRA is formulated as:

X←X+s·XWAWB,

where s≥1is a scaling hyper-parameter.

4 Analysis Pipeline

As mentioned before, we consider three represen-

tative DETs: Adapter (

), Preﬁx-tuning (

), and

LoRA (

). Each DET

t∗

deﬁnes a set of tunable

parameters

θt∗

. To adapt a PLM to a speciﬁc down-

stream task

, we optimize

θi

t∗

to minimize the

loss function

task(θi

t∗|θ0)

deﬁned by

, where

θ0

denotes the pre-trained weights. To verify our

hypothesis that there exists a uniﬁed optimization

subspace where all DETs can achieve non-trivial

performance, we propose a three-stage analysis

pipeline (visualized in Figure 1), where the ﬁrst

stage is designed to approximate the desired sub-

space, so that in the second stage, the optimizations

for different DETs could all be conducted in this

subspace. This makes it possible to explore the

connections of different DETs in the third stage.

Following Qin et al. (2021), to validate the gener-

ality of the found subspace and avoid information

Multi-head Attention

Add & LayerNorm

Feed Forward Net

Q K V

WW W

Hidden States

Interpolation

Preﬁx

LoRA

Unseen Tasks

Adapter

Preﬁx

LoRA

Up-Proj

Linear Layer

SiLU

Frozen PLM

Frozen

Up-Proj

Preﬁx

LoRA

Preﬁx

LoRA

Source Tasks

Down-Proj

(3) Subspace

! Solution

! Transfer

(2) Subspace Optimization

(1) Subspace Approxiamation

Transformer Layer × L

Adapter

Linear Layer

SiLU

Adapter

Linear Layer

SiLU

LoRA

Preﬁx

Figure 1: Illustration of our analysis pipeline, consisting of (1) subspace approximation, which jointly decom-

poses DET solutions into a shared subspace, (2) subspace optimization, which ﬁnds subspace solutions for a

speciﬁc DET, and (3) subspace solution transfer, which transfers the subspace solution from a source DET to

other DETs.

leakage, we approximate the subspace with a se-

ries of training tasks

Ttrain

, and conduct subsequent

subspace optimization on unseen tasks Ttest.

Subspace Approximation.

To approximate the

desired subspace, we decompose and then recon-

struct independent DET solutions of

Ttrain

. We

ﬁrst train DETs in their original space, and for

each task

Ti∈ Ttrain

, we obtain three indepen-

dent solutions:

θi

, and

θi

. Then we assign

a down-projection

Proj↓

t∗:R|θi

t∗|→Ry

and an

up-projection

Proj↑

t∗:Ry→R|θi

t∗|

for each DET

t∗

, where

is the dimension of the intrinsic sub-

space. In practice, both down-projection and up-

projection are MLP layers. Each down-projection

decomposes a DET solution into a low-dimensional

intrinsic vector Ii

t∗∈Ry:

t∗=Proj↓

t∗(θi

t∗).

Three intrinsic vectors

represent dif-

ferent local minima of

in the same subspace.

Ideally, if three DETs can be uniﬁed in the sub-

space, then each vector

t∗

could be used to recon-

struct any DET solution (

θi

, or

θi

). There-

fore, to approximate such a subspace, we facil-

itate the interaction among different DETs efﬁ-

ciently by dynamically sampling two random ratios

α∈[0,1], β ∈[0,1−α]

, and computing an inter-

polation of three intrinsic vectors of Ti:

α;β=α·Ii

tA+β·Ii

tP+ (1 −α−β)·Ii

tL.

The interpolation is mapped by each up-projection

Proj↑

t∗

to reconstruct the task solution for each

DET by minimizing the following loss function:

dist(θi

t∗) = ||θi

t∗−θi

t∗||2, θi

t∗=Proj↑

t∗(Ii

α;β).

To properly guide the reconstructed

θi

t∗

to solve

task

, we also incorporate the original task loss

task

. The overall training objective can be formu-

lated as follows:

Lpet =

|Ttrain|

i=1

t∗∈{tA,tP,tL}Li

dist(θi

t∗) + Li

task(θi

t∗|θ0).

During this stage, only the down-projections and

up-projections are optimized, and other parameters

are kept frozen. When this stage ﬁnishes, the two

projections can be seen as mappings between the

uniﬁed subspace and each DET’s original space.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DifferentTunesPlayedwithEqualSkill:ExploringaUniedOptimizationSubspaceforDeltaTuningJingYi1,WeizeChen1,YujiaQin1,YankaiLin2;3,NingDing1,XuHan1,ZhiyuanLiu1;4;5y,MaosongSun1;4;5y,JieZhou61NLPGroup,DCST,IAI,BNRIST,TsinghuaUniversity,Beijing2GaolingSchoolofArticialIntelligence,RenminUniversityofChi...

展开>> 收起<<

Different Tunes Played with Equal Skill Exploring a Uniﬁed Optimization Subspace for Delta Tuning Jing Yi1 Weize Chen1 Yujia Qin1 Yankai Lin23 Ning Ding1 Xu Han1.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Different Tunes Played with Equal Skill Exploring a Uniﬁed Optimization Subspace for Delta Tuning Jing Yi1 Weize Chen1 Yujia Qin1 Yankai Lin23 Ning Ding1 Xu Han1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: