Different Tunes Played with Equal Skill Exploring a Unified Optimization Subspace for Delta Tuning Jing Yi1 Weize Chen1 Yujia Qin1 Yankai Lin23 Ning Ding1 Xu Han1

2025-05-06 0 0 1.51MB 19 页 10玖币
侵权投诉
Different Tunes Played with Equal Skill:
Exploring a Unified Optimization Subspace for Delta Tuning
Jing Yi1, Weize Chen1, Yujia Qin1, Yankai Lin2,3, Ning Ding1, Xu Han1,
Zhiyuan Liu1,4,5,Maosong Sun1,4,5, Jie Zhou6
1NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing
2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing
3Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing
4International Innovation Center of Tsinghua University, Shanghai
5Quan Cheng Laboratory 6Pattern Recognition Center, WeChat AI, Tencent Inc.
{yi-j20, chenwz21, qyj20}@mails.tsinghua.edu.cn
{liuzy,sms}@tsinghua.edu.cn
Abstract
Delta tuning (DET, also known as parameter-
efficient tuning) is deemed as the new
paradigm for using pre-trained language mod-
els (PLMs). Up to now, various DETs with
distinct design elements have been proposed,
achieving performance on par with fine-tuning.
However, the mechanisms behind the above
success are still under-explored, especially the
connections among various DETs. To fathom
the mystery, we hypothesize that the adapta-
tions of different DETs could all be reparam-
eterized as low-dimensional optimizations in
a unified optimization subspace, which could
be found by jointly decomposing independent
solutions of different DETs. Then we explore
the connections among different DETs by con-
ducting optimization within the subspace. In
experiments, we find that, for a certain DET,
conducting optimization simply in the sub-
space could achieve comparable performance
to its original space, and the found solution in
the subspace could be transferred to another
DET and achieve non-trivial performance. We
also visualize the performance landscape of
the subspace, and find that, there exists a sub-
stantial region where different DETs all per-
form well. Finally, we extend our analysis
and show the strong connections between fine-
tuning and DETs. The codes are publicly avail-
able at https://github.com/thunlp/
Unified-DeltaTuning.
1 Introduction
Serving as the critical backbone for NLP, pre-
trained language models (PLMs) achieve supe-
rior performance when adapted to downstream
Indicates equal contribution.
Corresponding author.
tasks (Han et al.,2021). Conventionally, the dom-
inant way for such an adaptation is fine-tuning,
which requires updating and storing all the param-
eters in PLMs. Consequently, with ever-larger
PLMs continually being proposed (Raffel et al.,
2019;Brown et al.,2020), fine-tuning becomes ex-
tremely computationally expensive. As an alterna-
tive, various delta tuning algorithms (DETs) spring
up, which freeze most of the parameters and only
optimize minimal adaptive parameters (Ding et al.,
2022). Up to now, various DETs have been pro-
posed, including introducing extra tunable neuron
modules (Houlsby et al.,2019a), specifying partial
parameters to be tunable (Ben Zaken et al.,2021)
and re-parameterizing part of existing modules in
PLMs (Hu et al.,2021b), etc. DETs extensively
reduce the number of tunable parameters, and still
achieves comparable downstream performance to
fine-tuning.
Despite the success of DETs, the mechanism
behind it remains unclear. An essential question
is: how could the PLM adaptation using differ-
ent DETs relate to each other? To answer this
question, a direct exploration of the connections
among different DETs is needed, but this would
run into a problem: due to the versatile designs
of DETs, the parameter space of various DETs is
inherently different. To address the issue and inves-
tigate the above research question, we hypothesize
that the adaptations of different DETs could be re-
parameterized as low-dimensional optimizations
in a
unified optimization subspace
. In this sense,
optimizing various DETs can all be viewed as find-
ing optimal solutions within the same subspace.
Our hypothesis is inspired by recent findings that
despite owning huge amounts of parameters, PLMs
have an extremely low intrinsic dimension (Agha-
arXiv:2210.13311v1 [cs.CL] 24 Oct 2022
janyan et al.,2021;Qin et al.,2021). In this regard,
optimizing a certain PET, which is typically a high-
dimensional optimization problem, could be equiv-
alently re-parameterized as a low-dimensional opti-
mization problem, while achieving non-trivial per-
formance.
To find evidence for our hypothesis, we design
an analysis pipeline as follows: we first indepen-
dently obtain solutions for different DETs on a
set of tasks. Then we learn to project these solu-
tions to a desired subspace. Meanwhile, we also
define a mapping from the subspace to each DETs
original space. We contend that if the found sub-
space is indeed shared among various DETs, then
two conditions should be satisfied: (1) the opti-
mizations of different DETs could be equivalently
conducted in the found subspace and achieve non-
trivial performance, and (2) the local optima of
various DETs have a substantial intersection in the
subspace, which means the solution obtained in
the subspace using a certain DET could be directly
transferred to other DETs. If both conditions are
well-established for the found subspace, then we
could validate the existence of the unified optimiza-
tion subspace for DETs.
We conduct experiments on a series of represen-
tative NLP tasks, and demonstrate that in the found
subspace:
Solutions are transferable.
The solution of a
DET in the found subspace not only achieves
comparable performance to that in its origi-
nal DET space, but can be directly transferred
to another DET, achieving non-trivial perfor-
mance.
Local optima of DETs greatly overlap.
When visualizing the performance landscape,
we find that there exists a substantial region
where different DETs all perform well, indi-
cating the close connections among different
DETs.
Fine-tuning has strong connection with
DETs.
We extend the above analysis to fine-
tuning and show the strong connections be-
tween fine-tuning and DETs.
In general, our study is the first work to re-
veal the connections among different DETs and
fine-tuning from the perspective of subspace op-
timization, and uncovers the underlying mecha-
nism of PLMs’ downstream adaptation. We believe
many applications such as the ensemble and trans-
fer among various DETs can be well empowered
by the unified optimization subspace. Our findings
can be of interest to researchers who are working
on designing better DETs, and may provide some
guidance for using DETs in many real-world sce-
narios.
2 Background
Delta Tuning.
DET has been regarded as the
new paradigm for PLM adaptation. By training
lightweight parameters, DET yields a compact and
extensible model, and could achieve comparable
performance to full-parameter fine-tuning. Up to
now, various DET designs have sprung up. For in-
stance, some introduce additional tunable modules
after the feed-forward and attention modules in a
PLM (Houlsby et al.,2019a;Pfeiffer et al.,2021);
others prepend tunable prompt tokens into each
attention layer (Li and Liang,2021a) or only the
embedding layer (Lester et al.,2021). Another line
of work re-parameterizes existing modules with
low-rank decompositions (Hu et al.,2021b). Re-
cently, researchers demonstrate that existing DET
algorithms can be combined simultaneously and
achieve better performance (He et al.,2021;Mao
et al.,2021).
To fathom the mechanisms behind DET, He et al.
(2021) pioneered to explore the connections among
different DETs. They formalize various DETs as
different ways to compute the modifications on the
hidden states and unify different DETs in terms of
formulas. However, the unification in the formula
does not reveal the essence of DETs’ success, and
does not indicate that their internal mechanisms
are unified. Our paper differs from theirs in that
we explore whether DETs can be unified in terms
of internal mechanisms through the lens of opti-
mization. Specifically, we investigate whether the
optimization of different DETs can be unified in a
certain subspace.
Intrinsic Dimension.
Intrinsic dimension (Li
et al.,2018) estimates the minimum number of tun-
able parameters needed to reach a satisfying perfor-
mance for neural networks. Instead of training net-
works in their native parameter space, they linearly
re-parameterize all the tunable parameters
θ0
in a
randomly oriented subspace:
θθ0+Proj(θI)
,
where
Proj :R|θI|R|θ0|
denotes a random pro-
jection (
|θI||θ0|
). During optimization, only the
low-dimensional vector
θI
is tuned. Considering
that
|θ0|
could be extremely large, making com-
putation of the projection intractable, Aghajanyan
et al. (2021) reduce the computational complex-
ity using Fastfood transformation (Le et al.,2013).
In experiments, they find that for PLMs, a low-
dimensional (e.g.,
|θI| ∼ 103
) re-parameterization
could achieve over 85% performance of fine-tuning
(
|θ0|
exceeds millions or even billions). Further,
Qin et al. (2021) extend the tuning method from
fine-tuning to prompt tuning (Lester et al.,2021).
They demonstrate that the projection
Proj
can be
trained in order to approximate a better optimiza-
tion subspace. Based on previous explorations of
intrinsic subspace, we aim to validate the existence
of a unified subspace for various tuning methods.
3 Preliminary
Following He et al. (2021), we investigate three
representative DET algorithms to validate our
hypothesis, including Adapter (Houlsby et al.,
2019a), Prefix-tuning (Li and Liang,2021a), and
LoRA (Hu et al.,2021b). We will first recap the
Transformer layer (Vaswani et al.,2017), and then
give a brief review of the three DETs.
Transformer layer.
PLMs generally have multi-
ple Transformer layers, each consisting of a multi-
head attention (MHA) and a feed-forward network
(FFN). MHA is composed of
Nh
attention heads,
each containing a query / key / value weight matrix
W(i)
q
/
W(i)
k
/
W(i)
vRd×dh
, where
d
denotes
the model dimension and
dh=d/Nh
. Given a
sequence of
n
vectors
XRn×d
, MHA parame-
terizes them into queries (
Q(i)
), keys (
K(i)
) and
values (V(i)) as follows:
Q(i)=XW(i)
q,K(i)=XW(i)
k,V(i)=XW(i)
v.
Each (
Q(i)
,
K(i)
,
V(i)
) triple is then fed into a
self-attention function to obtain the
i
-th head’s rep-
resentation
Hi
. All head representations are then
concatenated and combined using an output weight
matrix WoRd×d:
Hi=softmax(Q(i)(K(i))T
dh
V(i)),
H=concat(H1, ..., HNh)Wo.
The FFN module is a two-layer MLP:
FFN(H) = σ(HW1+b1)W2+b2,
where
W1Rd×dm
,
dRdm
,
W2Rdm×d
and b2Rd.dmis often chosen larger than d.
Adapter.
Adapter (Houlsby et al.,2019a) plugs
in light-weight feed-forward networks in Trans-
former layers (after the MHA module and the FFN
module). Every adapter layer typically consists of
a down-projection matrix
Wdown Rd×rA
, a non-
linear activation function
f(·)
, and an up-projection
matrix
Wup RrA×d
, where
rA
denotes the bottle-
neck dimension. Denote the input as
XRn×d
,
adapter applies a residual connection as follows:
XX+f(XWdown)Wup.
Prefix-tuning.
Prefix-tuning (Li and Liang,
2021a) extends the queries
K(i)
/ the values
V(i)
in
every MHA module by prepending learnable prefix
vectors
P(i)
K
/
P(i)
VRm×dh
before them, where
m
denotes the number of virtual tokens. The output
of an attention head Hican be re-formulated as:
H0
i=ATT(Q(i),[P(i)
K;K(i)],[P(i)
V;V(i)]),
where [·;·]denotes concatenation.
LoRA.
LoRA (Hu et al.,2021b) re-parameterizes
the weight updates
W
of the weight matrix
W
in
the MHA module with low-rank decompositions,
i.e.,
W=WAWB
, where
WARd×rL
and
WBRrL×d
are two learnable low-rank matrices,
with
rL
being typically a small integer. For an input
XRn×d, LoRA is formulated as:
XX+s·XWAWB,
where s1is a scaling hyper-parameter.
4 Analysis Pipeline
As mentioned before, we consider three represen-
tative DETs: Adapter (
tA
), Prefix-tuning (
tP
), and
LoRA (
tL
). Each DET
t
defines a set of tunable
parameters
θt
. To adapt a PLM to a specific down-
stream task
Ti
, we optimize
θi
t
to minimize the
loss function
Li
task(θi
t|θ0)
defined by
Ti
, where
θ0
denotes the pre-trained weights. To verify our
hypothesis that there exists a unified optimization
subspace where all DETs can achieve non-trivial
performance, we propose a three-stage analysis
pipeline (visualized in Figure 1), where the first
stage is designed to approximate the desired sub-
space, so that in the second stage, the optimizations
for different DETs could all be conducted in this
subspace. This makes it possible to explore the
connections of different DETs in the third stage.
Following Qin et al. (2021), to validate the gener-
ality of the found subspace and avoid information
+
Multi-head Attention
Add & LayerNorm
Add & LayerNorm
Feed Forward Net
Q K V
WW W
Hidden States
Interpolation
Prefix
LoRA
Unseen Tasks
Adapter
Prefix
LoRA
Up-Proj
Linear Layer
Linear Layer
SiLU
Frozen PLM
Frozen PLM
Frozen
Up-Proj
x
Prefix
LoRA
x
Prefix
LoRA
x
Source Tasks
Down-Proj
(3) Subspace
! Solution
! Transfer
(2) Subspace Optimization
(1) Subspace Approxiamation
Transformer Layer × L
Adapter
Linear Layer
Linear Layer
SiLU
Adapter
+
Adapter
Adapter
Linear Layer
Linear Layer
SiLU
LoRA
+
Prefix
Figure 1: Illustration of our analysis pipeline, consisting of (1) subspace approximation, which jointly decom-
poses DET solutions into a shared subspace, (2) subspace optimization, which finds subspace solutions for a
specific DET, and (3) subspace solution transfer, which transfers the subspace solution from a source DET to
other DETs.
leakage, we approximate the subspace with a se-
ries of training tasks
Ttrain
, and conduct subsequent
subspace optimization on unseen tasks Ttest.
Subspace Approximation.
To approximate the
desired subspace, we decompose and then recon-
struct independent DET solutions of
Ttrain
. We
first train DETs in their original space, and for
each task
Ti∈ Ttrain
, we obtain three indepen-
dent solutions:
θi
tA
,
θi
tP
, and
θi
tL
. Then we assign
a down-projection
Proj
t:R|θi
t|Ry
and an
up-projection
Proj
t:RyR|θi
t|
for each DET
t
, where
y
is the dimension of the intrinsic sub-
space. In practice, both down-projection and up-
projection are MLP layers. Each down-projection
decomposes a DET solution into a low-dimensional
intrinsic vector Ii
tRy:
Ii
t=Proj
t(θi
t).
Three intrinsic vectors
Ii
tA
,
Ii
tP
,
Ii
tL
represent dif-
ferent local minima of
Ti
in the same subspace.
Ideally, if three DETs can be unified in the sub-
space, then each vector
Ii
t
could be used to recon-
struct any DET solution (
θi
tA
,
θi
tP
, or
θi
tL
). There-
fore, to approximate such a subspace, we facil-
itate the interaction among different DETs effi-
ciently by dynamically sampling two random ratios
α[0,1], β [0,1α]
, and computing an inter-
polation of three intrinsic vectors of Ti:
Ii
α;β=α·Ii
tA+β·Ii
tP+ (1 αβ)·Ii
tL.
The interpolation is mapped by each up-projection
Proj
t
to reconstruct the task solution for each
DET by minimizing the following loss function:
Li
dist(θi
t) = ||θi
tθi
t||2, θi
t=Proj
t(Ii
α;β).
To properly guide the reconstructed
θi
t
to solve
task
Ti
, we also incorporate the original task loss
Li
task
. The overall training objective can be formu-
lated as follows:
Lpet =
|Ttrain|
X
i=1
X
t∈{tA,tP,tL}Li
dist(θi
t) + Li
task(θi
t|θ0).
During this stage, only the down-projections and
up-projections are optimized, and other parameters
are kept frozen. When this stage finishes, the two
projections can be seen as mappings between the
unified subspace and each DETs original space.
摘要:

DifferentTunesPlayedwithEqualSkill:ExploringaUniedOptimizationSubspaceforDeltaTuningJingYi1,WeizeChen1,YujiaQin1,YankaiLin2;3,NingDing1,XuHan1,ZhiyuanLiu1;4;5y,MaosongSun1;4;5y,JieZhou61NLPGroup,DCST,IAI,BNRIST,TsinghuaUniversity,Beijing2GaolingSchoolofArticialIntelligence,RenminUniversityofChi...

展开>> 收起<<
Different Tunes Played with Equal Skill Exploring a Unified Optimization Subspace for Delta Tuning Jing Yi1 Weize Chen1 Yujia Qin1 Yankai Lin23 Ning Ding1 Xu Han1.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:19 页 大小:1.51MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注