Inducer-tuning Connecting Prefix-tuning and Adapter-tuning Yifan Chen1Devamanyu Hazarika2Mahdi Namazifar2 Yang Liu2Di Jin2yDilek Hakkani-Tur2

2025-05-06 0 0 1.39MB 16 页 10玖币
侵权投诉
Inducer-tuning: Connecting Prefix-tuning and Adapter-tuning
Yifan Chen1Devamanyu Hazarika2Mahdi Namazifar2
Yang Liu2Di Jin2Dilek Hakkani-Tur2
1University of Illinois Urbana-Champaign 2Amazon Alexa AI
Abstract
Prefix-tuning, or more generally continu-
ous prompt tuning, has become an essential
paradigm of parameter-efficient transfer learn-
ing. Using a large pre-trained language model
(PLM), prefix-tuning can obtain strong per-
formance by training only a small portion of
parameters. In this paper, we propose to
understand and further develop prefix-tuning
through the kernel lens. Specifically, we make
an analogy between prefixes and inducing vari-
ables in kernel methods and hypothesize that
prefixes serving as inducing variables would
improve their overall mechanism. From the
kernel estimator perspective, we suggest a
new variant of prefix-tuning—inducer-tuning,
which shares the exact mechanism as prefix-
tuning while leveraging the residual form
found in adapter-tuning. This mitigates the
initialization issue in prefix-tuning. Through
comprehensive empirical experiments on nat-
ural language understanding and generation
tasks, we demonstrate that inducer-tuning can
close the performance gap between prefix-
tuning and fine-tuning.
1 Introduction
Transfer learning from large pre-trained language
models (PLMs) has been the de-facto method to
tackle downstream natural language processing
(NLP) tasks with proven performance and scalabil-
ity (Peters et al.,2018). Among all the adaption
techniques, fine-tuning (Howard and Ruder,2018;
Kale and Rastogi,2020) is predominant for PLMs
and maintains the models’ architecture while up-
dating all the parameters within. Though power-
ful, fine-tuning is considered parameter-inefficient
since it results in separate copies of model parame-
ters for each task/client after training.
With the sizes of PLMs increasing to hundreds
of millions (Brown et al.,2020) or even up to tril-
Equal contribution. This work was performed while the
first author was interning at Amazon Alexa AI.
Correspondence to: Di Jin <djinamzn@amazon.com>
lion (Fedus et al.,2021) parameters, the trend mo-
tivates a range of parameter-efficient adaptation
techniques, including adapter-tuning and prompt-
ing, as promising lightweight alternatives to fine-
tuning to reduce computational consumption and
storage space. Adapter-tuning inserts bottlenecked
Multi-layer Perception (MLP) modules between
the pre-trained layers of PLMs and tunes only these
new parameters for task adaptation (Houlsby et al.,
2019;Pfeiffer et al.,2020a). Prompting, instead,
aims to adapt the general-purpose PLMs through
prompts, whose effectiveness has been shown on a
frozen GPT-3 model (Brown et al.,2020).
An implicit drawback of the prompt-based adap-
tation is the difficulty of searching for the proper
prompt. To avoid manually designing the prompts,
Shin et al. (2020) propose a search algorithm to
find the effective prompt over discrete space of vo-
cabularies; prefix-tuning (Li and Liang,2021) and
other concurrent methods (Lester et al.,2021;Liu
et al.,2021b,a) further extend the discrete search to
continuous prompts, attaining performance close
to fine-tuning in some tasks. Despite the effort,
there is still a performance gap between “prefix-
tuning” and “fine-tuning” in many tasks, especially
when the model size is small (Lester et al.,2021;
He et al.,2021a). In addition, the mechanism of
prefix-tuning is still poorly understood and under-
explored. Prefix-tuning is also similar to adapter-
tuning, since they both insert additional modules
into each transformer layer (classical prompt-based
methods (Lester et al.,2021;Liu et al.,2021b) only
add prompts to the embedding layer).
Scrutinizing the evolution of prompt-based meth-
ods, we can observe they have gradually deviated
from the concept of “prompts”. Compared to the
manually designed prompts, the discrete search
usually results in counter-intuitive prompt tokens,
which vaguely match the topic but are not as sen-
sible as the manual one; for continuous prompt
tuning, it even breaks the limit of the existing vo-
arXiv:2210.14469v1 [cs.CL] 26 Oct 2022
cabulary. All these pieces imply that the mecha-
nism behind prompt-based tuning might be more
complicated than guiding the output through hint
prompts. To open the black box of “prompts”, in
this work, we propose to consider the prompts (ei-
ther hard or soft) as “inducing variables” in ker-
nel methods (Titsias,2009). This analogy is justi-
fied due to the close connection between attention
modules in PLMs and kernel estimators (Choro-
manski et al.,2020;Chen et al.,2021;Tsai et al.,
2019). This kernel perspective explains the poten-
tial mechanism of prefix-tuning and motivates a
new method, inducer-tuning. Specifically, inducer-
tuning freezes all the original parameters in the
PLMs as other prompt-based methods; when com-
puting the attention output for a certain input to-
ken in each layer, inducer-tuning utilizes a point
close to the query vector as the “inducer”. This
unique “soft prompt” eases the search for appropri-
ate prompts and builds a new connection between
“prompting” and “adapter-tuning”.
In summary, the contribution of this work is
three-fold:
1
We explain the underlying mech-
anism of prefix-tuning as the inducing variables in
kernel learning.
2
We propose a new parameter-
efficient adaptation technique, inducer-tuning, to
further improve prefix-tuning.
3
Through compre-
hensive empirical studies, we verify our proposed
method can close the gap between “prefix-tuning”
and “fine-tuning” on relatively small PLMs, and
provide a tighter lower bound on the potential of
continuous prompt tuning.
2 Related Work
In this section, we briefly introduce the classical
form of adapter-tuning and mainly focus on the
different variants of prompting.
Adapter-tuning
. Compared to fine-tuning all
the parameters in the PLMs, Houlsby et al. (2019),
Pfeiffer et al. (2020a) propose to modulate the out-
put of a transformer layer through inserting ad-
ditional small-bottleneck MLP layers (adapters)
(Houlsby et al.,2019)1:
Adapter(h) = h+ReLU(hW1)W2,(1)
where
h
is the dimension-
d
hidden state in the
transformer and
W1,W2
are
d
-by-
r
and
r
-by-
d
projection matrices. Adapters have a residual form
similar to skip connection, while only
W1,W2
1
We ignored layer normalization and bias terms here for
brevity.
will be trained, greatly decreasing the size of tun-
able parameters. Up to now, the adapter-based
method has been widely used for multiple NLP
tasks (Stickland and Murray,2019;Pfeiffer et al.,
2020a;Wang et al.,2020;Pfeiffer et al.,2020b;
Üstün et al.,2020;Vidoni et al.,2020;Pfeiffer et al.,
2021;He et al.,2021b;Xu et al.,2021;Rücklé
et al.,2020;Karimi Mahabadi et al.,2021), and
adapters are also intrinsically connected to many
other parameter-efficient adaptation techniques, as
detailed in He et al. (2021a).
Prompting
. Prompting prepends task-specific
instructions to the task input and was originally
demonstrated in Brown et al. (2020). As manual
prompts rely on trial and error, Jiang et al. (2020),
Shin et al. (2020) suggests search algorithms to
specify the prompts among all the tokens in the
vocabulary. Prompt-tuning (Lester et al.,2021)
and P-tuning (Liu et al.,2021b) remove the vo-
cabulary restriction on prompts by using trainable
“soft prompts”. The prompts in the aforementioned
methods are only inserted into the bottom embed-
ding layer of PLMs, while Prefix-tuning (Li and
Liang,2021;Liu et al.,2021a) adds soft prompts
to all the transformer layers to further increase the
capacity of prompting.
Though effective, proper initialization of the soft
prompts remains challenging. To mitigate the is-
sue, Li and Liang (2021) used an extra MLP to re-
parameterize the prompts in each layer, thus adding
more parameters that need training; SPoT (Vu et al.,
2021) suggests performing pre-training for soft
prompts using a wide range of NLP tasks, which
requires additional computational resources. In
contrast, though adapters have a similar expression
form to prefix-tuning (He et al.,2021a), adapter-
tuning only requires regular initialization. We spec-
ulate that the residual form of adapters mitigates the
initialization issue since the output of each layer
in the new model would be centered around the
output in the frozen PLMs, and the residual form
contributes to gradient back-propagation as in skip
connection. We rely on this intuition and utilize the
above-mentioned advantages of adapters to guide
the design of our proposed inducer-tuning.
3 Preliminaries: Transformer Layers
Before discussing the mechanism of prompt-tuning,
we introduce the structure of transformer layers and
necessary notations in this section.
A general transformer-based PLM is mainly
composed of
L
stacked layers. Each layer con-
tains a multi-headed self-attention and a fully con-
nected feed-forward network (FFN) sub-layer, both
followed by an “Add & Norm” module (Vaswani
et al.,2017).
2
Hereon, we shall focus on the struc-
ture of the attention sub-layer since prefix-tuning
directly works on this sub-layer.
Passing a length-
n
input sequence
XRn×Nhp
to an attention sub-layer (assuming
Nh
heads and
dimension size
p
for each head), we first perform
linear transforms to the input
X
and obtain the
query matrix
(Q)
, the key matrix
(K)
, and the
value matrix (V)as:
Q/K/V =XW[q/k/v]+1bT
[q/k/v],(2)
where
Q,K,VRn×Nhp
are the query/ key/
value matrix;
W[q/k/v]RNhp×Nhp
are the weight
matrices, and
b[q/k/v]RNhp
are the bias terms in
the corresponding transformations. 3
To increase the model capacity, the three com-
ponents
Q,K,V
are respectively divided into
Nh
blocks, contributing to the attention output
in each head of the multi-headed self-attention
module. For instance, we represent
Q
as
Q=
Q(1),··· ,Q(Nh)
, where each block
Q(h)=
XW (h)
q+1(b(h)
q)T
is an
n
-by-
p
matrix, and
W(h)
q,b(h)
q
are the corresponding parts in
Wq,bq
.
The attention output for the hth head is:
L(h)V(h):=softmax(Q(h)(K(h))T/p)V(h)
= (D(h))1M(h)V(h),(3)
where
M(h):= exp Q(h)(K(h))T/p
and
D(h)
is a diagonal matrix in which
D(h)
ii
is the
sum of the
i
-th row in
M(h)
, serving as the nor-
malization procedure in softmax. The attention
outputs in each head are then concatenated as
L:= (L(1)V(1),...,L(Nh)V(Nh)).
After concatenating the heads, there is a linear
transform following the output
LWo+1bT
o,(4)
where
Wo
and
bo
are similarly sized as the other
matrices in Equation (2). This is the overall output
of the attention sub-layer, which we shall revisit
i4.4.
2
For simplicity, we omit the cross-attention module in
transformer-based encoder-decoder models.
3
To ease the notations we adopt the practical setting where
X,Q,K,Vhave the same shape.
4 Parameter-Efficient Inducer-Tuning
We describe the motivation and the mechanism of
inducer-tuning in this section. We first revisit the
connection between self-attention and kernel es-
timators in § 4.1, which interprets attention from
another perspective by considering query, key, and
value matrices as three separate sets of vectors
rather than the related representations of the same
input sequence. This special perspective motivates
and justifies the inducer-tuning we propose in § 4.3.
4.1 Attention as Kernel Estimators
Traditionally, attention operation (Equation (3)) is
viewed as a transformation
g(·)
of the input se-
quence
X
. However, in prefix-tuning, parameters
within PLMs are frozen, which implies that given
the input
X
, the represenattions
Q,K,and V
are invariant.
4
This observation allows us to re-
interpret attention as a kernel estimator
f(·)
with
Q
as its input. Specifically, we denote the
i
-th input
vector
Xi
s attention operation as
f(Qi):=g(Xi)
.
This attention representation can be seen as modi-
fying the input query vector
Qi
to
f(Qi)
via sup-
porting points
{Kj}n
j=1
(Choromanski et al.,2020;
Peng et al.,2020;Chen et al.,2021), which can be
considered as a Nadaraya–Watson kernel estimator
(Wasserman,2006, Definition 5.39):
row-normalize (κ(Q,K)) V,
where
κ(·,·)
is a kernel function. (Refer to Ap-
pendix Cfor more details on this claim.)
4.2 Prefix-Tuning and Inducing Variables
Prefix-tuning (Li and Liang,2021) alters the atten-
tion output in each layer. Concretely, it prepends
length-
l
prefix vectors
Pk,PvRl×p
to
K
and
V
, respectively; for a certain query token
Qi
(the
i
-th row of the query matrix
Q
), its attention out-
put
f(Qi):=Attn(Qi,K,V)
is updated as a
weighted sum of
f(Qi)
and
Attn(Qi,Pk,Pv)
(He
et al.,2021a, Equation (7)).
Remark.
From the kernel estimator perspective,
the two categories of virtual tokens play different
roles. The virtual key vectors
Pk
apply to the em-
pirical kernel matrix part and can alter the attention
scores (and thus the weights for
Attn(Qi,Pk,Pv)
);
whereas
Pv
takes effect in the value part. It might
not be optimal for prefix-tuning to model the two
4
While our discussion is for a single attention head, we
omit the superscript (h)for brevity.
categories of virtual tokens similarly. In § 4.3 we
will show how inducer-tuning addresses the two
parts through different residual forms.
We suggest that the mechanism of prefix-tuning
can be further understood through the concept of
inducing variables in kernel learning literature (Tit-
sias,2009). Many computational methods in kernel
learning utilize a small set of support points (in-
ducing variables) to improve the inference perfor-
mance (Musco and Musco,2017;Chen and Yang,
2021). Snelson and Ghahramani (2005) specifi-
cally consider the inducing variables as auxiliary
pseudo-inputs and infer them using continuous op-
timization, which is similar to prefix-tuning. We
emphasize that from the first sight the main char-
acter of inducing-point methods is representing a
vast amount of training examples through a small
number of points, so as to reduce the computational
cost; however, here we instead aim to leverage the
mechanism of inducing variables to well-steer the
estimation: the goal we try to attain is to strengthen
prefix-tuning by making the prefixes better modu-
late the attention output. We introduce and analyze
the mechanism as follows.
Mechanism for well-steering inference out-
puts in inducing-point methods
. Conceptually,
inducing variables help the inference because they
can represent the distribution of the query inputs
and steer the kernel methods without changing the
kernel in use. In particular, we consider the dis-
tribution pattern of unconstrained inducing points
XM
(Snelson and Ghahramani,2005, Figure 1).
We observe that most of them are close to the test-
ing examples
X
, and in the new estimation (Snel-
son and Ghahramani,2005, Equation (8)) the in-
ducers
XM
will receive great weights through the
weights assignment mechanism in kernel methods
(we recall kernel methods can assign the weights
of samples as attention (Choromanski et al.,2020;
Chen et al.,2021;Tsai et al.,2019); for inducing
variables close to the query, they would automati-
cally receive more attention), and thus effectively
modulate the output.
From this mechanism, we draw an inductive bias
"the prefix should be close to the query" (which is
not enforced in the method of prefix-tuning) and
accordingly propose inducer-tuning. We remark
since we are not pursuing the original goal, reduc-
ing computational cost, of inducing variables, it is
ordinary that the concrete design in the next sub-
section is different from the usual form of inducing
points, a small number of samples.
We speculate prefix-tuning partially benefits
from the above mechanism as well. Furthermore,
some indirect evidence is stated as follows. As dis-
cussed in previous studies, to make the full poten-
tial of prompting, the manually designed prompts
are expected to be related to the topic of the input
sequence (Brown et al.,2020) (close to the query);
even for the soft prompts they are recommended to
be initialized with the token relevant to the specific
tasks (Li and Liang,2021), which also requires
the prompts to be close to the query to provide
effective adaptation. With this belief, we propose
inducer-tuning to exploit further the mechanism of
inducing variables and improve upon prefix-tuning.
4.3 Method
Inducer-tuning follows the same design principle
as prefix-tuning, which modulates the attention out-
put through inserting virtual tokens (vectors). How-
ever, unlike prefix-tuning, our virtual tokens are
not shared among the input sequences. Inducer-
tuning also incorporates the benefits of residual
forms to ease the initialization and remove the re-
parametrization trick in prefix-tuning. Specifically,
we suggest the following modifications:
1
The
“inducers” are adaptive to and customized for each
input token to strengthen the expressiveness of the
new attention output.
2
We propose to model
the virtual vectors in a residual form as an adapter,
which makes the final attention output be in a resid-
ual form as well. We now dive into discussing the
intuitions behind the modifications in detail.
Adaptive inducers.
There is an important dif-
ference between language models and kernel meth-
ods, making fixed prefixes less effective than in-
ducing variables in kernel methods. In language
models, the distribution of the input queries keeps
changing, and for some inputs, the fixed prefixes
fail to be qualified as “inducing variables”. Even
worse, for a long input, there probably exists some
query vectors away (regarding
`2
distance) from all
the virtual vectors in the fixed prefixes, which are
thus unable to modulate the attention output well.
The phenomenon that prefix-tuning has a relatively
poorer performance on tasks with longer inputs can
be observed in our experiments (§ 6).
To alleviate the above issue, we propose adaptive
modeling of the virtual key vectors. For a query
Qi
, we suggest taking a vector close to
Qi
itself as
the corresponding virtual key vector (the length of
摘要:

Inducer-tuning:ConnectingPrex-tuningandAdapter-tuningYifanChen1DevamanyuHazarika2MahdiNamazifar2YangLiu2DiJin2yDilekHakkani-Tur21UniversityofIllinoisUrbana-Champaign2AmazonAlexaAIAbstractPrex-tuning,ormoregenerallycontinu-ousprompttuning,hasbecomeanessentialparadigmofparameter-efcienttransferle...

展开>> 收起<<
Inducer-tuning Connecting Prefix-tuning and Adapter-tuning Yifan Chen1Devamanyu Hazarika2Mahdi Namazifar2 Yang Liu2Di Jin2yDilek Hakkani-Tur2.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.39MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注