Inducer-tuning Connecting Preﬁx-tuning and Adapter-tuning Yifan Chen1Devamanyu Hazarika2Mahdi Namazifar2 Yang Liu2Di Jin2yDilek Hakkani-Tur2

2025-05-06 0 0 1.39MB 16 页 10玖币

侵权投诉

Inducer-tuning: Connecting Preﬁx-tuning and Adapter-tuning

Yifan Chen1∗Devamanyu Hazarika2∗Mahdi Namazifar2

Yang Liu2Di Jin2†Dilek Hakkani-Tur2

1University of Illinois Urbana-Champaign 2Amazon Alexa AI

Abstract

Preﬁx-tuning, or more generally continu-

ous prompt tuning, has become an essential

paradigm of parameter-efﬁcient transfer learn-

ing. Using a large pre-trained language model

(PLM), preﬁx-tuning can obtain strong per-

formance by training only a small portion of

parameters. In this paper, we propose to

understand and further develop preﬁx-tuning

through the kernel lens. Speciﬁcally, we make

an analogy between preﬁxes and inducing vari-

ables in kernel methods and hypothesize that

preﬁxes serving as inducing variables would

improve their overall mechanism. From the

kernel estimator perspective, we suggest a

new variant of preﬁx-tuning—inducer-tuning,

which shares the exact mechanism as preﬁx-

tuning while leveraging the residual form

found in adapter-tuning. This mitigates the

initialization issue in preﬁx-tuning. Through

comprehensive empirical experiments on nat-

ural language understanding and generation

tasks, we demonstrate that inducer-tuning can

close the performance gap between preﬁx-

tuning and ﬁne-tuning.

1 Introduction

Transfer learning from large pre-trained language

models (PLMs) has been the de-facto method to

tackle downstream natural language processing

(NLP) tasks with proven performance and scalabil-

ity (Peters et al.,2018). Among all the adaption

techniques, ﬁne-tuning (Howard and Ruder,2018;

Kale and Rastogi,2020) is predominant for PLMs

and maintains the models’ architecture while up-

dating all the parameters within. Though power-

ful, ﬁne-tuning is considered parameter-inefﬁcient

since it results in separate copies of model parame-

ters for each task/client after training.

With the sizes of PLMs increasing to hundreds

of millions (Brown et al.,2020) or even up to tril-

∗

Equal contribution. This work was performed while the

ﬁrst author was interning at Amazon Alexa AI.

†Correspondence to: Di Jin <djinamzn@amazon.com>

lion (Fedus et al.,2021) parameters, the trend mo-

tivates a range of parameter-efﬁcient adaptation

techniques, including adapter-tuning and prompt-

ing, as promising lightweight alternatives to ﬁne-

tuning to reduce computational consumption and

storage space. Adapter-tuning inserts bottlenecked

Multi-layer Perception (MLP) modules between

the pre-trained layers of PLMs and tunes only these

new parameters for task adaptation (Houlsby et al.,

2019;Pfeiffer et al.,2020a). Prompting, instead,

aims to adapt the general-purpose PLMs through

prompts, whose effectiveness has been shown on a

frozen GPT-3 model (Brown et al.,2020).

An implicit drawback of the prompt-based adap-

tation is the difﬁculty of searching for the proper

prompt. To avoid manually designing the prompts,

Shin et al. (2020) propose a search algorithm to

ﬁnd the effective prompt over discrete space of vo-

cabularies; preﬁx-tuning (Li and Liang,2021) and

other concurrent methods (Lester et al.,2021;Liu

et al.,2021b,a) further extend the discrete search to

continuous prompts, attaining performance close

to ﬁne-tuning in some tasks. Despite the effort,

there is still a performance gap between “preﬁx-

tuning” and “ﬁne-tuning” in many tasks, especially

when the model size is small (Lester et al.,2021;

He et al.,2021a). In addition, the mechanism of

preﬁx-tuning is still poorly understood and under-

explored. Preﬁx-tuning is also similar to adapter-

tuning, since they both insert additional modules

into each transformer layer (classical prompt-based

methods (Lester et al.,2021;Liu et al.,2021b) only

add prompts to the embedding layer).

Scrutinizing the evolution of prompt-based meth-

ods, we can observe they have gradually deviated

from the concept of “prompts”. Compared to the

manually designed prompts, the discrete search

usually results in counter-intuitive prompt tokens,

which vaguely match the topic but are not as sen-

sible as the manual one; for continuous prompt

tuning, it even breaks the limit of the existing vo-

arXiv:2210.14469v1 [cs.CL] 26 Oct 2022

cabulary. All these pieces imply that the mecha-

nism behind prompt-based tuning might be more

complicated than guiding the output through hint

prompts. To open the black box of “prompts”, in

this work, we propose to consider the prompts (ei-

ther hard or soft) as “inducing variables” in ker-

nel methods (Titsias,2009). This analogy is justi-

ﬁed due to the close connection between attention

modules in PLMs and kernel estimators (Choro-

manski et al.,2020;Chen et al.,2021;Tsai et al.,

2019). This kernel perspective explains the poten-

tial mechanism of preﬁx-tuning and motivates a

new method, inducer-tuning. Speciﬁcally, inducer-

tuning freezes all the original parameters in the

PLMs as other prompt-based methods; when com-

puting the attention output for a certain input to-

ken in each layer, inducer-tuning utilizes a point

close to the query vector as the “inducer”. This

unique “soft prompt” eases the search for appropri-

ate prompts and builds a new connection between

“prompting” and “adapter-tuning”.

In summary, the contribution of this work is

three-fold:

We explain the underlying mech-

anism of preﬁx-tuning as the inducing variables in

kernel learning.

We propose a new parameter-

efﬁcient adaptation technique, inducer-tuning, to

further improve preﬁx-tuning.

Through compre-

hensive empirical studies, we verify our proposed

method can close the gap between “preﬁx-tuning”

and “ﬁne-tuning” on relatively small PLMs, and

provide a tighter lower bound on the potential of

continuous prompt tuning.

2 Related Work

In this section, we brieﬂy introduce the classical

form of adapter-tuning and mainly focus on the

different variants of prompting.

Adapter-tuning

. Compared to ﬁne-tuning all

the parameters in the PLMs, Houlsby et al. (2019),

Pfeiffer et al. (2020a) propose to modulate the out-

put of a transformer layer through inserting ad-

ditional small-bottleneck MLP layers (adapters)

(Houlsby et al.,2019)1:

Adapter(h) = h+ReLU(hW1)W2,(1)

where

is the dimension-

hidden state in the

transformer and

W1,W2

are

-by-

and

-by-

projection matrices. Adapters have a residual form

similar to skip connection, while only

W1,W2

We ignored layer normalization and bias terms here for

brevity.

will be trained, greatly decreasing the size of tun-

able parameters. Up to now, the adapter-based

method has been widely used for multiple NLP

tasks (Stickland and Murray,2019;Pfeiffer et al.,

2020a;Wang et al.,2020;Pfeiffer et al.,2020b;

Üstün et al.,2020;Vidoni et al.,2020;Pfeiffer et al.,

2021;He et al.,2021b;Xu et al.,2021;Rücklé

et al.,2020;Karimi Mahabadi et al.,2021), and

adapters are also intrinsically connected to many

other parameter-efﬁcient adaptation techniques, as

detailed in He et al. (2021a).

Prompting

. Prompting prepends task-speciﬁc

instructions to the task input and was originally

demonstrated in Brown et al. (2020). As manual

prompts rely on trial and error, Jiang et al. (2020),

Shin et al. (2020) suggests search algorithms to

specify the prompts among all the tokens in the

vocabulary. Prompt-tuning (Lester et al.,2021)

and P-tuning (Liu et al.,2021b) remove the vo-

cabulary restriction on prompts by using trainable

“soft prompts”. The prompts in the aforementioned

methods are only inserted into the bottom embed-

ding layer of PLMs, while Preﬁx-tuning (Li and

Liang,2021;Liu et al.,2021a) adds soft prompts

to all the transformer layers to further increase the

capacity of prompting.

Though effective, proper initialization of the soft

prompts remains challenging. To mitigate the is-

sue, Li and Liang (2021) used an extra MLP to re-

parameterize the prompts in each layer, thus adding

more parameters that need training; SPoT (Vu et al.,

2021) suggests performing pre-training for soft

prompts using a wide range of NLP tasks, which

requires additional computational resources. In

contrast, though adapters have a similar expression

form to preﬁx-tuning (He et al.,2021a), adapter-

tuning only requires regular initialization. We spec-

ulate that the residual form of adapters mitigates the

initialization issue since the output of each layer

in the new model would be centered around the

output in the frozen PLMs, and the residual form

contributes to gradient back-propagation as in skip

connection. We rely on this intuition and utilize the

above-mentioned advantages of adapters to guide

the design of our proposed inducer-tuning.

3 Preliminaries: Transformer Layers

Before discussing the mechanism of prompt-tuning,

we introduce the structure of transformer layers and

necessary notations in this section.

A general transformer-based PLM is mainly

composed of

stacked layers. Each layer con-

tains a multi-headed self-attention and a fully con-

nected feed-forward network (FFN) sub-layer, both

followed by an “Add & Norm” module (Vaswani

et al.,2017).

Hereon, we shall focus on the struc-

ture of the attention sub-layer since preﬁx-tuning

directly works on this sub-layer.

Passing a length-

input sequence

X∈Rn×Nhp

to an attention sub-layer (assuming

heads and

dimension size

for each head), we ﬁrst perform

linear transforms to the input

and obtain the

query matrix

(Q)

, the key matrix

(K)

, and the

value matrix (V)as:

Q/K/V =XW[q/k/v]+1bT

[q/k/v],(2)

where

Q,K,V∈Rn×Nhp

are the query/ key/

value matrix;

W[q/k/v]∈RNhp×Nhp

are the weight

matrices, and

b[q/k/v]∈RNhp

are the bias terms in

the corresponding transformations. 3

To increase the model capacity, the three com-

ponents

Q,K,V

are respectively divided into

blocks, contributing to the attention output

in each head of the multi-headed self-attention

module. For instance, we represent

Q(1),··· ,Q(Nh)

, where each block

Q(h)=

XW (h)

q+1(b(h)

q)T

is an

-by-

matrix, and

W(h)

q,b(h)

are the corresponding parts in

Wq,bq

The attention output for the hth head is:

L(h)V(h):=softmax(Q(h)(K(h))T/√p)V(h)

= (D(h))−1M(h)V(h),(3)

where

M(h):= exp Q(h)(K(h))T/√p

and

D(h)

is a diagonal matrix in which

D(h)

is the

sum of the

-th row in

M(h)

, serving as the nor-

malization procedure in softmax. The attention

outputs in each head are then concatenated as

L:= (L(1)V(1),...,L(Nh)V(Nh)).

After concatenating the heads, there is a linear

transform following the output

LWo+1bT

o,(4)

where

and

are similarly sized as the other

matrices in Equation (2). This is the overall output

of the attention sub-layer, which we shall revisit

in§4.4.

For simplicity, we omit the cross-attention module in

transformer-based encoder-decoder models.

To ease the notations we adopt the practical setting where

X,Q,K,Vhave the same shape.

4 Parameter-Efﬁcient Inducer-Tuning

We describe the motivation and the mechanism of

inducer-tuning in this section. We ﬁrst revisit the

connection between self-attention and kernel es-

timators in § 4.1, which interprets attention from

another perspective by considering query, key, and

value matrices as three separate sets of vectors

rather than the related representations of the same

input sequence. This special perspective motivates

and justiﬁes the inducer-tuning we propose in § 4.3.

4.1 Attention as Kernel Estimators

Traditionally, attention operation (Equation (3)) is

viewed as a transformation

g(·)

of the input se-

quence

. However, in preﬁx-tuning, parameters

within PLMs are frozen, which implies that given

the input

, the represenattions

Q,K,and V

are invariant.

This observation allows us to re-

interpret attention as a kernel estimator

f(·)

with

as its input. Speciﬁcally, we denote the

-th input

vector

’s attention operation as

f(Qi):=g(Xi)

This attention representation can be seen as modi-

fying the input query vector

f(Qi)

via sup-

porting points

{Kj}n

j=1

(Choromanski et al.,2020;

Peng et al.,2020;Chen et al.,2021), which can be

considered as a Nadaraya–Watson kernel estimator

(Wasserman,2006, Deﬁnition 5.39):

row-normalize (κ(Q,K)) V,

where

κ(·,·)

is a kernel function. (Refer to Ap-

pendix Cfor more details on this claim.)

4.2 Preﬁx-Tuning and Inducing Variables

Preﬁx-tuning (Li and Liang,2021) alters the atten-

tion output in each layer. Concretely, it prepends

length-

preﬁx vectors

Pk,Pv∈Rl×p

and

, respectively; for a certain query token

(the

-th row of the query matrix

), its attention out-

put

f(Qi):=Attn(Qi,K,V)

is updated as a

weighted sum of

f(Qi)

and

Attn(Qi,Pk,Pv)

(He

et al.,2021a, Equation (7)).

Remark.

From the kernel estimator perspective,

the two categories of virtual tokens play different

roles. The virtual key vectors

apply to the em-

pirical kernel matrix part and can alter the attention

scores (and thus the weights for

Attn(Qi,Pk,Pv)

);

whereas

takes effect in the value part. It might

not be optimal for preﬁx-tuning to model the two

While our discussion is for a single attention head, we

omit the superscript (h)for brevity.

categories of virtual tokens similarly. In § 4.3 we

will show how inducer-tuning addresses the two

parts through different residual forms.

We suggest that the mechanism of preﬁx-tuning

can be further understood through the concept of

inducing variables in kernel learning literature (Tit-

sias,2009). Many computational methods in kernel

learning utilize a small set of support points (in-

ducing variables) to improve the inference perfor-

mance (Musco and Musco,2017;Chen and Yang,

2021). Snelson and Ghahramani (2005) speciﬁ-

cally consider the inducing variables as auxiliary

pseudo-inputs and infer them using continuous op-

timization, which is similar to preﬁx-tuning. We

emphasize that from the ﬁrst sight the main char-

acter of inducing-point methods is representing a

vast amount of training examples through a small

number of points, so as to reduce the computational

cost; however, here we instead aim to leverage the

mechanism of inducing variables to well-steer the

estimation: the goal we try to attain is to strengthen

preﬁx-tuning by making the preﬁxes better modu-

late the attention output. We introduce and analyze

the mechanism as follows.

Mechanism for well-steering inference out-

puts in inducing-point methods

. Conceptually,

inducing variables help the inference because they

can represent the distribution of the query inputs

and steer the kernel methods without changing the

kernel in use. In particular, we consider the dis-

tribution pattern of unconstrained inducing points

(Snelson and Ghahramani,2005, Figure 1).

We observe that most of them are close to the test-

ing examples

X∗

, and in the new estimation (Snel-

son and Ghahramani,2005, Equation (8)) the in-

ducers

will receive great weights through the

weights assignment mechanism in kernel methods

(we recall kernel methods can assign the weights

of samples as attention (Choromanski et al.,2020;

Chen et al.,2021;Tsai et al.,2019); for inducing

variables close to the query, they would automati-

cally receive more attention), and thus effectively

modulate the output.

From this mechanism, we draw an inductive bias

"the preﬁx should be close to the query" (which is

not enforced in the method of preﬁx-tuning) and

accordingly propose inducer-tuning. We remark

since we are not pursuing the original goal, reduc-

ing computational cost, of inducing variables, it is

ordinary that the concrete design in the next sub-

section is different from the usual form of inducing

points, a small number of samples.

We speculate preﬁx-tuning partially beneﬁts

from the above mechanism as well. Furthermore,

some indirect evidence is stated as follows. As dis-

cussed in previous studies, to make the full poten-

tial of prompting, the manually designed prompts

are expected to be related to the topic of the input

sequence (Brown et al.,2020) (close to the query);

even for the soft prompts they are recommended to

be initialized with the token relevant to the speciﬁc

tasks (Li and Liang,2021), which also requires

the prompts to be close to the query to provide

effective adaptation. With this belief, we propose

inducer-tuning to exploit further the mechanism of

inducing variables and improve upon preﬁx-tuning.

4.3 Method

Inducer-tuning follows the same design principle

as preﬁx-tuning, which modulates the attention out-

put through inserting virtual tokens (vectors). How-

ever, unlike preﬁx-tuning, our virtual tokens are

not shared among the input sequences. Inducer-

tuning also incorporates the beneﬁts of residual

forms to ease the initialization and remove the re-

parametrization trick in preﬁx-tuning. Speciﬁcally,

we suggest the following modiﬁcations:

The

“inducers” are adaptive to and customized for each

input token to strengthen the expressiveness of the

new attention output.

We propose to model

the virtual vectors in a residual form as an adapter,

which makes the ﬁnal attention output be in a resid-

ual form as well. We now dive into discussing the

intuitions behind the modiﬁcations in detail.

Adaptive inducers.

There is an important dif-

ference between language models and kernel meth-

ods, making ﬁxed preﬁxes less effective than in-

ducing variables in kernel methods. In language

models, the distribution of the input queries keeps

changing, and for some inputs, the ﬁxed preﬁxes

fail to be qualiﬁed as “inducing variables”. Even

worse, for a long input, there probably exists some

query vectors away (regarding

distance) from all

the virtual vectors in the ﬁxed preﬁxes, which are

thus unable to modulate the attention output well.

The phenomenon that preﬁx-tuning has a relatively

poorer performance on tasks with longer inputs can

be observed in our experiments (§ 6).

To alleviate the above issue, we propose adaptive

modeling of the virtual key vectors. For a query

, we suggest taking a vector close to

itself as

the corresponding virtual key vector (the length of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Inducer-tuning:ConnectingPrex-tuningandAdapter-tuningYifanChen1DevamanyuHazarika2MahdiNamazifar2YangLiu2DiJin2yDilekHakkani-Tur21UniversityofIllinoisUrbana-Champaign2AmazonAlexaAIAbstractPrex-tuning,ormoregenerallycontinu-ousprompttuning,hasbecomeanessentialparadigmofparameter-efcienttransferle...

展开>> 收起<<

Inducer-tuning Connecting Preﬁx-tuning and Adapter-tuning Yifan Chen1Devamanyu Hazarika2Mahdi Namazifar2 Yang Liu2Di Jin2yDilek Hakkani-Tur2.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Inducer-tuning Connecting Preﬁx-tuning and Adapter-tuning Yifan Chen1Devamanyu Hazarika2Mahdi Namazifar2 Yang Liu2Di Jin2yDilek Hakkani-Tur2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: