Late Prompt Tuning A Late Prompt Could Be Better Than Many Prompts Xiangyang Liu12Tianxiang Sun12Xuanjing Huang12Xipeng Qiu12 1School of Computer Science Fudan University

2025-05-03 0 0 1.68MB 14 页 10玖币

侵权投诉

Late Prompt Tuning: A Late Prompt Could Be Better Than Many Prompts

Xiangyang Liu1,2 Tianxiang Sun1,2 Xuanjing Huang1,2 Xipeng Qiu1,2 ∗

1School of Computer Science, Fudan University

2Shanghai Key Laboratory of Intelligent Information Processing, Fudan University

{xiangyangliu20,txsun19,xjhuang,xpqiu}@fudan.edu.cn

Abstract

Prompt tuning is a parameter-efﬁcient tuning

(PETuning) method for utilizing pre-trained

models (PTMs) that simply prepends a soft

prompt to the input and only optimizes the

prompt to adapt PTMs to downstream tasks.

Although it is parameter- and deployment-

efﬁcient, its performance still lags behind other

state-of-the-art PETuning methods. Besides,

the training cost of prompt tuning is not sig-

niﬁcantly reduced due to the back-propagation

through the entire model. Through empirical

analyses, we shed some light on the lagging

performance of prompt tuning and recognize

a trade-off between the propagation distance

from label signals to the inserted prompt and

the inﬂuence of the prompt on model outputs.

Further, we present Late Prompt Tuning (LPT)

that inserts a late prompt into an intermediate

layer of the PTM instead of the input layer

or all layers. The late prompt is obtained by

a neural prompt generator conditioned on the

hidden states before the prompt insertion layer

and therefore is instance-dependent. Through

extensive experimental results across various

tasks and PTMs, we show that LPT can achieve

competitive performance to full model tuning

and other PETuning methods under both full-

data and few-shot scenarios while possessing

faster training speed and lower memory cost.

1 Introduction

Pre-trained models (Devlin et al.,2019;Radford

et al.,2019;Yang et al.,2019;Raffel et al.,2020;

Lewis et al.,2020;Liu et al.,2022a;Qiu et al.,2020;

Lin et al.,2021) have pushed most NLP tasks to

state-of-the-art. Model tuning (or ﬁne-tuning) is a

popular method for utilizing PTMs on downstream

tasks that needs to tune all parameters of PTMs

for every task. Despite the welcome outcome, it

leads to prohibitive adaptation costs, especially for

supersized PTMs (Brown et al.,2020;Wang et al.,

2021a). Parameter-efﬁcient tuning (PETuning) is a

∗Corresponding author.

Figure 1: Overall comparison between LPT and base-

lines of only 100 training samples for each task. All

methods are evaluated on 10 text classiﬁcation tasks

using RoBERTaLARGE. The radius of every circle in-

dicates training speed (tokens per millisecond). LPT w/

NPG and LPT w/o PG represent LPT with naive prompt

generator and without prompt generator, respectively.

The details can be found in Section 5.

new tuning paradigm that can adapt PTMs to down-

stream tasks by only tuning a very small number of

internal or additional parameters.

Prompt tuning (Lester et al.,2021) is a simple

and popular PETuning method that prepends a se-

quence of soft prompt tokens to the input and only

optimizes the prompt to adapt PTMs to downstream

tasks. It has an absolute advantage in parame-

ter efﬁciency and facilitates mixed-task inference,

which makes the deployment of PTMs convenient.

However, compared with other advanced PETun-

ing methods, e.g., Adapter (Houlsby et al.,2019;

Mahabadi et al.,2021), LoRA (Hu et al.,2022),

and BitFit (Zaken et al.,2022), prompt tuning

suffers from lower performance and convergence

rate. Compared with full model tuning, although

the number of trainable parameters in prompt tun-

ing reduces by

∼

17,000

(from 355M to 21K on

RoBERTa

LARGE

), the training speed only increases

arXiv:2210.11292v2 [cs.CL] 21 Oct 2022

∼

1.5

, and the memory cost only reduces by

29.8%.

P-tuning v2 (Liu et al.,2022b) improves

the performance of prompt tuning by inserting soft

prompts into every hidden layer of PTMs, but it is

difﬁcult to optimize and needs more training steps

to attain competitive performance.

In this paper, we explore why prompt tuning per-

forms poorly and ﬁnd there is a trade-off between

the propagation distance from label signals to the

inserted prompt and the inﬂuence of the prompt

on model outputs. The key to prompt tuning is to

make the soft prompt carry task-related information

through downstream training. The trained prompt

can interact with text inputs during the model for-

ward pass to obtain text representations with task-

related information. Since the prompt is inserted

into the input in prompt tuning, it has a strong abil-

ity to inﬂuence the outputs of PTM through sufﬁ-

cient interactions with text inputs. However, there

is a long propagation path from label signals to the

prompt. It leads us to ask the question: Does this

long propagation path cause a lot of task-related

information to be lost during propagation and thus

affect performance? To verify the impact of the

propagation distance on performance, we conduct

pilot experiments by shortening it in Section 4and

ﬁnd that the performance ﬁrst increases then de-

creases with the shortening of the length. This

ﬁnding inspires us to present the

late prompt

(i.e.,

inserting the prompt into an intermediate hidden

layer of PTM). The late prompt not only receives

more task-related information at each update due

to the shorter propagation path of task-related in-

formation but also maintains the adequate ability to

inﬂuence the outputs of PTM. Despite the higher

performance and faster convergence rate of late

prompt than prompt tuning, the hidden states pro-

duced by PTM before the prompt insertion layer are

underutilized. To further improve performance and

take full advantage of these contextual hidden repre-

sentations, we introduce a prompt generator to gen-

erate the soft prompt (termed as

instance-aware

prompt

) for each instance using the corresponding

hidden states.

Based on the late and instance-aware prompt,

we present

ate

rompt

uning (LPT) to improve

prompt tuning. Since the soft prompt is inserted

into an intermediate layer of PTM, we have no

need to compute gradients for model parameters be-

low the prompt insertion layer, and therefore speed

1Refer to Section 6.5 for details.

up the training process and reduce memory costs.

Extensive experimental results show that LPT out-

performs most prompt-based tuning methods and

can be comparable with adapter-based tuning meth-

ods and even full model tuning. Especially in the

few-shot scenario with only 100 training samples,

LPT outperforms prompt tuning by

12.4 points

and

model tuning by

5.0 points

in the average perfor-

mance of ten text classiﬁcation tasks. Besides, it

2.0×

faster and reduces by

56.6%

than model

tuning in terms of training speed and memory cost

on RoBERTa

LARGE

, respectively. Figure 1shows

an overall comparison between LPT and its coun-

terparts. To sum up, the key contributions of this

paper are:

•

We explore why prompt tuning performs

poorly and ﬁnd that it is due to the long prop-

agation path from label signals to the input

prompt and present a simple variant named

late prompt tuning to address the issue.

•

Combining the late and instance-aware

prompts, we present LPT, which not only at-

tains comparable performance with adapter-

based tuning methods and even model tuning

but also greatly reduces training costs.

•

We verify the versatility of LPT in the full-

data and few-shot scenarios across 10 text

classiﬁcation tasks and 3 PTMs. Code is

publicly available at

https://github.com/

xyltt/LPT.

2 Related Work

Adapter-based tuning.

One research line of

PETuning is adapter-based tuning (Ding et al.,

2022) that inserts some adapter modules be-

tween model layers and optimizes these adapters

in downstream training for model adaptation.

Adapter (Houlsby et al.,2019) inserts adapter mod-

ules with bottleneck architecture between every con-

secutive Transformer (Vaswani et al.,2017) sub-

layers. AdapterDrop (Rücklé et al.,2021) investi-

gates the efﬁciency through removing adapters from

lower layers. Compacter (Mahabadi et al.,2021)

uses low-rank optimization and parameterized hy-

percomplex multiplication (Zhang et al.,2021) to

compress adapters. Adapter-based tuning methods

have comparable results with model tuning when

training data is sufﬁcient but don’t work well in the

few-shot scenario (Wang et al.,2022).

Prompt-based tuning.

Another main research

line of PETuning is prompt-based tuning that in-

serts some additional soft prompts into the hidden

states instead of injecting new neural modules to

PTMs. Prompt tuning (Lester et al.,2021) and

P-tuning (Liu et al.,2021) insert a soft prompt to

word embeddings only, and can achieve competitive

results when applied to supersized PTMs. Preﬁx-

tuning (Li and Liang,2021) and P-tuning v2 (Liu

et al.,2022b) insert prompts to every hidden layer

of PTM. BBT (Sun et al.,2022b) optimizes the

inserted prompt with derivative-free optimization.

Some prompt-based tuning methods, like prompt

tuning and BBT, formulate downstream tasks as

pre-training tasks (e.g., masked language model-

ing task) to close the gap between pre-training and

downstream training (Sun et al.,2022a). There

are also some prompt-based methods with instance-

aware prompt. IDPG (Wu et al.,2022) uses the

prompt generator with parameterized hypercom-

plex multiplication (Zhang et al.,2021) to gen-

erate a soft prompt for every instance. Context-

tuning (Tang et al.,2022) uses BERT model (Devlin

et al.,2019) as the prompt generator and focuses

on NLG tasks. IPL (Jin et al.,2022) ﬁrst calculates

relevance scores between prompt tokens and inputs,

then uses the scores to re-weight the original prompt

tokens. But it tunes all parameters of PTM. All the

above methods with instance-aware prompt have

the same weakness because they need to encode the

inputs using an extra encoder, which slows down

the training and increases inference latency.

There are also some other popular PETuning

methods, such as BitFit (Zaken et al.,2022) which

only tunes the bias terms, LoRA (Hu et al.,2022)

which optimizes low-rank decomposition matrices

of the weights within self-attention layers.

3 Problem Formulation

Given a PTM

, in the setting of model tuning,

we ﬁrst reformulate the inputs with single sentence

E([CLS] hS1i[SEP])

and the inputs with sen-

tence pair as

E([CLS] hS1i[SEP] hS2i[SEP])

where

is the embedding layer of

. The ﬁnal

hidden state of

[CLS]

token will be used to predict

label. In the setting of prompt tuning, we insert a

randomly initialized soft prompt

into word em-

beddings, and also modify the original inputs using

different manual templates with a

[MASK]

token for

different tasks. For example, the inputs with sin-

gle sentence from a sentiment analysis task will

be transform into

concat(p

E([CLS] hS1i

It was

[MASK]. [SEP]))

. Then, we map the original la-

bels

to some words in the vocabulary

which formulates downstream tasks as a language

modeling task to close the gap between pre-training

and downstream training. The ﬁnal hidden state of

[MASK] token will be used to predict label.

In the setting of our proposed method LPT, we

use a prompt generator (

) to generate an inde-

pendent prompt

for every input. In addition, the

layer that the prompt inserts into is an intermediate

layer of PTM instead of word embeddings, and we

refer to the layer as the prompt layer (PL).

4 Why Prompt Tuning Performs Poorly?

The workﬂow of prompt tuning is to make the in-

serted soft prompt carry task-related information

through downstream training. In the inference

phase, this prompt can interact with test inputs dur-

ing layer-upon-layer propagation so that the hidden

representations of these inputs also contain task-

related information. There are strong interactions

between the prompt and text inputs because prompt

tuning inserts prompt into word embeddings. How-

ever, there is a long propagation path from label

signals to the prompt. Therefore, we speculate that

the poor performance of prompt tuning is due to

the long propagation path of task-related informa-

tion, which causes a lot of task-related informa-

tion to be lost during propagation in the frozen

model and thus affect performance. To verify this

conjecture, we conduct some pilot experiments on

TREC (Voorhees and Tice,2000) and RTE (Dagan

et al.,2005) datasets using RoBERTa

LARGE

(Liu

et al.,2019).

Does shortening the propagation distance im-

prove performance?

We start by considering a

simple experiment setting where the soft prompt

is inserted into different layers of RoBERTa

LARGE

then we look at how performance changes as the

prompt layer changes. As shown in the left plots of

Figure 2, we can observe that the performance ﬁrst

increases and then decreases with the rise of the

prompt layer and obtain the highest performance

when the prompting layer is in the range of 12 to

14. In addition, we also explore the convergence

rates at different prompt layers. For simpliﬁcation,

we only consider three different prompt layers 1,

13, and 24. The middle plots in Figure 2show that

the model has the fastest convergence rate when the

prompt layer is 13. The trend is consistent with the

performance trend shown on the left plots. We can

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LatePromptTuning:ALatePromptCouldBeBetterThanManyPromptsXiangyangLiu1,2TianxiangSun1,2XuanjingHuang1,2XipengQiu1,21SchoolofComputerScience,FudanUniversity2ShanghaiKeyLaboratoryofIntelligentInformationProcessing,FudanUniversity{xiangyangliu20,txsun19,xjhuang,xpqiu}@fudan.edu.cnAbstractPrompttuningis...

展开>> 收起<<

Late Prompt Tuning A Late Prompt Could Be Better Than Many Prompts Xiangyang Liu12Tianxiang Sun12Xuanjing Huang12Xipeng Qiu12 1School of Computer Science Fudan University.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Late Prompt Tuning A Late Prompt Could Be Better Than Many Prompts Xiangyang Liu12Tianxiang Sun12Xuanjing Huang12Xipeng Qiu12 1School of Computer Science Fudan University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: