XPROMPT Exploring the Extreme of Prompt Tuning Fang Ma Chen Zhang

2025-04-29 0 0 1.33MB 15 页 10玖币

侵权投诉

XPROMPT: Exploring the Extreme of Prompt Tuning

Fang Ma , Chen Zhang , Lei Ren , Jingang Wang , Qifan Wang ,

Wei Wu , Xiaojun Quan , Dawei Song

Beijing Institute of Technology {mfang,czhang,dwsong}@bit.edu.cn

Meituan NLP {wangjingang02,wuwei30}@meituan.com,renlei_work@163.com

Meta AI wqfcr@fb.com

Sun Yat-Sen University quanxj3@mail.sysu.edu.cn

Abstract

Prompt tuning learns soft prompts to con-

dition frozen Pre-trained Language Models

(PLMs) for performing downstream tasks in

a parameter-efﬁcient manner. While prompt

tuning has gradually reached the performance

level of ﬁne-tuning as the model scale in-

creases, there is still a large performance gap

between prompt tuning and ﬁne-tuning for

models of moderate and small scales (typ-

ically less than 11B parameters). In this

paper, we empirically show that the trained

prompt tokens can have a negative impact

on a downstream task and thus degrade its

performance. To bridge the gap, we pro-

pose a novel PROMPT tuning model with

an eXtremely small scale (XPROMPT) un-

der the regime of lottery tickets hypothesis.

Speciﬁcally, XPROMPT eliminates the nega-

tive prompt tokens at different granularity lev-

els through a hierarchical structured pruning,

yielding a more parameter-efﬁcient prompt yet

with a competitive performance. Comprehen-

sive experiments are carried out on Super-

GLUE tasks, and the extensive results indi-

cate that XPROMPT is able to close the perfor-

mance gap at smaller model scales.

1 Introduction

Pre-trained Language Models (PLMs) have been

widely applied and achieved a remarkable success

in various NLP tasks (Devlin et al.,2019;Raffel

et al.,2020;Zhou et al.,2020) under the pre-train-

then-ﬁne-tune paradigm (Liu et al.,2019). De-

spite of its compelling performance, ﬁne-tuning is

parameter-inefﬁcient for large scale PLMs due to

the fact that the memory footprint is proportional

to the number of trainable parameters whose gra-

dients and optimizer states need to be stored (Guo

et al.,2021).

Dawei Song and Jingang Wang are the corresponding

authors.

Figure 1: XPROMPT outperforms the vanilla Prompt-

Tuning (Lester et al.,2021) and can signiﬁcantly im-

prove over Prompt-Tuning across tasks and model

scales. It is worth noting that there is a small per-

formance gap between prompt tuning and ﬁne-tuning

on T5-XXL (11B) due to different hyperparameter

settings and initialization. Similar observations have

been found in Figure3-a and Figure3-b of Lester et al.

(2021).

Recently, Prompt-Tuning (Lester et al.,2021;

Liu et al.,2021b) has been proposed to address

this issue by prepending a soft prompt to the in-

put and only updating the parameters of prompt

tokens during tuning. Prompt-Tuning provides a

parameter-efﬁcient alternative to ﬁne-tuning, since

the scale of the soft prompt is tens of thousand

smaller. It is also conceptually simpler and more

ﬂexible than other parameter-efﬁcient tuning meth-

ods such as Adapters that require intrusive modiﬁ-

cations to transformer layers (Houlsby et al.,2019;

Guo et al.,2021). Using fewer tunable parameters,

prompt tuning achieves competitive performance

to ﬁne-tuning with the increase of the model scale.

However, there is still a large performance gap be-

tween prompt tuning and ﬁne-tuning for models of

smaller scales (as shown in Figure 1).

This paper aims to ﬁll the gap, from the perspec-

tive of the lottery tickets hypothesis (LTH) (Frankle

and Carbin,2019). We are motivated by an obser-

vation that, on a speciﬁc task, not all prompt tokens

contribute equally to the task performance, while

arXiv:2210.04457v1 [cs.CL] 10 Oct 2022

Figure 2: The performance comparison of Prompt-

Tuning, Negative Prompt Masking and Random

Prompt Masking with T5-XL(3B) on three Super-

GLUE tasks. Prompt-Turning uses all prompt tokens.

Negative Prompt Masking masks selected (negative)

prompt tokens with low importance scores. Random

Prompt Masking randomly masks the same number of

tokens as in Negative Prompt Masking.

certain prompt tokens may even bring a negative

impact. Figure 2provides a preliminary result of

this observation. These negative prompt tokens

can be circumvented under the regime of LTH. Es-

sentially, LTH states that an over-parameterized

network contains a sub-network that, when initial-

ized and trained in isolation, can match or exceed

the test accuracy of the original network after train-

ing for at most the same number of iterations. The

sub-network is called lottery ticket, and the collec-

tion of the tickets is referred to as winning tickets

in PLMs (Liang et al.,2021). In the problem of

prompt-tuning, the winning tickets are the collec-

tion of positive prompt tokens that can achieve the

same performance as using the entire collection of

prompts, while the losing tickets are the collection

of negative prompt tokens.

Therefore, the key is to identify the winning

tickets and eliminate the losing ones, in the col-

lection of trained prompt tokens. In particular, we

propose to eliminate the losing tickets through a hi-

erarchical structured pruning, which ﬁrst removes

negative tokens at the token-level and then prunes

the remaining ones at a ﬁner graularity level, i.e.,

the piece-level, for a better trade-off between effec-

tiveness and efﬁciency. In line with LTH, weight

rewinding (Renda et al.,2020) is adopted to re-

train the identiﬁed positive soft prompts. With

the elimination of negative prompt tokens, a more

parameter-efﬁcient PROMPT of an eXtremely small

scale (XPROMPT) is obtained.

To verify the effectiveness of XPROMPT, we

conduct an extensive set of experiments on Super-

GLUE (Wang et al.,2019) in both high-resource

and low-resource scenarios. As shown in Figure 1

and Table 1, the results demonstrate that XPROMPT

signiﬁcantly improves the prompt-tuning methods

across tasks and model scales. For models of mod-

erate scales, XPROMPT closes the gap and achieves

a performance comparable to ﬁne-tuning. For mod-

els of large scales, XPROMPT also leads to large

performance gains over Prompt-Tuning, and even

exceeds ﬁne-tuning for most tasks.

2 Related Work

2.1 Pre-trained Language Models

Pre-trained Language Models (PLMs) have

achieved remarkable success in various NLP tasks

(Zhou et al.,2020;Raffel et al.,2020;Brown et al.,

2020). BERT (Devlin et al.,2019) and RoBERTa

(Liu et al.,2019) are two pioneers that learn contex-

tual representations with masked language model

(MLM) and next sentence prediction pre-training

tasks. Recently, a series of large scale PLMs have

emerged with different pre-training designs, such

as GPT-

(Radford et al.,2019), GPT-

(Brown

et al.,2020), ELECTRA (Clark et al.,2020), XL-

Net (Yang et al.,2019), BART (Lewis et al.,2020)

and T

(Raffel et al.,2020). However, with the ex-

ploding number of parameters, ﬁne-tuning models

become parameter-inefﬁcient and computationally

expensive due to the maintenance of all parame-

ters in the PLMs. Moreover, one has to ﬁne-tune

different models for different tasks and store them

separately, which is resource-intensive.

2.2 Prompt Learning in NLP

With the development of GPT-

(Brown et al.,

2020), prompt learning has drawn much attention

in the NLP community (Liu et al.,2021a;Ding

et al.,2022), which enables efﬁcient learning by

adding a number of prompt tokens to the input.

Prompt learning has been proven to be effective

in various downstream tasks (Davison et al.,2019;

Gong and Eldardiry,2021;Radford et al.,2019;

Wang et al.,2021;Khashabi et al.,2020). Recently,

prompt has been extended from discrete tokens

(tokens in the vocabularies) to continuous tokens

(trainable embeddings), i.e., soft prompt (Li and

Liang,2021;Zhong et al.,2021;Qin and Eisner,

2021). For example, (Lester et al.,2021) proposes

a parameter-efﬁcient prompt tuning approach by

only tuning soft prompts and ﬁxing the entire pa-

rameters in PLM. Prompt tuning achieves great suc-

cess and shows that it can reach the performance

of ﬁne-tuning with large PLM. However, there is

Prompt-Tuning Hierarchical Structured Pruning Rewinding

The input sequence.

…

The input sequence.

…

T5 (Encoder Decoder, Fixed)

Output

…

The input sequence.

…

The input sequence.

…

T5 (Encoder Decoder, Fixed)

Output

T5 (Encoder Decoder, Fixed)

Output

T5 (Encoder Decoder, Fixed)

Output

Trained

Prompt Token

…

Token-level

Mask

…

Piece-level

Mask

…

Rewinded

Prompt Token

…

Soft Prompt

Tok en Piece

Figure 3: The illustration of our proposed XPROMPT approach. XPROMPT consists of three stages, namely

Prompt-Tuning,Hierarchical Structured Pruning and Rewinding. Among all the stages, the parameters of T5 are

frozen - only the parameters of the prompts are tuned. The prompts trained in the previous stage are fed into

the next stage as the initialization prompts. The change of color represents the process that the parameters of the

prompts are tuned or pruned.

still a large performance gap between prompt tun-

ing and ﬁne-tuning for models of moderate scales.

More recently, (Vu et al.,2021) proposes a prompt-

based transfer learning approach, SPOT, to im-

prove the performance of prompt tuning, which

learns a prompt on source tasks and then applied

to initialize the target task’s prompt. Most recently,

(He et al.,2022) proposes HyperPrompt which uses

the hypernetworks to generate hyper-prompts and

obtains superior performance. However, it needs to

tune all parameters and shows that only tuning task-

conditioned parameters is not enough to achieve

competitive results as full model ﬁne-tuning for

multi-task learning.

2.3 Lottery Ticket Hypothesis

The lottery ticket hypothesis (Frankle and Carbin,

2019) ﬁnds that an over-parameterized network

contains a subnetwork that is initialized such that

- when trained in isolation - it can match the test

accuracy of the original network after training for

at most the same number of iterations. The subnet-

work is called lottery ticket. In NLP, the collection

of lottery tickets is referred to as winning tickets

in highly over-parametrized models, e.g., PLMs

(Liang et al.,2021). Such winning tickets have

demonstrated their abilities to transfer across tasks

and datasets (Morcos et al.,2019;Yu et al.,2020;

Desai et al.,2019). Recently, Chen et al. (2021)

has shown the existence of the winning tickets in

PLMs. Liang et al. (2021) observes that the gen-

eralization performance of the winning tickets can

even exceed that of the full model.

3 Preliminary

Built upon the text-to-text approach of T

(Raffel

et al.,2020), prompt tuning formulates all tasks

as text generation by prepending additional

tun-

able soft prompt tokens to the input and only up-

dating the parameters of the inserted soft prompt

tokens. Speciﬁcally, given a series of

input to-

kens

X={x1, x2, ..., xn}

, T

ﬁrst generates the

token embeddings

Xe∈Rn×e

, where

is the di-

mension of the embedding space. It also generates

soft prompt embeddings

Pe={p1, p2, ..., pm} ∈

Rm×e

, where

is the length of the soft prompt.

Then the soft prompts are prepended to the input

sequence as

[Pe;Xe]∈R(m+n)×e

. The goal of

prompt tuning is to maximize the likelihood of the

labels Yby only optimizing over Pe:

arg max

log p(Y|[Pe;Xe]) (1)

Prompt tuning becomes more effective as the

model scale increases. However, there is still a

signiﬁcant performance gap between prompt tun-

ing and ﬁne-tuning especially for models of small

and moderate scales. Our hypothesis is that not all

soft prompt tokens contribute equally to the per-

formance after training on the target task. There

exist certain soft prompt tokens that may have neg-

ative impacts on the task. Therefore, combining

the idea of the lottery ticket hypothesis, we propose

XPROMPT with hierarchical structured pruning to

identify the optimal soft prompts and bridge the

performance gap.

4 XPROMPT

The overall process of XPROMPT is illustrated

in Figure 3, which consists of three main stages:

Prompt-Tuning,Hierarchical Structured Pruning

and Rewinding. Speciﬁcally, the prompt tuning

learns an initial set of values for all soft prompt

tokens on the target task. During the hierarchi-

cal structured pruning, token-level and piece-level

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

XPROMPT:ExploringtheExtremeofPromptTuningFangMa,ChenZhang,LeiRen,JingangWang,QifanWang,WeiWu,XiaojunQuan,DaweiSongBeijingInstituteofTechnology{mfang,czhang,dwsong}@bit.edu.cnMeituanNLP{wangjingang02,wuwei30}@meituan.com,renlei_work@163.comMetaAIwqfcr@fb.comSunYat-SenUniversityquanxj3@mail.sysu.edu.c...

展开>> 收起<<

XPROMPT Exploring the Extreme of Prompt Tuning Fang Ma Chen Zhang.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

XPROMPT Exploring the Extreme of Prompt Tuning Fang Ma Chen Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: