XPROMPT Exploring the Extreme of Prompt Tuning Fang Ma Chen Zhang

2025-04-29 0 0 1.33MB 15 页 10玖币
侵权投诉
XPROMPT: Exploring the Extreme of Prompt Tuning
Fang Ma , Chen Zhang , Lei Ren , Jingang Wang , Qifan Wang ,
Wei Wu , Xiaojun Quan , Dawei Song
Beijing Institute of Technology {mfang,czhang,dwsong}@bit.edu.cn
Meituan NLP {wangjingang02,wuwei30}@meituan.com,renlei_work@163.com
Meta AI wqfcr@fb.com
Sun Yat-Sen University quanxj3@mail.sysu.edu.cn
Abstract
Prompt tuning learns soft prompts to con-
dition frozen Pre-trained Language Models
(PLMs) for performing downstream tasks in
a parameter-efficient manner. While prompt
tuning has gradually reached the performance
level of fine-tuning as the model scale in-
creases, there is still a large performance gap
between prompt tuning and fine-tuning for
models of moderate and small scales (typ-
ically less than 11B parameters). In this
paper, we empirically show that the trained
prompt tokens can have a negative impact
on a downstream task and thus degrade its
performance. To bridge the gap, we pro-
pose a novel PROMPT tuning model with
an eXtremely small scale (XPROMPT) un-
der the regime of lottery tickets hypothesis.
Specifically, XPROMPT eliminates the nega-
tive prompt tokens at different granularity lev-
els through a hierarchical structured pruning,
yielding a more parameter-efficient prompt yet
with a competitive performance. Comprehen-
sive experiments are carried out on Super-
GLUE tasks, and the extensive results indi-
cate that XPROMPT is able to close the perfor-
mance gap at smaller model scales.
1 Introduction
Pre-trained Language Models (PLMs) have been
widely applied and achieved a remarkable success
in various NLP tasks (Devlin et al.,2019;Raffel
et al.,2020;Zhou et al.,2020) under the pre-train-
then-fine-tune paradigm (Liu et al.,2019). De-
spite of its compelling performance, fine-tuning is
parameter-inefficient for large scale PLMs due to
the fact that the memory footprint is proportional
to the number of trainable parameters whose gra-
dients and optimizer states need to be stored (Guo
et al.,2021).
Dawei Song and Jingang Wang are the corresponding
authors.
Figure 1: XPROMPT outperforms the vanilla Prompt-
Tuning (Lester et al.,2021) and can significantly im-
prove over Prompt-Tuning across tasks and model
scales. It is worth noting that there is a small per-
formance gap between prompt tuning and fine-tuning
on T5-XXL (11B) due to different hyperparameter
settings and initialization. Similar observations have
been found in Figure3-a and Figure3-b of Lester et al.
(2021).
Recently, Prompt-Tuning (Lester et al.,2021;
Liu et al.,2021b) has been proposed to address
this issue by prepending a soft prompt to the in-
put and only updating the parameters of prompt
tokens during tuning. Prompt-Tuning provides a
parameter-efficient alternative to fine-tuning, since
the scale of the soft prompt is tens of thousand
smaller. It is also conceptually simpler and more
flexible than other parameter-efficient tuning meth-
ods such as Adapters that require intrusive modifi-
cations to transformer layers (Houlsby et al.,2019;
Guo et al.,2021). Using fewer tunable parameters,
prompt tuning achieves competitive performance
to fine-tuning with the increase of the model scale.
However, there is still a large performance gap be-
tween prompt tuning and fine-tuning for models of
smaller scales (as shown in Figure 1).
This paper aims to fill the gap, from the perspec-
tive of the lottery tickets hypothesis (LTH) (Frankle
and Carbin,2019). We are motivated by an obser-
vation that, on a specific task, not all prompt tokens
contribute equally to the task performance, while
arXiv:2210.04457v1 [cs.CL] 10 Oct 2022
Figure 2: The performance comparison of Prompt-
Tuning, Negative Prompt Masking and Random
Prompt Masking with T5-XL(3B) on three Super-
GLUE tasks. Prompt-Turning uses all prompt tokens.
Negative Prompt Masking masks selected (negative)
prompt tokens with low importance scores. Random
Prompt Masking randomly masks the same number of
tokens as in Negative Prompt Masking.
certain prompt tokens may even bring a negative
impact. Figure 2provides a preliminary result of
this observation. These negative prompt tokens
can be circumvented under the regime of LTH. Es-
sentially, LTH states that an over-parameterized
network contains a sub-network that, when initial-
ized and trained in isolation, can match or exceed
the test accuracy of the original network after train-
ing for at most the same number of iterations. The
sub-network is called lottery ticket, and the collec-
tion of the tickets is referred to as winning tickets
in PLMs (Liang et al.,2021). In the problem of
prompt-tuning, the winning tickets are the collec-
tion of positive prompt tokens that can achieve the
same performance as using the entire collection of
prompts, while the losing tickets are the collection
of negative prompt tokens.
Therefore, the key is to identify the winning
tickets and eliminate the losing ones, in the col-
lection of trained prompt tokens. In particular, we
propose to eliminate the losing tickets through a hi-
erarchical structured pruning, which first removes
negative tokens at the token-level and then prunes
the remaining ones at a finer graularity level, i.e.,
the piece-level, for a better trade-off between effec-
tiveness and efficiency. In line with LTH, weight
rewinding (Renda et al.,2020) is adopted to re-
train the identified positive soft prompts. With
the elimination of negative prompt tokens, a more
parameter-efficient PROMPT of an eXtremely small
scale (XPROMPT) is obtained.
To verify the effectiveness of XPROMPT, we
conduct an extensive set of experiments on Super-
GLUE (Wang et al.,2019) in both high-resource
and low-resource scenarios. As shown in Figure 1
and Table 1, the results demonstrate that XPROMPT
significantly improves the prompt-tuning methods
across tasks and model scales. For models of mod-
erate scales, XPROMPT closes the gap and achieves
a performance comparable to fine-tuning. For mod-
els of large scales, XPROMPT also leads to large
performance gains over Prompt-Tuning, and even
exceeds fine-tuning for most tasks.
2 Related Work
2.1 Pre-trained Language Models
Pre-trained Language Models (PLMs) have
achieved remarkable success in various NLP tasks
(Zhou et al.,2020;Raffel et al.,2020;Brown et al.,
2020). BERT (Devlin et al.,2019) and RoBERTa
(Liu et al.,2019) are two pioneers that learn contex-
tual representations with masked language model
(MLM) and next sentence prediction pre-training
tasks. Recently, a series of large scale PLMs have
emerged with different pre-training designs, such
as GPT-
2
(Radford et al.,2019), GPT-
3
(Brown
et al.,2020), ELECTRA (Clark et al.,2020), XL-
Net (Yang et al.,2019), BART (Lewis et al.,2020)
and T
5
(Raffel et al.,2020). However, with the ex-
ploding number of parameters, fine-tuning models
become parameter-inefficient and computationally
expensive due to the maintenance of all parame-
ters in the PLMs. Moreover, one has to fine-tune
different models for different tasks and store them
separately, which is resource-intensive.
2.2 Prompt Learning in NLP
With the development of GPT-
3
(Brown et al.,
2020), prompt learning has drawn much attention
in the NLP community (Liu et al.,2021a;Ding
et al.,2022), which enables efficient learning by
adding a number of prompt tokens to the input.
Prompt learning has been proven to be effective
in various downstream tasks (Davison et al.,2019;
Gong and Eldardiry,2021;Radford et al.,2019;
Wang et al.,2021;Khashabi et al.,2020). Recently,
prompt has been extended from discrete tokens
(tokens in the vocabularies) to continuous tokens
(trainable embeddings), i.e., soft prompt (Li and
Liang,2021;Zhong et al.,2021;Qin and Eisner,
2021). For example, (Lester et al.,2021) proposes
a parameter-efficient prompt tuning approach by
only tuning soft prompts and fixing the entire pa-
rameters in PLM. Prompt tuning achieves great suc-
cess and shows that it can reach the performance
of fine-tuning with large PLM. However, there is
Prompt-Tuning Hierarchical Structured Pruning Rewinding
The input sequence.
The input sequence.
T5 (Encoder Decoder, Fixed)
Output
The input sequence.
The input sequence.
T5 (Encoder Decoder, Fixed)
Output
T5 (Encoder Decoder, Fixed)
Output
T5 (Encoder Decoder, Fixed)
Output
Token-level
Mask
Piece-level
Mask
Rewinded
Prompt Token
Soft Prompt
Tok en Piece
Figure 3: The illustration of our proposed XPROMPT approach. XPROMPT consists of three stages, namely
Prompt-Tuning,Hierarchical Structured Pruning and Rewinding. Among all the stages, the parameters of T5 are
frozen - only the parameters of the prompts are tuned. The prompts trained in the previous stage are fed into
the next stage as the initialization prompts. The change of color represents the process that the parameters of the
prompts are tuned or pruned.
still a large performance gap between prompt tun-
ing and fine-tuning for models of moderate scales.
More recently, (Vu et al.,2021) proposes a prompt-
based transfer learning approach, SPOT, to im-
prove the performance of prompt tuning, which
learns a prompt on source tasks and then applied
to initialize the target task’s prompt. Most recently,
(He et al.,2022) proposes HyperPrompt which uses
the hypernetworks to generate hyper-prompts and
obtains superior performance. However, it needs to
tune all parameters and shows that only tuning task-
conditioned parameters is not enough to achieve
competitive results as full model fine-tuning for
multi-task learning.
2.3 Lottery Ticket Hypothesis
The lottery ticket hypothesis (Frankle and Carbin,
2019) finds that an over-parameterized network
contains a subnetwork that is initialized such that
- when trained in isolation - it can match the test
accuracy of the original network after training for
at most the same number of iterations. The subnet-
work is called lottery ticket. In NLP, the collection
of lottery tickets is referred to as winning tickets
in highly over-parametrized models, e.g., PLMs
(Liang et al.,2021). Such winning tickets have
demonstrated their abilities to transfer across tasks
and datasets (Morcos et al.,2019;Yu et al.,2020;
Desai et al.,2019). Recently, Chen et al. (2021)
has shown the existence of the winning tickets in
PLMs. Liang et al. (2021) observes that the gen-
eralization performance of the winning tickets can
even exceed that of the full model.
3 Preliminary
Built upon the text-to-text approach of T
5
(Raffel
et al.,2020), prompt tuning formulates all tasks
as text generation by prepending additional
l
tun-
able soft prompt tokens to the input and only up-
dating the parameters of the inserted soft prompt
tokens. Specifically, given a series of
n
input to-
kens
X={x1, x2, ..., xn}
, T
5
first generates the
token embeddings
XeRn×e
, where
e
is the di-
mension of the embedding space. It also generates
soft prompt embeddings
Pe={p1, p2, ..., pm} ∈
Rm×e
, where
m
is the length of the soft prompt.
Then the soft prompts are prepended to the input
sequence as
[Pe;Xe]R(m+n)×e
. The goal of
prompt tuning is to maximize the likelihood of the
labels Yby only optimizing over Pe:
arg max
Pe
log p(Y|[Pe;Xe]) (1)
Prompt tuning becomes more effective as the
model scale increases. However, there is still a
significant performance gap between prompt tun-
ing and fine-tuning especially for models of small
and moderate scales. Our hypothesis is that not all
soft prompt tokens contribute equally to the per-
formance after training on the target task. There
exist certain soft prompt tokens that may have neg-
ative impacts on the task. Therefore, combining
the idea of the lottery ticket hypothesis, we propose
XPROMPT with hierarchical structured pruning to
identify the optimal soft prompts and bridge the
performance gap.
4 XPROMPT
The overall process of XPROMPT is illustrated
in Figure 3, which consists of three main stages:
Prompt-Tuning,Hierarchical Structured Pruning
and Rewinding. Specifically, the prompt tuning
learns an initial set of values for all soft prompt
tokens on the target task. During the hierarchi-
cal structured pruning, token-level and piece-level
摘要:

XPROMPT:ExploringtheExtremeofPromptTuningFangMa,ChenZhang,LeiRen,JingangWang,QifanWang,WeiWu,XiaojunQuan,DaweiSongBeijingInstituteofTechnology{mfang,czhang,dwsong}@bit.edu.cnMeituanNLP{wangjingang02,wuwei30}@meituan.com,renlei_work@163.comMetaAIwqfcr@fb.comSunYat-SenUniversityquanxj3@mail.sysu.edu.c...

展开>> 收起<<
XPROMPT Exploring the Extreme of Prompt Tuning Fang Ma Chen Zhang.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.33MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注