Late Prompt Tuning A Late Prompt Could Be Better Than Many Prompts Xiangyang Liu12Tianxiang Sun12Xuanjing Huang12Xipeng Qiu12 1School of Computer Science Fudan University

2025-05-03 0 0 1.68MB 14 页 10玖币
侵权投诉
Late Prompt Tuning: A Late Prompt Could Be Better Than Many Prompts
Xiangyang Liu1,2 Tianxiang Sun1,2 Xuanjing Huang1,2 Xipeng Qiu1,2
1School of Computer Science, Fudan University
2Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
{xiangyangliu20,txsun19,xjhuang,xpqiu}@fudan.edu.cn
Abstract
Prompt tuning is a parameter-efficient tuning
(PETuning) method for utilizing pre-trained
models (PTMs) that simply prepends a soft
prompt to the input and only optimizes the
prompt to adapt PTMs to downstream tasks.
Although it is parameter- and deployment-
efficient, its performance still lags behind other
state-of-the-art PETuning methods. Besides,
the training cost of prompt tuning is not sig-
nificantly reduced due to the back-propagation
through the entire model. Through empirical
analyses, we shed some light on the lagging
performance of prompt tuning and recognize
a trade-off between the propagation distance
from label signals to the inserted prompt and
the influence of the prompt on model outputs.
Further, we present Late Prompt Tuning (LPT)
that inserts a late prompt into an intermediate
layer of the PTM instead of the input layer
or all layers. The late prompt is obtained by
a neural prompt generator conditioned on the
hidden states before the prompt insertion layer
and therefore is instance-dependent. Through
extensive experimental results across various
tasks and PTMs, we show that LPT can achieve
competitive performance to full model tuning
and other PETuning methods under both full-
data and few-shot scenarios while possessing
faster training speed and lower memory cost.
1 Introduction
Pre-trained models (Devlin et al.,2019;Radford
et al.,2019;Yang et al.,2019;Raffel et al.,2020;
Lewis et al.,2020;Liu et al.,2022a;Qiu et al.,2020;
Lin et al.,2021) have pushed most NLP tasks to
state-of-the-art. Model tuning (or fine-tuning) is a
popular method for utilizing PTMs on downstream
tasks that needs to tune all parameters of PTMs
for every task. Despite the welcome outcome, it
leads to prohibitive adaptation costs, especially for
supersized PTMs (Brown et al.,2020;Wang et al.,
2021a). Parameter-efficient tuning (PETuning) is a
Corresponding author.
Figure 1: Overall comparison between LPT and base-
lines of only 100 training samples for each task. All
methods are evaluated on 10 text classification tasks
using RoBERTaLARGE. The radius of every circle in-
dicates training speed (tokens per millisecond). LPT w/
NPG and LPT w/o PG represent LPT with naive prompt
generator and without prompt generator, respectively.
The details can be found in Section 5.
new tuning paradigm that can adapt PTMs to down-
stream tasks by only tuning a very small number of
internal or additional parameters.
Prompt tuning (Lester et al.,2021) is a simple
and popular PETuning method that prepends a se-
quence of soft prompt tokens to the input and only
optimizes the prompt to adapt PTMs to downstream
tasks. It has an absolute advantage in parame-
ter efficiency and facilitates mixed-task inference,
which makes the deployment of PTMs convenient.
However, compared with other advanced PETun-
ing methods, e.g., Adapter (Houlsby et al.,2019;
Mahabadi et al.,2021), LoRA (Hu et al.,2022),
and BitFit (Zaken et al.,2022), prompt tuning
suffers from lower performance and convergence
rate. Compared with full model tuning, although
the number of trainable parameters in prompt tun-
ing reduces by
17,000
×
(from 355M to 21K on
RoBERTa
LARGE
), the training speed only increases
arXiv:2210.11292v2 [cs.CL] 21 Oct 2022
by
1.5
×
, and the memory cost only reduces by
29.8%.
1
P-tuning v2 (Liu et al.,2022b) improves
the performance of prompt tuning by inserting soft
prompts into every hidden layer of PTMs, but it is
difficult to optimize and needs more training steps
to attain competitive performance.
In this paper, we explore why prompt tuning per-
forms poorly and find there is a trade-off between
the propagation distance from label signals to the
inserted prompt and the influence of the prompt
on model outputs. The key to prompt tuning is to
make the soft prompt carry task-related information
through downstream training. The trained prompt
can interact with text inputs during the model for-
ward pass to obtain text representations with task-
related information. Since the prompt is inserted
into the input in prompt tuning, it has a strong abil-
ity to influence the outputs of PTM through suffi-
cient interactions with text inputs. However, there
is a long propagation path from label signals to the
prompt. It leads us to ask the question: Does this
long propagation path cause a lot of task-related
information to be lost during propagation and thus
affect performance? To verify the impact of the
propagation distance on performance, we conduct
pilot experiments by shortening it in Section 4and
find that the performance first increases then de-
creases with the shortening of the length. This
finding inspires us to present the
late prompt
(i.e.,
inserting the prompt into an intermediate hidden
layer of PTM). The late prompt not only receives
more task-related information at each update due
to the shorter propagation path of task-related in-
formation but also maintains the adequate ability to
influence the outputs of PTM. Despite the higher
performance and faster convergence rate of late
prompt than prompt tuning, the hidden states pro-
duced by PTM before the prompt insertion layer are
underutilized. To further improve performance and
take full advantage of these contextual hidden repre-
sentations, we introduce a prompt generator to gen-
erate the soft prompt (termed as
instance-aware
prompt
) for each instance using the corresponding
hidden states.
Based on the late and instance-aware prompt,
we present
L
ate
P
rompt
T
uning (LPT) to improve
prompt tuning. Since the soft prompt is inserted
into an intermediate layer of PTM, we have no
need to compute gradients for model parameters be-
low the prompt insertion layer, and therefore speed
1Refer to Section 6.5 for details.
up the training process and reduce memory costs.
Extensive experimental results show that LPT out-
performs most prompt-based tuning methods and
can be comparable with adapter-based tuning meth-
ods and even full model tuning. Especially in the
few-shot scenario with only 100 training samples,
LPT outperforms prompt tuning by
12.4 points
and
model tuning by
5.0 points
in the average perfor-
mance of ten text classification tasks. Besides, it
is
2.0×
faster and reduces by
56.6%
than model
tuning in terms of training speed and memory cost
on RoBERTa
LARGE
, respectively. Figure 1shows
an overall comparison between LPT and its coun-
terparts. To sum up, the key contributions of this
paper are:
We explore why prompt tuning performs
poorly and find that it is due to the long prop-
agation path from label signals to the input
prompt and present a simple variant named
late prompt tuning to address the issue.
Combining the late and instance-aware
prompts, we present LPT, which not only at-
tains comparable performance with adapter-
based tuning methods and even model tuning
but also greatly reduces training costs.
We verify the versatility of LPT in the full-
data and few-shot scenarios across 10 text
classification tasks and 3 PTMs. Code is
publicly available at
https://github.com/
xyltt/LPT.
2 Related Work
Adapter-based tuning.
One research line of
PETuning is adapter-based tuning (Ding et al.,
2022) that inserts some adapter modules be-
tween model layers and optimizes these adapters
in downstream training for model adaptation.
Adapter (Houlsby et al.,2019) inserts adapter mod-
ules with bottleneck architecture between every con-
secutive Transformer (Vaswani et al.,2017) sub-
layers. AdapterDrop (Rücklé et al.,2021) investi-
gates the efficiency through removing adapters from
lower layers. Compacter (Mahabadi et al.,2021)
uses low-rank optimization and parameterized hy-
percomplex multiplication (Zhang et al.,2021) to
compress adapters. Adapter-based tuning methods
have comparable results with model tuning when
training data is sufficient but don’t work well in the
few-shot scenario (Wang et al.,2022).
Prompt-based tuning.
Another main research
line of PETuning is prompt-based tuning that in-
serts some additional soft prompts into the hidden
states instead of injecting new neural modules to
PTMs. Prompt tuning (Lester et al.,2021) and
P-tuning (Liu et al.,2021) insert a soft prompt to
word embeddings only, and can achieve competitive
results when applied to supersized PTMs. Prefix-
tuning (Li and Liang,2021) and P-tuning v2 (Liu
et al.,2022b) insert prompts to every hidden layer
of PTM. BBT (Sun et al.,2022b) optimizes the
inserted prompt with derivative-free optimization.
Some prompt-based tuning methods, like prompt
tuning and BBT, formulate downstream tasks as
pre-training tasks (e.g., masked language model-
ing task) to close the gap between pre-training and
downstream training (Sun et al.,2022a). There
are also some prompt-based methods with instance-
aware prompt. IDPG (Wu et al.,2022) uses the
prompt generator with parameterized hypercom-
plex multiplication (Zhang et al.,2021) to gen-
erate a soft prompt for every instance. Context-
tuning (Tang et al.,2022) uses BERT model (Devlin
et al.,2019) as the prompt generator and focuses
on NLG tasks. IPL (Jin et al.,2022) first calculates
relevance scores between prompt tokens and inputs,
then uses the scores to re-weight the original prompt
tokens. But it tunes all parameters of PTM. All the
above methods with instance-aware prompt have
the same weakness because they need to encode the
inputs using an extra encoder, which slows down
the training and increases inference latency.
There are also some other popular PETuning
methods, such as BitFit (Zaken et al.,2022) which
only tunes the bias terms, LoRA (Hu et al.,2022)
which optimizes low-rank decomposition matrices
of the weights within self-attention layers.
3 Problem Formulation
Given a PTM
M
, in the setting of model tuning,
we first reformulate the inputs with single sentence
as
E([CLS] hS1i[SEP])
and the inputs with sen-
tence pair as
E([CLS] hS1i[SEP] hS2i[SEP])
,
where
E
is the embedding layer of
M
. The final
hidden state of
[CLS]
token will be used to predict
label. In the setting of prompt tuning, we insert a
randomly initialized soft prompt
p
into word em-
beddings, and also modify the original inputs using
different manual templates with a
[MASK]
token for
different tasks. For example, the inputs with sin-
gle sentence from a sentiment analysis task will
be transform into
concat(p
,
E([CLS] hS1i
It was
[MASK]. [SEP]))
. Then, we map the original la-
bels
Y
to some words in the vocabulary
V
of
M
,
which formulates downstream tasks as a language
modeling task to close the gap between pre-training
and downstream training. The final hidden state of
[MASK] token will be used to predict label.
In the setting of our proposed method LPT, we
use a prompt generator (
PG
) to generate an inde-
pendent prompt
p
for every input. In addition, the
layer that the prompt inserts into is an intermediate
layer of PTM instead of word embeddings, and we
refer to the layer as the prompt layer (PL).
4 Why Prompt Tuning Performs Poorly?
The workflow of prompt tuning is to make the in-
serted soft prompt carry task-related information
through downstream training. In the inference
phase, this prompt can interact with test inputs dur-
ing layer-upon-layer propagation so that the hidden
representations of these inputs also contain task-
related information. There are strong interactions
between the prompt and text inputs because prompt
tuning inserts prompt into word embeddings. How-
ever, there is a long propagation path from label
signals to the prompt. Therefore, we speculate that
the poor performance of prompt tuning is due to
the long propagation path of task-related informa-
tion, which causes a lot of task-related informa-
tion to be lost during propagation in the frozen
model and thus affect performance. To verify this
conjecture, we conduct some pilot experiments on
TREC (Voorhees and Tice,2000) and RTE (Dagan
et al.,2005) datasets using RoBERTa
LARGE
(Liu
et al.,2019).
Does shortening the propagation distance im-
prove performance?
We start by considering a
simple experiment setting where the soft prompt
is inserted into different layers of RoBERTa
LARGE
then we look at how performance changes as the
prompt layer changes. As shown in the left plots of
Figure 2, we can observe that the performance first
increases and then decreases with the rise of the
prompt layer and obtain the highest performance
when the prompting layer is in the range of 12 to
14. In addition, we also explore the convergence
rates at different prompt layers. For simplification,
we only consider three different prompt layers 1,
13, and 24. The middle plots in Figure 2show that
the model has the fastest convergence rate when the
prompt layer is 13. The trend is consistent with the
performance trend shown on the left plots. We can
摘要:

LatePromptTuning:ALatePromptCouldBeBetterThanManyPromptsXiangyangLiu1,2TianxiangSun1,2XuanjingHuang1,2XipengQiu1,21SchoolofComputerScience,FudanUniversity2ShanghaiKeyLaboratoryofIntelligentInformationProcessing,FudanUniversity{xiangyangliu20,txsun19,xjhuang,xpqiu}@fudan.edu.cnAbstractPrompttuningis...

展开>> 收起<<
Late Prompt Tuning A Late Prompt Could Be Better Than Many Prompts Xiangyang Liu12Tianxiang Sun12Xuanjing Huang12Xipeng Qiu12 1School of Computer Science Fudan University.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.68MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注