Prompt-based tuning.
Another main research
line of PETuning is prompt-based tuning that in-
serts some additional soft prompts into the hidden
states instead of injecting new neural modules to
PTMs. Prompt tuning (Lester et al.,2021) and
P-tuning (Liu et al.,2021) insert a soft prompt to
word embeddings only, and can achieve competitive
results when applied to supersized PTMs. Prefix-
tuning (Li and Liang,2021) and P-tuning v2 (Liu
et al.,2022b) insert prompts to every hidden layer
of PTM. BBT (Sun et al.,2022b) optimizes the
inserted prompt with derivative-free optimization.
Some prompt-based tuning methods, like prompt
tuning and BBT, formulate downstream tasks as
pre-training tasks (e.g., masked language model-
ing task) to close the gap between pre-training and
downstream training (Sun et al.,2022a). There
are also some prompt-based methods with instance-
aware prompt. IDPG (Wu et al.,2022) uses the
prompt generator with parameterized hypercom-
plex multiplication (Zhang et al.,2021) to gen-
erate a soft prompt for every instance. Context-
tuning (Tang et al.,2022) uses BERT model (Devlin
et al.,2019) as the prompt generator and focuses
on NLG tasks. IPL (Jin et al.,2022) first calculates
relevance scores between prompt tokens and inputs,
then uses the scores to re-weight the original prompt
tokens. But it tunes all parameters of PTM. All the
above methods with instance-aware prompt have
the same weakness because they need to encode the
inputs using an extra encoder, which slows down
the training and increases inference latency.
There are also some other popular PETuning
methods, such as BitFit (Zaken et al.,2022) which
only tunes the bias terms, LoRA (Hu et al.,2022)
which optimizes low-rank decomposition matrices
of the weights within self-attention layers.
3 Problem Formulation
Given a PTM
M
, in the setting of model tuning,
we first reformulate the inputs with single sentence
as
E([CLS] hS1i[SEP])
and the inputs with sen-
tence pair as
E([CLS] hS1i[SEP] hS2i[SEP])
,
where
E
is the embedding layer of
M
. The final
hidden state of
[CLS]
token will be used to predict
label. In the setting of prompt tuning, we insert a
randomly initialized soft prompt
p
into word em-
beddings, and also modify the original inputs using
different manual templates with a
[MASK]
token for
different tasks. For example, the inputs with sin-
gle sentence from a sentiment analysis task will
be transform into
concat(p
,
E([CLS] hS1i
It was
[MASK]. [SEP]))
. Then, we map the original la-
bels
Y
to some words in the vocabulary
V
of
M
,
which formulates downstream tasks as a language
modeling task to close the gap between pre-training
and downstream training. The final hidden state of
[MASK] token will be used to predict label.
In the setting of our proposed method LPT, we
use a prompt generator (
PG
) to generate an inde-
pendent prompt
p
for every input. In addition, the
layer that the prompt inserts into is an intermediate
layer of PTM instead of word embeddings, and we
refer to the layer as the prompt layer (PL).
4 Why Prompt Tuning Performs Poorly?
The workflow of prompt tuning is to make the in-
serted soft prompt carry task-related information
through downstream training. In the inference
phase, this prompt can interact with test inputs dur-
ing layer-upon-layer propagation so that the hidden
representations of these inputs also contain task-
related information. There are strong interactions
between the prompt and text inputs because prompt
tuning inserts prompt into word embeddings. How-
ever, there is a long propagation path from label
signals to the prompt. Therefore, we speculate that
the poor performance of prompt tuning is due to
the long propagation path of task-related informa-
tion, which causes a lot of task-related informa-
tion to be lost during propagation in the frozen
model and thus affect performance. To verify this
conjecture, we conduct some pilot experiments on
TREC (Voorhees and Tice,2000) and RTE (Dagan
et al.,2005) datasets using RoBERTa
LARGE
(Liu
et al.,2019).
Does shortening the propagation distance im-
prove performance?
We start by considering a
simple experiment setting where the soft prompt
is inserted into different layers of RoBERTa
LARGE
then we look at how performance changes as the
prompt layer changes. As shown in the left plots of
Figure 2, we can observe that the performance first
increases and then decreases with the rise of the
prompt layer and obtain the highest performance
when the prompting layer is in the range of 12 to
14. In addition, we also explore the convergence
rates at different prompt layers. For simplification,
we only consider three different prompt layers 1,
13, and 24. The middle plots in Figure 2show that
the model has the fastest convergence rate when the
prompt layer is 13. The trend is consistent with the
performance trend shown on the left plots. We can