
Figure 2: The performance comparison of Prompt-
Tuning, Negative Prompt Masking and Random
Prompt Masking with T5-XL(3B) on three Super-
GLUE tasks. Prompt-Turning uses all prompt tokens.
Negative Prompt Masking masks selected (negative)
prompt tokens with low importance scores. Random
Prompt Masking randomly masks the same number of
tokens as in Negative Prompt Masking.
certain prompt tokens may even bring a negative
impact. Figure 2provides a preliminary result of
this observation. These negative prompt tokens
can be circumvented under the regime of LTH. Es-
sentially, LTH states that an over-parameterized
network contains a sub-network that, when initial-
ized and trained in isolation, can match or exceed
the test accuracy of the original network after train-
ing for at most the same number of iterations. The
sub-network is called lottery ticket, and the collec-
tion of the tickets is referred to as winning tickets
in PLMs (Liang et al.,2021). In the problem of
prompt-tuning, the winning tickets are the collec-
tion of positive prompt tokens that can achieve the
same performance as using the entire collection of
prompts, while the losing tickets are the collection
of negative prompt tokens.
Therefore, the key is to identify the winning
tickets and eliminate the losing ones, in the col-
lection of trained prompt tokens. In particular, we
propose to eliminate the losing tickets through a hi-
erarchical structured pruning, which first removes
negative tokens at the token-level and then prunes
the remaining ones at a finer graularity level, i.e.,
the piece-level, for a better trade-off between effec-
tiveness and efficiency. In line with LTH, weight
rewinding (Renda et al.,2020) is adopted to re-
train the identified positive soft prompts. With
the elimination of negative prompt tokens, a more
parameter-efficient PROMPT of an eXtremely small
scale (XPROMPT) is obtained.
To verify the effectiveness of XPROMPT, we
conduct an extensive set of experiments on Super-
GLUE (Wang et al.,2019) in both high-resource
and low-resource scenarios. As shown in Figure 1
and Table 1, the results demonstrate that XPROMPT
significantly improves the prompt-tuning methods
across tasks and model scales. For models of mod-
erate scales, XPROMPT closes the gap and achieves
a performance comparable to fine-tuning. For mod-
els of large scales, XPROMPT also leads to large
performance gains over Prompt-Tuning, and even
exceeds fine-tuning for most tasks.
2 Related Work
2.1 Pre-trained Language Models
Pre-trained Language Models (PLMs) have
achieved remarkable success in various NLP tasks
(Zhou et al.,2020;Raffel et al.,2020;Brown et al.,
2020). BERT (Devlin et al.,2019) and RoBERTa
(Liu et al.,2019) are two pioneers that learn contex-
tual representations with masked language model
(MLM) and next sentence prediction pre-training
tasks. Recently, a series of large scale PLMs have
emerged with different pre-training designs, such
as GPT-
2
(Radford et al.,2019), GPT-
3
(Brown
et al.,2020), ELECTRA (Clark et al.,2020), XL-
Net (Yang et al.,2019), BART (Lewis et al.,2020)
and T
5
(Raffel et al.,2020). However, with the ex-
ploding number of parameters, fine-tuning models
become parameter-inefficient and computationally
expensive due to the maintenance of all parame-
ters in the PLMs. Moreover, one has to fine-tune
different models for different tasks and store them
separately, which is resource-intensive.
2.2 Prompt Learning in NLP
With the development of GPT-
3
(Brown et al.,
2020), prompt learning has drawn much attention
in the NLP community (Liu et al.,2021a;Ding
et al.,2022), which enables efficient learning by
adding a number of prompt tokens to the input.
Prompt learning has been proven to be effective
in various downstream tasks (Davison et al.,2019;
Gong and Eldardiry,2021;Radford et al.,2019;
Wang et al.,2021;Khashabi et al.,2020). Recently,
prompt has been extended from discrete tokens
(tokens in the vocabularies) to continuous tokens
(trainable embeddings), i.e., soft prompt (Li and
Liang,2021;Zhong et al.,2021;Qin and Eisner,
2021). For example, (Lester et al.,2021) proposes
a parameter-efficient prompt tuning approach by
only tuning soft prompts and fixing the entire pa-
rameters in PLM. Prompt tuning achieves great suc-
cess and shows that it can reach the performance
of fine-tuning with large PLM. However, there is