Continued Pretraining for Better Zero- and Few-Shot Promptability Zhaofeng WuãRobert L. Logan IVà Pete WalshæAkshita BhagiaæDirk GroeneveldæSameer Singhæà

2025-05-06 0 0 723.99KB 15 页 10玖币
侵权投诉
Continued Pretraining for Better Zero- and Few-Shot Promptability
Zhaofeng WuãRobert L. Logan IVà
Pete WalshæAkshita BhagiaæDirk GroeneveldæSameer Singhæ
à
Iz Beltagyæ
ãMIT àDataminr Inc.
æAllen Institute for Artificial Intelligence
à
University of California, Irvine
zfw@csail.mit.edu rlogan@dataminr.com
{petew,akshitab,dirkg,beltagy}@allenai.org sameer@uci.edu
Abstract
Recently introduced language model prompt-
ing methods can achieve high accuracy in zero-
and few-shot settings while requiring few to
no learned task-specific parameters. Never-
theless, these methods still often trail behind
full model finetuning. In this work, we investi-
gate if a dedicated continued pretraining stage
could improve “promptability”, i.e., zero-shot
performance with natural language prompts or
few-shot performance with prompt tuning. We
reveal settings where existing continued pre-
training methods lack promptability. We also
identify current methodological gaps, which
we fill with thorough large-scale experiments.
We demonstrate that a simple recipe, contin-
ued pretraining that incorporates a trainable
prompt during multi-task learning, leads to im-
proved promptability in both zero- and few-
shot settings compared to existing methods, up
to 31% relative. On the other hand, we find that
continued pretraining using MAML-style meta-
learning, a method that directly optimizes few-
shot promptability, yields subpar performance.
We validate our findings with two prompt tun-
ing methods, and, based on our results, we
provide concrete recommendations to optimize
promptability for different use cases.
1 Introduction
Conditioning language models (LMs) on manually-
written or learned continuous prompts allows them
to solve tasks with high accuracy and minimal pa-
rameter overhead (Brown et al.,2020;Li and Liang,
2021;Lester et al.,2021,i.a.). However, prompt-
ing performance often still lags behind traditional
full finetuning. Natural language prompts usually
underperform trained models even when manually
curated (Brown et al.,2020;Sanh et al.,2022). Sim-
ilarly, while learned prompts yield higher accuracy,
This work was done when Zhaofeng Wu was at AI2, and
Robert Logan was at UCI.
We release our code and models at
https://github.
com/allenai/better-promptability.
they do not work as well when the training data is
scarce (Gu et al.,2022), when the model is small
or moderately sized (Lester et al.,2021), and when
the tasks are difficult (He et al.,2022).
To reduce the gap between prompt and full
model tuning, past work has shown that continued
pretraining on data that resembles the downstream
prompting setup induces better “promptability”,
i.e., zero-shot performance with natural language
(NL) prompts and few-shot performance of prompt
tuning (Sanh et al.,2022;Gu et al.,2022). However,
in this paper, we identify several shortcomings of
these methods. First, continued pretraining on NL
prompts (Sanh et al.,2022) sometimes causes per-
formance degradation with prompt tuning. Second,
continued pretraining approaches that learn only a
universal prompt initialization (Gu et al.,2022;Vu
et al.,2022) bring only marginal improvement on
the P3 datasets (Bach et al.,2022).
To further improve zero-shot and few-shot
promptability, we investigate gaps in existing meth-
ods with different parameter configurations and
training procedures. First, we explore the effect
of incorporating a learned continuous prompt into
multi-task learning (MTL), and find it to signifi-
cantly improve zero- and few-shot promptability
across the board. In addition, we explore MAML-
style meta-learning (Finn et al.,2017;Nichol et al.,
2018) as an alternative to the standard continued
pretraining paradigm, but find that it underperforms
simple MTL, despite its previous success on few-
shot learning tasks (Li et al.,2017;Gu et al.,2018;
Qian and Yu,2019,i.a.). We perform an analysis of
this phenomenon and present several explanations.
Through large-scale experiments, each involv-
ing continued pretraining on over 9B tokens (§A),
we make several contributions: (1) we thoroughly
evaluate continued pretraining methods, both ex-
isting and our proposed ones, in many setups; (2)
we demonstrate that a simple continued pretraining
recipe improves over existing methods by up to
arXiv:2210.10258v2 [cs.CL] 21 Oct 2022
31%; (3) we show that MAML-style meta-learning
underperforms multi-task learning and provide ex-
planations; (4) we provide concrete recommen-
dations to improve promptability in various use
cases.
2 Prompting
We review two types of prompting that we use: nat-
ural language (NL) prompting and prompt tuning.
Traditionally, NLP tasks are solved by task-
specific models that predict label
y∈ Y
from
input
x∈ X
. We can consider LMs as func-
tions that score any source and target text pair,
LM :V× VR
with vocabulary
V
.
1
Past
work found that large LMs can be repurposed to
solve many tasks by casting
x, y
into a text format
using a template function
f:X ∪ Y V
and tak-
ing as prediction arg maxy0∈Y LM(f(x), f(y0)).
NL prompts, or instructions, are manually
constructed
f(·)
. Without task-specific training,
they have been successfully used to elicit predic-
tions from LMs to perform tasks with high accu-
racy (Brown et al.,2020;Logan IV et al.,2022).
Sharing the motivation, prompt tuning learns
a continuous prompt to condition the model. It
takes the source text embedded by the LM input
embeddings,
sRN×d
with length
N
and di-
mension
d
, and prepends learnable embeddings
ERL×d
, where
L
is a hyperparameter, to ob-
tain a new
(L+N)
-lengthed embedded sequence.
We consider hybrid prompt tuning, where
s
is the
embedding of the templatized
f(x)
, i.e., prompt
tuning is always performed in addition to NL tem-
plates. This has been widely adopted due to demon-
strated better performance (Gu et al.,2022;Min
et al.,2022). We also study a variant of prompt tun-
ing, sometimes called prefix tuning (Li and Liang,
2021), where the learnable vectors are added not
only to the input but all transformer layers. See
Lester et al. (2021) and Li and Liang (2021) for
more details on these methods. Following the ter-
minology of Liu et al. (2022b), we refer to the
input-level method as
shallow
prompt tuning and
the layer-specific method as deep prompt tuning.
3 Improving Promptability
In this section, we describe existing methods to
improve promptability and a new paradigm that
1
We focus on encoder-decoder LMs based on T5 (Raffel
et al.,2020). Past work considers them to work better than
decoder-only LMs for prompting (Sanh et al.,2022).
combines their advantages.
While prompt tuning sometimes performs close
to full model finetuning (Lester et al.,2021;Liu
et al.,2022b), there is often still a substantial gap,
such as with limited training data (Gu et al.,2022),
non-gigantic models (Lester et al.,2021), or chal-
lenging tasks (He et al.,2022). We therefore study
ways to improve LMs’ “promptability. We focus
on a low resource setup and consider zero-shot
NL prompts and few-shot learned prompts (which,
again, are in conjunction with NL prompts; §2).
For the former, better promptability increases the
performance when LMs face textual prompts of
new tasks. For the latter, it more effectively lever-
ages limited training examples for higher accuracy.
We investigate if promptability can improve with
a continued pretraining stage after LM pretraining
(or LM adaptation for LM-adapted T5 (Lester et al.,
2021)) and before task-specific finetuning. The
model is trained on a collection of tasks that have
NL prompts and evaluated on unseen tasks. The
methods that we explore below differ in how the
continued pretraining stage is performed. We use
the notation
MTL-T_P_
to abbreviate those meth-
ods that are based on multi-task learning, where
the blanks
_
specify different configurations of the
transformer (
T
) and the prompt (
P
) components
during MTL. Architecturally, a method may con-
tinue to pretrain only the T5 model without prompt
parameters, in which case we use P
7
to denote
the lack of them; otherwise, both transformer and
prompt parameters exist during MTL. We use
and to denote if the corresponding component is
trained or frozen in MTL, respectively. This nota-
tion describes the continued pretraining stage only:
in the final finetuning stage, all methods include
both the transformer and prompt components, but
only the latter is updated.
Continued pretraining has been studied in lim-
ited settings. Sanh et al. (2022) proposed T0 by
multi-task training a T5 model (Raffel et al.,2020)
as continued pretraining. They updated T5 pa-
rameters through learning on continued pretrain-
ing tasks, not including a prompt component, and
showed that this training improves zero-shot NL
promptability. Following our nomenclature, we re-
fer to this paradigm as
MTL-T P7
. Additionally,
Gu et al. (2022) employed a similar stage, incor-
porating and multi-task training a shallow prompt
as continued pretraining, while freezing the trans-
former parameters in this stage. They showed that
this strategy helps few-shot promptability during
finetuning. We refer to this paradigm as
MTL-
T P .
In this work, we study the gains of the previous
two continued pretraining approaches, as well as a
model that synthesizes them,
MTL-T P
, which
we are the first to propose. For few-shot down-
stream tuning, the learned prompt can act as a good
initialization compared to
MTL-T P7
. In the zero-
shot setup, prior work has discovered that includ-
ing certain text in a prompt, such as “Let’s think
step by step,” can adjust the reasoning of LMs to
yield substantially improved performance across
tasks (Kojima et al.,2022;Askell et al.,2021). The
learned prompt here could function analogously.
Compared to
MTL-T P
, on the other hand, the
additional capacity brought by more updatable pa-
rameters could further boost model performance.
MAML-style meta-learning (Finn et al.,2017)
directly optimizes for the downstream updates
and can outperform MTL for full model finetun-
ing (Dou et al.,2019;Bansal et al.,2020a). Yet,
it similarly remains unexplored for prompting.
We examine first-order MAML (
FOMAML
;Finn
et al.,2017), performing
T
steps of prompt tuning
in the inner loop and updating all parameters in
the outer loop. We also evaluate a version of
Rep-
tile
(Nichol et al.,2018) adapted for our setting that
performs
T
steps of prompt tuning followed by one
step of full model tuning, and use the resulting Rep-
tile gradient for model updates. They have the same
architecture as
MTL-T P
and all parameters are
trainable too. We provide a detailed description
and theoretical discussion of these processes in §B.
See the original papers for more details.
4 Experimental Setup
We use P3, a collection of NL-templatized exam-
ples for a variety of datasets, for training and evalu-
ation using the standard splits in Sanh et al. (2022).
Not only is there no dataset overlap between train-
ing and evaluation, but no task overlap either (e.g.,
sentiment vs. QA), making it challenging. We re-
port dataset statistics in §A. We perform continued
pretraining for one epoch over all training datasets.
Each dataset has multiple templates, each evaluated
with accuracy. As different datasets have differ-
ent numbers of answer choices and hence different
baseline accuracy, we report Average Relative Gain
(ARG; Ye et al.,2021) as a single summary metric
by averaging across all templates the relative ac-
curacy improvement over a random baseline. We
perform significance testing using bootstrap with
1,000 iterations, in each iteration randomly sam-
pling evaluation examples and comparing the two
models in question. §Dreports per-dataset results.
Following Sanh et al. (2022), we initialize the
continued pretraining stage from T5 finetuned with
an LM objective (Lester et al.,2021), making it
more amenable to prompting. We experiment with
two sizes: T5-Large with 770M parameters and T5-
XL with 3B parameters. We retrain T0 (Sanh et al.,
2022), i.e.
MTL-T P7
, to eliminate confounding
factors in the training procedure. We also repro-
duce Gu et al. (2022)’s experiment in our setup, i.e.
MTL-T P
, pretraining a shallow prompt with
other parameters frozen. During few-shot finetun-
ing, we train on the same 16 examples for 100
epochs. §Creports additional hyperparameters.
5 Results
Table 1reports our results. From
No Cont. Pre-
training
, we find that continued pretraining is cru-
cial for prompt tuning with low resources—without
it, only few-shot deep prompt tuning yields slightly
above-random performance. These results contra-
dict previous findings that few-shot prompt tuning
works well without this stage (Min et al.,2022). We
believe this is due to the challenging nature of the
P3 evaluation datasets, compared to the simple sen-
tence classification tasks previously investigated.
This is consistent with what He et al. (2022) ob-
served in the full-data setting where deep prompt
tuning performs sub-optimally on difficult tasks.
Existing methods
for continued pretraining have
their drawbacks. In contrast to Gu et al. (2022),
we found that
MTL-T P
with a shallow prompt
does not substantially perform above random. We
attribute this to (1) their simpler evaluation tasks
which, unlike ours, have decent prompt-tuned per-
formance without continued pretraining; and (2)
their hand-designed pretraining tasks that match
their evaluation tasks, while P3 conversely avoids
training-evaluation task overlap, requiring general-
izability. Vu et al. (2022) also found
MTL-T P
to be effective, though with high resources. We
also compare with T0, i.e.
MTL-T P7
, where
both the official model and our reproduction suffer
from degraded performance when few-shot shallow
prompt tuned (compared to 0-shot), likely because
the prompt added during finetuning is intrusive, and
the limited gradient updates are not sufficient to re-
摘要:

ContinuedPretrainingforBetterZero-andFew-ShotPromptabilityZhaofengWuãRobertL.LoganIVàPeteWalshæAkshitaBhagiaæDirkGroeneveldæSameerSinghæàIzBeltagyæãMITàDataminrInc.æAllenInstituteforArticialIntelligenceàUniversityofCalifornia,Irvinezfw@csail.mit.edurlogan@dataminr.com{petew,akshitab,dirkg,beltagy}@...

展开>> 收起<<
Continued Pretraining for Better Zero- and Few-Shot Promptability Zhaofeng WuãRobert L. Logan IVà Pete WalshæAkshita BhagiaæDirk GroeneveldæSameer Singhæà.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:723.99KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注