
Continued Pretraining for Better Zero- and Few-Shot Promptability
Zhaofeng WuãRobert L. Logan IVà
Pete WalshæAkshita BhagiaæDirk GroeneveldæSameer Singhæ
à
Iz Beltagyæ
ãMIT àDataminr Inc.
æAllen Institute for Artificial Intelligence
à
University of California, Irvine
zfw@csail.mit.edu rlogan@dataminr.com
{petew,akshitab,dirkg,beltagy}@allenai.org sameer@uci.edu
Abstract
Recently introduced language model prompt-
ing methods can achieve high accuracy in zero-
and few-shot settings while requiring few to
no learned task-specific parameters. Never-
theless, these methods still often trail behind
full model finetuning. In this work, we investi-
gate if a dedicated continued pretraining stage
could improve “promptability”, i.e., zero-shot
performance with natural language prompts or
few-shot performance with prompt tuning. We
reveal settings where existing continued pre-
training methods lack promptability. We also
identify current methodological gaps, which
we fill with thorough large-scale experiments.
We demonstrate that a simple recipe, contin-
ued pretraining that incorporates a trainable
prompt during multi-task learning, leads to im-
proved promptability in both zero- and few-
shot settings compared to existing methods, up
to 31% relative. On the other hand, we find that
continued pretraining using MAML-style meta-
learning, a method that directly optimizes few-
shot promptability, yields subpar performance.
We validate our findings with two prompt tun-
ing methods, and, based on our results, we
provide concrete recommendations to optimize
promptability for different use cases.
1 Introduction
Conditioning language models (LMs) on manually-
written or learned continuous prompts allows them
to solve tasks with high accuracy and minimal pa-
rameter overhead (Brown et al.,2020;Li and Liang,
2021;Lester et al.,2021,i.a.). However, prompt-
ing performance often still lags behind traditional
full finetuning. Natural language prompts usually
underperform trained models even when manually
curated (Brown et al.,2020;Sanh et al.,2022). Sim-
ilarly, while learned prompts yield higher accuracy,
This work was done when Zhaofeng Wu was at AI2, and
Robert Logan was at UCI.
We release our code and models at
https://github.
com/allenai/better-promptability.
they do not work as well when the training data is
scarce (Gu et al.,2022), when the model is small
or moderately sized (Lester et al.,2021), and when
the tasks are difficult (He et al.,2022).
To reduce the gap between prompt and full
model tuning, past work has shown that continued
pretraining on data that resembles the downstream
prompting setup induces better “promptability”,
i.e., zero-shot performance with natural language
(NL) prompts and few-shot performance of prompt
tuning (Sanh et al.,2022;Gu et al.,2022). However,
in this paper, we identify several shortcomings of
these methods. First, continued pretraining on NL
prompts (Sanh et al.,2022) sometimes causes per-
formance degradation with prompt tuning. Second,
continued pretraining approaches that learn only a
universal prompt initialization (Gu et al.,2022;Vu
et al.,2022) bring only marginal improvement on
the P3 datasets (Bach et al.,2022).
To further improve zero-shot and few-shot
promptability, we investigate gaps in existing meth-
ods with different parameter configurations and
training procedures. First, we explore the effect
of incorporating a learned continuous prompt into
multi-task learning (MTL), and find it to signifi-
cantly improve zero- and few-shot promptability
across the board. In addition, we explore MAML-
style meta-learning (Finn et al.,2017;Nichol et al.,
2018) as an alternative to the standard continued
pretraining paradigm, but find that it underperforms
simple MTL, despite its previous success on few-
shot learning tasks (Li et al.,2017;Gu et al.,2018;
Qian and Yu,2019,i.a.). We perform an analysis of
this phenomenon and present several explanations.
Through large-scale experiments, each involv-
ing continued pretraining on over 9B tokens (§A),
we make several contributions: (1) we thoroughly
evaluate continued pretraining methods, both ex-
isting and our proposed ones, in many setups; (2)
we demonstrate that a simple continued pretraining
recipe improves over existing methods by up to
arXiv:2210.10258v2 [cs.CL] 21 Oct 2022