Continued Pretraining for Better Zero- and Few-Shot Promptability Zhaofeng WuãRobert L. Logan IVà Pete WalshæAkshita BhagiaæDirk GroeneveldæSameer Singhæà

2025-05-06 0 0 723.99KB 15 页 10玖币

侵权投诉

Continued Pretraining for Better Zero- and Few-Shot Promptability

Zhaofeng WuãRobert L. Logan IVà

Pete WalshæAkshita BhagiaæDirk GroeneveldæSameer Singhæ

Iz Beltagyæ

ãMIT àDataminr Inc.

æAllen Institute for Artiﬁcial Intelligence

University of California, Irvine

zfw@csail.mit.edu rlogan@dataminr.com

{petew,akshitab,dirkg,beltagy}@allenai.org sameer@uci.edu

Abstract

Recently introduced language model prompt-

ing methods can achieve high accuracy in zero-

and few-shot settings while requiring few to

no learned task-speciﬁc parameters. Never-

theless, these methods still often trail behind

full model ﬁnetuning. In this work, we investi-

gate if a dedicated continued pretraining stage

could improve “promptability”, i.e., zero-shot

performance with natural language prompts or

few-shot performance with prompt tuning. We

reveal settings where existing continued pre-

training methods lack promptability. We also

identify current methodological gaps, which

we ﬁll with thorough large-scale experiments.

We demonstrate that a simple recipe, contin-

ued pretraining that incorporates a trainable

prompt during multi-task learning, leads to im-

proved promptability in both zero- and few-

shot settings compared to existing methods, up

to 31% relative. On the other hand, we ﬁnd that

continued pretraining using MAML-style meta-

learning, a method that directly optimizes few-

shot promptability, yields subpar performance.

We validate our ﬁndings with two prompt tun-

ing methods, and, based on our results, we

provide concrete recommendations to optimize

promptability for different use cases.

1 Introduction

Conditioning language models (LMs) on manually-

written or learned continuous prompts allows them

to solve tasks with high accuracy and minimal pa-

rameter overhead (Brown et al.,2020;Li and Liang,

2021;Lester et al.,2021,i.a.). However, prompt-

ing performance often still lags behind traditional

full ﬁnetuning. Natural language prompts usually

underperform trained models even when manually

curated (Brown et al.,2020;Sanh et al.,2022). Sim-

ilarly, while learned prompts yield higher accuracy,

This work was done when Zhaofeng Wu was at AI2, and

Robert Logan was at UCI.

We release our code and models at

https://github.

com/allenai/better-promptability.

they do not work as well when the training data is

scarce (Gu et al.,2022), when the model is small

or moderately sized (Lester et al.,2021), and when

the tasks are difﬁcult (He et al.,2022).

To reduce the gap between prompt and full

model tuning, past work has shown that continued

pretraining on data that resembles the downstream

prompting setup induces better “promptability”,

i.e., zero-shot performance with natural language

(NL) prompts and few-shot performance of prompt

tuning (Sanh et al.,2022;Gu et al.,2022). However,

in this paper, we identify several shortcomings of

these methods. First, continued pretraining on NL

prompts (Sanh et al.,2022) sometimes causes per-

formance degradation with prompt tuning. Second,

continued pretraining approaches that learn only a

universal prompt initialization (Gu et al.,2022;Vu

et al.,2022) bring only marginal improvement on

the P3 datasets (Bach et al.,2022).

To further improve zero-shot and few-shot

promptability, we investigate gaps in existing meth-

ods with different parameter conﬁgurations and

training procedures. First, we explore the effect

of incorporating a learned continuous prompt into

multi-task learning (MTL), and ﬁnd it to signiﬁ-

cantly improve zero- and few-shot promptability

across the board. In addition, we explore MAML-

style meta-learning (Finn et al.,2017;Nichol et al.,

2018) as an alternative to the standard continued

pretraining paradigm, but ﬁnd that it underperforms

simple MTL, despite its previous success on few-

shot learning tasks (Li et al.,2017;Gu et al.,2018;

Qian and Yu,2019,i.a.). We perform an analysis of

this phenomenon and present several explanations.

Through large-scale experiments, each involv-

ing continued pretraining on over 9B tokens (§A),

we make several contributions: (1) we thoroughly

evaluate continued pretraining methods, both ex-

isting and our proposed ones, in many setups; (2)

we demonstrate that a simple continued pretraining

recipe improves over existing methods by up to

arXiv:2210.10258v2 [cs.CL] 21 Oct 2022

31%; (3) we show that MAML-style meta-learning

underperforms multi-task learning and provide ex-

planations; (4) we provide concrete recommen-

dations to improve promptability in various use

cases.

2 Prompting

We review two types of prompting that we use: nat-

ural language (NL) prompting and prompt tuning.

Traditionally, NLP tasks are solved by task-

speciﬁc models that predict label

y∈ Y

from

input

x∈ X

. We can consider LMs as func-

tions that score any source and target text pair,

LM :V∗× V∗→R

with vocabulary

Past

work found that large LMs can be repurposed to

solve many tasks by casting

x, y

into a text format

using a template function

f:X ∪ Y → V∗

and tak-

ing as prediction arg maxy0∈Y LM(f(x), f(y0)).

NL prompts, or instructions, are manually

constructed

f(·)

. Without task-speciﬁc training,

they have been successfully used to elicit predic-

tions from LMs to perform tasks with high accu-

racy (Brown et al.,2020;Logan IV et al.,2022).

Sharing the motivation, prompt tuning learns

a continuous prompt to condition the model. It

takes the source text embedded by the LM input

embeddings,

s∈RN×d

with length

and di-

mension

, and prepends learnable embeddings

E∈RL×d

, where

is a hyperparameter, to ob-

tain a new

(L+N)

-lengthed embedded sequence.

We consider hybrid prompt tuning, where

is the

embedding of the templatized

f(x)

, i.e., prompt

tuning is always performed in addition to NL tem-

plates. This has been widely adopted due to demon-

strated better performance (Gu et al.,2022;Min

et al.,2022). We also study a variant of prompt tun-

ing, sometimes called preﬁx tuning (Li and Liang,

2021), where the learnable vectors are added not

only to the input but all transformer layers. See

Lester et al. (2021) and Li and Liang (2021) for

more details on these methods. Following the ter-

minology of Liu et al. (2022b), we refer to the

input-level method as

shallow

prompt tuning and

the layer-speciﬁc method as deep prompt tuning.

3 Improving Promptability

In this section, we describe existing methods to

improve promptability and a new paradigm that

We focus on encoder-decoder LMs based on T5 (Raffel

et al.,2020). Past work considers them to work better than

decoder-only LMs for prompting (Sanh et al.,2022).

combines their advantages.

While prompt tuning sometimes performs close

to full model ﬁnetuning (Lester et al.,2021;Liu

et al.,2022b), there is often still a substantial gap,

such as with limited training data (Gu et al.,2022),

non-gigantic models (Lester et al.,2021), or chal-

lenging tasks (He et al.,2022). We therefore study

ways to improve LMs’ “promptability.” We focus

on a low resource setup and consider zero-shot

NL prompts and few-shot learned prompts (which,

again, are in conjunction with NL prompts; §2).

For the former, better promptability increases the

performance when LMs face textual prompts of

new tasks. For the latter, it more effectively lever-

ages limited training examples for higher accuracy.

We investigate if promptability can improve with

a continued pretraining stage after LM pretraining

(or LM adaptation for LM-adapted T5 (Lester et al.,

2021)) and before task-speciﬁc ﬁnetuning. The

model is trained on a collection of tasks that have

NL prompts and evaluated on unseen tasks. The

methods that we explore below differ in how the

continued pretraining stage is performed. We use

the notation

MTL-T_P_

to abbreviate those meth-

ods that are based on multi-task learning, where

the blanks

specify different conﬁgurations of the

transformer (

) and the prompt (

) components

during MTL. Architecturally, a method may con-

tinue to pretrain only the T5 model without prompt

parameters, in which case we use P

to denote

the lack of them; otherwise, both transformer and

prompt parameters exist during MTL. We use

and to denote if the corresponding component is

trained or frozen in MTL, respectively. This nota-

tion describes the continued pretraining stage only:

in the ﬁnal ﬁnetuning stage, all methods include

both the transformer and prompt components, but

only the latter is updated.

Continued pretraining has been studied in lim-

ited settings. Sanh et al. (2022) proposed T0 by

multi-task training a T5 model (Raffel et al.,2020)

as continued pretraining. They updated T5 pa-

rameters through learning on continued pretrain-

ing tasks, not including a prompt component, and

showed that this training improves zero-shot NL

promptability. Following our nomenclature, we re-

fer to this paradigm as

MTL-T P7

. Additionally,

Gu et al. (2022) employed a similar stage, incor-

porating and multi-task training a shallow prompt

as continued pretraining, while freezing the trans-

former parameters in this stage. They showed that

this strategy helps few-shot promptability during

ﬁnetuning. We refer to this paradigm as

MTL-

T P .

In this work, we study the gains of the previous

two continued pretraining approaches, as well as a

model that synthesizes them,

MTL-T P

, which

we are the ﬁrst to propose. For few-shot down-

stream tuning, the learned prompt can act as a good

initialization compared to

MTL-T P7

. In the zero-

shot setup, prior work has discovered that includ-

ing certain text in a prompt, such as “Let’s think

step by step,” can adjust the reasoning of LMs to

yield substantially improved performance across

tasks (Kojima et al.,2022;Askell et al.,2021). The

learned prompt here could function analogously.

Compared to

MTL-T P

, on the other hand, the

additional capacity brought by more updatable pa-

rameters could further boost model performance.

MAML-style meta-learning (Finn et al.,2017)

directly optimizes for the downstream updates

and can outperform MTL for full model ﬁnetun-

ing (Dou et al.,2019;Bansal et al.,2020a). Yet,

it similarly remains unexplored for prompting.

We examine ﬁrst-order MAML (

FOMAML

;Finn

et al.,2017), performing

steps of prompt tuning

in the inner loop and updating all parameters in

the outer loop. We also evaluate a version of

Rep-

tile

(Nichol et al.,2018) adapted for our setting that

performs

steps of prompt tuning followed by one

step of full model tuning, and use the resulting Rep-

tile gradient for model updates. They have the same

architecture as

MTL-T P

and all parameters are

trainable too. We provide a detailed description

and theoretical discussion of these processes in §B.

See the original papers for more details.

4 Experimental Setup

We use P3, a collection of NL-templatized exam-

ples for a variety of datasets, for training and evalu-

ation using the standard splits in Sanh et al. (2022).

Not only is there no dataset overlap between train-

ing and evaluation, but no task overlap either (e.g.,

sentiment vs. QA), making it challenging. We re-

port dataset statistics in §A. We perform continued

pretraining for one epoch over all training datasets.

Each dataset has multiple templates, each evaluated

with accuracy. As different datasets have differ-

ent numbers of answer choices and hence different

baseline accuracy, we report Average Relative Gain

(ARG; Ye et al.,2021) as a single summary metric

by averaging across all templates the relative ac-

curacy improvement over a random baseline. We

perform signiﬁcance testing using bootstrap with

1,000 iterations, in each iteration randomly sam-

pling evaluation examples and comparing the two

models in question. §Dreports per-dataset results.

Following Sanh et al. (2022), we initialize the

continued pretraining stage from T5 ﬁnetuned with

an LM objective (Lester et al.,2021), making it

more amenable to prompting. We experiment with

two sizes: T5-Large with 770M parameters and T5-

XL with 3B parameters. We retrain T0 (Sanh et al.,

2022), i.e.

MTL-T P7

, to eliminate confounding

factors in the training procedure. We also repro-

duce Gu et al. (2022)’s experiment in our setup, i.e.

MTL-T P

, pretraining a shallow prompt with

other parameters frozen. During few-shot ﬁnetun-

ing, we train on the same 16 examples for 100

epochs. §Creports additional hyperparameters.

5 Results

Table 1reports our results. From

No Cont. Pre-

training

, we ﬁnd that continued pretraining is cru-

cial for prompt tuning with low resources—without

it, only few-shot deep prompt tuning yields slightly

above-random performance. These results contra-

dict previous ﬁndings that few-shot prompt tuning

works well without this stage (Min et al.,2022). We

believe this is due to the challenging nature of the

P3 evaluation datasets, compared to the simple sen-

tence classiﬁcation tasks previously investigated.

This is consistent with what He et al. (2022) ob-

served in the full-data setting where deep prompt

tuning performs sub-optimally on difﬁcult tasks.

Existing methods

for continued pretraining have

their drawbacks. In contrast to Gu et al. (2022),

we found that

MTL-T P

with a shallow prompt

does not substantially perform above random. We

attribute this to (1) their simpler evaluation tasks

which, unlike ours, have decent prompt-tuned per-

formance without continued pretraining; and (2)

their hand-designed pretraining tasks that match

their evaluation tasks, while P3 conversely avoids

training-evaluation task overlap, requiring general-

izability. Vu et al. (2022) also found

MTL-T P

to be effective, though with high resources. We

also compare with T0, i.e.

MTL-T P7

, where

both the ofﬁcial model and our reproduction suffer

from degraded performance when few-shot shallow

prompt tuned (compared to 0-shot), likely because

the prompt added during ﬁnetuning is intrusive, and

the limited gradient updates are not sufﬁcient to re-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContinuedPretrainingforBetterZero-andFew-ShotPromptabilityZhaofengWuãRobertL.LoganIVàPeteWalshæAkshitaBhagiaæDirkGroeneveldæSameerSinghæàIzBeltagyæãMITàDataminrInc.æAllenInstituteforArticialIntelligenceàUniversityofCalifornia,Irvinezfw@csail.mit.edurlogan@dataminr.com{petew,akshitab,dirkg,beltagy}@...

展开>> 收起<<

Continued Pretraining for Better Zero- and Few-Shot Promptability Zhaofeng WuãRobert L. Logan IVà Pete WalshæAkshita BhagiaæDirk GroeneveldæSameer Singhæà.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Continued Pretraining for Better Zero- and Few-Shot Promptability Zhaofeng WuãRobert L. Logan IVà Pete WalshæAkshita BhagiaæDirk GroeneveldæSameer Singhæà

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: