PROGEN Progressive Zero-shot Dataset Generation via In-context Feedback Jiacheng Ye Jiahui Gao Jiangtao Feng Zhiyong Wu

2025-05-02 0 0 689.47KB 13 页 10玖币

侵权投诉

PROGEN: Progressive Zero-shot Dataset Generation

via In-context Feedback

Jiacheng Ye♠♦∗

, Jiahui Gao♠, Jiangtao Feng♦, Zhiyong Wu♦,

Tao Yu♠♥,Lingpeng Kong♠♦

♦Shanghai AI Laboratory ♥University of Washington

♠The University of Hong Kong

{carsonye, sumiler}@connect.hku.hk,

{fengjiangtao,wuzhiyong}@pjlab.org.cn, {tyu,lpk}@cs.hku.hk

Abstract

Recently, dataset-generation-based zero-shot

learning has shown promising results by train-

ing a task-speciﬁc model with a dataset synthe-

sized from large pre-trained language models

(PLMs). The ﬁnal task-speciﬁc model often

achieves compatible or even better perfor-

mance than PLMs under the zero-shot setting,

with orders of magnitude fewer parameters.

However, synthetic datasets have their draw-

backs. They have long been suffering from

low-quality issues (e.g., low informativeness

and redundancy). This explains why the

massive synthetic data does not lead to better

performance – a scenario we would expect

in the human-labeled data. To improve the

quality of dataset synthesis, we propose a pro-

gressive zero-shot dataset generation frame-

work, PROGEN, which leverages the feedback

from the task-speciﬁc model to guide the

generation of new training data via in-context

examples. Extensive experiments on ﬁve text

classiﬁcation datasets demonstrate the effec-

tiveness of the proposed approach. We also

show PROGEN achieves on-par or superior

performance with only 1% synthetic dataset

size compared to baseline methods without in-

context feedback.

1 Introduction

Dataset generation with pre-trained language mod-

els (PLMs) has attracted enormous interest re-

cently due to the superior generative capacity of

PLMs. Given task-speciﬁc supervision, recent

work (Anaby-Tavor et al.,2020;Puri et al.,2020;

Kumar et al.,2020;Lee et al.,2021,inter alia)

manages to ﬁne-tune the PLMs to synthesize

high-quality datasets for downstream applications.

Nevertheless, obtaining task supervision from

human experts can be expensive or even unrealistic.

Recent attempts (Schick and Schütze,2021;Wang

et al.,2021;Meng et al.,2022,inter alia) turn

∗Work done while interning at Shanghai AI Lab.

(a) Zero-shot Dataset Generation (ZEROGEN)

...

PLMs

Prompt Synthetic

Dataset

(b) Progressive Zero-shot Dataset Generation (PROGEN)

PLMs

Prompt

Feedback

Figure 1: Comparison of vanilla zero-shot dataset

generation (ZEROGEN) and progressive zero-shot

dataset generation (PROGEN). In progressive zero-

shot dataset generation, we split the whole dataset

generation process into multiple phrases. In each

phase, the generation is steered by feedback from the

previously generated dataset, so as to synthesize a

dataset with higher quality.

their eyes to the unsupervised dataset generation.

Among them, ZEROGEN (Ye et al.,2022) proposes

to ﬁrst convert the task descriptions into carefully

designed prompts (Petroni et al.,2019;Brown et al.,

2020), and then use these prompts to steer the

PLMs to synthesize the training data for the ﬁnal

task model. This approach allows highly efﬁcient

inference as the ﬁnal task model only has orders of

magnitude fewer parameters compared to PLMs,

yet achieves compatible or even better performance

than PLMs under the zero-shot setting.

The major drawback of synthetic datasets, how-

ever, is they often suffer from low-quality issues

(e.g., low informativeness, redundancy). Despite

we can generate as much data as computational re-

source allows, the massive generated data does not

automatically translate into better performances,

unlike in the human-labeling scenario.

To address this problem, we propose a

arXiv:2210.12329v1 [cs.CL] 22 Oct 2022

pro

gressive zero-shot dataset generation

framework (Figure 1b), called PROGEN. In

a nutshell, PROGEN learns a model for a

downstream task by performing two phrases

alternatively – using PLMs to create labeled

examples leveraging the feedback from the current

task-speciﬁc model, and training a task-speciﬁc

model given the generated labeled examples. To

compute reliable signals as feedback, we employ

the inﬂuence function (Koh and Liang (2017); IF)

to quantify contribution to the loss for each training

point. In the context of zero-shot learning where no

human-annotated data is assumed, we integrate a

noise-resistant objective in the calculation of IF so

that it can tackle the noise in the synthetic dataset.

To incorporate feedback into PLMs, we sort the

training samples based on their quantiﬁed inﬂuence

score, and formulate those most inﬂuential ones

as in-context examples (Brown et al.,2020) to

steer the generation. Overall, PROGEN has the

following advantages: 1) the quality estimation

phrase requires no human annotations, thus works

in a purely zero-shot learning setting; 2) unlike

most controllable generation methods that tune or

require the access to PLMs (Keskar et al.,2019;

Dathathri et al.,2020;Liu et al.,2021,inter alia),

the in-context feedback phrase does not need to

modify parameters in the PLM and incurs minimal

disturbance to its generation procedure. Our main

contributions are three folds:

•

We propose a progressive framework for zero-

shot dataset generation to generate higher-

quality dataset (§3);

•

We propose noise-resistant inﬂuence function

to estimate the quality of each sample without

any human annotations (§3.1), and a learning-

free controllable generation method via in-

context feedback (§3.2);

•

Across multiple text classiﬁcation datasets,

we show our framework obtains better perfor-

mance over various prompt-based methods,

and achieves on-par zero-shot performance

with only 1% synthetic dataset size, when

compared to methods without in-context

feedback (§4).

Our code can be found at

https://github.

com/HKUNLP/ProGen.

2 Background

In this section, we brieﬂy review the baseline

approaches of zero-shot dataset generation and how

the synthesized dataset can be used for zero-shot

learning on downstream tasks.

Zero-shot Dataset Generation

Take text clas-

siﬁcation task as an example, vanilla zero-shot

dataset generation methods (Meng et al.,2022;Ye

et al.,2022) aims to generate a synthetic dataset

D={(x, y)}

with the help of a PLM

. They ﬁrst

sample a class label

from a uniform distribution:

y∼U(y1, y2, . . . , yk),(1)

where

is the number of classes. They then wrap

up into a label-descriptive prompt

T(y)

to steer

the generation of x:

x∼ P(·|T (y)).(2)

Since the parameters of

is frozen and the

generation

for each

is deterministic, different

sampling algorithms (e.g., Top-k sampling (Fan

et al.,2018) and nucleus sampling (Holtzman et al.,

2020)) can be adopted to increase the diversity

of generated dataset. A synthetic dataset

constructed after pairing the generated xwith y.

Dataset-generation-based Zero-shot Learning

The vast linguistic (Jawahar et al.,2019;Goldberg,

2019;Tenney et al.,2019) and factual (Petroni et al.,

2019;Jiang et al.,2020b) knowledge encoded in

PLMs’ parameters is the key towards the success

of conventional prompt-based zero-shot learning

(PROMPTING) (Brown et al.,2020). However,

PROMPTING fails to fully exert the capacity

of PLMs and heavily relies on gigantic PLMs

during inference. This motivates another line of

work (Meng et al.,2022;Ye et al.,2022) to explore

a more ﬂexible and efﬁcient way of conducting

zero-shot learning based on dataset generation.

Given the synthetic dataset generated as above,

a task-speciﬁc model is trained, allowing any

task-speciﬁc inductive bias and with an order-of-

magnitude smaller number of parameters compared

to PLMs. The performance of the ﬁnal task-

speciﬁc model is mostly dominated by the quality

of the synthetic dataset, and a low-quality dataset

degrades the ﬁnal zero-shot performance. This

thereby motivates us to explore methods that

improve the dataset quality.

PLMs

The movie review in positive sentiment is: "

The movie review in positive sentiment is: "A deep and meaningful movie."

The movie review in positive sentiment is: "Caramel is charming, funny, honest, an epic love

story that you want to see more of."

The movie review in positive sentiment is: "A thoughtful ending that both builds the story's

tension and completes the arc of its first act."

In-context Examples

Influential

Subset

Synthetic

Dataset

Quality Estimation via

a Trained TAM

Prompt

Feedback as

In-context Examples

Update Dataset

Update Prompt

Figure 2: Framework of PROGEN for progressive zero-shot dataset generation. To update the prompt, we ﬁrst

train a task-speciﬁc model (TAM) with the synthetic dataset, and then employ the noise-robust inﬂuence function

to measure the quality of each data point. Finally, the most inﬂuential subset is selected, which acts as feedback

via in-context learning. The whole framework works with a black-box PLM and requires no human annotations.

3 PROGEN

We now describe our framework for

pro

gressive

zero-shot dataset

gen

eration via in-context feed-

back (PROGEN), as shown in Figure 2. We

follow ZEROGEN (Ye et al.,2022) to build the

backbone of our framework. Concretely, we ﬁrst

train a

sk-speciﬁc

odel (TAM) with partially

generated dataset. Then, assuming no access to

human annotations, we estimate the inﬂuence of

each sample via the noise-robust inﬂuence function.

Finally, with those identiﬁed most inﬂuential

samples, we explore the use of in-context learning

to shift the generation distribution towards that of

inﬂuential samples, so that the system generates

more related samples. The whole framework

progressively constructs the synthetic dataset and

enhances the performance of the ﬁnal task-speciﬁc

model.

3.1 Annotation-free Quality Estimation

There are many factors in measuring the quality of

a dataset, e.g., diversity, annotation correctness,

spurious biases (Mishra et al.,2020;Wiegreffe

and Marasovic,2021). However, it is often

very subjective, making it unrealistic to calculate

them all automatically. Our solution to this is

to infer the quality of the individual samples in

synthetic datasets using the performance of the

ﬁnal task-speciﬁc model trained on the dataset as

the surrogate. Concretely, we propose to apply

inﬂuence function (Koh and Liang,2017) on the

task-speciﬁc model to give sample-level inﬂuence

scores with regard to the loss of validation set.

However, a clean validation set, which is crucial for

producing reliable inﬂuence scores, is inaccessible

in the zero-shot learning setting. Thus, we use a

synthetic validation set and harness the inﬂuence

function with a noise-robust objective to handle the

potential noise in the synthetic validation set.

Formally, inﬂuence function measures the

change in the model’s loss on the test data-point

ztest = (x, y)

if we up-weight the loss of a training

data-point zby :

Iup,loss (z, ztest )def

dL ztest,ˆ

θ,z 

d



=0

=∇θLztest,ˆ

θ>dˆ

θ,z

d 



=0

=−∇θLztest,ˆ

θ>H−1

θ∇θL(z, ˆ

θ),

(3)

where

θ,z

def

= arg minθ∈Θ

nPn

i=1 L(zi, θ) +

L(z, θ)

is the parameter if

were upweighted by

some small



and

Hˆ

is the Hessian. Our noise-

robust validation-set level inﬂuence function is

deﬁned as:

Iup,loss (z, Dval) = −∇θL0Dval,ˆ

θ>H−1

θ∇θL(z, ˆ

θ),

(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PROGEN:ProgressiveZero-shotDatasetGenerationviaIn-contextFeedbackJiachengYe},JiahuiGao,JiangtaoFeng},ZhiyongWu},TaoYu~,LingpengKong}}ShanghaiAILaboratory~UniversityofWashingtonTheUniversityofHongKong{carsonye,sumiler}@connect.hku.hk,{fengjiangtao,wuzhiyong}@pjlab.org.cn,{tyu,lpk}@cs.hku.hkAbst...

展开>> 收起<<

PROGEN Progressive Zero-shot Dataset Generation via In-context Feedback Jiacheng Ye Jiahui Gao Jiangtao Feng Zhiyong Wu.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PROGEN Progressive Zero-shot Dataset Generation via In-context Feedback Jiacheng Ye Jiahui Gao Jiangtao Feng Zhiyong Wu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: