PROGEN Progressive Zero-shot Dataset Generation via In-context Feedback Jiacheng Ye Jiahui Gao Jiangtao Feng Zhiyong Wu

2025-05-02 0 0 689.47KB 13 页 10玖币
侵权投诉
PROGEN: Progressive Zero-shot Dataset Generation
via In-context Feedback
Jiacheng Ye♠♦
, Jiahui Gao, Jiangtao Feng, Zhiyong Wu,
Tao Yu♠♥,Lingpeng Kong♠♦
Shanghai AI Laboratory University of Washington
The University of Hong Kong
{carsonye, sumiler}@connect.hku.hk,
{fengjiangtao,wuzhiyong}@pjlab.org.cn, {tyu,lpk}@cs.hku.hk
Abstract
Recently, dataset-generation-based zero-shot
learning has shown promising results by train-
ing a task-specific model with a dataset synthe-
sized from large pre-trained language models
(PLMs). The final task-specific model often
achieves compatible or even better perfor-
mance than PLMs under the zero-shot setting,
with orders of magnitude fewer parameters.
However, synthetic datasets have their draw-
backs. They have long been suffering from
low-quality issues (e.g., low informativeness
and redundancy). This explains why the
massive synthetic data does not lead to better
performance – a scenario we would expect
in the human-labeled data. To improve the
quality of dataset synthesis, we propose a pro-
gressive zero-shot dataset generation frame-
work, PROGEN, which leverages the feedback
from the task-specific model to guide the
generation of new training data via in-context
examples. Extensive experiments on five text
classification datasets demonstrate the effec-
tiveness of the proposed approach. We also
show PROGEN achieves on-par or superior
performance with only 1% synthetic dataset
size compared to baseline methods without in-
context feedback.
1 Introduction
Dataset generation with pre-trained language mod-
els (PLMs) has attracted enormous interest re-
cently due to the superior generative capacity of
PLMs. Given task-specific supervision, recent
work (Anaby-Tavor et al.,2020;Puri et al.,2020;
Kumar et al.,2020;Lee et al.,2021,inter alia)
manages to fine-tune the PLMs to synthesize
high-quality datasets for downstream applications.
Nevertheless, obtaining task supervision from
human experts can be expensive or even unrealistic.
Recent attempts (Schick and Schütze,2021;Wang
et al.,2021;Meng et al.,2022,inter alia) turn
Work done while interning at Shanghai AI Lab.
(a) Zero-shot Dataset Generation (ZEROGEN)
...
PLMs
Prompt Synthetic
Dataset
(b) Progressive Zero-shot Dataset Generation (PROGEN)
PLMs
Prompt
Feedback
Figure 1: Comparison of vanilla zero-shot dataset
generation (ZEROGEN) and progressive zero-shot
dataset generation (PROGEN). In progressive zero-
shot dataset generation, we split the whole dataset
generation process into multiple phrases. In each
phase, the generation is steered by feedback from the
previously generated dataset, so as to synthesize a
dataset with higher quality.
their eyes to the unsupervised dataset generation.
Among them, ZEROGEN (Ye et al.,2022) proposes
to first convert the task descriptions into carefully
designed prompts (Petroni et al.,2019;Brown et al.,
2020), and then use these prompts to steer the
PLMs to synthesize the training data for the final
task model. This approach allows highly efficient
inference as the final task model only has orders of
magnitude fewer parameters compared to PLMs,
yet achieves compatible or even better performance
than PLMs under the zero-shot setting.
The major drawback of synthetic datasets, how-
ever, is they often suffer from low-quality issues
(e.g., low informativeness, redundancy). Despite
we can generate as much data as computational re-
source allows, the massive generated data does not
automatically translate into better performances,
unlike in the human-labeling scenario.
To address this problem, we propose a
arXiv:2210.12329v1 [cs.CL] 22 Oct 2022
pro
gressive zero-shot dataset generation
framework (Figure 1b), called PROGEN. In
a nutshell, PROGEN learns a model for a
downstream task by performing two phrases
alternatively – using PLMs to create labeled
examples leveraging the feedback from the current
task-specific model, and training a task-specific
model given the generated labeled examples. To
compute reliable signals as feedback, we employ
the influence function (Koh and Liang (2017); IF)
to quantify contribution to the loss for each training
point. In the context of zero-shot learning where no
human-annotated data is assumed, we integrate a
noise-resistant objective in the calculation of IF so
that it can tackle the noise in the synthetic dataset.
To incorporate feedback into PLMs, we sort the
training samples based on their quantified influence
score, and formulate those most influential ones
as in-context examples (Brown et al.,2020) to
steer the generation. Overall, PROGEN has the
following advantages: 1) the quality estimation
phrase requires no human annotations, thus works
in a purely zero-shot learning setting; 2) unlike
most controllable generation methods that tune or
require the access to PLMs (Keskar et al.,2019;
Dathathri et al.,2020;Liu et al.,2021,inter alia),
the in-context feedback phrase does not need to
modify parameters in the PLM and incurs minimal
disturbance to its generation procedure. Our main
contributions are three folds:
We propose a progressive framework for zero-
shot dataset generation to generate higher-
quality dataset (§3);
We propose noise-resistant influence function
to estimate the quality of each sample without
any human annotations (§3.1), and a learning-
free controllable generation method via in-
context feedback (§3.2);
Across multiple text classification datasets,
we show our framework obtains better perfor-
mance over various prompt-based methods,
and achieves on-par zero-shot performance
with only 1% synthetic dataset size, when
compared to methods without in-context
feedback (§4).
Our code can be found at
https://github.
com/HKUNLP/ProGen.
2 Background
In this section, we briefly review the baseline
approaches of zero-shot dataset generation and how
the synthesized dataset can be used for zero-shot
learning on downstream tasks.
Zero-shot Dataset Generation
Take text clas-
sification task as an example, vanilla zero-shot
dataset generation methods (Meng et al.,2022;Ye
et al.,2022) aims to generate a synthetic dataset
D={(x, y)}
with the help of a PLM
P
. They first
sample a class label
y
from a uniform distribution:
yU(y1, y2, . . . , yk),(1)
where
k
is the number of classes. They then wrap
y
up into a label-descriptive prompt
T(y)
to steer
the generation of x:
x∼ P(·|T (y)).(2)
Since the parameters of
P
is frozen and the
generation
x
for each
y
is deterministic, different
sampling algorithms (e.g., Top-k sampling (Fan
et al.,2018) and nucleus sampling (Holtzman et al.,
2020)) can be adopted to increase the diversity
of generated dataset. A synthetic dataset
D
is
constructed after pairing the generated xwith y.
Dataset-generation-based Zero-shot Learning
The vast linguistic (Jawahar et al.,2019;Goldberg,
2019;Tenney et al.,2019) and factual (Petroni et al.,
2019;Jiang et al.,2020b) knowledge encoded in
PLMs’ parameters is the key towards the success
of conventional prompt-based zero-shot learning
(PROMPTING) (Brown et al.,2020). However,
PROMPTING fails to fully exert the capacity
of PLMs and heavily relies on gigantic PLMs
during inference. This motivates another line of
work (Meng et al.,2022;Ye et al.,2022) to explore
a more flexible and efficient way of conducting
zero-shot learning based on dataset generation.
Given the synthetic dataset generated as above,
a task-specific model is trained, allowing any
task-specific inductive bias and with an order-of-
magnitude smaller number of parameters compared
to PLMs. The performance of the final task-
specific model is mostly dominated by the quality
of the synthetic dataset, and a low-quality dataset
degrades the final zero-shot performance. This
thereby motivates us to explore methods that
improve the dataset quality.
PLMs
The movie review in positive sentiment is: "
The movie review in positive sentiment is: "A deep and meaningful movie."
The movie review in positive sentiment is: "Caramel is charming, funny, honest, an epic love
story that you want to see more of."
The movie review in positive sentiment is: "A thoughtful ending that both builds the story's
tension and completes the arc of its first act."
In-context Examples
Influential
Subset
Synthetic
Dataset
Quality Estimation via
a Trained TAM
Prompt
Feedback as
In-context Examples
Update Dataset
Update Prompt
Figure 2: Framework of PROGEN for progressive zero-shot dataset generation. To update the prompt, we first
train a task-specific model (TAM) with the synthetic dataset, and then employ the noise-robust influence function
to measure the quality of each data point. Finally, the most influential subset is selected, which acts as feedback
via in-context learning. The whole framework works with a black-box PLM and requires no human annotations.
3 PROGEN
We now describe our framework for
pro
gressive
zero-shot dataset
gen
eration via in-context feed-
back (PROGEN), as shown in Figure 2. We
follow ZEROGEN (Ye et al.,2022) to build the
backbone of our framework. Concretely, we first
train a
ta
sk-specific
m
odel (TAM) with partially
generated dataset. Then, assuming no access to
human annotations, we estimate the influence of
each sample via the noise-robust influence function.
Finally, with those identified most influential
samples, we explore the use of in-context learning
to shift the generation distribution towards that of
influential samples, so that the system generates
more related samples. The whole framework
progressively constructs the synthetic dataset and
enhances the performance of the final task-specific
model.
3.1 Annotation-free Quality Estimation
There are many factors in measuring the quality of
a dataset, e.g., diversity, annotation correctness,
spurious biases (Mishra et al.,2020;Wiegreffe
and Marasovic,2021). However, it is often
very subjective, making it unrealistic to calculate
them all automatically. Our solution to this is
to infer the quality of the individual samples in
synthetic datasets using the performance of the
final task-specific model trained on the dataset as
the surrogate. Concretely, we propose to apply
influence function (Koh and Liang,2017) on the
task-specific model to give sample-level influence
scores with regard to the loss of validation set.
However, a clean validation set, which is crucial for
producing reliable influence scores, is inaccessible
in the zero-shot learning setting. Thus, we use a
synthetic validation set and harness the influence
function with a noise-robust objective to handle the
potential noise in the synthetic validation set.
Formally, influence function measures the
change in the model’s loss on the test data-point
ztest = (x, y)
if we up-weight the loss of a training
data-point zby :
Iup,loss (z, ztest )def
=
dL ztest,ˆ
θ,z
d
=0
=θLztest,ˆ
θ>dˆ
θ,z
d
=0
=−∇θLztest,ˆ
θ>H1
ˆ
θθL(z, ˆ
θ),
(3)
where
ˆ
θ,z
def
= arg minθΘ
1
nPn
i=1 L(zi, θ) +
L(z, θ)
is the parameter if
z
were upweighted by
some small
and
Hˆ
θ
is the Hessian. Our noise-
robust validation-set level influence function is
defined as:
Iup,loss (z, Dval) = −∇θL0Dval,ˆ
θ>H1
ˆ
θθL(z, ˆ
θ),
(4)
摘要:

PROGEN:ProgressiveZero-shotDatasetGenerationviaIn-contextFeedbackJiachengYe},JiahuiGao,JiangtaoFeng},ZhiyongWu},TaoYu~,LingpengKong}}ShanghaiAILaboratory~UniversityofWashingtonTheUniversityofHongKong{carsonye,sumiler}@connect.hku.hk,{fengjiangtao,wuzhiyong}@pjlab.org.cn,{tyu,lpk}@cs.hku.hkAbst...

展开>> 收起<<
PROGEN Progressive Zero-shot Dataset Generation via In-context Feedback Jiacheng Ye Jiahui Gao Jiangtao Feng Zhiyong Wu.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:689.47KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注