Dont Prompt Search Mining-based Zero-Shot Learning with Language Models Mozes van de Kar1Mengzhou Xia2Danqi Chen2Mikel Artetxe3

2025-05-04 0 0 3.95MB 14 页 10玖币
侵权投诉
Don’t Prompt, Search!
Mining-based Zero-Shot Learning with Language Models
Mozes van de Kar1Mengzhou Xia2Danqi Chen2Mikel Artetxe3
1University of Amsterdam 2Princeton University 3Meta AI
mozesvandekar@gmail.com {mengzhou,danqic}@cs.princeton.edu
artetxe@meta.com
Abstract
Masked language models like BERT can per-
form text classification in a zero-shot fash-
ion by reformulating downstream tasks as text
infilling. However, this approach is highly
sensitive to the template used to prompt the
model, yet practitioners are blind when design-
ing them in strict zero-shot settings. In this
paper, we propose an alternative mining-based
approach for zero-shot learning. Instead of
prompting language models, we use regular ex-
pressions to mine labeled examples1from un-
labeled corpora, which can optionally be fil-
tered through prompting, and used to finetune
a pretrained model. Our method is more flexi-
ble and interpretable than prompting, and out-
performs it on a wide range of tasks when us-
ing comparable templates. Our results suggest
that the success of prompting can partly be ex-
plained by the model being exposed to similar
examples during pretraining, which can be di-
rectly retrieved through regular expressions.
1 Introduction
Recent work has obtained strong zero-shot results
by prompting language models (Brown et al.,2020;
Chowdhery et al.,2022). As formalized by Schick
and Schütze (2021a), the core idea is to reformulate
text classification as language modeling using a
pattern and a verbalizer. Given the input space
X
,
the output space
C
and the space of possible strings
V
, the pattern
t:XV
maps each input into a
string with a masked span, whereas the verbalizer
v:CV
maps each label into a string. A
language model can then be used for zero-shot
classification by picking the most likely completion
for the masked text
arg maxcCp(v(c)|t(x))
.
2
In
1
We use ‘labeled examples’ throughout the paper to denote
the examples that match regex-based patterns of different
labels. They are weakly-supervised and can be noisy.
2
We focus on masked language models, and allow multi-
token verbalizers through autoregressive decoding (see §3).
Left-to-right language models also fit the framework by plac-
ing the mask at the end or scoring the full populated prompt.
It is [mask]. It was too scary.
2
Filter with zero-shot prompting
discard
-
MLM
Label mismatch?
Ta sk D es cr i pt i on
3
Finetune a pretrained model on the mined dataset
pattern: (is|was) {VERBALIZER}. {INPUT}
verbalizer: good / bad
… it. The film is good. That is …The actor played well.
1
Mining: match pattern and extract {INPUT}
+ -
It was not funny at all.The actor played well.
+ -
It was not funny at all.The actor played well.
Te x t C or p u s
Figure 1: Proposed method. 1) We mine labeled ex-
amples from a text corpus with regex-based patterns.
2) Optionally, we filter examples for which zero-shot
prompting predicts a different label. 3) We finetune a
pretrained language model with a classification head.
few-shot settings, better results can be obtained by
prepending a few labeled examples (Brown et al.,
2020), or using them in some form of fine-tuning
(Schick and Schütze,2021a;Gao et al.,2021).
However, prompting is known to be sensitive to
the choice of the pattern and the verbalizer, yet prac-
titioners are blind when designing them in true zero-
shot settings (Jiang et al.,2020;Perez et al.,2021).
Connected to that, subtle phenomena like the sur-
face form competition (Holtzman et al.,2021) have
a large impact on performance. Recent work has
tried to mitigate these issues through calibration
(Zhao et al.,2021), prompt combination (Schick
and Schütze,2021a;Lester et al.,2021;Zhou et al.,
2022) or automatic prompt generation (Shin et al.,
2020;Gao et al.,2021). At the same time, there
is still not a principled understanding of how lan-
guage models become few-shot learners, with re-
cent work analyzing the role of the pretraining data
(Chan et al.,2022) or the input-output mapping of
arXiv:2210.14803v1 [cs.CL] 26 Oct 2022
Task Prompting pattern Mining pattern
Sentiment {INPUT}. It was {VERBALIZER}.(is|was) {VERBALIZER}*. {INPUT}
Topic class. {INPUT}. It is about {VERBALIZER}.{VERBALIZER}*. {INPUT}
NLI {INPUT:HYP} {VERBALIZER}, {INPUT:PREM} {INPUT:HYP} {VERBALIZER}, {INPUT:PREM}
Table 1: Patterns. {VERBALIZER} is replaced with the verbalizers in Table 2. For mining, *. captures everything
up to a sentence boundary, and {INPUT},{INPUT:HYP} and {INPUT:PREM} capture a single sentence.
Task Lbl Verbalizers
Sent. Pos. good
good
good
good
good
good
good
good
good
good
good
good
good
good
good
good
good, great, awesome, incredible
Neg. bad
bad
bad
bad
bad
bad
bad
bad
bad
bad
bad
bad
bad
bad
bad
bad
bad, awful, terrible, horrible
NLI
Ent. Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes, Therefore, Thus, Accordingly,
Hence, For this reason
Con. No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No, However, But, On the contrary,
In contrast
Neu. Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe
Maybe, Also, Furthermore, Secondly,
Additionally, Moreover, In addition
Table 2: Verbalizers for sentiment classification and
NLI. See Table 9for verbalizers used in topic classifi-
cation. When using a single verbalizer, we choose the
one underlined. Multi-token verbalizers are in italic.
Lbl: label, Ent./Con./Neu: entailment, contradiction,
neutral.
in-context demonstrations (Min et al.,2022).
In this paper, we propose an alternative approach
to zero-shot learning that is more flexible and inter-
pretable than prompting, while obtaining stronger
results in our experiments. Similar to prompting,
our method requires a pretrained language model,
pattern, and verbalizer, in addition to an unlabeled
corpus (e.g., the one used for pretraining). As il-
lustrated in Figure 1, our approach works by using
the pattern and verbalizer to mine labeled examples
from the corpus through regular expressions, and
leveraging them as supervision to finetune the pre-
trained language model. This allows to naturally
combine multiple patterns and verbalizers for each
task, while providing a signal to interactively de-
sign them by looking at the mined examples. In
addition, we show that better results are obtained
by filtering the mined examples through prompting.
Experiments in sentiment analysis, topic
classification and natural language inference (NLI)
confirm the effectiveness of our approach, which
outperforms prompting by a large margin when
using the exact same verbalizers and comparable
patterns. Our results offer a new perspective on
how language models can perform downstream
tasks in a zero-shot fashion, showing that similar
examples often exist in the pretraining corpus,
which can be directly retrieved through simple
extraction patterns.
2 Proposed Method
As shown in Figure 1, our method has three steps:
Mine.
We first use the pattern and a set of verbal-
izers to extract labeled examples from the corpus.
To that end, we define patterns that are filled with
verbalizers and expanded into regular expressions.
For instance, the pattern and verbalizer in Figure 1
would extract every sentence following “is good.
or “was good. as an example of the positive class,
and every sentence following “is bad. or “was
bad.as an example of the negative class. In prac-
tice, the patterns that we define are comparable to
the ones used for prompting, and the verbalizers
are exactly the same (see Tables 1and 2). Ap-
pendix Agives more details on how we expand
patterns into regular expressions. While prior work
in prompting typically uses a single verbalizer per
class, our approach allows to naturally combine
examples mined through multiple verbalizers in a
single dataset. So as to mitigate class imbalance
and keep the mined dataset to a reasonable size, we
mine a maximum of 40k examples per class after
balancing across the different verbalizers.
Filter.
As an optional second step, we explore
automatically removing noisy examples from the
mined data. To that end, we classify the mined
examples using zero-shot prompting, and remove
examples for which the predicted and the mined
label do not match. This filtering step is reliant on
the performance of prompting, and we only remove
10% of the mismatching examples for which zero-
shot prompting is the most confident.
Finetune.
Finally, we use the mined dataset to
finetune a pretrained language model in the stan-
dard supervised fashion (Devlin et al.,2019), learn-
ing a new classification head.
Sentiment analysis Topic class. NLI avg
amz imd mr sst ylp agn dbp yah mnl qnl rte snl
Full-shot Fine-tuning 97.1 95.7 88.8 94.4 95.0 95.1 99.3 76.8 78.5 92.6 67.6 90.5 89.3
Zero-shot
Prompting 81.5 78.4 71.1 77.4 81.9 34.0 36.4 28.2 47.1 50.8 52.3 39.6 56.6
w/ multi verb. 83.5 81.8 78.3 81.9 83.1 54.6 51.1 34.1 46.5 58.2 61.4 44.1 63.2
Proposed method 92.0 86.7 80.5 85.6 92.0 79.2 80.4 56.1 50.4 53.2 62.6 46.0 72.0
Table 3: Main results (accuracy). All systems are based on RoBERTa-base, and all zero-shot systems use compa-
rable patterns (see Table 1). We report average accuracy across 3 runs for all systems except prompting. w/ multi
verb.: prompting with different sets of verbalizers (Table 9) and averaging the probabilities.
3 Experimental Settings
Tasks.
We evaluate on three types of tasks: bi-
nary sentiment analysis on Amazon (Zhang et al.,
2015), IMDb (Maas et al.,2011), MR (Pang and
Lee,2005), SST-2 (Socher et al.,2013) and Yelp
(Zhang et al.,2015), topic classification on AG
News (Zhang et al.,2015), DBPedia (Zhang et al.,
2015) and Yahoo Topics
3
(Zhang et al.,2015), and
NLI on MNLI (Williams et al.,2018), QNLI (Ra-
jpurkar et al.,2016), RTE (Dagan et al.,2005;
Bar-Haim et al.,2006;Giampiccolo et al.,2007;
Bentivogli et al.,2009) and SNLI (Bowman et al.,
2015). We report accuracy on the test set when
available, falling back to the validation set for SST-
2, MNLI, RTE and QNLI. For all systems involv-
ing fine-tuning, we report the average across 3 runs
with different random seeds. We ran all develop-
ment experiments on SST-2 and AG News without
any exhaustive hyperparameter exploration, and
evaluate the rest of the tasks blindly.
Approaches.
We compare the following meth-
ods in our experiments, using RoBERTa-base (Liu
et al.,2019) as the pretrained model in all cases:
Full-shot fine-tuning
: We finetune RoBERTa
on the original training set adding a new clas-
sification head. We train for 3 epochs with a
batch size of 32. All the other hyperparameters
follow Liu et al. (2019). Refer to Appendix B
for more details.
Zero-shot prompting
: Standard prompting,
described in §1. Multi-token verbalizer proba-
bilities are calculated autoregressively, picking
the most likely token at each step (Schick and
Schütze,2021c). We report results using both
a single verbalizer per class, as it is common
3
The Yahoo Answers dataset was downloaded by, and
access was limited to, the University of Amsterdam, where all
experiments were carried out.
Prompting Mining
good /bad 78.1 72.0
great /awful 82.3 82.1
awesome /terrible 82.3 83.9
incredible /horrible 83.1 87.3
combined 81.7 85.4
Table 4: Average sentiment accuracy using different
verbalizers. We report mining results without filtering.
More detailed results are provided in Table 12.
in prior work, as well as multiple verbalizers
per class, which is more comparable to our ap-
proach. For the latter, we combine the probabil-
ities of each verbalizer by averaging.4
Zero-shot mining
: Our proposed method, de-
scribed in §2. For the mining step, we use the
first 100 shards from the C4 corpus (Raffel et al.,
2020), which cover 9.8% of the data. For the
filtering step, we use single-verbalizer prompt-
ing to filter 10% of the mislabeled examples.
For the fine-tuning step, we use the same set-
tings as in the full-shot setup, except that we
train for 5,000 steps with a dropout probability
of 0.4.
5
To mitigate class imbalance, we form
batches by first sampling the class for each in-
stance from the uniform distribution, and then
picking a random example from the mined data
belonging to that class.
Patterns and verbalizers.
We use comparable
patterns for prompting and mining with the exact
4
We also tried summing or taking the maximum, which
obtained similar results as shown in Appendix C.
5
During development, we found that high dropout and
early stopping help mitigating model overfitting caused by
the misalignment between the mined and the true distribution.
However, evaluation on all tasks shows mixed results. We
stick to the original setup with high dropout to be faithful to
the rigorous zero-shot scenario, and report additional results
with standard dropout in Appendix C.
摘要:

Don'tPrompt,Search!Mining-basedZero-ShotLearningwithLanguageModelsMozesvandeKar1MengzhouXia2DanqiChen2MikelArtetxe31UniversityofAmsterdam2PrincetonUniversity3MetaAImozesvandekar@gmail.com{mengzhou,danqic}@cs.princeton.eduartetxe@meta.comAbstractMaskedlanguagemodelslikeBERTcanper-formtextclassicatio...

展开>> 收起<<
Dont Prompt Search Mining-based Zero-Shot Learning with Language Models Mozes van de Kar1Mengzhou Xia2Danqi Chen2Mikel Artetxe3.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:3.95MB 格式:PDF 时间:2025-05-04

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注