Dont Prompt Search Mining-based Zero-Shot Learning with Language Models Mozes van de Kar1Mengzhou Xia2Danqi Chen2Mikel Artetxe3

2025-05-04 0 0 3.95MB 14 页 10玖币

侵权投诉

Don’t Prompt, Search!

Mining-based Zero-Shot Learning with Language Models

Mozes van de Kar1Mengzhou Xia2Danqi Chen2Mikel Artetxe3

1University of Amsterdam 2Princeton University 3Meta AI

mozesvandekar@gmail.com {mengzhou,danqic}@cs.princeton.edu

artetxe@meta.com

Abstract

Masked language models like BERT can per-

form text classiﬁcation in a zero-shot fash-

ion by reformulating downstream tasks as text

inﬁlling. However, this approach is highly

sensitive to the template used to prompt the

model, yet practitioners are blind when design-

ing them in strict zero-shot settings. In this

paper, we propose an alternative mining-based

approach for zero-shot learning. Instead of

prompting language models, we use regular ex-

pressions to mine labeled examples1from un-

labeled corpora, which can optionally be ﬁl-

tered through prompting, and used to ﬁnetune

a pretrained model. Our method is more ﬂexi-

ble and interpretable than prompting, and out-

performs it on a wide range of tasks when us-

ing comparable templates. Our results suggest

that the success of prompting can partly be ex-

plained by the model being exposed to similar

examples during pretraining, which can be di-

rectly retrieved through regular expressions.

1 Introduction

Recent work has obtained strong zero-shot results

by prompting language models (Brown et al.,2020;

Chowdhery et al.,2022). As formalized by Schick

and Schütze (2021a), the core idea is to reformulate

text classiﬁcation as language modeling using a

pattern and a verbalizer. Given the input space

the output space

and the space of possible strings

V∗

, the pattern

t:X→V∗

maps each input into a

string with a masked span, whereas the verbalizer

v:C→V∗

maps each label into a string. A

language model can then be used for zero-shot

classiﬁcation by picking the most likely completion

for the masked text

arg maxc∈Cp(v(c)|t(x))

We use ‘labeled examples’ throughout the paper to denote

the examples that match regex-based patterns of different

labels. They are weakly-supervised and can be noisy.

We focus on masked language models, and allow multi-

token verbalizers through autoregressive decoding (see §3).

Left-to-right language models also ﬁt the framework by plac-

ing the mask at the end or scoring the full populated prompt.

It is [mask]. It was too scary.

Filter with zero-shot prompting

discard

MLM

Label mismatch?

Ta sk D es cr i pt i on

Finetune a pretrained model on the mined dataset

pattern: (is|was) {VERBALIZER}. {INPUT}

verbalizer: good / bad

… it. The film is good. That is …The actor played well.

Mining: match pattern and extract {INPUT}

+ -

It was not funny at all.The actor played well.

+ -

It was not funny at all.The actor played well.

Te x t C or p u s

Figure 1: Proposed method. 1) We mine labeled ex-

amples from a text corpus with regex-based patterns.

2) Optionally, we ﬁlter examples for which zero-shot

prompting predicts a different label. 3) We ﬁnetune a

pretrained language model with a classiﬁcation head.

few-shot settings, better results can be obtained by

prepending a few labeled examples (Brown et al.,

2020), or using them in some form of ﬁne-tuning

(Schick and Schütze,2021a;Gao et al.,2021).

However, prompting is known to be sensitive to

the choice of the pattern and the verbalizer, yet prac-

titioners are blind when designing them in true zero-

shot settings (Jiang et al.,2020;Perez et al.,2021).

Connected to that, subtle phenomena like the sur-

face form competition (Holtzman et al.,2021) have

a large impact on performance. Recent work has

tried to mitigate these issues through calibration

(Zhao et al.,2021), prompt combination (Schick

and Schütze,2021a;Lester et al.,2021;Zhou et al.,

2022) or automatic prompt generation (Shin et al.,

2020;Gao et al.,2021). At the same time, there

is still not a principled understanding of how lan-

guage models become few-shot learners, with re-

cent work analyzing the role of the pretraining data

(Chan et al.,2022) or the input-output mapping of

arXiv:2210.14803v1 [cs.CL] 26 Oct 2022

Task Prompting pattern Mining pattern

Sentiment {INPUT}. It was {VERBALIZER}.(is|was) {VERBALIZER}*. {INPUT}

Topic class. {INPUT}. It is about {VERBALIZER}.{VERBALIZER}*. {INPUT}

NLI {INPUT:HYP} {VERBALIZER}, {INPUT:PREM} {INPUT:HYP} {VERBALIZER}, {INPUT:PREM}

Table 1: Patterns. {VERBALIZER} is replaced with the verbalizers in Table 2. For mining, *. captures everything

up to a sentence boundary, and {INPUT},{INPUT:HYP} and {INPUT:PREM} capture a single sentence.

Task Lbl Verbalizers

Sent. Pos. good

good

good, great, awesome, incredible

Neg. bad

bad

bad, awful, terrible, horrible

NLI

Ent. Yes

Yes

Yes, Therefore, Thus, Accordingly,

Hence, For this reason

Con. No

No, However, But, On the contrary,

In contrast

Neu. Maybe

Maybe

Maybe, Also, Furthermore, Secondly,

Additionally, Moreover, In addition

Table 2: Verbalizers for sentiment classiﬁcation and

NLI. See Table 9for verbalizers used in topic classiﬁ-

cation. When using a single verbalizer, we choose the

one underlined. Multi-token verbalizers are in italic.

Lbl: label, Ent./Con./Neu: entailment, contradiction,

neutral.

in-context demonstrations (Min et al.,2022).

In this paper, we propose an alternative approach

to zero-shot learning that is more ﬂexible and inter-

pretable than prompting, while obtaining stronger

results in our experiments. Similar to prompting,

our method requires a pretrained language model,

pattern, and verbalizer, in addition to an unlabeled

corpus (e.g., the one used for pretraining). As il-

lustrated in Figure 1, our approach works by using

the pattern and verbalizer to mine labeled examples

from the corpus through regular expressions, and

leveraging them as supervision to ﬁnetune the pre-

trained language model. This allows to naturally

combine multiple patterns and verbalizers for each

task, while providing a signal to interactively de-

sign them by looking at the mined examples. In

addition, we show that better results are obtained

by ﬁltering the mined examples through prompting.

Experiments in sentiment analysis, topic

classiﬁcation and natural language inference (NLI)

conﬁrm the effectiveness of our approach, which

outperforms prompting by a large margin when

using the exact same verbalizers and comparable

patterns. Our results offer a new perspective on

how language models can perform downstream

tasks in a zero-shot fashion, showing that similar

examples often exist in the pretraining corpus,

which can be directly retrieved through simple

extraction patterns.

2 Proposed Method

As shown in Figure 1, our method has three steps:

Mine.

We ﬁrst use the pattern and a set of verbal-

izers to extract labeled examples from the corpus.

To that end, we deﬁne patterns that are ﬁlled with

verbalizers and expanded into regular expressions.

For instance, the pattern and verbalizer in Figure 1

would extract every sentence following “is good.”

or “was good.” as an example of the positive class,

and every sentence following “is bad.” or “was

bad.” as an example of the negative class. In prac-

tice, the patterns that we deﬁne are comparable to

the ones used for prompting, and the verbalizers

are exactly the same (see Tables 1and 2). Ap-

pendix Agives more details on how we expand

patterns into regular expressions. While prior work

in prompting typically uses a single verbalizer per

class, our approach allows to naturally combine

examples mined through multiple verbalizers in a

single dataset. So as to mitigate class imbalance

and keep the mined dataset to a reasonable size, we

mine a maximum of 40k examples per class after

balancing across the different verbalizers.

Filter.

As an optional second step, we explore

automatically removing noisy examples from the

mined data. To that end, we classify the mined

examples using zero-shot prompting, and remove

examples for which the predicted and the mined

label do not match. This ﬁltering step is reliant on

the performance of prompting, and we only remove

10% of the mismatching examples for which zero-

shot prompting is the most conﬁdent.

Finetune.

Finally, we use the mined dataset to

ﬁnetune a pretrained language model in the stan-

dard supervised fashion (Devlin et al.,2019), learn-

ing a new classiﬁcation head.

Sentiment analysis Topic class. NLI avg

amz imd mr sst ylp agn dbp yah mnl qnl rte snl

Full-shot Fine-tuning 97.1 95.7 88.8 94.4 95.0 95.1 99.3 76.8 78.5 92.6 67.6 90.5 89.3

Zero-shot

Prompting 81.5 78.4 71.1 77.4 81.9 34.0 36.4 28.2 47.1 50.8 52.3 39.6 56.6

w/ multi verb. 83.5 81.8 78.3 81.9 83.1 54.6 51.1 34.1 46.5 58.2 61.4 44.1 63.2

Proposed method 92.0 86.7 80.5 85.6 92.0 79.2 80.4 56.1 50.4 53.2 62.6 46.0 72.0

Table 3: Main results (accuracy). All systems are based on RoBERTa-base, and all zero-shot systems use compa-

rable patterns (see Table 1). We report average accuracy across 3 runs for all systems except prompting. w/ multi

verb.: prompting with different sets of verbalizers (Table 9) and averaging the probabilities.

3 Experimental Settings

Tasks.

We evaluate on three types of tasks: bi-

nary sentiment analysis on Amazon (Zhang et al.,

2015), IMDb (Maas et al.,2011), MR (Pang and

Lee,2005), SST-2 (Socher et al.,2013) and Yelp

(Zhang et al.,2015), topic classiﬁcation on AG

News (Zhang et al.,2015), DBPedia (Zhang et al.,

2015) and Yahoo Topics

(Zhang et al.,2015), and

NLI on MNLI (Williams et al.,2018), QNLI (Ra-

jpurkar et al.,2016), RTE (Dagan et al.,2005;

Bar-Haim et al.,2006;Giampiccolo et al.,2007;

Bentivogli et al.,2009) and SNLI (Bowman et al.,

2015). We report accuracy on the test set when

available, falling back to the validation set for SST-

2, MNLI, RTE and QNLI. For all systems involv-

ing ﬁne-tuning, we report the average across 3 runs

with different random seeds. We ran all develop-

ment experiments on SST-2 and AG News without

any exhaustive hyperparameter exploration, and

evaluate the rest of the tasks blindly.

Approaches.

We compare the following meth-

ods in our experiments, using RoBERTa-base (Liu

et al.,2019) as the pretrained model in all cases:

•Full-shot ﬁne-tuning

: We ﬁnetune RoBERTa

on the original training set adding a new clas-

siﬁcation head. We train for 3 epochs with a

batch size of 32. All the other hyperparameters

follow Liu et al. (2019). Refer to Appendix B

for more details.

•Zero-shot prompting

: Standard prompting,

described in §1. Multi-token verbalizer proba-

bilities are calculated autoregressively, picking

the most likely token at each step (Schick and

Schütze,2021c). We report results using both

a single verbalizer per class, as it is common

The Yahoo Answers dataset was downloaded by, and

access was limited to, the University of Amsterdam, where all

experiments were carried out.

Prompting Mining

good /bad 78.1 72.0

great /awful 82.3 82.1

awesome /terrible 82.3 83.9

incredible /horrible 83.1 87.3

combined 81.7 85.4

Table 4: Average sentiment accuracy using different

verbalizers. We report mining results without ﬁltering.

More detailed results are provided in Table 12.

in prior work, as well as multiple verbalizers

per class, which is more comparable to our ap-

proach. For the latter, we combine the probabil-

ities of each verbalizer by averaging.4

•Zero-shot mining

: Our proposed method, de-

scribed in §2. For the mining step, we use the

ﬁrst 100 shards from the C4 corpus (Raffel et al.,

2020), which cover 9.8% of the data. For the

ﬁltering step, we use single-verbalizer prompt-

ing to ﬁlter 10% of the mislabeled examples.

For the ﬁne-tuning step, we use the same set-

tings as in the full-shot setup, except that we

train for 5,000 steps with a dropout probability

of 0.4.

To mitigate class imbalance, we form

batches by ﬁrst sampling the class for each in-

stance from the uniform distribution, and then

picking a random example from the mined data

belonging to that class.

Patterns and verbalizers.

We use comparable

patterns for prompting and mining with the exact

We also tried summing or taking the maximum, which

obtained similar results as shown in Appendix C.

During development, we found that high dropout and

early stopping help mitigating model overﬁtting caused by

the misalignment between the mined and the true distribution.

However, evaluation on all tasks shows mixed results. We

stick to the original setup with high dropout to be faithful to

the rigorous zero-shot scenario, and report additional results

with standard dropout in Appendix C.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Don'tPrompt,Search!Mining-basedZero-ShotLearningwithLanguageModelsMozesvandeKar1MengzhouXia2DanqiChen2MikelArtetxe31UniversityofAmsterdam2PrincetonUniversity3MetaAImozesvandekar@gmail.com{mengzhou,danqic}@cs.princeton.eduartetxe@meta.comAbstractMaskedlanguagemodelslikeBERTcanper-formtextclassicatio...

展开>> 收起<<

Dont Prompt Search Mining-based Zero-Shot Learning with Language Models Mozes van de Kar1Mengzhou Xia2Danqi Chen2Mikel Artetxe3.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dont Prompt Search Mining-based Zero-Shot Learning with Language Models Mozes van de Kar1Mengzhou Xia2Danqi Chen2Mikel Artetxe3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: