iPrompt Explaining Data Patterns in Natural Language via Interpretable Autoprompting Chandan Singh 1John X. Morris 2Jyoti Aneja1Alexander M. Rush2Jianfeng Gao1

2025-05-03 0 0 3.98MB 23 页 10玖币

侵权投诉

iPrompt: Explaining Data Patterns in Natural Language

via Interpretable Autoprompting

Chandan Singh * 1 John X. Morris * 2 Jyoti Aneja 1Alexander M. Rush 2Jianfeng Gao 1

Abstract

Large language models (LLMs) have displayed an

impressive ability to harness natural language to

perform complex tasks. We explore whether we

can leverage this ability to ﬁnd and explain pat-

terns in data. Speciﬁcally, given a pre-trained

LLM and data examples, we introduce inter-

pretable autoprompting (iPrompt), an algorithm

that generates a natural language string explaining

the data. iPrompt iteratively generates explana-

tions with an LLM and reranks them based on

their performance when used as a prompt. Experi-

ments on a wide range of datasets, from synthetic

mathematics to natural language understanding,

show that iPrompt can yield meaningful insights

by accurately ﬁnding dataset explanations that are

human-interpretable. On two of four classiﬁcation

datasets, iPrompt discovers a prompt that outper-

forms human-written prompts on GPT-3, despite

only querying the relatively small GPT-J model.

Finally, experiments with scientiﬁc datasets show

the potential for iPrompt to aid in scientiﬁc dis-

covery. 1

1. Introduction

Large language models (LLMs) have attained an extraordi-

nary ability to harness natural language for solving diverse

problems (Devlin et al., 2018), often without the need for

ﬁnetuning (Brown et al., 2020; Sanh et al., 2021). Moreover,

LLMs have demonstrated the capacity to excel at real-world

problems, such as mathematics (Lewkowycz et al., 2022),

scientiﬁc question answering (Sadat & Caragea, 2022), gen-

eral processing of scientiﬁc text (Beltagy et al., 2019), pre-

dicting brain responses (Schrimpf et al., 2021), and classify-

ing proteins and chemical compounds (Taylor et al., 2022).

Equal contribution

Microsoft Research

Cornell University.

Correspondence to: Jianfeng Gao <jfgao@microsft.com>.

All code for using the methods and data here is made available

on Github.

Dataset

Input: 3 1 Output: 4

Input: 4 7 Output: 11

…

Input: 5 9 Output: 14 LLM

Natural-language

explanation

Uses

Add the inputs

Explain a known dataset

Recover a transferable prompt

Propose novel descriptions

(i)

(ii)

(iii)

Figure 1.

Interpretable autoprompting (iPrompt) inverts the stan-

dard prediction problem to instead ﬁnd a natural language explana-

tion of the data using a ﬁxed, pre-trained large language model.

In this work, we probe whether we can leverage the learned

skills of an LLM to discover and explain patterns in a

dataset. To do so, we invert the typical problem of ﬁtting

an LLM to data and instead ask whether we can use a ﬁxed

LLM to produce a natural language string explaining dataset

patterns.

Our approach to this problem centers around prompting.

Prompting has emerged as an effective method for adapt-

ing LLMs to new datasets (Liu et al., 2021a); a prompt

string is combined with each example in a dataset before

querying an LLM for an answer. While prompts were ini-

tially constructed manually, recent work has shown suc-

cess in autoprompting, automatically ﬁnding a prompt via

optimization (Shin et al., 2020; Li & Liang, 2021; Deng

et al., 2022). However, previous work on learning natu-

ral language prompts does not produce prompts that are

meaningful to humans.

Our approach, interpretable autoprompting (iPrompt), ex-

tends autoprompting to generate a semantically meaningful

natural language prompt that explains a key characteristic

of the data (see Fig. 1). For example, given a dataset of

examples of addition, e.g.

2 5 ⇒7

...

3 1 ⇒4

, iPrompt

yields the natural language explanation Add the inputs. By

changing the input form of the data, we can generate expla-

nations that accomplish different tasks from the example,

such as: i) recovering a dataset explanation, ii) generating

a prompt transferable between LLMs, and iii) proposing

novel descriptions. iPrompt works by using a pre-trained

arXiv:2210.01848v2 [cs.LG] 26 Jan 2023

iPrompt 2

LLM to iteratively propose and evaluate different candidate

explanations.

For evaluation, we curate a diverse collection of datasets

written in natural language (Table 1) and measure iPrompt’s

ability to accurately explain a ground-truth pattern. We ﬁnd

that iPrompt outperforms baseline methods in accurately

ﬁnding a correct description; moreover, the generated de-

scriptions are interpretable, allowing human auditing and

enabling strong generalization when used as a prompt in a

new setting (i.e. when used for a different LLM). On real-

world sentiment classiﬁcation datasets, iPrompt even pro-

duces prompts that match or improve upon human-written

prompts for GPT-3, while only using smaller, locally-run

language models. Finally, we ﬁnd that iPrompt is able to

extract information from real-world scientiﬁc datasets.

2. Related work

Prompting and autoprompting.

With the advent of

large-scale models, prompting (i.e. ﬁnding the right prompt

to use to query an LLM for a given task) has exploded as an

area of inquiry, often yielding impressive improvements in

performance (Brown et al., 2020; Petroni et al., 2019; Liu

et al., 2021a) and spurring a line of work aiming to make

prompting easier (Strobelt et al., 2022; Lu et al., 2022; Bach

et al., 2022; Logan IV et al., 2022). Recently, autoprompt-

ing (i.e. automatically searching for a prompt or prompt-

embedding via optimization) has emerged, with methods

such as preﬁx-tuning (Li & Liang, 2021), P-tuning (Liu

et al., 2021b), prompt-tuning with rules (Han et al., 2021),

knowledgeable prompt tuning (Hu et al., 2021) and many

more (Liu et al., 2021a). These strategies use gradient de-

scent to ﬁnd a set of “adapter” parameters that maximize

model performance, but do not require that the new parame-

ters map back to tokens in discrete space, rendering them

uninterpretable.

A few methods tackle the more difﬁcult problem of search-

ing for prompts that can be expressed in natural language

tokens. RLPrompt (Deng et al., 2022) searches for such

a prompt using reinforcement learning and one recent

work (Honovich et al., 2022) queries an LLM to produce

a prompt. AutoPrompt (Shin et al., 2020) performs auto-

prompting via input gradients (see Sec. 3). Similarly, ad-

versarial triggers (Wallace et al., 2019) use autoprompting

to identify adversarial inputs which can be used to change

a model’s prediction. These methods effectively alter a

model’s predictions, but do not constrain the discovered

prompts to be semantically meaningful, resulting in prompts

that are difﬁcult to interpret (Webson & Pavlick, 2021). An-

other related work directly ﬁnetunes an LLM to describe

the difference between two datasets (Zhong et al., 2022).

Concurrent work proposes a method for natural language

prompting similar to the one here, with a focus on improv-

ing prediction performance rather than on explaining data

patterns (Zhou et al., 2022).

Problems related to dataset explanation

The problem

statement presented in this work closely resembles the

widely studied problems of symbolic regression (Augusto

& Barbosa, 2000; Schmidt & Lipson, 2009), program syn-

thesis (Gulwani et al., 2017; Manna & Waldinger, 1980),

text/table summarization (Kry

sci

nski et al., 2019; Liu et al.,

2018), and pattern discovery in data-mining (Hand, 2007).

iPrompt can be viewed as an algorithm for symbolic regres-

sion, in which the set of allowable symbols consists of se-

mantically meaningful natural language strings. One recent

work proposes the task of inferring prompts that improve

supervised prediction (Honovich et al., 2022), which we

generalize here to diverse use cases for dataset explanation.

Alternative methods for neural-network interpretation

A popular method for interpreting neural networks is to

inspect an LLM’s individual predictions via feature impor-

tances (Lundberg et al., 2019; Ribeiro et al., 2016), feature-

interaction importances (Singh et al., 2019; Tsang et al.,

2017), extractive rationales (Zaidan & Eisner, 2008; Sha

et al., 2021), or natural language explanations for individual

predictions (Hendricks et al., 2016; Camburu et al., 2018).

These works can provide meaningful insights for individual

predictions, but it is difﬁcult to parse them into an under-

standing of an entire dataset. Alternatively, one can inves-

tigate whether an LLM’s learned representations via prob-

ing (Conneau et al., 2018; Liu & Avci, 2019) or by directly

analyzing a model’s internal weights and activations (Wang

et al., 2021; Olah et al., 2018; Meng et al., 2022). However,

these approaches are limited in their ability to generate pre-

viously unknown descriptions of data. A different approach

involves distilling information into a transparent model (Tan

et al., 2018; Ha et al., 2021; Singh & Gao, 2022) or simply

using a transparent model in the ﬁrst place (Breiman et al.,

1984; Tan et al., 2022; Singh et al., 2021; Agarwal et al.,

2022).

3. Methods: Deﬁning the task and approach

3.1. Task: Dataset Explanation

Given a dataset comprised of input-output string pairs

{(x1, y1),...,(xN, yN)}

, the goal is to produce a “seman-

tically meaningful” natural language string that explains the

relationship between

and

. We require that a string con-

sists of human-understandable text rather than a sequence

of incongruous tokens. For example, in the dataset shown

in Fig. 1, given samples of data performing addition, our

task is to recover text synonymous to Add the inputs. This

dataset explanation can then be used for various downstream

tasks, such as prompting a different LLM.

iPrompt 3

Table 1.

Dataset Explanation Tasks. Each collections contains #

different task. Roman numerals correspond to the use cases in

Fig. 1. For full details on each dataset, see Appendix A.2.

Collection # Description

1) Synthetic math 10 Mathematical functions (i), (ii)

2) Allen NLI 10 Language tasks (i), (ii)

3) Instr. induction 20 Language tasks (i), (ii)

4) Sentiment 4 Sentiment classiﬁcation (i), (ii)

5) Proteins/chemicals 3 Protein/chemical properties (iii)

6) Language fMRI 20 Excitation of fMRI voxel (iii),(iii)

Datasets

Table 1 shows the collections of datasets we

study: (1) Synthetic math – datasets that require inferring an

underlying mathematical function based on numeric input

and outputs; (2) Allen NLI (ANLI) and (3) Instruction induc-

tion (Honovich et al., 2022) – diverse language tasks (Wang

et al., 2022) with easily veriﬁable descriptions (e.g. Find a

country’s capital). (4) Sentiment – a collection of sentiment

classiﬁcation datasets in different domains. For collections

(1-3), there is a ground-truth prompt available for evaluation.

For example, when adding two numbers (Fig. 1), the rule

checks whether a description contains any of the keywords

add,sum, or +. We also study scientiﬁc datasets on (5)

proteins/chemicals, and (6) fMRI with full details given in

Sec. 6.

3.2. Approach: iPrompt

We now detail approaches for the general problem of au-

toprompting before introducing iPrompt, our method for

interpretable autoprompting. We specify autoprompting as

a discrete search problem. Given a dataset of

input-output

pairs

{

(

), ..., (

)

}

and a pre-trained LLM

that

returns the log-probability of a given string, autoprompting

ﬁnds a natural language explanation ˆsmaximizing:

ˆs=argmax

s∈S

i=1

frender(s, xi, yi)(1)

The

render

function is a problem-speciﬁc function that ren-

ders a natural language string from the prompt

and each

example in the dataset

(xi, yi)

. We use

to indicate the

set of ﬂuent strings, under some notion of syntactic ﬂuency.

This constraint is used to ensure prompts are readable, and

potentially generalize to downstream LLMs. Solving this

search problem exactly is intractable.

A core assumption of this objective is that semantically

accurate prompts lead a model to assign higher probability

to the correct output. To check this assumption, we analyze

four datasets from the inverse synthetic math collection that

share common structure for the inputs and prompts. Each

dataset admits a prompt of the form Return the of the

inputs., then is given two input numbers and queried for the

Figure 2.

Prompt-based reranking depends on model size. Large

models (GPT-J 6B and GPT-3) align prompts correctly to tasks.

The model is given the prompt Return the of the inputs., where

is ﬁlled in with the shown prompt keyword before querying

the output given two inputs numbers in a string. Darker indicates a

higher accuracy, and high accuracy along the diagonal indicates

that the correct prompt induces the highest accuracy.

output.

Fig. 2 shows the accuracy of different models at perform-

ing these tasks across different input prompts.

For small

models, the prompts are unsuccessful, but for large models

(GPT-J 6B and GPT-3), the model is accurate if and only if

given the correct prompt.

This result suggests that, at least

for large models, the search for a prompt that maximizes

performance correlates well with the underlying task. We

will see in Fig. 4 that dataset explanation depends on this

ability.

Baseline: AutoPrompt

AutoPrompt (Shin et al., 2020)

targets the objective posed in Eq. (1) using a gradient-based

local search. AutoPrompt searches for

ˆs

following the gra-

dients of the objective Eq. (1) with respect to individual

tokens in

ˆs

. It discretely changes individual words in

ˆs

and

then checks whether or not the newly updated

ˆs

improves

the objective score. The use of gradients allows AutoPrompt

to ﬁnd an effective prompt

ˆs

, but makes it difﬁcult to ﬁnd

answers that satisfy the ﬂuency constraint S.

The accuracy is normalized for each task using softmax in

order to visualize the effect of differing prompts.

3For details on each model, see Table A3.

iPrompt 4

Figure 3.

Overview of iPrompt. iPrompt ﬁrst proposes candidate

prompts, then ranks them based on their performance as a prompt,

then truncates and regenerates them. This entire process is repeated

until performance stops improving.

Baseline: Zero-shot sufﬁx decoding

LLMs them-

selves can be directly used to predict prompt strings.

Following Honovich et al., we give the model a

prompt string which contains data examples (e.g.

In: 2 5

| {z }

Out: 7.

| {z }

To compute the output from the input,

| {z }

template

and sample the output to recover a prompt

ˆs

using nucleus

sampling.4

Proposed method: iPrompt

iPrompt (Fig. 3) is an iter-

ative local search algorithm that alternates between three

steps: (i) proposing candidate prompts, (ii) reranking candi-

date prompts, (iii) exploration.

(i) Proposal

: Candidate prompts are generated by extending

the zero-shot LLM generation. Given a data instance as

a preﬁx, we sample a number of candidate prompts. The

maximum length of each candidate is pre-speciﬁed and ﬁxed.

For example, in the add-two-numbers task (Fig. 3), we may

generate four candidates:

{

Combine the numbers, Return

the output, Sum in order, Compute the output}.

(ii) Reranking

: Given candidates, the objective Eq. (1) is

evaluated for each candidate prompt

. The top few candi-

dates which maximize the objective are kept, e.g. narrowing

down the candidates to

{

Combine the numbers, Sum in or-

We also consider averaging the model’s output logits across

all examples in the dataset before decoding the output, but ﬁnd

that it does not improve performance (see Appendix A.4).

der}.

(iii) Iterate with exploration

: Each of the top candidates

from reranking is truncated at a random position. These

truncated candidates are used as a preﬁx when generating

new candidate prompts via sufﬁx decoding. For example,

we may randomly select the start of the previous candidates

and ﬁll in the endings:

{

Combine the , Sum

} →

{

Combine the numbers, Combine both arguments, Sum the

numbers, Sum all inputs}.

The algorithm is repeated until identifying a suitably strong

ˆs

, e.g. Sum the numbers. Steps (i) and (iii) ensure that

prompts remain ﬂuent, while step (ii) improves the score

of the prompts on the objective. Computationally, iPrompt

only requires running inference on the pre-trained LLM,

yielding a signiﬁcantly lower memory requirement than

methods such as AutoPrompt which require access to the

LLM’s gradients.

4. Experimental Setup

We consider two sets of experiments. First in Sec. 5, we

explore iPrompt’s ability to rediscover a correct and ﬂuent

prompt on the variety of simple instruction datasets (Table 1,

top) with known answers. Experiments test the ability of

the model to recover a known prompt while also remaining

ﬂuent in a way that generalize to human readers and to other

language models. In Sec. 6 we apply iPrompt to scientiﬁc

datasets (Table 1, bottom).

Language Models

For the main set of experiments, we

always generate prompts using GPT-J, a 6 billion parameter

model (Wang & Komatsuzaki, 2021). We restrict prompts to

{

6,12

}

tokens for sentiment classiﬁcation and 6 tokens for

the remaining data collections in Table 1. For generalization

experiments, alternative models are tested with the gener-

ated prompts including OPT and GPT-3 (Zhang et al., 2022;

Brown et al., 2020). See Appendix A.4 for a full discussion

of experimental details and Appendix A.3 for experiments

on more models (e.g. Galactica (Taylor et al., 2022)) and

more datasets.

Evaluation metrics

We consider two types of evalua-

tion: closeness to ground-truth and accuracy as a prompt.

To measure closeness we use three metrics: (1) Correct –

whether the generated explanation contains one of a set of

problem-speciﬁc keywords. (2) MRR – Mean reciprocal

rank measuring the rank of the ﬁrst task-correct prompt.

Given a set of datasets

D={D1, ..., DN}

, we compute:

MRR =1

|D| P|D|

i=1

ranki

, where

ranki

is the one-indexed

rank of the ﬁrst correct explanation. (3) Human – The

human evaluation scores between the top-generated expla-

nation and a pre-speciﬁed groundtruth explanation, when

instructed “You are given a groundtruth description along

iPrompt 5

Table 2.

Performance for dataset explanation. Dataset from Ta-

ble 1 (1-3). Accuracy measured via (1) Human-evaluation (H,

normalized %), (2) Mean Reciprocal Rank across the collection

(M) and (3) 1-best correctness (C, %). For all metrics, higher is

better.

iPrompt AutoPrompt Sufﬁx

H/M/C H/M/C H/M/C

Math 60 /0.69 /60 25 / 0.14 / 13 20 / 0.08 / 03

ANLI 56 /0.41 /37 21 / 0.07 / 07 25 / 0.06 / 01

Induction 42 /0.35 /28 21 / 0.09 / 08 23 / 0.04 / 01

with a generated one. On a scale of 1 (worst) to 5 (best), how

interpretable and accurate is the generated description?”

The mean human evaluation score (ranging from 1 to 5) is

normalized.

To measure generalization ability, we evaluate explanations

based on accuracy as a prompt for other models. Accuracy

is computed following (Brown et al., 2020; Raffel et al.,

2020): using exact matching with beam search, a beam

width of 4, and a length penalty of α= 0.6.

For sentiment evaluation, we learn a prompt within the

template Input: “

input

}

”

{

prompt

}

We use positive and

negative as positive and negative labels and require the LLM

to rank the two options. Human-written prompts are adapted

to this template from open-source prompts available through

PromptSource (Bach et al., 2022).

5. Results and Analysis

5.1. Dataset explanation recovery

Table 2 compares prompting methods across three diverse

data collections. The Human evaluation scores are much

higher for iPrompt than the baselines, suggesting that it ﬁnds

prompts which are both accurate and human-interpretable.

Similarly, the MRR and Correct scores show that iPrompt

considerably improves in ﬁnding accurate explanations. See

all generated explanations in Appendix A.3.

To assess the best-case absolute accuracy of the approach,

we note it is impossible for the approach to recover the

prompt if the underlying LLM cannot solve the task. Fig. 4

plots the prompt recovery performance (MRR) against the

underlying LLM’s accuracy (when using the groundtruth

prompt) for each dataset. When the model can solve the task,

iPrompt does well on recovery. However for many tasks

the model has low accuracy even with the correct prompt,

putting a ceiling on the performance of iPrompt.

Human evaluation scores are averaged over 4 PhD students in

machine learning not afﬁliated with the study.

In initial experiments, we ﬁnd that performance drops signiﬁ-

cantly when learning a prompt that comes before the input.

Figure 4.

Comparison of model accuracy with correct prompt and

iPrompt ability to ﬁnd the correct prompt across each individual

task (single-task MRR). Prompt recovery ability is dependent on

the model’s ability to perform the task.

Table 3.

Generalization accuracy (zero-shot) with the prompts gen-

erated with GPT-J as the LLM across different models.

Correct

Prompt iPrompt AutoPrompt No

prompt

GPT-J 6.7B* 54.0 51.5 41.6 16.3

Math

OPT 6.7B 12.7 19.3 18.9 8.4

GPT 20B 76.1 54.4 23.2 8.5

GPT-3 175B 76.0 62.1 40.8 28.4

GPT-J 6.7B* 9.0 4.7 1.9 2.0

ANLI

OPT 6.7B 10.7 6.7 4.7 7.9

GPT 20B 31.0 14.2 5.6 4.0

GPT-3 175B 37.6 11.7 2.7 7.7

5.2. Generalization accuracy of prompts

Do prompts generated for a speciﬁc LLM still work when

applied to a different model? Table 3 shows the general-

ization accuracy when testing the prompts generated using

GPT-J (Table 5) on different LLMs. The prompts maintain

effectiveness across most models. For the Math datasets,

the iPrompt prompts elicit improvement over the baselines

and approach the accuracy of the correct prompt. For the

ANLI datasets, all prompts induce poor performance. No-

tably, the gap between iPrompt and AutoPrompt is larger

for larger models (i.e. GPT 20B and GPT-3); this suggests

that, by generating ﬂuent prompts, iPrompt generates more

generalizable descriptions.

Table 4 shows results on the sentiment analysis datasets.

As prompts for GPT-J, iPrompt outperforms not only Au-

toPrompt, but also the manually-written prompt on all four

datasets. Interestingly, the average performance of human-

written prompts on GPT-J is very low, unlike the prompts

generated by iPrompt. This indicates that models at 6B

parameter scale may be brittle to the choice of prompt, even

among a set of reasonable options, and iPrompt (and to

an extent, AutoPrompt) is able to discover how to phrase

prompts so that models of this scale can complete the task.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

iPrompt:ExplainingDataPatternsinNaturalLanguageviaInterpretableAutopromptingChandanSingh*1JohnX.Morris*2JyotiAneja1AlexanderM.Rush2JianfengGao1AbstractLargelanguagemodels(LLMs)havedisplayedanimpressiveabilitytoharnessnaturallanguagetoperformcomplextasks.Weexplorewhetherwecanleveragethisabilitytonda...

展开>> 收起<<

iPrompt Explaining Data Patterns in Natural Language via Interpretable Autoprompting Chandan Singh 1John X. Morris 2Jyoti Aneja1Alexander M. Rush2Jianfeng Gao1.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

iPrompt Explaining Data Patterns in Natural Language via Interpretable Autoprompting Chandan Singh 1John X. Morris 2Jyoti Aneja1Alexander M. Rush2Jianfeng Gao1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: