iPrompt Explaining Data Patterns in Natural Language via Interpretable Autoprompting Chandan Singh 1John X. Morris 2Jyoti Aneja1Alexander M. Rush2Jianfeng Gao1

2025-05-03 0 0 3.98MB 23 页 10玖币
侵权投诉
iPrompt: Explaining Data Patterns in Natural Language
via Interpretable Autoprompting
Chandan Singh * 1 John X. Morris * 2 Jyoti Aneja 1Alexander M. Rush 2Jianfeng Gao 1
Abstract
Large language models (LLMs) have displayed an
impressive ability to harness natural language to
perform complex tasks. We explore whether we
can leverage this ability to find and explain pat-
terns in data. Specifically, given a pre-trained
LLM and data examples, we introduce inter-
pretable autoprompting (iPrompt), an algorithm
that generates a natural language string explaining
the data. iPrompt iteratively generates explana-
tions with an LLM and reranks them based on
their performance when used as a prompt. Experi-
ments on a wide range of datasets, from synthetic
mathematics to natural language understanding,
show that iPrompt can yield meaningful insights
by accurately finding dataset explanations that are
human-interpretable. On two of four classification
datasets, iPrompt discovers a prompt that outper-
forms human-written prompts on GPT-3, despite
only querying the relatively small GPT-J model.
Finally, experiments with scientific datasets show
the potential for iPrompt to aid in scientific dis-
covery. 1
1. Introduction
Large language models (LLMs) have attained an extraordi-
nary ability to harness natural language for solving diverse
problems (Devlin et al., 2018), often without the need for
finetuning (Brown et al., 2020; Sanh et al., 2021). Moreover,
LLMs have demonstrated the capacity to excel at real-world
problems, such as mathematics (Lewkowycz et al., 2022),
scientific question answering (Sadat & Caragea, 2022), gen-
eral processing of scientific text (Beltagy et al., 2019), pre-
dicting brain responses (Schrimpf et al., 2021), and classify-
ing proteins and chemical compounds (Taylor et al., 2022).
*
Equal contribution
1
Microsoft Research
2
Cornell University.
Correspondence to: Jianfeng Gao <jfgao@microsft.com>.
1
All code for using the methods and data here is made available
on Github.
Dataset
Input: 3 1 Output: 4
Input: 4 7 Output: 11
Input: 5 9 Output: 14 LLM
Natural-language
explanation
Uses
Add the inputs
Explain a known dataset
Recover a transferable prompt
Propose novel descriptions
(i)
(ii)
(iii)
Figure 1.
Interpretable autoprompting (iPrompt) inverts the stan-
dard prediction problem to instead find a natural language explana-
tion of the data using a fixed, pre-trained large language model.
In this work, we probe whether we can leverage the learned
skills of an LLM to discover and explain patterns in a
dataset. To do so, we invert the typical problem of fitting
an LLM to data and instead ask whether we can use a fixed
LLM to produce a natural language string explaining dataset
patterns.
Our approach to this problem centers around prompting.
Prompting has emerged as an effective method for adapt-
ing LLMs to new datasets (Liu et al., 2021a); a prompt
string is combined with each example in a dataset before
querying an LLM for an answer. While prompts were ini-
tially constructed manually, recent work has shown suc-
cess in autoprompting, automatically finding a prompt via
optimization (Shin et al., 2020; Li & Liang, 2021; Deng
et al., 2022). However, previous work on learning natu-
ral language prompts does not produce prompts that are
meaningful to humans.
Our approach, interpretable autoprompting (iPrompt), ex-
tends autoprompting to generate a semantically meaningful
natural language prompt that explains a key characteristic
of the data (see Fig. 1). For example, given a dataset of
examples of addition, e.g.
2 5 7
...
3 1 4
, iPrompt
yields the natural language explanation Add the inputs. By
changing the input form of the data, we can generate expla-
nations that accomplish different tasks from the example,
such as: i) recovering a dataset explanation, ii) generating
a prompt transferable between LLMs, and iii) proposing
novel descriptions. iPrompt works by using a pre-trained
arXiv:2210.01848v2 [cs.LG] 26 Jan 2023
iPrompt 2
LLM to iteratively propose and evaluate different candidate
explanations.
For evaluation, we curate a diverse collection of datasets
written in natural language (Table 1) and measure iPrompt’s
ability to accurately explain a ground-truth pattern. We find
that iPrompt outperforms baseline methods in accurately
finding a correct description; moreover, the generated de-
scriptions are interpretable, allowing human auditing and
enabling strong generalization when used as a prompt in a
new setting (i.e. when used for a different LLM). On real-
world sentiment classification datasets, iPrompt even pro-
duces prompts that match or improve upon human-written
prompts for GPT-3, while only using smaller, locally-run
language models. Finally, we find that iPrompt is able to
extract information from real-world scientific datasets.
2. Related work
Prompting and autoprompting.
With the advent of
large-scale models, prompting (i.e. finding the right prompt
to use to query an LLM for a given task) has exploded as an
area of inquiry, often yielding impressive improvements in
performance (Brown et al., 2020; Petroni et al., 2019; Liu
et al., 2021a) and spurring a line of work aiming to make
prompting easier (Strobelt et al., 2022; Lu et al., 2022; Bach
et al., 2022; Logan IV et al., 2022). Recently, autoprompt-
ing (i.e. automatically searching for a prompt or prompt-
embedding via optimization) has emerged, with methods
such as prefix-tuning (Li & Liang, 2021), P-tuning (Liu
et al., 2021b), prompt-tuning with rules (Han et al., 2021),
knowledgeable prompt tuning (Hu et al., 2021) and many
more (Liu et al., 2021a). These strategies use gradient de-
scent to find a set of “adapter” parameters that maximize
model performance, but do not require that the new parame-
ters map back to tokens in discrete space, rendering them
uninterpretable.
A few methods tackle the more difficult problem of search-
ing for prompts that can be expressed in natural language
tokens. RLPrompt (Deng et al., 2022) searches for such
a prompt using reinforcement learning and one recent
work (Honovich et al., 2022) queries an LLM to produce
a prompt. AutoPrompt (Shin et al., 2020) performs auto-
prompting via input gradients (see Sec. 3). Similarly, ad-
versarial triggers (Wallace et al., 2019) use autoprompting
to identify adversarial inputs which can be used to change
a model’s prediction. These methods effectively alter a
model’s predictions, but do not constrain the discovered
prompts to be semantically meaningful, resulting in prompts
that are difficult to interpret (Webson & Pavlick, 2021). An-
other related work directly finetunes an LLM to describe
the difference between two datasets (Zhong et al., 2022).
Concurrent work proposes a method for natural language
prompting similar to the one here, with a focus on improv-
ing prediction performance rather than on explaining data
patterns (Zhou et al., 2022).
Problems related to dataset explanation
The problem
statement presented in this work closely resembles the
widely studied problems of symbolic regression (Augusto
& Barbosa, 2000; Schmidt & Lipson, 2009), program syn-
thesis (Gulwani et al., 2017; Manna & Waldinger, 1980),
text/table summarization (Kry
´
sci
´
nski et al., 2019; Liu et al.,
2018), and pattern discovery in data-mining (Hand, 2007).
iPrompt can be viewed as an algorithm for symbolic regres-
sion, in which the set of allowable symbols consists of se-
mantically meaningful natural language strings. One recent
work proposes the task of inferring prompts that improve
supervised prediction (Honovich et al., 2022), which we
generalize here to diverse use cases for dataset explanation.
Alternative methods for neural-network interpretation
A popular method for interpreting neural networks is to
inspect an LLM’s individual predictions via feature impor-
tances (Lundberg et al., 2019; Ribeiro et al., 2016), feature-
interaction importances (Singh et al., 2019; Tsang et al.,
2017), extractive rationales (Zaidan & Eisner, 2008; Sha
et al., 2021), or natural language explanations for individual
predictions (Hendricks et al., 2016; Camburu et al., 2018).
These works can provide meaningful insights for individual
predictions, but it is difficult to parse them into an under-
standing of an entire dataset. Alternatively, one can inves-
tigate whether an LLM’s learned representations via prob-
ing (Conneau et al., 2018; Liu & Avci, 2019) or by directly
analyzing a model’s internal weights and activations (Wang
et al., 2021; Olah et al., 2018; Meng et al., 2022). However,
these approaches are limited in their ability to generate pre-
viously unknown descriptions of data. A different approach
involves distilling information into a transparent model (Tan
et al., 2018; Ha et al., 2021; Singh & Gao, 2022) or simply
using a transparent model in the first place (Breiman et al.,
1984; Tan et al., 2022; Singh et al., 2021; Agarwal et al.,
2022).
3. Methods: Defining the task and approach
3.1. Task: Dataset Explanation
Given a dataset comprised of input-output string pairs
{(x1, y1),...,(xN, yN)}
, the goal is to produce a “seman-
tically meaningful” natural language string that explains the
relationship between
x
and
y
. We require that a string con-
sists of human-understandable text rather than a sequence
of incongruous tokens. For example, in the dataset shown
in Fig. 1, given samples of data performing addition, our
task is to recover text synonymous to Add the inputs. This
dataset explanation can then be used for various downstream
tasks, such as prompting a different LLM.
iPrompt 3
Table 1.
Dataset Explanation Tasks. Each collections contains #
different task. Roman numerals correspond to the use cases in
Fig. 1. For full details on each dataset, see Appendix A.2.
Collection # Description
1) Synthetic math 10 Mathematical functions (i), (ii)
2) Allen NLI 10 Language tasks (i), (ii)
3) Instr. induction 20 Language tasks (i), (ii)
4) Sentiment 4 Sentiment classification (i), (ii)
5) Proteins/chemicals 3 Protein/chemical properties (iii)
6) Language fMRI 20 Excitation of fMRI voxel (iii),(iii)
Datasets
Table 1 shows the collections of datasets we
study: (1) Synthetic math – datasets that require inferring an
underlying mathematical function based on numeric input
and outputs; (2) Allen NLI (ANLI) and (3) Instruction induc-
tion (Honovich et al., 2022) – diverse language tasks (Wang
et al., 2022) with easily verifiable descriptions (e.g. Find a
country’s capital). (4) Sentiment – a collection of sentiment
classification datasets in different domains. For collections
(1-3), there is a ground-truth prompt available for evaluation.
For example, when adding two numbers (Fig. 1), the rule
checks whether a description contains any of the keywords
add,sum, or +. We also study scientific datasets on (5)
proteins/chemicals, and (6) fMRI with full details given in
Sec. 6.
3.2. Approach: iPrompt
We now detail approaches for the general problem of au-
toprompting before introducing iPrompt, our method for
interpretable autoprompting. We specify autoprompting as
a discrete search problem. Given a dataset of
n
input-output
pairs
{
(
x1
,
y1
), ..., (
xn
,
yn
)
}
and a pre-trained LLM
f
that
returns the log-probability of a given string, autoprompting
finds a natural language explanation ˆsmaximizing:
ˆs=argmax
s∈S
n
X
i=1
frender(s, xi, yi)(1)
The
render
function is a problem-specific function that ren-
ders a natural language string from the prompt
s
and each
example in the dataset
(xi, yi)
. We use
S
to indicate the
set of fluent strings, under some notion of syntactic fluency.
This constraint is used to ensure prompts are readable, and
potentially generalize to downstream LLMs. Solving this
search problem exactly is intractable.
A core assumption of this objective is that semantically
accurate prompts lead a model to assign higher probability
to the correct output. To check this assumption, we analyze
four datasets from the inverse synthetic math collection that
share common structure for the inputs and prompts. Each
dataset admits a prompt of the form Return the of the
inputs., then is given two input numbers and queried for the
Figure 2.
Prompt-based reranking depends on model size. Large
models (GPT-J 6B and GPT-3) align prompts correctly to tasks.
The model is given the prompt Return the of the inputs., where
is filled in with the shown prompt keyword before querying
the output given two inputs numbers in a string. Darker indicates a
higher accuracy, and high accuracy along the diagonal indicates
that the correct prompt induces the highest accuracy.
output.
Fig. 2 shows the accuracy of different models at perform-
ing these tasks across different input prompts.
2
For small
models, the prompts are unsuccessful, but for large models
(GPT-J 6B and GPT-3), the model is accurate if and only if
given the correct prompt.
3
This result suggests that, at least
for large models, the search for a prompt that maximizes
performance correlates well with the underlying task. We
will see in Fig. 4 that dataset explanation depends on this
ability.
Baseline: AutoPrompt
AutoPrompt (Shin et al., 2020)
targets the objective posed in Eq. (1) using a gradient-based
local search. AutoPrompt searches for
ˆs
following the gra-
dients of the objective Eq. (1) with respect to individual
tokens in
ˆs
. It discretely changes individual words in
ˆs
and
then checks whether or not the newly updated
ˆs
improves
the objective score. The use of gradients allows AutoPrompt
to find an effective prompt
ˆs
, but makes it difficult to find
answers that satisfy the fluency constraint S.
2
The accuracy is normalized for each task using softmax in
order to visualize the effect of differing prompts.
3For details on each model, see Table A3.
iPrompt 4
Figure 3.
Overview of iPrompt. iPrompt first proposes candidate
prompts, then ranks them based on their performance as a prompt,
then truncates and regenerates them. This entire process is repeated
until performance stops improving.
Baseline: Zero-shot suffix decoding
LLMs them-
selves can be directly used to predict prompt strings.
Following Honovich et al., we give the model a
prompt string which contains data examples (e.g.
In: 2 5
| {z }
xi
Out: 7.
| {z }
yi
To compute the output from the input,
| {z }
template
,)
and sample the output to recover a prompt
ˆs
using nucleus
sampling.4
Proposed method: iPrompt
iPrompt (Fig. 3) is an iter-
ative local search algorithm that alternates between three
steps: (i) proposing candidate prompts, (ii) reranking candi-
date prompts, (iii) exploration.
(i) Proposal
: Candidate prompts are generated by extending
the zero-shot LLM generation. Given a data instance as
a prefix, we sample a number of candidate prompts. The
maximum length of each candidate is pre-specified and fixed.
For example, in the add-two-numbers task (Fig. 3), we may
generate four candidates:
{
Combine the numbers, Return
the output, Sum in order, Compute the output}.
(ii) Reranking
: Given candidates, the objective Eq. (1) is
evaluated for each candidate prompt
s
. The top few candi-
dates which maximize the objective are kept, e.g. narrowing
down the candidates to
{
Combine the numbers, Sum in or-
4
We also consider averaging the model’s output logits across
all examples in the dataset before decoding the output, but find
that it does not improve performance (see Appendix A.4).
der}.
(iii) Iterate with exploration
: Each of the top candidates
from reranking is truncated at a random position. These
truncated candidates are used as a prefix when generating
new candidate prompts via suffix decoding. For example,
we may randomly select the start of the previous candidates
and fill in the endings:
{
Combine the , Sum
} →
{
Combine the numbers, Combine both arguments, Sum the
numbers, Sum all inputs}.
The algorithm is repeated until identifying a suitably strong
ˆs
, e.g. Sum the numbers. Steps (i) and (iii) ensure that
prompts remain fluent, while step (ii) improves the score
of the prompts on the objective. Computationally, iPrompt
only requires running inference on the pre-trained LLM,
yielding a significantly lower memory requirement than
methods such as AutoPrompt which require access to the
LLM’s gradients.
4. Experimental Setup
We consider two sets of experiments. First in Sec. 5, we
explore iPrompt’s ability to rediscover a correct and fluent
prompt on the variety of simple instruction datasets (Table 1,
top) with known answers. Experiments test the ability of
the model to recover a known prompt while also remaining
fluent in a way that generalize to human readers and to other
language models. In Sec. 6 we apply iPrompt to scientific
datasets (Table 1, bottom).
Language Models
For the main set of experiments, we
always generate prompts using GPT-J, a 6 billion parameter
model (Wang & Komatsuzaki, 2021). We restrict prompts to
{
6,12
}
tokens for sentiment classification and 6 tokens for
the remaining data collections in Table 1. For generalization
experiments, alternative models are tested with the gener-
ated prompts including OPT and GPT-3 (Zhang et al., 2022;
Brown et al., 2020). See Appendix A.4 for a full discussion
of experimental details and Appendix A.3 for experiments
on more models (e.g. Galactica (Taylor et al., 2022)) and
more datasets.
Evaluation metrics
We consider two types of evalua-
tion: closeness to ground-truth and accuracy as a prompt.
To measure closeness we use three metrics: (1) Correct –
whether the generated explanation contains one of a set of
problem-specific keywords. (2) MRR – Mean reciprocal
rank measuring the rank of the first task-correct prompt.
Given a set of datasets
D={D1, ..., DN}
, we compute:
MRR =1
|D| P|D|
i=1
1
ranki
, where
ranki
is the one-indexed
rank of the first correct explanation. (3) Human – The
human evaluation scores between the top-generated expla-
nation and a pre-specified groundtruth explanation, when
instructed “You are given a groundtruth description along
iPrompt 5
Table 2.
Performance for dataset explanation. Dataset from Ta-
ble 1 (1-3). Accuracy measured via (1) Human-evaluation (H,
normalized %), (2) Mean Reciprocal Rank across the collection
(M) and (3) 1-best correctness (C, %). For all metrics, higher is
better.
iPrompt AutoPrompt Suffix
H/M/C H/M/C H/M/C
Math 60 /0.69 /60 25 / 0.14 / 13 20 / 0.08 / 03
ANLI 56 /0.41 /37 21 / 0.07 / 07 25 / 0.06 / 01
Induction 42 /0.35 /28 21 / 0.09 / 08 23 / 0.04 / 01
with a generated one. On a scale of 1 (worst) to 5 (best), how
interpretable and accurate is the generated description?”
5
.
The mean human evaluation score (ranging from 1 to 5) is
normalized.
To measure generalization ability, we evaluate explanations
based on accuracy as a prompt for other models. Accuracy
is computed following (Brown et al., 2020; Raffel et al.,
2020): using exact matching with beam search, a beam
width of 4, and a length penalty of α= 0.6.
For sentiment evaluation, we learn a prompt within the
template Input: “
${
input
}
{
prompt
}
.
6
We use positive and
negative as positive and negative labels and require the LLM
to rank the two options. Human-written prompts are adapted
to this template from open-source prompts available through
PromptSource (Bach et al., 2022).
5. Results and Analysis
5.1. Dataset explanation recovery
Table 2 compares prompting methods across three diverse
data collections. The Human evaluation scores are much
higher for iPrompt than the baselines, suggesting that it finds
prompts which are both accurate and human-interpretable.
Similarly, the MRR and Correct scores show that iPrompt
considerably improves in finding accurate explanations. See
all generated explanations in Appendix A.3.
To assess the best-case absolute accuracy of the approach,
we note it is impossible for the approach to recover the
prompt if the underlying LLM cannot solve the task. Fig. 4
plots the prompt recovery performance (MRR) against the
underlying LLM’s accuracy (when using the groundtruth
prompt) for each dataset. When the model can solve the task,
iPrompt does well on recovery. However for many tasks
the model has low accuracy even with the correct prompt,
putting a ceiling on the performance of iPrompt.
5
Human evaluation scores are averaged over 4 PhD students in
machine learning not affiliated with the study.
6
In initial experiments, we find that performance drops signifi-
cantly when learning a prompt that comes before the input.
Figure 4.
Comparison of model accuracy with correct prompt and
iPrompt ability to find the correct prompt across each individual
task (single-task MRR). Prompt recovery ability is dependent on
the model’s ability to perform the task.
Table 3.
Generalization accuracy (zero-shot) with the prompts gen-
erated with GPT-J as the LLM across different models.
Correct
Prompt iPrompt AutoPrompt No
prompt
GPT-J 6.7B* 54.0 51.5 41.6 16.3
Math
OPT 6.7B 12.7 19.3 18.9 8.4
GPT 20B 76.1 54.4 23.2 8.5
GPT-3 175B 76.0 62.1 40.8 28.4
GPT-J 6.7B* 9.0 4.7 1.9 2.0
ANLI
OPT 6.7B 10.7 6.7 4.7 7.9
GPT 20B 31.0 14.2 5.6 4.0
GPT-3 175B 37.6 11.7 2.7 7.7
5.2. Generalization accuracy of prompts
Do prompts generated for a specific LLM still work when
applied to a different model? Table 3 shows the general-
ization accuracy when testing the prompts generated using
GPT-J (Table 5) on different LLMs. The prompts maintain
effectiveness across most models. For the Math datasets,
the iPrompt prompts elicit improvement over the baselines
and approach the accuracy of the correct prompt. For the
ANLI datasets, all prompts induce poor performance. No-
tably, the gap between iPrompt and AutoPrompt is larger
for larger models (i.e. GPT 20B and GPT-3); this suggests
that, by generating fluent prompts, iPrompt generates more
generalizable descriptions.
Table 4 shows results on the sentiment analysis datasets.
As prompts for GPT-J, iPrompt outperforms not only Au-
toPrompt, but also the manually-written prompt on all four
datasets. Interestingly, the average performance of human-
written prompts on GPT-J is very low, unlike the prompts
generated by iPrompt. This indicates that models at 6B
parameter scale may be brittle to the choice of prompt, even
among a set of reasonable options, and iPrompt (and to
an extent, AutoPrompt) is able to discover how to phrase
prompts so that models of this scale can complete the task.
摘要:

iPrompt:ExplainingDataPatternsinNaturalLanguageviaInterpretableAutopromptingChandanSingh*1JohnX.Morris*2JyotiAneja1AlexanderM.Rush2JianfengGao1AbstractLargelanguagemodels(LLMs)havedisplayedanimpressiveabilitytoharnessnaturallanguagetoperformcomplextasks.Weexplorewhetherwecanleveragethisabilitytonda...

展开>> 收起<<
iPrompt Explaining Data Patterns in Natural Language via Interpretable Autoprompting Chandan Singh 1John X. Morris 2Jyoti Aneja1Alexander M. Rush2Jianfeng Gao1.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:3.98MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注