Preprint COMPLEXITY -BASED PROMPTING FOR MULTI -STEP REASONING

2025-05-02 0 0 549.78KB 14 页 10玖币

侵权投诉

Preprint

COMPLEXITY-BASED PROMPTING FOR MULTI-STEP

REASONING

Yao Fu♠∗

, Hao Peng♣, Ashish Sabharwal♣, Peter Clark♣, Tushar Khot♣

♠University of Edinburgh ♣Allen Institute for AI

yao.fu@ed.ac.uk, haop@allenai.org, ashishs@allenai.org, peterc@allenai.org, tushark@allenai.org

ABSTRACT

We study the task of prompting large-scale language models to perform multi-

step reasoning. Existing work shows that when prompted with a chain of

thoughts (CoT), sequences of short sentences describing intermediate reasoning

steps towards a ﬁnal answer, large language models can generate new reasoning

chains and predict answers for new inputs. A central question is which reasoning

examples make the most effective prompts. In this work, we propose complexity-

based prompting, a simple and effective example selection scheme for multi-step

reasoning. We show that prompts with higher reasoning complexity, i.e., chains

with more reasoning steps, achieve substantially better performance on multi-

step reasoning tasks over strong baselines. We further extend our complexity-

based criteria from prompting (selecting inputs) to decoding (selecting outputs),

where we sample multiple reasoning chains from the model, then choose the

majority of generated answers from complex reasoning chains (over simple

chains). When used to prompt GPT-3 and Codex, our approach substantially

improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA)

performance on three math benchmarks (GSM8K, MultiArith, and MathQA) and

two BigBenchHard tasks (Date Understanding and Penguins), with an average

+5.3 and up to +18 accuracy improvements. Compared with existing example

selection schemes like manual tuning or retrieval-based selection, selection based

on reasoning complexity is intuitive, easy to implement, and annotation-efﬁcient.

Further results demonstrate the robustness of performance gains from complex

prompts under format perturbation and distribution shift.

1 INTRODUCTION

We consider the problem of prompting large language models for multi-step reasoning. Recent

breakthroughs (Wei et al., 2022b; Wang et al., 2022b) show that language models, when large

enough (>100B parameters), exhibit the emergent ability (Wei et al., 2022a) of performing complex

multi-step reasoning when provided with only a few reasoning examples. In the regime of large

models, prompting achieves comparable or even better performance than full training set ﬁnetuning

while being substantially more sample-efﬁcient (Wei et al., 2022b; Kojima et al., 2022; Lewkowycz

et al., 2022). In particular, Wei et al. (2022b) show that chain-of-thoughts (CoT) prompts, sequences

of short sentences describing intermediate reasoning steps towards ﬁnal answers (Fig. 1A), can elicit

strong reasoning capabilities from large language models for complex tasks such as math problems.

This work studies example selection in chain-of-thoughts multi-step reasoning. Example selection

is a central problem in the prompting literature (Liu et al., 2022; Rubin et al., 2022; Su et al., 2022;

Lazaridou et al., 2022). It asks what instances make the best prompts for solving the tasks of interest.

For CoT prompting, example selection is further related to annotation efﬁciency, as CoT requires

manually-annotated reasoning chains. For datasets where reasoning annotations are easy to obtain,

one may want to know which annotated chains make the best prompt; if the annotations are hard to

obtain, one may identify the best cases to annotate, rather than annotating the entire dataset.

We propose complexity-based prompting, a new example selection scheme for chain-of-thoughts

multi-step reasoning. Existing sample selection methods are usually based on manual tries (Wei

∗Work done during internship at Allen Institute for AI

arXiv:2210.00720v2 [cs.CL] 30 Jan 2023

Preprint

A. Workﬂow of chain of thoughts prompting B. Example complex chain, 9 reasoning steps C. Complexity-based consistency

Asia bought a homecoming

dress on sale for $140. It was

originally priced at $350.

What percentage oﬀ did she

get at the sale?

Question

Chain of

Thoughts"

prompt

Answer The answer is 60

Asia saved $350 - $140 =

$210 on the dress.

That means she saved $210 /

$350 = 0.60 or 60% oﬀ on the

dress.

Angelo and Melanie want to plan how

many hours … how many days should

they plan to study total over the next

week if they take a 10-minute break

every hour …?

Angelo and Melanie think they should dedicate 3

hours to each of the 2 chapters …

For the worksheets they plan to dedicate 1.5

hours for each worksheet …

They will need to plan to study 4 days to allow for

all the time they need.

The answer is 4

They want to study no more than 4 hours each

day, 15 hours / 4 hours each day = 3.75

… < more CoT cases > … … < more reasoning steps > …

Test"

Question

Olivia has $23. She bought

ﬁve bagels for $3 each. How

much money does she have

left?

Angelo and Melanie need to start with planning 12

hours to study, at 4 hours a day, 12 / 4 = 3 days.

CoT prompt + Question

Sample from GPT3

Reasoning A, 4 steps, answer = 100

Reasoning B, 3 steps, answer = 100

Reasoning C, 2 steps, answer = 100

Reasoning D, 5 steps, answer = 200

Reasoning E, 6 steps, answer = 200

Majority = 200

Majority voting"

Over complex chains

Figure 1: A: Chain of thoughts (in blue) are intermediate reasoning steps towards a ﬁnal answer.

The input of CoT prompting is a stack of few (often 8) CoT cases before a test question. Then the

language model will continue generating an output CoT for the test question. B: Chains of harder

reasoning complexity are chains with more reasoning steps (9 steps in this case, v.s. only 2 steps in

subﬁgure A). C: During decoding, we sample Nreasoning chains from the language model (N= 5

here), and take the majority answer over the K(K= 3 here) most complex generated chains.

et al., 2022b), heuristic rules (Wallace et al., 2019), optimization and search (Shin et al., 2020), or

retrieval from a large training set (Rubin et al., 2022). Different from these schemes, complexity-

based prompting chooses examples with complex reasoning chains, i.e., chains with more reasoning

steps, as the prompt. Fig. 1A shows a simple example with 2 reasoning steps, versus the example in

subﬁgure B is a complex case with 9 reasoning steps. As we will show in the experiments (§4.2),

the reasoning performance of GPT-3 175B (Brown et al., 2020) clearly improves with the increased

input prompt complexity, where complex prompts achieve better performance than simple prompts.

We further extend the complexity-based selection criteria from the input space (the prompts) to the

output space (reasoning chains generated by the language model). Our extension is based on the idea

of self-consistency (Wang et al., 2022b;a), where they sample multiple reasoning chains (instead of

using greedy decoding) from the model that lead to possibly different answers, then choose the

majority of the generated answers. Here we propose complexity-based consistency, where instead of

taking a majority vote among all generated chains, we vote over the top Kcomplex chains, as shown

in Fig. 1C. In §4.2, we will show that complexity-based consistency leads to further performance

gains, on top of the existing gain from complexity-based prompting.

Putting everything together, our methods achieve new state of the art performance on three math

benchmarks (GSM8K, MultiArith, and MathQA) and two BigBenchHard tasks (Date Understanding

and Penguins) with substantial performance gains over Wei et al. (2022b). We show that, compared

with existing sample selection schemes, complexity-based prompting achieves better performance

in most cases (see §4.2). Furthermore, performance gains from complex samples are consistent

in different prompt distributions (in-distribution, transfer, and noisily-labeled, see §4.2) and are

also consistent with regard to alternative proxies for complexity (e.g., question or formula lengths,

see §4.3) when the dataset does not contain annotated reasoning chains. A careful analysis shows

that the number of reasoning steps is the most prominent factor, over confounders like prompt

lengths or the number of input cases (§4.3). We hope this work will open new research possibilities

in in-context learning, large language models, and multi-step reasoning.

2 RELATED WORK

Emergent Abilities and Multi-Step Reasoning With the recent trend in scaling language

models (Brown et al., 2020; Chowdhery et al., 2022), a central question is what unique abilities

emerge as models become large (Kaplan et al., 2020; Wei et al., 2022a). Generally, the ability to

follow the format of given prompts (typically few-shot) thus solving the corresponding tasks (also

Preprint

referred as in-context learning), is something that large language models are particularly skilled

at (Shin et al., 2020; Liu et al., 2021). Among the wide language understanding task spectrum,

we are particularly interested in multi-step reasoning because of its two uniqueness: (1). multi-

step reasoning is a task where large models substantially outperform smaller models (Wei et al.,

2022b), versus performance gains on tasks like sentiment classiﬁcation can be very limited with

large models (Shin et al., 2020); (2). multi-step reasoning is where few-shot prompting starts

to outperform full training set ﬁne-tuning, even when ﬁne-tuning is conducted on the same large

model (Lewkowycz et al., 2022). This work takes an important step forward in multi-step reasoning

by showing the critical role of prompt complexity.

Chain-of-Thoughts Reasoning A prominent work demonstrating the multi-step reasoning of

language models is chain-of-thoughts prompting (Fig. 1A), proposed by Wei et al. (2022b). They

show that the reasoning ability can only be elicited by chain of thoughts, but not standard prompting

where an answer directly follows a question without intermediate reasoning steps. Further works

show that CoT can be improved by self-consistency (Wang et al., 2022b), pretraining the model

with latex-formated data (Lewkowycz et al., 2022), context selection (Creswell et al., 2022), or even

adding certain magic phrases like “Let’s think step by step” (Kojima et al., 2022). The original CoT

paper (Wei et al., 2022b) uses 8 manually written examples as the prompt, which are reused by most

follow-up works. Our work sits in the context of CoT reasoning, and propose a new complexity-

based prompt selection that substantially outperforms the original CoT.

Example Selection for Prompting Designing prompts can be challenging due to the instability,

as multiple works have shown the performance is sensitive to prompt, task, dataset, and model

changes (Zhao et al., 2021; Lu et al., 2022; Su et al., 2022). Despite works on automatic prompt

searching (which is more suitable for smaller models, e.g., Shin et al., 2020; Li & Liang, 2021),

currently, prompt engineering for large models is (still) a community-wide collective trial and error

effort (there is even a prompt marketplace named PromptBase). The difﬁculty is that it is extremely

hard to extract generalizable regularity from empirical observations that can form effective selection

criteria. One notable exception is similarity-based prompt selection, which retrieves the most similar

training instances as the prompt for a given test case (Rubin et al., 2022). Yet for CoT prompting,

retrieving different prompts for different test cases requires reasoning chain annotations for the

whole training set, which compromises the advantage of being few-shot. Given this background,

our core contribution is identifying complexity as an effective and robust selection criterion and in

many cases, it outperforms existing prompt selection schemes while being annotation-efﬁcient.

Relation to Classical Semantic Parsing The procedure of chain of thoughts prompting is

conceptually similar to classical semantic parsing where one generates a logical form then executes

it upon a knowledge base to reach a ﬁnal answer (Liang, 2016; Cheng et al., 2019). The practice

of sampling then voting is also similar to marginalizing out semantic parses (Yin et al., 2018).

There are further works linking the relationship between in-context learning and classical Bayesian

inference (Wei et al., 2021; Xie et al., 2022). From our perspective, we tend to view chain-of-

thoughts as ﬂexible, language model styled “logical forms” which are “executed” by the language

model itself. We leave further study on connecting classical parsing and CoT to future work.

3 COMPLEXITY-BASED PROMPTING

We study multi-step reasoning tasks, and use math word problems, mathematical problems

expressed in natural language, as our testbed. This task, as is measured by solve rate (accuracy),

is to predict the answer (typically a number) of a given math word problem via intermediate steps.

We follow the chain-of-thoughts prompting framework and compare all prompting schemes using

GPT-3 text-davinci-002 and Codex code-davinci-002. An example problem, as well

as the chain-of-thoughts workﬂow, is shown in Fig. 1A. The input is a stack of a few (often 8) CoT

cases followed by a test question, then the language model continues generating an output CoT for

the test question. Our goal is to improve the reasoning accuracy by identifying and exploiting more

effective input and output reasoning chains.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PreprintCOMPLEXITY-BASEDPROMPTINGFORMULTI-STEPREASONINGYaoFu,HaoPeng|,AshishSabharwal|,PeterClark|,TusharKhot|UniversityofEdinburgh|AllenInstituteforAIyao.fu@ed.ac.uk,haop@allenai.org,ashishs@allenai.org,peterc@allenai.org,tushark@allenai.orgABSTRACTWestudythetaskofpromptinglarge-scalelanguagemod...

展开>> 收起<<

Preprint COMPLEXITY -BASED PROMPTING FOR MULTI -STEP REASONING.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint COMPLEXITY -BASED PROMPTING FOR MULTI -STEP REASONING

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: