Preprint COMPLEXITY -BASED PROMPTING FOR MULTI -STEP REASONING

2025-05-02 0 0 549.78KB 14 页 10玖币
侵权投诉
Preprint
COMPLEXITY-BASED PROMPTING FOR MULTI-STEP
REASONING
Yao Fu
, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot
University of Edinburgh Allen Institute for AI
yao.fu@ed.ac.uk, haop@allenai.org, ashishs@allenai.org, peterc@allenai.org, tushark@allenai.org
ABSTRACT
We study the task of prompting large-scale language models to perform multi-
step reasoning. Existing work shows that when prompted with a chain of
thoughts (CoT), sequences of short sentences describing intermediate reasoning
steps towards a final answer, large language models can generate new reasoning
chains and predict answers for new inputs. A central question is which reasoning
examples make the most effective prompts. In this work, we propose complexity-
based prompting, a simple and effective example selection scheme for multi-step
reasoning. We show that prompts with higher reasoning complexity, i.e., chains
with more reasoning steps, achieve substantially better performance on multi-
step reasoning tasks over strong baselines. We further extend our complexity-
based criteria from prompting (selecting inputs) to decoding (selecting outputs),
where we sample multiple reasoning chains from the model, then choose the
majority of generated answers from complex reasoning chains (over simple
chains). When used to prompt GPT-3 and Codex, our approach substantially
improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA)
performance on three math benchmarks (GSM8K, MultiArith, and MathQA) and
two BigBenchHard tasks (Date Understanding and Penguins), with an average
+5.3 and up to +18 accuracy improvements. Compared with existing example
selection schemes like manual tuning or retrieval-based selection, selection based
on reasoning complexity is intuitive, easy to implement, and annotation-efficient.
Further results demonstrate the robustness of performance gains from complex
prompts under format perturbation and distribution shift.
1 INTRODUCTION
We consider the problem of prompting large language models for multi-step reasoning. Recent
breakthroughs (Wei et al., 2022b; Wang et al., 2022b) show that language models, when large
enough (>100B parameters), exhibit the emergent ability (Wei et al., 2022a) of performing complex
multi-step reasoning when provided with only a few reasoning examples. In the regime of large
models, prompting achieves comparable or even better performance than full training set finetuning
while being substantially more sample-efficient (Wei et al., 2022b; Kojima et al., 2022; Lewkowycz
et al., 2022). In particular, Wei et al. (2022b) show that chain-of-thoughts (CoT) prompts, sequences
of short sentences describing intermediate reasoning steps towards final answers (Fig. 1A), can elicit
strong reasoning capabilities from large language models for complex tasks such as math problems.
This work studies example selection in chain-of-thoughts multi-step reasoning. Example selection
is a central problem in the prompting literature (Liu et al., 2022; Rubin et al., 2022; Su et al., 2022;
Lazaridou et al., 2022). It asks what instances make the best prompts for solving the tasks of interest.
For CoT prompting, example selection is further related to annotation efficiency, as CoT requires
manually-annotated reasoning chains. For datasets where reasoning annotations are easy to obtain,
one may want to know which annotated chains make the best prompt; if the annotations are hard to
obtain, one may identify the best cases to annotate, rather than annotating the entire dataset.
We propose complexity-based prompting, a new example selection scheme for chain-of-thoughts
multi-step reasoning. Existing sample selection methods are usually based on manual tries (Wei
Work done during internship at Allen Institute for AI
1
arXiv:2210.00720v2 [cs.CL] 30 Jan 2023
Preprint
A. Workflow of chain of thoughts prompting B. Example complex chain, 9 reasoning steps C. Complexity-based consistency
Asia bought a homecoming
dress on sale for $140. It was
originally priced at $350.
What percentage o did she
get at the sale?
Question
Chain of
Thoughts"
prompt
Answer The answer is 60
Asia saved $350 - $140 =
$210 on the dress.
That means she saved $210 /
$350 = 0.60 or 60% o on the
dress.
Angelo and Melanie want to plan how
many hours … how many days should
they plan to study total over the next
week if they take a 10-minute break
every hour …?
1.
2.
Angelo and Melanie think they should dedicate 3
hours to each of the 2 chapters …
For the worksheets they plan to dedicate 1.5
hours for each worksheet …
They will need to plan to study 4 days to allow for
all the time they need.
The answer is 4
They want to study no more than 4 hours each
day, 15 hours / 4 hours each day = 3.75
1.
2.
8.
9.
… < more CoT cases > … … < more reasoning steps > …
Test"
Question
Olivia has $23. She bought
five bagels for $3 each. How
much money does she have
left?
<GPT3 generates from here>
Angelo and Melanie need to start with planning 12
hours to study, at 4 hours a day, 12 / 4 = 3 days.
3.
CoT prompt + Question
Sample from GPT3
Reasoning A, 4 steps, answer = 100
Reasoning B, 3 steps, answer = 100
Reasoning C, 2 steps, answer = 100
Reasoning D, 5 steps, answer = 200
Reasoning E, 6 steps, answer = 200
Majority = 200
Majority voting"
Over complex chains
Figure 1: A: Chain of thoughts (in blue) are intermediate reasoning steps towards a final answer.
The input of CoT prompting is a stack of few (often 8) CoT cases before a test question. Then the
language model will continue generating an output CoT for the test question. B: Chains of harder
reasoning complexity are chains with more reasoning steps (9 steps in this case, v.s. only 2 steps in
subfigure A). C: During decoding, we sample Nreasoning chains from the language model (N= 5
here), and take the majority answer over the K(K= 3 here) most complex generated chains.
et al., 2022b), heuristic rules (Wallace et al., 2019), optimization and search (Shin et al., 2020), or
retrieval from a large training set (Rubin et al., 2022). Different from these schemes, complexity-
based prompting chooses examples with complex reasoning chains, i.e., chains with more reasoning
steps, as the prompt. Fig. 1A shows a simple example with 2 reasoning steps, versus the example in
subfigure B is a complex case with 9 reasoning steps. As we will show in the experiments (§4.2),
the reasoning performance of GPT-3 175B (Brown et al., 2020) clearly improves with the increased
input prompt complexity, where complex prompts achieve better performance than simple prompts.
We further extend the complexity-based selection criteria from the input space (the prompts) to the
output space (reasoning chains generated by the language model). Our extension is based on the idea
of self-consistency (Wang et al., 2022b;a), where they sample multiple reasoning chains (instead of
using greedy decoding) from the model that lead to possibly different answers, then choose the
majority of the generated answers. Here we propose complexity-based consistency, where instead of
taking a majority vote among all generated chains, we vote over the top Kcomplex chains, as shown
in Fig. 1C. In §4.2, we will show that complexity-based consistency leads to further performance
gains, on top of the existing gain from complexity-based prompting.
Putting everything together, our methods achieve new state of the art performance on three math
benchmarks (GSM8K, MultiArith, and MathQA) and two BigBenchHard tasks (Date Understanding
and Penguins) with substantial performance gains over Wei et al. (2022b). We show that, compared
with existing sample selection schemes, complexity-based prompting achieves better performance
in most cases (see §4.2). Furthermore, performance gains from complex samples are consistent
in different prompt distributions (in-distribution, transfer, and noisily-labeled, see §4.2) and are
also consistent with regard to alternative proxies for complexity (e.g., question or formula lengths,
see §4.3) when the dataset does not contain annotated reasoning chains. A careful analysis shows
that the number of reasoning steps is the most prominent factor, over confounders like prompt
lengths or the number of input cases (§4.3). We hope this work will open new research possibilities
in in-context learning, large language models, and multi-step reasoning.
2 RELATED WORK
Emergent Abilities and Multi-Step Reasoning With the recent trend in scaling language
models (Brown et al., 2020; Chowdhery et al., 2022), a central question is what unique abilities
emerge as models become large (Kaplan et al., 2020; Wei et al., 2022a). Generally, the ability to
follow the format of given prompts (typically few-shot) thus solving the corresponding tasks (also
2
Preprint
referred as in-context learning), is something that large language models are particularly skilled
at (Shin et al., 2020; Liu et al., 2021). Among the wide language understanding task spectrum,
we are particularly interested in multi-step reasoning because of its two uniqueness: (1). multi-
step reasoning is a task where large models substantially outperform smaller models (Wei et al.,
2022b), versus performance gains on tasks like sentiment classification can be very limited with
large models (Shin et al., 2020); (2). multi-step reasoning is where few-shot prompting starts
to outperform full training set fine-tuning, even when fine-tuning is conducted on the same large
model (Lewkowycz et al., 2022). This work takes an important step forward in multi-step reasoning
by showing the critical role of prompt complexity.
Chain-of-Thoughts Reasoning A prominent work demonstrating the multi-step reasoning of
language models is chain-of-thoughts prompting (Fig. 1A), proposed by Wei et al. (2022b). They
show that the reasoning ability can only be elicited by chain of thoughts, but not standard prompting
where an answer directly follows a question without intermediate reasoning steps. Further works
show that CoT can be improved by self-consistency (Wang et al., 2022b), pretraining the model
with latex-formated data (Lewkowycz et al., 2022), context selection (Creswell et al., 2022), or even
adding certain magic phrases like “Let’s think step by step” (Kojima et al., 2022). The original CoT
paper (Wei et al., 2022b) uses 8 manually written examples as the prompt, which are reused by most
follow-up works. Our work sits in the context of CoT reasoning, and propose a new complexity-
based prompt selection that substantially outperforms the original CoT.
Example Selection for Prompting Designing prompts can be challenging due to the instability,
as multiple works have shown the performance is sensitive to prompt, task, dataset, and model
changes (Zhao et al., 2021; Lu et al., 2022; Su et al., 2022). Despite works on automatic prompt
searching (which is more suitable for smaller models, e.g., Shin et al., 2020; Li & Liang, 2021),
currently, prompt engineering for large models is (still) a community-wide collective trial and error
effort (there is even a prompt marketplace named PromptBase). The difficulty is that it is extremely
hard to extract generalizable regularity from empirical observations that can form effective selection
criteria. One notable exception is similarity-based prompt selection, which retrieves the most similar
training instances as the prompt for a given test case (Rubin et al., 2022). Yet for CoT prompting,
retrieving different prompts for different test cases requires reasoning chain annotations for the
whole training set, which compromises the advantage of being few-shot. Given this background,
our core contribution is identifying complexity as an effective and robust selection criterion and in
many cases, it outperforms existing prompt selection schemes while being annotation-efficient.
Relation to Classical Semantic Parsing The procedure of chain of thoughts prompting is
conceptually similar to classical semantic parsing where one generates a logical form then executes
it upon a knowledge base to reach a final answer (Liang, 2016; Cheng et al., 2019). The practice
of sampling then voting is also similar to marginalizing out semantic parses (Yin et al., 2018).
There are further works linking the relationship between in-context learning and classical Bayesian
inference (Wei et al., 2021; Xie et al., 2022). From our perspective, we tend to view chain-of-
thoughts as flexible, language model styled “logical forms” which are “executed” by the language
model itself. We leave further study on connecting classical parsing and CoT to future work.
3 COMPLEXITY-BASED PROMPTING
We study multi-step reasoning tasks, and use math word problems, mathematical problems
expressed in natural language, as our testbed. This task, as is measured by solve rate (accuracy),
is to predict the answer (typically a number) of a given math word problem via intermediate steps.
We follow the chain-of-thoughts prompting framework and compare all prompting schemes using
GPT-3 text-davinci-002 and Codex code-davinci-002. An example problem, as well
as the chain-of-thoughts workflow, is shown in Fig. 1A. The input is a stack of a few (often 8) CoT
cases followed by a test question, then the language model continues generating an output CoT for
the test question. Our goal is to improve the reasoning accuracy by identifying and exploiting more
effective input and output reasoning chains.
3
摘要:

PreprintCOMPLEXITY-BASEDPROMPTINGFORMULTI-STEPREASONINGYaoFu,HaoPeng|,AshishSabharwal|,PeterClark|,TusharKhot|UniversityofEdinburgh|AllenInstituteforAIyao.fu@ed.ac.uk,haop@allenai.org,ashishs@allenai.org,peterc@allenai.org,tushark@allenai.orgABSTRACTWestudythetaskofpromptinglarge-scalelanguagemod...

展开>> 收起<<
Preprint COMPLEXITY -BASED PROMPTING FOR MULTI -STEP REASONING.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:549.78KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注