AUTOMATIC CHAIN OF THOUGHT PROMPTING INLARGE LANGUAGE MODELS Zhuosheng Zhangy Aston Zhangz Mu Liz Alex Smolaz

2025-05-02 0 0 814.78KB 25 页 10玖币

侵权投诉

AUTOMATIC CHAIN OF THOUGHT PROMPTING

IN LARGE LANGUAGE MODELS

Zhuosheng Zhang†∗

, Aston Zhang‡, Mu Li‡, Alex Smola‡

†Shanghai Jiao Tong University, ‡Amazon Web Services

ABSTRACT

Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning

steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting.

CoT prompting has two major paradigms. One leverages a simple prompt like “Let’s think step by

step” to facilitate step-by-step thinking before answering a question. The other uses a few manual

demonstrations one by one, each composed of a question and a reasoning chain that leads to an

answer. The superior performance of the second paradigm hinges on the hand-crafting of task-speciﬁc

demonstrations one by one. We show that such manual efforts may be eliminated by leveraging

LLMs with the “Let’s think step by step” prompt to generate reasoning chains for demonstrations one

by one, i.e., let’s think not just step by step, but also one by one. However, these generated chains

often come with mistakes. To mitigate the effect of such mistakes, we ﬁnd that diversity matters for

automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-

CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations.

On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the

performance of the CoT paradigm that requires manual designs of demonstrations. Code is available

at https://github.com/amazon-research/auto-cot

1 Introduction

Large language models (LLMs) [Brown et al., 2020, Thoppilan et al., 2022, Rae et al., 2021, Chowdhery et al., 2022]

have performed impressively on complex reasoning tasks by decomposing the multi-step problems into intermediate

steps before producing the answer. This reasoning process is elicited by a very recent technique: chain-of-thought

(CoT) prompting [Wei et al., 2022a].

CoT prompting can be categorized into two major paradigms. One adds a single prompt like “Let’s think step by step”

after the test question to facilitate the reasoning chains in LLMs [Kojima et al., 2022]. Since this prompting paradigm

is task-agnostic and does not need input-output demonstrations, it is called

Zero-Shot-CoT

(left of Figure 1). With

Zero-Shot-CoT, LLMs have shown to be decent zero-shot reasoners. The other paradigm is few-shot prompting with

manual reasoning demonstrations one by one [Wei et al., 2022a]. Each demonstration has a question and a reasoning

chain. A reasoning chain is composed of a rationale (a series of intermediate reasoning steps) and an expected answer.

With all the demonstrations being manually designed, this paradigm is referred to as

Manual-CoT

(right of Figure 1).

In practice, Manual-CoT has obtained stronger performance than Zero-Shot-CoT [Wei et al., 2022a, Kojima et al.,

2022]. However, this superior performance hinges on the hand-drafting of effective demonstrations. Speciﬁcally, the

hand-drafting involves nontrivial efforts in designs of both questions and their reasoning chains for demonstrations.

Moreover, human efforts for designing task-speciﬁc demonstrations are even more: different tasks, such as arithmetic

[Roy and Roth, 2015] and commonsense reasoning [Talmor et al., 2019], require different ways of demonstrations.

To eliminate such manual designs, we advocate another

Auto-CoT

paradigm to automatically construct demonstrations

with questions and reasoning chains. Speciﬁcally, Auto-CoT leverages LLMs with the “Let’s think step by step” prompt

to generate reasoning chains for demonstrations one by one, i.e., let’s think not just step by step, but also one by one.

∗

Work done during an internship at Amazon Web Services. Correspondence to Zhuosheng Zhang <zhangzs@sjtu.edu.cn> and

Aston Zhang <astonz@amazon.com>

arXiv:2210.03493v1 [cs.CL] 7 Oct 2022

A: Let’s think step by step. There are 64 puppies.28 of

them were sold. This leaves 36 puppies. Each cage has

4 puppies, so we need 9 cages.

Therefore, the answer (arabic numerals) is Test Question

Generated Rationale

Manual Demos One by One

Q: A pet store had 64 puppies.In one day they sold 28 of

them and put the rest into cages with 4in each cage.

How many cages did they use?

Q: There are 15 trees in the grove. Grove workers will plant

trees in the grove today.After they are done, there will be 21

trees.How many trees did the grove workers plant today?

A: There are 15 trees originally.Then there were 21 trees after

some more were planted.So there must have been 21 -15 = 6.

The answer is 6. …

Rationale Generation

LLM

Answer Extraction

The pet store had 64 puppies.They sold 28 of them.So they

had 64 -28 =36 puppies left.They put them into cages with 4

in each cage. So they used 36 / 4 = 9 cages. The answer is 9.

Q: A pet store had 64 puppies.In one day they sold 28 of them

and put the rest into cages with 4in each cage. How many

cages did they use?

(a) Zero-Shot-CoT (b) Manual-CoT

Question

Answer Rationale

LLM

Q: A pet store had 64 puppies.In one day they sold 28 of

them and put the rest into cages with 4in each cage.

How many cages did they use?

A: Let’s think step by step.

Figure 1: Zero-Shot-CoT [Kojima et al., 2022] (using the “Let’s think step by step” prompt) and Manual-CoT [Wei

et al., 2022a] (using manually designed demonstrations one by one) with example inputs and outputs of an LLM.

However, we ﬁnd that this challenge cannot be effectively addressed by simple solutions. For example, given a test

question of a dataset, retrieving semantically similar questions and invoking Zero-Shot-CoT to generate reasoning

chains will fail. Although LLMs are decent zero-shot reasoners, they are not perfect: Zero-Shot-CoT can still make

mistakes in reasoning chains.

To mitigate the effect of reasoning chain mistakes from Zero-Shot-CoT, our analysis shows that diversity of

demonstration questions is the key. Based on this insight, we propose an Auto-CoT method to automatically construct

demonstrations. Auto-CoT consists of two main steps. First, partition questions of a given dataset into a few clusters.

Second, select a representative question from each cluster and generate its reasoning chain using Zero-Shot-CoT with

simple heuristics.

We evaluate Auto-CoT on ten benchmark reasoning tasks including: (i) arithmetic reasoning (MultiArith [Roy and Roth,

2015], GSM8K [Cobbe et al., 2021], AQUA-RAT [Ling et al., 2017], SVAMP [Patel et al., 2021]); (ii) commonsense

reasoning (CSQA [Talmor et al., 2019], StrategyQA [Geva et al., 2021]); (iii) symbolic reasoning (Last Letter

Concatenation, Coin Flip) [Wei et al., 2022a]. Experimental results show that with GPT-3, Auto-CoT consistently

matches or exceeds the performance of Manual-CoT that requires manual designs. This indicates that LLMs can

perform CoT reasoning by automatically constructing demonstrations.

2 Related Work

This section reviews two lines of research that form the basis of this work: chain-of-thought (CoT) prompting for

multi-step reasoning and in-context learning for inducing LLMs to learn from demonstrations.

2.1 Chain-of-thought Prompting

CoT prompting is a gradient-free technique of inducing LLMs to produce intermediate reasoning steps that lead to the

ﬁnal answer. Wei et al. [2022a] formally studied the topic of CoT prompting in language models. This technique elicits

LLMs to generate a coherent series of intermediate reasoning steps that lead to the ﬁnal answer to a question. Studies

have shown that LLMs can perform CoT reasoning with zero-shot prompting (Zero-Shot-CoT) [Kojima et al., 2022] or

manually written few-shot demonstrations (Manual-CoT) [Wei et al., 2022a].

Zero-Shot-CoT. Kojima et al. [2022] showed that LLMs are decent zero-shot reasoners whose generated rationales

have already reﬂected the CoT reasoning. This ﬁnding inspires our work to leverage the self-generated rationales for

demonstrations. Generating rationales by LLMs was shown to be practical in a recent work [Zelikman et al., 2022]. In

their work, an LLM is prompted to generate rationales and those rationales that lead to the correct answer are selected.

The selection requires a training dataset of questions with annotated answers. In contrast, our work considers a more

challenging scenario where only a set of test questions are given (without a training dataset), following CoT prompting

studies by Wei et al. [2022a] and Kojima et al. [2022].

Manual-CoT.

Manual-CoT achieves stronger performance by eliciting the CoT reasoning ability with effective

manual demonstrations. The demonstrations for the reasoning process are manually designed. However, the human

efforts in designs of both questions and their reasoning chains are nontrivial. Instead of addressing this limitation, recent

studies mainly focus on hand-crafting more complex demonstrations or leveraging ensemble-like methods. One trend is

problem decomposition. In least-to-most prompting [Zhou et al., 2022], complex problems are reduced to sub-problems,

and then the sub-problems are solved sequentially. The other trend is to vote over multiple reasoning paths for a test

question. Wang et al. [2022a] introduced a self-consistency decoding strategy to sample multiple outputs of LLMs and

then took a majority over the ﬁnal answers. Wang et al. [2022b] and Li et al. [2022] introduced randomness in the input

space to produce more diverse outputs for voting. They used manually-designed demonstrations as the seed set and

generated additional rationales: leave one question from the seed set and use the remaining demonstrations to generate

rationales for this question by the LLM. Unlike the aforementioned research lines that rely on manually-designed

demonstrations, our work intends to eliminate manual designs with competitive performance.

2.2 In-Context Learning

CoT prompting is closely related to in-context learning (ICL) [Radford et al., 2019, Brown et al., 2020]. ICL enables

LLMs to perform a target task by feeding a few prompted examples as part of the input. Without gradient update, ICL

allows a single model to perform various tasks universally. There are various research lines to improve the performance

of ICL: (i) retrieving related demonstrations to the test instance where the popular practice is dynamically retrieving

related training examples for a given test input [Rubin et al., 2022, Su et al., 2022]; (ii) augmenting with ﬁne-grained

information, such as incorporating task instruction [Mishra et al., 2022, Wei et al., 2022b, Sanh et al., 2022]; (iii)

manipulating output probabilities of LLMs instead of directly computing the likelihood of target labels [Holtzman et al.,

2021, Zhao et al., 2021, Min et al., 2022a].

Despite the success of ICL, studies [Liu et al., 2022a, Lu et al., 2022] have shown that the strength of ICL may vary

widely depending on the choice of in-context demonstrations [Liu et al., 2022b]. In detail, the formatting of the

prompt, such as wording or order of demonstrations, may lead to performance ﬂuctuations [Webson and Pavlick,

2022, Zhao et al., 2021]. A recent work [Min et al., 2022b] even questioned the necessity of ground-truth input-

output mapping: using incorrect labels in the examples only marginally lowers the performance. However, the

existing analysis of ICL is mainly based on standard classiﬁcation and multi-choice datasets that only have simple

<input

→

output> mappings. We discover that those ﬁndings may not be applicable to the CoT prompting scenario with

more complex <input

→

rationale

→

output> mappings. For example, mistakes in either the <input

→

rationale> mapping

or the <rationale→output> mapping will lead to a dramatic performance drop (Appendix A.1).

3 Challenge of Auto-CoT

As just discussed, the performance of ICL hinges on hand-crafted demonstrations. As reported in Manual-CoT [Wei

et al., 2022a], using demonstrations written by different annotators brings up to 28.2% accuracy disparity in a symbolic

reasoning task, while changing the order of demonstrations results in less than 2% changes in most tasks. This suggests

that the key challenge of Auto-CoT lies in automatically constructing demonstrations with good questions and their

reasoning chains.

Recall that Manual-CoT hand-crafts a few (e.g., 8) questions in demonstrations. With similarity-based retrieval methods

being widely adopted for prompting LLMs [Rubin et al., 2022, Su et al., 2022], a promising candidate solution is to

sample demonstration questions using similarity-based retrieval. We follow the more challenging assumption in CoT

studies [Wei et al., 2022a, Kojima et al., 2022] that only a set of test questions are given (without a training dataset).

Following Liu et al. [2022a], we use Sentence-BERT [Reimers and Gurevych, 2019] to encode questions. For each

question

qtest

in a test dataset, we sample demonstration questions

qdemo

(

i= 1, . . . , k

) from the rest of the questions.

We design a

Retrieval-Q-CoT

method to retrieve the top-

(e.g.,

k= 8

) similar questions based on cosine similarity.

To compare with this similarity-based method, we also test a relatively more diversity-based method:

Random-Q-CoT

which randomly samples kother test questions for each test question.

Both Retrieval-Q-CoT and Random-Q-CoT invoke Zero-Shot-CoT [Kojima et al., 2022] to generate the reasoning chain

cdemo

(rationale and answer) for each sampled question

qdemo

, as LLMs are decent zero-shot reasoners [Kojima et al.,

2022]. We use GPT-3 [Brown et al., 2020] with 175B parameters (text-davinci-002) for the LLM unless otherwise stated.

On a high level, both Retrieval-Q-CoT and Random-Q-CoT take the concatenation of qdemo

i, cdemo

ipairs (i= 1, . . . , k)

and qtest as input to predict the reasoning chain for qtest, which contains the answer in the end (like right of Figure 1).

Table 1: Accuracy (%) of different sampling

methods. Symbol

†

indicates using training sets

with annotated reasoning chains.

Method MultiArith GSM8K AQuA

Zero-Shot-CoT 78.7 40.7 33.5

Manual-CoT 91.7 46.9 35.8†

Random-Q-CoT 86.2 47.6†36.2†

Retrieval-Q-CoT 82.8 48.0†39.7†

To our surprise, Retrieval-Q-CoT underperforms Random-Q-CoT

on the arithmetic dataset MultiArith [Roy and Roth, 2015] (Table

1). Note that the retrieval methods were originally proposed in

tasks with annotated labels [Rubin et al., 2022, Su et al., 2022],

however, invoking Zero-Shot-CoT does not guarantee entirely correct

reasoning chains. Thus, we hypothesize that the inferior performance

of Retrieval-Q-CoT is caused by incorrect reasoning chains by Zero-

Shot-CoT. To test this hypothesis, we experiment with Retrieval-Q-

CoT on two other datasets GSM8K [Cobbe et al., 2021] and AQuA

[Ling et al., 2017] that have training sets with annotated reasoning

chains. The results are shown with

†

in Table 1. Under the setting

with annotated reasoning chains, Retrieval-Q-CoT even outperforms

Manual-CoT. The result indicates that Retrieval-Q-CoT is effective

when human annotations are available.

Although human annotations are useful, such manual efforts are nontrivial. However, automatically generating reasoning

chains via Zero-Shot-CoT underperforms Manual-CoT, especially when the challenge of question sampling is not

addressed. To design more effective Auto-CoT, we need to understand its challenge better.

3.1 Retrieval-Q-CoT Fails due to Misleading by Similarity

Since Retrieval-Q-CoT uses a few prompting demonstrations like in Manual-CoT, Retrieval-Q-CoT is expected to

perform competitively as well. However, reasoning chains (both rationales and answers) in Retrieval-Q-CoT are

generated by Zero-Shot-CoT: they may have mistakes that lead to wrong answers. Let us simply call demonstrations

with wrong answers as wrong demonstrations. Intuitively, after similar questions to a test question are retrieved, wrong

demonstrations caused by Zero-Shot-CoT may mislead the same LLM to reason similarly with a wrong answer (e.g.,

replicating mistakes) for the test question. We refer to this phenomenon as misleading by similarity. We will investigate

whether misleading by similarity contributes to the inferior performance of Retrieval-Q-CoT.

Retrieval-Q-CoT Random-Q-CoT

Rate (%)

Figure 2: Unresolving Rate.

To begin with, we invoke Zero-Shot-CoT on all the 600 questions

from the MultiArith dataset. Among them, we collect those 128

questions (denoted as

) where Zero-Shot-CoT generates wrong

answers (error rate:

21.3% = 128/600

). As we mentioned, with

extra demonstrations, Retrieval-Q-CoT and Random-Q-CoT are

expected to perform more competitively than Zero-Shot-CoT. Among

where Zero-Shot-CoT fails, we call those where Retrieval-Q-CoT

or Random-Q-CoT still fail as their unresolved questions. We divide

the number of unresolved questions by 128 (number of questions in

) to calculate the unresolving rate. A higher unresolving rate means

that a method more likely still makes mistakes like Zero-Shot-CoT.

Figure 2 shows that the unresolving rate of Retrieval-Q-CoT (46.9%)

is much higher than Random-Q-CoT (25.8%). It indicates that with similar questions being sampled for test questions,

Retrieval-Q-CoT is negatively affected by misleading by similarity.

To show that unresolved questions of Retrieval-Q-CoT tend to be similar, we present a case study in Table 2. In the left

part, the retrieved demonstration questions are similar to the test question and ask “how long will it take him to cook the

rest?” The reasoning chains generated by Zero-Shot-CoT produce answers regarding “the total of ” instead of “the rest”.

Following the demonstrations, Retrieval-Q-CoT also fails by misunderstanding the meaning of “the rest”. In contrast,

Random-Q-CoT correctly understands “the rest” better without making similar mistakes in the demonstrations, thanks

to relatively more diverse (random) demonstrations.

3.2 Errors Frequently Fall into the Same Cluster

Motivated by the observations in Table 2, we use

-means to partition all the 600 test questions into

k= 8

clusters,

where each cluster contains similar questions.2With these clusters and reasoning chains generated by Zero-Shot-CoT

2We use Sentence-BERT [Reimers and Gurevych, 2019] to encode questions and apply k-means for clustering.

Table 2: Examples of Retrieval-Q-CoT and Random-Q-CoT for constructing demonstrations and prediction. The

upper part contains the automatically constructed demonstrations, and the lower part contains test questions along with

predicted reasoning chains.

Retrieval-Q-CoT Random-Q-CoT

A chef needs to cook 9 potatoes. He has already cooked 7. If each

potato takes 3 minutes to cook, how long will it take him to cook

the rest?

Wendy uploaded 79 pictures to Facebook. She put

44 pics into one album and put the rest into 5 different

albums. How many pictures were in each album?

Let’s think step by step. The chef has already cooked 7 potatoes.

That means it has taken him 7 * 3 minutes to cook those 7 potatoes.

That means it will take him 3 more minutes to cook each of the

remaining 2 potatoes. That means it will take him a total of 7 * 3 + 2

* 3 = 21 minutes to cook all 9 potatoes. The answer is 21. 7

Let’s think step by step. Wendy uploaded 79 pictures

to Facebook. She put 44 pictures into one album. That

means she put the rest of the pictures into 5 different

albums. That means each album would have (79-44)/5 =

15 pictures in each album. The answer is 15.X

A chef needs to cook 13 potatoes. He has already cooked 5. If

each potato takes 6 minutes to cook, how long will it take him to

cook the rest?

At a restaurant each adult meal costs $5 and kids eat

free. If a group of 15 people came in and 8 were kids,

how much would it cost for the group to eat?

Let’s think step by step. The chef has already cooked 5 potatoes.

That means it has taken him 30 minutes so far. There are 13 potatoes

in total. 5 have been cooked so far. That means there are 8 left to

cook. Each potato takes 6 minutes to cook. That means it will take

the chef 48 minutes to cook the 8 remaining potatoes. In total, it will

take the chef 78 minutes to cook all 13 potatoes. The answer is 78.

Let’s think step by step. There are 8 kids, so that

means there are 7 adults. Each adult meal costs $5, so

that means that the 7 adults will cost $5*7=$35. Now we

just need to add the two numbers together. $35+8=43$.

So it would cost $43 for the group to eat. The answer is

43. 7

A chef needs to cook 15 potatoes. He has already cooked 8. If

each potato takes 9 minutes to cook, how long will it take him to

cook the rest?

A chef needs to cook 15 potatoes. He has already

cooked 8. If each potato takes 9 minutes to cook, how

long will it take him to cook the rest?

Let’s think step by step. The chef has already cooked 8 potatoes.

That means it has taken him 8 * 9 minutes to cook those 8 potatoes.

That means it will take him 9 minutes to cook each of the remaining

7 potatoes. That means it will take him a total of 8 * 9 + 7 * 9 = 135

minutes to cook all 15 potatoes. The answer is 135. 7

Let’s think step by step. The chef needs to cook 15

potatoes. He has already cooked 8. That means he has to

cook 15-8=7 more potatoes. Each potato takes 9 minutes

to cook. That means it will take him 9*7=63 minutes to

cook the rest of the potatoes. The answer is 63. X

(in Section 3.1), now we are curious if certain clusters contain questions where Zero-Shot-CoT frequently fails. Thus,

we calculate the error rate (questions with wrong Zero-Shot-CoT answers / total questions) for each cluster.

12345678

Error Rate (%)

Figure 3: Clusters of similar questions.

As shown in Figure 3, there exists a cluster (Cluster 2) with frequent

Zero-Shot-CoT errors (52.3%). The phenomenon could be generic as

Zero-Shot-CoT may lack some skills to solve some common problems

in target tasks.

For convenience of descriptions, let us call the cluster

with the highest error rate as the frequent-error cluster (e.g., Cluster

2 in Figure 3). Therefore, the imperfect nature of generated reasoning

chains in a zero-shot fashion poses risks of retrieving multiple similar

questions inside a frequent-error cluster by using similarity-based

methods. For the test question in the frequent-error cluster, Retrieval-

Q-CoT more easily constructs demonstrations with multiple similar

mistakes. As a result, Retrieval-Q-CoT often makes similar mistakes

like Zero-Shot-CoT, reiterated by its higher unresolving rate in Figure

3.3 Diversity May Mitigate Misleading by Similarity

The analysis so far compellingly shows that LLMs are still not perfect zero-shot reasoners; thus, we aim to mitigate the

effect of their Zero-Shot-CoT errors, especially to mitigate misleading by similarity in the design of Auto-CoT.

As we will show later (Section 5.5), presenting a small portion of mistakes (e.g., 1 or 2 wrong demonstrations out

of 8) would not harm the overall reasoning performance for test questions. Suppose that questions of all the wrong

demonstrations fall into the same frequent-error cluster; then sampling one question from every different cluster will

lead to a higher than

7/8 = 87.5%

chance to construct all the 8 correct demonstrations. Since different clusters reﬂect

diverse semantics of the questions, this clustering-based sampling method can be considered as diversity-based, which

is in sharp contrast to similarity-based Retrieval-Q-CoT. On one hand, sampling questions with diversity may mitigate

3We observe similar phenomena when changing the cluster number or using other datasets (Appendix A.2).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AUTOMATICCHAINOFTHOUGHTPROMPTINGINLARGELANGUAGEMODELSZhuoshengZhangy,AstonZhangz,MuLiz,AlexSmolazyShanghaiJiaoTongUniversity,zAmazonWebServicesABSTRACTLargelanguagemodels(LLMs)canperformcomplexreasoningbygeneratingintermediatereasoningsteps.Providingthesestepsforpromptingdemonstrationsiscalledchain...

展开>> 收起<<

AUTOMATIC CHAIN OF THOUGHT PROMPTING INLARGE LANGUAGE MODELS Zhuosheng Zhangy Aston Zhangz Mu Liz Alex Smolaz.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AUTOMATIC CHAIN OF THOUGHT PROMPTING INLARGE LANGUAGE MODELS Zhuosheng Zhangy Aston Zhangz Mu Liz Alex Smolaz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: