AUTOMATIC CHAIN OF THOUGHT PROMPTING INLARGE LANGUAGE MODELS Zhuosheng Zhangy Aston Zhangz Mu Liz Alex Smolaz

2025-05-02 0 0 814.78KB 25 页 10玖币
侵权投诉
AUTOMATIC CHAIN OF THOUGHT PROMPTING
IN LARGE LANGUAGE MODELS
Zhuosheng Zhang
, Aston Zhang, Mu Li, Alex Smola
Shanghai Jiao Tong University, Amazon Web Services
ABSTRACT
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning
steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting.
CoT prompting has two major paradigms. One leverages a simple prompt like “Let’s think step by
step” to facilitate step-by-step thinking before answering a question. The other uses a few manual
demonstrations one by one, each composed of a question and a reasoning chain that leads to an
answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific
demonstrations one by one. We show that such manual efforts may be eliminated by leveraging
LLMs with the “Let’s think step by step” prompt to generate reasoning chains for demonstrations one
by one, i.e., let’s think not just step by step, but also one by one. However, these generated chains
often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for
automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-
CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations.
On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the
performance of the CoT paradigm that requires manual designs of demonstrations. Code is available
at https://github.com/amazon-research/auto-cot
1 Introduction
Large language models (LLMs) [Brown et al., 2020, Thoppilan et al., 2022, Rae et al., 2021, Chowdhery et al., 2022]
have performed impressively on complex reasoning tasks by decomposing the multi-step problems into intermediate
steps before producing the answer. This reasoning process is elicited by a very recent technique: chain-of-thought
(CoT) prompting [Wei et al., 2022a].
CoT prompting can be categorized into two major paradigms. One adds a single prompt like “Let’s think step by step”
after the test question to facilitate the reasoning chains in LLMs [Kojima et al., 2022]. Since this prompting paradigm
is task-agnostic and does not need input-output demonstrations, it is called
Zero-Shot-CoT
(left of Figure 1). With
Zero-Shot-CoT, LLMs have shown to be decent zero-shot reasoners. The other paradigm is few-shot prompting with
manual reasoning demonstrations one by one [Wei et al., 2022a]. Each demonstration has a question and a reasoning
chain. A reasoning chain is composed of a rationale (a series of intermediate reasoning steps) and an expected answer.
With all the demonstrations being manually designed, this paradigm is referred to as
Manual-CoT
(right of Figure 1).
In practice, Manual-CoT has obtained stronger performance than Zero-Shot-CoT [Wei et al., 2022a, Kojima et al.,
2022]. However, this superior performance hinges on the hand-drafting of effective demonstrations. Specifically, the
hand-drafting involves nontrivial efforts in designs of both questions and their reasoning chains for demonstrations.
Moreover, human efforts for designing task-specific demonstrations are even more: different tasks, such as arithmetic
[Roy and Roth, 2015] and commonsense reasoning [Talmor et al., 2019], require different ways of demonstrations.
To eliminate such manual designs, we advocate another
Auto-CoT
paradigm to automatically construct demonstrations
with questions and reasoning chains. Specifically, Auto-CoT leverages LLMs with the “Let’s think step by step” prompt
to generate reasoning chains for demonstrations one by one, i.e., let’s think not just step by step, but also one by one.
Work done during an internship at Amazon Web Services. Correspondence to Zhuosheng Zhang <zhangzs@sjtu.edu.cn> and
Aston Zhang <astonz@amazon.com>
arXiv:2210.03493v1 [cs.CL] 7 Oct 2022
A: Let’s think step by step. There are 64 puppies.28 of
them were sold. This leaves 36 puppies. Each cage has
4 puppies, so we need 9 cages.
Therefore, the answer (arabic numerals) is Test Question
Generated Rationale
Manual Demos One by One
Q: A pet store had 64 puppies.In one day they sold 28 of
them and put the rest into cages with 4in each cage.
How many cages did they use?
Q: There are 15 trees in the grove. Grove workers will plant
trees in the grove today.After they are done, there will be 21
trees.How many trees did the grove workers plant today?
A: There are 15 trees originally.Then there were 21 trees after
some more were planted.So there must have been 21 -15 = 6.
The answer is 6.
Rationale Generation
9.
LLM
Answer Extraction
The pet store had 64 puppies.They sold 28 of them.So they
had 64 -28 =36 puppies left.They put them into cages with 4
in each cage. So they used 36 / 4 = 9 cages. The answer is 9.
Q: A pet store had 64 puppies.In one day they sold 28 of them
and put the rest into cages with 4in each cage. How many
cages did they use?
A:
(a) Zero-Shot-CoT (b) Manual-CoT
Question
Answer Rationale
LLM
LLM
Q: A pet store had 64 puppies.In one day they sold 28 of
them and put the rest into cages with 4in each cage.
How many cages did they use?
A: Let’s think step by step.
Figure 1: Zero-Shot-CoT [Kojima et al., 2022] (using the “Let’s think step by step” prompt) and Manual-CoT [Wei
et al., 2022a] (using manually designed demonstrations one by one) with example inputs and outputs of an LLM.
However, we find that this challenge cannot be effectively addressed by simple solutions. For example, given a test
question of a dataset, retrieving semantically similar questions and invoking Zero-Shot-CoT to generate reasoning
chains will fail. Although LLMs are decent zero-shot reasoners, they are not perfect: Zero-Shot-CoT can still make
mistakes in reasoning chains.
To mitigate the effect of reasoning chain mistakes from Zero-Shot-CoT, our analysis shows that diversity of
demonstration questions is the key. Based on this insight, we propose an Auto-CoT method to automatically construct
demonstrations. Auto-CoT consists of two main steps. First, partition questions of a given dataset into a few clusters.
Second, select a representative question from each cluster and generate its reasoning chain using Zero-Shot-CoT with
simple heuristics.
We evaluate Auto-CoT on ten benchmark reasoning tasks including: (i) arithmetic reasoning (MultiArith [Roy and Roth,
2015], GSM8K [Cobbe et al., 2021], AQUA-RAT [Ling et al., 2017], SVAMP [Patel et al., 2021]); (ii) commonsense
reasoning (CSQA [Talmor et al., 2019], StrategyQA [Geva et al., 2021]); (iii) symbolic reasoning (Last Letter
Concatenation, Coin Flip) [Wei et al., 2022a]. Experimental results show that with GPT-3, Auto-CoT consistently
matches or exceeds the performance of Manual-CoT that requires manual designs. This indicates that LLMs can
perform CoT reasoning by automatically constructing demonstrations.
2 Related Work
This section reviews two lines of research that form the basis of this work: chain-of-thought (CoT) prompting for
multi-step reasoning and in-context learning for inducing LLMs to learn from demonstrations.
2.1 Chain-of-thought Prompting
CoT prompting is a gradient-free technique of inducing LLMs to produce intermediate reasoning steps that lead to the
final answer. Wei et al. [2022a] formally studied the topic of CoT prompting in language models. This technique elicits
LLMs to generate a coherent series of intermediate reasoning steps that lead to the final answer to a question. Studies
have shown that LLMs can perform CoT reasoning with zero-shot prompting (Zero-Shot-CoT) [Kojima et al., 2022] or
manually written few-shot demonstrations (Manual-CoT) [Wei et al., 2022a].
Zero-Shot-CoT. Kojima et al. [2022] showed that LLMs are decent zero-shot reasoners whose generated rationales
have already reflected the CoT reasoning. This finding inspires our work to leverage the self-generated rationales for
demonstrations. Generating rationales by LLMs was shown to be practical in a recent work [Zelikman et al., 2022]. In
2
their work, an LLM is prompted to generate rationales and those rationales that lead to the correct answer are selected.
The selection requires a training dataset of questions with annotated answers. In contrast, our work considers a more
challenging scenario where only a set of test questions are given (without a training dataset), following CoT prompting
studies by Wei et al. [2022a] and Kojima et al. [2022].
Manual-CoT.
Manual-CoT achieves stronger performance by eliciting the CoT reasoning ability with effective
manual demonstrations. The demonstrations for the reasoning process are manually designed. However, the human
efforts in designs of both questions and their reasoning chains are nontrivial. Instead of addressing this limitation, recent
studies mainly focus on hand-crafting more complex demonstrations or leveraging ensemble-like methods. One trend is
problem decomposition. In least-to-most prompting [Zhou et al., 2022], complex problems are reduced to sub-problems,
and then the sub-problems are solved sequentially. The other trend is to vote over multiple reasoning paths for a test
question. Wang et al. [2022a] introduced a self-consistency decoding strategy to sample multiple outputs of LLMs and
then took a majority over the final answers. Wang et al. [2022b] and Li et al. [2022] introduced randomness in the input
space to produce more diverse outputs for voting. They used manually-designed demonstrations as the seed set and
generated additional rationales: leave one question from the seed set and use the remaining demonstrations to generate
rationales for this question by the LLM. Unlike the aforementioned research lines that rely on manually-designed
demonstrations, our work intends to eliminate manual designs with competitive performance.
2.2 In-Context Learning
CoT prompting is closely related to in-context learning (ICL) [Radford et al., 2019, Brown et al., 2020]. ICL enables
LLMs to perform a target task by feeding a few prompted examples as part of the input. Without gradient update, ICL
allows a single model to perform various tasks universally. There are various research lines to improve the performance
of ICL: (i) retrieving related demonstrations to the test instance where the popular practice is dynamically retrieving
related training examples for a given test input [Rubin et al., 2022, Su et al., 2022]; (ii) augmenting with fine-grained
information, such as incorporating task instruction [Mishra et al., 2022, Wei et al., 2022b, Sanh et al., 2022]; (iii)
manipulating output probabilities of LLMs instead of directly computing the likelihood of target labels [Holtzman et al.,
2021, Zhao et al., 2021, Min et al., 2022a].
Despite the success of ICL, studies [Liu et al., 2022a, Lu et al., 2022] have shown that the strength of ICL may vary
widely depending on the choice of in-context demonstrations [Liu et al., 2022b]. In detail, the formatting of the
prompt, such as wording or order of demonstrations, may lead to performance fluctuations [Webson and Pavlick,
2022, Zhao et al., 2021]. A recent work [Min et al., 2022b] even questioned the necessity of ground-truth input-
output mapping: using incorrect labels in the examples only marginally lowers the performance. However, the
existing analysis of ICL is mainly based on standard classification and multi-choice datasets that only have simple
<input
output> mappings. We discover that those findings may not be applicable to the CoT prompting scenario with
more complex <input
rationale
output> mappings. For example, mistakes in either the <input
rationale> mapping
or the <rationaleoutput> mapping will lead to a dramatic performance drop (Appendix A.1).
3 Challenge of Auto-CoT
As just discussed, the performance of ICL hinges on hand-crafted demonstrations. As reported in Manual-CoT [Wei
et al., 2022a], using demonstrations written by different annotators brings up to 28.2% accuracy disparity in a symbolic
reasoning task, while changing the order of demonstrations results in less than 2% changes in most tasks. This suggests
that the key challenge of Auto-CoT lies in automatically constructing demonstrations with good questions and their
reasoning chains.
Recall that Manual-CoT hand-crafts a few (e.g., 8) questions in demonstrations. With similarity-based retrieval methods
being widely adopted for prompting LLMs [Rubin et al., 2022, Su et al., 2022], a promising candidate solution is to
sample demonstration questions using similarity-based retrieval. We follow the more challenging assumption in CoT
studies [Wei et al., 2022a, Kojima et al., 2022] that only a set of test questions are given (without a training dataset).
Following Liu et al. [2022a], we use Sentence-BERT [Reimers and Gurevych, 2019] to encode questions. For each
question
qtest
in a test dataset, we sample demonstration questions
qdemo
i
(
i= 1, . . . , k
) from the rest of the questions.
We design a
Retrieval-Q-CoT
method to retrieve the top-
k
(e.g.,
k= 8
) similar questions based on cosine similarity.
To compare with this similarity-based method, we also test a relatively more diversity-based method:
Random-Q-CoT
,
which randomly samples kother test questions for each test question.
Both Retrieval-Q-CoT and Random-Q-CoT invoke Zero-Shot-CoT [Kojima et al., 2022] to generate the reasoning chain
cdemo
i
(rationale and answer) for each sampled question
qdemo
i
, as LLMs are decent zero-shot reasoners [Kojima et al.,
2022]. We use GPT-3 [Brown et al., 2020] with 175B parameters (text-davinci-002) for the LLM unless otherwise stated.
3
On a high level, both Retrieval-Q-CoT and Random-Q-CoT take the concatenation of qdemo
i, cdemo
ipairs (i= 1, . . . , k)
and qtest as input to predict the reasoning chain for qtest, which contains the answer in the end (like right of Figure 1).
Table 1: Accuracy (%) of different sampling
methods. Symbol
indicates using training sets
with annotated reasoning chains.
Method MultiArith GSM8K AQuA
Zero-Shot-CoT 78.7 40.7 33.5
Manual-CoT 91.7 46.9 35.8
Random-Q-CoT 86.2 47.636.2
Retrieval-Q-CoT 82.8 48.039.7
To our surprise, Retrieval-Q-CoT underperforms Random-Q-CoT
on the arithmetic dataset MultiArith [Roy and Roth, 2015] (Table
1). Note that the retrieval methods were originally proposed in
tasks with annotated labels [Rubin et al., 2022, Su et al., 2022],
however, invoking Zero-Shot-CoT does not guarantee entirely correct
reasoning chains. Thus, we hypothesize that the inferior performance
of Retrieval-Q-CoT is caused by incorrect reasoning chains by Zero-
Shot-CoT. To test this hypothesis, we experiment with Retrieval-Q-
CoT on two other datasets GSM8K [Cobbe et al., 2021] and AQuA
[Ling et al., 2017] that have training sets with annotated reasoning
chains. The results are shown with
in Table 1. Under the setting
with annotated reasoning chains, Retrieval-Q-CoT even outperforms
Manual-CoT. The result indicates that Retrieval-Q-CoT is effective
when human annotations are available.
Although human annotations are useful, such manual efforts are nontrivial. However, automatically generating reasoning
chains via Zero-Shot-CoT underperforms Manual-CoT, especially when the challenge of question sampling is not
addressed. To design more effective Auto-CoT, we need to understand its challenge better.
3.1 Retrieval-Q-CoT Fails due to Misleading by Similarity
Since Retrieval-Q-CoT uses a few prompting demonstrations like in Manual-CoT, Retrieval-Q-CoT is expected to
perform competitively as well. However, reasoning chains (both rationales and answers) in Retrieval-Q-CoT are
generated by Zero-Shot-CoT: they may have mistakes that lead to wrong answers. Let us simply call demonstrations
with wrong answers as wrong demonstrations. Intuitively, after similar questions to a test question are retrieved, wrong
demonstrations caused by Zero-Shot-CoT may mislead the same LLM to reason similarly with a wrong answer (e.g.,
replicating mistakes) for the test question. We refer to this phenomenon as misleading by similarity. We will investigate
whether misleading by similarity contributes to the inferior performance of Retrieval-Q-CoT.
Retrieval-Q-CoT Random-Q-CoT
20
30
40
50
Rate (%)
Figure 2: Unresolving Rate.
To begin with, we invoke Zero-Shot-CoT on all the 600 questions
from the MultiArith dataset. Among them, we collect those 128
questions (denoted as
Q
) where Zero-Shot-CoT generates wrong
answers (error rate:
21.3% = 128/600
). As we mentioned, with
extra demonstrations, Retrieval-Q-CoT and Random-Q-CoT are
expected to perform more competitively than Zero-Shot-CoT. Among
Q
where Zero-Shot-CoT fails, we call those where Retrieval-Q-CoT
or Random-Q-CoT still fail as their unresolved questions. We divide
the number of unresolved questions by 128 (number of questions in
Q
) to calculate the unresolving rate. A higher unresolving rate means
that a method more likely still makes mistakes like Zero-Shot-CoT.
Figure 2 shows that the unresolving rate of Retrieval-Q-CoT (46.9%)
is much higher than Random-Q-CoT (25.8%). It indicates that with similar questions being sampled for test questions,
Retrieval-Q-CoT is negatively affected by misleading by similarity.
To show that unresolved questions of Retrieval-Q-CoT tend to be similar, we present a case study in Table 2. In the left
part, the retrieved demonstration questions are similar to the test question and ask “how long will it take him to cook the
rest? The reasoning chains generated by Zero-Shot-CoT produce answers regarding “the total of ” instead of “the rest”.
Following the demonstrations, Retrieval-Q-CoT also fails by misunderstanding the meaning of “the rest”. In contrast,
Random-Q-CoT correctly understands “the rest” better without making similar mistakes in the demonstrations, thanks
to relatively more diverse (random) demonstrations.
3.2 Errors Frequently Fall into the Same Cluster
Motivated by the observations in Table 2, we use
k
-means to partition all the 600 test questions into
k= 8
clusters,
where each cluster contains similar questions.2With these clusters and reasoning chains generated by Zero-Shot-CoT
2We use Sentence-BERT [Reimers and Gurevych, 2019] to encode questions and apply k-means for clustering.
4
Table 2: Examples of Retrieval-Q-CoT and Random-Q-CoT for constructing demonstrations and prediction. The
upper part contains the automatically constructed demonstrations, and the lower part contains test questions along with
predicted reasoning chains.
Retrieval-Q-CoT Random-Q-CoT
Q:
A chef needs to cook 9 potatoes. He has already cooked 7. If each
potato takes 3 minutes to cook, how long will it take him to cook
the rest?
Q:
Wendy uploaded 79 pictures to Facebook. She put
44 pics into one album and put the rest into 5 different
albums. How many pictures were in each album?
A:
Let’s think step by step. The chef has already cooked 7 potatoes.
That means it has taken him 7 * 3 minutes to cook those 7 potatoes.
That means it will take him 3 more minutes to cook each of the
remaining 2 potatoes. That means it will take him a total of 7 * 3 + 2
* 3 = 21 minutes to cook all 9 potatoes. The answer is 21. 7
A:
Let’s think step by step. Wendy uploaded 79 pictures
to Facebook. She put 44 pictures into one album. That
means she put the rest of the pictures into 5 different
albums. That means each album would have (79-44)/5 =
15 pictures in each album. The answer is 15.X
Q:
A chef needs to cook 13 potatoes. He has already cooked 5. If
each potato takes 6 minutes to cook, how long will it take him to
cook the rest?
Q:
At a restaurant each adult meal costs $5 and kids eat
free. If a group of 15 people came in and 8 were kids,
how much would it cost for the group to eat?
A:
Let’s think step by step. The chef has already cooked 5 potatoes.
That means it has taken him 30 minutes so far. There are 13 potatoes
in total. 5 have been cooked so far. That means there are 8 left to
cook. Each potato takes 6 minutes to cook. That means it will take
the chef 48 minutes to cook the 8 remaining potatoes. In total, it will
take the chef 78 minutes to cook all 13 potatoes. The answer is 78.
7
A:
Let’s think step by step. There are 8 kids, so that
means there are 7 adults. Each adult meal costs $5, so
that means that the 7 adults will cost $5*7=$35. Now we
just need to add the two numbers together. $35+8=43$.
So it would cost $43 for the group to eat. The answer is
43. 7
Q:
A chef needs to cook 15 potatoes. He has already cooked 8. If
each potato takes 9 minutes to cook, how long will it take him to
cook the rest?
Q:
A chef needs to cook 15 potatoes. He has already
cooked 8. If each potato takes 9 minutes to cook, how
long will it take him to cook the rest?
A:
Let’s think step by step. The chef has already cooked 8 potatoes.
That means it has taken him 8 * 9 minutes to cook those 8 potatoes.
That means it will take him 9 minutes to cook each of the remaining
7 potatoes. That means it will take him a total of 8 * 9 + 7 * 9 = 135
minutes to cook all 15 potatoes. The answer is 135. 7
A:
Let’s think step by step. The chef needs to cook 15
potatoes. He has already cooked 8. That means he has to
cook 15-8=7 more potatoes. Each potato takes 9 minutes
to cook. That means it will take him 9*7=63 minutes to
cook the rest of the potatoes. The answer is 63. X
(in Section 3.1), now we are curious if certain clusters contain questions where Zero-Shot-CoT frequently fails. Thus,
we calculate the error rate (questions with wrong Zero-Shot-CoT answers / total questions) for each cluster.
12345678
0
20
40
60
Error Rate (%)
Figure 3: Clusters of similar questions.
As shown in Figure 3, there exists a cluster (Cluster 2) with frequent
Zero-Shot-CoT errors (52.3%). The phenomenon could be generic as
Zero-Shot-CoT may lack some skills to solve some common problems
in target tasks.
3
For convenience of descriptions, let us call the cluster
with the highest error rate as the frequent-error cluster (e.g., Cluster
2 in Figure 3). Therefore, the imperfect nature of generated reasoning
chains in a zero-shot fashion poses risks of retrieving multiple similar
questions inside a frequent-error cluster by using similarity-based
methods. For the test question in the frequent-error cluster, Retrieval-
Q-CoT more easily constructs demonstrations with multiple similar
mistakes. As a result, Retrieval-Q-CoT often makes similar mistakes
like Zero-Shot-CoT, reiterated by its higher unresolving rate in Figure
2.
3.3 Diversity May Mitigate Misleading by Similarity
The analysis so far compellingly shows that LLMs are still not perfect zero-shot reasoners; thus, we aim to mitigate the
effect of their Zero-Shot-CoT errors, especially to mitigate misleading by similarity in the design of Auto-CoT.
As we will show later (Section 5.5), presenting a small portion of mistakes (e.g., 1 or 2 wrong demonstrations out
of 8) would not harm the overall reasoning performance for test questions. Suppose that questions of all the wrong
demonstrations fall into the same frequent-error cluster; then sampling one question from every different cluster will
lead to a higher than
7/8 = 87.5%
chance to construct all the 8 correct demonstrations. Since different clusters reflect
diverse semantics of the questions, this clustering-based sampling method can be considered as diversity-based, which
is in sharp contrast to similarity-based Retrieval-Q-CoT. On one hand, sampling questions with diversity may mitigate
3We observe similar phenomena when changing the cluster number or using other datasets (Appendix A.2).
5
摘要:

AUTOMATICCHAINOFTHOUGHTPROMPTINGINLARGELANGUAGEMODELSZhuoshengZhangy,AstonZhangz,MuLiz,AlexSmolazyShanghaiJiaoTongUniversity,zAmazonWebServicesABSTRACTLargelanguagemodels(LLMs)canperformcomplexreasoningbygeneratingintermediatereasoningsteps.Providingthesestepsforpromptingdemonstrationsiscalledchain...

展开>> 收起<<
AUTOMATIC CHAIN OF THOUGHT PROMPTING INLARGE LANGUAGE MODELS Zhuosheng Zhangy Aston Zhangz Mu Liz Alex Smolaz.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:814.78KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注