
Table 2: Examples of Retrieval-Q-CoT and Random-Q-CoT for constructing demonstrations and prediction. The
upper part contains the automatically constructed demonstrations, and the lower part contains test questions along with
predicted reasoning chains.
Retrieval-Q-CoT Random-Q-CoT
Q:
A chef needs to cook 9 potatoes. He has already cooked 7. If each
potato takes 3 minutes to cook, how long will it take him to cook
the rest?
Q:
Wendy uploaded 79 pictures to Facebook. She put
44 pics into one album and put the rest into 5 different
albums. How many pictures were in each album?
A:
Let’s think step by step. The chef has already cooked 7 potatoes.
That means it has taken him 7 * 3 minutes to cook those 7 potatoes.
That means it will take him 3 more minutes to cook each of the
remaining 2 potatoes. That means it will take him a total of 7 * 3 + 2
* 3 = 21 minutes to cook all 9 potatoes. The answer is 21. 7
A:
Let’s think step by step. Wendy uploaded 79 pictures
to Facebook. She put 44 pictures into one album. That
means she put the rest of the pictures into 5 different
albums. That means each album would have (79-44)/5 =
15 pictures in each album. The answer is 15.X
Q:
A chef needs to cook 13 potatoes. He has already cooked 5. If
each potato takes 6 minutes to cook, how long will it take him to
cook the rest?
Q:
At a restaurant each adult meal costs $5 and kids eat
free. If a group of 15 people came in and 8 were kids,
how much would it cost for the group to eat?
A:
Let’s think step by step. The chef has already cooked 5 potatoes.
That means it has taken him 30 minutes so far. There are 13 potatoes
in total. 5 have been cooked so far. That means there are 8 left to
cook. Each potato takes 6 minutes to cook. That means it will take
the chef 48 minutes to cook the 8 remaining potatoes. In total, it will
take the chef 78 minutes to cook all 13 potatoes. The answer is 78.
7
A:
Let’s think step by step. There are 8 kids, so that
means there are 7 adults. Each adult meal costs $5, so
that means that the 7 adults will cost $5*7=$35. Now we
just need to add the two numbers together. $35+8=43$.
So it would cost $43 for the group to eat. The answer is
43. 7
Q:
A chef needs to cook 15 potatoes. He has already cooked 8. If
each potato takes 9 minutes to cook, how long will it take him to
cook the rest?
Q:
A chef needs to cook 15 potatoes. He has already
cooked 8. If each potato takes 9 minutes to cook, how
long will it take him to cook the rest?
A:
Let’s think step by step. The chef has already cooked 8 potatoes.
That means it has taken him 8 * 9 minutes to cook those 8 potatoes.
That means it will take him 9 minutes to cook each of the remaining
7 potatoes. That means it will take him a total of 8 * 9 + 7 * 9 = 135
minutes to cook all 15 potatoes. The answer is 135. 7
A:
Let’s think step by step. The chef needs to cook 15
potatoes. He has already cooked 8. That means he has to
cook 15-8=7 more potatoes. Each potato takes 9 minutes
to cook. That means it will take him 9*7=63 minutes to
cook the rest of the potatoes. The answer is 63. X
(in Section 3.1), now we are curious if certain clusters contain questions where Zero-Shot-CoT frequently fails. Thus,
we calculate the error rate (questions with wrong Zero-Shot-CoT answers / total questions) for each cluster.
12345678
0
20
40
60
Error Rate (%)
Figure 3: Clusters of similar questions.
As shown in Figure 3, there exists a cluster (Cluster 2) with frequent
Zero-Shot-CoT errors (52.3%). The phenomenon could be generic as
Zero-Shot-CoT may lack some skills to solve some common problems
in target tasks.
3
For convenience of descriptions, let us call the cluster
with the highest error rate as the frequent-error cluster (e.g., Cluster
2 in Figure 3). Therefore, the imperfect nature of generated reasoning
chains in a zero-shot fashion poses risks of retrieving multiple similar
questions inside a frequent-error cluster by using similarity-based
methods. For the test question in the frequent-error cluster, Retrieval-
Q-CoT more easily constructs demonstrations with multiple similar
mistakes. As a result, Retrieval-Q-CoT often makes similar mistakes
like Zero-Shot-CoT, reiterated by its higher unresolving rate in Figure
2.
3.3 Diversity May Mitigate Misleading by Similarity
The analysis so far compellingly shows that LLMs are still not perfect zero-shot reasoners; thus, we aim to mitigate the
effect of their Zero-Shot-CoT errors, especially to mitigate misleading by similarity in the design of Auto-CoT.
As we will show later (Section 5.5), presenting a small portion of mistakes (e.g., 1 or 2 wrong demonstrations out
of 8) would not harm the overall reasoning performance for test questions. Suppose that questions of all the wrong
demonstrations fall into the same frequent-error cluster; then sampling one question from every different cluster will
lead to a higher than
7/8 = 87.5%
chance to construct all the 8 correct demonstrations. Since different clusters reflect
diverse semantics of the questions, this clustering-based sampling method can be considered as diversity-based, which
is in sharp contrast to similarity-based Retrieval-Q-CoT. On one hand, sampling questions with diversity may mitigate
3We observe similar phenomena when changing the cluster number or using other datasets (Appendix A.2).
5