
Preprint
referred as in-context learning), is something that large language models are particularly skilled
at (Shin et al., 2020; Liu et al., 2021). Among the wide language understanding task spectrum,
we are particularly interested in multi-step reasoning because of its two uniqueness: (1). multi-
step reasoning is a task where large models substantially outperform smaller models (Wei et al.,
2022b), versus performance gains on tasks like sentiment classification can be very limited with
large models (Shin et al., 2020); (2). multi-step reasoning is where few-shot prompting starts
to outperform full training set fine-tuning, even when fine-tuning is conducted on the same large
model (Lewkowycz et al., 2022). This work takes an important step forward in multi-step reasoning
by showing the critical role of prompt complexity.
Chain-of-Thoughts Reasoning A prominent work demonstrating the multi-step reasoning of
language models is chain-of-thoughts prompting (Fig. 1A), proposed by Wei et al. (2022b). They
show that the reasoning ability can only be elicited by chain of thoughts, but not standard prompting
where an answer directly follows a question without intermediate reasoning steps. Further works
show that CoT can be improved by self-consistency (Wang et al., 2022b), pretraining the model
with latex-formated data (Lewkowycz et al., 2022), context selection (Creswell et al., 2022), or even
adding certain magic phrases like “Let’s think step by step” (Kojima et al., 2022). The original CoT
paper (Wei et al., 2022b) uses 8 manually written examples as the prompt, which are reused by most
follow-up works. Our work sits in the context of CoT reasoning, and propose a new complexity-
based prompt selection that substantially outperforms the original CoT.
Example Selection for Prompting Designing prompts can be challenging due to the instability,
as multiple works have shown the performance is sensitive to prompt, task, dataset, and model
changes (Zhao et al., 2021; Lu et al., 2022; Su et al., 2022). Despite works on automatic prompt
searching (which is more suitable for smaller models, e.g., Shin et al., 2020; Li & Liang, 2021),
currently, prompt engineering for large models is (still) a community-wide collective trial and error
effort (there is even a prompt marketplace named PromptBase). The difficulty is that it is extremely
hard to extract generalizable regularity from empirical observations that can form effective selection
criteria. One notable exception is similarity-based prompt selection, which retrieves the most similar
training instances as the prompt for a given test case (Rubin et al., 2022). Yet for CoT prompting,
retrieving different prompts for different test cases requires reasoning chain annotations for the
whole training set, which compromises the advantage of being few-shot. Given this background,
our core contribution is identifying complexity as an effective and robust selection criterion and in
many cases, it outperforms existing prompt selection schemes while being annotation-efficient.
Relation to Classical Semantic Parsing The procedure of chain of thoughts prompting is
conceptually similar to classical semantic parsing where one generates a logical form then executes
it upon a knowledge base to reach a final answer (Liang, 2016; Cheng et al., 2019). The practice
of sampling then voting is also similar to marginalizing out semantic parses (Yin et al., 2018).
There are further works linking the relationship between in-context learning and classical Bayesian
inference (Wei et al., 2021; Xie et al., 2022). From our perspective, we tend to view chain-of-
thoughts as flexible, language model styled “logical forms” which are “executed” by the language
model itself. We leave further study on connecting classical parsing and CoT to future work.
3 COMPLEXITY-BASED PROMPTING
We study multi-step reasoning tasks, and use math word problems, mathematical problems
expressed in natural language, as our testbed. This task, as is measured by solve rate (accuracy),
is to predict the answer (typically a number) of a given math word problem via intermediate steps.
We follow the chain-of-thoughts prompting framework and compare all prompting schemes using
GPT-3 text-davinci-002 and Codex code-davinci-002. An example problem, as well
as the chain-of-thoughts workflow, is shown in Fig. 1A. The input is a stack of a few (often 8) CoT
cases followed by a test question, then the language model continues generating an output CoT for
the test question. Our goal is to improve the reasoning accuracy by identifying and exploiting more
effective input and output reasoning chains.
3