(Lampinen et al., 2022) and in distillation (Pruthi et al., 2022). In this paper, we focus more on
the unsupervised learning setting, where we do not assume we have a rationale-augmented training
dataset available, since human-annotated rationales can be expensive.
Few-shot explanations improves reasoning in LLMs. Recently, a lot of progress has been made
towards improving LLMs’ reasoning abilities via prompting or in-context learning. Wei et al.
(2022b) propose Chain-of-Thought prompting, which prompts the language model to generate a se-
ries of natural-language-based intermediate steps, and show it can help language models better solve
complex and multi-step reasoning tasks. Wang et al. (2022b) improve Chain-of-Thought prompting
by sampling multiple diverse reasoning paths and finding the most consistent answers via majority
voting. Kojima et al. (2022) propose to prompt the language model with “Let’s think step by step”
to generate reasoning in a zero-shot fashion. Zhou et al. (2022a) further decompose the questions
into multiple sub-questions, and ask the language model to solve each sub-question sequentially.
Refining explanations. More recent work proposes to further refine the generated reasoning paths
as some of them could be unreliable. For example, Ye & Durrett (2022) calibrate model predictions
based on the reliability of the explanations, Jung et al. (2022) show that inducing a tree of expla-
nations and inferring the satisfiability of each explanation can further help judge the correctness of
explanations. Li et al. (2022b) show that sampling a diverse set of prompts from the training data,
and a voting verifier can be used to improve model’s reasoning performance. Zelikman et al. (2022)
proposes better rationale generation by augmenting ground truth answers as hints when predicted an-
swers are incorrect. Our work is orthogonal to these lines of work, as we utilize refined explanations
from Wang et al. (2022b) for fine-tuning the model for self-improvement, and could readily incor-
porate these other refinement techniques for generating higher-quality self-training data. Our work
is similar to Zelikman et al. (2022) where we both propose to fine-tune a model on self-generated
CoT data, but our method does not require ground truth labels and shows stronger empirical results
with multi-task generalization.
Self-training models. One related line of work is self-training (see a survey from Amini et al.
(2022)). The key idea is to assign pseudo labels from a learned classifier to unlabeled data, and use
these pseudo-labeled examples to further improve the original model training, e.g., (RoyChowdhury
et al., 2019; Xie et al., 2020; He et al., 2020; Chen et al., 2021). Different from such prior work, our
proposed self-improvement framework uses CoT prompting plus self-consistency to obtain high-
confidence solutions on a large set of unlabeled data to augment the fine-tuning process.
Distillation and dark knowledge. Our method also tangentially relates to rich literature on dis-
tillation (Ba & Caruana, 2014; Hinton et al., 2015), where a student network imitates a teacher
network’s classifier predictions on input examples. A key detail is to learn from soft targets instead
of hard predicted labels, as softmax outputs with a high temperature reveal more detailed relative
class likelihoods, colloquially known as dark knowledge (Hinton et al., 2015; Korattikara Balan
et al., 2015). Recent studies (Zelikman et al., 2022; Snell et al., 2022; Eisenstein et al., 2022) show
that dark knowledge within LLMs can be retrieved with more computation at inference time, such
as adding informative instructions into the input sequence, and output CoT generation (Wei et al.,
2022b; Kojima et al., 2022). In our work, we explicitly show that imperfect CoT reasoning (which
may lead to incorrect answer) can be used directly for self-improving language models as evidenced
in our experiments in Sections 5.2 and 5.3.
3 METHOD
The overview of our method is illustrated in Fig. 1: We are given a pre-trained Large Language
Model (LLM) Mand a question-only training dataset Dtrain ={xi}D
i=1 with few-shot Chain-of-
Thought (CoT) examples (Wei et al., 2022b). We apply multiple path decoding with a sampling
temperature T > 0for generating mreasoning paths and answers {ri1, ri2, . . . , rim}for each ques-
tion xiin Dtrain, and use majority voting (self-consistency) to select the most consistent, highest
confidence answer (Wang et al., 2022b). We then keep all reasoning paths that lead to the most
consistent answer, apply mixed formats of prompts and answers for augmentation, and fine-tune
the model on these self-generated reasoning-answer data. We consider our approach as making the
3