An important difference between our ThinkSum
and existing chain-of-thought-like prompt engineer-
ing methods (Wei et al.,2022;Kojima et al.,2022),
is that our reasoning step is not reduced to a gener-
ation problem for the LLM, but is performed as a
probabilistic inference external to the LLM. This re-
duces vulnerability to features of the prompt, such
as accidental distraction of the LLM by spurious
patterns (see Fig. 1, middle). Instead, we engineer
the slow thinking process to make parallel calls
to the LLM to query for intermediate information,
then possibly perform programmatic recombina-
tion of strings (Think). The final reasoning step
– in which likelihoods obtained from the LLM for
the recombinations derived from earlier steps of
the reasoning process are combined to make the
final prediction – is left to classical probabilistic
reasoning (Sum). In a sense, Sum replaces the
self-attention mechanism over linear text, which is
used as the sole ‘reasoning’ mechanism in chain-of-
thought-like approaches that expect the intermedi-
ate ‘thoughts’ to take the form of generated tokens
intervening between the input and output.
Imposing an alternative reasoning system over
an associative “knee-jerk reaction" system has an
analogy with models of human cognitive processes
(Tversky and Kahneman,1974;Kahneman,2011)
that separate System 1 (fast thinking) and System
2 (slow thinking). System 2 acts as a ‘controller’
that can prime System 1 to appropriately bias its
fast thinking. In the context of reasoning with deep
learning models, System 2 has been interpreted
as operating with sparse concepts that can be de-
scribed in language (Bengio,2017;Goyal and Ben-
gio,2020). Through repeated usage, the functions
of System 2 become compressed into System 1
intuitions, in the same manner that iterative ‘rea-
soning’ functions of which smaller LLMs are not
capable become zero-shot generation capacities for
large LLMs. As is the case with humans, there
is always the next frontier of problems where a
trained model with remarkable ‘intuition’ needs to
be slowed down. The main claim of this paper is
that more is possible with LLMs of existing scale
when they are used in concert with a wise controller
that allows for probabilistic inference.
2 ThinkSum
2.1 How to Think
Here we list examples of the “fast thinking" that
precedes the summarization stage.
Elementary string manipulations. Standard
ways to turn a question into a prompt that can be
given to a LLM for generation or scoring involve
choices (e.g., of the prompt format) that can be
seen as being made by a controlling agent. The
default approach to multiple-choice questions is
to write them as Cloze tasks. However, there are
nontrivial operations used in inference procedures
that sometimes work better, such as:
•
Order inversion: Exchanging the order of the
question and answers, as in Min et al. (2022).
•
Premise erasure: Deleting a part of the question.
Removing a premise with which the answer is
expected to have high mutual information is a
step in inference procedures that aim to correct
for bias towards answers with high unconditional
likelihood (Zhao et al.,2021;Holtzman et al.,
2021;Malkin et al.,2022).
Substitution and normalization. An example
is shown in Fig. 1. Elements from a set may be
substituted in place of ‘slot’ words in a prompt,
such as ‘cat’ substituted for ‘binne’ in the prompt
“
A binne bam is a place for
”. This operation
can be combined with syntax-normalization steps
that are reliably achieved by standard NLP tools,
such as ensuring subject-verb agreement.
Example and list generation. A LLM can be
prompted to generate or score lists of words or
phrases. We suggest and experiment with three
instances of this:
•
Example generation: In Fig. 1, the LLM is
prompted to turn a definition or characterizing
property, such as ‘simple dwelling’, into a list of
examples. This can be achieved with a prompt
such as “
A bam is a simple dwelling.
Examples: 1.
”. The generated completion can
be parsed into a set to be used later in the infer-
ence procedure.
•
List extension: A similar approach can also be
used to hallucinate additional possible answers
to questions, as we will show in some of the
experiments.
•
List of words: Similar prompts provide an even
simpler Think method that we use for scoring –
but not generation – in several tasks. Just prompt-
ing a LLM with “
List of words: 𝐴,𝐵
”,
where
𝐴
and
𝐵
are words or phrases, and com-
puting the likelihood of
𝐵
conditioned on “
List
of words: 𝐴,
” is a good measure of semantic
relatedness of 𝐴and 𝐵.