ThinkSum Probabilistic reasoning over sets using large language models Batu Ozturkler Stanford University

2025-05-06 0 0 1.34MB 22 页 10玖币
侵权投诉
ThinkSum: Probabilistic reasoning over sets using large language models
Batu Ozturkler
Stanford University
Stanford, California, USA
ozt@stanford.edu
Nikolay Malkin
Mila, Université de Montréal
Montréal, Québec, Canada
nikolay.malkin@mila.quebec
Zhen Wang
Ohio State University
Columbus, Ohio, USA
wang.9215@osu.edu
Nebojsa Jojic
Microsoft Research
Redmond, Washington, USA
jojic@microsoft.com
Abstract
Large language models (LLMs) have a substan-
tial capacity for high-level analogical reason-
ing: reproducing patterns in linear text that oc-
cur in their training data (zero-shot evaluation)
or in the provided context (few-shot in-context
learning). However, recent studies show that
even the more advanced LLMs fail in scenar-
ios that require reasoning over multiple objects
or facts and making sequences of logical de-
ductions. We propose a two-stage probabilis-
tic inference paradigm, ThinkSum, which rea-
sons over sets of objects or facts in a struc-
tured manner. In the first stage (Think – re-
trieval of associations), a LLM is queried in
parallel over a set of phrases extracted from
the prompt or an auxiliary model call. In the
second stage (Sum – probabilistic inference
or reasoning), the results of these queries are
aggregated to make the final prediction. We
demonstrate the possibilities and advantages
of ThinkSum on the BIG-bench suite of LLM
evaluation tasks, achieving improvements over
the state of the art using GPT-family models on
thirteen difficult tasks, often with far smaller
model variants. We also compare and contrast
ThinkSum with other proposed modifications
to direct prompting of LLMs, such as variants
of chain-of-thought prompting. Our results sug-
gest that because the probabilistic inference in
ThinkSum is performed outside of calls to the
LLM, ThinkSum is less sensitive to prompt
design, yields more interpretable predictions,
and can be flexibly combined with latent vari-
able models to extract structured knowledge
from LLMs. Overall, our proposed paradigm
represents a promising approach for enhancing
the reasoning capabilities of LLMs.
1 Introduction
Large language models (LLMs; Brown et al.,2020;
Rae et al.,2021;Chowdhery et al.,2022) can recall
a broad range of basic facts, recognize and mimic
various forms in language, and efficiently extrapo-
late analogies in structure and meaning. These abil-
ities allow LLMs to excel in zero-shot and few-shot
tasks formulated as the generation or selection of a
likely completion to a prompt. This formulation re-
quires LLMs to perform fast associative thinking,
in which each token of text in the sequence making
up the answer is generated or scored in one pass
through the model and, other than that, no interme-
diate information is created or retained. This fast
thinking is made possible by the compression of
information that is repeated in a variety of ways in
large training datasets, within the LLM’s weights.
However, it is increasingly evident that when
reasoning, or slow thinking, is required, fail-
ure modes of LLMs are revealed. In our usage,
reasoning refers to the sequential manipulation
of concepts that can be expressed in language.
Tasks that require iterative retrieval of rarely stated
knowledge, uncertainties over multiple objects or
facts, or multiple steps of deduction are difficult
even for the most advanced LLMs (Suzgun et al.,
2022). In a recently designed suite of evalua-
tions, BIG-bench (Srivastava et al.,2022), some
of the tasks where the gap between machine and
human performance is large involve inference se-
quences with nested counterfactuals (LOGICAL
DEDUCTION), concepts introduced through defi-
nitions (CONCEPTUAL COMBINATIONS), etc. (see
Fig. B.1). These are tasks where a human solver’s
intuitive feeling of ‘(in)coherence’ is insufficient
to produce the right answer, and a sequence of
thoughts, along with the use of intermediate re-
sults, may be necessary to arrive at the solution,
particularly when working memory is insufficient.
We show several tasks in BIG-bench that can be
addressed by a two-component mechanism, which
we name ThinkSum1:
1
ThinkSum is named by analogy with other algorithms
arXiv:2210.01293v2 [cs.CL] 2 Jun 2023
A binne is any furry four-legged creature, and a bam is a simple dwelling.
DIRECT PROMPTING
CHAIN OF THOUGHT / AUXILIARY KNOWLEDGE
THINKSUM
A binne bam is a place for people (55%) animals (44%) birds (0.87%) researchers (0.022%)
A binne is any furry four-legged creature, and a bam is a simple dwelling.
Examples of binnes: cat,mink,ferret,guinea pig,rabbit.
Examples of bams: hut,cabin,cottage,shelter,shack.
A binne bam is a place for people (51%) animals (48%) birds (0.76%) researchers (0.011%)
A binne is any furry four-legged creature, and a bam is a simple dwelling.
binne ={cat,mink,ferret,guinea pig,rabbit}
bam ={hut,cabin,cottage,shelter,shack}
Abinne bam is a place for animals (65%) people (34%) birds (1.5%) researchers (0.056%)
THINK (auxiliary LM calls to define sets)
Acat cottage is a place for
Arabbit cabin is a place for
Amink shelter is a place for
···
X
SUM (aggregate LM likelihoods)
Figure 1: An example adapted from the CONCEPTUAL COMBINATIONS (INVENTED WORDS) task, in which models
must select the most likely completion of a phrase that includes nonce words whose definitions are given. Top:
Direct prompting evaluates completion likelihoods normalized over the four answer choices (‘people’, ‘animals’,
‘birds’, ‘researchers’). Middle: Chain-of-thought-like or auxiliary knowledge approaches would query a LLM or
knowledge base for additional context. This example shows the brittleness entrusting all ‘reasoning’ to self-attention
in linear text, especially in smaller models, which have stronger recency bias (Malkin et al.,2022): if we simply list
generated examples as the additional context in the prompt, the recency bias causes the LLM to still give a higher
probability to ‘people’ than to ‘animals’, simply because ‘bam’ (simple dwelling) examples are given after the
‘binne’ examples. Bottom: Our ThinkSum approach to this task queries a LLM (GPT-2 XL) to produce sets of
examples defining the nonce words, then marginalizes over substitutions of these examples into the target phrase.
Think (fast thinking / association / knowledge re-
trieval step): creating an association of text spans
with sets of strings. This process may involve
generation from a language model, as is the case
in Fig. 1, where the novel word ‘binne’ is asso-
ciated with the set of strings
{‘cat’,‘mink’, . . . }
by prompting GPT-3 with the definition and ask-
ing for examples. Alternatively, it may consist
solely of a scoring mechanism, resulting in the
formation of a matrix of probabilities on which
probabilistic inference is performed.
Sum (slow thinking / Summarization / reasoning
step): probabilistic inference that aggregates gen-
erated strings or probabilities to produce the final
answer. Summarization typically involves, and
often entirely consists of, summing of probabili-
ties of strings (computed in the Think step), as
in Fig. 1, where the final word is assumed to be
sampled from a mixture of possible substitutions
of ‘binne’ and ‘bam’ words into the input.
We discuss different ways to Think and to Sum
in section §2, but we start with one example, illus-
with ‘expand’ and ‘aggregate’ steps, such as MapReduce in
distributed computing and sum-product in graphical models.
trated in Fig. 1(bottom), motivated by the CON-
CEPTUAL COMBINATIONS (INVENTED WORDS)
task in BIG-bench. In this task, the LLM is pro-
vided with the definitions of two invented words
and asked to infer the most plausible sentence that
uses a combination of the invented words. As the
words are not common or consistently used in the
training set, the LLM needs to understand and com-
bine the definitions of the invented words to reason
about the meaning of the combination. The LLM
is queried to produce example instances of the in-
vented words with the help of the definitions. These
example instances can be substituted into the query
in place of the invented words. By mapping indi-
vidual spans of the text of interest to sets, we arrive
at a mixture model (in this example, a mixture with
25 components for 5 possible replacements of each
word), which can be used in the same manner the
original LLM is used, either to score text or to
generate it token by token. When we score all can-
didate completions using this mixture model and
normalize over the four choices, the correct answer
– that ‘binne bams’ are for animals and not people –
becomes the most likely.
An important difference between our ThinkSum
and existing chain-of-thought-like prompt engineer-
ing methods (Wei et al.,2022;Kojima et al.,2022),
is that our reasoning step is not reduced to a gener-
ation problem for the LLM, but is performed as a
probabilistic inference external to the LLM. This re-
duces vulnerability to features of the prompt, such
as accidental distraction of the LLM by spurious
patterns (see Fig. 1, middle). Instead, we engineer
the slow thinking process to make parallel calls
to the LLM to query for intermediate information,
then possibly perform programmatic recombina-
tion of strings (Think). The final reasoning step
– in which likelihoods obtained from the LLM for
the recombinations derived from earlier steps of
the reasoning process are combined to make the
final prediction – is left to classical probabilistic
reasoning (Sum). In a sense, Sum replaces the
self-attention mechanism over linear text, which is
used as the sole ‘reasoning’ mechanism in chain-of-
thought-like approaches that expect the intermedi-
ate ‘thoughts’ to take the form of generated tokens
intervening between the input and output.
Imposing an alternative reasoning system over
an associative “knee-jerk reaction" system has an
analogy with models of human cognitive processes
(Tversky and Kahneman,1974;Kahneman,2011)
that separate System 1 (fast thinking) and System
2 (slow thinking). System 2 acts as a ‘controller’
that can prime System 1 to appropriately bias its
fast thinking. In the context of reasoning with deep
learning models, System 2 has been interpreted
as operating with sparse concepts that can be de-
scribed in language (Bengio,2017;Goyal and Ben-
gio,2020). Through repeated usage, the functions
of System 2 become compressed into System 1
intuitions, in the same manner that iterative ‘rea-
soning’ functions of which smaller LLMs are not
capable become zero-shot generation capacities for
large LLMs. As is the case with humans, there
is always the next frontier of problems where a
trained model with remarkable ‘intuition’ needs to
be slowed down. The main claim of this paper is
that more is possible with LLMs of existing scale
when they are used in concert with a wise controller
that allows for probabilistic inference.
2 ThinkSum
2.1 How to Think
Here we list examples of the “fast thinking" that
precedes the summarization stage.
Elementary string manipulations. Standard
ways to turn a question into a prompt that can be
given to a LLM for generation or scoring involve
choices (e.g., of the prompt format) that can be
seen as being made by a controlling agent. The
default approach to multiple-choice questions is
to write them as Cloze tasks. However, there are
nontrivial operations used in inference procedures
that sometimes work better, such as:
Order inversion: Exchanging the order of the
question and answers, as in Min et al. (2022).
Premise erasure: Deleting a part of the question.
Removing a premise with which the answer is
expected to have high mutual information is a
step in inference procedures that aim to correct
for bias towards answers with high unconditional
likelihood (Zhao et al.,2021;Holtzman et al.,
2021;Malkin et al.,2022).
Substitution and normalization. An example
is shown in Fig. 1. Elements from a set may be
substituted in place of ‘slot’ words in a prompt,
such as ‘cat’ substituted for ‘binne’ in the prompt
A binne bam is a place for
”. This operation
can be combined with syntax-normalization steps
that are reliably achieved by standard NLP tools,
such as ensuring subject-verb agreement.
Example and list generation. A LLM can be
prompted to generate or score lists of words or
phrases. We suggest and experiment with three
instances of this:
Example generation: In Fig. 1, the LLM is
prompted to turn a definition or characterizing
property, such as ‘simple dwelling’, into a list of
examples. This can be achieved with a prompt
such as “
A bam is a simple dwelling.
Examples: 1.
”. The generated completion can
be parsed into a set to be used later in the infer-
ence procedure.
List extension: A similar approach can also be
used to hallucinate additional possible answers
to questions, as we will show in some of the
experiments.
List of words: Similar prompts provide an even
simpler Think method that we use for scoring –
but not generation – in several tasks. Just prompt-
ing a LLM with “
List of words: 𝐴,𝐵
”,
where
𝐴
and
𝐵
are words or phrases, and com-
puting the likelihood of
𝐵
conditioned on “
List
of words: 𝐴,
” is a good measure of semantic
relatedness of 𝐴and 𝐵.
Fact generation. This way of Thinking asso-
ciates an input word with a set of phrases in a
similar manner to generating examples from a def-
inition. It can be achieved with prompts such as
List facts about cats. 1.
The generated
facts are good targets for substitutions of other con-
cepts (‘dogs’, ‘galaxies’) in place of the concept
(‘cats’) about which facts are generated. A varia-
tion on this asks the LLM to generate differences
between two concepts, as shown in Fig. 2(right).
Translation. The LLM can be prompted to con-
vert between different forms of representing the
same concept as a sequence of tokens. We use two
basic examples of this in experiments:
Translation between languages by prompting the
LLM in formats such as “
French: J’adore
les chats noirs. English:
”. A very similar
approach can be used to convert non-alphabetic
symbols, such as emoji, into words with similar
meanings.
Converting text to formal (symbolic) structures,
like turning a word problem into a collection of
mathematical equations.
2.2 How to Sum
Elementary inference. As above, we begin by
listing existing standard ways of turning LLM out-
puts into answers, which we see as trivial cases of
aggregation (Sum).
Majority/minority vote (argmin/argmax): a
component of most answer selection procedures.
Ratio of likelihoods: Likelihoods from different
variants of the same prompt can be combined
by considering their ratio or more general log-
linear or other mixture. For example, this can
be done to correct the likelihood of an answer
conditioned on a question by its unconditional
likelihood, in combination with the Premise era-
sure operation described above.
Mixture (average) aggregation. A collection of
prompts can be treated as the components of a
mixture model over completions. An example is
shown in Fig. 1, where substitutions of a set of
words yield 25 different prompts. Likelihoods of
the completion over these 25 prompts are averaged.
Product aggregation. We use products of likeli-
hoods in two different ways:
In a similar way as mixtures, but when the more
natural probabilistic model has all elements of a
set (of prompts) generating the answer, such as
when a description or definition must be satisfied
by all concepts in a set.
In a task where we are to determine whether a
statement
𝑆
or its negation
𝑆
is true, we can
compute the likelihood of both
𝑆
and
𝑆
being
true (as posterior over the tokens ‘True’ and
‘False’ in an appropriate prompt), then compare
𝑝(True|𝑆)𝑝(False|𝑆)
(
𝑆
is true and
𝑆
is false)
with
𝑝(False|𝑆)𝑝(True|𝑆)
(
𝑆
is false and
𝑆
is
true).
3 Experiments
In this section, we perform case studies on three
tasks from the BIG-bench suite to demonstrate the
possibilities of the inference approaches discussed
in §2. We also experiment with ten other tasks
from BIG-bench; the best results are summarized
in Table 1and the methods, grouped by the style
of Thinking and Summing, are described in Ap-
pendix (§A).
All details of the tasks can be found in the Ap-
pendix (§C). Comparisons to direct prompting and
algorithms that append retrieved or generated to-
kens to the prompt are given in §3.4.
3.1
Conceptual combinations: Invented words
In INVENTED WORDS, two nonce words
𝑥1, 𝑥2
are
defined and the correct statement must be chosen
out of a set of statements
𝑆={𝑠𝑗}
that begin with
(possibly inflected forms of) “𝑥1𝑥2” (Fig. 1).
We use an Example generation prompt to ob-
tain a set of example words fitting the definitions of
𝑥1
and
𝑥2
. We thus obtain sets
𝑆1
and
𝑆2
of words
that can be substituted for 𝑥1and 𝑥2, respectively.
We treat each statement
𝑠𝑗
as a template into
which words
𝑤1𝑆1
and
𝑤2𝑆2
can be substi-
tuted by replacing
𝑥𝑖
with
𝑤𝑖
and normalizing the
syntax to ensure subject-verb agreement. Denoting
by
𝑠𝑗𝑤1, 𝑤2
such a substitution, we form a vector
of probabilities
𝑝𝑗
by scoring the Substitution of
each possible pair of words into each statement and
performing Mixture aggregation and considering
the Ratio of likelihoods with the template without
substitution:
𝑝𝑗=
1
|𝑆1| |𝑆2|Í𝑤1𝑆1,𝑤2𝑆2𝑝LLM (𝑠𝑗𝑤1, 𝑤2)
𝑝LLM (𝑠𝑗).
The statement
𝑠𝑗
with highest likelihood under this
normalized mixture, arg max𝑗𝑝𝑗, is selected.
3.2 Odd one out
We examine possible Think and Sum approaches
in depth on the ODD ONE OUT task, in which the
GPT-3 (davinci) 𝑛-shot ThinkSum
Task Avg. H 𝑛=01 2 3 GPT-3 InstructGPT GPT-2 XL
INVENTED WORDS 3.1) N/A 0.29 0.14 0.14 0.21 0.64 0.71 0.29
ODD ONE OUT 3.2) 0.80 0.27 0.20 0.23 0.23 0.80 0.84 0.71
FIVE OBJECTS 3.3) N/A 0.23 0.29 0.28 0.32 0.77
SPORTS UNDERSTANDING A.1) 0.71 0.50 0.50 0.50 0.50 0.71 0.74 0.54
KNOWN UNKNOWNS A.1)0.80 0.61 0.52 0.48 0.50 0.54 0.76
MISCONCEPTIONS RUSSIAN A.2) 0.65 0.33 0.33 0.41 0.35 0.70 0.61 –
EMOJI MOVIE A.2)0.93 0.12 0.18 0.12 0.19 0.80 0.75
PARSINLU READING COMPREHENSION A.2) 0.02 0.00 0.00 0.00 0.00 0.02
PHRASE RELATEDNESS A.3) 0.74 0.37 0.42 0.52 0.59 0.85 0.87 0.79
CODENAMES A.3) 0.18 0.01 0.11 0.16 0.19 0.37 0.41 0.36
NOVEL CONCEPTS A.4) 0.67 0.47 0.47 0.56 0.56 0.72 0.75 0.50
CODE LINE DESCRIPTION A.4) 0.60 0.32 0.32 0.28 0.32 0.83 0.90 0.77
LANGUAGE IDENTIFICATION A.5) 0.16 0.16 0.12 0.13 0.11 0.57 – 0.30
Table 1: Standard metric (BLEU for CODENAMES, accuracy for other tasks) for GPT-3 175B (davinci) and
ThinkSum with 175B (davinci), InstructGPT and GPT-2 XL on BIG-bench tasks. A ‘–’ indicates that the model
and task combination was not evaluated because the model does not reliably execute the appropriate Think prompt.
We did not evaluate InstructGPT on LANGUAGE IDENTIFICATION due to the large dataset size and API quota.
Figure 2: ODD ONE OUT.Left: Performance of GPT-3 (
𝑛
-shot,
𝑛=0,1,2,3
), auxiliary knowledge, and ThinkSum
with various model sizes. Middle: Auxiliary knowledge vs. ThinkSum with varying number of differences. Right:
Prompt used to generate knowledge statements.
word in a set
𝑊={𝑤𝑖}
that is least semantically
related to the others must be chosen (e.g., Pick the
odd word out: glass, head, arm, leg, hand, foot).
List of words. We form a semantic relatedness
matrix
𝑃𝑖 𝑗
by querying the LLM with a List of
words Think prompt for each pair of indices 𝑖, 𝑗:
𝑃𝑖 𝑗 =𝑝LLM (𝑤𝑗|List of words: 𝑤𝑖,).
This matrix is aggregated by averaging over
𝑗
(in
log domain) and selecting the
𝑖
with lowest average,
i.e., least likelihood of being generated by a product
mixture of all words in the set:
𝑖=arg min𝑖Î𝑗𝑃𝑖 𝑗
.
This is a case of Product aggregation.
Because this approach is the most successful
with all model sizes we experimented with, its
performance is reported in Table 1. Remarkably,
near-average-human accuracy is maintained for all
model sizes from GPT-2 Small to the largest GPT-3
model (Fig. 2(left)).
Fact generation. As an alternative approach, we
use a Fact generation prompt. An effective way
to mine facts for semantic relatedness tasks is to
consider two items in the same context in order to
get relevant facts regarding how items are related
to each other (prompt in Fig. 2(right)). The demon-
stration used in the prompt ensures that the LLM
generates statements in an expected format, which
can be parsed and used for probability computa-
tion later. Using this prompt, we obtain a collec-
tion of statements
𝑆={𝑠𝑖}
about items
𝑤𝑗
. We
treat each generated
𝑠𝑖
as a template into which
different words
𝑤
can be substituted and denote
by
𝑠𝑖𝑤
the Substitution of word
𝑤
into template
𝑠𝑖
. We then form a
|𝑆|×|𝑊|
matrix
𝑃𝑖 𝑗
, defined
摘要:

ThinkSum:ProbabilisticreasoningoversetsusinglargelanguagemodelsBatuOzturklerStanfordUniversityStanford,California,USAozt@stanford.eduNikolayMalkinMila,UniversitédeMontréalMontréal,Québec,Canadanikolay.malkin@mila.quebecZhenWangOhioStateUniversityColumbus,Ohio,USAwang.9215@osu.eduNebojsaJojicMicrosof...

展开>> 收起<<
ThinkSum Probabilistic reasoning over sets using large language models Batu Ozturkler Stanford University.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.34MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注