ThinkSum Probabilistic reasoning over sets using large language models Batu Ozturkler Stanford University

2025-05-06 0 0 1.34MB 22 页 10玖币

侵权投诉

ThinkSum: Probabilistic reasoning over sets using large language models

Batu Ozturkler

Stanford University

Stanford, California, USA

ozt@stanford.edu

Nikolay Malkin

Mila, Université de Montréal

Montréal, Québec, Canada

nikolay.malkin@mila.quebec

Zhen Wang

Ohio State University

Columbus, Ohio, USA

wang.9215@osu.edu

Nebojsa Jojic

Microsoft Research

Redmond, Washington, USA

jojic@microsoft.com

Abstract

Large language models (LLMs) have a substan-

tial capacity for high-level analogical reason-

ing: reproducing patterns in linear text that oc-

cur in their training data (zero-shot evaluation)

or in the provided context (few-shot in-context

learning). However, recent studies show that

even the more advanced LLMs fail in scenar-

ios that require reasoning over multiple objects

or facts and making sequences of logical de-

ductions. We propose a two-stage probabilis-

tic inference paradigm, ThinkSum, which rea-

sons over sets of objects or facts in a struc-

tured manner. In the ﬁrst stage (Think – re-

trieval of associations), a LLM is queried in

parallel over a set of phrases extracted from

the prompt or an auxiliary model call. In the

second stage (Sum – probabilistic inference

or reasoning), the results of these queries are

aggregated to make the ﬁnal prediction. We

demonstrate the possibilities and advantages

of ThinkSum on the BIG-bench suite of LLM

evaluation tasks, achieving improvements over

the state of the art using GPT-family models on

thirteen difﬁcult tasks, often with far smaller

model variants. We also compare and contrast

ThinkSum with other proposed modiﬁcations

to direct prompting of LLMs, such as variants

of chain-of-thought prompting. Our results sug-

gest that because the probabilistic inference in

ThinkSum is performed outside of calls to the

LLM, ThinkSum is less sensitive to prompt

design, yields more interpretable predictions,

and can be ﬂexibly combined with latent vari-

able models to extract structured knowledge

from LLMs. Overall, our proposed paradigm

represents a promising approach for enhancing

the reasoning capabilities of LLMs.

1 Introduction

Large language models (LLMs; Brown et al.,2020;

Rae et al.,2021;Chowdhery et al.,2022) can recall

a broad range of basic facts, recognize and mimic

various forms in language, and efﬁciently extrapo-

late analogies in structure and meaning. These abil-

ities allow LLMs to excel in zero-shot and few-shot

tasks formulated as the generation or selection of a

likely completion to a prompt. This formulation re-

quires LLMs to perform fast associative thinking,

in which each token of text in the sequence making

up the answer is generated or scored in one pass

through the model and, other than that, no interme-

diate information is created or retained. This fast

thinking is made possible by the compression of

information that is repeated in a variety of ways in

large training datasets, within the LLM’s weights.

However, it is increasingly evident that when

reasoning, or slow thinking, is required, fail-

ure modes of LLMs are revealed. In our usage,

reasoning refers to the sequential manipulation

of concepts that can be expressed in language.

Tasks that require iterative retrieval of rarely stated

knowledge, uncertainties over multiple objects or

facts, or multiple steps of deduction are difﬁcult

even for the most advanced LLMs (Suzgun et al.,

2022). In a recently designed suite of evalua-

tions, BIG-bench (Srivastava et al.,2022), some

of the tasks where the gap between machine and

human performance is large involve inference se-

quences with nested counterfactuals (LOGICAL

DEDUCTION), concepts introduced through deﬁ-

nitions (CONCEPTUAL COMBINATIONS), etc. (see

Fig. B.1). These are tasks where a human solver’s

intuitive feeling of ‘(in)coherence’ is insufﬁcient

to produce the right answer, and a sequence of

thoughts, along with the use of intermediate re-

sults, may be necessary to arrive at the solution,

particularly when working memory is insufﬁcient.

We show several tasks in BIG-bench that can be

addressed by a two-component mechanism, which

we name ThinkSum1:

ThinkSum is named by analogy with other algorithms

arXiv:2210.01293v2 [cs.CL] 2 Jun 2023

A binne is any furry four-legged creature, and a bam is a simple dwelling.

DIRECT PROMPTING

CHAIN OF THOUGHT / AUXILIARY KNOWLEDGE

THINKSUM

A binne bam is a place for people (55%) animals (44%) birds (0.87%) researchers (0.022%)

A binne is any furry four-legged creature, and a bam is a simple dwelling.

Examples of binnes: cat,mink,ferret,guinea pig,rabbit.

Examples of bams: hut,cabin,cottage,shelter,shack.

A binne bam is a place for people (51%) animals (48%) birds (0.76%) researchers (0.011%)

A binne is any furry four-legged creature, and a bam is a simple dwelling.

binne ={cat,mink,ferret,guinea pig,rabbit}

bam ={hut,cabin,cottage,shelter,shack}

Abinne bam is a place for animals (65%) people (34%) birds (1.5%) researchers (0.056%)

⌉⌋ THINK (auxiliary LM calls to deﬁne sets)

Acat cottage is a place for

Arabbit cabin is a place for

Amink shelter is a place for

···

⌉

⌋SUM (aggregate LM likelihoods)

Figure 1: An example adapted from the CONCEPTUAL COMBINATIONS (INVENTED WORDS) task, in which models

must select the most likely completion of a phrase that includes nonce words whose deﬁnitions are given. Top:

Direct prompting evaluates completion likelihoods normalized over the four answer choices (‘people’, ‘animals’,

‘birds’, ‘researchers’). Middle: Chain-of-thought-like or auxiliary knowledge approaches would query a LLM or

knowledge base for additional context. This example shows the brittleness entrusting all ‘reasoning’ to self-attention

in linear text, especially in smaller models, which have stronger recency bias (Malkin et al.,2022): if we simply list

generated examples as the additional context in the prompt, the recency bias causes the LLM to still give a higher

probability to ‘people’ than to ‘animals’, simply because ‘bam’ (simple dwelling) examples are given after the

‘binne’ examples. Bottom: Our ThinkSum approach to this task queries a LLM (GPT-2 XL) to produce sets of

examples deﬁning the nonce words, then marginalizes over substitutions of these examples into the target phrase.

•

Think (fast thinking / association / knowledge re-

trieval step): creating an association of text spans

with sets of strings. This process may involve

generation from a language model, as is the case

in Fig. 1, where the novel word ‘binne’ is asso-

ciated with the set of strings

{‘cat’,‘mink’, . . . }

by prompting GPT-3 with the deﬁnition and ask-

ing for examples. Alternatively, it may consist

solely of a scoring mechanism, resulting in the

formation of a matrix of probabilities on which

probabilistic inference is performed.

•

Sum (slow thinking / Summarization / reasoning

step): probabilistic inference that aggregates gen-

erated strings or probabilities to produce the ﬁnal

answer. Summarization typically involves, and

often entirely consists of, summing of probabili-

ties of strings (computed in the Think step), as

in Fig. 1, where the ﬁnal word is assumed to be

sampled from a mixture of possible substitutions

of ‘binne’ and ‘bam’ words into the input.

We discuss different ways to Think and to Sum

in section §2, but we start with one example, illus-

with ‘expand’ and ‘aggregate’ steps, such as MapReduce in

distributed computing and sum-product in graphical models.

trated in Fig. 1(bottom), motivated by the CON-

CEPTUAL COMBINATIONS (INVENTED WORDS)

task in BIG-bench. In this task, the LLM is pro-

vided with the deﬁnitions of two invented words

and asked to infer the most plausible sentence that

uses a combination of the invented words. As the

words are not common or consistently used in the

training set, the LLM needs to understand and com-

bine the deﬁnitions of the invented words to reason

about the meaning of the combination. The LLM

is queried to produce example instances of the in-

vented words with the help of the deﬁnitions. These

example instances can be substituted into the query

in place of the invented words. By mapping indi-

vidual spans of the text of interest to sets, we arrive

at a mixture model (in this example, a mixture with

25 components for 5 possible replacements of each

word), which can be used in the same manner the

original LLM is used, either to score text or to

generate it token by token. When we score all can-

didate completions using this mixture model and

normalize over the four choices, the correct answer

– that ‘binne bams’ are for animals and not people –

becomes the most likely.

An important difference between our ThinkSum

and existing chain-of-thought-like prompt engineer-

ing methods (Wei et al.,2022;Kojima et al.,2022),

is that our reasoning step is not reduced to a gener-

ation problem for the LLM, but is performed as a

probabilistic inference external to the LLM. This re-

duces vulnerability to features of the prompt, such

as accidental distraction of the LLM by spurious

patterns (see Fig. 1, middle). Instead, we engineer

the slow thinking process to make parallel calls

to the LLM to query for intermediate information,

then possibly perform programmatic recombina-

tion of strings (Think). The ﬁnal reasoning step

– in which likelihoods obtained from the LLM for

the recombinations derived from earlier steps of

the reasoning process are combined to make the

ﬁnal prediction – is left to classical probabilistic

reasoning (Sum). In a sense, Sum replaces the

self-attention mechanism over linear text, which is

used as the sole ‘reasoning’ mechanism in chain-of-

thought-like approaches that expect the intermedi-

ate ‘thoughts’ to take the form of generated tokens

intervening between the input and output.

Imposing an alternative reasoning system over

an associative “knee-jerk reaction" system has an

analogy with models of human cognitive processes

(Tversky and Kahneman,1974;Kahneman,2011)

that separate System 1 (fast thinking) and System

2 (slow thinking). System 2 acts as a ‘controller’

that can prime System 1 to appropriately bias its

fast thinking. In the context of reasoning with deep

learning models, System 2 has been interpreted

as operating with sparse concepts that can be de-

scribed in language (Bengio,2017;Goyal and Ben-

gio,2020). Through repeated usage, the functions

of System 2 become compressed into System 1

intuitions, in the same manner that iterative ‘rea-

soning’ functions of which smaller LLMs are not

capable become zero-shot generation capacities for

large LLMs. As is the case with humans, there

is always the next frontier of problems where a

trained model with remarkable ‘intuition’ needs to

be slowed down. The main claim of this paper is

that more is possible with LLMs of existing scale

when they are used in concert with a wise controller

that allows for probabilistic inference.

2 ThinkSum

2.1 How to Think

Here we list examples of the “fast thinking" that

precedes the summarization stage.

Elementary string manipulations. Standard

ways to turn a question into a prompt that can be

given to a LLM for generation or scoring involve

choices (e.g., of the prompt format) that can be

seen as being made by a controlling agent. The

default approach to multiple-choice questions is

to write them as Cloze tasks. However, there are

nontrivial operations used in inference procedures

that sometimes work better, such as:

•

Order inversion: Exchanging the order of the

question and answers, as in Min et al. (2022).

•

Premise erasure: Deleting a part of the question.

Removing a premise with which the answer is

expected to have high mutual information is a

step in inference procedures that aim to correct

for bias towards answers with high unconditional

likelihood (Zhao et al.,2021;Holtzman et al.,

2021;Malkin et al.,2022).

Substitution and normalization. An example

is shown in Fig. 1. Elements from a set may be

substituted in place of ‘slot’ words in a prompt,

such as ‘cat’ substituted for ‘binne’ in the prompt

“

A binne bam is a place for

”. This operation

can be combined with syntax-normalization steps

that are reliably achieved by standard NLP tools,

such as ensuring subject-verb agreement.

Example and list generation. A LLM can be

prompted to generate or score lists of words or

phrases. We suggest and experiment with three

instances of this:

•

Example generation: In Fig. 1, the LLM is

prompted to turn a deﬁnition or characterizing

property, such as ‘simple dwelling’, into a list of

examples. This can be achieved with a prompt

such as “

A bam is a simple dwelling.

Examples: 1.

”. The generated completion can

be parsed into a set to be used later in the infer-

ence procedure.

•

List extension: A similar approach can also be

used to hallucinate additional possible answers

to questions, as we will show in some of the

experiments.

•

List of words: Similar prompts provide an even

simpler Think method that we use for scoring –

but not generation – in several tasks. Just prompt-

ing a LLM with “

List of words: 𝐴,𝐵

”,

where

𝐴

and

𝐵

are words or phrases, and com-

puting the likelihood of

𝐵

conditioned on “

List

of words: 𝐴,

” is a good measure of semantic

relatedness of 𝐴and 𝐵.

Fact generation. This way of Thinking asso-

ciates an input word with a set of phrases in a

similar manner to generating examples from a def-

inition. It can be achieved with prompts such as

“

List facts about cats. 1.

” The generated

facts are good targets for substitutions of other con-

cepts (‘dogs’, ‘galaxies’) in place of the concept

(‘cats’) about which facts are generated. A varia-

tion on this asks the LLM to generate differences

between two concepts, as shown in Fig. 2(right).

Translation. The LLM can be prompted to con-

vert between different forms of representing the

same concept as a sequence of tokens. We use two

basic examples of this in experiments:

•

Translation between languages by prompting the

LLM in formats such as “

French: J’adore

les chats noirs. English:

”. A very similar

approach can be used to convert non-alphabetic

symbols, such as emoji, into words with similar

meanings.

•

Converting text to formal (symbolic) structures,

like turning a word problem into a collection of

mathematical equations.

2.2 How to Sum

Elementary inference. As above, we begin by

listing existing standard ways of turning LLM out-

puts into answers, which we see as trivial cases of

aggregation (Sum).

•

Majority/minority vote (argmin/argmax): a

component of most answer selection procedures.

•

Ratio of likelihoods: Likelihoods from different

variants of the same prompt can be combined

by considering their ratio or more general log-

linear or other mixture. For example, this can

be done to correct the likelihood of an answer

conditioned on a question by its unconditional

likelihood, in combination with the Premise era-

sure operation described above.

Mixture (average) aggregation. A collection of

prompts can be treated as the components of a

mixture model over completions. An example is

shown in Fig. 1, where substitutions of a set of

words yield 25 different prompts. Likelihoods of

the completion over these 25 prompts are averaged.

Product aggregation. We use products of likeli-

hoods in two different ways:

•

In a similar way as mixtures, but when the more

natural probabilistic model has all elements of a

set (of prompts) generating the answer, such as

when a description or deﬁnition must be satisﬁed

by all concepts in a set.

•

In a task where we are to determine whether a

statement

𝑆

or its negation

𝑆′

is true, we can

compute the likelihood of both

𝑆

and

𝑆′

being

true (as posterior over the tokens ‘True’ and

‘False’ in an appropriate prompt), then compare

𝑝(True|𝑆)𝑝(False|𝑆′)

(

𝑆

is true and

𝑆′

is false)

with

𝑝(False|𝑆)𝑝(True|𝑆′)

(

𝑆

is false and

𝑆′

true).

3 Experiments

In this section, we perform case studies on three

tasks from the BIG-bench suite to demonstrate the

possibilities of the inference approaches discussed

in §2. We also experiment with ten other tasks

from BIG-bench; the best results are summarized

in Table 1and the methods, grouped by the style

of Thinking and Summing, are described in Ap-

pendix (§A).

All details of the tasks can be found in the Ap-

pendix (§C). Comparisons to direct prompting and

algorithms that append retrieved or generated to-

kens to the prompt are given in §3.4.

3.1

Conceptual combinations: Invented words

In INVENTED WORDS, two nonce words

𝑥1, 𝑥2

are

deﬁned and the correct statement must be chosen

out of a set of statements

𝑆={𝑠𝑗}

that begin with

(possibly inﬂected forms of) “𝑥1𝑥2” (Fig. 1).

We use an Example generation prompt to ob-

tain a set of example words ﬁtting the deﬁnitions of

𝑥1

and

𝑥2

. We thus obtain sets

𝑆1

and

𝑆2

of words

that can be substituted for 𝑥1and 𝑥2, respectively.

We treat each statement

𝑠𝑗

as a template into

which words

𝑤1∈𝑆1

and

𝑤2∈𝑆2

can be substi-

tuted by replacing

𝑥𝑖

with

𝑤𝑖

and normalizing the

syntax to ensure subject-verb agreement. Denoting

𝑠𝑗⟨𝑤1, 𝑤2⟩

such a substitution, we form a vector

of probabilities

𝑝𝑗

by scoring the Substitution of

each possible pair of words into each statement and

performing Mixture aggregation and considering

the Ratio of likelihoods with the template without

substitution:

𝑝𝑗=

|𝑆1| |𝑆2|Í𝑤1∈𝑆1,𝑤2∈𝑆2𝑝LLM (𝑠𝑗⟨𝑤1, 𝑤2⟩)

𝑝LLM (𝑠𝑗).

The statement

𝑠𝑗

with highest likelihood under this

normalized mixture, arg max𝑗𝑝𝑗, is selected.

3.2 Odd one out

We examine possible Think and Sum approaches

in depth on the ODD ONE OUT task, in which the

GPT-3 (davinci) 𝑛-shot ThinkSum

Task Avg. H 𝑛=01 2 3 GPT-3 InstructGPT GPT-2 XL

INVENTED WORDS (§3.1) N/A 0.29 0.14 0.14 0.21 0.64 0.71 0.29

ODD ONE OUT (§3.2) 0.80 0.27 0.20 0.23 0.23 0.80 0.84 0.71

FIVE OBJECTS (§3.3) N/A 0.23 0.29 0.28 0.32 – 0.77 –

SPORTS UNDERSTANDING (§A.1) 0.71 0.50 0.50 0.50 0.50 0.71 0.74 0.54

KNOWN UNKNOWNS (§A.1)0.80 0.61 0.52 0.48 0.50 0.54 0.76 –

MISCONCEPTIONS RUSSIAN (§A.2) 0.65 0.33 0.33 0.41 0.35 0.70 0.61 –

EMOJI MOVIE (§A.2)0.93 0.12 0.18 0.12 0.19 0.80 0.75 –

PARSINLU READING COMPREHENSION (§A.2) 0.02 0.00 0.00 0.00 0.00 – 0.02 –

PHRASE RELATEDNESS (§A.3) 0.74 0.37 0.42 0.52 0.59 0.85 0.87 0.79

CODENAMES (§A.3) 0.18 0.01 0.11 0.16 0.19 0.37 0.41 0.36

NOVEL CONCEPTS (§A.4) 0.67 0.47 0.47 0.56 0.56 0.72 0.75 0.50

CODE LINE DESCRIPTION (§A.4) 0.60 0.32 0.32 0.28 0.32 0.83 0.90 0.77

LANGUAGE IDENTIFICATION (§A.5) 0.16 0.16 0.12 0.13 0.11 0.57 – 0.30

Table 1: Standard metric (BLEU for CODENAMES, accuracy for other tasks) for GPT-3 175B (davinci) and

ThinkSum with 175B (davinci), InstructGPT and GPT-2 XL on BIG-bench tasks. A ‘–’ indicates that the model

and task combination was not evaluated because the model does not reliably execute the appropriate Think prompt.

We did not evaluate InstructGPT on LANGUAGE IDENTIFICATION due to the large dataset size and API quota.

Figure 2: ODD ONE OUT.Left: Performance of GPT-3 (

𝑛

-shot,

𝑛=0,1,2,3

), auxiliary knowledge, and ThinkSum

with various model sizes. Middle: Auxiliary knowledge vs. ThinkSum with varying number of differences. Right:

Prompt used to generate knowledge statements.

word in a set

𝑊={𝑤𝑖}

that is least semantically

related to the others must be chosen (e.g., Pick the

odd word out: glass, head, arm, leg, hand, foot).

List of words. We form a semantic relatedness

matrix

𝑃𝑖 𝑗

by querying the LLM with a List of

words Think prompt for each pair of indices 𝑖, 𝑗:

𝑃𝑖 𝑗 =𝑝LLM (𝑤𝑗|“List of words: 𝑤𝑖,”).

This matrix is aggregated by averaging over

𝑗

(in

log domain) and selecting the

𝑖

with lowest average,

i.e., least likelihood of being generated by a product

mixture of all words in the set:

𝑖=arg min𝑖Î𝑗𝑃𝑖 𝑗

This is a case of Product aggregation.

Because this approach is the most successful

with all model sizes we experimented with, its

performance is reported in Table 1. Remarkably,

near-average-human accuracy is maintained for all

model sizes from GPT-2 Small to the largest GPT-3

model (Fig. 2(left)).

Fact generation. As an alternative approach, we

use a Fact generation prompt. An effective way

to mine facts for semantic relatedness tasks is to

consider two items in the same context in order to

get relevant facts regarding how items are related

to each other (prompt in Fig. 2(right)). The demon-

stration used in the prompt ensures that the LLM

generates statements in an expected format, which

can be parsed and used for probability computa-

tion later. Using this prompt, we obtain a collec-

tion of statements

𝑆={𝑠𝑖}

about items

𝑤𝑗

. We

treat each generated

𝑠𝑖

as a template into which

different words

𝑤

can be substituted and denote

𝑠𝑖⟨𝑤⟩

the Substitution of word

𝑤

into template

𝑠𝑖

. We then form a

|𝑆|×|𝑊|

matrix

𝑃𝑖 𝑗

, deﬁned

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ThinkSum:ProbabilisticreasoningoversetsusinglargelanguagemodelsBatuOzturklerStanfordUniversityStanford,California,USAozt@stanford.eduNikolayMalkinMila,UniversitédeMontréalMontréal,Québec,Canadanikolay.malkin@mila.quebecZhenWangOhioStateUniversityColumbus,Ohio,USAwang.9215@osu.eduNebojsaJojicMicrosof...

展开>> 收起<<

ThinkSum Probabilistic reasoning over sets using large language models Batu Ozturkler Stanford University.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ThinkSum Probabilistic reasoning over sets using large language models Batu Ozturkler Stanford University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: