Neural Theory-of-Mind On the Limits of Social Intelligence in Large LMs Maarten SapRonan Le BrasDaniel FriedYejin Choi

2025-05-02 0 0 1010.88KB 23 页 10玖币

侵权投诉

Neural Theory-of-Mind?

On the Limits of Social Intelligence in Large LMs

Maarten Sap♠♦ Ronan Le Bras♠Daniel Fried♦Yejin Choi♠♥

♠Allen Institute for AI, Seattle, WA, USA

♦Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA

♥Paul G. Allen School of Computer Science, University of Washington, Seattle, WA, USA

maartensap@cmu.edu

Abstract

Social intelligence and Theory of Mind

(TOM), i.e., the ability to reason about the

different mental states, intents, and reactions

of all people involved, allow humans to effec-

tively navigate and understand everyday social

interactions. As NLP systems are used in in-

creasingly complex social situations, their abil-

ity to grasp social dynamics becomes crucial.

In this work, we examine the open question

of social intelligence and Theory of Mind in

modern NLP systems from an empirical and

theory-based perspective. We show that one

of today’s largest language models (GPT-3;

Brown et al.,2020) lacks this kind of social

intelligence out-of-the box, using two tasks:

SOCIALIQA(Sap et al.,2019b), which mea-

sures models’ ability to understand intents and

reactions of participants of social interactions,

and TOMI(Le et al.,2019), which measures

whether models can infer mental states and re-

alities of participants of situations.

Our results show that models struggle sub-

stantially at these Theory of Mind tasks, with

well-below-human accuracies of 55% and 60%

on SOCIALIQAand TOMI, respectively. To

conclude, we draw on theories from pragmat-

ics to contextualize this shortcoming of large

language models, by examining the limita-

tions stemming from their data, neural archi-

tecture, and training paradigms. Challeng-

ing the prevalent narrative that only scale is

needed, we posit that person-centric NLP ap-

proaches might be more effective towards neu-

ral Theory of Mind.

1 Introduction

With the growing prevalence of AI and NLP sys-

tems in everyday social interactions, the need for

AI systems with social intelligence and Theory of

Mind (TOM), i.e., the ability to infer and reason

about the intents, feelings, and mental states of oth-

ers, becomes increasingly evident (Pereira et al.,

Reasoning about

mental states and realities

Social commonsense and

emotional intelligence

Although Taylor was older and stronger,

they lost to Alex in the wrestling match.

James and Abby are in the bedroom.

Abby put the pen in the desk

drawer. Abby leaves the bedroom.

James moves the pen into the bag.

How would Alex feel as a result?

Where does James think Abby will

look for the pen?

ashamed boastful

drawer bag

Measuring Neural Theory of Mind

Figure 1: Theory of Mind is the ability for humans to

reason about the intents, reactions, and mental states of

others. We asses these abilities in LLMs through two

question-answering tasks that measure social common-

sense and emotional intelligence (SOCIALIQA; top)

and reasoning about people’s mental states and reali-

ties (TOMI; bottom); ﬁnding that GPT-3 ( ) struggles

on both tasks. We discuss why that may be, drawing

from theories of the pragmatics of language.

2016;Langley et al.,2022). For humans, Theory of

Mind is a crucial component that enables us to in-

teract and communicate effectively with each other

(Premack and Woodruff,1978;Apperly,2010). It

allows us, for example, to infer that someone likely

feels boastful instead of ashamed after winning a

wrestling match (Fig. 1; top). In addition, TOM

also enables us to reason about people’s mental re-

alities, e.g., if someone was out of the room while

a pen was moved, she will likely search for the pen

arXiv:2210.13312v2 [cs.CL] 3 Apr 2023

where she last saw it instead of where it was moved

to (Fig. 1; bottom).

While humans develop it naturally, TOM and

social intelligence remain elusive goals for modern

AI systems (Choi,2022), including large neural

language models (LLMs). With advances in scal-

ing the sizes of models and datasets, these LLMs

have proven very impressive at generating human-

like language for conversational, summarization,

or sentence continuation settings, often with zero

to few examples to learn from (Brown et al.,2020;

Clark et al.,2021;Chowdhery et al.,2022). How-

ever, increasing scrutiny has shed light on the short-

comings of these LLMs, showing that they often

fall prey to spurious correlational patterns instead

of displaying higher-order reasoning (Elkins and

Chun,2020;Dale,2021;Marcus,2022).

In line with EMNLP 2022’s theme, we examine

the open research question of whether and how

much LLMs—which are the backbone of most

modern NLP systems—exhibit social intelligence

and TOM abilities. Using some of the largest En-

glish models in existence (GPT-3; Brown et al.,

2020), we demonstrate that out-of-the-box LLMs

struggle at two types of reasoning abilities that

requisites for Theory of Mind (shown in Fig. 1).

We argue that these reasoning abilities are neces-

sary but not sufﬁcient for Theory of Mind, and that

larger models will likely provide upper bounds on

what equivalent-but-smaller models are capable of.

We ﬁrst assess whether LLMs can reason about

social commonsense and emotional intelligence

with respect to social interactions (§3), using the

SOCIALIQAbenchmark (Sap et al.,2019b) illus-

trated in Fig. 1(top). Results show our best per-

forming few-shot GPT-3 setup achieving only 55%

accuracy, lagging

30% behind human perfor-

mance. Furthermore, social reasoning about the

protagonists of situations is easier for GPT-3 (5-

15% absolute difference) compared to reasoning

about other secondary participants.

Second, we measure LLMs’ ability to under-

stand other people’s mental states and realities in

short stories (§4). We use the TOMIQA bench-

mark (illustrated in Fig. 1; bottom; Le et al.,2019),

which was inspired by the classic Sally-Ann False

Belief Theory of Mind test (Baron-Cohen et al.,

1985). Here, our results show that GPT-3 models

peak at 60% accuracy on questions about partic-

ipants’ mental states, compared to 90–100% on

factual questions.

Our novel insights show that reasoning about

social situations and false beliefs still presents a

signiﬁcant challenge for large language models, de-

spite their seemingly impressive performance on

tasks that could require social intelligence (e.g.,

story generation, dialogues). In §5, we ﬁrst ex-

amine these shortcomings; drawing on theories of

the pragmatics of language, we speculate that the

type of texts in LLMs’ training datasets could sub-

stantially limit learning social intelligence. Then,

we outline some possible future directions towards

socially aware LLMs, reﬂecting on the feasibil-

ity of interactional data selection, person-centric

inductive biases, and interaction-based language

learning. Our ﬁndings suggest that only increasing

the scale of LLMs is likely not the most effective

way to create socially aware AI systems, challeng-

ing a prevalent narrative in AI research (Narang

and Chowdhery,2022).

2 Theory of Mind & Large LMs

Why do LLMs need Theory of Mind?

Social

intelligence, Theory of Mind, and commonsense

reasoning have been a longstanding but elusive

goal of artiﬁcial intelligence for decades (Gun-

ning,2018;Choi,2022). These reasoning abil-

ities are becoming increasingly necessary as AI

assistants are used in situations that require social

intelligence and Theory of Mind in order to op-

erate effectively (Wang et al.,2007;Dhelim et al.,

2021;Langley et al.,2022). For example, new tech-

nologies are emerging where AI is used to interact

and adapt to users (Bickmore and Picard,2005;

Jaques,2019), e.g., voice assistants, and tutoring

systems; or where AI helps enhance communica-

tion between multiple users, e.g., email autocom-

plete (Chen et al.,2019), AI-assisted counseling

(Kearns et al.,2020;Allen,2020;Sharma et al.,

2021), or facilitated discussion (Rosé et al.,2014).

As we move beyond just asking single-turn ques-

tions to social and interactive AI assistants, higher-

order reasoning becomes necessary (McDonald

and Pearson,2019). For example, AI systems

should be capable of more nuanced understand-

ing, such as ensuring an alarm is on if someone

has a job interview the next morning (Dhelim et al.,

2021), knowing to call for help when an elderly

person falls (Pollack,2005), inferring personality

and intentions in dialogues (Mairesse et al.,2007;

Wang et al.,2019), reasoning about public com-

mitments (Asher and Lascarides,2013), predicting

100

010 20 30

Social IQa accuracy

k(number of examples )

random ada curie davinci human

Figure 2: Accuracy on the SOCIALIQAdev. set, bro-

ken down by LLM model type and size, as well as num-

ber of few-shot examples (k).

emotional and affective states (Litman and Forbes-

Riley,2004;Jaques et al.,2020), and incorporating

empathy, interlocutor perspective, and social intel-

ligence (Kearns et al.,2020;Sharma et al.,2021).

What is Theory of Mind?

Theory of Mind

(TOM) describes the ability that we, as humans,

have to ascribe and infer the mental states of others,

and to predict which likely actions they are going

to take (Apperly,2010).

This ability is closely re-

lated to (interpersonal) social intelligence (Ganaie

and Mudasir,2015), which allows us to navigate

and understand social situations ranging from sim-

ple everyday interactions to complex negotiations

(Gardner et al.,1995).

Interestingly, the development of Theory of

Mind and language seem to happen around sim-

ilar ages in children (Sperber and Wilson,1986;

Wellman,1992;Miller,2006;Tauzin and Gergely,

2018).

Theories of the pragmatics of language

and communication can frame our understanding

of this link (Rubio-Fernandez,2021), positing that

one needs to reason about an interlocutor’s mental

state (TOM) to effectively communicate and un-

derstand language (Grice,1975;Fernández,2013;

Goodman and Frank,2016;Enrici et al.,2019).3

While Theory of Mind is well developed in most adults

(Ganaie and Mudasir,2015), reasoning and inference capa-

bilities can be inﬂuenced by age, culture, neurodiversity, or

developmental disorders (Korkmaz,2011).

The direction of the TOM-language association is still

debated (de Villiers,2007). Some researchers believe lan-

guage development enables TOM-like abilities (Pyers and

Senghas,2009;Rubio-Fernandez,2021). On the other hand,

some argue that language develops after TOM since preverbal

infants already could possess some level of TOM-like abilities

(Onishi and Baillargeon,2005;Southgate and Vernetti,2014;

Poulin-Dubois and Yott,2018).

Most cognitive studies on this subject focus on the English

language, which is not representative of the wide variation of

100

Effect React Want

Social IQa accuracy

Reasoning dimension

Agent Other

Figure 3: Comparing the accuracy of GPT-3-DAVINCI

(35-shot) on SOCIALIQAwhen the reasoning is about

the main agent of the situation versus others.

3 SOCIALIQA: Do LLMs have Social

Intelligence and Social Commonsense?

A crucial component of Theory-of-Mind is the abil-

ity to reason about the intents and reactions of par-

ticipants of social interactions. To measure this, we

use the dev. set of the SOCIALIQAQA benchmark

(Sap et al.,2019b), which was designed to probe so-

cial and emotional intelligence in various everyday

situations. This benchmark covers questions about

nine social reasoning dimensions, drawn from the

ATOMIC knowledge graph (Sap et al.,2019a).

SOCIALIQAinstances consist of a context, ques-

tion, and three answer choices, written in English.

Each question relates to a speciﬁc reasoning dimen-

sion from ATOMIC: six dimensions focus on the

pre- and post-conditions of the agent or protago-

nist of the situation (e.g., needs, intents, reactions,

next actions), and three dimensions focus on the

post-conditions of other participants involved in

the situation (reaction, next action, effect). In to-

tal, there are 1954 three-way QA tuples; see Tab. 1

for examples, and Tab. 3in Appendix Afor per-

dimension counts.

3.1 Probing LLMs with SOCIALIQA

To probe our language models, we use a

-shot lan-

guage probing setup, following Brown et al. (2020).

We select the answer that has the highest likelihood

under the language model conditioned on the con-

text and question, as described in Appendix C.

To test the limits of what the models can do, we

select

examples that have the same ATOMIC rea-

soning dimension as the question at hand, varying

language structures, and thus limits the cognitive conclusions

one can draw about the link between language and Theory of

Mind (Blasi et al.,2022).

Situation Answers Focus

(a)

Remy was working late in his ofﬁce trying to

catch up. He had a big stack of papers. What

does Remy need to do before this?

Needed to be behind

AgentBe more efﬁcient

Finish his work

(b)

Casey wrapped Sasha’s hands around him

because they are in a romantic relationship. How

would you describe Casey?

Very loving towards Sasha

Agent

Wanted

Being kept warm by Sasha

birth to addison. What will happen to Tracy?

Throw her baby at the wall

Agent

Cry

Take care of her baby

(d) Kai gave Ash some bread so they could make a

sandwich. How would Kai feel afterwards?

Glad they helped

AgentGood they get something to eat

Appreciative

(e)

Aubrey was making extra money by babysitting

Tracey’s kids for the summer. What will Tracy

want to do next?

Save up for a vacation

Others

Let Aubrey know that they are appreciated

Pay off her college tuition

(f)

The people bullied Sasha all her life. But Sasha

got revenge on the people. What will the people

want to do next?

Do whatever Sasha says

Others

Get even

Flee from Sasha

(g)

After everyone ﬁnished their food they were

going to go to a party so Kai decided to ﬁnish his

food ﬁrst. What will others want to do next?

Eat their food quickly

Others

Throw their food away

Go back for a second serving

(h)

Aubrey fed Tracy’s kids lunch today when Tracy

had to go to work. What will happen to Aubrey?

Be grateful

Agent

Get paid by Tracy

Get yelled at by Tracy

(i)

Sasha was the most popular girl in school when

she accepted Jordan’s invitation to go on a date.

What will Jordan want to do next?

Plan a best friends outing with Sasha

Others

Plan a romantic evening with Sasha

Go on a date with Valerie

Table 1: Examples of SOCIALIQAquestions, which person the questions focus on (Agent,Others), and the human

gold answers ( ) and GPT-3-DAVINCI predictions ( ).

from 0 to 35 in increments of 5. We use three GPT-

3 model sizes: GPT-3-ADA (smallest), and GPT-

3-CURIE and GPT-3-DAVINCI (two largest).

3.2 SOCIALIQAResults

Shown in Fig. 2, GPT-3 models perform sub-

stantially worse than humans (>30% less) on SO-

CIALIQA,

and also worse than models ﬁnetuned

on the SOCIALIQAtraining set (>20%; Lourie

et al.,2021).

Although it is not surprising that

GPT-3-DAVINCI reaches higher accuracies than

GPT-3-ADA and GPT-3-CURIE, the gains are

small, which suggests that increasing model size

might not be enough to reach human-level accuracy.

These ﬁndings are in line with recent BIG-Bench

results on SOCIALIQAwith the BIG-G (128B pa-

rameters; Srivastava et al.,2022) and PaLM (353B

parameters; Chowdhery et al.,2022) LLMs, which

We ﬁnd similar results when using INSTRUCTGPT

(Ouyang et al.,2022) instead of GPT-3-DAVINCI.

Lourie et al. (2021) achieves 83% on the test set, as shown

on the AI2 SOCIALIQAleaderboard.

lag behind humans with 45% and 73% accuracy,

respectively (see Fig. 7in Appendix A.2).

Focusing on GPT-3-DAVINCI, while increasing

the number of examples

improves performance,

the differences are marginal after

=10 examples

(only 1% increase from 10 to 35 examples). This

suggest that performance either plateaus or follows

a logarithmic relationship with increasing number

of conditioning examples.

Finally, we examine the differences in GPT-

3-DAVINCI with respect to which participant is

the focus. Shown in Fig. 3, we ﬁnd that GPT-3-

DAVINCI performs consistently better on agent-

centric questions, compared to other-oriented ques-

tions. Shown in the example predictions in Tab. 1,

GPT-3-DAVINCI often confuses which participant

is being asked about. In example (e), after Aubrey

babysat for Tracy, GPT-3-DAVINCI fails to pre-

dict that Tracy will likely want to “let Aubrey know

they are appreciated,” and instead mistakenly pre-

dicts that Tracy will want to “save up for vacation,”

which is what Aubrey would likely do. GPT-3-

100

0 5 10 15 20 25

ToMi accuracy

(Mind only)

k (number of examples)

random ada curie davinci

Figure 4: Accuracy on the TOMIdev. set MIND ques-

tions of varying sizes of GPT-3, and with varying num-

ber of examples (k).

DAVINCI displays a similar participant confusion

in example (f) in Tab. 1.

4 TOMI: Can LLMs Reason about

Mental States and Realities?

Another key component of Theory of Mind is the

ability to reason about mental states and realities of

others, recognizing that they may be different than

our own mental states. As a measure of this ability

in humans, psychologists developed the Sally Ann

false-belief test (Wimmer and Perner,1983), in

which two people (Sally and Ann) are together

in a room with a ball, a basket, and a box, and

while Sally is away, Ann moves the ball from the

basket to the box. When asked where Sally will

look for her ball, Theory of Mind allows us to infer

that Sally will look in the basket (where she left

the ball), instead of in the box (where the ball is,

unbeknownst to Sally).

To measure the false-belief abilities of LLMs,

we use the TOMIQA dataset of English Sally-Ann-

like stories and questions (Le et al.,2019).

TOMI

stories were created using a stochastic rule-based

algorithm that samples two participants, an object

of interest, and a set of locations or containers,

and weaves together a story that involves an object

being moved (see Tab. 2). All questions have two

possible answers: the original object location, and

the ﬁnal object location.

We investigate how LLMs answer the TOMI

story-question pairs, distinguishing between ques-

tions about factual object locations (FACT) and

questions about where participants think objects

TOMIis a more challenging version of the rule-based

datasets by Nematzadeh et al. (2018) and Grant et al. (2017),

as it contains randomly inserted distractor actions that prevent

trivial reverse engineering.

100

0 2 4 8 16 24

ToMi accuracy

k(number of examples)

Fact Mind Mind-TB Mind-FB

Figure 5: Accuracy of GPT-3-DAVINCI by number

of examples (k), by reasoning type (FACT vs. MIND;

MIND-TBvs. MIND-FB).

are located (i.e., their mental states; MIND). The

FACT questions either ask about the object’s origi-

nal (FACT-MEM) or ﬁnal (FACT-REAL) location.

The MIND questions cover ﬁrst-order (e.g., “where

will Abby look for the object?”; MIND-1st) and

second-order beliefs (e.g., “where does James think

that Abby will look for the object?”; MIND-2nd).

We further distinguish the MIND questions between

true belief (TB) and false belief (FB), i.e., stories

where a participant was present or absent when an

object was moved, respectively.

Importantly, answering the MIND questions re-

quires Theory of Mind and reasoning about reali-

ties and mental states of participants—regardless

of the true- or false-belief setting—whereas FACT

questions do not require such TOM. There are a

total of 1861 two-way QA pairs in our TOMIprobe

set, with 519 FACT and 1342 MIND questions (see

Tab. 4in Appendix Bfor more detailed counts).

4.1 Probing LLMs with TOMI

We use the

-shot probing setup to test this TOM

component in LLMs, with

k∈ {2,4,8,16,24}

We select

examples of the same reasoning type

(i.e., FACT-MEM, MIND-1st, etc.), ensuring a 50-

50 split between true- and false-belief examples for

the MIND questions. As before, we test GPT-3-

ADA, GPT-3-CURIE, and GPT-3-DAVINCI.

4.2 TOMIResults

Shown in Fig. 4, our results indicate that GPT-3

models struggle substantially with the TOMIques-

tions related to mental states (MIND), reaching

60% accuracy in the best setup. As expected, the

best performance is reached with GPT-3-DAVINCI

compared to smaller models which do not surpass

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NeuralTheory-of-Mind?OntheLimitsofSocialIntelligenceinLargeLMsMaartenSap}RonanLeBrasDanielFried}YejinChoi~AllenInstituteforAI,Seattle,WA,USA}LanguageTechnologiesInstitute,CarnegieMellonUniversity,Pittsburgh,USA~PaulG.AllenSchoolofComputerScience,UniversityofWashington,Seattle,WA,USAmaartensap@cm...

展开>> 收起<<

Neural Theory-of-Mind On the Limits of Social Intelligence in Large LMs Maarten SapRonan Le BrasDaniel FriedYejin Choi.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Neural Theory-of-Mind On the Limits of Social Intelligence in Large LMs Maarten SapRonan Le BrasDaniel FriedYejin Choi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: