Neural Theory-of-Mind On the Limits of Social Intelligence in Large LMs Maarten SapRonan Le BrasDaniel FriedYejin Choi

2025-05-02 0 0 1010.88KB 23 页 10玖币
侵权投诉
Neural Theory-of-Mind?
On the Limits of Social Intelligence in Large LMs
Maarten SapRonan Le BrasDaniel FriedYejin Choi
Allen Institute for AI, Seattle, WA, USA
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA
Paul G. Allen School of Computer Science, University of Washington, Seattle, WA, USA
maartensap@cmu.edu
Abstract
Social intelligence and Theory of Mind
(TOM), i.e., the ability to reason about the
different mental states, intents, and reactions
of all people involved, allow humans to effec-
tively navigate and understand everyday social
interactions. As NLP systems are used in in-
creasingly complex social situations, their abil-
ity to grasp social dynamics becomes crucial.
In this work, we examine the open question
of social intelligence and Theory of Mind in
modern NLP systems from an empirical and
theory-based perspective. We show that one
of today’s largest language models (GPT-3;
Brown et al.,2020) lacks this kind of social
intelligence out-of-the box, using two tasks:
SOCIALIQA(Sap et al.,2019b), which mea-
sures models’ ability to understand intents and
reactions of participants of social interactions,
and TOMI(Le et al.,2019), which measures
whether models can infer mental states and re-
alities of participants of situations.
Our results show that models struggle sub-
stantially at these Theory of Mind tasks, with
well-below-human accuracies of 55% and 60%
on SOCIALIQAand TOMI, respectively. To
conclude, we draw on theories from pragmat-
ics to contextualize this shortcoming of large
language models, by examining the limita-
tions stemming from their data, neural archi-
tecture, and training paradigms. Challeng-
ing the prevalent narrative that only scale is
needed, we posit that person-centric NLP ap-
proaches might be more effective towards neu-
ral Theory of Mind.
1 Introduction
With the growing prevalence of AI and NLP sys-
tems in everyday social interactions, the need for
AI systems with social intelligence and Theory of
Mind (TOM), i.e., the ability to infer and reason
about the intents, feelings, and mental states of oth-
ers, becomes increasingly evident (Pereira et al.,
Reasoning about
mental states and realities
Social commonsense and
emotional intelligence
Although Taylor was older and stronger,
they lost to Alex in the wrestling match.
James and Abby are in the bedroom.
Abby put the pen in the desk
drawer. Abby leaves the bedroom.
James moves the pen into the bag.
How would Alex feel as a result?
Where does James think Abby will
look for the pen?
ashamed boastful
drawer bag
Measuring Neural Theory of Mind
Figure 1: Theory of Mind is the ability for humans to
reason about the intents, reactions, and mental states of
others. We asses these abilities in LLMs through two
question-answering tasks that measure social common-
sense and emotional intelligence (SOCIALIQA; top)
and reasoning about people’s mental states and reali-
ties (TOMI; bottom); finding that GPT-3 ( ) struggles
on both tasks. We discuss why that may be, drawing
from theories of the pragmatics of language.
2016;Langley et al.,2022). For humans, Theory of
Mind is a crucial component that enables us to in-
teract and communicate effectively with each other
(Premack and Woodruff,1978;Apperly,2010). It
allows us, for example, to infer that someone likely
feels boastful instead of ashamed after winning a
wrestling match (Fig. 1; top). In addition, TOM
also enables us to reason about people’s mental re-
alities, e.g., if someone was out of the room while
a pen was moved, she will likely search for the pen
arXiv:2210.13312v2 [cs.CL] 3 Apr 2023
where she last saw it instead of where it was moved
to (Fig. 1; bottom).
While humans develop it naturally, TOM and
social intelligence remain elusive goals for modern
AI systems (Choi,2022), including large neural
language models (LLMs). With advances in scal-
ing the sizes of models and datasets, these LLMs
have proven very impressive at generating human-
like language for conversational, summarization,
or sentence continuation settings, often with zero
to few examples to learn from (Brown et al.,2020;
Clark et al.,2021;Chowdhery et al.,2022). How-
ever, increasing scrutiny has shed light on the short-
comings of these LLMs, showing that they often
fall prey to spurious correlational patterns instead
of displaying higher-order reasoning (Elkins and
Chun,2020;Dale,2021;Marcus,2022).
In line with EMNLP 2022’s theme, we examine
the open research question of whether and how
much LLMs—which are the backbone of most
modern NLP systems—exhibit social intelligence
and TOM abilities. Using some of the largest En-
glish models in existence (GPT-3; Brown et al.,
2020), we demonstrate that out-of-the-box LLMs
struggle at two types of reasoning abilities that
requisites for Theory of Mind (shown in Fig. 1).
We argue that these reasoning abilities are neces-
sary but not sufficient for Theory of Mind, and that
larger models will likely provide upper bounds on
what equivalent-but-smaller models are capable of.
We first assess whether LLMs can reason about
social commonsense and emotional intelligence
with respect to social interactions (§3), using the
SOCIALIQAbenchmark (Sap et al.,2019b) illus-
trated in Fig. 1(top). Results show our best per-
forming few-shot GPT-3 setup achieving only 55%
accuracy, lagging
>
30% behind human perfor-
mance. Furthermore, social reasoning about the
protagonists of situations is easier for GPT-3 (5-
15% absolute difference) compared to reasoning
about other secondary participants.
Second, we measure LLMs’ ability to under-
stand other people’s mental states and realities in
short stories (§4). We use the TOMIQA bench-
mark (illustrated in Fig. 1; bottom; Le et al.,2019),
which was inspired by the classic Sally-Ann False
Belief Theory of Mind test (Baron-Cohen et al.,
1985). Here, our results show that GPT-3 models
peak at 60% accuracy on questions about partic-
ipants’ mental states, compared to 90–100% on
factual questions.
Our novel insights show that reasoning about
social situations and false beliefs still presents a
significant challenge for large language models, de-
spite their seemingly impressive performance on
tasks that could require social intelligence (e.g.,
story generation, dialogues). In §5, we first ex-
amine these shortcomings; drawing on theories of
the pragmatics of language, we speculate that the
type of texts in LLMs’ training datasets could sub-
stantially limit learning social intelligence. Then,
we outline some possible future directions towards
socially aware LLMs, reflecting on the feasibil-
ity of interactional data selection, person-centric
inductive biases, and interaction-based language
learning. Our findings suggest that only increasing
the scale of LLMs is likely not the most effective
way to create socially aware AI systems, challeng-
ing a prevalent narrative in AI research (Narang
and Chowdhery,2022).
2 Theory of Mind & Large LMs
Why do LLMs need Theory of Mind?
Social
intelligence, Theory of Mind, and commonsense
reasoning have been a longstanding but elusive
goal of artificial intelligence for decades (Gun-
ning,2018;Choi,2022). These reasoning abil-
ities are becoming increasingly necessary as AI
assistants are used in situations that require social
intelligence and Theory of Mind in order to op-
erate effectively (Wang et al.,2007;Dhelim et al.,
2021;Langley et al.,2022). For example, new tech-
nologies are emerging where AI is used to interact
and adapt to users (Bickmore and Picard,2005;
Jaques,2019), e.g., voice assistants, and tutoring
systems; or where AI helps enhance communica-
tion between multiple users, e.g., email autocom-
plete (Chen et al.,2019), AI-assisted counseling
(Kearns et al.,2020;Allen,2020;Sharma et al.,
2021), or facilitated discussion (Rosé et al.,2014).
As we move beyond just asking single-turn ques-
tions to social and interactive AI assistants, higher-
order reasoning becomes necessary (McDonald
and Pearson,2019). For example, AI systems
should be capable of more nuanced understand-
ing, such as ensuring an alarm is on if someone
has a job interview the next morning (Dhelim et al.,
2021), knowing to call for help when an elderly
person falls (Pollack,2005), inferring personality
and intentions in dialogues (Mairesse et al.,2007;
Wang et al.,2019), reasoning about public com-
mitments (Asher and Lascarides,2013), predicting
0
20
40
60
80
100
010 20 30
Social IQa accuracy
k(number of examples )
random ada curie davinci human
Figure 2: Accuracy on the SOCIALIQAdev. set, bro-
ken down by LLM model type and size, as well as num-
ber of few-shot examples (k).
emotional and affective states (Litman and Forbes-
Riley,2004;Jaques et al.,2020), and incorporating
empathy, interlocutor perspective, and social intel-
ligence (Kearns et al.,2020;Sharma et al.,2021).
What is Theory of Mind?
Theory of Mind
(TOM) describes the ability that we, as humans,
have to ascribe and infer the mental states of others,
and to predict which likely actions they are going
to take (Apperly,2010).
1
This ability is closely re-
lated to (interpersonal) social intelligence (Ganaie
and Mudasir,2015), which allows us to navigate
and understand social situations ranging from sim-
ple everyday interactions to complex negotiations
(Gardner et al.,1995).
Interestingly, the development of Theory of
Mind and language seem to happen around sim-
ilar ages in children (Sperber and Wilson,1986;
Wellman,1992;Miller,2006;Tauzin and Gergely,
2018).
2
Theories of the pragmatics of language
and communication can frame our understanding
of this link (Rubio-Fernandez,2021), positing that
one needs to reason about an interlocutor’s mental
state (TOM) to effectively communicate and un-
derstand language (Grice,1975;Fernández,2013;
Goodman and Frank,2016;Enrici et al.,2019).3
1
While Theory of Mind is well developed in most adults
(Ganaie and Mudasir,2015), reasoning and inference capa-
bilities can be influenced by age, culture, neurodiversity, or
developmental disorders (Korkmaz,2011).
2
The direction of the TOM-language association is still
debated (de Villiers,2007). Some researchers believe lan-
guage development enables TOM-like abilities (Pyers and
Senghas,2009;Rubio-Fernandez,2021). On the other hand,
some argue that language develops after TOM since preverbal
infants already could possess some level of TOM-like abilities
(Onishi and Baillargeon,2005;Southgate and Vernetti,2014;
Poulin-Dubois and Yott,2018).
3
Most cognitive studies on this subject focus on the English
language, which is not representative of the wide variation of
0
20
40
60
80
100
Effect React Want
Social IQa accuracy
Reasoning dimension
Agent Other
Figure 3: Comparing the accuracy of GPT-3-DAVINCI
(35-shot) on SOCIALIQAwhen the reasoning is about
the main agent of the situation versus others.
3 SOCIALIQA: Do LLMs have Social
Intelligence and Social Commonsense?
A crucial component of Theory-of-Mind is the abil-
ity to reason about the intents and reactions of par-
ticipants of social interactions. To measure this, we
use the dev. set of the SOCIALIQAQA benchmark
(Sap et al.,2019b), which was designed to probe so-
cial and emotional intelligence in various everyday
situations. This benchmark covers questions about
nine social reasoning dimensions, drawn from the
ATOMIC knowledge graph (Sap et al.,2019a).
SOCIALIQAinstances consist of a context, ques-
tion, and three answer choices, written in English.
Each question relates to a specific reasoning dimen-
sion from ATOMIC: six dimensions focus on the
pre- and post-conditions of the agent or protago-
nist of the situation (e.g., needs, intents, reactions,
next actions), and three dimensions focus on the
post-conditions of other participants involved in
the situation (reaction, next action, effect). In to-
tal, there are 1954 three-way QA tuples; see Tab. 1
for examples, and Tab. 3in Appendix Afor per-
dimension counts.
3.1 Probing LLMs with SOCIALIQA
To probe our language models, we use a
k
-shot lan-
guage probing setup, following Brown et al. (2020).
We select the answer that has the highest likelihood
under the language model conditioned on the con-
text and question, as described in Appendix C.
To test the limits of what the models can do, we
select
k
examples that have the same ATOMIC rea-
soning dimension as the question at hand, varying
k
language structures, and thus limits the cognitive conclusions
one can draw about the link between language and Theory of
Mind (Blasi et al.,2022).
Situation Answers Focus
(a)
Remy was working late in his office trying to
catch up. He had a big stack of papers. What
does Remy need to do before this?
Needed to be behind
AgentBe more efficient
Finish his work
(b)
Casey wrapped Sasha’s hands around him
because they are in a romantic relationship. How
would you describe Casey?
Very loving towards Sasha
Agent
Wanted
Being kept warm by Sasha
(c) Tracy held a baby for 9 months and then gave
birth to addison. What will happen to Tracy?
Throw her baby at the wall
Agent
Cry
Take care of her baby
(d) Kai gave Ash some bread so they could make a
sandwich. How would Kai feel afterwards?
Glad they helped
AgentGood they get something to eat
Appreciative
(e)
Aubrey was making extra money by babysitting
Tracey’s kids for the summer. What will Tracy
want to do next?
Save up for a vacation
Others
Let Aubrey know that they are appreciated
Pay off her college tuition
(f)
The people bullied Sasha all her life. But Sasha
got revenge on the people. What will the people
want to do next?
Do whatever Sasha says
Others
Get even
Flee from Sasha
(g)
After everyone finished their food they were
going to go to a party so Kai decided to finish his
food first. What will others want to do next?
Eat their food quickly
Others
Throw their food away
Go back for a second serving
(h)
Aubrey fed Tracy’s kids lunch today when Tracy
had to go to work. What will happen to Aubrey?
Be grateful
Agent
Get paid by Tracy
Get yelled at by Tracy
(i)
Sasha was the most popular girl in school when
she accepted Jordan’s invitation to go on a date.
What will Jordan want to do next?
Plan a best friends outing with Sasha
Others
Plan a romantic evening with Sasha
Go on a date with Valerie
Table 1: Examples of SOCIALIQAquestions, which person the questions focus on (Agent,Others), and the human
gold answers ( ) and GPT-3-DAVINCI predictions ( ).
from 0 to 35 in increments of 5. We use three GPT-
3 model sizes: GPT-3-ADA (smallest), and GPT-
3-CURIE and GPT-3-DAVINCI (two largest).
3.2 SOCIALIQAResults
Shown in Fig. 2, GPT-3 models perform sub-
stantially worse than humans (>30% less) on SO-
CIALIQA,
4
and also worse than models finetuned
on the SOCIALIQAtraining set (>20%; Lourie
et al.,2021).
5
Although it is not surprising that
GPT-3-DAVINCI reaches higher accuracies than
GPT-3-ADA and GPT-3-CURIE, the gains are
small, which suggests that increasing model size
might not be enough to reach human-level accuracy.
These findings are in line with recent BIG-Bench
results on SOCIALIQAwith the BIG-G (128B pa-
rameters; Srivastava et al.,2022) and PaLM (353B
parameters; Chowdhery et al.,2022) LLMs, which
4
We find similar results when using INSTRUCTGPT
(Ouyang et al.,2022) instead of GPT-3-DAVINCI.
5
Lourie et al. (2021) achieves 83% on the test set, as shown
on the AI2 SOCIALIQAleaderboard.
lag behind humans with 45% and 73% accuracy,
respectively (see Fig. 7in Appendix A.2).
Focusing on GPT-3-DAVINCI, while increasing
the number of examples
k
improves performance,
the differences are marginal after
k
=10 examples
(only 1% increase from 10 to 35 examples). This
suggest that performance either plateaus or follows
a logarithmic relationship with increasing number
of conditioning examples.
Finally, we examine the differences in GPT-
3-DAVINCI with respect to which participant is
the focus. Shown in Fig. 3, we find that GPT-3-
DAVINCI performs consistently better on agent-
centric questions, compared to other-oriented ques-
tions. Shown in the example predictions in Tab. 1,
GPT-3-DAVINCI often confuses which participant
is being asked about. In example (e), after Aubrey
babysat for Tracy, GPT-3-DAVINCI fails to pre-
dict that Tracy will likely want to “let Aubrey know
they are appreciated,” and instead mistakenly pre-
dicts that Tracy will want to “save up for vacation,
which is what Aubrey would likely do. GPT-3-
0
20
40
60
80
100
0 5 10 15 20 25
ToMi accuracy
(Mind only)
k (number of examples)
random ada curie davinci
Figure 4: Accuracy on the TOMIdev. set MIND ques-
tions of varying sizes of GPT-3, and with varying num-
ber of examples (k).
DAVINCI displays a similar participant confusion
in example (f) in Tab. 1.
4 TOMI: Can LLMs Reason about
Mental States and Realities?
Another key component of Theory of Mind is the
ability to reason about mental states and realities of
others, recognizing that they may be different than
our own mental states. As a measure of this ability
in humans, psychologists developed the Sally Ann
false-belief test (Wimmer and Perner,1983), in
which two people (Sally and Ann) are together
in a room with a ball, a basket, and a box, and
while Sally is away, Ann moves the ball from the
basket to the box. When asked where Sally will
look for her ball, Theory of Mind allows us to infer
that Sally will look in the basket (where she left
the ball), instead of in the box (where the ball is,
unbeknownst to Sally).
To measure the false-belief abilities of LLMs,
we use the TOMIQA dataset of English Sally-Ann-
like stories and questions (Le et al.,2019).
6
TOMI
stories were created using a stochastic rule-based
algorithm that samples two participants, an object
of interest, and a set of locations or containers,
and weaves together a story that involves an object
being moved (see Tab. 2). All questions have two
possible answers: the original object location, and
the final object location.
We investigate how LLMs answer the TOMI
story-question pairs, distinguishing between ques-
tions about factual object locations (FACT) and
questions about where participants think objects
6
TOMIis a more challenging version of the rule-based
datasets by Nematzadeh et al. (2018) and Grant et al. (2017),
as it contains randomly inserted distractor actions that prevent
trivial reverse engineering.
0
20
40
60
80
100
0 2 4 8 16 24
ToMi accuracy
k(number of examples)
Fact Mind Mind-TB Mind-FB
Figure 5: Accuracy of GPT-3-DAVINCI by number
of examples (k), by reasoning type (FACT vs. MIND;
MIND-TBvs. MIND-FB).
are located (i.e., their mental states; MIND). The
FACT questions either ask about the object’s origi-
nal (FACT-MEM) or final (FACT-REAL) location.
The MIND questions cover first-order (e.g., “where
will Abby look for the object?”; MIND-1st) and
second-order beliefs (e.g., “where does James think
that Abby will look for the object?”; MIND-2nd).
We further distinguish the MIND questions between
true belief (TB) and false belief (FB), i.e., stories
where a participant was present or absent when an
object was moved, respectively.
Importantly, answering the MIND questions re-
quires Theory of Mind and reasoning about reali-
ties and mental states of participants—regardless
of the true- or false-belief setting—whereas FACT
questions do not require such TOM. There are a
total of 1861 two-way QA pairs in our TOMIprobe
set, with 519 FACT and 1342 MIND questions (see
Tab. 4in Appendix Bfor more detailed counts).
4.1 Probing LLMs with TOMI
We use the
k
-shot probing setup to test this TOM
component in LLMs, with
k∈ {2,4,8,16,24}
.
We select
k
examples of the same reasoning type
(i.e., FACT-MEM, MIND-1st, etc.), ensuring a 50-
50 split between true- and false-belief examples for
the MIND questions. As before, we test GPT-3-
ADA, GPT-3-CURIE, and GPT-3-DAVINCI.
4.2 TOMIResults
Shown in Fig. 4, our results indicate that GPT-3
models struggle substantially with the TOMIques-
tions related to mental states (MIND), reaching
60% accuracy in the best setup. As expected, the
best performance is reached with GPT-3-DAVINCI
compared to smaller models which do not surpass
摘要:

NeuralTheory-of-Mind?OntheLimitsofSocialIntelligenceinLargeLMsMaartenSap}RonanLeBrasDanielFried}YejinChoi~AllenInstituteforAI,Seattle,WA,USA}LanguageTechnologiesInstitute,CarnegieMellonUniversity,Pittsburgh,USA~PaulG.AllenSchoolofComputerScience,UniversityofWashington,Seattle,WA,USAmaartensap@cm...

展开>> 收起<<
Neural Theory-of-Mind On the Limits of Social Intelligence in Large LMs Maarten SapRonan Le BrasDaniel FriedYejin Choi.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:1010.88KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注