where she last saw it instead of where it was moved
to (Fig. 1; bottom).
While humans develop it naturally, TOM and
social intelligence remain elusive goals for modern
AI systems (Choi,2022), including large neural
language models (LLMs). With advances in scal-
ing the sizes of models and datasets, these LLMs
have proven very impressive at generating human-
like language for conversational, summarization,
or sentence continuation settings, often with zero
to few examples to learn from (Brown et al.,2020;
Clark et al.,2021;Chowdhery et al.,2022). How-
ever, increasing scrutiny has shed light on the short-
comings of these LLMs, showing that they often
fall prey to spurious correlational patterns instead
of displaying higher-order reasoning (Elkins and
Chun,2020;Dale,2021;Marcus,2022).
In line with EMNLP 2022’s theme, we examine
the open research question of whether and how
much LLMs—which are the backbone of most
modern NLP systems—exhibit social intelligence
and TOM abilities. Using some of the largest En-
glish models in existence (GPT-3; Brown et al.,
2020), we demonstrate that out-of-the-box LLMs
struggle at two types of reasoning abilities that
requisites for Theory of Mind (shown in Fig. 1).
We argue that these reasoning abilities are neces-
sary but not sufficient for Theory of Mind, and that
larger models will likely provide upper bounds on
what equivalent-but-smaller models are capable of.
We first assess whether LLMs can reason about
social commonsense and emotional intelligence
with respect to social interactions (§3), using the
SOCIALIQAbenchmark (Sap et al.,2019b) illus-
trated in Fig. 1(top). Results show our best per-
forming few-shot GPT-3 setup achieving only 55%
accuracy, lagging
>
30% behind human perfor-
mance. Furthermore, social reasoning about the
protagonists of situations is easier for GPT-3 (5-
15% absolute difference) compared to reasoning
about other secondary participants.
Second, we measure LLMs’ ability to under-
stand other people’s mental states and realities in
short stories (§4). We use the TOMIQA bench-
mark (illustrated in Fig. 1; bottom; Le et al.,2019),
which was inspired by the classic Sally-Ann False
Belief Theory of Mind test (Baron-Cohen et al.,
1985). Here, our results show that GPT-3 models
peak at 60% accuracy on questions about partic-
ipants’ mental states, compared to 90–100% on
factual questions.
Our novel insights show that reasoning about
social situations and false beliefs still presents a
significant challenge for large language models, de-
spite their seemingly impressive performance on
tasks that could require social intelligence (e.g.,
story generation, dialogues). In §5, we first ex-
amine these shortcomings; drawing on theories of
the pragmatics of language, we speculate that the
type of texts in LLMs’ training datasets could sub-
stantially limit learning social intelligence. Then,
we outline some possible future directions towards
socially aware LLMs, reflecting on the feasibil-
ity of interactional data selection, person-centric
inductive biases, and interaction-based language
learning. Our findings suggest that only increasing
the scale of LLMs is likely not the most effective
way to create socially aware AI systems, challeng-
ing a prevalent narrative in AI research (Narang
and Chowdhery,2022).
2 Theory of Mind & Large LMs
Why do LLMs need Theory of Mind?
Social
intelligence, Theory of Mind, and commonsense
reasoning have been a longstanding but elusive
goal of artificial intelligence for decades (Gun-
ning,2018;Choi,2022). These reasoning abil-
ities are becoming increasingly necessary as AI
assistants are used in situations that require social
intelligence and Theory of Mind in order to op-
erate effectively (Wang et al.,2007;Dhelim et al.,
2021;Langley et al.,2022). For example, new tech-
nologies are emerging where AI is used to interact
and adapt to users (Bickmore and Picard,2005;
Jaques,2019), e.g., voice assistants, and tutoring
systems; or where AI helps enhance communica-
tion between multiple users, e.g., email autocom-
plete (Chen et al.,2019), AI-assisted counseling
(Kearns et al.,2020;Allen,2020;Sharma et al.,
2021), or facilitated discussion (Rosé et al.,2014).
As we move beyond just asking single-turn ques-
tions to social and interactive AI assistants, higher-
order reasoning becomes necessary (McDonald
and Pearson,2019). For example, AI systems
should be capable of more nuanced understand-
ing, such as ensuring an alarm is on if someone
has a job interview the next morning (Dhelim et al.,
2021), knowing to call for help when an elderly
person falls (Pollack,2005), inferring personality
and intentions in dialogues (Mairesse et al.,2007;
Wang et al.,2019), reasoning about public com-
mitments (Asher and Lascarides,2013), predicting