Robot Learning Theory of Mind through Self-Observation Exploiting the Intentions-Beliefs Synergy Francesca Bianco1and Dimitri Ognibene21

2025-05-04 0 0 286.68KB 8 页 10玖币

侵权投诉

Robot Learning Theory of Mind through Self-Observation: Exploiting

the Intentions-Beliefs Synergy

Francesca Bianco1and Dimitri Ognibene2,1,∗

Abstract— In complex environments, where the human sen-

sory system reaches its limits, our behaviour is strongly driven

by our beliefs about the state of the world around us. Accessing

others’ beliefs, intentions, or mental states in general, could

thus allow for more effective social interactions in natural

contexts. Yet these variables are not directly observable. Theory

of Mind (TOM), the ability to attribute to other agents’ beliefs,

intentions, or mental states in general, is a crucial feature of

human social interaction and has become of interest to the

robotics community. Recently, new models that are able to

learn TOM have been introduced. In this paper, we show the

synergy between learning to predict low-level mental states,

such as intentions and goals, and attributing high-level ones,

such as beliefs. Assuming that learning of beliefs can take place

by observing own decision and beliefs estimation processes in

partially observable environments and using a simple feed-

forward deep learning model, we show that when learning

to predict others’ intentions and actions, faster and more

accurate predictions can be acquired if beliefs attribution is

learnt simultaneously with action and intentions prediction.

We show that the learning performance improves even when

observing agents with a different decision process and is

higher when observing beliefs-driven chunks of behaviour. We

propose that our architectural approach can be relevant for the

design of future adaptive social robots that should be able to

autonomously understand and assist human partners in novel

natural environments and tasks.

I. INTRODUCTION

Due to recent technological developments, the interactions

between AI and humans have become pervasive and hetero-

geneous, extending from voice assistants or recommender

systems supporting the online experience of millions of users

to autonomous driving cars. Principled models to represent

the human collaborators’ needs are being adopted [1] while

robotic perception in complex environments is becoming

more ﬂexible and adaptive [2]–[5] even in social contexts,

robot sensory limits are starting to be actively managed [6],

[7]. However, robots and intelligent systems still have a

limited understanding of how sensory limits affect human

partners’ behaviour and lead them to rely on internal beliefs

about the state of the world. This strongly impacts human-

robot mutual understanding [8] and calls for an effort to

transfer the advance in robot perception management to

methods to better cope with human collaborators’ perceptual

limits [9]–[11].

The possibility of introducing in robots and AI systems

a Theory of Mind (TOM) [12], the ability to attribute to

*This work was not supported by any organization

1University of Essex, Colchester, UK

2Universit`

a degli Studi di Milano Bicocca, Milano, Italy

∗email:dimitri.ognibene@unimib.it

other agents’ beliefs, intentions, or mental states in general,

has recently raised hopes to further improve robots’ social

skills [13]–[16]. While some studies have explored human

partners’ tendency to attribute mental states to robots [17]–

[20], the expected practical impact of TOM led to a diverse

set of TOM implementations on robots. Several implementa-

tions relied on hardwired agents and task models that could

be applied to infer mental states in settings known at design

time [21]–[25]. A step forward is presented in [26] with an

algorithm to understand unknown agents relying upon Belief-

Desire-Intention models of previously met agents.

Recently, following [27] seminal work, several models

have introduced deep learning based TOM implementations

[28]–[33]. This novel approach, learning both beliefs and in-

tention attribution, should allow improved collaboration and

adaptive human-robot collaboration in complex environments

through a better understanding of humans’ mental states. In

this paper, we explore if the data-driven approach proposed

in [27] and related works leads to improved predictions

of the partner’s intentions, which is often the mental state

with the highest impact on the interaction performance.

The prediction of partners’ intentions, even within a system

producing prediction on several others’ unobservable mental

states, such as beliefs, will still rely only on the processing of

observable behavioural inputs, aka state-action trajectories.

In a purely supervised learning setting, such as that proposed

in [27], it is not immediate why performing an additional

set of predictions, increasing the demands on the social

perception system, should result in higher accuracy for the

prediction of others’ intentions. This approach introduces

additional complexity and noise that may hinder performance

(see [34]). Moreover, deep learning models as those proposed

in [27] are usually data hungry, which may further affect the

value of the approach. These factors may be some of the

reasons for the long time required for the full development

of TOM in infants [12], [35].

While all these considerations sound technically valid,

our results with simpliﬁed versions of the architecture pro-

posed in [27] show that the original hypothesis may be

true: learning is faster and more accurate if it takes place

simultaneously for the prediction of both intentions and

beliefs together. Our results also show that the impact of

learning beliefs attribution on intention prediction is stronger

in conditions of strong partial observability, e.g. when the

observed agent does not still know where his target is.

We found that when the system learns to predict intentions

and beliefs at the same time it can better disambiguate

and discard unrelated objects that are or have been in the

arXiv:2210.09435v1 [cs.RO] 17 Oct 2022

sensory ﬁeld of the observed agent. This can be particularly

relevant for assistive applications, especially those based on

egovision that can monitor the sensory state of the partners

[36]. Our hypothesis is that this is due to the regularization

effect of multitask learning when tasks don’t present conﬂict

but synergy [37]–[40]. Indeed the observed accuracy gain

decreases when the dataset size exceeds a certain threshold

and makes regularization less helpful. On the other side,

when very limited experience is available, the performance

is slightly worse for joint prediction, maybe explaining the

necessity for the complex and multi-system developmental

process that seems to characterize humans [12].

Another issue of the approach proposed in [27] is the

lack of training samples, as others’ mental states are usually

not available through direct behaviour observation or in

datasets for training or prediction. We propose that self-

observation, the observation of the agent’s own behaviours

and the internal decision and beliefs estimation processes can

be used as the training signal. We tested that mental states

prediction can then be generalized to different agents, even

if within certain limits.

From a cognitive modelling perspective, the proposed

architecture detaches from the motor simulation tradition

common in robotics [6], [7], [41]–[43], where the motor

control system is used for both action execution and per-

ception. In our model, the motor control system is only

the source of the reference signal for the training of the

social perception system. Among the others, this strategy has

the advantage of reducing the strong interference between

action and observation that a motor simulation account

of social perception could present [44]–[46]. It could also

reduce the computational demands and delays that could be

involved in simulating others’ actions while it is still able

to exploit the observer’s (motor or action) knowledge for

social perception [47]. The presented architecture can be

related to the associative hypothesis of social perception [48]

and in particular, to Hommel et al.’s [49] interpretation, as

described in [50], which has extended the associative view

to intention interpretation and proposes that associations be-

tween behaviours (actions) and underlying intentions (effect)

develop from an early age and can be used in later in life as a

means to infer and predict others’ intentions and behaviours.

The associative account has also been previously related

to others’ understanding by proposals of its involvement

in mirror neurons development (e.g., [51]), as well as in

sensorimotor matching for imitation [52]. The architecture

proposed in the present paper differs from these accounts as

it focuses on beliefs [53]. Speciﬁcally, It relies on associa-

tions between explicit belief representations learnt through

self-experience and consequent behaviours to improve the

prediction performance of others’ beliefs-driven behaviours.

II. METHODS

A. Architecture

The agent is composed of two components (see Fig.1),

one is the Social Perception System (SPS), that interpret and

allows for predictions of observed agents behaviour, in terms

of next actions, goals and even beliefs. The other component

is the control system that deﬁne the agent behaviour to

perform tasks similar to that it interprets in the SPS. The

quantities, e.g. intentions, beliefs, actions, produced during

the performance of the task are also used to fully train

the SPS, while others’ behaviours are used only for partial

training, as others’ beliefs cannot be observed.

Social Perception System

Behavior Trajectory Chunk

Prediction Net

Action

State

Target

Class

Beliefs

Target

Position

Agent Control System

Decision

System

Beliefs

Estimator

Beliefs

Action

Target

Class

Self

Observation

Learning?

Teaching Signal

Robot Sensors

Objects PositionMap Agents Position

YES

Fig. 1. The architecture utilised in the here reported studies, formed

of a shared prediction net torso and subsequently of separate prediction

heads. For the NoBeliefs architecture, the following prediction heads are

considered: 1. Target position, 2. Actor’s next action, and 3. Actor’s next

state. For the Beliefs architecture, the 4. belief prediction head (in red) is

also considered

1) Social Perception System (SPS): The SPS’s goal is to

make predictions about the observed actors’ future behaviour,

with a speciﬁc interest on the actor’s target position. Two

types of SPS were trained: NoBelief SPS, the baseline, which

predicts an actor’s target position, target class, next action,

and next resulting state but not the beliefs; Beliefs SPS, was

asked to predict, in addition to the previous, also the actor’s

beliefs.

a) Input sensing and routing: The SPS can observe

themselves or others; the input vector for the system can

then be provided by a common reference frame to represent

either the self-localisation state of the observer, during self-

observation learning, or the physical state of the other actor,

both for learning and prediction of other actor’s behaviour.

Several architectures have studied the problem [7], [41] of

how to switch between the processing of own and others’

data and how to acquire others’ physical states. Note that in

this case this function, while important, poses less constraints

to behaviour performance as it feeds not the execution

process but the social learning one [54].

b) Input encoding and pre-processing: The inputs is

formed by a number (max 5 in the reported experiment) of

past steps of a trajectory on a single grid map. Observed

actions-states pairs are combined through a spatialisation-

concatenation operation, whereby actions are tiled over space

into a tensor and concatenated to form a single tensor of

shape (11 x 11 x 20). While 11 x 11 represents the size

of the grid world environments, 20 vectors are provided

as inputs consisting of information regarding (a) actions (9

possible actions in experiments, thus 9 vectors); (b) objects

coordinates, including the target position (4 objects, thus 4

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobotLearningTheoryofMindthroughSelf-Observation:ExploitingtheIntentions-BeliefsSynergyFrancescaBianco1andDimitriOgnibene2;1;AbstractIncomplexenvironments,wherethehumansen-sorysystemreachesitslimits,ourbehaviourisstronglydrivenbyourbeliefsaboutthestateoftheworldaroundus.Accessingothers'beliefs,int...

展开>> 收起<<

Robot Learning Theory of Mind through Self-Observation Exploiting the Intentions-Beliefs Synergy Francesca Bianco1and Dimitri Ognibene21.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robot Learning Theory of Mind through Self-Observation Exploiting the Intentions-Beliefs Synergy Francesca Bianco1and Dimitri Ognibene21

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: