MILD Multimodal Interactive Latent Dynamics for Learning Human-Robot Interaction Vignesh Prasad12 Dorothea Koert15 Ruth Stock-Homburg2 Jan Peters1345 Georgia Chalvatzaki14

2025-05-02 0 0 3.27MB 8 页 10玖币

侵权投诉

MILD: Multimodal Interactive Latent Dynamics

for Learning Human-Robot Interaction

Vignesh Prasad1,2, Dorothea Koert1,5, Ruth Stock-Homburg2, Jan Peters1,3,4,5, Georgia Chalvatzaki1,4

Abstract— Modeling interaction dynamics to generate robot

trajectories that enable a robot to adapt and react to a human’s

actions and intentions is critical for efﬁcient and effective

collaborative Human-Robot Interactions (HRI). Learning from

Demonstration (LfD) methods from Human-Human Interac-

tions (HHI) have shown promising results, especially when

coupled with representation learning techniques. However, such

methods for learning HRI either do not scale well to high

dimensional data or cannot accurately adapt to changing

via-poses of the interacting partner. We propose Multimodal

Interactive Latent Dynamics (MILD), a method that couples

deep representation learning and probabilistic machine learning

to address the problem of two-party physical HRIs. We learn

the interaction dynamics from demonstrations, using Hidden

Semi-Markov Models (HSMMs) to model the joint distribution

of the interacting agents in the latent space of a Variational

Autoencoder (VAE). Our experimental evaluations for learning

HRI from HHI demonstrations show that MILD effectively

captures the multimodality in the latent representations of HRI

tasks, allowing us to decode the varying dynamics occurring

in such tasks. Compared to related work, MILD generates

more accurate trajectories for the controlled agent (robot)

when conditioned on the observed agent’s (human) trajectory.

Notably, MILD can learn directly from camera-based pose

estimations to generate trajectories, which we then map to a

humanoid robot without the need for any additional training.

Supplementary Material: https://bit.ly/MILD-HRI

I. INTRODUCTION

Observing human actions and interacting synchronously is

an essential characteristic of a social robot in HRI scenarios

[1]. Key components for learning coordinated HRI policies

are having a good spatio-temporal representation and jointly

modeling the interaction dynamics of the agents. In this

regard, the paradigm of LfD shows promising results [2],

[8], [7], especially when using only a handful of trajec-

tories. Such LfD approaches learn joint distributions over

1Department of Computer Science, TU Darmstadt, Germany.

2Chair for Marketing and Human Resource Management, Department of

Law and Economics, TU Darmstadt, Germany.

3German Research Center for AI (DFKI), Research Department: Systems

AI for Robot Learning.

4Hessian.AI

5Centre for Cognitive Science, TU Darmstadt, Germany.

This work was supported by the German Research Foundation (DFG)

Project ”Social Robots at the Customer Interface” (Grant No.: STO 477/14-

1), the DFG Emmy Noether Programme (CH 2676/1-1), the German Federal

Ministry of Education and Research (BMBF) Project ”IKIDA” (Grant no.:

01IS20045), the RoboTrust project of the Centre Responsible Digitality

(ZEVEDI) Hessen, Germany, the Funding Association for Market-Oriented

Management, Marketing, and Human Resource Management (F¨

orderverein

f¨

ur Marktorientierte Unternehmensf¨

uhrung, Marketing und Personal man-

agement e.V.), and the Leap in Time Foundation (Leap in Time Stiftung).

The authors thank the NHR Centers NHR4CES at TU Darmstadt for the

access to the Lichtenberg high-performance computer (Project No. 1694)

for running the experiments in this work.

Human

Observations

Latent Space

HSMM

Conditioning

Reconstruction

Controlled

Agent

Fig. 1: Overview of our approach, ”MILD”. We train VAEs to

reconstruct the observations of the interactions agents (x1

1:t,x2

1:t)

with an HSMM prior to learn a joint distribution over the latent

space trajectories (z1

1:t,z2

1:t)of the interacting agents. During test

time, the observed agent’s latent trajectory conditions the HSMM

to infer the controlled agent’s latent trajectory p(z2

t|z1

1:t)which is

decoded to generate the agent’s real-world trajectory ˆx2

human and robot trajectories that can be conditioned on

the observed actions of the human, although, they scale

poorly with higher dimensions. In such cases, deep LfD

approaches perform well for learning latent-space dynamics

with high-dimensional data [20], [11], [9], [13], [29] but they

usually are not scalable to different interactive scenarios, as

they usually do not model the inherent multimodality and

uncertainty of HRI tasks.

To tackle these challenges when learning HRI policies, we

introduce MILD, a method that effectively couples beneﬁts

from deep LfD methods with probabilistic machine learn-

ing. MILD uses HSMMs as a temporally coherent latent

space prior of a VAE, which we learn from HHI data.

Speciﬁcally, we model the prior as a joint distribution over

the trajectories of both interacting agents, making full use

of the power of HSMMs for learning both trajectory and

interaction dynamics. Our approach successfully captures

the multimodality of the latent interactive trajectories thanks

to the modularity of HSMMs, enabling better modeling of

dynamics as compared to using an uninformed, stationary

prior [5]. During testing, we can generate latent trajectories

by conditioning the HSMM on the human observations using

Gaussian Mixture Regression [7], and decode them to obtain

the robot’s control trajectories (Fig. 1).

Our experimental evaluations on different test scenarios

show the efﬁcacy of MILD in capturing coherent latent

dynamics, both in predicting HHI and in generating controls

for HRI on different robots, compared to the state-of-the-

art method that implicitly learns shared representations [5].

MILD learns to generate effective robot trajectories for HRI,

arXiv:2210.12418v1 [cs.RO] 22 Oct 2022

not only through robot kinesthetic teaching, but also by learn-

ing from HHI, both from idealistic data (Motion Capture) and

noisy RGB-D skeleton tracking, by directly transferring the

generated trajectories to a humanoid robot [17], [33], without

requiring additional demonstrations or ﬁne-tuning.

II. RELATED WORK

Early approaches for learning HRI modeled them as a

joint distribution with a Gaussian Mixture Model (GMM)

learned over demonstrated trajectories of a human and a

robot in a collaborative task [7]. The correlations between

the human and the robot degrees of freedom (DoFs) can

then be leveraged to generate the robot’s trajectory given

observations of the human. This method was further extended

with HSMMs with explicit duration constraints for learning

both proactive and reactive controllers [35]. Along similar

lines of leveraging Gaussian approximations for LfD, Move-

ment primitives [31], [36], which learn a distribution over un-

derlying weight vectors obtained via linear regression, were

extended for HRI by similarly learning a joint distribution

over the weights of both interacting agents [2], [26]. The

versatility of interaction primitives can additionally be noted

by their ability to be adapted for different intention predic-

tions [22], speeds [27], or for learning multiple interactive

tasks seamlessly by either using a GMM as the underlying

distribution [16] or in an incremental manner [28], [23].

Deep LfD techniques have grown in popularity for learn-

ing latent trajectory dynamics from demonstrations wherein

an autoencoding approach, like VAEs, is used to encode

latent trajectories over which a latent dynamics model is

trained. In their simplest form, the latent dynamics can be

modeled either with linear Gaussian models [20] or Kalman

ﬁlter [3]. Other approaches learn stable dynamical systems,

like Dynamic Movement Primitives [36] over VAE latent

spaces [4], [10], [11], [9]. Instead of learning a feedfor-

ward dynamics model, Dermy et al. [13] model the entire

trajectory’s dynamics at once using Probabilistic Movement

Primitives [31] achieving better results than [9]. When large

datasets are available, Recurrent Networks are powerful tools

in approximating latent dynamics [12], [18], especially in

the case of learning common dynamics models in HRIs [5].

A major advantage of most of the aforementioned LfD

approaches, other than their sample efﬁciency in terms of

demonstrations, is that they can be explicitly conditioned at

desired time steps, unlike neural network-based approaches.

Most deep LfD approaches ﬁt complete trajectories, curat-

ing neither the multimodality in HRI tasks nor the subsequent

dynamics between different modes of interaction. Instead of

ﬁtting a single distribution over demonstrated trajectories,

HSMMs break down such complex trajectories into multiple

modes and learn the sequencing between hidden states, as

shown in [29], where HSMMs were used as latent priors for a

VAE. However, [29] does not look at the interdependence be-

tween dimensions, but models each dimension individually,

which is not favorable when learning interaction dynamics.

Such issues can be circumvented by using a diagonal cross-

covariance structure (as in [3]), but this would only learn

dependencies between individual dimensions. Contrarily, we

learn full covariance matrices in our HSMM models to better

facilitate the learning of interaction dynamics.

III. PRELIMINARIES

In this section, we brieﬂy introduce preliminary concepts,

namely, VAEs (Sec. III-A) and HSMMs (Sec. III-B), that we

deem useful for discussing our proposed method.

A. Variational Autoencoders

Variational Autoencoders (VAEs) [21], [34] are a style of

neural network architectures that learn the identity function

in an unsupervised, probabilistic way. An encoder encodes

the inputs onto a latent space, denoted by ”z”, of the input

”x” at the bottleneck that a decoder uses to reconstruct

the original input. A prior distribution is enforced over the

latent space, usually given by a normal distribution p(z) =

N(z;0, I). The goal is to estimate the true posterior p(z|x),

using a neural network q(z|x)and is trained by minimizing

the Kullback-Leibler (KL) divergence between them.

DKL(q(z|x)||p(z|x)) = Eq[log q(z|x)

p(x, z)] + log p(x)(1)

This can be re-written as

log p(x) = DKL(q(z|x)||p(z|x)) + Eq[log p(x, z)

q(z|x)](2)

The KL divergence is always non-negative, therefore the

second term in (2) acts as a lower bound. Maximizing it

would effectively maximize the log-likelihood of the data

distribution or evidence, and is hence called the Evidence

Lower Bound (ELBO), which can be written as

Eq[log p(x, z)

q(z|x)] = Eq[log p(x|z)] + DKL(q(z|x)||p(z))

(3)

The ﬁrst term corresponds to the reconstruction of the input

via samples decoded from the posterior distribution. The

second term is the KL divergence between the prior and the

approximate posterior, which acts as a regularization term for

the posterior. Further information about variational inference

can be found in [21], [34].

B. Hidden Semi-Markov Models

HSMMs are a special class of Hidden Markov Models

(HMMs), where the Markov property is relaxed, i.e., the cur-

rent state depends on not just the previous state, but also the

duration for which the system remains in a state. In an HMM,

a sequence of observations z1:Tis modeled as a sequence of

Khidden latent states that emit the observations with some

probability. Speciﬁcally, an HMM can be described by its

initial state distribution πiover the states i∈ {1,2. . . K},

the state transition probabilities Ti,j denoting the probability

of transitioning from state ito state j. In our case, each state

is characterized by a Normal distribution with mean µiand

covariance Σi, which characterize the emission probabilities

of the observations N(zt;µi,Σi). This, in essence, is similar

to learning a GMM over the observations and learning the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MILD:MultimodalInteractiveLatentDynamicsforLearningHuman-RobotInteractionVigneshPrasad1;2,DorotheaKoert1;5,RuthStock-Homburg2,JanPeters1;3;4;5,GeorgiaChalvatzaki1;4AbstractModelinginteractiondynamicstogeneraterobottrajectoriesthatenablearobottoadaptandreacttoahuman'sactionsandintentionsiscriticalfo...

展开>> 收起<<

MILD Multimodal Interactive Latent Dynamics for Learning Human-Robot Interaction Vignesh Prasad12 Dorothea Koert15 Ruth Stock-Homburg2 Jan Peters1345 Georgia Chalvatzaki14.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MILD Multimodal Interactive Latent Dynamics for Learning Human-Robot Interaction Vignesh Prasad12 Dorothea Koert15 Ruth Stock-Homburg2 Jan Peters1345 Georgia Chalvatzaki14

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: