MILD Multimodal Interactive Latent Dynamics for Learning Human-Robot Interaction Vignesh Prasad12 Dorothea Koert15 Ruth Stock-Homburg2 Jan Peters1345 Georgia Chalvatzaki14

2025-05-02 0 0 3.27MB 8 页 10玖币
侵权投诉
MILD: Multimodal Interactive Latent Dynamics
for Learning Human-Robot Interaction
Vignesh Prasad1,2, Dorothea Koert1,5, Ruth Stock-Homburg2, Jan Peters1,3,4,5, Georgia Chalvatzaki1,4
Abstract Modeling interaction dynamics to generate robot
trajectories that enable a robot to adapt and react to a human’s
actions and intentions is critical for efficient and effective
collaborative Human-Robot Interactions (HRI). Learning from
Demonstration (LfD) methods from Human-Human Interac-
tions (HHI) have shown promising results, especially when
coupled with representation learning techniques. However, such
methods for learning HRI either do not scale well to high
dimensional data or cannot accurately adapt to changing
via-poses of the interacting partner. We propose Multimodal
Interactive Latent Dynamics (MILD), a method that couples
deep representation learning and probabilistic machine learning
to address the problem of two-party physical HRIs. We learn
the interaction dynamics from demonstrations, using Hidden
Semi-Markov Models (HSMMs) to model the joint distribution
of the interacting agents in the latent space of a Variational
Autoencoder (VAE). Our experimental evaluations for learning
HRI from HHI demonstrations show that MILD effectively
captures the multimodality in the latent representations of HRI
tasks, allowing us to decode the varying dynamics occurring
in such tasks. Compared to related work, MILD generates
more accurate trajectories for the controlled agent (robot)
when conditioned on the observed agent’s (human) trajectory.
Notably, MILD can learn directly from camera-based pose
estimations to generate trajectories, which we then map to a
humanoid robot without the need for any additional training.
Supplementary Material: https://bit.ly/MILD-HRI
I. INTRODUCTION
Observing human actions and interacting synchronously is
an essential characteristic of a social robot in HRI scenarios
[1]. Key components for learning coordinated HRI policies
are having a good spatio-temporal representation and jointly
modeling the interaction dynamics of the agents. In this
regard, the paradigm of LfD shows promising results [2],
[8], [7], especially when using only a handful of trajec-
tories. Such LfD approaches learn joint distributions over
1Department of Computer Science, TU Darmstadt, Germany.
2Chair for Marketing and Human Resource Management, Department of
Law and Economics, TU Darmstadt, Germany.
3German Research Center for AI (DFKI), Research Department: Systems
AI for Robot Learning.
4Hessian.AI
5Centre for Cognitive Science, TU Darmstadt, Germany.
This work was supported by the German Research Foundation (DFG)
Project ”Social Robots at the Customer Interface” (Grant No.: STO 477/14-
1), the DFG Emmy Noether Programme (CH 2676/1-1), the German Federal
Ministry of Education and Research (BMBF) Project ”IKIDA” (Grant no.:
01IS20045), the RoboTrust project of the Centre Responsible Digitality
(ZEVEDI) Hessen, Germany, the Funding Association for Market-Oriented
Management, Marketing, and Human Resource Management (F¨
orderverein
f¨
ur Marktorientierte Unternehmensf¨
uhrung, Marketing und Personal man-
agement e.V.), and the Leap in Time Foundation (Leap in Time Stiftung).
The authors thank the NHR Centers NHR4CES at TU Darmstadt for the
access to the Lichtenberg high-performance computer (Project No. 1694)
for running the experiments in this work.
Human
Observations
Latent Space
HSMM
Conditioning
Reconstruction
Controlled
Agent
Fig. 1: Overview of our approach, ”MILD”. We train VAEs to
reconstruct the observations of the interactions agents (x1
1:t,x2
1:t)
with an HSMM prior to learn a joint distribution over the latent
space trajectories (z1
1:t,z2
1:t)of the interacting agents. During test
time, the observed agent’s latent trajectory conditions the HSMM
to infer the controlled agent’s latent trajectory p(z2
t|z1
1:t)which is
decoded to generate the agent’s real-world trajectory ˆx2
t.
human and robot trajectories that can be conditioned on
the observed actions of the human, although, they scale
poorly with higher dimensions. In such cases, deep LfD
approaches perform well for learning latent-space dynamics
with high-dimensional data [20], [11], [9], [13], [29] but they
usually are not scalable to different interactive scenarios, as
they usually do not model the inherent multimodality and
uncertainty of HRI tasks.
To tackle these challenges when learning HRI policies, we
introduce MILD, a method that effectively couples benefits
from deep LfD methods with probabilistic machine learn-
ing. MILD uses HSMMs as a temporally coherent latent
space prior of a VAE, which we learn from HHI data.
Specifically, we model the prior as a joint distribution over
the trajectories of both interacting agents, making full use
of the power of HSMMs for learning both trajectory and
interaction dynamics. Our approach successfully captures
the multimodality of the latent interactive trajectories thanks
to the modularity of HSMMs, enabling better modeling of
dynamics as compared to using an uninformed, stationary
prior [5]. During testing, we can generate latent trajectories
by conditioning the HSMM on the human observations using
Gaussian Mixture Regression [7], and decode them to obtain
the robot’s control trajectories (Fig. 1).
Our experimental evaluations on different test scenarios
show the efficacy of MILD in capturing coherent latent
dynamics, both in predicting HHI and in generating controls
for HRI on different robots, compared to the state-of-the-
art method that implicitly learns shared representations [5].
MILD learns to generate effective robot trajectories for HRI,
arXiv:2210.12418v1 [cs.RO] 22 Oct 2022
not only through robot kinesthetic teaching, but also by learn-
ing from HHI, both from idealistic data (Motion Capture) and
noisy RGB-D skeleton tracking, by directly transferring the
generated trajectories to a humanoid robot [17], [33], without
requiring additional demonstrations or fine-tuning.
II. RELATED WORK
Early approaches for learning HRI modeled them as a
joint distribution with a Gaussian Mixture Model (GMM)
learned over demonstrated trajectories of a human and a
robot in a collaborative task [7]. The correlations between
the human and the robot degrees of freedom (DoFs) can
then be leveraged to generate the robot’s trajectory given
observations of the human. This method was further extended
with HSMMs with explicit duration constraints for learning
both proactive and reactive controllers [35]. Along similar
lines of leveraging Gaussian approximations for LfD, Move-
ment primitives [31], [36], which learn a distribution over un-
derlying weight vectors obtained via linear regression, were
extended for HRI by similarly learning a joint distribution
over the weights of both interacting agents [2], [26]. The
versatility of interaction primitives can additionally be noted
by their ability to be adapted for different intention predic-
tions [22], speeds [27], or for learning multiple interactive
tasks seamlessly by either using a GMM as the underlying
distribution [16] or in an incremental manner [28], [23].
Deep LfD techniques have grown in popularity for learn-
ing latent trajectory dynamics from demonstrations wherein
an autoencoding approach, like VAEs, is used to encode
latent trajectories over which a latent dynamics model is
trained. In their simplest form, the latent dynamics can be
modeled either with linear Gaussian models [20] or Kalman
filter [3]. Other approaches learn stable dynamical systems,
like Dynamic Movement Primitives [36] over VAE latent
spaces [4], [10], [11], [9]. Instead of learning a feedfor-
ward dynamics model, Dermy et al. [13] model the entire
trajectory’s dynamics at once using Probabilistic Movement
Primitives [31] achieving better results than [9]. When large
datasets are available, Recurrent Networks are powerful tools
in approximating latent dynamics [12], [18], especially in
the case of learning common dynamics models in HRIs [5].
A major advantage of most of the aforementioned LfD
approaches, other than their sample efficiency in terms of
demonstrations, is that they can be explicitly conditioned at
desired time steps, unlike neural network-based approaches.
Most deep LfD approaches fit complete trajectories, curat-
ing neither the multimodality in HRI tasks nor the subsequent
dynamics between different modes of interaction. Instead of
fitting a single distribution over demonstrated trajectories,
HSMMs break down such complex trajectories into multiple
modes and learn the sequencing between hidden states, as
shown in [29], where HSMMs were used as latent priors for a
VAE. However, [29] does not look at the interdependence be-
tween dimensions, but models each dimension individually,
which is not favorable when learning interaction dynamics.
Such issues can be circumvented by using a diagonal cross-
covariance structure (as in [3]), but this would only learn
dependencies between individual dimensions. Contrarily, we
learn full covariance matrices in our HSMM models to better
facilitate the learning of interaction dynamics.
III. PRELIMINARIES
In this section, we briefly introduce preliminary concepts,
namely, VAEs (Sec. III-A) and HSMMs (Sec. III-B), that we
deem useful for discussing our proposed method.
A. Variational Autoencoders
Variational Autoencoders (VAEs) [21], [34] are a style of
neural network architectures that learn the identity function
in an unsupervised, probabilistic way. An encoder encodes
the inputs onto a latent space, denoted by ”z”, of the input
x” at the bottleneck that a decoder uses to reconstruct
the original input. A prior distribution is enforced over the
latent space, usually given by a normal distribution p(z) =
N(z;0, I). The goal is to estimate the true posterior p(z|x),
using a neural network q(z|x)and is trained by minimizing
the Kullback-Leibler (KL) divergence between them.
DKL(q(z|x)||p(z|x)) = Eq[log q(z|x)
p(x, z)] + log p(x)(1)
This can be re-written as
log p(x) = DKL(q(z|x)||p(z|x)) + Eq[log p(x, z)
q(z|x)](2)
The KL divergence is always non-negative, therefore the
second term in (2) acts as a lower bound. Maximizing it
would effectively maximize the log-likelihood of the data
distribution or evidence, and is hence called the Evidence
Lower Bound (ELBO), which can be written as
Eq[log p(x, z)
q(z|x)] = Eq[log p(x|z)] + DKL(q(z|x)||p(z))
(3)
The first term corresponds to the reconstruction of the input
via samples decoded from the posterior distribution. The
second term is the KL divergence between the prior and the
approximate posterior, which acts as a regularization term for
the posterior. Further information about variational inference
can be found in [21], [34].
B. Hidden Semi-Markov Models
HSMMs are a special class of Hidden Markov Models
(HMMs), where the Markov property is relaxed, i.e., the cur-
rent state depends on not just the previous state, but also the
duration for which the system remains in a state. In an HMM,
a sequence of observations z1:Tis modeled as a sequence of
Khidden latent states that emit the observations with some
probability. Specifically, an HMM can be described by its
initial state distribution πiover the states i∈ {1,2. . . K},
the state transition probabilities Ti,j denoting the probability
of transitioning from state ito state j. In our case, each state
is characterized by a Normal distribution with mean µiand
covariance Σi, which characterize the emission probabilities
of the observations N(zt;µi,Σi). This, in essence, is similar
to learning a GMM over the observations and learning the
摘要:

MILD:MultimodalInteractiveLatentDynamicsforLearningHuman-RobotInteractionVigneshPrasad1;2,DorotheaKoert1;5,RuthStock-Homburg2,JanPeters1;3;4;5,GeorgiaChalvatzaki1;4Abstract—Modelinginteractiondynamicstogeneraterobottrajectoriesthatenablearobottoadaptandreacttoahuman'sactionsandintentionsiscriticalfo...

展开>> 收起<<
MILD Multimodal Interactive Latent Dynamics for Learning Human-Robot Interaction Vignesh Prasad12 Dorothea Koert15 Ruth Stock-Homburg2 Jan Peters1345 Georgia Chalvatzaki14.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:3.27MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注