
not only through robot kinesthetic teaching, but also by learn-
ing from HHI, both from idealistic data (Motion Capture) and
noisy RGB-D skeleton tracking, by directly transferring the
generated trajectories to a humanoid robot [17], [33], without
requiring additional demonstrations or fine-tuning.
II. RELATED WORK
Early approaches for learning HRI modeled them as a
joint distribution with a Gaussian Mixture Model (GMM)
learned over demonstrated trajectories of a human and a
robot in a collaborative task [7]. The correlations between
the human and the robot degrees of freedom (DoFs) can
then be leveraged to generate the robot’s trajectory given
observations of the human. This method was further extended
with HSMMs with explicit duration constraints for learning
both proactive and reactive controllers [35]. Along similar
lines of leveraging Gaussian approximations for LfD, Move-
ment primitives [31], [36], which learn a distribution over un-
derlying weight vectors obtained via linear regression, were
extended for HRI by similarly learning a joint distribution
over the weights of both interacting agents [2], [26]. The
versatility of interaction primitives can additionally be noted
by their ability to be adapted for different intention predic-
tions [22], speeds [27], or for learning multiple interactive
tasks seamlessly by either using a GMM as the underlying
distribution [16] or in an incremental manner [28], [23].
Deep LfD techniques have grown in popularity for learn-
ing latent trajectory dynamics from demonstrations wherein
an autoencoding approach, like VAEs, is used to encode
latent trajectories over which a latent dynamics model is
trained. In their simplest form, the latent dynamics can be
modeled either with linear Gaussian models [20] or Kalman
filter [3]. Other approaches learn stable dynamical systems,
like Dynamic Movement Primitives [36] over VAE latent
spaces [4], [10], [11], [9]. Instead of learning a feedfor-
ward dynamics model, Dermy et al. [13] model the entire
trajectory’s dynamics at once using Probabilistic Movement
Primitives [31] achieving better results than [9]. When large
datasets are available, Recurrent Networks are powerful tools
in approximating latent dynamics [12], [18], especially in
the case of learning common dynamics models in HRIs [5].
A major advantage of most of the aforementioned LfD
approaches, other than their sample efficiency in terms of
demonstrations, is that they can be explicitly conditioned at
desired time steps, unlike neural network-based approaches.
Most deep LfD approaches fit complete trajectories, curat-
ing neither the multimodality in HRI tasks nor the subsequent
dynamics between different modes of interaction. Instead of
fitting a single distribution over demonstrated trajectories,
HSMMs break down such complex trajectories into multiple
modes and learn the sequencing between hidden states, as
shown in [29], where HSMMs were used as latent priors for a
VAE. However, [29] does not look at the interdependence be-
tween dimensions, but models each dimension individually,
which is not favorable when learning interaction dynamics.
Such issues can be circumvented by using a diagonal cross-
covariance structure (as in [3]), but this would only learn
dependencies between individual dimensions. Contrarily, we
learn full covariance matrices in our HSMM models to better
facilitate the learning of interaction dynamics.
III. PRELIMINARIES
In this section, we briefly introduce preliminary concepts,
namely, VAEs (Sec. III-A) and HSMMs (Sec. III-B), that we
deem useful for discussing our proposed method.
A. Variational Autoencoders
Variational Autoencoders (VAEs) [21], [34] are a style of
neural network architectures that learn the identity function
in an unsupervised, probabilistic way. An encoder encodes
the inputs onto a latent space, denoted by ”z”, of the input
”x” at the bottleneck that a decoder uses to reconstruct
the original input. A prior distribution is enforced over the
latent space, usually given by a normal distribution p(z) =
N(z;0, I). The goal is to estimate the true posterior p(z|x),
using a neural network q(z|x)and is trained by minimizing
the Kullback-Leibler (KL) divergence between them.
DKL(q(z|x)||p(z|x)) = Eq[log q(z|x)
p(x, z)] + log p(x)(1)
This can be re-written as
log p(x) = DKL(q(z|x)||p(z|x)) + Eq[log p(x, z)
q(z|x)](2)
The KL divergence is always non-negative, therefore the
second term in (2) acts as a lower bound. Maximizing it
would effectively maximize the log-likelihood of the data
distribution or evidence, and is hence called the Evidence
Lower Bound (ELBO), which can be written as
Eq[log p(x, z)
q(z|x)] = Eq[log p(x|z)] + DKL(q(z|x)||p(z))
(3)
The first term corresponds to the reconstruction of the input
via samples decoded from the posterior distribution. The
second term is the KL divergence between the prior and the
approximate posterior, which acts as a regularization term for
the posterior. Further information about variational inference
can be found in [21], [34].
B. Hidden Semi-Markov Models
HSMMs are a special class of Hidden Markov Models
(HMMs), where the Markov property is relaxed, i.e., the cur-
rent state depends on not just the previous state, but also the
duration for which the system remains in a state. In an HMM,
a sequence of observations z1:Tis modeled as a sequence of
Khidden latent states that emit the observations with some
probability. Specifically, an HMM can be described by its
initial state distribution πiover the states i∈ {1,2. . . K},
the state transition probabilities Ti,j denoting the probability
of transitioning from state ito state j. In our case, each state
is characterized by a Normal distribution with mean µiand
covariance Σi, which characterize the emission probabilities
of the observations N(zt;µi,Σi). This, in essence, is similar
to learning a GMM over the observations and learning the