Stone, 2017, Torabi et al., 2018, Guo et al., 2019], and reward shaping with a pretrained expert [Merel
et al., 2017].
LDD
achieves top performance on all environments in terms of task completion and
reward. In addition to comparing
LDD
to other methods, we ablate
LDD
to quantify the effect of
language observations in dynamics modeling, and the importance of dynamics modeling with expert
demonstrations. On two environments where we can control for the presence of language descriptions
(NetHack game messages and Touchdown panorama captions), we show that language descriptions
improve sample-efficiency and generalization. Finally, across all environments, we find that dynamics
modeling with expert demonstrations is more effective than with non-expert rollouts.
2 Related Work
Learning by observing language.
Recent work studies generalization to language instructions
and manuals that specify new tasks and environments. These settings range from photorealistic/3D
navigation [Anderson et al., 2018, Chen et al., 2018, Ku et al., 2020, Shridhar et al., 2020] to multi-
hop reference games [Narasimhan et al., 2015, Zhong et al., 2020, Hanjie et al., 2021]. We use a
collection of these tasks to evaluate
LDD
. There is also work where understanding language is not
necessary to achieve the task, however its inclusion (e.g. via captions, scene descriptions) makes
learning more efficient. Shridhar et al. [2021] show that one can quickly learn policies in a simulated
kitchen environment described in text, then transfer this policy to the 3D visual environment. Zhong
et al. [2021] similarly transform photorealistic navigation to a symbolic form via image segmentation,
then learn a policy that transfers to the original photorealistic setting. In work concurrent to ours, Tam
et al. [2022] generate oracle captions of observations for simulated robotic control and city navigation,
which improve policy learning.
LDD
is complementary to these—in addition to incorporating language
descriptions as features, we show that learning a dynamics model from unlabeled demonstrations
with language descriptions improves sample efficiency and results in better policies.
Imitation learning from observations.
There is prior work on model-free as well as model-based
imitation learning from observations. Model-free methods encourage the imitator to produce state
distributions similar to those produced by the demonstrator, for example via generative adversarial
learning [Merel et al., 2017] and reward shaping [Kimura et al., 2018]. In contrast,
LDD
only requires
intermediate representations extracted from an expert dynamics model on states encountered by the
learner, which are cheaper to compute than rollouts from an expert policy. Model-based approaches
learn dynamics models that predict state-transitions given the current state and an action. Hanna and
Stone [2017] learn an inverse model to map state-transitions to actions, which is then used to annotate
unlabeled trajectories for imitation learning. Edwards et al. [2019] learn a forward dynamics model
that predicts future states given state and latent action pairs. In contrast,
LDD
does not assume priors
over the action space distribution. For instance, on ALFWorld, our method works even though it is
impossible to enumerate the action space. In our experiments, we extend model-free reward shaping
and model-based inverse dynamics modeling to account for language descriptions and compare
LDD
to
these methods.
Representation learning in RL.
In representation learning for RL, the agent learns representations
of the environment using rewards and objectives based on the difference between the state and prior
states [Strehl and Littman, 2008], raw visual observations [Jaderberg et al., 2017], learned agent
representations [Raileanu and Rocktäschel, 2020], and random network observations [Burda et al.,
2019]. In intrinsic exploration methods [Raileanu and Rocktäschel, 2020, Burda et al., 2019], the
training objective encourages dissimilarity (e.g. in observation/state space) to prior agent experi-
ence so that the agent discovers novel states. Unlike intrinsic exploration, the distillation objective
in Language Dynamics Distillation encourages similarity to expert behaviour, as opposed to dissim-
ilarity to prior agent experience. In reconstruction based representation learning methods [Strehl
and Littman, 2008, Jaderberg et al., 2017], the training objective encourages the agent to learn
intermediate representations that also capture the dynamics and structure of the environment by
reconstructing the observations (e.g. predicting what objects are in scene). Language Dynamics
Distillation is similar to reconstruction methods for representation learning, however unlike the latter,
the dynamics model in
LDD
is trained on trajectories obtained from an expert policy as opposed to the
agent policy. Language Dynamics Distillation is complementary to intrinsic exploration methods and
to reconstruction based representation learning methods.
3