the agent might take, allowing it to leverage information gained through off-policy exploration, and
to simulate the effect of taking actions in states that may be otherwise hard to reach. Some recent
RL approaches using world models [8] have been shown to improve sample efficiency over state-
of-the-art in atari benchmarks. In this project, we investigate the performance of one world model
approach on a continuous control task in a complex environment with a large state space: a robot
manipulating a cloth to fold it in simulation.
We found that the learning state representations improved the learning performance but the envi-
ronment dynamics predictor of world models did not improve the performance over our ablation
experiments. However, we should note that the approach we used is not the current state-of-the-art,
and our experimentation was limited by computing constraints. Secondly, we show that using struc-
tured data as states, i. e. keypoints instead of RGB images, increases both performance and sample
efficiency more than learned state representations. This suggests that it is possible to increase sample
efficiency of world models with the use of a structured feature space - but more progress is needed
for world models to be effectively used to solve real-world robotics tasks.
Related Work
Ha and Schmidhuber [6] demonstrate the effectiveness of a learned generative world model on a
simulated racecar task and Doom, learning to encode a pixel image of the state as a latent represen-
tation from which the next state can be reconstructed, and learning a separate policy model to select
the best action given the encoded state. Hafner et al. [7] tries WMs in DeepMind Control Suite [16]
with a more complex policy model. Hafner et al. [8] discretize the latent space of WMs to increase
their prediction performance and their model achieves a better performance than state-of-the-art RL
models and humans in atari benchmark. In Kipf et al. [11], structured world models are learned,
although they are not used for exploration. Finally, world models have been used for exploration via
planning by Sekar et al. [14]. Our project expands on the literature by introducing built-in struc-
ture to the state space and better exploration strategies to make WMs a better fit for robotics. Our
approach is directly informed by that of Ha and Schmidhuber, though we make adjustments to our
model training approach based on what has been shown to be effective in later literature.[8]
Problem Statement
We assume a task in an environment that can be modeled as an MDP, with a state sfrom a state space
S, where the agent selects an action afrom its action space Aat each timestep, transitioning to a
state s0and receiving a reward r. The agent’s goal is to select actions that maximize the cumulative
discounted reward given the agent’s starting state. Given a dataset collected in the environment with
some exploration policy, consisting of state transitions, actions taken, and rewards, we aim to learn
a model capturing that MDP’s transition function: st, at−→ st+1. At test time, the agent uses this
model to select atwith the highest predicted Q-value. The task we attempt to learn, cloth folding, is
a continuous control task with a large and complex state space, due to the infinitely many possible
configurations of the cloth. Learning a policy directly in this environment, without prior structure,
knowledge of environment dynamics, or bias towards meaningful state features or regions of the
state space, would be very sample intensive.
Instead of mapping the observed state of the real world directly to an action, we want to find a
compact state representation that captures temporal and spatial environment dynamics. The idea is
that this reduced representation includes relevant and useful information for the task, and allows for
more efficient learning due to its smaller space of relevant features. These representations are used
to learn a transition model that captures environment dynamics. A controller is then trained to select
optimal actions using this state representation and knowledge of possible state transitions.
2