Sample Efficient Robot Learning with Structured World Models Tuluhan Akbulut Max Merlin Shane Parr Benedict Quartey

2025-05-03 0 0 2.1MB 9 页 10玖币
侵权投诉
Sample Efficient Robot Learning with Structured
World Models
Tuluhan Akbulut * Max Merlin * Shane Parr * Benedict Quartey *
Skye Thompson *
December 16, 2021
Abstract:
Reinforcement learning has been demonstrated as a flexible and effective
approach for learning a range of continuous control tasks, such as those
used by robots to manipulate objects in their environment. But in robotics
particularly, real-world rollouts are costly, and sample efficiency can be a
major limiting factor when learning a new skill. In game environments, the
use of world models has been shown to improve sample efficiency while
still achieving good performance, especially when images or other rich ob-
servations are provided. In this project, we explore the use of a world model
in a deformable robotic manipulation task, evaluating its effect on sample
efficiency when learning to fold a cloth in simulation. We compare the use
of RGB image observation with a feature space leveraging built-in struc-
ture (keypoints representing the cloth configuration), a common approach
in robot skill learning, and compare the impact on task performance and
learning efficiency with and without the world model. Our experiments
showed that usage of keypoints increased performance of the best model
on the task by 50%, and in general, the use of a learned or constructed re-
duced feature space improved task performance and sample efficiency. The
use of a state transition predictor(MDN-RNN) in our world models did not
have a notable effect on task performance.
Introduction
Reinforcement Learning (RL) is a policy-learning approach that incentivizes an agent to produce
desired behavior through feedback (reward) from the environment conditioned on actions taken by
said agent[15].This formulation has appealing flexibility and ability to generalize to a wide class of
problems, from structured games to continuous control. However, one limitation of RL algorithms
in complex environments, where the state and action space is large and reward may be sparse, is
their sample inefficiency. Capturing sufficient information about an environment to learn a useful
policy can require taking hundreds of thousands of actions and recording the results. This is par-
ticularly problematic when ‘samples’ are hard to come by - for instance, on problems in real-world
environments, like robotics, where taking such actions are costly [3].
World Models (WMs) attempt to improve the sample efficiency of RL approaches by learning a
model that captures the temporal and spatial dynamics of the environment, [6] so that an agent can
learn a policy by leveraging that model to reduce the number of interactions it needs to take in the
real world. A world model learns the dynamics of an environment independent of a specific policy
*All authors contributed equally to this work.
1
arXiv:2210.12278v1 [cs.RO] 21 Oct 2022
the agent might take, allowing it to leverage information gained through off-policy exploration, and
to simulate the effect of taking actions in states that may be otherwise hard to reach. Some recent
RL approaches using world models [8] have been shown to improve sample efficiency over state-
of-the-art in atari benchmarks. In this project, we investigate the performance of one world model
approach on a continuous control task in a complex environment with a large state space: a robot
manipulating a cloth to fold it in simulation.
We found that the learning state representations improved the learning performance but the envi-
ronment dynamics predictor of world models did not improve the performance over our ablation
experiments. However, we should note that the approach we used is not the current state-of-the-art,
and our experimentation was limited by computing constraints. Secondly, we show that using struc-
tured data as states, i. e. keypoints instead of RGB images, increases both performance and sample
efficiency more than learned state representations. This suggests that it is possible to increase sample
efficiency of world models with the use of a structured feature space - but more progress is needed
for world models to be effectively used to solve real-world robotics tasks.
Related Work
Ha and Schmidhuber [6] demonstrate the effectiveness of a learned generative world model on a
simulated racecar task and Doom, learning to encode a pixel image of the state as a latent represen-
tation from which the next state can be reconstructed, and learning a separate policy model to select
the best action given the encoded state. Hafner et al. [7] tries WMs in DeepMind Control Suite [16]
with a more complex policy model. Hafner et al. [8] discretize the latent space of WMs to increase
their prediction performance and their model achieves a better performance than state-of-the-art RL
models and humans in atari benchmark. In Kipf et al. [11], structured world models are learned,
although they are not used for exploration. Finally, world models have been used for exploration via
planning by Sekar et al. [14]. Our project expands on the literature by introducing built-in struc-
ture to the state space and better exploration strategies to make WMs a better fit for robotics. Our
approach is directly informed by that of Ha and Schmidhuber, though we make adjustments to our
model training approach based on what has been shown to be effective in later literature.[8]
Problem Statement
We assume a task in an environment that can be modeled as an MDP, with a state sfrom a state space
S, where the agent selects an action afrom its action space Aat each timestep, transitioning to a
state s0and receiving a reward r. The agent’s goal is to select actions that maximize the cumulative
discounted reward given the agent’s starting state. Given a dataset collected in the environment with
some exploration policy, consisting of state transitions, actions taken, and rewards, we aim to learn
a model capturing that MDP’s transition function: st, atst+1. At test time, the agent uses this
model to select atwith the highest predicted Q-value. The task we attempt to learn, cloth folding, is
a continuous control task with a large and complex state space, due to the infinitely many possible
configurations of the cloth. Learning a policy directly in this environment, without prior structure,
knowledge of environment dynamics, or bias towards meaningful state features or regions of the
state space, would be very sample intensive.
Instead of mapping the observed state of the real world directly to an action, we want to find a
compact state representation that captures temporal and spatial environment dynamics. The idea is
that this reduced representation includes relevant and useful information for the task, and allows for
more efficient learning due to its smaller space of relevant features. These representations are used
to learn a transition model that captures environment dynamics. A controller is then trained to select
optimal actions using this state representation and knowledge of possible state transitions.
2
摘要:

SampleEfcientRobotLearningwithStructuredWorldModelsTuluhanAkbulut*MaxMerlin*ShaneParr*BenedictQuartey*SkyeThompson*December16,2021Abstract:Reinforcementlearninghasbeendemonstratedasaexibleandeffectiveapproachforlearningarangeofcontinuouscontroltasks,suchasthoseusedbyrobotstomanipulateobjectsinthei...

展开>> 收起<<
Sample Efficient Robot Learning with Structured World Models Tuluhan Akbulut Max Merlin Shane Parr Benedict Quartey.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:2.1MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注