Sample Efﬁcient Robot Learning with Structured World Models Tuluhan Akbulut Max Merlin Shane Parr Benedict Quartey

2025-05-03 0 0 2.1MB 9 页 10玖币

侵权投诉

Sample Efﬁcient Robot Learning with Structured

World Models

Tuluhan Akbulut * Max Merlin * Shane Parr * Benedict Quartey *

Skye Thompson *

December 16, 2021

Abstract:

Reinforcement learning has been demonstrated as a ﬂexible and effective

approach for learning a range of continuous control tasks, such as those

used by robots to manipulate objects in their environment. But in robotics

particularly, real-world rollouts are costly, and sample efﬁciency can be a

major limiting factor when learning a new skill. In game environments, the

use of world models has been shown to improve sample efﬁciency while

still achieving good performance, especially when images or other rich ob-

servations are provided. In this project, we explore the use of a world model

in a deformable robotic manipulation task, evaluating its effect on sample

efﬁciency when learning to fold a cloth in simulation. We compare the use

of RGB image observation with a feature space leveraging built-in struc-

ture (keypoints representing the cloth conﬁguration), a common approach

in robot skill learning, and compare the impact on task performance and

learning efﬁciency with and without the world model. Our experiments

showed that usage of keypoints increased performance of the best model

on the task by 50%, and in general, the use of a learned or constructed re-

duced feature space improved task performance and sample efﬁciency. The

use of a state transition predictor(MDN-RNN) in our world models did not

have a notable effect on task performance.

Introduction

Reinforcement Learning (RL) is a policy-learning approach that incentivizes an agent to produce

desired behavior through feedback (reward) from the environment conditioned on actions taken by

said agent[15].This formulation has appealing ﬂexibility and ability to generalize to a wide class of

problems, from structured games to continuous control. However, one limitation of RL algorithms

in complex environments, where the state and action space is large and reward may be sparse, is

their sample inefﬁciency. Capturing sufﬁcient information about an environment to learn a useful

policy can require taking hundreds of thousands of actions and recording the results. This is par-

ticularly problematic when ‘samples’ are hard to come by - for instance, on problems in real-world

environments, like robotics, where taking such actions are costly [3].

World Models (WMs) attempt to improve the sample efﬁciency of RL approaches by learning a

model that captures the temporal and spatial dynamics of the environment, [6] so that an agent can

learn a policy by leveraging that model to reduce the number of interactions it needs to take in the

real world. A world model learns the dynamics of an environment independent of a speciﬁc policy

*All authors contributed equally to this work.

arXiv:2210.12278v1 [cs.RO] 21 Oct 2022

the agent might take, allowing it to leverage information gained through off-policy exploration, and

to simulate the effect of taking actions in states that may be otherwise hard to reach. Some recent

RL approaches using world models [8] have been shown to improve sample efﬁciency over state-

of-the-art in atari benchmarks. In this project, we investigate the performance of one world model

approach on a continuous control task in a complex environment with a large state space: a robot

manipulating a cloth to fold it in simulation.

We found that the learning state representations improved the learning performance but the envi-

ronment dynamics predictor of world models did not improve the performance over our ablation

experiments. However, we should note that the approach we used is not the current state-of-the-art,

and our experimentation was limited by computing constraints. Secondly, we show that using struc-

tured data as states, i. e. keypoints instead of RGB images, increases both performance and sample

efﬁciency more than learned state representations. This suggests that it is possible to increase sample

efﬁciency of world models with the use of a structured feature space - but more progress is needed

for world models to be effectively used to solve real-world robotics tasks.

Related Work

Ha and Schmidhuber [6] demonstrate the effectiveness of a learned generative world model on a

simulated racecar task and Doom, learning to encode a pixel image of the state as a latent represen-

tation from which the next state can be reconstructed, and learning a separate policy model to select

the best action given the encoded state. Hafner et al. [7] tries WMs in DeepMind Control Suite [16]

with a more complex policy model. Hafner et al. [8] discretize the latent space of WMs to increase

their prediction performance and their model achieves a better performance than state-of-the-art RL

models and humans in atari benchmark. In Kipf et al. [11], structured world models are learned,

although they are not used for exploration. Finally, world models have been used for exploration via

planning by Sekar et al. [14]. Our project expands on the literature by introducing built-in struc-

ture to the state space and better exploration strategies to make WMs a better ﬁt for robotics. Our

approach is directly informed by that of Ha and Schmidhuber, though we make adjustments to our

model training approach based on what has been shown to be effective in later literature.[8]

Problem Statement

We assume a task in an environment that can be modeled as an MDP, with a state sfrom a state space

S, where the agent selects an action afrom its action space Aat each timestep, transitioning to a

state s0and receiving a reward r. The agent’s goal is to select actions that maximize the cumulative

discounted reward given the agent’s starting state. Given a dataset collected in the environment with

some exploration policy, consisting of state transitions, actions taken, and rewards, we aim to learn

a model capturing that MDP’s transition function: st, at−→ st+1. At test time, the agent uses this

model to select atwith the highest predicted Q-value. The task we attempt to learn, cloth folding, is

a continuous control task with a large and complex state space, due to the inﬁnitely many possible

conﬁgurations of the cloth. Learning a policy directly in this environment, without prior structure,

knowledge of environment dynamics, or bias towards meaningful state features or regions of the

state space, would be very sample intensive.

Instead of mapping the observed state of the real world directly to an action, we want to ﬁnd a

compact state representation that captures temporal and spatial environment dynamics. The idea is

that this reduced representation includes relevant and useful information for the task, and allows for

more efﬁcient learning due to its smaller space of relevant features. These representations are used

to learn a transition model that captures environment dynamics. A controller is then trained to select

optimal actions using this state representation and knowledge of possible state transitions.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SampleEfcientRobotLearningwithStructuredWorldModelsTuluhanAkbulut*MaxMerlin*ShaneParr*BenedictQuartey*SkyeThompson*December16,2021Abstract:Reinforcementlearninghasbeendemonstratedasaexibleandeffectiveapproachforlearningarangeofcontinuouscontroltasks,suchasthoseusedbyrobotstomanipulateobjectsinthei...

展开>> 收起<<

Sample Efﬁcient Robot Learning with Structured World Models Tuluhan Akbulut Max Merlin Shane Parr Benedict Quartey.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sample Efﬁcient Robot Learning with Structured World Models Tuluhan Akbulut Max Merlin Shane Parr Benedict Quartey

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: