Real World Ofﬂine Reinforcement Learning with Realistic Data Source Gaoyue Zhou1 Liyiming Key2 Siddhartha Srinivasa2 Abhinav Gupta1

2025-04-26 0 0 2.51MB 18 页 10玖币

侵权投诉

Real World Ofﬂine Reinforcement Learning

with Realistic Data Source

Gaoyue Zhou*1, Liyiming Ke*†2, Siddhartha Srinivasa2, Abhinav Gupta1,

Aravind Rajeswaran3, and Vikash Kumar3

1Carnegie Mellon University

2University of Washington

3Meta AI

Abstract:

Ofﬂine reinforcement learning (ORL) holds great promise for robot learning due

to its ability to learn from arbitrary pre-generated experience. However, current

ORL benchmarks are almost entirely in simulation and utilize contrived datasets

like replay buffers of online RL agents or sub-optimal trajectories, and thus hold

limited relevance for real-world robotics. In this work (Real-ORL), we posit that

data collected from safe operations of closely related tasks are more practical data

sources for real-world robot learning. Under these settings, we perform an exten-

sive (6500+ trajectories collected over 800+ robot hours and 270+ human labor

hour) empirical study evaluating generalization and transfer capabilities of repre-

sentative ORL methods on four real-world tabletop manipulation tasks. Our study

ﬁnds that ORL and imitation learning prefer different action spaces, and that ORL

algorithms can generalize from leveraging ofﬂine heterogeneous data sources and

outperform imitation learning. We release our dataset and implementations at

URL: https://sites.google.com/view/real-orl

1 Introduction

Task

Offline-RL

Dataset

Offline-RL

Algo(s)

Imitation

Learning

Simulation Data Sources

Half-Trained Policies

(Unrealistic)

Random

(Unsafe)

Expert + Noise

(Unsafe)

Real-World Data Sources (Multi-Task)

Data for Task Task

Figure 1: Realistic data sources for ofﬂine RL algorithms in real world tasks.

Despite rapid advances, the applicability of Deep Reinforcement Learning (DRL) algorithms [1,

2,3,4,5,6,7,8] to real-world robotics tasks is limited due to sample inefﬁciency and safety

considerations. The emerging ﬁeld of ofﬂine reinforcement learning (ORL) [9,10] has the

potential to overcome these challenges, by learning only from logged or pre-generated ofﬂine

datasets, thereby circumventing safety and exploration challenges. This makes ORL well suited

for applications with large datasets (e.g. recommendation systems) or those where online inter-

actions are scarce and expensive (e.g. robotics). However, comprehensive benchmarking and

empirical evaluation of ORL algorithms is signiﬁcantly lagging behind the burst of algorithmic

†Work completed during internship at FAIR-MetaAI.

*Equal Contribution.

arXiv:2210.06479v1 [cs.RO] 12 Oct 2022

Figure 2: Canonical tasks for tabletop manipulation.

progress [11,12,13,14,15,16,17,18,19,20,21]. Widely used ORL benchmarks [22,23] are

entirely in simulation and use contrived data collection protocols that do not capture fundamental

considerations of physical robots. In this work (Real-ORL), we aim to bridge this gap by outlining

practical ofﬂine dataset collection protocols that are representative of real-world robot settings. Our

work also performs a comprehensive empirical study spanning 6500+ trajectories collected over

800+ robot hours and 270+ human labor hour, to benchmark and analyze three representative

ORL algorithms thoroughly. We will release all the datasets, code, and hardware hooks from this

paper.

In principle, ORL can consume and train policies from arbitrary datasets. This has prompted the

development of simulated ORL benchmarks [22,23,24] that utilize data sources like expert policies

trained with online RL, exploratory policies, or even replay buffers of online RL agents. However,

simulated dataset may fail to capture the challenges in real world: hardware noises coupled with

varying reset conditions lead to covariate shift and violate the i.i.d. assumption about state distri-

butions between train and test time. Further, such datasets are not feasible on physical robots and

defeat the core motivation of ORL in robotics – to avoid the use of online RL due to poor sample

efﬁciency and safety! Recent works [24,25] suggest that dataset composition and distribution dra-

matically affect the relative performance of algorithms. In this backdrop, we consider the pertinent

question:

What is a practical instantiation of the ORL setting for physical robots, and can existing ORL algo-

rithms learn successful policies in such a setting?

In this work, we envision practical scenarios to apply ORL for real-world robotics. Towards this

end, our ﬁrst insight is that real-world ofﬂine datasets are likely to come from well-behaved policies

that abide by safety and monetary constraints, in sharp contrast to simulator data collected from

exploratory or partially trained policies, as used in simulated benchmarks [22,23,24]. Such trajec-

tories can be collected by user demonstrations or through hand-scripted policies that are partially

successful but safe. It is more realistic to collect large volumes of data for real robots using multiple

successful policies designed under expert supervision for speciﬁc tasks than using policies that are

unsuccessful or without safety guarantees. Secondly, the goal of any learning (including ORL) is

broad generalization and transfer. It is therefore critical to study whether a learning algorithm can

leverage task-agnostic datasets, or datasets intended for a source task, to make progress on a new

target task. In this work, we collect ofﬂine datasets consistent with these principles and evaluate

representative ORL algorithms on a set of canonical table-top tasks as illustrated in Figure 2.

Evaluation studies on physical robots are sparse in the ﬁeld due to time and resource constraints,

but they are vital to furturing our understanding. Our real robot results corroborate and validate

intuitions from simulated benchmarks [26] but also enable novel discoveries. We ﬁnd that (1) even

for scenarios with sufﬁciently high-quality data, some ORL algorithms could outperform behavior

cloning (BC) [27] on speciﬁc tasks, (2) for scenarios that require generalization or transfer to new

tasks with low data support, ORL agents generally outperform BC. (3) in cases with overlapping

data support, ORL algorithms can leverage additional heterogeneous task-agnostic data to improve

their own performance, and in some cases even surpass the best in-domain agent.

Our empirical evaluation is unique as it focuses on ORL algorithms ability to leverage more realistic,

multi-task data sources, spans over several tasks that are algorithm-agnostic, trains various ORL al-

gorithms on the same settings and evaluates them directly in the real world. In summary, we believe

Real-ORL establishes the effectiveness of ofﬂine RL algorithms in leveraging out of domain high-

quality heterogeneous data for generalization and transfer in robot-learning, which is representative

of real world applications.

2 Preliminaries and Related Work

Ofﬂine RL. We consider the ORL framework, which models the environment as a Markov De-

cision Process (MDP): M=hS, A, R, T, ρ0, Hiwhere S⊆Rnis the state space, A⊆Rmis the

action space, R:S×A→Ris the reward function, T:S×A×S→R+is the (stochastic)

transition dynamics, ρ0:S→R+is the initial state distribution, and His the maximum trajectory

horizon. In the ORL setting, we assume access to the reward function Rand a pre-generated dataset

of the form: D={τ1, τ2,...τN}, where each τi= (s0, a0, s1, a1,...sH)is a trajectory collected

using a behavioral policy or a mix of policies πb:S×A→R+.

The goal in ORL is to use the ofﬂine dataset Dto learn a near-optimal policy,

π∗:= arg max

EM,π "H

t=0

r(st, at)#.

In the general case, the optimal policy π∗may not be learnable using Ddue to a lack of sufﬁcient

exploration in the dataset. In this case, we would seek the best policy learnable from the dataset, or,

at the very least, a policy that improves upon behavioral policy.

Ofﬂine RL Algorithms. Recent years have seen tremendous interests in ofﬂine RL and the de-

velopment of new ORL algorithms. Most of these algorithms incorporate some form of regular-

ization or conservatism. This can take many forms, such as regularized policy gradients or actor

critic algorithms [14,15,19,28,29,30], approximate dynamic programming [11,13,17,18], and

model-based RL [12,31,32,33]. We select a representative ORL algorithms from each category:

AWAC [19], IQL [18] and MOREL [12]. In this work, we do not propose new algorithms for ofﬂine

RL; rather we study a spectrum of representative ORL algorithms and evaluate their assumptions

and effectiveness on a physical robot under realistic usage scenarios.

Ofﬂine RL Benchmarks and Evaluation. In conjunction with algorithmic advances, ofﬂine RL

benchmarks have also been proposed. However, they are predominantly captured with simula-

tion [22,23,34] using datasets with idealistic coverage, i.i.d. samples, and synchronous execu-

tion. Most of these assumptions are invalid in real world which is stochastic and has operational

delays. Prior works investigating ofﬂine RL for these settings on physical robots are limited. For

instance, Kostrikov et al. [18] did not provide real robot evaluation for IQL, which we conduct in

this work; Chebotar et al. [35], Kalashnikov et al. [36] evaluate performance on a specialized Arm-

Farm; Rafailov et al. [37] evaluate on a single drawer closing task; Singh et al. [17], Kumar et al.

[38] evaluate only one algorithm (COG, CQL, respectively). Mandlekar et al. [39] evaluate BCQ

and CQL alongside BC on three real robotics tasks, but their evaluations consider only in-domain

setting: that the agents were trained only on the speciﬁc task data, without giving them access to

a pre-generated, ofﬂine dataset. Thus, it is unclear whether insights from simulated benchmarks or

limited hardware evaluation can generalize broadly. Our work aims to bridge this gap by empirically

studying representative ofﬂine RL algorithms on a suite of real-world robot learning tasks with an

emphasize on transfer learning and out-domain generalization. See Section 3for detailed discussion.

Imitation Learning (IL). IL [40] is an alternate approach to training control policies for robotics.

Unlike RL, which learns policies by optimizing rewards (or costs), IL (and inverse RL [41,42,

43]) learns by mimicking expert demonstrations and typically requires no reward function. IL has

been studied in both the ofﬂine setting [44,45], where the agent learns from a ﬁxed set of expert

demonstrations, and the online setting [46,47], where the agent can perform additional environment

interactions. A combination of RL and IL has also been explored in prior work [48,49]. Our

ofﬂine dataset consists of trajectories from a heuristic hand-scripted policy collected under expert

supervision, which represents a dataset of reasonably high quality. As a result, we consider ofﬂine

IL and, behavior cloning in particular, as a baseline algorithm in our empirical evaluation.

3 Experiment Scope and Setup

To investigate the effectiveness of ORL algorithms on real-world robot learning tasks, we adhere

to a few guiding principles: (1) we make design choices representing the wider community to the

extent possible, (2) we strive to be fair to all baselines by providing them their best chance and work

in consultation with their authors; and (3) we prioritize reproducibility and data sharing. We will

open-source our data, camera images along with our training and evaluation codebase.

Hardware Setup. Hardware plays a seminal role in robotic capability. For reproducibility and

extensibility, we selected a hardware platform that is well-established, non-custom, and commonly

used in the ﬁeld. After an exhaustive literature survey [50,51,52,53,54,55], we converged on a

table-top manipulation setup, shown in Figure 3. It consists of a table-mounted Franka panda arm

that uses a RobotiQ parallel gripper as its end effector, which is accompanied by two Intel 435 RGBD

cameras. Our robot has 8 DOF, uses factory-supplied default controller gains, accepts position

commands at 15 Hz, and runs a low-level joint position controller at 1000 Hz. To perceive the object

to interact with, we exact the position of the AprilTags attached to the object from RGB images. Our

robot states consist of joint positions, joint velocities, and positions of the object to interact with (if

applicable). Our policies compute actions (desired joint pose) using robot proprioception, tracked

object locations, and desired goal location.

Figure 3: Our setup consists of a commonly

used Franka arm, a RobotiQ parallel gripper,

and two Intel Realsense 435 cameras.

Canonical Tasks We consider four classic manip-

ulation tasks common in literature: reach,slide,

lift, and pick-n-place (PnP) (see Figure 2).

reach requires the robot to move from a randomly

sampled conﬁguration in the workspace to another

conﬁguration. The other three tasks involve a heavy

glass lid with a handle, which is initialized randomly

on the table. slide requires the robot to hold and

move the lid along the table to a speciﬁed goal lo-

cation. lift requires the robot to grasp and lift the

lid 10 cm off the table. PnP requires the robot to

grasp, lift, move and place the lid at a designated

goal position i.e. the chopping board. The four tasks

constitute a representative range of common table-

top manipulation challenges: reach focuses on free

movements while the other three tasks involve intermittent interaction dynamics between the ta-

ble, lid, and the parallel grippers. We model each canonical task as a MDP with an unique reward

function. Details on our tasks are in Appendix. 8.1.

Data Collection. We use a hand-designed, scripted policy developed under expert supervision to

collect (dominantly) successful trajectories for all our canonical tasks. To highlight ORL algorithms

ability to overcome suboptimal dataset, previous works [22,34,39] have crippled expert policies

with noise, use half-trained RL policies or collect human demonstrations with varying qualities to

highlight the performance gain over compromised datasets. We posit that such data sources are not

representative of robotics domains, where noisy or random behaviors are unsafe and detrimental to

hardware’s stability. Instead of infusing noise or failure data points to serve as negative examples, we

believe that mixing data collected from various tasks offers a more realistic setting in which to apply

ORL on real robots for three reasons: (1) collecting such “random/roaming/explorative” data on

a real robot autonomously would require comprehensive safety constraints, expert supervision and

oversight, (2) engaging experts to record such random data in large quantities makes less sense than

utilizing them to collecting meaningful trajectories on a real task, and (3) designing task-speciﬁc

strategies and stress testing ORL’s ability against such a strong dataset is more viable than using a

compromised dataset. In Real-ORL, we collected ofﬂine dataset using heuristic strategies designed

with reasonable efforts and, to avoid biases favoring task/algorithm, froze the dataset ahead of time.

To create scripted policies for all tasks, we ﬁrst decompose each task into simpler stages marked

by end-effector sub-goals. We leverage Mujoco’s IK solver to map these sub-goals into joint space.

The scripted policy takes tiny steps toward sub-goals until some task-speciﬁc criterias are met. Our

heuristic policies didn’t reach the theoretical maximum possible scores due to controller noises

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RealWorldOfineReinforcementLearningwithRealisticDataSourceGaoyueZhou*1,LiyimingKe*y2,SiddharthaSrinivasa2,AbhinavGupta1,AravindRajeswaran3,andVikashKumar31CarnegieMellonUniversity2UniversityofWashington3MetaAIAbstract:Ofinereinforcementlearning(ORL)holdsgreatpromiseforrobotlearningduetoitsabilityt...

展开>> 收起<<

Real World Ofﬂine Reinforcement Learning with Realistic Data Source Gaoyue Zhou1 Liyiming Key2 Siddhartha Srinivasa2 Abhinav Gupta1.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Real World Ofﬂine Reinforcement Learning with Realistic Data Source Gaoyue Zhou1 Liyiming Key2 Siddhartha Srinivasa2 Abhinav Gupta1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: