Real World Offline Reinforcement Learning with Realistic Data Source Gaoyue Zhou1 Liyiming Key2 Siddhartha Srinivasa2 Abhinav Gupta1

2025-04-26 0 0 2.51MB 18 页 10玖币
侵权投诉
Real World Offline Reinforcement Learning
with Realistic Data Source
Gaoyue Zhou*1, Liyiming Ke*2, Siddhartha Srinivasa2, Abhinav Gupta1,
Aravind Rajeswaran3, and Vikash Kumar3
1Carnegie Mellon University
2University of Washington
3Meta AI
Abstract:
Offline reinforcement learning (ORL) holds great promise for robot learning due
to its ability to learn from arbitrary pre-generated experience. However, current
ORL benchmarks are almost entirely in simulation and utilize contrived datasets
like replay buffers of online RL agents or sub-optimal trajectories, and thus hold
limited relevance for real-world robotics. In this work (Real-ORL), we posit that
data collected from safe operations of closely related tasks are more practical data
sources for real-world robot learning. Under these settings, we perform an exten-
sive (6500+ trajectories collected over 800+ robot hours and 270+ human labor
hour) empirical study evaluating generalization and transfer capabilities of repre-
sentative ORL methods on four real-world tabletop manipulation tasks. Our study
finds that ORL and imitation learning prefer different action spaces, and that ORL
algorithms can generalize from leveraging offline heterogeneous data sources and
outperform imitation learning. We release our dataset and implementations at
URL: https://sites.google.com/view/real-orl
1 Introduction
Task
Offline-RL
Dataset
Offline-RL
Algo(s)
Imitation
Learning
?
Simulation Data Sources
Half-Trained Policies
(Unrealistic)
Random
(Unsafe)
Expert + Noise
(Unsafe)
Real-World Data Sources (Multi-Task)
Data for Task Task
Figure 1: Realistic data sources for offline RL algorithms in real world tasks.
Despite rapid advances, the applicability of Deep Reinforcement Learning (DRL) algorithms [1,
2,3,4,5,6,7,8] to real-world robotics tasks is limited due to sample inefficiency and safety
considerations. The emerging field of offline reinforcement learning (ORL) [9,10] has the
potential to overcome these challenges, by learning only from logged or pre-generated offline
datasets, thereby circumventing safety and exploration challenges. This makes ORL well suited
for applications with large datasets (e.g. recommendation systems) or those where online inter-
actions are scarce and expensive (e.g. robotics). However, comprehensive benchmarking and
empirical evaluation of ORL algorithms is significantly lagging behind the burst of algorithmic
Work completed during internship at FAIR-MetaAI.
*Equal Contribution.
arXiv:2210.06479v1 [cs.RO] 12 Oct 2022
Figure 2: Canonical tasks for tabletop manipulation.
progress [11,12,13,14,15,16,17,18,19,20,21]. Widely used ORL benchmarks [22,23] are
entirely in simulation and use contrived data collection protocols that do not capture fundamental
considerations of physical robots. In this work (Real-ORL), we aim to bridge this gap by outlining
practical offline dataset collection protocols that are representative of real-world robot settings. Our
work also performs a comprehensive empirical study spanning 6500+ trajectories collected over
800+ robot hours and 270+ human labor hour, to benchmark and analyze three representative
ORL algorithms thoroughly. We will release all the datasets, code, and hardware hooks from this
paper.
In principle, ORL can consume and train policies from arbitrary datasets. This has prompted the
development of simulated ORL benchmarks [22,23,24] that utilize data sources like expert policies
trained with online RL, exploratory policies, or even replay buffers of online RL agents. However,
simulated dataset may fail to capture the challenges in real world: hardware noises coupled with
varying reset conditions lead to covariate shift and violate the i.i.d. assumption about state distri-
butions between train and test time. Further, such datasets are not feasible on physical robots and
defeat the core motivation of ORL in robotics – to avoid the use of online RL due to poor sample
efficiency and safety! Recent works [24,25] suggest that dataset composition and distribution dra-
matically affect the relative performance of algorithms. In this backdrop, we consider the pertinent
question:
What is a practical instantiation of the ORL setting for physical robots, and can existing ORL algo-
rithms learn successful policies in such a setting?
In this work, we envision practical scenarios to apply ORL for real-world robotics. Towards this
end, our first insight is that real-world offline datasets are likely to come from well-behaved policies
that abide by safety and monetary constraints, in sharp contrast to simulator data collected from
exploratory or partially trained policies, as used in simulated benchmarks [22,23,24]. Such trajec-
tories can be collected by user demonstrations or through hand-scripted policies that are partially
successful but safe. It is more realistic to collect large volumes of data for real robots using multiple
successful policies designed under expert supervision for specific tasks than using policies that are
unsuccessful or without safety guarantees. Secondly, the goal of any learning (including ORL) is
broad generalization and transfer. It is therefore critical to study whether a learning algorithm can
leverage task-agnostic datasets, or datasets intended for a source task, to make progress on a new
target task. In this work, we collect offline datasets consistent with these principles and evaluate
representative ORL algorithms on a set of canonical table-top tasks as illustrated in Figure 2.
Evaluation studies on physical robots are sparse in the field due to time and resource constraints,
but they are vital to furturing our understanding. Our real robot results corroborate and validate
intuitions from simulated benchmarks [26] but also enable novel discoveries. We find that (1) even
for scenarios with sufficiently high-quality data, some ORL algorithms could outperform behavior
cloning (BC) [27] on specific tasks, (2) for scenarios that require generalization or transfer to new
tasks with low data support, ORL agents generally outperform BC. (3) in cases with overlapping
data support, ORL algorithms can leverage additional heterogeneous task-agnostic data to improve
their own performance, and in some cases even surpass the best in-domain agent.
Our empirical evaluation is unique as it focuses on ORL algorithms ability to leverage more realistic,
multi-task data sources, spans over several tasks that are algorithm-agnostic, trains various ORL al-
gorithms on the same settings and evaluates them directly in the real world. In summary, we believe
Real-ORL establishes the effectiveness of offline RL algorithms in leveraging out of domain high-
2
quality heterogeneous data for generalization and transfer in robot-learning, which is representative
of real world applications.
2 Preliminaries and Related Work
Offline RL. We consider the ORL framework, which models the environment as a Markov De-
cision Process (MDP): M=hS, A, R, T, ρ0, Hiwhere SRnis the state space, ARmis the
action space, R:S×ARis the reward function, T:S×A×SR+is the (stochastic)
transition dynamics, ρ0:SR+is the initial state distribution, and His the maximum trajectory
horizon. In the ORL setting, we assume access to the reward function Rand a pre-generated dataset
of the form: D={τ1, τ2,...τN}, where each τi= (s0, a0, s1, a1,...sH)is a trajectory collected
using a behavioral policy or a mix of policies πb:S×AR+.
The goal in ORL is to use the offline dataset Dto learn a near-optimal policy,
π:= arg max
π
EM"H
X
t=0
r(st, at)#.
In the general case, the optimal policy πmay not be learnable using Ddue to a lack of sufficient
exploration in the dataset. In this case, we would seek the best policy learnable from the dataset, or,
at the very least, a policy that improves upon behavioral policy.
Offline RL Algorithms. Recent years have seen tremendous interests in offline RL and the de-
velopment of new ORL algorithms. Most of these algorithms incorporate some form of regular-
ization or conservatism. This can take many forms, such as regularized policy gradients or actor
critic algorithms [14,15,19,28,29,30], approximate dynamic programming [11,13,17,18], and
model-based RL [12,31,32,33]. We select a representative ORL algorithms from each category:
AWAC [19], IQL [18] and MOREL [12]. In this work, we do not propose new algorithms for offline
RL; rather we study a spectrum of representative ORL algorithms and evaluate their assumptions
and effectiveness on a physical robot under realistic usage scenarios.
Offline RL Benchmarks and Evaluation. In conjunction with algorithmic advances, offline RL
benchmarks have also been proposed. However, they are predominantly captured with simula-
tion [22,23,34] using datasets with idealistic coverage, i.i.d. samples, and synchronous execu-
tion. Most of these assumptions are invalid in real world which is stochastic and has operational
delays. Prior works investigating offline RL for these settings on physical robots are limited. For
instance, Kostrikov et al. [18] did not provide real robot evaluation for IQL, which we conduct in
this work; Chebotar et al. [35], Kalashnikov et al. [36] evaluate performance on a specialized Arm-
Farm; Rafailov et al. [37] evaluate on a single drawer closing task; Singh et al. [17], Kumar et al.
[38] evaluate only one algorithm (COG, CQL, respectively). Mandlekar et al. [39] evaluate BCQ
and CQL alongside BC on three real robotics tasks, but their evaluations consider only in-domain
setting: that the agents were trained only on the specific task data, without giving them access to
a pre-generated, offline dataset. Thus, it is unclear whether insights from simulated benchmarks or
limited hardware evaluation can generalize broadly. Our work aims to bridge this gap by empirically
studying representative offline RL algorithms on a suite of real-world robot learning tasks with an
emphasize on transfer learning and out-domain generalization. See Section 3for detailed discussion.
Imitation Learning (IL). IL [40] is an alternate approach to training control policies for robotics.
Unlike RL, which learns policies by optimizing rewards (or costs), IL (and inverse RL [41,42,
43]) learns by mimicking expert demonstrations and typically requires no reward function. IL has
been studied in both the offline setting [44,45], where the agent learns from a fixed set of expert
demonstrations, and the online setting [46,47], where the agent can perform additional environment
interactions. A combination of RL and IL has also been explored in prior work [48,49]. Our
offline dataset consists of trajectories from a heuristic hand-scripted policy collected under expert
supervision, which represents a dataset of reasonably high quality. As a result, we consider offline
IL and, behavior cloning in particular, as a baseline algorithm in our empirical evaluation.
3
3 Experiment Scope and Setup
To investigate the effectiveness of ORL algorithms on real-world robot learning tasks, we adhere
to a few guiding principles: (1) we make design choices representing the wider community to the
extent possible, (2) we strive to be fair to all baselines by providing them their best chance and work
in consultation with their authors; and (3) we prioritize reproducibility and data sharing. We will
open-source our data, camera images along with our training and evaluation codebase.
Hardware Setup. Hardware plays a seminal role in robotic capability. For reproducibility and
extensibility, we selected a hardware platform that is well-established, non-custom, and commonly
used in the field. After an exhaustive literature survey [50,51,52,53,54,55], we converged on a
table-top manipulation setup, shown in Figure 3. It consists of a table-mounted Franka panda arm
that uses a RobotiQ parallel gripper as its end effector, which is accompanied by two Intel 435 RGBD
cameras. Our robot has 8 DOF, uses factory-supplied default controller gains, accepts position
commands at 15 Hz, and runs a low-level joint position controller at 1000 Hz. To perceive the object
to interact with, we exact the position of the AprilTags attached to the object from RGB images. Our
robot states consist of joint positions, joint velocities, and positions of the object to interact with (if
applicable). Our policies compute actions (desired joint pose) using robot proprioception, tracked
object locations, and desired goal location.
Figure 3: Our setup consists of a commonly
used Franka arm, a RobotiQ parallel gripper,
and two Intel Realsense 435 cameras.
Canonical Tasks We consider four classic manip-
ulation tasks common in literature: reach,slide,
lift, and pick-n-place (PnP) (see Figure 2).
reach requires the robot to move from a randomly
sampled configuration in the workspace to another
configuration. The other three tasks involve a heavy
glass lid with a handle, which is initialized randomly
on the table. slide requires the robot to hold and
move the lid along the table to a specified goal lo-
cation. lift requires the robot to grasp and lift the
lid 10 cm off the table. PnP requires the robot to
grasp, lift, move and place the lid at a designated
goal position i.e. the chopping board. The four tasks
constitute a representative range of common table-
top manipulation challenges: reach focuses on free
movements while the other three tasks involve intermittent interaction dynamics between the ta-
ble, lid, and the parallel grippers. We model each canonical task as a MDP with an unique reward
function. Details on our tasks are in Appendix. 8.1.
Data Collection. We use a hand-designed, scripted policy developed under expert supervision to
collect (dominantly) successful trajectories for all our canonical tasks. To highlight ORL algorithms
ability to overcome suboptimal dataset, previous works [22,34,39] have crippled expert policies
with noise, use half-trained RL policies or collect human demonstrations with varying qualities to
highlight the performance gain over compromised datasets. We posit that such data sources are not
representative of robotics domains, where noisy or random behaviors are unsafe and detrimental to
hardware’s stability. Instead of infusing noise or failure data points to serve as negative examples, we
believe that mixing data collected from various tasks offers a more realistic setting in which to apply
ORL on real robots for three reasons: (1) collecting such “random/roaming/explorative” data on
a real robot autonomously would require comprehensive safety constraints, expert supervision and
oversight, (2) engaging experts to record such random data in large quantities makes less sense than
utilizing them to collecting meaningful trajectories on a real task, and (3) designing task-specific
strategies and stress testing ORLs ability against such a strong dataset is more viable than using a
compromised dataset. In Real-ORL, we collected offline dataset using heuristic strategies designed
with reasonable efforts and, to avoid biases favoring task/algorithm, froze the dataset ahead of time.
To create scripted policies for all tasks, we first decompose each task into simpler stages marked
by end-effector sub-goals. We leverage Mujoco’s IK solver to map these sub-goals into joint space.
The scripted policy takes tiny steps toward sub-goals until some task-specific criterias are met. Our
heuristic policies didn’t reach the theoretical maximum possible scores due to controller noises
4
摘要:

RealWorldOfineReinforcementLearningwithRealisticDataSourceGaoyueZhou*1,LiyimingKe*y2,SiddharthaSrinivasa2,AbhinavGupta1,AravindRajeswaran3,andVikashKumar31CarnegieMellonUniversity2UniversityofWashington3MetaAIAbstract:Ofinereinforcementlearning(ORL)holdsgreatpromiseforrobotlearningduetoitsabilityt...

展开>> 收起<<
Real World Offline Reinforcement Learning with Realistic Data Source Gaoyue Zhou1 Liyiming Key2 Siddhartha Srinivasa2 Abhinav Gupta1.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2.51MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注