
3 Experiment Scope and Setup
To investigate the effectiveness of ORL algorithms on real-world robot learning tasks, we adhere
to a few guiding principles: (1) we make design choices representing the wider community to the
extent possible, (2) we strive to be fair to all baselines by providing them their best chance and work
in consultation with their authors; and (3) we prioritize reproducibility and data sharing. We will
open-source our data, camera images along with our training and evaluation codebase.
Hardware Setup. Hardware plays a seminal role in robotic capability. For reproducibility and
extensibility, we selected a hardware platform that is well-established, non-custom, and commonly
used in the field. After an exhaustive literature survey [50,51,52,53,54,55], we converged on a
table-top manipulation setup, shown in Figure 3. It consists of a table-mounted Franka panda arm
that uses a RobotiQ parallel gripper as its end effector, which is accompanied by two Intel 435 RGBD
cameras. Our robot has 8 DOF, uses factory-supplied default controller gains, accepts position
commands at 15 Hz, and runs a low-level joint position controller at 1000 Hz. To perceive the object
to interact with, we exact the position of the AprilTags attached to the object from RGB images. Our
robot states consist of joint positions, joint velocities, and positions of the object to interact with (if
applicable). Our policies compute actions (desired joint pose) using robot proprioception, tracked
object locations, and desired goal location.
Figure 3: Our setup consists of a commonly
used Franka arm, a RobotiQ parallel gripper,
and two Intel Realsense 435 cameras.
Canonical Tasks We consider four classic manip-
ulation tasks common in literature: reach,slide,
lift, and pick-n-place (PnP) (see Figure 2).
reach requires the robot to move from a randomly
sampled configuration in the workspace to another
configuration. The other three tasks involve a heavy
glass lid with a handle, which is initialized randomly
on the table. slide requires the robot to hold and
move the lid along the table to a specified goal lo-
cation. lift requires the robot to grasp and lift the
lid 10 cm off the table. PnP requires the robot to
grasp, lift, move and place the lid at a designated
goal position i.e. the chopping board. The four tasks
constitute a representative range of common table-
top manipulation challenges: reach focuses on free
movements while the other three tasks involve intermittent interaction dynamics between the ta-
ble, lid, and the parallel grippers. We model each canonical task as a MDP with an unique reward
function. Details on our tasks are in Appendix. 8.1.
Data Collection. We use a hand-designed, scripted policy developed under expert supervision to
collect (dominantly) successful trajectories for all our canonical tasks. To highlight ORL algorithms
ability to overcome suboptimal dataset, previous works [22,34,39] have crippled expert policies
with noise, use half-trained RL policies or collect human demonstrations with varying qualities to
highlight the performance gain over compromised datasets. We posit that such data sources are not
representative of robotics domains, where noisy or random behaviors are unsafe and detrimental to
hardware’s stability. Instead of infusing noise or failure data points to serve as negative examples, we
believe that mixing data collected from various tasks offers a more realistic setting in which to apply
ORL on real robots for three reasons: (1) collecting such “random/roaming/explorative” data on
a real robot autonomously would require comprehensive safety constraints, expert supervision and
oversight, (2) engaging experts to record such random data in large quantities makes less sense than
utilizing them to collecting meaningful trajectories on a real task, and (3) designing task-specific
strategies and stress testing ORL’s ability against such a strong dataset is more viable than using a
compromised dataset. In Real-ORL, we collected offline dataset using heuristic strategies designed
with reasonable efforts and, to avoid biases favoring task/algorithm, froze the dataset ahead of time.
To create scripted policies for all tasks, we first decompose each task into simpler stages marked
by end-effector sub-goals. We leverage Mujoco’s IK solver to map these sub-goals into joint space.
The scripted policy takes tiny steps toward sub-goals until some task-specific criterias are met. Our
heuristic policies didn’t reach the theoretical maximum possible scores due to controller noises
4