difficult to adapt to various tasks. Object colors and geometric
details are not included in the model, limiting its representation
capability [23].
By integrating differentiable physics-based simulation and
rendering, we propose a sensing-aware model-based reinforce-
ment learning system called SAM-RL. As shown in Fig. 1, we
apply SAM-RL on a robot system with two 7-DoF robotic
arms (Flexiv Rizon [25] and Franka Emika Panda), where
the former mounts an RGB-D camera, and the latter handles
manipulation tasks. Our framework is sensing-aware, which
allows the robot to automatically select an informative camera
view to effectively monitor the manipulation process, provid-
ing the following benefits. First, the system no longer requires
obtaining a sequence of camera poses at each step, which
is extremely time-consuming. Second, compared with using
a fixed view, SAM-RL leverages varying camera views with
potentially fewer occlusions and offers better estimations of
environment states and object status (especially for deformable
bodies). The improved quality in object status estimation
contributes more effective robotic actions to complete various
tasks. Third, by comparing rendered and measured (i.e., real-
world) images, discrepancies between the simulation and the
reality are better revealed and then reduced automatically using
gradient-based optimization and differentiable rendering.
In practice, we train the robot to learn three challeng-
ing manipulation skills: Peg-Insertion,Spatula-Flipping, and
Needle-Threading. Our experiments indicate that SAM-RL can
significantly reduce training time and improve success rate by
large margins compared to common model-free and model-
based deep reinforcement learning algorithms.
Our primary contributions include:
•proposing an active-sensing framework named SAM-RL
that enables robots to select informative views for various
manipulation tasks;
•introducing a model-based reinforcement learning algo-
rithm to produce efficient policies;
•conducting extensive quantitative and qualitative evalua-
tions to demonstrate the effectiveness of our approach;
•applying our framework to robotic assembly, tool manip-
ulation, and deformable object manipulation tasks both
in simulation and real world experiments.
II. RELATED WORK
We review related literature on key components in our
approach, including model-based reinforcement learning, next
best view, integration of differentiable physics-based simula-
tion and rendering, and robotic manipulation. We describe how
we are different from previous work.
A. Model-based Reinforcement Learning
MBRL is considered to be potentially more sample efficient
than model-free RL [4]. However, automatically and efficiently
developing an accurate model from raw sensory data is a
challenging problem, which retards MBRL from being widely
applied in the real world. For a broader review of the field
on MBRL, we refer to [26]. One line of works [5, 6, 7, 8]
use representation learning methods to learn low-dimensional
latent state and action representations from high-dimensional
input data. But the learned models might not satisfy the
physical dynamics, and the quality may also significantly drop
beyond the training data distribution. Recently, Lv et al. [23]
leveraged the differentiable physics simulation and developed
a system to produce a URDF file to model the surrounding
environment based on an RGB-D camera. However, the RGB-
D camera poses used in [23] are predefined and can not adjust
to different tasks. Our approach allows the robot to select the
most informative camera view to monitor the manipulation
process and update the environment model automatically.
B. Next Best View in Active Sensing
Next Best View (NBV) has been one of the core problems
in active sensing. It studies the problem of how to obtain a
series of sensor poses to increase the information gain. The
information gain is explicitly defined to reflect the improved
perception for 3D reconstruction [27, 28, 29, 30], object
recognition [31, 32, 33, 34], 3D model completion [35], and
3D exploration [36, 37]. Unlike perception-related tasks, we
explore the NBV over a wide range of robotic manipulation
tasks. Information gain in the robotic manipulation tasks is
difficult to define explicitly and is implicitly related to task
performance. In our system, the environment changes accord-
ingly after the robot’s interaction. We integrate the information
gain into the Qfunction to reflect the informative viewpoint
for manipulation.
C. Integration of Differentiable Physics-Based Simulation and
Rendering
Recently, great progresses have been made in the field of
differentiable physics-based simulation and rendering. For a
broader review, please refer to [9, 10, 11, 12, 13, 14, 15, 16]
and [38, 39]. With the development of these techniques,
Jatavallabhula et al. [21] first proposed a pipeline to leverage
differentiable simulation and rendering for system identifi-
cation and visuomotor control. Ma et al. [22] introduced a
rendering-invariant state predictor network that maps images
into states that are agnostic to rendering parameters. By
comparing the state predictions obtained using rendered and
ground-truth images, the pipeline can backpropagate the gradi-
ent to update system parameters and actions. Sundaresan et al.
[40] proposed a real-to-sim parameter estimation approach
from point clouds for deformable objects. Different from these
works, we use the differentiable simulation and rendering to
find the next best view for various manipulation tasks and
update the object status in the model by comparing rendered
and captured images.
D. Manipulation
Our framework can be adopted to improve the performance
of a range of manipulation tasks. We review the related work
in these domains. 1) Peg-insertion. Peg insertion is a classic
robotic assembly task with rich literature [41, 42, 43]. For a