
if the vision pipeline detects contact between the receiver’s fingers and the object, and
the torque pipeline detects a pull, pull-up, and hold. Our initial experimental results
with human receivers are extremely positive showing a 98% success rate in R2H tasks.
While joint force/torque sensors have been used in previous works on R2H handover
tasks [ [4] and [5]], and have been combined with a specialized simple optical sensor,
designed specifically to detect object motion [ [6] and [7]]3, the key contributions of our
work are: i) we use joint torque sensors’ data in a novel way, i.e., we use a time series
of joint torques to detect human receiver’s action/intention for R2H handover tasks,
(ii) we use an eye-in-hand RGB-D camera and detect finger contacts with the object
in real-time (30fps), and iii) to combine i) and ii) as an algorithmic fusion approach to
make a robust RELEASE decision. Our preliminary real experiments with human re-
ceivers show a 98% success rate. We also compare our method’s success rate with some
existing R2H systems [8–12] that have used success rate as an evaluation metric. Please
note that some other works report human satisfaction surveys to evaluate R2H systems,
e.g., [7, 13], which is different than the success rate metric that we report.
The rest of this paper is organized as follows: Section II presents a comprehensive
literature review on vision-based and force/torque-based object handover. Section III
presents the methodology and the algorithmic foundations of our work. Section IV shows
the experimental results. Finally, section V presents conclusions and future works.
2 Related Work
A key challenge in robot-to-human object handover (R2H), unlike robot-to-robot han-
dover (R2R), as we mentioned earlier in the introduction is that there is no real-time
sensor data exchanging between the human receiver and the robot other than onboard
sensors of the robot. The robot (in our case, a 7-DoF Gen 3 Kinova arm) with a 3-fingered
mechanical hand (Schunk SDH) has two sensor modalities: i) joint torque sensors and
ii) an eye-in-hand RGB-D camera. Therefore, we have focused on these two modalities
in our current work and our literature review below focuses on R2H handover works
that use one or both of these two modalities to understand the intention of the human
receiver in R2H handover tasks. The research community has attacked this challenge
using two main approaches: Vision-based and force/torque based. We first outline some
general vision-based approaches from the machine vision community and then present
the R2H literature.
2.1 Vision-based computations in general
In the machine vision community, vision data has been used in a variety of ways -
detecting human gaze, human body configuration, human hand, and object detection.
A key requirement in R2H tasks is to accomplish this in real-time. Detecting human’s
hand and the object in real-time is investigated in [14], [15], and [16]. Human body
tracking and its related pose with respect to the object is investigated in [17] and [18].
From the perception perspective, Single Shot multibox Detector (SSD), a CNN-based
network architecture that was introduced by Lio et al. [3] is particularly appealing for
object detection for R2H tasks and we adapt it for our application along with a bounding
box regression algorithm has been taken from Googles Inception Network. The SSD
network combined with bounding box regression is able to outperform Faster R-CNN
(another competing neural network-based architecture for object detection) in accuracy
and in speed to obtain 59+ fps. SSD is capable of detecting multiple objects. Unlike
R-CNN methods, it propagates the feature map in one forward pass throughout the
network. This is the main reason that SSD is able to operate in real-time and handle
object overlap in the data points. SSD uses a pre-trained network as a basic net which is
3In fact, that work used a specialized simple optical sensor precisely because they mention the
unacceptable amount of computation time that would be taken for processing RGB images,
a problem that we solve via the use of SSD network.