
Some works focus primarily on manipulation. In [19], the
authors investigate the use of a bi-manual setup with a suction
gripper, in a setting similar to ours and train the fully con-
nected network to predict which object to support with a non-
suction gripper for the safe extraction of the selected object.
The network is trained to perform in an environment similar
to ours, i.e. bookshelf. Other works solve the book grasping
outside of the library environment. One example would be
[20], which utilized a combination between suction and a two-
finger gripper for grasping books in different configurations.
There is no readily available solution to our knowledge, that
would automate book manipulation in the library environment
similar to ours.
B. Perception (Shang-Ching Liu / Fabian)
In the perception related tasks, we try to model the real-time
books in the scene and furthermore matching individual book
to the known book database. Thus, we dig into each part for
previous achievement.
For detection method like YOLO [27] or Fast r-cnn [9]
are two main method directions in state-of-art, the YOLO is
more efficient with bounding box output and Fast r-cnn gives
precise segmentation result. The evolution model of YOLO —
YOLO-v5 [14] have well documentation and robust pipeline
and utilities such as Roboflow [28] to proceed fine-tuning
technique to extract the book spine, which we choose as the
approach for book spine detection.
For book matching there are SIFT [23] to find the key-
point of the picture, HSV (for hue, saturation, value) [36]
histogram to understand the color encoding, fuzzywoozy to
measure the text similarity between detection text with book
title in database.
Inventory Management in a library is a tedious task that
has been tried to automize in the past decade. Book spines
standing on the shelf were attempted to be detected and rec-
ognized using different computer vision methods without the
aid of special markers. A frequently seen approach to detect
book spines is to use edge detection along with further line
segment processing [3], [6], [22], [34]. Often, an orthographic
representation of the book spines is required. To detect the
spines independently from the viewpoint, Talker et al. used
a constrained active contour model allowing the spines to be
non-parallel to the image axis [35].
The detection part is crucial to find book spine candidates,
but recognizing them correctly plays an evenly important role
in inventory management. While many approaches focus on
text recognition to identify the book spines [3], [6], [22], [34],
Fowers et al. made use of difference of gaussian (DOG) over
the YCbCr color space to extract features [7]. In combination
with SIFT, this approach does not depend on an OCR engine
(e.g. Tesseract [31]) while yielding robust performance.
Comparing the results of the mentioned work meaningfully
is hard as no standardized way in the field of book spine
recognition exists. However, in the domain of scene text
recognition (STR), which can be utilized for text-based book
spine recognition, a framework was developed by Baek et al.
to allow comparison of different model architectures [1]. Since
deep learning methods have generally not been widely applied
in book spine recognition, this work tries to incorporate such
an STR model to perform text matching.
III. SYSTEM OVERVIEW (SHANG-CHING LIU)
The central system can be separated into three parts, includ-
ing Manipulation, vision pipeline, and task planning, as shown
in figure 1. Task planning module controlling vision pipeline
module and Manipulation module. The Vision pipeline takes
the RGBD camera (Azure Kinect) scene as input and matches
the books in the scene to the books database, and finally
creates a MoveIt Planning Scene in visualization. The Ma-
nipulation has a controller to control two hands of the Robot
(PR-2). One is combined with Shadow hands, and another is
combined with a two fingers gripper, as shown in the figure
2j.
Fig. 1: System Overview
IV. PERCEPTION (FABIAN / SHANG-CHING LIU)
A. Preprocessing (Fabian)
The camera of the PR-2 is located on top of its head. This
will create a perspective projection of the shelf (fig. 3, left;
fig. 5, right) making book spine detection more challenging
since the edges tend to be not aligned with the image axis.
To automatically mitigate this problem without rearranging the
hardware, a perspective transformation is applied to the image
twice as shown in fig. 3. For each shelf level, the corners
(red/blue dots) are determined using the AprilTags known pose
along with offsets matching the shelfs dimensions. The 3D
points are then projected onto the image to be used as anchor
points for the transformation. The results are two images with
the book spine edges being aligned with the image axes (fig. 3,
right).
To project back from the corrected images to the original
image, the inverse projection matrix is also computed and will
be used later on.