
Learning the Sequence of Packing Irregular Objects from Human
Demonstrations: Towards Autonomous Packing Robots
Andr´
e Santos1, Nuno Ferreira Duarte1, Atabak Dehban1and Jos´
e Santos-Victor1
Abstract— We tackle the challenge of robotic bin packing
with irregular objects, such as groceries. Given the diverse
physical attributes of these objects and the complex constraints
governing their placement and manipulation, employing pre-
programmed strategies becomes unfeasible. Our approach is to
learn directly from expert demonstrations in order to extract
implicit task knowledge and strategies to ensure safe object
positioning, efficient use of space, and the generation of human-
like behaviors that enhance human-robot trust. We rely on
human demonstrations to learn a Markov chain for predicting
the object packing sequence for a given set of items and then
compare it with human performance. Our experimental results
show that the model outperforms human performance by gener-
ating sequence predictions that humans classify as human-like
more frequently than human-generated sequences. The human
demonstrations were collected using our proposed VR platform,
“BoxED”, which is a box packaging environment for simulating
real-world objects and scenarios for fast and streamlined data
collection with the purpose of teaching robots. We collected
data from 43 participants packing a total of 263 boxes with
supermarket-like objects, yielding 4644 object manipulations.
Our VR platform can be easily adapted to new scenarios and
objects, and is publicly available, alongside our dataset, at
https://github.com/andrejfsantos4/BoxED.
I. INTRODUCTION
The ceaseless digitalization of the modern world has
steadily transformed many aspects of our daily lives, includ-
ing the grocery shopping experience. In recent years, we have
observed a pronounced migration from physical retail spaces
to the online realm, with companies such as Ocado [1], a
dedicated online grocery retailer, reporting 2.5 billion GBP
of revenue in 2022. This trend is underpinned by the intricate
logistical challenge of efficiently packaging all orders into
shipping containers.
Simultaneously, robots have been steadily integrating into
both our workplaces and homes. The so-called “cobots” [2]
have attracted a considerable amount of research and devel-
opment attention, with the introduction of Amazon’s new
Astro robot [3] and Temi robot [4]. Besides the critical need
to ensure a safe operation while interacting with humans,
robots must also act as human-like as possible, such that their
actions can be understood and anticipated by their interac-
tion partners, providing users with an improved interaction
experience.
*This work was supported by the Fundac¸˜
ao para a Ciˆ
encia e
a Tecnologia (FCT) through the ISR/LARSyS Associated Laboratory
UID/EEA/50009/2020, LA/P/0083/2020.
1All the authors are affiliated with the Institute for Systems and Robotics,
Instituto Superior T´
ecnico, Universidade de Lisboa, 1049-001 Lisbon, Por-
tugal (e-mail: andrejfsantos@tecnico.ulisboa.pt, {nferreiraduarte, adehban,
jasv}@isr.tecnico.ulisboa.pt).
Fig. 1. The demonstrator performs the box packing task in virtual reality.
The task that motivated our work lies at the intersection
of these two rapidly expanding fields: a robot that packs
groceries alongside and learns from a human collaborator.
Several hardware solutions have been developed specifi-
cally to enhance robot-human collaboration, featuring special
adaptations such as flexible joints. However, there is very
little literature on how to pack heterogeneous objects such as
those found in a supermarket, and there is even less publicly
available datasets related to this task. Existing methods that
address packing objects inside a container consider essen-
tially simple shape objects (such as cuboids and spheres) and
often deploy search-based algorithms to determine the best
placement pose, requiring multiple hours to find a solution
[5], [6]. Methods that predict a packing sequence given a set
of objects (for instance, the objects in an online order) are
equally sparse [7]. Our contributions are twofold:
1) A model trained on our dataset that predicts the correct
packing sequence, extracting implicit task knowledge
directly from humans. The generated sequences lead
to safe and efficient packing and were classified as
“human-like” sequences by users.
2) A publicly-available VR platform (BoxED) for fast and
streamlined data collection, see Fig. 1. BoxED can
be adapted to new box packaging scenarios with any
object datasets, it collects 6-DOF pick-and-place grasp
poses, object trajectories, packing sequences, and ob-
jects’ poses when inside the box. In addition to BoxED,
we created the first publicly available collection of
human demonstrations of packing groceries into a box.
Section II addresses related research, Section III describes
the dataset, Section IV introduces our packing sequence
prediction model, and Section VI presents final remarks.
II. RELATED WORK
Learning from Demonstrations (LfD). LfD consists of
learning how to execute new tasks and their constraints by
arXiv:2210.01645v2 [cs.RO] 8 Nov 2023