plex relations between objects in natural language, grounding
reference expression of objects, and generating placement
that satisfies the relational constraints in the instructions.
Parsing-based methods for robot instruction-following [3],
[4], [6], [10], [11], [12], [13], [14] parse natural language
into formal representations using hand-crafted rules and
grammatical structures. Those hand-crafted rules are gen-
eralizable but not robust to noisy language [15]. Among
these studies, those focusing on placing [3], [4], [6] lack
a decomposition mechanism for compositional instructions
and assume perfect object grounding without referential
ambiguity. Recently, [8], [9], [7] used sentence embeddings
to learn a language-condition policy for robot instruction
following, which are not data-efficient and hard to generalize
to unseen compositional instructions. We follow [16] to inte-
grate parsing-based modules into a data-driven framework. It
is robust, data-efficient, and generalizable for learning com-
positional instructions. We also use probabilistic techniques
to adapt to the uncertainty of ambiguous instructions.
PARAGON has a GNN for relational reasoning and place-
ment generation, which encodes a mean-field inference al-
gorithm similar to [17]. Moreover, our GNN uses particles
for message passing to capture complex and multimodal
distribution. It approximates a distribution as a set of parti-
cles [18], which provides strong expressiveness for complex
and multimodal distribution. It was used in robot percep-
tion [19], [20], [21], recurrent neural networks [22], and
graphical models [23], [24]. Our approach employed this idea
in GNN for particle-based message passing.
III. OVERVIEW
We focus on the grounding human language for tabletop
object placing tasks. In this task, scenes are composed of
a finite set of 3D objects on a 2D tabletop. Humans give
natural language instruction `∈ L to guide the robot to pick
an object and put it at the desired position x∗
tgt. The human
instruction is denoted as a sequence `={ωl}1≤l≤Lwhere ωl
is a word, e.g., instruction in Fig.6 is {ω1=put,ω2=a,...}.
An instruction should contain a target object expression
(e.g., “a plate”) to specify the object to pick and express
at least one spatial relation (e.g., “next to a silver mug”)
for placement description. The robot needs to find the target
object’s placement distribution p(x∗
tgt|`,o)conditioned on the
language instruction `and visual observation o.
The core idea in our proposed solution, PARAGON, is
to leverage object-centric relations from linguistic and vi-
sual inputs to perform relational reasoning and placement
generation, and encode those procedures in neural networks
for end-to-end training. The pipeline of PARAGON is in
Fig 2. PARAGON first uses the soft parsing module to convert
language inputs “softly” into a set of relations, represented
as triplets. A grounding module then aligns the mentioned
objects in triplets with objects in the visual scenes. The
triplets can form a graph by taking the objects as the nodes
and relations as the edges. The resulting graph is fed into
a GNN for relational reasoning and generating placements.
The GNN encodes a mean-field inference algorithm for a
conditional random field depicting spatial relations in triplets.
PARAGON is trained end-to-end to achieve the best overall
performance for object placing without annotating parsing
and object-grounding labels.
IV. SOFT PARSING
The soft parsing module is to extract spatial relations in
complex instructions for accurate placement generation. The
pipeline is in Fig 4. Dependency trees capture the relations
between words in natural language, which implicitly indicate
the relations between the semantics those words express [25],
[26]. Thus, we use a data-driven approach to explore the
underlying semantic relations in the dependency tree for
extracting relations represented as relational triplets. It takes
linguistic input and outputs relational triplets, where the
triplets’ components are represented as embeddings.
A. Preliminaries
Triplets: A triplet consists of two entities and their rela-
tion, representing a binary relation. Triplet provides a formal
representation of knowledge expressed in natural language,
which is widely applied in scene graph parsing [25], relation
extraction [27], and knowledge graph [28]. The underlying
assumption of representing natural language as triplets is that
natural language rarely has higher-order relations, as humans
mostly use binary relations in natural language [29]. For
spatial relations, two triplets can represent ternary relations
(e.g., “between A and B” equals “the right of A and left of B”
sometimes). As such, it is sufficient to represent instructions
as triplets for common object-placing purposes.
Dependency Tree: A dependency tree (shown in Fig. 3) is
a universal structure that examines the relationships between
words in a phrase to determine its grammatical structure [30].
It uses part-of-speech tags to mark each word and depen-
dency tags to mark the relations between two words. A
part-of-speech tag [31] categorizes words’ correspondence
with a particular part of speech, depending on the word’s
definition and context, such as in Fig. 3, “cup” is a “Noun”.
Dependency tags mark two words relations in grammar,
represented as Universal Dependency Relations [32]. For
example, in Fig. 3, the Noun “cup” is the “direct object
(Dobj)” of the Verb “place”. Those relations are universal.
A proper dependency tree relies on grammatically correct
instructions, whereas noisy language sentences may result
in imperfect dependency trees. Thus, we use the data-driven
method to adapt to imperfect dependency trees.
B. Method
To make the parsing differentiable, the soft parsing module
“softens” the triplet as the attention to the words of linguistic
inputs {aγ,l}1≤l≤L,Plaγ,l = 1,γ ∈ {subj,obj,rel}. We
compute the embeddings of components in triplets by the
attention-weighted sum of the word embeddings ϕγ=
Plaγ,lfCLIP(ωl). The word embeddings are evaluated us-
ing pre-trained CLIP [33] fCLIP. As the dependency tree
examines the relations between words that could implicitly
encode relational information, we use a GNN [34] to operate