Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement Zirui Zhao Wee Sun Lee and David Hsu

2025-04-27 0 0 4.73MB 8 页 10玖币
侵权投诉
Differentiable Parsing and Visual Grounding of Natural Language
Instructions for Object Placement
Zirui Zhao, Wee Sun Lee, and David Hsu
Abstract We present a new method, PARsing And visual
GrOuNding (PARAGON), for grounding natural language in
object placement tasks. Natural language generally describes
objects and spatial relations with compositionality and ambi-
guity, two major obstacles to effective language grounding.
For compositionality, PARAGON parses a language instruction
into an object-centric graph representation to ground objects
individually. For ambiguity, PARAGON uses a novel particle-
based graph neural network to reason about object placements
with uncertainty. Essentially, PARAGON integrates a parsing
algorithm into a probabilistic, data-driven learning framework.
It is fully differentiable and trained end-to-end from data for
robustness against complex, ambiguous language input.
I. INTRODUCTION
Robot tasks, such as navigation, manipulation, assembly,
. . . , often involve spatial relations among objects. To carry
out tasks instructed by humans, robots must understand
natural language instructions about objects and their spatial
relations. This work focuses specifically on the object place-
ment task with language instructions. Humans provide verbal
instructions to robots to pick up an object and place it in a
specific location. The robot must generate object placements
based on both language description and visual observation.
However, the language expressions about spatial relations are
generally ambiguous and compositional, two major obstacles
to effective language grounding.
Two types of ambiguity, positional ambiguity and refer-
ential ambiguity, are focused upon in this study. Positional
ambiguity occurs when humans describe directional relations
without specifying exact distances (e.g., to the left side of
a plate”). Reference objects are often required to describe
spatial relations. When placing an object next to a reference
object without specifying its distances, it is hard to link the
reference expressions to the referred objects to learn visual
grounding. Referential ambiguity arises when descriptions
of objects are ambiguous, leading to a reference expression
being grounded to multiple semantically identical objects and
resulting in a multimodal distribution of correct placement.
The compositional structure of language-described spatial
relations stems from the visual scene’s and natural language’s
compositional nature. A complex scene includes multiple
basic objects, and to describe the desired state of a complex
scene, one can compose many simple sentences for referents
The authors are from National University of Singapore. Emails: {ziruiz,
leews, dyhsu}@comp.nus.edu.sg. This research is supported in part by
the National Research Foundation (NRF), Singapore and DSO National
Laboratories under the AI Singapore Program (AISG Award No. AISG2-RP-
2020-016) and the Agency of Science, Technology and Research, Singapore,
under the National Robotics Program (Grant No. 192 25 00054).
The data, code and appendix are available at https://bit.ly/ParaGonProj.
Put a plate to
the upper side
of a knife and
next to the
silver mug.
Fig. 1. PARAGON takes as input a language instruction and an image of the
task environment. It outputs candidate placements for a target object. The
presence of multiple semantically identical objects (e.g., “silver mug”) and
omitted distance information cause difficulty for placement generation, and
the compositional instructions increase the data required for learning.
and their relations to form a complex language sentence (e.g.,
the instruction in Fig 1). This characteristic increases the data
required for learning compositional language instructions.
To address the issues, we propose PARAGON, a PARsing
And visual GrOuNding method for language-conditioned ob-
ject placement. The core idea of PARAGON is to parse human
language into object-centric relations for visual grounding
and placement generation, and encode those procedures in
neural networks for end-to-end training. The parsing module
of PARAGON decomposes compositional instructions into
object-centric relations, enabling the grounding of objects
separately without compositionality issues. The relations can
be composed and encoded in graph neural networks (GNN)
for placement generation. This GNN uses particle-based
message-passing to model the uncertainty caused by am-
biguous instructions. All the modules are encoded into neural
networks and connected for end-to-end training, avoiding the
need for individual training module labeling.
PARAGON essentially integrates parsing-based methods
into a probabilistic, data-driven framework, exhibiting ro-
bustness from its data-driven property, as well as generaliz-
ability and data efficiency due to its parsing-based nature. It
also adapts to the uncertainty of ambiguous instructions us-
ing particle-based probabilistic techniques. The experiments
demonstrate that PARAGON outperforms the state-of-the-art
method in language-conditioned object placement tasks in
the presence of ambiguity and compositionality.
II. RELATED WORK
Grounding human language in robot instruction-following
has been studied in the recent decades [1], [2], [3], [4], [5],
[6], [7], [8], [9]. Our research focuses on object placement
instructed by human language. In contrast to picking [1], [2]
that needs only a discriminative model to ground objects
from reference expressions, placing [3], [4], [5], [6], [7]
requires a generative model conditioned on the relational
constraints of object placement. It requires capturing com-
arXiv:2210.00215v4 [cs.RO] 13 Mar 2023
plex relations between objects in natural language, grounding
reference expression of objects, and generating placement
that satisfies the relational constraints in the instructions.
Parsing-based methods for robot instruction-following [3],
[4], [6], [10], [11], [12], [13], [14] parse natural language
into formal representations using hand-crafted rules and
grammatical structures. Those hand-crafted rules are gen-
eralizable but not robust to noisy language [15]. Among
these studies, those focusing on placing [3], [4], [6] lack
a decomposition mechanism for compositional instructions
and assume perfect object grounding without referential
ambiguity. Recently, [8], [9], [7] used sentence embeddings
to learn a language-condition policy for robot instruction
following, which are not data-efficient and hard to generalize
to unseen compositional instructions. We follow [16] to inte-
grate parsing-based modules into a data-driven framework. It
is robust, data-efficient, and generalizable for learning com-
positional instructions. We also use probabilistic techniques
to adapt to the uncertainty of ambiguous instructions.
PARAGON has a GNN for relational reasoning and place-
ment generation, which encodes a mean-field inference al-
gorithm similar to [17]. Moreover, our GNN uses particles
for message passing to capture complex and multimodal
distribution. It approximates a distribution as a set of parti-
cles [18], which provides strong expressiveness for complex
and multimodal distribution. It was used in robot percep-
tion [19], [20], [21], recurrent neural networks [22], and
graphical models [23], [24]. Our approach employed this idea
in GNN for particle-based message passing.
III. OVERVIEW
We focus on the grounding human language for tabletop
object placing tasks. In this task, scenes are composed of
a finite set of 3D objects on a 2D tabletop. Humans give
natural language instruction `∈ L to guide the robot to pick
an object and put it at the desired position x
tgt. The human
instruction is denoted as a sequence `={ωl}1lLwhere ωl
is a word, e.g., instruction in Fig.6 is {ω1=put,ω2=a,...}.
An instruction should contain a target object expression
(e.g., “a plate”) to specify the object to pick and express
at least one spatial relation (e.g., “next to a silver mug”)
for placement description. The robot needs to find the target
object’s placement distribution p(x
tgt|`,o)conditioned on the
language instruction `and visual observation o.
The core idea in our proposed solution, PARAGON, is
to leverage object-centric relations from linguistic and vi-
sual inputs to perform relational reasoning and placement
generation, and encode those procedures in neural networks
for end-to-end training. The pipeline of PARAGON is in
Fig 2. PARAGON first uses the soft parsing module to convert
language inputs “softly” into a set of relations, represented
as triplets. A grounding module then aligns the mentioned
objects in triplets with objects in the visual scenes. The
triplets can form a graph by taking the objects as the nodes
and relations as the edges. The resulting graph is fed into
a GNN for relational reasoning and generating placements.
The GNN encodes a mean-field inference algorithm for a
conditional random field depicting spatial relations in triplets.
PARAGON is trained end-to-end to achieve the best overall
performance for object placing without annotating parsing
and object-grounding labels.
IV. SOFT PARSING
The soft parsing module is to extract spatial relations in
complex instructions for accurate placement generation. The
pipeline is in Fig 4. Dependency trees capture the relations
between words in natural language, which implicitly indicate
the relations between the semantics those words express [25],
[26]. Thus, we use a data-driven approach to explore the
underlying semantic relations in the dependency tree for
extracting relations represented as relational triplets. It takes
linguistic input and outputs relational triplets, where the
triplets’ components are represented as embeddings.
A. Preliminaries
Triplets: A triplet consists of two entities and their rela-
tion, representing a binary relation. Triplet provides a formal
representation of knowledge expressed in natural language,
which is widely applied in scene graph parsing [25], relation
extraction [27], and knowledge graph [28]. The underlying
assumption of representing natural language as triplets is that
natural language rarely has higher-order relations, as humans
mostly use binary relations in natural language [29]. For
spatial relations, two triplets can represent ternary relations
(e.g., “between A and B” equals “the right of A and left of B”
sometimes). As such, it is sufficient to represent instructions
as triplets for common object-placing purposes.
Dependency Tree: A dependency tree (shown in Fig. 3) is
a universal structure that examines the relationships between
words in a phrase to determine its grammatical structure [30].
It uses part-of-speech tags to mark each word and depen-
dency tags to mark the relations between two words. A
part-of-speech tag [31] categorizes words’ correspondence
with a particular part of speech, depending on the word’s
definition and context, such as in Fig. 3, “cup” is a “Noun”.
Dependency tags mark two words relations in grammar,
represented as Universal Dependency Relations [32]. For
example, in Fig. 3, the Noun “cup” is the “direct object
(Dobj)” of the Verb “place”. Those relations are universal.
A proper dependency tree relies on grammatically correct
instructions, whereas noisy language sentences may result
in imperfect dependency trees. Thus, we use the data-driven
method to adapt to imperfect dependency trees.
B. Method
To make the parsing differentiable, the soft parsing module
“softens” the triplet as the attention to the words of linguistic
inputs {aγ,l}1lL,Plaγ,l = 1,γ ∈ {subj,obj,rel}. We
compute the embeddings of components in triplets by the
attention-weighted sum of the word embeddings ϕγ=
Plaγ,lfCLIP(ωl). The word embeddings are evaluated us-
ing pre-trained CLIP [33] fCLIP. As the dependency tree
examines the relations between words that could implicitly
encode relational information, we use a GNN [34] to operate
摘要:

DifferentiableParsingandVisualGroundingofNaturalLanguageInstructionsforObjectPlacementZiruiZhao,WeeSunLee,andDavidHsuAbstract—Wepresentanewmethod,PARsingAndvisualGrOuNding(PARAGON),forgroundingnaturallanguageinobjectplacementtasks.Naturallanguagegenerallydescribesobjectsandspatialrelationswithcompos...

展开>> 收起<<
Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement Zirui Zhao Wee Sun Lee and David Hsu.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:4.73MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注