Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement Zirui Zhao Wee Sun Lee and David Hsu

2025-04-27 0 0 4.73MB 8 页 10玖币

侵权投诉

Differentiable Parsing and Visual Grounding of Natural Language

Instructions for Object Placement

Zirui Zhao, Wee Sun Lee, and David Hsu

Abstract— We present a new method, PARsing And visual

GrOuNding (PARAGON), for grounding natural language in

object placement tasks. Natural language generally describes

objects and spatial relations with compositionality and ambi-

guity, two major obstacles to effective language grounding.

For compositionality, PARAGON parses a language instruction

into an object-centric graph representation to ground objects

individually. For ambiguity, PARAGON uses a novel particle-

based graph neural network to reason about object placements

with uncertainty. Essentially, PARAGON integrates a parsing

algorithm into a probabilistic, data-driven learning framework.

It is fully differentiable and trained end-to-end from data for

robustness against complex, ambiguous language input.

I. INTRODUCTION

Robot tasks, such as navigation, manipulation, assembly,

. . . , often involve spatial relations among objects. To carry

out tasks instructed by humans, robots must understand

natural language instructions about objects and their spatial

relations. This work focuses speciﬁcally on the object place-

ment task with language instructions. Humans provide verbal

instructions to robots to pick up an object and place it in a

speciﬁc location. The robot must generate object placements

based on both language description and visual observation.

However, the language expressions about spatial relations are

generally ambiguous and compositional, two major obstacles

to effective language grounding.

Two types of ambiguity, positional ambiguity and refer-

ential ambiguity, are focused upon in this study. Positional

ambiguity occurs when humans describe directional relations

without specifying exact distances (e.g., to the left side of

a plate”). Reference objects are often required to describe

spatial relations. When placing an object next to a reference

object without specifying its distances, it is hard to link the

reference expressions to the referred objects to learn visual

grounding. Referential ambiguity arises when descriptions

of objects are ambiguous, leading to a reference expression

being grounded to multiple semantically identical objects and

resulting in a multimodal distribution of correct placement.

The compositional structure of language-described spatial

relations stems from the visual scene’s and natural language’s

compositional nature. A complex scene includes multiple

basic objects, and to describe the desired state of a complex

scene, one can compose many simple sentences for referents

The authors are from National University of Singapore. Emails: {ziruiz,

leews, dyhsu}@comp.nus.edu.sg. This research is supported in part by

the National Research Foundation (NRF), Singapore and DSO National

Laboratories under the AI Singapore Program (AISG Award No. AISG2-RP-

2020-016) and the Agency of Science, Technology and Research, Singapore,

under the National Robotics Program (Grant No. 192 25 00054).

The data, code and appendix are available at https://bit.ly/ParaGonProj.

Put a plate to

the upper side

of a knife and

next to the

silver mug.

Fig. 1. PARAGON takes as input a language instruction and an image of the

task environment. It outputs candidate placements for a target object. The

presence of multiple semantically identical objects (e.g., “silver mug”) and

omitted distance information cause difﬁculty for placement generation, and

the compositional instructions increase the data required for learning.

and their relations to form a complex language sentence (e.g.,

the instruction in Fig 1). This characteristic increases the data

required for learning compositional language instructions.

To address the issues, we propose PARAGON, a PARsing

And visual GrOuNding method for language-conditioned ob-

ject placement. The core idea of PARAGON is to parse human

language into object-centric relations for visual grounding

and placement generation, and encode those procedures in

neural networks for end-to-end training. The parsing module

of PARAGON decomposes compositional instructions into

object-centric relations, enabling the grounding of objects

separately without compositionality issues. The relations can

be composed and encoded in graph neural networks (GNN)

for placement generation. This GNN uses particle-based

message-passing to model the uncertainty caused by am-

biguous instructions. All the modules are encoded into neural

networks and connected for end-to-end training, avoiding the

need for individual training module labeling.

PARAGON essentially integrates parsing-based methods

into a probabilistic, data-driven framework, exhibiting ro-

bustness from its data-driven property, as well as generaliz-

ability and data efﬁciency due to its parsing-based nature. It

also adapts to the uncertainty of ambiguous instructions us-

ing particle-based probabilistic techniques. The experiments

demonstrate that PARAGON outperforms the state-of-the-art

method in language-conditioned object placement tasks in

the presence of ambiguity and compositionality.

II. RELATED WORK

Grounding human language in robot instruction-following

has been studied in the recent decades [1], [2], [3], [4], [5],

[6], [7], [8], [9]. Our research focuses on object placement

instructed by human language. In contrast to picking [1], [2]

that needs only a discriminative model to ground objects

from reference expressions, placing [3], [4], [5], [6], [7]

requires a generative model conditioned on the relational

constraints of object placement. It requires capturing com-

arXiv:2210.00215v4 [cs.RO] 13 Mar 2023

plex relations between objects in natural language, grounding

reference expression of objects, and generating placement

that satisﬁes the relational constraints in the instructions.

Parsing-based methods for robot instruction-following [3],

[4], [6], [10], [11], [12], [13], [14] parse natural language

into formal representations using hand-crafted rules and

grammatical structures. Those hand-crafted rules are gen-

eralizable but not robust to noisy language [15]. Among

these studies, those focusing on placing [3], [4], [6] lack

a decomposition mechanism for compositional instructions

and assume perfect object grounding without referential

ambiguity. Recently, [8], [9], [7] used sentence embeddings

to learn a language-condition policy for robot instruction

following, which are not data-efﬁcient and hard to generalize

to unseen compositional instructions. We follow [16] to inte-

grate parsing-based modules into a data-driven framework. It

is robust, data-efﬁcient, and generalizable for learning com-

positional instructions. We also use probabilistic techniques

to adapt to the uncertainty of ambiguous instructions.

PARAGON has a GNN for relational reasoning and place-

ment generation, which encodes a mean-ﬁeld inference al-

gorithm similar to [17]. Moreover, our GNN uses particles

for message passing to capture complex and multimodal

distribution. It approximates a distribution as a set of parti-

cles [18], which provides strong expressiveness for complex

and multimodal distribution. It was used in robot percep-

tion [19], [20], [21], recurrent neural networks [22], and

graphical models [23], [24]. Our approach employed this idea

in GNN for particle-based message passing.

III. OVERVIEW

We focus on the grounding human language for tabletop

object placing tasks. In this task, scenes are composed of

a ﬁnite set of 3D objects on a 2D tabletop. Humans give

natural language instruction `∈ L to guide the robot to pick

an object and put it at the desired position x∗

tgt. The human

instruction is denoted as a sequence `={ωl}1≤l≤Lwhere ωl

is a word, e.g., instruction in Fig.6 is {ω1=put,ω2=a,...}.

An instruction should contain a target object expression

(e.g., “a plate”) to specify the object to pick and express

at least one spatial relation (e.g., “next to a silver mug”)

for placement description. The robot needs to ﬁnd the target

object’s placement distribution p(x∗

tgt|`,o)conditioned on the

language instruction `and visual observation o.

The core idea in our proposed solution, PARAGON, is

to leverage object-centric relations from linguistic and vi-

sual inputs to perform relational reasoning and placement

generation, and encode those procedures in neural networks

for end-to-end training. The pipeline of PARAGON is in

Fig 2. PARAGON ﬁrst uses the soft parsing module to convert

language inputs “softly” into a set of relations, represented

as triplets. A grounding module then aligns the mentioned

objects in triplets with objects in the visual scenes. The

triplets can form a graph by taking the objects as the nodes

and relations as the edges. The resulting graph is fed into

a GNN for relational reasoning and generating placements.

The GNN encodes a mean-ﬁeld inference algorithm for a

conditional random ﬁeld depicting spatial relations in triplets.

PARAGON is trained end-to-end to achieve the best overall

performance for object placing without annotating parsing

and object-grounding labels.

IV. SOFT PARSING

The soft parsing module is to extract spatial relations in

complex instructions for accurate placement generation. The

pipeline is in Fig 4. Dependency trees capture the relations

between words in natural language, which implicitly indicate

the relations between the semantics those words express [25],

[26]. Thus, we use a data-driven approach to explore the

underlying semantic relations in the dependency tree for

extracting relations represented as relational triplets. It takes

linguistic input and outputs relational triplets, where the

triplets’ components are represented as embeddings.

A. Preliminaries

Triplets: A triplet consists of two entities and their rela-

tion, representing a binary relation. Triplet provides a formal

representation of knowledge expressed in natural language,

which is widely applied in scene graph parsing [25], relation

extraction [27], and knowledge graph [28]. The underlying

assumption of representing natural language as triplets is that

natural language rarely has higher-order relations, as humans

mostly use binary relations in natural language [29]. For

spatial relations, two triplets can represent ternary relations

(e.g., “between A and B” equals “the right of A and left of B”

sometimes). As such, it is sufﬁcient to represent instructions

as triplets for common object-placing purposes.

Dependency Tree: A dependency tree (shown in Fig. 3) is

a universal structure that examines the relationships between

words in a phrase to determine its grammatical structure [30].

It uses part-of-speech tags to mark each word and depen-

dency tags to mark the relations between two words. A

part-of-speech tag [31] categorizes words’ correspondence

with a particular part of speech, depending on the word’s

deﬁnition and context, such as in Fig. 3, “cup” is a “Noun”.

Dependency tags mark two words relations in grammar,

represented as Universal Dependency Relations [32]. For

example, in Fig. 3, the Noun “cup” is the “direct object

(Dobj)” of the Verb “place”. Those relations are universal.

A proper dependency tree relies on grammatically correct

instructions, whereas noisy language sentences may result

in imperfect dependency trees. Thus, we use the data-driven

method to adapt to imperfect dependency trees.

B. Method

To make the parsing differentiable, the soft parsing module

“softens” the triplet as the attention to the words of linguistic

inputs {aγ,l}1≤l≤L,Plaγ,l = 1,γ ∈ {subj,obj,rel}. We

compute the embeddings of components in triplets by the

attention-weighted sum of the word embeddings ϕγ=

Plaγ,lfCLIP(ωl). The word embeddings are evaluated us-

ing pre-trained CLIP [33] fCLIP. As the dependency tree

examines the relations between words that could implicitly

encode relational information, we use a GNN [34] to operate

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DifferentiableParsingandVisualGroundingofNaturalLanguageInstructionsforObjectPlacementZiruiZhao,WeeSunLee,andDavidHsuAbstractWepresentanewmethod,PARsingAndvisualGrOuNding(PARAGON),forgroundingnaturallanguageinobjectplacementtasks.Naturallanguagegenerallydescribesobjectsandspatialrelationswithcompos...

展开>> 收起<<

Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement Zirui Zhao Wee Sun Lee and David Hsu.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Differentiable Parsing and Visual Grounding of Natural Language Instructions for Object Placement Zirui Zhao Wee Sun Lee and David Hsu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: