CLIP-Fields Weakly Supervised Semantic Fields for Robotic Memory Nur Muhammad Mahi Shafiullah1Chris Paxton2Lerrel Pinto1Soumith Chintala2Arthur Szlam2

2025-04-24 0 0 9.55MB 11 页 10玖币

侵权投诉

CLIP-Fields: Weakly Supervised Semantic Fields

for Robotic Memory

Nur Muhammad (Mahi) Shaﬁullah†1Chris Paxton2Lerrel Pinto1Soumith Chintala2Arthur Szlam2

Abstract—We propose CLIP-Fields, an implicit scene model

that can be used for a variety of tasks, such as segmentation,

instance identiﬁcation, semantic search over space, and view

localization. CLIP-Fields learns a mapping from spatial locations

to semantic embedding vectors. Importantly, we show that this

mapping can be trained with supervision coming only from web-

image and web-text trained models such as CLIP, Detic, and

Sentence-BERT; and thus uses no direct human supervision.

When compared to baselines like Mask-RCNN, our method

outperforms on few-shot instance identiﬁcation or semantic

segmentation on the HM3D dataset with only a fraction of the

examples. Finally, we show that using CLIP-Fields as a scene

memory, robots can perform semantic navigation in real-world

environments. Our code and demonstration videos are available

here: https://mahis.life/clip-ﬁelds

I. INTRODUCTION

In order to perform a variety of complex tasks in human

environments, robots often rely on a spatial semantic mem-

ory [2,19,11]. Ideally, this spatial memory should not be

restricted to particular labels or semantic concepts, would

not rely on human annotation for each scene, and would be

easily learnable from commodity sensors like RGB-D cameras

and IMUs. However, existing representations are coarse, often

relying on a preset list of classes and capturing minimal

semantics [2,11]. As a solution, we propose CLIP-Fields,

which builds an implicit spatial semantic memory using web-

scale pretrained models as weak supervision.

Recently, representations of 3D scenes via neural implicit

mappings have become practical [29,28]. Neural Radiance

Fields (NeRFs) [18], and implicit neural representations more

generally [21] can serve as differentiable databases of spatio-

temporal information that can be used by robots for scene

understanding, SLAM, and planning [17,27,6,9,21].

Concurrently, web-scale weakly-supervised vision-language

models like CLIP [22] have shown that the ability to capture

powerful semantic abstractions from individual 2D images.

These have proven useful for a range of robotics applications,

including object understanding [30] and multi-task learning

from demonstration [26]. Their applications have been limited,

however, by the fact that these trained representations assume

a single 2D image as input; it is an open question how to use

these together with 3D reasoning.

In this work, we introduce a method for building weakly

supervised semantic neural ﬁelds, called CLIP-Fields, which

combines the advantages of both of these lines of work.

CLIP-Fields is intended to serve as a queryable 3D scene

†Corresponding author, email mahi@cs.nyu.edu

1. New York University 2. FAIR Labs

Web-pretrained Model Supervision

Implicit CLIP Field

Warm up

my lunch

Query

Robot memory retrieval

Semantic!

Representation

Fig. 1: Our approach, CLIP-Fields, integrates multiple views of a

scene and can capture 3D semantics from relatively few examples.

This results in a scalable 3D semantic representation that can be used

to infer information about the world from relatively few examples and

functions as a 3D spatial memory for a mobile robot.

representation, capable of acting as a spatial-semantic memory

for a mobile robot. We show that CLIP-Fields is capable of

open-vocabulary segmentation and object navigation in a 3D

scene using only pretrained models as supervision.

Our key idea is to build a mapping from locations in space

g(x, y, z) : R3→Rdthat serves as a generic differentiable

spatial database. This dataset is trained to predict features

from a set of off-the-shelf vision-language models trained on

web-scale data, which give us weak supervision. This map

is trained on RGB-D data using a contrastive loss which

encourages similarity between features predicted at speciﬁc

spatial locations.

Thus, from the point of view of a robot using CLIP-Fields

as a spatial database for scene-understanding, training gitself

arXiv:2210.05663v3 [cs.RO] 22 May 2023

can be entirely self-supervised: the full pipeline, including

training the underlying image models, need not use any

explicit supervision. On the other hand, as we show in our

experiments, even without any explicit supervision, the spatial

database gcan naturally capture scene-speciﬁc information.

We demonstrate our method on tasks such as instance

segmentation and identiﬁcation. Furthermore, we give qual-

itative examples of image-view localization, where we need

to ﬁnd the spatial coordinates corresponding to an image and

localizing text descriptions in space. Finally, we demonstrate

CLIP-Fields on a real robot by having the robot move to look

at various objects in 3D given natural language commands.

These experiments show how CLIP-Fields could be used to

power a range of real-world applications by capturing rich 3D

semantic information in an accessible way.

II. RELATED WORK

Vision-Language Navigation. Much recent progress on

vision-language navigation problems such as ALFRED [25] or

RXR [16] has used spatial representations or structured mem-

ory as a key component to solving the problem [19,2,33,10].

HLSM [2] and FiLM [19] are built as the agent moves

through the environment, and rely on a ﬁxed set of classes

and a discretization of the world that is inherently limiting.

By contrast, CLIP-Fields creates an embedding-dependant

implicit representation of a scene, removing dependency on a

ﬁxed set of labels and hyperparameters related to environment

discretization. Other representations [33] do not allow for 3D

spatial queries, or rely on dense annotations, or accurate object

detection and segmentation [10,5,1].

Concurrently with our work, NLMap-SayCan [4] and

VLMaps [13] proposed two approaches for real-world vision-

language navigation. NLMap-SayCan uses a 2D grid-based

map and a discrete set of objects predicted by a region-

proposal network [4], while CLIP-Fields can make predictions

at different granularities. VLMaps [13] use a 2D grid-based

representation and operate on a speciﬁc, pre-selected set of

object classes. By contrast, CLIP-Fields can operate on 3D

data, allowing the agent to look up or down to ﬁnd objects.

All three methods assume the environment has been explored,

but both [4] and [13] look at predicting action sequences,

while we focus on the problem of building an open-vocabulary,

queryable 3D scene representation.

Pretrained Representations. Effective use of pretrained

representations like CLIP [22] seems crucial to deploying

robots with semantic knowledge in the real world. Recent

works have shown that it is possible to use supervised web

image data for self-supervised learning of spatial representa-

tions. Our work is closely related to [3], where the authors

show that a web-trained detection model, along with spatial

consistency heuristics, can be used to annotate a 3D voxel

map. That voxel map can then be used to propagate labels

from one image to another. Other works, for example [8], use

models speciﬁcally trained on indoor semantic segmentation

to build semantic scene data-structures.

Cohen et al. [7] looks at personalizing CLIP for speciﬁc

users and rare queries, but does not build 3D spatial representa-

tions conducive to robotics applications, and instead functions

on the level of individual images.

Implicit Representations. There is a recent trend towards

using NeRF-inspired representations as the spatial knowledge

base for robotic manipulation problems [27,9], but so far this

has not been applied to open-vocabulary object search. As in

[36,29,32,14,31], we use a mapping (parameterized by a

neural network) that associates to an (x, y, z)point in space a

vector with semantic information. In those works, the labels

are given as explicit (but perhaps sparse) human annotation,

whereas, in this work, the annotation for the semantic vector

are derived from weakly-supervised web image data.

Language-based Robotics. Several works [26,30] have

shown how features from weakly-supervised web-image

trained models like CLIP [22] can be used for robotic scene

understanding. Most closely related to this work is [12], which

uses CLIP embeddings to label points in a single-view 3D

space via back-projection. In that work, text descriptions are

associated with locations in space in a two step process. In

the ﬁrst step, using an ViT-CLIP attention-based relevancy

extractor, a given text description is localized in a region on

an image; and that region is back-projected to locations in

space (via depth information). In the second step, a separately

trained model decoupled from the semantics converts the back-

projected points into an occupancy map. In contrast, in our

work, CLIP embeddings are used to directly train an implicit

map that outputs a semantic vector corresponding to each

point in space. One notable consequence is that our approach

integrates semantic information from multiple views into the

spatial memory; for example in Figure 6we see that more

views of the scene lead to better zero-shot detections.

III. BACKGROUND

In this section, we provide descriptions of the recent ad-

vances in machine learning that makes CLIP-Fields possible.

a) Contrastive Image-Language Pretraining: This pre-

training method, colloquially known as CLIP [22], is based

on training a pair of image and language embedding networks

such that an image and text strings describing that image

have similar embeddings. The CLIP model in [22] is trained

with a large corpus of paired image and text captions with

a contrastive loss objective predicting which caption goes

with which image. The resultant pair of models are able to

embed images and texts into the same latent space with a

meaningful cosine similarity metric between the embeddings.

We use CLIP models and embeddings heavily in this work

because they can work as a shared representation between an

object’s visual features and its possible language labels.

b) Open-label Object Detection and Image Segmenta-

tion: Traditionally, the objective of object detection and se-

mantic segmentation tasks has been to assign a label to each

detected object or pixels. Generally, these labels are chosen

out of a set of predeﬁned labels ﬁxed during training or ﬁne-

tuning. Recently, the advent of open-label models have taken

this task to a step further by allowing the user to deﬁne

the set of labels during run-time with no extra training or

ﬁne-tuning. Such models instead generally predict a CLIP

embedding for each detected object or pixel, which is then

compared against the label-embeddings to assign labels. In

our work, we use Detic [37] pretrained on ImageNet-20k as

our open-label object detector. We take advantage of the fact

that besides the proposed labels, Detic also reports the CLIP

image embedding for each proposed region in the image.

c) Sentence Embedding Networks for Text Similarity:

CLIP models are pretrained with image-text pairs, but not with

image-image or text-text pairs. As a result, sometimes CLIP

embeddings can be ambiguous when comparing similarities

between two images or pieces of texts. To improve CLIP-

Fields’ performance on language queries, we also utilize

language model pretrained for semantic-similarity tasks such

as Sentence-BERT [23]. Such models are pretrained on a

large number of question-answer datasets. Thus, they are also

good candidates for generating embeddings that are relevant

to answering imperative queries.

d) Neural Fields: Generally, Neural Fields refer to a

class of methods using coordinate based neural networks

which parametrize physical properties of scenes or objects

across space and time [34]. Namely, they build a map from

space (and potentially time) coordinates to some physical

properties, such as RGB color and density in the case of

neural radiance ﬁelds [18], or a signed distance in the case

of instant signed distance ﬁelds [21]. While there are many

popular architectures for learning a neural ﬁeld, in this paper

we used Instant-NGP [20] as in preliminary experiments we

found it to be an order of magnitude faster than the original

architecture in [18].

Note that a major focus of our work is using models

pretrained on large datasets as-is – to make sure CLIP-Fields

can take advantage of the latest advances in the diverse ﬁelds

it draws from. At the same time, while in our setup we

haven’t found a need to ﬁne-tune any of the pretrained models

mentioned here, we do not believe there is any barrier to do

so if such is necessary.

IV. APPROACH

In this section, we describe our concrete problem statement,

the components of our semantic scene model, and how those

components connect with each other.

A. Problem Statement

We aim to build a system that can connect points of a

3D scene with their visual and semantic meaning. Concretely,

we design CLIP-Fields to provide an interface with a pair

of scene-dependent implicit functions f, h :R3→Rnsuch

that for the coordinates of any point Pin our scene, f(P)

is a vector representing its semantic features, and h(P)is

another vector representing its visual features. For ease of

decoding, we constrain the output spaces of f, h to match the

embedding space of pre-trained language and vision-language

models, respectively. For the rest of this paper, we refer to such

functions as “spatial memory” or “geometric database” since

they connect the scene coordinates with scene information.

Given such a pair of functions, we can solve multiple

downstream problems in the following way:

•Segmentation: For a pixel in a scene, ﬁnd the corre-

sponding point Piin space. Use the alignment between

a label embedding and f(Pi)to ﬁnd the label with the

highest probability for that pixel. Segment a scene image

by doing so for each pixel.

•Object navigation: For a given semantic query qs(or

a visual query qv) ﬁnd the associated embeddings from

our pretrained models, es(respectively, ev), and ﬁnd the

point in space that maximizes es·f(P∗)(or ev·h(P∗)).

Navigate to P∗using classic navigation stack.

•View localization: Given a view vfrom the scene, ﬁnd

the image embedding evof vusing the same vision-

language model. Find the set of points with highest

alignment ev·h(P)in the scene.

While such a pair of scene-dependent functions f, h would

be straightforward to construct if we were given a dataset

{(P, f(P), h(P)|P∈scene}, to make it broadly applicable,

we create CLIP-Fields to be able to construct f, h from easily

collectable RGB-D videos and odometry data.

B. Dataset Creation

We assume that we have a series of RGB-D images of a

scene alongside odometry information, i.e. the approximate

6D camera poses while capturing the images. As described

in V-B, we captured such a dataset using accessible consumer

devices such as an iPhone Pro or iPads. To train our model,

we ﬁrst preprocess this set of RGB-D frames into a scene

dataset (Fig. 2). We convert each of our depth images to

pointclouds in world coordinates using the camera’s intrinsic

and extrinsic matrices. Next, we label each of the points P

in the pointcloud with their possible representation vectors,

f(P), h(P). When no human annotations are available, we

use web-image trained object detection models on our RGB

images. We choose Detic [37] as our detection model since it

can perform object detection with an open label set. However,

this model can freely be swapped out for any other pretrained

detection or segmentation model. When available, we can also

use human annotations for semantic or instance segmentations.

In both cases, we derive a set of detected objects with

language labels in the image, along with their label masks

and conﬁdence scores. We back-project the pixels included

in the the label mask to the world coordinates using our

point cloud. We label each back-projected point in the world

with the associated language label and label conﬁdence score.

Additionally, we label each back-projected point with the CLIP

embedding of the view it was back-projected from as well as

the distance between camera and the point in that particular

point. Note that each point can appear multiple times in the

dataset from different training images.

Thereby, we get a dataset with two sets of labels from

our collected RGB-D frames and odometry information. One

set of label captures primarily semantic information, Dlabel =

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CLIP-Fields:WeaklySupervisedSemanticFieldsforRoboticMemoryNurMuhammad(Mahi)Shafiullah†1ChrisPaxton2LerrelPinto1SoumithChintala2ArthurSzlam2Abstract—WeproposeCLIP-Fields,animplicitscenemodelthatcanbeusedforavarietyoftasks,suchassegmentation,instanceidentification,semanticsearchoverspace,andviewlocali...

展开>> 收起<<

CLIP-Fields Weakly Supervised Semantic Fields for Robotic Memory Nur Muhammad Mahi Shafiullah1Chris Paxton2Lerrel Pinto1Soumith Chintala2Arthur Szlam2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CLIP-Fields Weakly Supervised Semantic Fields for Robotic Memory Nur Muhammad Mahi Shafiullah1Chris Paxton2Lerrel Pinto1Soumith Chintala2Arthur Szlam2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: