CLIP-Fields Weakly Supervised Semantic Fields for Robotic Memory Nur Muhammad Mahi Shafiullah1Chris Paxton2Lerrel Pinto1Soumith Chintala2Arthur Szlam2

2025-04-24 0 0 9.55MB 11 页 10玖币
侵权投诉
CLIP-Fields: Weakly Supervised Semantic Fields
for Robotic Memory
Nur Muhammad (Mahi) Shafiullah1Chris Paxton2Lerrel Pinto1Soumith Chintala2Arthur Szlam2
Abstract—We propose CLIP-Fields, an implicit scene model
that can be used for a variety of tasks, such as segmentation,
instance identification, semantic search over space, and view
localization. CLIP-Fields learns a mapping from spatial locations
to semantic embedding vectors. Importantly, we show that this
mapping can be trained with supervision coming only from web-
image and web-text trained models such as CLIP, Detic, and
Sentence-BERT; and thus uses no direct human supervision.
When compared to baselines like Mask-RCNN, our method
outperforms on few-shot instance identification or semantic
segmentation on the HM3D dataset with only a fraction of the
examples. Finally, we show that using CLIP-Fields as a scene
memory, robots can perform semantic navigation in real-world
environments. Our code and demonstration videos are available
here: https://mahis.life/clip-fields
I. INTRODUCTION
In order to perform a variety of complex tasks in human
environments, robots often rely on a spatial semantic mem-
ory [2,19,11]. Ideally, this spatial memory should not be
restricted to particular labels or semantic concepts, would
not rely on human annotation for each scene, and would be
easily learnable from commodity sensors like RGB-D cameras
and IMUs. However, existing representations are coarse, often
relying on a preset list of classes and capturing minimal
semantics [2,11]. As a solution, we propose CLIP-Fields,
which builds an implicit spatial semantic memory using web-
scale pretrained models as weak supervision.
Recently, representations of 3D scenes via neural implicit
mappings have become practical [29,28]. Neural Radiance
Fields (NeRFs) [18], and implicit neural representations more
generally [21] can serve as differentiable databases of spatio-
temporal information that can be used by robots for scene
understanding, SLAM, and planning [17,27,6,9,21].
Concurrently, web-scale weakly-supervised vision-language
models like CLIP [22] have shown that the ability to capture
powerful semantic abstractions from individual 2D images.
These have proven useful for a range of robotics applications,
including object understanding [30] and multi-task learning
from demonstration [26]. Their applications have been limited,
however, by the fact that these trained representations assume
a single 2D image as input; it is an open question how to use
these together with 3D reasoning.
In this work, we introduce a method for building weakly
supervised semantic neural fields, called CLIP-Fields, which
combines the advantages of both of these lines of work.
CLIP-Fields is intended to serve as a queryable 3D scene
Corresponding author, email mahi@cs.nyu.edu
1. New York University 2. FAIR Labs
Web-pretrained Model Supervision
Implicit CLIP Field
Warm up
my lunch
Query
Robot memory retrieval
Semantic!
Representation
Fig. 1: Our approach, CLIP-Fields, integrates multiple views of a
scene and can capture 3D semantics from relatively few examples.
This results in a scalable 3D semantic representation that can be used
to infer information about the world from relatively few examples and
functions as a 3D spatial memory for a mobile robot.
representation, capable of acting as a spatial-semantic memory
for a mobile robot. We show that CLIP-Fields is capable of
open-vocabulary segmentation and object navigation in a 3D
scene using only pretrained models as supervision.
Our key idea is to build a mapping from locations in space
g(x, y, z) : R3Rdthat serves as a generic differentiable
spatial database. This dataset is trained to predict features
from a set of off-the-shelf vision-language models trained on
web-scale data, which give us weak supervision. This map
is trained on RGB-D data using a contrastive loss which
encourages similarity between features predicted at specific
spatial locations.
Thus, from the point of view of a robot using CLIP-Fields
as a spatial database for scene-understanding, training gitself
arXiv:2210.05663v3 [cs.RO] 22 May 2023
can be entirely self-supervised: the full pipeline, including
training the underlying image models, need not use any
explicit supervision. On the other hand, as we show in our
experiments, even without any explicit supervision, the spatial
database gcan naturally capture scene-specific information.
We demonstrate our method on tasks such as instance
segmentation and identification. Furthermore, we give qual-
itative examples of image-view localization, where we need
to find the spatial coordinates corresponding to an image and
localizing text descriptions in space. Finally, we demonstrate
CLIP-Fields on a real robot by having the robot move to look
at various objects in 3D given natural language commands.
These experiments show how CLIP-Fields could be used to
power a range of real-world applications by capturing rich 3D
semantic information in an accessible way.
II. RELATED WORK
Vision-Language Navigation. Much recent progress on
vision-language navigation problems such as ALFRED [25] or
RXR [16] has used spatial representations or structured mem-
ory as a key component to solving the problem [19,2,33,10].
HLSM [2] and FiLM [19] are built as the agent moves
through the environment, and rely on a fixed set of classes
and a discretization of the world that is inherently limiting.
By contrast, CLIP-Fields creates an embedding-dependant
implicit representation of a scene, removing dependency on a
fixed set of labels and hyperparameters related to environment
discretization. Other representations [33] do not allow for 3D
spatial queries, or rely on dense annotations, or accurate object
detection and segmentation [10,5,1].
Concurrently with our work, NLMap-SayCan [4] and
VLMaps [13] proposed two approaches for real-world vision-
language navigation. NLMap-SayCan uses a 2D grid-based
map and a discrete set of objects predicted by a region-
proposal network [4], while CLIP-Fields can make predictions
at different granularities. VLMaps [13] use a 2D grid-based
representation and operate on a specific, pre-selected set of
object classes. By contrast, CLIP-Fields can operate on 3D
data, allowing the agent to look up or down to find objects.
All three methods assume the environment has been explored,
but both [4] and [13] look at predicting action sequences,
while we focus on the problem of building an open-vocabulary,
queryable 3D scene representation.
Pretrained Representations. Effective use of pretrained
representations like CLIP [22] seems crucial to deploying
robots with semantic knowledge in the real world. Recent
works have shown that it is possible to use supervised web
image data for self-supervised learning of spatial representa-
tions. Our work is closely related to [3], where the authors
show that a web-trained detection model, along with spatial
consistency heuristics, can be used to annotate a 3D voxel
map. That voxel map can then be used to propagate labels
from one image to another. Other works, for example [8], use
models specifically trained on indoor semantic segmentation
to build semantic scene data-structures.
Cohen et al. [7] looks at personalizing CLIP for specific
users and rare queries, but does not build 3D spatial representa-
tions conducive to robotics applications, and instead functions
on the level of individual images.
Implicit Representations. There is a recent trend towards
using NeRF-inspired representations as the spatial knowledge
base for robotic manipulation problems [27,9], but so far this
has not been applied to open-vocabulary object search. As in
[36,29,32,14,31], we use a mapping (parameterized by a
neural network) that associates to an (x, y, z)point in space a
vector with semantic information. In those works, the labels
are given as explicit (but perhaps sparse) human annotation,
whereas, in this work, the annotation for the semantic vector
are derived from weakly-supervised web image data.
Language-based Robotics. Several works [26,30] have
shown how features from weakly-supervised web-image
trained models like CLIP [22] can be used for robotic scene
understanding. Most closely related to this work is [12], which
uses CLIP embeddings to label points in a single-view 3D
space via back-projection. In that work, text descriptions are
associated with locations in space in a two step process. In
the first step, using an ViT-CLIP attention-based relevancy
extractor, a given text description is localized in a region on
an image; and that region is back-projected to locations in
space (via depth information). In the second step, a separately
trained model decoupled from the semantics converts the back-
projected points into an occupancy map. In contrast, in our
work, CLIP embeddings are used to directly train an implicit
map that outputs a semantic vector corresponding to each
point in space. One notable consequence is that our approach
integrates semantic information from multiple views into the
spatial memory; for example in Figure 6we see that more
views of the scene lead to better zero-shot detections.
III. BACKGROUND
In this section, we provide descriptions of the recent ad-
vances in machine learning that makes CLIP-Fields possible.
a) Contrastive Image-Language Pretraining: This pre-
training method, colloquially known as CLIP [22], is based
on training a pair of image and language embedding networks
such that an image and text strings describing that image
have similar embeddings. The CLIP model in [22] is trained
with a large corpus of paired image and text captions with
a contrastive loss objective predicting which caption goes
with which image. The resultant pair of models are able to
embed images and texts into the same latent space with a
meaningful cosine similarity metric between the embeddings.
We use CLIP models and embeddings heavily in this work
because they can work as a shared representation between an
object’s visual features and its possible language labels.
b) Open-label Object Detection and Image Segmenta-
tion: Traditionally, the objective of object detection and se-
mantic segmentation tasks has been to assign a label to each
detected object or pixels. Generally, these labels are chosen
out of a set of predefined labels fixed during training or fine-
tuning. Recently, the advent of open-label models have taken
this task to a step further by allowing the user to define
the set of labels during run-time with no extra training or
fine-tuning. Such models instead generally predict a CLIP
embedding for each detected object or pixel, which is then
compared against the label-embeddings to assign labels. In
our work, we use Detic [37] pretrained on ImageNet-20k as
our open-label object detector. We take advantage of the fact
that besides the proposed labels, Detic also reports the CLIP
image embedding for each proposed region in the image.
c) Sentence Embedding Networks for Text Similarity:
CLIP models are pretrained with image-text pairs, but not with
image-image or text-text pairs. As a result, sometimes CLIP
embeddings can be ambiguous when comparing similarities
between two images or pieces of texts. To improve CLIP-
Fields’ performance on language queries, we also utilize
language model pretrained for semantic-similarity tasks such
as Sentence-BERT [23]. Such models are pretrained on a
large number of question-answer datasets. Thus, they are also
good candidates for generating embeddings that are relevant
to answering imperative queries.
d) Neural Fields: Generally, Neural Fields refer to a
class of methods using coordinate based neural networks
which parametrize physical properties of scenes or objects
across space and time [34]. Namely, they build a map from
space (and potentially time) coordinates to some physical
properties, such as RGB color and density in the case of
neural radiance fields [18], or a signed distance in the case
of instant signed distance fields [21]. While there are many
popular architectures for learning a neural field, in this paper
we used Instant-NGP [20] as in preliminary experiments we
found it to be an order of magnitude faster than the original
architecture in [18].
Note that a major focus of our work is using models
pretrained on large datasets as-is – to make sure CLIP-Fields
can take advantage of the latest advances in the diverse fields
it draws from. At the same time, while in our setup we
haven’t found a need to fine-tune any of the pretrained models
mentioned here, we do not believe there is any barrier to do
so if such is necessary.
IV. APPROACH
In this section, we describe our concrete problem statement,
the components of our semantic scene model, and how those
components connect with each other.
A. Problem Statement
We aim to build a system that can connect points of a
3D scene with their visual and semantic meaning. Concretely,
we design CLIP-Fields to provide an interface with a pair
of scene-dependent implicit functions f, h :R3Rnsuch
that for the coordinates of any point Pin our scene, f(P)
is a vector representing its semantic features, and h(P)is
another vector representing its visual features. For ease of
decoding, we constrain the output spaces of f, h to match the
embedding space of pre-trained language and vision-language
models, respectively. For the rest of this paper, we refer to such
functions as “spatial memory” or “geometric database” since
they connect the scene coordinates with scene information.
Given such a pair of functions, we can solve multiple
downstream problems in the following way:
Segmentation: For a pixel in a scene, find the corre-
sponding point Piin space. Use the alignment between
a label embedding and f(Pi)to find the label with the
highest probability for that pixel. Segment a scene image
by doing so for each pixel.
Object navigation: For a given semantic query qs(or
a visual query qv) find the associated embeddings from
our pretrained models, es(respectively, ev), and find the
point in space that maximizes es·f(P)(or ev·h(P)).
Navigate to Pusing classic navigation stack.
View localization: Given a view vfrom the scene, find
the image embedding evof vusing the same vision-
language model. Find the set of points with highest
alignment ev·h(P)in the scene.
While such a pair of scene-dependent functions f, h would
be straightforward to construct if we were given a dataset
{(P, f(P), h(P)|Pscene}, to make it broadly applicable,
we create CLIP-Fields to be able to construct f, h from easily
collectable RGB-D videos and odometry data.
B. Dataset Creation
We assume that we have a series of RGB-D images of a
scene alongside odometry information, i.e. the approximate
6D camera poses while capturing the images. As described
in V-B, we captured such a dataset using accessible consumer
devices such as an iPhone Pro or iPads. To train our model,
we first preprocess this set of RGB-D frames into a scene
dataset (Fig. 2). We convert each of our depth images to
pointclouds in world coordinates using the camera’s intrinsic
and extrinsic matrices. Next, we label each of the points P
in the pointcloud with their possible representation vectors,
f(P), h(P). When no human annotations are available, we
use web-image trained object detection models on our RGB
images. We choose Detic [37] as our detection model since it
can perform object detection with an open label set. However,
this model can freely be swapped out for any other pretrained
detection or segmentation model. When available, we can also
use human annotations for semantic or instance segmentations.
In both cases, we derive a set of detected objects with
language labels in the image, along with their label masks
and confidence scores. We back-project the pixels included
in the the label mask to the world coordinates using our
point cloud. We label each back-projected point in the world
with the associated language label and label confidence score.
Additionally, we label each back-projected point with the CLIP
embedding of the view it was back-projected from as well as
the distance between camera and the point in that particular
point. Note that each point can appear multiple times in the
dataset from different training images.
Thereby, we get a dataset with two sets of labels from
our collected RGB-D frames and odometry information. One
set of label captures primarily semantic information, Dlabel =
摘要:

CLIP-Fields:WeaklySupervisedSemanticFieldsforRoboticMemoryNurMuhammad(Mahi)Shafiullah†1ChrisPaxton2LerrelPinto1SoumithChintala2ArthurSzlam2Abstract—WeproposeCLIP-Fields,animplicitscenemodelthatcanbeusedforavarietyoftasks,suchassegmentation,instanceidentification,semanticsearchoverspace,andviewlocali...

展开>> 收起<<
CLIP-Fields Weakly Supervised Semantic Fields for Robotic Memory Nur Muhammad Mahi Shafiullah1Chris Paxton2Lerrel Pinto1Soumith Chintala2Arthur Szlam2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:11 页 大小:9.55MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注