can be entirely self-supervised: the full pipeline, including
training the underlying image models, need not use any
explicit supervision. On the other hand, as we show in our
experiments, even without any explicit supervision, the spatial
database gcan naturally capture scene-specific information.
We demonstrate our method on tasks such as instance
segmentation and identification. Furthermore, we give qual-
itative examples of image-view localization, where we need
to find the spatial coordinates corresponding to an image and
localizing text descriptions in space. Finally, we demonstrate
CLIP-Fields on a real robot by having the robot move to look
at various objects in 3D given natural language commands.
These experiments show how CLIP-Fields could be used to
power a range of real-world applications by capturing rich 3D
semantic information in an accessible way.
II. RELATED WORK
Vision-Language Navigation. Much recent progress on
vision-language navigation problems such as ALFRED [25] or
RXR [16] has used spatial representations or structured mem-
ory as a key component to solving the problem [19,2,33,10].
HLSM [2] and FiLM [19] are built as the agent moves
through the environment, and rely on a fixed set of classes
and a discretization of the world that is inherently limiting.
By contrast, CLIP-Fields creates an embedding-dependant
implicit representation of a scene, removing dependency on a
fixed set of labels and hyperparameters related to environment
discretization. Other representations [33] do not allow for 3D
spatial queries, or rely on dense annotations, or accurate object
detection and segmentation [10,5,1].
Concurrently with our work, NLMap-SayCan [4] and
VLMaps [13] proposed two approaches for real-world vision-
language navigation. NLMap-SayCan uses a 2D grid-based
map and a discrete set of objects predicted by a region-
proposal network [4], while CLIP-Fields can make predictions
at different granularities. VLMaps [13] use a 2D grid-based
representation and operate on a specific, pre-selected set of
object classes. By contrast, CLIP-Fields can operate on 3D
data, allowing the agent to look up or down to find objects.
All three methods assume the environment has been explored,
but both [4] and [13] look at predicting action sequences,
while we focus on the problem of building an open-vocabulary,
queryable 3D scene representation.
Pretrained Representations. Effective use of pretrained
representations like CLIP [22] seems crucial to deploying
robots with semantic knowledge in the real world. Recent
works have shown that it is possible to use supervised web
image data for self-supervised learning of spatial representa-
tions. Our work is closely related to [3], where the authors
show that a web-trained detection model, along with spatial
consistency heuristics, can be used to annotate a 3D voxel
map. That voxel map can then be used to propagate labels
from one image to another. Other works, for example [8], use
models specifically trained on indoor semantic segmentation
to build semantic scene data-structures.
Cohen et al. [7] looks at personalizing CLIP for specific
users and rare queries, but does not build 3D spatial representa-
tions conducive to robotics applications, and instead functions
on the level of individual images.
Implicit Representations. There is a recent trend towards
using NeRF-inspired representations as the spatial knowledge
base for robotic manipulation problems [27,9], but so far this
has not been applied to open-vocabulary object search. As in
[36,29,32,14,31], we use a mapping (parameterized by a
neural network) that associates to an (x, y, z)point in space a
vector with semantic information. In those works, the labels
are given as explicit (but perhaps sparse) human annotation,
whereas, in this work, the annotation for the semantic vector
are derived from weakly-supervised web image data.
Language-based Robotics. Several works [26,30] have
shown how features from weakly-supervised web-image
trained models like CLIP [22] can be used for robotic scene
understanding. Most closely related to this work is [12], which
uses CLIP embeddings to label points in a single-view 3D
space via back-projection. In that work, text descriptions are
associated with locations in space in a two step process. In
the first step, using an ViT-CLIP attention-based relevancy
extractor, a given text description is localized in a region on
an image; and that region is back-projected to locations in
space (via depth information). In the second step, a separately
trained model decoupled from the semantics converts the back-
projected points into an occupancy map. In contrast, in our
work, CLIP embeddings are used to directly train an implicit
map that outputs a semantic vector corresponding to each
point in space. One notable consequence is that our approach
integrates semantic information from multiple views into the
spatial memory; for example in Figure 6we see that more
views of the scene lead to better zero-shot detections.
III. BACKGROUND
In this section, we provide descriptions of the recent ad-
vances in machine learning that makes CLIP-Fields possible.
a) Contrastive Image-Language Pretraining: This pre-
training method, colloquially known as CLIP [22], is based
on training a pair of image and language embedding networks
such that an image and text strings describing that image
have similar embeddings. The CLIP model in [22] is trained
with a large corpus of paired image and text captions with
a contrastive loss objective predicting which caption goes
with which image. The resultant pair of models are able to
embed images and texts into the same latent space with a
meaningful cosine similarity metric between the embeddings.
We use CLIP models and embeddings heavily in this work
because they can work as a shared representation between an
object’s visual features and its possible language labels.
b) Open-label Object Detection and Image Segmenta-
tion: Traditionally, the objective of object detection and se-
mantic segmentation tasks has been to assign a label to each
detected object or pixels. Generally, these labels are chosen
out of a set of predefined labels fixed during training or fine-
tuning. Recently, the advent of open-label models have taken