3
great progress in multimodal understanding tasks, bene-
fiting from the powerful transformer network and large-
scale pre-training. While it is tempting to leverage existing
frameworks from 2D visual grounding, it is important to
acknowledge that 3D visual grounding presents distinct
challenges, such as irregularity, noise, and missing data in
point clouds, as well as complex spatial relationships in 3D
space. Consequently, developing tailored approaches that
tackle these specific challenges is essential to make progress
in 3D visual grounding.
2.2 Visual Grounding on 3D Point Clouds
Visual grounding on 3D point clouds [1], [3] aims at lo-
calizing the corresponding 3D bounding box of a target
object given query sentences and unorganized point clouds.
Two public benchmark datasets ScanRefer [1] and ReferIt3D
[3] are proposed. Both of them adopt the ScanNet [5], an
indoor scene 3D point clouds dataset, and augment it with
language annotations. More specifically, ScanRefer follows
the grounding-by-detection paradigm of 2D visual ground-
ing that only raw 3D point clouds and the query sentence
are given. Alternatively, ReferIt3D formulates the 3D vi-
sual grounding as a fine-grained identification problem,
which assumes the object ground truth boxes are known
during both training and inference. Besides, Rel3D [2] is
constructed, which focuses on grounding spatial relations
between objects but not localization. SUNRefer [4] is intro-
duced for visual grounding on RGBD images. In this work,
we work on the visual grounding on 3D point clouds only.
To tackle this task, most of the existing methods [1],
[3], [9], [10], [11], [12], [18] follow the two-stage frame-
work. In the first stage, 3D object proposals are directly
generated from the ground-truth [3] or extracted by a 3D
object detector [21]. In the second stage, the proposals and
language features are aligned for semantic matching. Recent
work [12], [16], [18], [49] adopts the popular transformer
framework to model the relationship between proposals and
language powered by the attention mechanism [24]. Some
methods [9], [10], [11] utilize the graph neural network to
aggregate information. For instance, TGNN and InstanceRe-
fer [9], [10] treat the proposals as nodes and use language
to enhance them. FFL-3DOG [11] characterizes both the
proposals and language in two independent graphs and
fuses them in another graph. Recently, 3DJCG [13] proposes
a unified framework to jointly fit both the 3D captioning
task and 3D grounding task, which consists of shared
task-agnostic modules and task-specific modules. Similarly,
D3Net [19] introduces a unified network that can Detect,
Describe and Discriminate for both dense captioning and
visual grounding in point clouds. 3D-SPS [14] proposes a
3D single-stage referred point progressive selection method
for progressively selecting key points with the guidance of
language and directly locating the target in a single-stage
framework. [17] proposes a multi-view transformer for 3D
visual grounding, which takes the additional multi-view
features as inputs and designs the network for learning a
more robust multimodal representation. BUTD-DETR [15]
proposes the Bottom Up Top Down DEtection TRansformers
for visual grounding using language guidance and object-
ness guidance in both 2D and 3D.
Overall, we found that existing methods on 3D vi-
sual grounding usually perform visual-linguistic alignment
upon a coarse granularity and overlook the spatial con-
text, which inspired this work to design a hierarchical,
coarse-to-fine, and multi-granularity point-language align-
ment framework.
2.3 3D Object Detection
Since we work on end-to-end 3D visual grounding requiring
accurate detection, we then revisit the related work on 3D
object detection. To apply CNNs to 3D object detection, early
works project the point clouds to the bird’s view [50], [51],
[52] or frontal views [53], [54]. Voxel-based methods [55],
[56] conduct voxelization on point clouds and utilize 3D
and 2D CNNs sequentially as used in 2D object detection.
Recently, a bulk of point-based methods [21], [22], [57],
[58] put more effort into tackling this task directly on raw
point clouds. Most of them group the points into object
candidates by using box proposals [57], [58] or voting [21],
and then extract object features from groups of points for the
following detection. Group-Free [22] drops grounding used
in voting [21] and uses K-Closest Point Sampling instead
for proposal generation. At the same time, it applies a
transformer to carry out the correlation between proposals
and key points, which serves as the foundation of this paper.
More recently, RepSurf-U [59] presents the representative
surfaces, a representation of point clouds to explicitly extract
the local structure. It provides a lightweight plug-and-play
backbone module for downstream tasks containing object
detection. Group-Free [22] equipped with RepSurf-U out-
performs prior methods. FCAF3D [60], a recently proposed
fully convolutional anchor-free framework for object detec-
tion in 3D indoor scenes. Different from voting-based and
transformer-based methods, it provides an effective and
scalable design to manipulate the voxel representation of
point clouds with sparse convolutions. FCAF3D achieves
the state-of-the-art 3D object detection results on three in-
door large-scale 3D point clouds benchmark datasets [5],
[61], [62]. In this paper, we adopt a point-based detection
framework, but we believe that voxel-based design is also a
strong baseline, which we leave to future work.
2.4 Vision-Language Transformer
Transformer-based networks demonstrate extraordinary ca-
pability in computer vision [63], [64], [65] and natural lan-
guage processing [24], [66], [67] and further bridge the gap
between them in the multimodality field [31], [35], [36], [48],
[68], which also motivates us to utilize a transformer-based
framework for our work. Some works [48], [68] conduct
large-scale pre-training using a vision-language transformer.
For example, CLIP [48] introduces a transformer-based con-
trastive language-image pre-training method for efficiently
learning visual concepts from natural language supervi-
sion. Then TCL [68] proposes triple contrastive learning for
vision-language representation pre-training. Besides regular
cross-modal alignment (CMA), TCL introduces an intra-
modal contrastive (IMC) objective to provide complemen-
tary benefits in representation learning. All the network
modules, including the vision encoder, text encoder, and
fusion encoder, are transformer-based designs. Pre-training