1 Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

2025-04-30 0 0 7.39MB 15 页 10玖币
侵权投诉
1
Learning Point-Language Hierarchical Alignment
for 3D Visual Grounding
Jiaming Chen, Weixin Luo, Ran Song, Xiaolin Wei, Lin Ma§, and Wei Zhang§
The printer is on a white table. It
is to the left of the bed.
A red armchair. In front of it is a
white cabinet.
This is a white door. It is to the
right of a dresser.
The dresser is to the left of the
entrance door. The dresser is
white and rectangular.
A white pillow. It is above a bed.
There is a bed in a corner. It is to
the left of a desk and chair.
A white lamp. It is above a
brown table.
There is a chair to the right of a
bed. It is between the bed and
another chair.
Fig. 1: Demonstration of the proposed HAM framework on the ScanRefer benchmark. This example demonstrates
HAM’s ability to comprehend spatial relationships through free-form language and accurately localize targets in
irregular point clouds.
Abstract—3D visual grounding localizes target objects on point clouds using natural language. While recent studies have made
substantial progress by using Transformers to align visual and linguistic information, most of them tend to use coarse-grained attention
mechanisms, which may limit their ability to comprehend lengthy and intricate language, as well as complex spatial relationships. This
paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an
end-to-end manner. We extract key points and proposal points to model 3D contexts and instances, and propose point-language
alignment with context modulation (PLACM) mechanism, which learns to gradually align word-level and sentence-level linguistic
embeddings with visual representations, while the modulation with the visual context captures latent informative relationships. To
further capture both global and local relationships, we propose a spatially multi-granular modeling scheme that applies PLACM to both
global and local fields. Experimental results demonstrate the superiority of HAM, with visualized results showing that it can dynamically
model fine-grained visual and linguistic representations. HAM outperforms existing methods by a significant margin and achieves
state-of-the-art performance on two publicly available datasets, and won the championship in ECCV 2022 ScanRefer challenge. Code
is available at https://github.com/PPjmchen/HAM.
Index Terms—Visual Grounding, Point Clouds, Transformer, Vision-Language
1 INTRODUCTION
3D object localization on point clouds using natural lan-
guage, namely 3D visual grounding, has been a prevailing
research topic in multimodal 3D understanding [1], [2], [3],
[4] since the pioneers ScanRefer [1] and ReferIt3D [3] built
on subsets of ScanNet [5]. Compared to the well-researched
field of 2D visual grounding in images [6], [7], [8], 3D visual
grounding poses two significant challenges. As shown in
J. Chen and W. Luo are the co-first authors. L. Ma and W. Zhang are the
corresponding authors.
J. Chen, R. Song and W. Zhang are with the School of Control Science and
Engineering, Shandong University, China. E-mail: ppjmchen@gmail.com,
davidzhang@sdu.edu.cn.
W. Luo, X. Wei, and L. Ma are with Meituan, China. E-mail: for-
est.linma@gmail.com.
Fig. 2, first, 3D visual grounding requires a profound un-
derstanding of the intricate spatial relationships between 3D
objects. For example, the highlighted blue phrase “above the
sink” and “to its right with a towel in the handle”. Second,
3D visual grounding requires a comprehension of complex
language queries, which may include such specifics as the
observer’s perspective, as highlighted in green phrase “when
you are facing the kitchen sink”. Additionally, issues such as
sensor noise, occlusion, missing data of point clouds, visual
ambiguity resulting from numerous similar furniture, and
other factors make it more challenging to develop accurate
and robust algorithms for 3D visual grounding than for its
2D counterpart.
To tackle the two challenges, several state-of-the-art ap-
arXiv:2210.12513v4 [cs.CV] 9 Jun 2023
2
Query:These brown cabinets can be seen
when you are facing the kitchen sink. There is
a shorter brown cabinet above the sink and a
larger one to its right with a towel in the handle.
Query:A set of kitchen cabinets
(a) 2D Visual Grounding Example (b) 3D Visual Grounding Example
Fig. 2: Visualization of 2D and 3D visual grounding exam-
ples. While 2D visual grounding is typically performed on
images, 3D visual grounding is more challenging, requiring
a deeper understanding of the intricate spatial relationships,
as well as the accompanying lengthy and complex language.
proaches [9], [10], [11], [12], [13], [14], [15], [16], [16], [17],
[18], [19] have been proposed, leveraging 3D point clouds
abstraction networks [20], 3D object detectors [21], [22], [23],
and powerful transformer networks [24] for multimodal
alignment. However, these methods often suffer common
issues. First, these approaches often perform multimodal
learning with coarse granularity for both language and
point clouds, resulting in an insufficient understanding of
lengthy and intricate language. Second, such methods tend
to overlook the spatial context, which may lead to an incom-
plete grasp of complex spatial relationships. This limitation
can hinder the model’s ability to fully comprehend the
contextual information within a scene. Addressing this issue
is essential for enhancing the model’s performance of visual
grounding in 3D environments.
This paper proposes an end-to-end hierarchical align-
ment model (HAM) with multi-granularity representation
learning to address the aforementioned issues. We use learn-
able point abstraction to gradually down-sample raw point
clouds into a set of key points and further select a smaller set
of proposal points to model 3D contexts and instances. The
core of our method is the point-language alignment with
context modulation (PLACM) mechanism, which builds
upon the popular query-key-value attention [24]. This mod-
ule uses proposals as queries and combines context rep-
resentations with language embeddings at both word and
sentence levels to conduct hierarchical, multi-granularity
attention. To capture both global and local relationships,
we further propose a spatially multi-granular modeling
(SMGM) scheme that applies PLACM to both global and
local fields. We pre-formulate the proposal points and key
points via space partition and group them regionally to
obtain spatially multi-granular representations.
Given the success of prompt engineering in NLP and
other multimodal domains [6], [25], [26], [27], we hypoth-
esize that such techniques are valuable tools for 3D visual
grounding models, which rely heavily on natural language
to identify and locate objects in point clouds. In this work,
we extensively investigate existing prompt engineering
techniques, which involve various strategies for modifying
and reorganizing input text, and systematically incorporate
them into our HAM framework to provide a diverse train-
ing set and enhance the efficiency and effectiveness of the
training.
We evaluate the proposed HAM framework on two
publicly available datasets [1], [3], and show that it per-
forms excellently in target identification and localization.
Furthermore, our approach won the championship in the
ECCV 2022 ScanRefer Challenge 1. We also provide ex-
tensive ablation experiments to show the effectiveness of
the proposed modules and present various visualization
results that illustrate the discriminative visual and linguistic
representations learned by our approach.
The main contributions of our work are summarized as
follows:
We design a hierarchical alignment model (HAM)
for 3D visual grounding. It contains a point-language
alignment with context modulation (PLACM) mech-
anism, which learns hierarchical alignment and ex-
tracts informative representations on both vision
and language. A spatially multi-granular modeling
(SMGM) strategy is conducted to extend PLACM to
multi-granular spatial fields.
We systematically analyze and incorporate three cu-
mulative prompt engineering strategies, which en-
hance the performance and robustness of HAM and
accelerate the training process.
Comprehensive experiments on two public datasets
demonstrate the improvement of the proposed
HAM. Moreover, our approach won the champi-
onship in ECCV 2022 ScanRefer Challenge.
We organize the rest of the manuscript as below. In
Section 2, we list related work, including 2D/3D visual
grounding, 3D object detection, etc. Section 3introduces our
proposed HAM and demonstrates each module in detail.
In Section 4, experiments on two publicly available datasets
with extensive ablation studies are carried out to validate
the effectiveness of HAM. We summarize this paper in
Section 5.
2 RELATED WORK
Computer vision and natural language processing, two
crucial sub-fields of AI, have made remarkable progress
recently. Linking two modalities to realize multimodal ar-
tificial intelligence is becoming a new research trend and
a broad range of related tasks has been extended. Most of
the vision-language tasks can be divided into several main
research fields, e.g.vision-language understanding [28], [29],
[30], [31], [32], generation [32], [33], [34], [35], [36], vision-
language and robotics [37], [38]. This work focuses on the
emerging and challenging 3D visual grounding task, and
we further introduce the related work in detail.
2.1 Visual Grounding on 2D Images
Visual grounding was first proposed on 2D images, which
aims at localizing the target object in an image based on
a given sentence query [39], [40], [41], [42]. Most existing
approaches adopt two-stage [8], [43], [44] and one-stage
frameworks [45], [46], [47], which are widely used in the ob-
ject detection. Recently, a set of methods [6], [48] have made
1. https://kaldir.vc.in.tum.de/scanrefer benchmark
3
great progress in multimodal understanding tasks, bene-
fiting from the powerful transformer network and large-
scale pre-training. While it is tempting to leverage existing
frameworks from 2D visual grounding, it is important to
acknowledge that 3D visual grounding presents distinct
challenges, such as irregularity, noise, and missing data in
point clouds, as well as complex spatial relationships in 3D
space. Consequently, developing tailored approaches that
tackle these specific challenges is essential to make progress
in 3D visual grounding.
2.2 Visual Grounding on 3D Point Clouds
Visual grounding on 3D point clouds [1], [3] aims at lo-
calizing the corresponding 3D bounding box of a target
object given query sentences and unorganized point clouds.
Two public benchmark datasets ScanRefer [1] and ReferIt3D
[3] are proposed. Both of them adopt the ScanNet [5], an
indoor scene 3D point clouds dataset, and augment it with
language annotations. More specifically, ScanRefer follows
the grounding-by-detection paradigm of 2D visual ground-
ing that only raw 3D point clouds and the query sentence
are given. Alternatively, ReferIt3D formulates the 3D vi-
sual grounding as a fine-grained identification problem,
which assumes the object ground truth boxes are known
during both training and inference. Besides, Rel3D [2] is
constructed, which focuses on grounding spatial relations
between objects but not localization. SUNRefer [4] is intro-
duced for visual grounding on RGBD images. In this work,
we work on the visual grounding on 3D point clouds only.
To tackle this task, most of the existing methods [1],
[3], [9], [10], [11], [12], [18] follow the two-stage frame-
work. In the first stage, 3D object proposals are directly
generated from the ground-truth [3] or extracted by a 3D
object detector [21]. In the second stage, the proposals and
language features are aligned for semantic matching. Recent
work [12], [16], [18], [49] adopts the popular transformer
framework to model the relationship between proposals and
language powered by the attention mechanism [24]. Some
methods [9], [10], [11] utilize the graph neural network to
aggregate information. For instance, TGNN and InstanceRe-
fer [9], [10] treat the proposals as nodes and use language
to enhance them. FFL-3DOG [11] characterizes both the
proposals and language in two independent graphs and
fuses them in another graph. Recently, 3DJCG [13] proposes
a unified framework to jointly fit both the 3D captioning
task and 3D grounding task, which consists of shared
task-agnostic modules and task-specific modules. Similarly,
D3Net [19] introduces a unified network that can Detect,
Describe and Discriminate for both dense captioning and
visual grounding in point clouds. 3D-SPS [14] proposes a
3D single-stage referred point progressive selection method
for progressively selecting key points with the guidance of
language and directly locating the target in a single-stage
framework. [17] proposes a multi-view transformer for 3D
visual grounding, which takes the additional multi-view
features as inputs and designs the network for learning a
more robust multimodal representation. BUTD-DETR [15]
proposes the Bottom Up Top Down DEtection TRansformers
for visual grounding using language guidance and object-
ness guidance in both 2D and 3D.
Overall, we found that existing methods on 3D vi-
sual grounding usually perform visual-linguistic alignment
upon a coarse granularity and overlook the spatial con-
text, which inspired this work to design a hierarchical,
coarse-to-fine, and multi-granularity point-language align-
ment framework.
2.3 3D Object Detection
Since we work on end-to-end 3D visual grounding requiring
accurate detection, we then revisit the related work on 3D
object detection. To apply CNNs to 3D object detection, early
works project the point clouds to the bird’s view [50], [51],
[52] or frontal views [53], [54]. Voxel-based methods [55],
[56] conduct voxelization on point clouds and utilize 3D
and 2D CNNs sequentially as used in 2D object detection.
Recently, a bulk of point-based methods [21], [22], [57],
[58] put more effort into tackling this task directly on raw
point clouds. Most of them group the points into object
candidates by using box proposals [57], [58] or voting [21],
and then extract object features from groups of points for the
following detection. Group-Free [22] drops grounding used
in voting [21] and uses K-Closest Point Sampling instead
for proposal generation. At the same time, it applies a
transformer to carry out the correlation between proposals
and key points, which serves as the foundation of this paper.
More recently, RepSurf-U [59] presents the representative
surfaces, a representation of point clouds to explicitly extract
the local structure. It provides a lightweight plug-and-play
backbone module for downstream tasks containing object
detection. Group-Free [22] equipped with RepSurf-U out-
performs prior methods. FCAF3D [60], a recently proposed
fully convolutional anchor-free framework for object detec-
tion in 3D indoor scenes. Different from voting-based and
transformer-based methods, it provides an effective and
scalable design to manipulate the voxel representation of
point clouds with sparse convolutions. FCAF3D achieves
the state-of-the-art 3D object detection results on three in-
door large-scale 3D point clouds benchmark datasets [5],
[61], [62]. In this paper, we adopt a point-based detection
framework, but we believe that voxel-based design is also a
strong baseline, which we leave to future work.
2.4 Vision-Language Transformer
Transformer-based networks demonstrate extraordinary ca-
pability in computer vision [63], [64], [65] and natural lan-
guage processing [24], [66], [67] and further bridge the gap
between them in the multimodality field [31], [35], [36], [48],
[68], which also motivates us to utilize a transformer-based
framework for our work. Some works [48], [68] conduct
large-scale pre-training using a vision-language transformer.
For example, CLIP [48] introduces a transformer-based con-
trastive language-image pre-training method for efficiently
learning visual concepts from natural language supervi-
sion. Then TCL [68] proposes triple contrastive learning for
vision-language representation pre-training. Besides regular
cross-modal alignment (CMA), TCL introduces an intra-
modal contrastive (IMC) objective to provide complemen-
tary benefits in representation learning. All the network
modules, including the vision encoder, text encoder, and
fusion encoder, are transformer-based designs. Pre-training
摘要:

1LearningPoint-LanguageHierarchicalAlignmentfor3DVisualGroundingJiamingChen†,WeixinLuo†,RanSong,XiaolinWei,LinMa§,andWeiZhang§Fig.1:DemonstrationoftheproposedHAMframeworkontheScanReferbenchmark.ThisexampledemonstratesHAM’sabilitytocomprehendspatialrelationshipsthroughfree-formlanguageandaccuratelylo...

展开>> 收起<<
1 Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:7.39MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注