1 Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

2025-04-30 0 0 7.39MB 15 页 10玖币

侵权投诉

Learning Point-Language Hierarchical Alignment

for 3D Visual Grounding

Jiaming Chen†, Weixin Luo†, Ran Song, Xiaolin Wei, Lin Ma§, and Wei Zhang§

The printer is on a white table. It

is to the left of the bed.

A red armchair. In front of it is a

white cabinet.

This is a white door. It is to the

right of a dresser.

The dresser is to the left of the

entrance door. The dresser is

white and rectangular.

A white pillow. It is above a bed.

There is a bed in a corner. It is to

the left of a desk and chair.

A white lamp. It is above a

brown table.

There is a chair to the right of a

bed. It is between the bed and

another chair.

Fig. 1: Demonstration of the proposed HAM framework on the ScanRefer benchmark. This example demonstrates

HAM’s ability to comprehend spatial relationships through free-form language and accurately localize targets in

irregular point clouds.

Abstract—3D visual grounding localizes target objects on point clouds using natural language. While recent studies have made

substantial progress by using Transformers to align visual and linguistic information, most of them tend to use coarse-grained attention

mechanisms, which may limit their ability to comprehend lengthy and intricate language, as well as complex spatial relationships. This

paper presents a novel hierarchical alignment model (HAM) that learns multi-granularity visual and linguistic representations in an

end-to-end manner. We extract key points and proposal points to model 3D contexts and instances, and propose point-language

alignment with context modulation (PLACM) mechanism, which learns to gradually align word-level and sentence-level linguistic

embeddings with visual representations, while the modulation with the visual context captures latent informative relationships. To

further capture both global and local relationships, we propose a spatially multi-granular modeling scheme that applies PLACM to both

global and local ﬁelds. Experimental results demonstrate the superiority of HAM, with visualized results showing that it can dynamically

model ﬁne-grained visual and linguistic representations. HAM outperforms existing methods by a signiﬁcant margin and achieves

state-of-the-art performance on two publicly available datasets, and won the championship in ECCV 2022 ScanRefer challenge. Code

is available at https://github.com/PPjmchen/HAM.

Index Terms—Visual Grounding, Point Clouds, Transformer, Vision-Language

✦

1 INTRODUCTION

3D object localization on point clouds using natural lan-

guage, namely 3D visual grounding, has been a prevailing

research topic in multimodal 3D understanding [1], [2], [3],

[4] since the pioneers ScanRefer [1] and ReferIt3D [3] built

on subsets of ScanNet [5]. Compared to the well-researched

ﬁeld of 2D visual grounding in images [6], [7], [8], 3D visual

grounding poses two signiﬁcant challenges. As shown in

J. Chen and W. Luo are the co-ﬁrst authors. L. Ma and W. Zhang are the

corresponding authors.

•J. Chen, R. Song and W. Zhang are with the School of Control Science and

Engineering, Shandong University, China. E-mail: ppjmchen@gmail.com,

davidzhang@sdu.edu.cn.

•W. Luo, X. Wei, and L. Ma are with Meituan, China. E-mail: for-

est.linma@gmail.com.

Fig. 2, ﬁrst, 3D visual grounding requires a profound un-

derstanding of the intricate spatial relationships between 3D

objects. For example, the highlighted blue phrase “above the

sink” and “to its right with a towel in the handle”. Second,

3D visual grounding requires a comprehension of complex

language queries, which may include such speciﬁcs as the

observer’s perspective, as highlighted in green phrase “when

you are facing the kitchen sink”. Additionally, issues such as

sensor noise, occlusion, missing data of point clouds, visual

ambiguity resulting from numerous similar furniture, and

other factors make it more challenging to develop accurate

and robust algorithms for 3D visual grounding than for its

2D counterpart.

To tackle the two challenges, several state-of-the-art ap-

arXiv:2210.12513v4 [cs.CV] 9 Jun 2023

Query:These brown cabinets can be seen

when you are facing the kitchen sink. There is

a shorter brown cabinet above the sink and a

larger one to its right with a towel in the handle.

Query:A set of kitchen cabinets

(a) 2D Visual Grounding Example (b) 3D Visual Grounding Example

Fig. 2: Visualization of 2D and 3D visual grounding exam-

ples. While 2D visual grounding is typically performed on

images, 3D visual grounding is more challenging, requiring

a deeper understanding of the intricate spatial relationships,

as well as the accompanying lengthy and complex language.

proaches [9], [10], [11], [12], [13], [14], [15], [16], [16], [17],

[18], [19] have been proposed, leveraging 3D point clouds

abstraction networks [20], 3D object detectors [21], [22], [23],

and powerful transformer networks [24] for multimodal

alignment. However, these methods often suffer common

issues. First, these approaches often perform multimodal

learning with coarse granularity for both language and

point clouds, resulting in an insufﬁcient understanding of

lengthy and intricate language. Second, such methods tend

to overlook the spatial context, which may lead to an incom-

plete grasp of complex spatial relationships. This limitation

can hinder the model’s ability to fully comprehend the

contextual information within a scene. Addressing this issue

is essential for enhancing the model’s performance of visual

grounding in 3D environments.

This paper proposes an end-to-end hierarchical align-

ment model (HAM) with multi-granularity representation

learning to address the aforementioned issues. We use learn-

able point abstraction to gradually down-sample raw point

clouds into a set of key points and further select a smaller set

of proposal points to model 3D contexts and instances. The

core of our method is the point-language alignment with

context modulation (PLACM) mechanism, which builds

upon the popular query-key-value attention [24]. This mod-

ule uses proposals as queries and combines context rep-

resentations with language embeddings at both word and

sentence levels to conduct hierarchical, multi-granularity

attention. To capture both global and local relationships,

we further propose a spatially multi-granular modeling

(SMGM) scheme that applies PLACM to both global and

local ﬁelds. We pre-formulate the proposal points and key

points via space partition and group them regionally to

obtain spatially multi-granular representations.

Given the success of prompt engineering in NLP and

other multimodal domains [6], [25], [26], [27], we hypoth-

esize that such techniques are valuable tools for 3D visual

grounding models, which rely heavily on natural language

to identify and locate objects in point clouds. In this work,

we extensively investigate existing prompt engineering

techniques, which involve various strategies for modifying

and reorganizing input text, and systematically incorporate

them into our HAM framework to provide a diverse train-

ing set and enhance the efﬁciency and effectiveness of the

training.

We evaluate the proposed HAM framework on two

publicly available datasets [1], [3], and show that it per-

forms excellently in target identiﬁcation and localization.

Furthermore, our approach won the championship in the

ECCV 2022 ScanRefer Challenge 1. We also provide ex-

tensive ablation experiments to show the effectiveness of

the proposed modules and present various visualization

results that illustrate the discriminative visual and linguistic

representations learned by our approach.

The main contributions of our work are summarized as

follows:

•We design a hierarchical alignment model (HAM)

for 3D visual grounding. It contains a point-language

alignment with context modulation (PLACM) mech-

anism, which learns hierarchical alignment and ex-

tracts informative representations on both vision

and language. A spatially multi-granular modeling

(SMGM) strategy is conducted to extend PLACM to

multi-granular spatial ﬁelds.

•We systematically analyze and incorporate three cu-

mulative prompt engineering strategies, which en-

hance the performance and robustness of HAM and

accelerate the training process.

•Comprehensive experiments on two public datasets

demonstrate the improvement of the proposed

HAM. Moreover, our approach won the champi-

onship in ECCV 2022 ScanRefer Challenge.

We organize the rest of the manuscript as below. In

Section 2, we list related work, including 2D/3D visual

grounding, 3D object detection, etc. Section 3introduces our

proposed HAM and demonstrates each module in detail.

In Section 4, experiments on two publicly available datasets

with extensive ablation studies are carried out to validate

the effectiveness of HAM. We summarize this paper in

Section 5.

2 RELATED WORK

Computer vision and natural language processing, two

crucial sub-ﬁelds of AI, have made remarkable progress

recently. Linking two modalities to realize multimodal ar-

tiﬁcial intelligence is becoming a new research trend and

a broad range of related tasks has been extended. Most of

the vision-language tasks can be divided into several main

research ﬁelds, e.g.vision-language understanding [28], [29],

[30], [31], [32], generation [32], [33], [34], [35], [36], vision-

language and robotics [37], [38]. This work focuses on the

emerging and challenging 3D visual grounding task, and

we further introduce the related work in detail.

2.1 Visual Grounding on 2D Images

Visual grounding was ﬁrst proposed on 2D images, which

aims at localizing the target object in an image based on

a given sentence query [39], [40], [41], [42]. Most existing

approaches adopt two-stage [8], [43], [44] and one-stage

frameworks [45], [46], [47], which are widely used in the ob-

ject detection. Recently, a set of methods [6], [48] have made

1. https://kaldir.vc.in.tum.de/scanrefer benchmark

great progress in multimodal understanding tasks, bene-

ﬁting from the powerful transformer network and large-

scale pre-training. While it is tempting to leverage existing

frameworks from 2D visual grounding, it is important to

acknowledge that 3D visual grounding presents distinct

challenges, such as irregularity, noise, and missing data in

point clouds, as well as complex spatial relationships in 3D

space. Consequently, developing tailored approaches that

tackle these speciﬁc challenges is essential to make progress

in 3D visual grounding.

2.2 Visual Grounding on 3D Point Clouds

Visual grounding on 3D point clouds [1], [3] aims at lo-

calizing the corresponding 3D bounding box of a target

object given query sentences and unorganized point clouds.

Two public benchmark datasets ScanRefer [1] and ReferIt3D

[3] are proposed. Both of them adopt the ScanNet [5], an

indoor scene 3D point clouds dataset, and augment it with

language annotations. More speciﬁcally, ScanRefer follows

the grounding-by-detection paradigm of 2D visual ground-

ing that only raw 3D point clouds and the query sentence

are given. Alternatively, ReferIt3D formulates the 3D vi-

sual grounding as a ﬁne-grained identiﬁcation problem,

which assumes the object ground truth boxes are known

during both training and inference. Besides, Rel3D [2] is

constructed, which focuses on grounding spatial relations

between objects but not localization. SUNRefer [4] is intro-

duced for visual grounding on RGBD images. In this work,

we work on the visual grounding on 3D point clouds only.

To tackle this task, most of the existing methods [1],

[3], [9], [10], [11], [12], [18] follow the two-stage frame-

work. In the ﬁrst stage, 3D object proposals are directly

generated from the ground-truth [3] or extracted by a 3D

object detector [21]. In the second stage, the proposals and

language features are aligned for semantic matching. Recent

work [12], [16], [18], [49] adopts the popular transformer

framework to model the relationship between proposals and

language powered by the attention mechanism [24]. Some

methods [9], [10], [11] utilize the graph neural network to

aggregate information. For instance, TGNN and InstanceRe-

fer [9], [10] treat the proposals as nodes and use language

to enhance them. FFL-3DOG [11] characterizes both the

proposals and language in two independent graphs and

fuses them in another graph. Recently, 3DJCG [13] proposes

a uniﬁed framework to jointly ﬁt both the 3D captioning

task and 3D grounding task, which consists of shared

task-agnostic modules and task-speciﬁc modules. Similarly,

D3Net [19] introduces a uniﬁed network that can Detect,

Describe and Discriminate for both dense captioning and

visual grounding in point clouds. 3D-SPS [14] proposes a

3D single-stage referred point progressive selection method

for progressively selecting key points with the guidance of

language and directly locating the target in a single-stage

framework. [17] proposes a multi-view transformer for 3D

visual grounding, which takes the additional multi-view

features as inputs and designs the network for learning a

more robust multimodal representation. BUTD-DETR [15]

proposes the Bottom Up Top Down DEtection TRansformers

for visual grounding using language guidance and object-

ness guidance in both 2D and 3D.

Overall, we found that existing methods on 3D vi-

sual grounding usually perform visual-linguistic alignment

upon a coarse granularity and overlook the spatial con-

text, which inspired this work to design a hierarchical,

coarse-to-ﬁne, and multi-granularity point-language align-

ment framework.

2.3 3D Object Detection

Since we work on end-to-end 3D visual grounding requiring

accurate detection, we then revisit the related work on 3D

object detection. To apply CNNs to 3D object detection, early

works project the point clouds to the bird’s view [50], [51],

[52] or frontal views [53], [54]. Voxel-based methods [55],

[56] conduct voxelization on point clouds and utilize 3D

and 2D CNNs sequentially as used in 2D object detection.

Recently, a bulk of point-based methods [21], [22], [57],

[58] put more effort into tackling this task directly on raw

point clouds. Most of them group the points into object

candidates by using box proposals [57], [58] or voting [21],

and then extract object features from groups of points for the

following detection. Group-Free [22] drops grounding used

in voting [21] and uses K-Closest Point Sampling instead

for proposal generation. At the same time, it applies a

transformer to carry out the correlation between proposals

and key points, which serves as the foundation of this paper.

More recently, RepSurf-U [59] presents the representative

surfaces, a representation of point clouds to explicitly extract

the local structure. It provides a lightweight plug-and-play

backbone module for downstream tasks containing object

detection. Group-Free [22] equipped with RepSurf-U out-

performs prior methods. FCAF3D [60], a recently proposed

fully convolutional anchor-free framework for object detec-

tion in 3D indoor scenes. Different from voting-based and

transformer-based methods, it provides an effective and

scalable design to manipulate the voxel representation of

point clouds with sparse convolutions. FCAF3D achieves

the state-of-the-art 3D object detection results on three in-

door large-scale 3D point clouds benchmark datasets [5],

[61], [62]. In this paper, we adopt a point-based detection

framework, but we believe that voxel-based design is also a

strong baseline, which we leave to future work.

2.4 Vision-Language Transformer

Transformer-based networks demonstrate extraordinary ca-

pability in computer vision [63], [64], [65] and natural lan-

guage processing [24], [66], [67] and further bridge the gap

between them in the multimodality ﬁeld [31], [35], [36], [48],

[68], which also motivates us to utilize a transformer-based

framework for our work. Some works [48], [68] conduct

large-scale pre-training using a vision-language transformer.

For example, CLIP [48] introduces a transformer-based con-

trastive language-image pre-training method for efﬁciently

learning visual concepts from natural language supervi-

sion. Then TCL [68] proposes triple contrastive learning for

vision-language representation pre-training. Besides regular

cross-modal alignment (CMA), TCL introduces an intra-

modal contrastive (IMC) objective to provide complemen-

tary beneﬁts in representation learning. All the network

modules, including the vision encoder, text encoder, and

fusion encoder, are transformer-based designs. Pre-training

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1LearningPoint-LanguageHierarchicalAlignmentfor3DVisualGroundingJiamingChen†,WeixinLuo†,RanSong,XiaolinWei,LinMa§,andWeiZhang§Fig.1:DemonstrationoftheproposedHAMframeworkontheScanReferbenchmark.ThisexampledemonstratesHAM’sabilitytocomprehendspatialrelationshipsthroughfree-formlanguageandaccuratelylo...

展开>> 收起<<

1 Learning Point-Language Hierarchical Alignment for 3D Visual Grounding.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Learning Point-Language Hierarchical Alignment for 3D Visual Grounding

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: