Bridged Transformer for Vision and Point Cloud 3D Object Detection Yikai Wang1TengQi Ye2Lele Cao1Wenbing Huang3 Fuchun Sun1BFengxiang He4Dacheng Tao4

2025-04-27 0 0 8.74MB 11 页 10玖币

侵权投诉

Bridged Transformer for Vision and Point Cloud 3D Object Detection

Yikai Wang1TengQi Ye2Lele Cao1Wenbing Huang3

Fuchun Sun1BFengxiang He4Dacheng Tao4

1Beijing National Research Center for Information Science and Technology (BNRist),

State Key Lab on Intelligent Technology and Systems,

Department of Computer Science and Technology, Tsinghua University 2ByteDance Inc.

3Institute for AI Industry Research (AIR), Tsinghua University 4JD Explore Academy, JD.com

wangyk17@mails.tsinghua.edu.cn, yetengqi@gmail.com, caolele@gmail.com, hwenbing@126.com,

fuchuns@tsinghua.edu.cn, fengxiang.f.he@gmail.com, dacheng.tao@gmail.com

Abstract

3D object detection is a crucial research topic in com-

puter vision, which usually uses 3D point clouds as input in

conventional setups. Recently, there is a trend of leverag-

ing multiple sources of input data, such as complementing

the 3D point cloud with 2D images that often have richer

color and fewer noises. However, due to the heterogeneous

geometrics of the 2D and 3D representations, it prevents

us from applying off-the-shelf neural networks to achieve

multimodal fusion. To that end, we propose Bridged Trans-

former (BrT), an end-to-end architecture for 3D object de-

tection. BrT is simple and effective, which learns to identify

3D and 2D object bounding boxes from both points and im-

age patches. A key element of BrT lies in the utilization of

object queries for bridging 3D and 2D spaces, which uni-

ﬁes different sources of data representations in Transformer.

We adopt a form of feature aggregation realized by point-to-

patch projections which further strengthen the correlations

between images and points. Moreover, BrT works seam-

lessly for fusing the point cloud with multi-view images.

We experimentally show that BrT surpasses state-of-the-art

methods on SUN RGB-D and ScanNetV2 datasets.

1. Introduction

3D object detection, which aims at identifying or locat-

ing objects in 3D scenes, is drawing increasing attention and

is acting as a fundamental task towards scene understand-

ing. Many successful attempts [3,15,21,25] have been made

using point cloud data as input. These attempts include con-

verting the points to regular format (e.g., 3D voxel grids

[32], polygon meshes [11], multi-views [29]), or using 3D

BCorresponding author: Fuchun Sun.

speciﬁc operators (e.g., symmetric functions [23], voting

[21]) to design grouping strategies for points. In addition,

since Transformers could be naturally permutation invari-

ant and capable of capturing large-scale data correlations,

they are lately applied to 3D object detection and demon-

strate superior performance [15,19]. Besides handling point

cloud learning tasks, Transformers have swept across vari-

ous 2D tasks, e.g., image classiﬁcation [6,14], object detec-

tion [2,8,39], and semantic segmentation [33,37].

Deep multimodal learning by leveraging the advantage

of multiple modalities has shown its superiority on various

applications [1,31]. Despite the success of Transformers in

2D or 3D single-modal object detection tasks, the attempt

of combining advantages from both point clouds and im-

ages remains scarce. For 3D learning tasks, the point cloud

provides essential geometrical cues, while the information

in rich color images can complement the point cloud by ful-

ﬁlling the missing color information and correcting noise

errors. As a result, the performance of 3D object detec-

tion could be potentially improved by the involvement of

2D images. One intuitive method is to lift 3-dimensional

RGB vectors from images to extend the point features. A

CNN-based 3D detection model, imVoteNet [20], points out

the difﬁculty in migrating 2D/3D discrepancies by this in-

tuitive method, and instead, imVoteNet substitutes the RGB

vectors with image features extracted by a pre-trained 2D

detector. However, simultaneously relying on both the im-

age voting and point cloud voting assumptions in [20]

could accumulate the intrinsic grouping errors as mentioned

by [15]. To avoid the learning process of point clouds being

impacted by middle-level 2D/3D feature interactions, [20]

combines multimodal features over the ﬁrst layer, which po-

tentially prevents the network from fully exploiting their se-

mantic correlations or migrating multimodal discrepancies.

In this work, we propose Bridged Transformer (BrT) – a

arXiv:2210.01391v1 [cs.CV] 4 Oct 2022

simple and effective Transformer framework for 3D object

detection. BrT bridges the learning processes of images and

point clouds inside Transformer. This approach takes the

sampled points and image patches as input. To protect the

self-learning process of each modality, attentions between

point tokens and image patch tokens are blocked but corre-

lated by object queries throughout the Transformer layers.

To strengthen the correlations of images and points, BrT

is also equipped with powerful bridging designs from two

perspectives. Firstly, we leverage conditional object queries

for images and points that are aware of the learned proposal

points. Such design together with aligned positional em-

beddings tells Transformer that object quires of images and

points are aligned. Secondly, despite the perspective from

object queries, we perform point-to-patch projections to ex-

plicitly leverage the spatial relationships of both modalities.

BrT avoids the grouping errors due to its natural ability of

capturing long-range dependencies and global contextual

information, and instead of lifting image features to point

clouds at the beginning layer in [20], BrT allows the full

propagation of feature interactions in the whole network.

As an additional advantage, BrT can be extended to com-

bine point clouds with multi-view images.

We evaluate BrT on both SUN RGB-D and ScanNetV2

datasets, where respectively, BrT achieves remarkably 2.4%

and 2.2% improvements over state-of-the-art methods.

To summarize, the contributions of our work are:

• We propose BrT, a novel framework for 3D object de-

tection that bridges the learning processes of images

and point clouds inside Transformer.

• We propose to strengthen the correlation of images and

points from two bridging perspectives including condi-

tional object queries and the point-to-patch projection.

• BrT achieves the state-of-the-art on two benchmarks,

which demonstrates the superiority of our design and

also the potential in multi-view scenarios.

2. Related Work

3D detection with point cloud. There are unique chal-

lenges faced by the processing of point clouds using deep

neural networks (DNNs) [3,15,19,21,25,36]. A detailed

discussion around this difference can be found in [9]. The

targets of object detection in 3D space are locating 3D

bounding boxes and recognizing the object classes. Vox-

elNet [38] proposes to divide a point cloud into equally

spaced 3D voxels, and then transforms the points in each

voxel into a uniﬁed feature representation. VoteNet [21]

reformulates Hough voting in the context of deep learn-

ing to generate better points for box proposals with group-

ing. Transformers are also adapted to become suitable for

handling 3D points. 3DTR [19] introduces an end-to-end

Transformer with non-parametric queries and Fourier posi-

tional embeddings. Group-Free [15] adopts the attention

mechanism to learn the point features, which potentially

retains the information of all points to avoid the errors of

previous grouping strategies. Voxel Transformer [17] effec-

tively captures the long-range relationships between voxels.

3D detection with multimodal data. There are a few

works that use deep networks to combine point clouds and

images. MV3D [4] proposes an element-wise fusion of rep-

resentations from different domains, based on the rigid as-

sumption that all objects are on the same spatial plane and

can be pinpointed solely from a top-down view of the point

cloud. PointFusion [34] concatenates point cloud features

and image features at two different levels to learn their cor-

relations, which could not guarantee the alignment of fea-

tures. ImVoteNet [20] lifts crafted semantic and texture fea-

tures to the 3D seed points for fusion. However, ImVoteNet

is still negatively affected by the errors of grouping and

combining features only at the beginning layer, leading to

highly restricted feature interactions. Different from the

aforementioned methods, our BrT fully exploits the feature

correlation for images and points with additional bridging

processes to strengthen the correlation.

Transformer for 2D detection. Recently, Transformer

achieves the cutting edge performance in computer vision

tasks [2,6,8,14,18,35,39]. For 2D object detection based

on images, DETR [2] enables the Transformer to learn re-

lations of the objects and the global image context to di-

rectly output the ﬁnal set of predictions; and it also removes

the need for non-maximum suppression and anchor genera-

tion. With the help of pre-training, YOLOS [8] proposes a

pure sequence-to-sequence approach that achieves compet-

itive performance for object detection; hence it also tackles

the transferability of Transformer from image recognition to

object detection. Deformable DETR [39] is an efﬁcient and

fast-converging model with attention modules only paying

attention to a small set of tokens instead of the whole con-

texts. Conditional DETR [18] learns a conditional spatial

query aiming to accelerate the training process.

3. Method

In this section, we propose Bridged Transformer (BrT)

for 3D object detection with both vision and point cloud as

input. We describe the overall structure of BrT in Sec. 3.1,

followed by the design of building blocks in Sec. 3.2. We

consider two aspects to bridge the learning processes of vi-

sion and point cloud in Sec. 3.3 and Sec. 3.4, respectively.

3.1. Overall architecture

An overall architecture of our BrT is depicted in Fig. 1.

Suppose we are given N×3points representing the 3D co-

ordinates, and an H×W×3image. Here, Nis the number

Pnt-Tok Pnt-Tok Pat-Tok Pat-Tok

Linear Projection of

Sampled Points

+PE +PE +PE +PE

Pnt-Tok Pnt-Tok Pat-Tok Pat-Tok

(Cls & 3D Bbox) (Cls & 2D Bbox)

MLP Heads MLP Heads

+PE +PE

Linear Projection of

Flattened Patches

⋯

Obj-Tok Obj-Tok

!"#

×"

⋯

!"#$%

!"#$

%&'

!"#$%

!"#$

%&'

⋯

"×3

⋯

!"#$

%&'

⋯

Sample and

learn point features

!"#$

%&'

!"#$

%&'

!"#$

%&'

!"#$

!"#$%

!"#$%× 3 + )

Transformer

Figure 1. Overall architecture of our Bridged Transformer (BrT) for 3D object detection based on point clouds and single-view/multi-view

images. For each image view, we annotate its corresponding region on the point cloud for better readability.

of points; Hand Ware the height and width of the im-

age respectively. For simplicity, we ﬁrst analyze one image

per scene since it matches the common scenario where the

camera sensors capture (depth) points and RGB at the same

time. Yet our method can be extended to handle multiple

images per scene with different views, at one’s disposal, as

described in Sec. 3.5 and evaluated by our experiments.

Before feeding the point cloud data to the ﬁrst Trans-

former stage1, we process the data with the method adopted

in [21]. Speciﬁcally, we ﬁrst sample Npnt ×(3 + F)“seed

points” from a total of N0×3points, using PointNet++.

Note that Npnt denotes the number of sampled points; The

positive integers 3and Frepresent the dimension of the 3D

Euclidean coordinate and point feature, respectively.

For processing the image data, we follow some success-

ful practices from vision Transformers. Concretely, each

image is evenly partitioned into Npat patches before em-

bedded by a multi-layer perception (MLP). Together with

the embedded images patches, the learned object queries

are sent to the model, generating output embeddings that

are used to predict box coordinates and class labels.

Moreover, we adopt 2Klearnable object queries, among

which Kqueries for points and Kfor image patches. In

summary, we have Npnt +Npat basic tokens and 2Kobject

queries tokens. Suppose the hidden dimension is D, The

token features fed to the l-th (l= 1, . . . , L) Transformer

stage contains point tokens pl

pnt ∈RNpnt×D, patch tokens

pat ∈RNpat×D, object queries for points ol

pnt ∈RK×D,

and object queries for patches ol

pat ∈RK×D.

When given camera intrinsic and extrinsic parameters,

1Here, each stage contains a multi-head self-attention, an MLP, and two

layer normalizations.

each 3D point could be projected to the camera plane, that

is correlating the 3D coordinates with 2D image pixels.

We deﬁne the projection operator proj : R3→R2indi-

cating the projection process from a 3D point coordinate

k= [x, y, z]>to a 2D pixel coordinate k0= [u, v]>on the

corresponding image, and there is

k0= proj(k) = Π



40 0

0 0 1 

KRt











,(1)

where Kand Rtare the intrinsic and extrinsic matrices,

and Πis a perspective mapping.

BrT has 2Koutputs which correspond to the 2Kinput

object queries. An MLP is applied to the ﬁrst Koutputs

for predicting the coordinates of 3D boxes and their class

labels. For the rest of the Koutputs, we use a different MLP

to predict the 2D coordinates of the bounding boxes and

their associated classes. It is worth mentioning that we do

not need extra labels for 2D box coordinates, since they are

obtained by ﬁrst projecting the labels of 3D box coordinates

to the 2D camera plane following Eq. (12), and then taking

the axis-aligned 2D bounding boxes of projected shapes.

The optimization of BrT concerns minimizing a com-

pound loss function that contains two parts: a repression

loss for locating bounding boxes, and a classiﬁcation loss

for predicting the class of the associated box. The regres-

sion loss contains two components: L3D

obj and L2D

obj for 3D

and 2D cases respectively. Likewise, for classiﬁcation loss,

there are also a 3D component L3D

cls and a 2D component

L2D

cls . As such, the overall loss function is formulated as

L=L3D

obj +α1L3D

cls +α2L2D

obj +α3L2D

cls ,(2)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BridgedTransformerforVisionandPointCloud3DObjectDetectionYikaiWang1TengQiYe2LeleCao1WenbingHuang3FuchunSun1BFengxiangHe4DachengTao41BeijingNationalResearchCenterforInformationScienceandTechnology(BNRist),StateKeyLabonIntelligentTechnologyandSystems,DepartmentofComputerScienceandTechnology,TsinghuaUn...

展开>> 收起<<

Bridged Transformer for Vision and Point Cloud 3D Object Detection Yikai Wang1TengQi Ye2Lele Cao1Wenbing Huang3 Fuchun Sun1BFengxiang He4Dacheng Tao4.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bridged Transformer for Vision and Point Cloud 3D Object Detection Yikai Wang1TengQi Ye2Lele Cao1Wenbing Huang3 Fuchun Sun1BFengxiang He4Dacheng Tao4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: