Bridged Transformer for Vision and Point Cloud 3D Object Detection Yikai Wang1TengQi Ye2Lele Cao1Wenbing Huang3 Fuchun Sun1BFengxiang He4Dacheng Tao4

2025-04-27 0 0 8.74MB 11 页 10玖币
侵权投诉
Bridged Transformer for Vision and Point Cloud 3D Object Detection
Yikai Wang1TengQi Ye2Lele Cao1Wenbing Huang3
Fuchun Sun1BFengxiang He4Dacheng Tao4
1Beijing National Research Center for Information Science and Technology (BNRist),
State Key Lab on Intelligent Technology and Systems,
Department of Computer Science and Technology, Tsinghua University 2ByteDance Inc.
3Institute for AI Industry Research (AIR), Tsinghua University 4JD Explore Academy, JD.com
wangyk17@mails.tsinghua.edu.cn, yetengqi@gmail.com, caolele@gmail.com, hwenbing@126.com,
fuchuns@tsinghua.edu.cn, fengxiang.f.he@gmail.com, dacheng.tao@gmail.com
Abstract
3D object detection is a crucial research topic in com-
puter vision, which usually uses 3D point clouds as input in
conventional setups. Recently, there is a trend of leverag-
ing multiple sources of input data, such as complementing
the 3D point cloud with 2D images that often have richer
color and fewer noises. However, due to the heterogeneous
geometrics of the 2D and 3D representations, it prevents
us from applying off-the-shelf neural networks to achieve
multimodal fusion. To that end, we propose Bridged Trans-
former (BrT), an end-to-end architecture for 3D object de-
tection. BrT is simple and effective, which learns to identify
3D and 2D object bounding boxes from both points and im-
age patches. A key element of BrT lies in the utilization of
object queries for bridging 3D and 2D spaces, which uni-
fies different sources of data representations in Transformer.
We adopt a form of feature aggregation realized by point-to-
patch projections which further strengthen the correlations
between images and points. Moreover, BrT works seam-
lessly for fusing the point cloud with multi-view images.
We experimentally show that BrT surpasses state-of-the-art
methods on SUN RGB-D and ScanNetV2 datasets.
1. Introduction
3D object detection, which aims at identifying or locat-
ing objects in 3D scenes, is drawing increasing attention and
is acting as a fundamental task towards scene understand-
ing. Many successful attempts [3,15,21,25] have been made
using point cloud data as input. These attempts include con-
verting the points to regular format (e.g., 3D voxel grids
[32], polygon meshes [11], multi-views [29]), or using 3D
BCorresponding author: Fuchun Sun.
specific operators (e.g., symmetric functions [23], voting
[21]) to design grouping strategies for points. In addition,
since Transformers could be naturally permutation invari-
ant and capable of capturing large-scale data correlations,
they are lately applied to 3D object detection and demon-
strate superior performance [15,19]. Besides handling point
cloud learning tasks, Transformers have swept across vari-
ous 2D tasks, e.g., image classification [6,14], object detec-
tion [2,8,39], and semantic segmentation [33,37].
Deep multimodal learning by leveraging the advantage
of multiple modalities has shown its superiority on various
applications [1,31]. Despite the success of Transformers in
2D or 3D single-modal object detection tasks, the attempt
of combining advantages from both point clouds and im-
ages remains scarce. For 3D learning tasks, the point cloud
provides essential geometrical cues, while the information
in rich color images can complement the point cloud by ful-
filling the missing color information and correcting noise
errors. As a result, the performance of 3D object detec-
tion could be potentially improved by the involvement of
2D images. One intuitive method is to lift 3-dimensional
RGB vectors from images to extend the point features. A
CNN-based 3D detection model, imVoteNet [20], points out
the difficulty in migrating 2D/3D discrepancies by this in-
tuitive method, and instead, imVoteNet substitutes the RGB
vectors with image features extracted by a pre-trained 2D
detector. However, simultaneously relying on both the im-
age voting and point cloud voting assumptions in [20]
could accumulate the intrinsic grouping errors as mentioned
by [15]. To avoid the learning process of point clouds being
impacted by middle-level 2D/3D feature interactions, [20]
combines multimodal features over the first layer, which po-
tentially prevents the network from fully exploiting their se-
mantic correlations or migrating multimodal discrepancies.
In this work, we propose Bridged Transformer (BrT) – a
arXiv:2210.01391v1 [cs.CV] 4 Oct 2022
simple and effective Transformer framework for 3D object
detection. BrT bridges the learning processes of images and
point clouds inside Transformer. This approach takes the
sampled points and image patches as input. To protect the
self-learning process of each modality, attentions between
point tokens and image patch tokens are blocked but corre-
lated by object queries throughout the Transformer layers.
To strengthen the correlations of images and points, BrT
is also equipped with powerful bridging designs from two
perspectives. Firstly, we leverage conditional object queries
for images and points that are aware of the learned proposal
points. Such design together with aligned positional em-
beddings tells Transformer that object quires of images and
points are aligned. Secondly, despite the perspective from
object queries, we perform point-to-patch projections to ex-
plicitly leverage the spatial relationships of both modalities.
BrT avoids the grouping errors due to its natural ability of
capturing long-range dependencies and global contextual
information, and instead of lifting image features to point
clouds at the beginning layer in [20], BrT allows the full
propagation of feature interactions in the whole network.
As an additional advantage, BrT can be extended to com-
bine point clouds with multi-view images.
We evaluate BrT on both SUN RGB-D and ScanNetV2
datasets, where respectively, BrT achieves remarkably 2.4%
and 2.2% improvements over state-of-the-art methods.
To summarize, the contributions of our work are:
We propose BrT, a novel framework for 3D object de-
tection that bridges the learning processes of images
and point clouds inside Transformer.
We propose to strengthen the correlation of images and
points from two bridging perspectives including condi-
tional object queries and the point-to-patch projection.
BrT achieves the state-of-the-art on two benchmarks,
which demonstrates the superiority of our design and
also the potential in multi-view scenarios.
2. Related Work
3D detection with point cloud. There are unique chal-
lenges faced by the processing of point clouds using deep
neural networks (DNNs) [3,15,19,21,25,36]. A detailed
discussion around this difference can be found in [9]. The
targets of object detection in 3D space are locating 3D
bounding boxes and recognizing the object classes. Vox-
elNet [38] proposes to divide a point cloud into equally
spaced 3D voxels, and then transforms the points in each
voxel into a unified feature representation. VoteNet [21]
reformulates Hough voting in the context of deep learn-
ing to generate better points for box proposals with group-
ing. Transformers are also adapted to become suitable for
handling 3D points. 3DTR [19] introduces an end-to-end
Transformer with non-parametric queries and Fourier posi-
tional embeddings. Group-Free [15] adopts the attention
mechanism to learn the point features, which potentially
retains the information of all points to avoid the errors of
previous grouping strategies. Voxel Transformer [17] effec-
tively captures the long-range relationships between voxels.
3D detection with multimodal data. There are a few
works that use deep networks to combine point clouds and
images. MV3D [4] proposes an element-wise fusion of rep-
resentations from different domains, based on the rigid as-
sumption that all objects are on the same spatial plane and
can be pinpointed solely from a top-down view of the point
cloud. PointFusion [34] concatenates point cloud features
and image features at two different levels to learn their cor-
relations, which could not guarantee the alignment of fea-
tures. ImVoteNet [20] lifts crafted semantic and texture fea-
tures to the 3D seed points for fusion. However, ImVoteNet
is still negatively affected by the errors of grouping and
combining features only at the beginning layer, leading to
highly restricted feature interactions. Different from the
aforementioned methods, our BrT fully exploits the feature
correlation for images and points with additional bridging
processes to strengthen the correlation.
Transformer for 2D detection. Recently, Transformer
achieves the cutting edge performance in computer vision
tasks [2,6,8,14,18,35,39]. For 2D object detection based
on images, DETR [2] enables the Transformer to learn re-
lations of the objects and the global image context to di-
rectly output the final set of predictions; and it also removes
the need for non-maximum suppression and anchor genera-
tion. With the help of pre-training, YOLOS [8] proposes a
pure sequence-to-sequence approach that achieves compet-
itive performance for object detection; hence it also tackles
the transferability of Transformer from image recognition to
object detection. Deformable DETR [39] is an efficient and
fast-converging model with attention modules only paying
attention to a small set of tokens instead of the whole con-
texts. Conditional DETR [18] learns a conditional spatial
query aiming to accelerate the training process.
3. Method
In this section, we propose Bridged Transformer (BrT)
for 3D object detection with both vision and point cloud as
input. We describe the overall structure of BrT in Sec. 3.1,
followed by the design of building blocks in Sec. 3.2. We
consider two aspects to bridge the learning processes of vi-
sion and point cloud in Sec. 3.3 and Sec. 3.4, respectively.
3.1. Overall architecture
An overall architecture of our BrT is depicted in Fig. 1.
Suppose we are given N×3points representing the 3D co-
ordinates, and an H×W×3image. Here, Nis the number
Pnt-Tok Pnt-Tok Pat-Tok Pat-Tok
Linear Projection of
Sampled Points
+PE +PE +PE +PE
Pnt-Tok Pnt-Tok Pat-Tok Pat-Tok
(Cls & 3D Bbox) (Cls & 2D Bbox)
MLP Heads MLP Heads
+PE +PE
Linear Projection of
Flattened Patches
#1
#1
#1
#1
Obj-Tok Obj-Tok
#1
!"#
×"
×"
!"#$%
!"#$
%
!"#$
%&'
!"#$%
!"#$
%
!"#$
%&'
!"#$
%
!"#$
%&'
View 1
View 2
View 3
Sample and
learn point features
View 2
View 1
View 3
!"#$
%
!"#$
%
!"#$
%&'
!"#$
%&'
!"#$
%&'
!"#$
%
!"#$%
!"#$%
!"#$ 3 + )
Transformer
Figure 1. Overall architecture of our Bridged Transformer (BrT) for 3D object detection based on point clouds and single-view/multi-view
images. For each image view, we annotate its corresponding region on the point cloud for better readability.
of points; Hand Ware the height and width of the im-
age respectively. For simplicity, we first analyze one image
per scene since it matches the common scenario where the
camera sensors capture (depth) points and RGB at the same
time. Yet our method can be extended to handle multiple
images per scene with different views, at one’s disposal, as
described in Sec. 3.5 and evaluated by our experiments.
Before feeding the point cloud data to the first Trans-
former stage1, we process the data with the method adopted
in [21]. Specifically, we first sample Npnt ×(3 + F)“seed
points” from a total of N0×3points, using PointNet++.
Note that Npnt denotes the number of sampled points; The
positive integers 3and Frepresent the dimension of the 3D
Euclidean coordinate and point feature, respectively.
For processing the image data, we follow some success-
ful practices from vision Transformers. Concretely, each
image is evenly partitioned into Npat patches before em-
bedded by a multi-layer perception (MLP). Together with
the embedded images patches, the learned object queries
are sent to the model, generating output embeddings that
are used to predict box coordinates and class labels.
Moreover, we adopt 2Klearnable object queries, among
which Kqueries for points and Kfor image patches. In
summary, we have Npnt +Npat basic tokens and 2Kobject
queries tokens. Suppose the hidden dimension is D, The
token features fed to the l-th (l= 1, . . . , L) Transformer
stage contains point tokens pl
pnt RNpnt×D, patch tokens
pl
pat RNpat×D, object queries for points ol
pnt RK×D,
and object queries for patches ol
pat RK×D.
When given camera intrinsic and extrinsic parameters,
1Here, each stage contains a multi-head self-attention, an MLP, and two
layer normalizations.
each 3D point could be projected to the camera plane, that
is correlating the 3D coordinates with 2D image pixels.
We define the projection operator proj : R3R2indi-
cating the projection process from a 3D point coordinate
k= [x, y, z]>to a 2D pixel coordinate k0= [u, v]>on the
corresponding image, and there is
k0= proj(k) = Π
1
40 0
01
40
0 0 1
KRt
x
y
z
1
,(1)
where Kand Rtare the intrinsic and extrinsic matrices,
and Πis a perspective mapping.
BrT has 2Koutputs which correspond to the 2Kinput
object queries. An MLP is applied to the first Koutputs
for predicting the coordinates of 3D boxes and their class
labels. For the rest of the Koutputs, we use a different MLP
to predict the 2D coordinates of the bounding boxes and
their associated classes. It is worth mentioning that we do
not need extra labels for 2D box coordinates, since they are
obtained by first projecting the labels of 3D box coordinates
to the 2D camera plane following Eq. (12), and then taking
the axis-aligned 2D bounding boxes of projected shapes.
The optimization of BrT concerns minimizing a com-
pound loss function that contains two parts: a repression
loss for locating bounding boxes, and a classification loss
for predicting the class of the associated box. The regres-
sion loss contains two components: L3D
obj and L2D
obj for 3D
and 2D cases respectively. Likewise, for classification loss,
there are also a 3D component L3D
cls and a 2D component
L2D
cls . As such, the overall loss function is formulated as
L=L3D
obj +α1L3D
cls +α2L2D
obj +α3L2D
cls ,(2)
摘要:

BridgedTransformerforVisionandPointCloud3DObjectDetectionYikaiWang1TengQiYe2LeleCao1WenbingHuang3FuchunSun1BFengxiangHe4DachengTao41BeijingNationalResearchCenterforInformationScienceandTechnology(BNRist),StateKeyLabonIntelligentTechnologyandSystems,DepartmentofComputerScienceandTechnology,TsinghuaUn...

展开>> 收起<<
Bridged Transformer for Vision and Point Cloud 3D Object Detection Yikai Wang1TengQi Ye2Lele Cao1Wenbing Huang3 Fuchun Sun1BFengxiang He4Dacheng Tao4.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:8.74MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注