simple and effective Transformer framework for 3D object
detection. BrT bridges the learning processes of images and
point clouds inside Transformer. This approach takes the
sampled points and image patches as input. To protect the
self-learning process of each modality, attentions between
point tokens and image patch tokens are blocked but corre-
lated by object queries throughout the Transformer layers.
To strengthen the correlations of images and points, BrT
is also equipped with powerful bridging designs from two
perspectives. Firstly, we leverage conditional object queries
for images and points that are aware of the learned proposal
points. Such design together with aligned positional em-
beddings tells Transformer that object quires of images and
points are aligned. Secondly, despite the perspective from
object queries, we perform point-to-patch projections to ex-
plicitly leverage the spatial relationships of both modalities.
BrT avoids the grouping errors due to its natural ability of
capturing long-range dependencies and global contextual
information, and instead of lifting image features to point
clouds at the beginning layer in [20], BrT allows the full
propagation of feature interactions in the whole network.
As an additional advantage, BrT can be extended to com-
bine point clouds with multi-view images.
We evaluate BrT on both SUN RGB-D and ScanNetV2
datasets, where respectively, BrT achieves remarkably 2.4%
and 2.2% improvements over state-of-the-art methods.
To summarize, the contributions of our work are:
• We propose BrT, a novel framework for 3D object de-
tection that bridges the learning processes of images
and point clouds inside Transformer.
• We propose to strengthen the correlation of images and
points from two bridging perspectives including condi-
tional object queries and the point-to-patch projection.
• BrT achieves the state-of-the-art on two benchmarks,
which demonstrates the superiority of our design and
also the potential in multi-view scenarios.
2. Related Work
3D detection with point cloud. There are unique chal-
lenges faced by the processing of point clouds using deep
neural networks (DNNs) [3,15,19,21,25,36]. A detailed
discussion around this difference can be found in [9]. The
targets of object detection in 3D space are locating 3D
bounding boxes and recognizing the object classes. Vox-
elNet [38] proposes to divide a point cloud into equally
spaced 3D voxels, and then transforms the points in each
voxel into a unified feature representation. VoteNet [21]
reformulates Hough voting in the context of deep learn-
ing to generate better points for box proposals with group-
ing. Transformers are also adapted to become suitable for
handling 3D points. 3DTR [19] introduces an end-to-end
Transformer with non-parametric queries and Fourier posi-
tional embeddings. Group-Free [15] adopts the attention
mechanism to learn the point features, which potentially
retains the information of all points to avoid the errors of
previous grouping strategies. Voxel Transformer [17] effec-
tively captures the long-range relationships between voxels.
3D detection with multimodal data. There are a few
works that use deep networks to combine point clouds and
images. MV3D [4] proposes an element-wise fusion of rep-
resentations from different domains, based on the rigid as-
sumption that all objects are on the same spatial plane and
can be pinpointed solely from a top-down view of the point
cloud. PointFusion [34] concatenates point cloud features
and image features at two different levels to learn their cor-
relations, which could not guarantee the alignment of fea-
tures. ImVoteNet [20] lifts crafted semantic and texture fea-
tures to the 3D seed points for fusion. However, ImVoteNet
is still negatively affected by the errors of grouping and
combining features only at the beginning layer, leading to
highly restricted feature interactions. Different from the
aforementioned methods, our BrT fully exploits the feature
correlation for images and points with additional bridging
processes to strengthen the correlation.
Transformer for 2D detection. Recently, Transformer
achieves the cutting edge performance in computer vision
tasks [2,6,8,14,18,35,39]. For 2D object detection based
on images, DETR [2] enables the Transformer to learn re-
lations of the objects and the global image context to di-
rectly output the final set of predictions; and it also removes
the need for non-maximum suppression and anchor genera-
tion. With the help of pre-training, YOLOS [8] proposes a
pure sequence-to-sequence approach that achieves compet-
itive performance for object detection; hence it also tackles
the transferability of Transformer from image recognition to
object detection. Deformable DETR [39] is an efficient and
fast-converging model with attention modules only paying
attention to a small set of tokens instead of the whole con-
texts. Conditional DETR [18] learns a conditional spatial
query aiming to accelerate the training process.
3. Method
In this section, we propose Bridged Transformer (BrT)
for 3D object detection with both vision and point cloud as
input. We describe the overall structure of BrT in Sec. 3.1,
followed by the design of building blocks in Sec. 3.2. We
consider two aspects to bridge the learning processes of vi-
sion and point cloud in Sec. 3.3 and Sec. 3.4, respectively.
3.1. Overall architecture
An overall architecture of our BrT is depicted in Fig. 1.
Suppose we are given N×3points representing the 3D co-
ordinates, and an H×W×3image. Here, Nis the number