tonomous driving benchmarks, nuScenes [1] and KITTI
[14] dataset. Our network achieves 61.3% mAP and 67.6%
NDS on the nuScenes dataset surpassing the state-of-the-art
CenterPoint [51] and Object-DGCNN [45] by 3.3% mAP
(and 2.1% NDS) and 2.6% mAP (and 1.6% NDS) respec-
tively.
Our main contributions are as follows:
• We propose an end-to-end, single-stage LiDAR based
3D Detection Transformer (Li3DeTr) for autonomous
driving. Our method achieves 61.3% mAP and
67.6% NDS on the nuScenes [1] dataset which sur-
passed state-of-the-art LiDAR based object detection
approaches. Our method achieves competitive per-
formance (without NMS) to other approaches (with
NMS) on the KITTI [14] dataset. Similar to DETR
[2], our approach does not require NMS, hence it is
effective to apply knowledge distillation with teacher
and student model to improve the accuracy.
• We introduce a novel Li3DeTr cross-attention block to
link the LiDAR global encoded features to 3D object
predictions leveraging the learnt object queries. The
attention mechanism in encoder and decoder helps to
detect large-size objects effectively as shown in Ta-
ble 3. The ablation study shown in Table 6 justifies
our novel Li3DeTr cross-attention block.
• We release our code and models to facilitate further
research.
2. Related Work
The LiDAR point cloud based 3D object detection ap-
proaches can be divided into two categories: point-based
and grid-based, depending on the type of data representa-
tion used to predict the 3D bounding boxes.
Point-based methods [28, 35, 36, 49, 50] directly use the
sparse and unordered set of points to predict 3D bound-
ing boxes. The point features are aggregated by multi-
scale/multi-resolution grouping and set abstraction [29, 30].
PointRCNN [36] employs a two-stage pipeline for 3D ob-
ject prediction. PVRCNN [35] models point-voxel based
set abstraction layer to leverage the advantage of point and
voxel based methods. Frustum-PointNet [28] uses 2D ob-
ject detection to sample a frustum of points to apply Point-
Net [29] to predict 3D objects. Although point-based meth-
ods achieve large receptive fields with set abstraction layer,
they are computationally expensive.
Grid-based methods. As the LiDAR point clouds are
sparse and unordered set of points, many methods project
the points to regular grids such as voxels [47, 53], BEV
pillars [19] or range projection [3, 13, 39]. The point
clouds are discretized into 3D voxels [38, 53] and 3D
CNNs are employed to extract voxel-wise features. How-
ever, 3D CNNs are computationally expensive and requires
large memory, in order to mitigate this problem [5, 47] use
sparse 3D CNNs [15] for efficient voxel processing. The
LiDAR point cloud is projected into BEV map in Point-
Pillars [19] and PIXOR [48] and 2D CNNs are employed
to reduce the computational cost, however such projection
induces 3D information loss. In order to mitigate this is-
sue, some methods [47, 51] compute voxel features using
sprase convolutions and then project the voxel features into
BEV space, and finally predict the 3D bounding boxes in
the BEV space. As this approach takes the advantage of
voxel and BEV space, we test our network with SECOND
[47] and PointPillars [19] feature extraction networks. In
order to achieve large receptive fields similar to point-based
methods [29, 30], we model long-range interactions of local
LiDAR features using multi-scale deformable attention [55]
block in our encoder to obtain LiDAR global features.
Transformer-based methods. Earlier approaches [7, 19,
35, 36, 50, 53] object detection head employ anchor boxes
to predict the objects, however anchor boxes involve hand-
crafted parameter tuning and they are statistically obtained
from the dataset. To mitigate this issue, some approaches
[5, 43, 48, 51] followed anchor-free pipeline by comput-
ing per-pixel or per-pillar prediction. But these approaches
use NMS to remove redundant boxes. DETR [2] is the
first transformer architecture which formulated 2D detec-
tion problem as a direct set prediction to remove NMS.
Our network follows similar formulation for 3D object de-
tection. Some approaches [24, 27, 34] used transformer
for feature extraction networks. 3DETR [25] is a fully
transformer based architecture for 3D object detection us-
ing vanilla transformer [41] block with minimal modifica-
tions. 3DETR directly operate and attend on points whereas
our approach voxelize the points and attend the BEV global
voxel features which is computationally efficient for au-
tonomous driving scenarios. 3DETR employs downsam-
pling and set-aggregation operation [30] on the input points
of indoor scenarios because the computational complexity
of self-attention increases quadratically (O(n2)) with the
number of input points. Moreover, 3DETR is effective on
indoor datasets, where the points are dense and concen-
trated. Object-DGCNN [45] employs a graph-based model
for transformer-based 3D object detection for outdoor envi-
ronments. BoxeR [26] introduces a novel and simple Box-
Attention which enables spatial interaction between grid
features. BoxeR-2D enables end-to-end 2D object detection
and segmentation tasks, which can be extended to BoxeR-
3D for end-to-end 3D object detection. VISTA [11] is a plug
and play module to adaptively fuse multi-view features in a
global spatial context, incorporated with [5, 51]. It intro-
duces dual cross-view spatial attention to leverage the in-
formation in BEV and Range View (RV) features. We for-
mulate our model with voxel-BEV based CNN backbone ar-
chitecture for local feature extraction and an attention-based
architecture for global feature extraction to increase the re-