Li3DeTr A LiDAR based 3D Detection Transformer Gopi Krishna Erabati and Helder Araujo Institute of Systems and Robotics

2025-05-03 0 0 786.21KB 10 页 10玖币
侵权投诉
Li3DeTr: A LiDAR based 3D Detection Transformer
Gopi Krishna Erabati and Helder Araujo
Institute of Systems and Robotics
University of Coimbra, Portugal
{gopi.erabati, helder}@isr.uc.pt
Abstract
Inspired by recent advances in vision transformers for
object detection, we propose Li3DeTr, an end-to-end LiDAR
based 3DDetection Transformer for autonomous driving,
that inputs LiDAR point clouds and regresses 3D bounding
boxes. The LiDAR local and global features are encoded
using sparse convolution and multi-scale deformable atten-
tion respectively. In the decoder head, firstly, in the novel
Li3DeTr cross-attention block, we link the LiDAR global
features to 3D predictions leveraging the sparse set of ob-
ject queries learnt from the data. Secondly, the object query
interactions are formulated using multi-head self-attention.
Finally, the decoder layer is repeated Ldec number of times
to refine the object queries. Inspired by DETR, we em-
ploy set-to-set loss to train the Li3DeTr network. Without
bells and whistles, the Li3DeTr network achieves 61.3%
mAP and 67.6% NDS surpassing the state-of-the-art meth-
ods with non-maximum suppression (NMS) on the nuScenes
dataset and it also achieves competitive performance on the
KITTI dataset. We also employ knowledge distillation (KD)
using a teacher and student model that slightly improves the
performance of our network.
1. Introduction
With the advent of deep learning networks for computer
vision [16, 37] and large-scale datasets [10] the research on
perception systems for scene understanding of autonomous
vehicles is growing rapidly. 3D object detection is one of
the key processes of autonomous driving, which is a two
fold process of classification and localization of the objects
in the scene. LiDAR is one of the significant sensors of au-
tonomous vehicles which provides precise 3D information
of the scene. Although there is a huge progress in 2D object
detection approaches [2, 12, 22, 32, 33, 40], the CNN-based
approaches are not well directly adapted to LiDAR point
clouds due to their sparse, unordered and irregular nature.
Earlier approaches for 3D object detection on LiDAR
data can be divided into two types: point-based and grid-
based methods. Point-based methods [27, 36, 49] are based
on point operations [29, 30] which detect the 3D objects
directly from the point clouds. Grid-based methods ei-
ther voxelize the points into volumetric grids or project
the points to Birds Eye View (BEV) space. The advan-
tage of BEV projection is that it preserves euclidean dis-
tance, avoids overlapping of objects and the object size
is invariant to distance from ego vehicle which is signifi-
cant for autonomous driving scenarios. The sparse CNN-
based voxel feature extraction [47] is advantageous but it
can not extract rich semantic information with limited re-
ceptive fields. We mitigate this issue by employing a multi-
scale deformable attention [55] encoder to capture global
LiDAR feature maps.
Earlier approaches either use two-stage detection
pipeline [7, 36] or anchors [19, 53] or anchor-free networks
[42, 43, 51] for 3D object detection, but all of them em-
ploy post-processing method like NMS to remove redun-
dant boxes. Inspired by Object-DGCNN [45], we formulate
the 3D object detection problem as a direct set prediction
problem to avoid NMS.
We propose an end-to-end, single-stage LiDAR based
3D Detection Transformer (Li3DeTr) network to predict the
3D bounding boxes for autonomous driving. Firstly, the
voxel features are extracted with SECOND [47] by lever-
aging sparse convolutions [15] and BEV transformation or
with PointPillars [19]. Secondly, we employ an encoder
module with multi-scale deformable attention [55] to cap-
ture rich semantic features and long range dependencies in
BEV feature maps to generate LiDAR global features. The
LiDAR global features are passed to the decoder module.
Finally, we introduce a novel Li3DeTr cross-attention block
in the decoder to link the LiDAR global features to the 3D
object predictions leveraging the learnt object queries. The
object queries interact with each other in multi-head self-
attention block [41]. The object queries are iteratively re-
fined and 3D bounding box parameters are regressed in ev-
ery decoder layer. Inspired by DETR [2], we use set-to-set
loss to optimize our network during training.
We conduct experiments on two publicly available au-
arXiv:2210.15365v1 [cs.CV] 27 Oct 2022
tonomous driving benchmarks, nuScenes [1] and KITTI
[14] dataset. Our network achieves 61.3% mAP and 67.6%
NDS on the nuScenes dataset surpassing the state-of-the-art
CenterPoint [51] and Object-DGCNN [45] by 3.3% mAP
(and 2.1% NDS) and 2.6% mAP (and 1.6% NDS) respec-
tively.
Our main contributions are as follows:
We propose an end-to-end, single-stage LiDAR based
3D Detection Transformer (Li3DeTr) for autonomous
driving. Our method achieves 61.3% mAP and
67.6% NDS on the nuScenes [1] dataset which sur-
passed state-of-the-art LiDAR based object detection
approaches. Our method achieves competitive per-
formance (without NMS) to other approaches (with
NMS) on the KITTI [14] dataset. Similar to DETR
[2], our approach does not require NMS, hence it is
effective to apply knowledge distillation with teacher
and student model to improve the accuracy.
We introduce a novel Li3DeTr cross-attention block to
link the LiDAR global encoded features to 3D object
predictions leveraging the learnt object queries. The
attention mechanism in encoder and decoder helps to
detect large-size objects effectively as shown in Ta-
ble 3. The ablation study shown in Table 6 justifies
our novel Li3DeTr cross-attention block.
We release our code and models to facilitate further
research.
2. Related Work
The LiDAR point cloud based 3D object detection ap-
proaches can be divided into two categories: point-based
and grid-based, depending on the type of data representa-
tion used to predict the 3D bounding boxes.
Point-based methods [28, 35, 36, 49, 50] directly use the
sparse and unordered set of points to predict 3D bound-
ing boxes. The point features are aggregated by multi-
scale/multi-resolution grouping and set abstraction [29, 30].
PointRCNN [36] employs a two-stage pipeline for 3D ob-
ject prediction. PVRCNN [35] models point-voxel based
set abstraction layer to leverage the advantage of point and
voxel based methods. Frustum-PointNet [28] uses 2D ob-
ject detection to sample a frustum of points to apply Point-
Net [29] to predict 3D objects. Although point-based meth-
ods achieve large receptive fields with set abstraction layer,
they are computationally expensive.
Grid-based methods. As the LiDAR point clouds are
sparse and unordered set of points, many methods project
the points to regular grids such as voxels [47, 53], BEV
pillars [19] or range projection [3, 13, 39]. The point
clouds are discretized into 3D voxels [38, 53] and 3D
CNNs are employed to extract voxel-wise features. How-
ever, 3D CNNs are computationally expensive and requires
large memory, in order to mitigate this problem [5, 47] use
sparse 3D CNNs [15] for efficient voxel processing. The
LiDAR point cloud is projected into BEV map in Point-
Pillars [19] and PIXOR [48] and 2D CNNs are employed
to reduce the computational cost, however such projection
induces 3D information loss. In order to mitigate this is-
sue, some methods [47, 51] compute voxel features using
sprase convolutions and then project the voxel features into
BEV space, and finally predict the 3D bounding boxes in
the BEV space. As this approach takes the advantage of
voxel and BEV space, we test our network with SECOND
[47] and PointPillars [19] feature extraction networks. In
order to achieve large receptive fields similar to point-based
methods [29, 30], we model long-range interactions of local
LiDAR features using multi-scale deformable attention [55]
block in our encoder to obtain LiDAR global features.
Transformer-based methods. Earlier approaches [7, 19,
35, 36, 50, 53] object detection head employ anchor boxes
to predict the objects, however anchor boxes involve hand-
crafted parameter tuning and they are statistically obtained
from the dataset. To mitigate this issue, some approaches
[5, 43, 48, 51] followed anchor-free pipeline by comput-
ing per-pixel or per-pillar prediction. But these approaches
use NMS to remove redundant boxes. DETR [2] is the
first transformer architecture which formulated 2D detec-
tion problem as a direct set prediction to remove NMS.
Our network follows similar formulation for 3D object de-
tection. Some approaches [24, 27, 34] used transformer
for feature extraction networks. 3DETR [25] is a fully
transformer based architecture for 3D object detection us-
ing vanilla transformer [41] block with minimal modifica-
tions. 3DETR directly operate and attend on points whereas
our approach voxelize the points and attend the BEV global
voxel features which is computationally efficient for au-
tonomous driving scenarios. 3DETR employs downsam-
pling and set-aggregation operation [30] on the input points
of indoor scenarios because the computational complexity
of self-attention increases quadratically (O(n2)) with the
number of input points. Moreover, 3DETR is effective on
indoor datasets, where the points are dense and concen-
trated. Object-DGCNN [45] employs a graph-based model
for transformer-based 3D object detection for outdoor envi-
ronments. BoxeR [26] introduces a novel and simple Box-
Attention which enables spatial interaction between grid
features. BoxeR-2D enables end-to-end 2D object detection
and segmentation tasks, which can be extended to BoxeR-
3D for end-to-end 3D object detection. VISTA [11] is a plug
and play module to adaptively fuse multi-view features in a
global spatial context, incorporated with [5, 51]. It intro-
duces dual cross-view spatial attention to leverage the in-
formation in BEV and Range View (RV) features. We for-
mulate our model with voxel-BEV based CNN backbone ar-
chitecture for local feature extraction and an attention-based
architecture for global feature extraction to increase the re-
摘要:

Li3DeTr:ALiDARbased3DDetectionTransformerGopiKrishnaErabatiandHelderAraujoInstituteofSystemsandRoboticsUniversityofCoimbra,Portugal{gopi.erabati,helder}@isr.uc.ptAbstractInspiredbyrecentadvancesinvisiontransformersforobjectdetection,weproposeLi3DeTr,anend-to-endLiDARbased3DDetectionTransformerforaut...

展开>> 收起<<
Li3DeTr A LiDAR based 3D Detection Transformer Gopi Krishna Erabati and Helder Araujo Institute of Systems and Robotics.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:786.21KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注