Li3DeTr A LiDAR based 3D Detection Transformer Gopi Krishna Erabati and Helder Araujo Institute of Systems and Robotics

2025-05-03 0 0 786.21KB 10 页 10玖币

侵权投诉

Li3DeTr: A LiDAR based 3D Detection Transformer

Gopi Krishna Erabati and Helder Araujo

Institute of Systems and Robotics

University of Coimbra, Portugal

{gopi.erabati, helder}@isr.uc.pt

Abstract

Inspired by recent advances in vision transformers for

object detection, we propose Li3DeTr, an end-to-end LiDAR

based 3DDetection Transformer for autonomous driving,

that inputs LiDAR point clouds and regresses 3D bounding

boxes. The LiDAR local and global features are encoded

using sparse convolution and multi-scale deformable atten-

tion respectively. In the decoder head, ﬁrstly, in the novel

Li3DeTr cross-attention block, we link the LiDAR global

features to 3D predictions leveraging the sparse set of ob-

ject queries learnt from the data. Secondly, the object query

interactions are formulated using multi-head self-attention.

Finally, the decoder layer is repeated Ldec number of times

to reﬁne the object queries. Inspired by DETR, we em-

ploy set-to-set loss to train the Li3DeTr network. Without

bells and whistles, the Li3DeTr network achieves 61.3%

mAP and 67.6% NDS surpassing the state-of-the-art meth-

ods with non-maximum suppression (NMS) on the nuScenes

dataset and it also achieves competitive performance on the

KITTI dataset. We also employ knowledge distillation (KD)

using a teacher and student model that slightly improves the

performance of our network.

1. Introduction

With the advent of deep learning networks for computer

vision [16, 37] and large-scale datasets [10] the research on

perception systems for scene understanding of autonomous

vehicles is growing rapidly. 3D object detection is one of

the key processes of autonomous driving, which is a two

fold process of classiﬁcation and localization of the objects

in the scene. LiDAR is one of the signiﬁcant sensors of au-

tonomous vehicles which provides precise 3D information

of the scene. Although there is a huge progress in 2D object

detection approaches [2, 12, 22, 32, 33, 40], the CNN-based

approaches are not well directly adapted to LiDAR point

clouds due to their sparse, unordered and irregular nature.

Earlier approaches for 3D object detection on LiDAR

data can be divided into two types: point-based and grid-

based methods. Point-based methods [27, 36, 49] are based

on point operations [29, 30] which detect the 3D objects

directly from the point clouds. Grid-based methods ei-

ther voxelize the points into volumetric grids or project

the points to Birds Eye View (BEV) space. The advan-

tage of BEV projection is that it preserves euclidean dis-

tance, avoids overlapping of objects and the object size

is invariant to distance from ego vehicle which is signiﬁ-

cant for autonomous driving scenarios. The sparse CNN-

based voxel feature extraction [47] is advantageous but it

can not extract rich semantic information with limited re-

ceptive ﬁelds. We mitigate this issue by employing a multi-

scale deformable attention [55] encoder to capture global

LiDAR feature maps.

Earlier approaches either use two-stage detection

pipeline [7, 36] or anchors [19, 53] or anchor-free networks

[42, 43, 51] for 3D object detection, but all of them em-

ploy post-processing method like NMS to remove redun-

dant boxes. Inspired by Object-DGCNN [45], we formulate

the 3D object detection problem as a direct set prediction

problem to avoid NMS.

We propose an end-to-end, single-stage LiDAR based

3D Detection Transformer (Li3DeTr) network to predict the

3D bounding boxes for autonomous driving. Firstly, the

voxel features are extracted with SECOND [47] by lever-

aging sparse convolutions [15] and BEV transformation or

with PointPillars [19]. Secondly, we employ an encoder

module with multi-scale deformable attention [55] to cap-

ture rich semantic features and long range dependencies in

BEV feature maps to generate LiDAR global features. The

LiDAR global features are passed to the decoder module.

Finally, we introduce a novel Li3DeTr cross-attention block

in the decoder to link the LiDAR global features to the 3D

object predictions leveraging the learnt object queries. The

object queries interact with each other in multi-head self-

attention block [41]. The object queries are iteratively re-

ﬁned and 3D bounding box parameters are regressed in ev-

ery decoder layer. Inspired by DETR [2], we use set-to-set

loss to optimize our network during training.

We conduct experiments on two publicly available au-

arXiv:2210.15365v1 [cs.CV] 27 Oct 2022

tonomous driving benchmarks, nuScenes [1] and KITTI

[14] dataset. Our network achieves 61.3% mAP and 67.6%

NDS on the nuScenes dataset surpassing the state-of-the-art

CenterPoint [51] and Object-DGCNN [45] by 3.3% mAP

(and 2.1% NDS) and 2.6% mAP (and 1.6% NDS) respec-

tively.

Our main contributions are as follows:

• We propose an end-to-end, single-stage LiDAR based

3D Detection Transformer (Li3DeTr) for autonomous

driving. Our method achieves 61.3% mAP and

67.6% NDS on the nuScenes [1] dataset which sur-

passed state-of-the-art LiDAR based object detection

approaches. Our method achieves competitive per-

formance (without NMS) to other approaches (with

NMS) on the KITTI [14] dataset. Similar to DETR

[2], our approach does not require NMS, hence it is

effective to apply knowledge distillation with teacher

and student model to improve the accuracy.

• We introduce a novel Li3DeTr cross-attention block to

link the LiDAR global encoded features to 3D object

predictions leveraging the learnt object queries. The

attention mechanism in encoder and decoder helps to

detect large-size objects effectively as shown in Ta-

ble 3. The ablation study shown in Table 6 justiﬁes

our novel Li3DeTr cross-attention block.

• We release our code and models to facilitate further

research.

2. Related Work

The LiDAR point cloud based 3D object detection ap-

proaches can be divided into two categories: point-based

and grid-based, depending on the type of data representa-

tion used to predict the 3D bounding boxes.

Point-based methods [28, 35, 36, 49, 50] directly use the

sparse and unordered set of points to predict 3D bound-

ing boxes. The point features are aggregated by multi-

scale/multi-resolution grouping and set abstraction [29, 30].

PointRCNN [36] employs a two-stage pipeline for 3D ob-

ject prediction. PVRCNN [35] models point-voxel based

set abstraction layer to leverage the advantage of point and

voxel based methods. Frustum-PointNet [28] uses 2D ob-

ject detection to sample a frustum of points to apply Point-

Net [29] to predict 3D objects. Although point-based meth-

ods achieve large receptive ﬁelds with set abstraction layer,

they are computationally expensive.

Grid-based methods. As the LiDAR point clouds are

sparse and unordered set of points, many methods project

the points to regular grids such as voxels [47, 53], BEV

pillars [19] or range projection [3, 13, 39]. The point

clouds are discretized into 3D voxels [38, 53] and 3D

CNNs are employed to extract voxel-wise features. How-

ever, 3D CNNs are computationally expensive and requires

large memory, in order to mitigate this problem [5, 47] use

sparse 3D CNNs [15] for efﬁcient voxel processing. The

LiDAR point cloud is projected into BEV map in Point-

Pillars [19] and PIXOR [48] and 2D CNNs are employed

to reduce the computational cost, however such projection

induces 3D information loss. In order to mitigate this is-

sue, some methods [47, 51] compute voxel features using

sprase convolutions and then project the voxel features into

BEV space, and ﬁnally predict the 3D bounding boxes in

the BEV space. As this approach takes the advantage of

voxel and BEV space, we test our network with SECOND

[47] and PointPillars [19] feature extraction networks. In

order to achieve large receptive ﬁelds similar to point-based

methods [29, 30], we model long-range interactions of local

LiDAR features using multi-scale deformable attention [55]

block in our encoder to obtain LiDAR global features.

Transformer-based methods. Earlier approaches [7, 19,

35, 36, 50, 53] object detection head employ anchor boxes

to predict the objects, however anchor boxes involve hand-

crafted parameter tuning and they are statistically obtained

from the dataset. To mitigate this issue, some approaches

[5, 43, 48, 51] followed anchor-free pipeline by comput-

ing per-pixel or per-pillar prediction. But these approaches

use NMS to remove redundant boxes. DETR [2] is the

ﬁrst transformer architecture which formulated 2D detec-

tion problem as a direct set prediction to remove NMS.

Our network follows similar formulation for 3D object de-

tection. Some approaches [24, 27, 34] used transformer

for feature extraction networks. 3DETR [25] is a fully

transformer based architecture for 3D object detection us-

ing vanilla transformer [41] block with minimal modiﬁca-

tions. 3DETR directly operate and attend on points whereas

our approach voxelize the points and attend the BEV global

voxel features which is computationally efﬁcient for au-

tonomous driving scenarios. 3DETR employs downsam-

pling and set-aggregation operation [30] on the input points

of indoor scenarios because the computational complexity

of self-attention increases quadratically (O(n2)) with the

number of input points. Moreover, 3DETR is effective on

indoor datasets, where the points are dense and concen-

trated. Object-DGCNN [45] employs a graph-based model

for transformer-based 3D object detection for outdoor envi-

ronments. BoxeR [26] introduces a novel and simple Box-

Attention which enables spatial interaction between grid

features. BoxeR-2D enables end-to-end 2D object detection

and segmentation tasks, which can be extended to BoxeR-

3D for end-to-end 3D object detection. VISTA [11] is a plug

and play module to adaptively fuse multi-view features in a

global spatial context, incorporated with [5, 51]. It intro-

duces dual cross-view spatial attention to leverage the in-

formation in BEV and Range View (RV) features. We for-

mulate our model with voxel-BEV based CNN backbone ar-

chitecture for local feature extraction and an attention-based

architecture for global feature extraction to increase the re-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Li3DeTr:ALiDARbased3DDetectionTransformerGopiKrishnaErabatiandHelderAraujoInstituteofSystemsandRoboticsUniversityofCoimbra,Portugal{gopi.erabati,helder}@isr.uc.ptAbstractInspiredbyrecentadvancesinvisiontransformersforobjectdetection,weproposeLi3DeTr,anend-to-endLiDARbased3DDetectionTransformerforaut...

展开>> 收起<<

Li3DeTr A LiDAR based 3D Detection Transformer Gopi Krishna Erabati and Helder Araujo Institute of Systems and Robotics.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Li3DeTr A LiDAR based 3D Detection Transformer Gopi Krishna Erabati and Helder Araujo Institute of Systems and Robotics

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: