LidarNAS Unifying and Searching Neural Architectures for 3D Point Clouds Chenxi Liu Zhaoqi Leng Pei Sun Shuyang Cheng Charles R. Qi Yin Zhou

2025-05-03 0 0 619.77KB 20 页 10玖币

侵权投诉

LidarNAS: Unifying and Searching Neural

Architectures for 3D Point Clouds

Chenxi Liu, Zhaoqi Leng, Pei Sun, Shuyang Cheng, Charles R. Qi, Yin Zhou,

Mingxing Tan, and Dragomir Anguelov

Waymo LLC

{cxliu, lengzhaoqi, peis, shuyangcheng, rqi, yinzhou, tanmingxing,

dragomir}@waymo.com

Abstract. Developing neural models that accurately understand ob-

jects in 3D point clouds is essential for the success of robotics and au-

tonomous driving. However, arguably due to the higher-dimensional na-

ture of the data (as compared to images), existing neural architectures

exhibit a large variety in their designs, including but not limited to the

views considered, the format of the neural features, and the neural op-

erations used. Lack of a uniﬁed framework and interpretation makes it

hard to put these designs in perspective, as well as systematically explore

new ones. In this paper, we begin by proposing a uniﬁed framework of

such, with the key idea being factorizing the neural networks into a series

of view transforms and neural layers. We demonstrate that this modular

framework can reproduce a variety of existing works while allowing a fair

comparison of backbone designs. Then, we show how this framework can

easily materialize into a concrete neural architecture search (NAS) space,

allowing a principled NAS-for-3D exploration. In performing evolution-

ary NAS on the 3D object detection task on the Waymo Open Dataset,

not only do we outperform the state-of-the-art models, but also report

the interesting ﬁnding that NAS tends to discover the same macro-level

architecture concept for both the vehicle and pedestrian classes.

1 Introduction

Being able to recognize, segment, or detect objects in 3D is one of the funda-

mental goals of computer vision. In this paper we consider the point cloud input

representation for the wide usage of RGBD cameras in robotics applications, as

well as LiDAR sensors in autonomous driving. There has been a lot of research

in this area, including various deep learning based approaches.

But which neural architecture should you choose? PointNet [33]? Voxel-

Net [59]? PointPillars [18]? Range Sparse Net [45]? It is easy to get overwhelmed

by the diverse set of concepts present in these names as well as the variety in

the architectures themselves.

This level of variety at the macro-level is not observed in other areas, e.g.,

neural architectures developed for 2D images. The root cause is the higher-

dimensional nature of the data. There are three major reasons in particular:

arXiv:2210.05018v1 [cs.CV] 10 Oct 2022

2 C. Liu et al.

–Views: 2D images are captured by an egocentric photographer. A similar

view exists for 3D, that is the perspective view, or range images. But when

the scan is not egocentric, we have an unordered point set that can no longer

be indexed by pixel coordinates. In addition, gravity makes the zaxis special,

and often times a natural choice is to view an object from top-down. Each

view has its unique properties and (dis)advantages.

–Sparsity: Images are dense in the sense that each pixel has an RGB value

between 0 and 255. But in 3D, range images may have pixels that correspond

to inﬁnite depth. Also, objects typically occupy a small percentage of the

space, meaning that when a scene is voxelized, the number of non-empty

voxels is typically small compared with the total number of voxels.

–Neural operations: Due to views and sparsity, 2D convolution does not always

apply, resulting in more diverse neural operations.

Our ﬁrst contribution in this paper is a uniﬁed framework that can inter-

pret and organize the variety of neural architecture designs, while adhering to

the principles listed above. This framework allows us to put existing designs in

perspective and enables us to explore new designs. The key idea is to factorize

the entire neural network into a series of transforms and layers. The framework

supports four views (point, voxel, pillar, perspective) and two formats (dense,

sparse), as well as the transforms between them. It is also possible to merge

features from diﬀerent views, building parallelism into the sequential stages.

But once a view-format combination is set, it restricts the types of layers that

can be applied. When visualized, this framework is a trellis, and any neural ar-

chitecture corresponds to a connected subset of this trellis. We provide several

examples of how popular architectures can be refactored and reproduced under

this framework, proving its generality.

A direct beneﬁt of this framework is that it can easily materialize into a search

space, which immediately unlocks and enables NAS. NAS stands for neural ar-

chitecture search [61], which tries to replace human labor and manual designs

with machine computation and automatic discoveries. Despite its success on 2D

architectures [46], its usage on 3D has been limited. In this paper we conduct

a principled NAS-for-3D explorations, by not only considering the micro-level

(such as the number of channels), but also embracing the macro-level (such as

transforms between various views and formats).

We conduct our LidarNAS experiments on the 3D object detection task on

the Waymo Open Dataset [44]. Using regularized evolution [36], our search ﬁnds

LidarNASNet, which outperforms the state-of-the-art RSN model [45] on both

the vehicle and the pedestrian classes. In addition to the superior accuracy and

the competitive latency, there are also interesting observations about the Li-

darNASNet architecture itself. First of all, though the search / evolution was

conducted separately on vehicle and pedestrian, the found architectures have

essentially the same high-level design concept. Second, the modiﬁcations discov-

ered by NAS coincidentally reﬂects ideas from human designs. We also analyze

the hundreds of architectures sampled in the process and draw useful lessons

that should inform future designs.

LidarNAS 3

To summarize, the main contributions of this paper are:

–A uniﬁed framework general enough to include a wide range of backbones

for 3D data processing

–A search space and an algorithm challenging enough to cover both the micro-

level and the macro-level

–A successful NAS experiment which leads to state-of-the-art performance on

the Waymo Open Dataset

2 Related Work

2.1 Neural Architectures for 3D

We partition neural architectures for 3D into four categories, according to the

primary view(s) used. Since this paper studies backbone design for 3D object

detection, we will mostly cover detection but will also talk about segmentation

and classiﬁcation.

The ﬁrst category is top-down primary, which includes voxel and pillar.

The main idea is to divide 3D points into 3D voxels [11,8,52,59,51,9] or 2D pil-

lars [18], which then become regular. The advantage is that voxelization enables

locality, which in turn enables convolution operations. But the main limitation

is memory consumption, which grows cubically (or quadratically). This either

limits the maximum detection range or sacriﬁces the voxelization granularity.

Even if sparse operations may be used, for egocentric scans, the point densities

at long-range and short-range are diﬀerent, posing challenges in learning.

The second category is point primary, which treats the point cloud as un-

organized sets. Originally developed for classiﬁcation and segmentation [33,34],

the idea can also be used on detection [32,30]. The advantage is that it is more

memory-friendly than voxelization based approaches. However, its limitation is

that the neural layers do not perform as well, possibly due to irregular coordi-

nates. In addition, to achieve locality, nearest neighbor search is typically needed

for the input, which can be expensive.

The third category is perspective primary, operating directly on the range

image [29,5,6,12]. This is also very memory-friendly and can utilize powerful 2D

convolution layers which have been extensively researched. However, as the depth

can change drastically for adjacent pixels, these methods exhibit more diﬃculty

in localizing the objects accurately, as well as handling occlusions.

The fourth and ﬁnal category is fusion methods, which use two or more of

the representations discussed above. The fusion may be either sequential and

parallel. For example, RSN [45] sequentially performs foreground segmentation

on the perspective view and delivers detection output on the top-down view.

PVCNN [26] and SPVCNN [47] fuses information from the point view and the

voxel view in a parallel fashion. MVF [58] fuses feature from perspective view,

point view, and pillar view, also in a parallel fashion. The hope is that fusion

methods can combine the best of multiple worlds, which is why it is important

to keep all options when doing architecture exploration.

4 C. Liu et al.

2.2 Neural Architecture Search

Early works on neural architecture search primarily focused on the search al-

gorithm. A variety of methods were introduced, including reinforcement learn-

ing [61,3], evolution [37,36], performance prediction [24], weight-sharing [31,25].

Essentially, diﬀerent methods make diﬀerent approximations about the search

process.

These search algorithm explorations started on image classiﬁcation. The fol-

lowing phase consists of extending to other tasks, such as semantic segmenta-

tion [7,23] and object detection [50,14]. For 3D tasks, NAS research has been

done on medical imaging [60,17,2,49,54]. However, the volumetric CT scans are

diﬀerent from point clouds, and as a result the search space is greatly simpli-

ﬁed. There are also works on 3D shape classiﬁcation [27,19], but their overall

frameworks do not exceed that set by [25]. [47,20] is closer to our work, in the

sense that it uses NAS to optimize for segmentation and detection on 3D scenes

(KITTI [13]). But generalizing the terminology used in [23], we believe there

is also a two-level hierarchy in 3D neural architecture designs, with the outer

macro-level controlling the views of the data / features, and the inner micro-level

being the speciﬁcs of the neural layers. Under this terminology, [47,20] keeps the

macro-level ﬁxed, while our search covers both.

3 Unifying Neural Architectures for 3D

3.1 Philosophy

In order to oﬀer a uniﬁed interpretation of the growing variety of neural networks

for 3D, we need to pinpoint their high-level design principles. Fortunately, we

ﬁnd these underlying principles to be surprisingly congruent, and we character-

ize them as: ﬁnding some neighborhood of the 3D points and then aggregating

information within. The “aggregation” part is typically done through some form

of convolution and / or pooling. The “neighborhood” part has diﬀerent choices:

–PointNet [33]: the neighborhood alternates between the point itself (MLP)

and all points (max-pooling)

–PointNet++ [34]: the neighborhood is an Euclidean ball with a certain radius

–VoxelNet [59]: 3D neighborhood measured by Manhattan distance of Carte-

sian coordinates (x, y, z)

–PointPillars [18]: 2D neighborhood measured by Manhattan distance of (part

of) Cartesian coordinates (x, y)

–LaserNet [29]: 2D neighborhood measured by Manhattan distance of pixel

coordinates (i, j)

These common “neighborhood” choices have been typically expressed through

the views of the data / features: point, voxel, pillar, perspective. We point out

that there have been and will be more views being proposed, which is why we

feel the “neighborhood” interpretation is more generic. Notably, diﬀerent data

views can transform between each other back and forth. However, once the data

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LidarNAS:UnifyingandSearchingNeuralArchitecturesfor3DPointCloudsChenxiLiu,ZhaoqiLeng,PeiSun,ShuyangCheng,CharlesR.Qi,YinZhou,MingxingTan,andDragomirAnguelovWaymoLLCfcxliu,lengzhaoqi,peis,shuyangcheng,rqi,yinzhou,tanmingxing,dragomirg@waymo.comAbstract.Developingneuralmodelsthataccuratelyunderstandob...

展开>> 收起<<

LidarNAS Unifying and Searching Neural Architectures for 3D Point Clouds Chenxi Liu Zhaoqi Leng Pei Sun Shuyang Cheng Charles R. Qi Yin Zhou.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LidarNAS Unifying and Searching Neural Architectures for 3D Point Clouds Chenxi Liu Zhaoqi Leng Pei Sun Shuyang Cheng Charles R. Qi Yin Zhou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: