LidarNAS Unifying and Searching Neural Architectures for 3D Point Clouds Chenxi Liu Zhaoqi Leng Pei Sun Shuyang Cheng Charles R. Qi Yin Zhou

2025-05-03 0 0 619.77KB 20 页 10玖币
侵权投诉
LidarNAS: Unifying and Searching Neural
Architectures for 3D Point Clouds
Chenxi Liu, Zhaoqi Leng, Pei Sun, Shuyang Cheng, Charles R. Qi, Yin Zhou,
Mingxing Tan, and Dragomir Anguelov
Waymo LLC
{cxliu, lengzhaoqi, peis, shuyangcheng, rqi, yinzhou, tanmingxing,
dragomir}@waymo.com
Abstract. Developing neural models that accurately understand ob-
jects in 3D point clouds is essential for the success of robotics and au-
tonomous driving. However, arguably due to the higher-dimensional na-
ture of the data (as compared to images), existing neural architectures
exhibit a large variety in their designs, including but not limited to the
views considered, the format of the neural features, and the neural op-
erations used. Lack of a unified framework and interpretation makes it
hard to put these designs in perspective, as well as systematically explore
new ones. In this paper, we begin by proposing a unified framework of
such, with the key idea being factorizing the neural networks into a series
of view transforms and neural layers. We demonstrate that this modular
framework can reproduce a variety of existing works while allowing a fair
comparison of backbone designs. Then, we show how this framework can
easily materialize into a concrete neural architecture search (NAS) space,
allowing a principled NAS-for-3D exploration. In performing evolution-
ary NAS on the 3D object detection task on the Waymo Open Dataset,
not only do we outperform the state-of-the-art models, but also report
the interesting finding that NAS tends to discover the same macro-level
architecture concept for both the vehicle and pedestrian classes.
1 Introduction
Being able to recognize, segment, or detect objects in 3D is one of the funda-
mental goals of computer vision. In this paper we consider the point cloud input
representation for the wide usage of RGBD cameras in robotics applications, as
well as LiDAR sensors in autonomous driving. There has been a lot of research
in this area, including various deep learning based approaches.
But which neural architecture should you choose? PointNet [33]? Voxel-
Net [59]? PointPillars [18]? Range Sparse Net [45]? It is easy to get overwhelmed
by the diverse set of concepts present in these names as well as the variety in
the architectures themselves.
This level of variety at the macro-level is not observed in other areas, e.g.,
neural architectures developed for 2D images. The root cause is the higher-
dimensional nature of the data. There are three major reasons in particular:
arXiv:2210.05018v1 [cs.CV] 10 Oct 2022
2 C. Liu et al.
Views: 2D images are captured by an egocentric photographer. A similar
view exists for 3D, that is the perspective view, or range images. But when
the scan is not egocentric, we have an unordered point set that can no longer
be indexed by pixel coordinates. In addition, gravity makes the zaxis special,
and often times a natural choice is to view an object from top-down. Each
view has its unique properties and (dis)advantages.
Sparsity: Images are dense in the sense that each pixel has an RGB value
between 0 and 255. But in 3D, range images may have pixels that correspond
to infinite depth. Also, objects typically occupy a small percentage of the
space, meaning that when a scene is voxelized, the number of non-empty
voxels is typically small compared with the total number of voxels.
Neural operations: Due to views and sparsity, 2D convolution does not always
apply, resulting in more diverse neural operations.
Our first contribution in this paper is a unified framework that can inter-
pret and organize the variety of neural architecture designs, while adhering to
the principles listed above. This framework allows us to put existing designs in
perspective and enables us to explore new designs. The key idea is to factorize
the entire neural network into a series of transforms and layers. The framework
supports four views (point, voxel, pillar, perspective) and two formats (dense,
sparse), as well as the transforms between them. It is also possible to merge
features from different views, building parallelism into the sequential stages.
But once a view-format combination is set, it restricts the types of layers that
can be applied. When visualized, this framework is a trellis, and any neural ar-
chitecture corresponds to a connected subset of this trellis. We provide several
examples of how popular architectures can be refactored and reproduced under
this framework, proving its generality.
A direct benefit of this framework is that it can easily materialize into a search
space, which immediately unlocks and enables NAS. NAS stands for neural ar-
chitecture search [61], which tries to replace human labor and manual designs
with machine computation and automatic discoveries. Despite its success on 2D
architectures [46], its usage on 3D has been limited. In this paper we conduct
a principled NAS-for-3D explorations, by not only considering the micro-level
(such as the number of channels), but also embracing the macro-level (such as
transforms between various views and formats).
We conduct our LidarNAS experiments on the 3D object detection task on
the Waymo Open Dataset [44]. Using regularized evolution [36], our search finds
LidarNASNet, which outperforms the state-of-the-art RSN model [45] on both
the vehicle and the pedestrian classes. In addition to the superior accuracy and
the competitive latency, there are also interesting observations about the Li-
darNASNet architecture itself. First of all, though the search / evolution was
conducted separately on vehicle and pedestrian, the found architectures have
essentially the same high-level design concept. Second, the modifications discov-
ered by NAS coincidentally reflects ideas from human designs. We also analyze
the hundreds of architectures sampled in the process and draw useful lessons
that should inform future designs.
LidarNAS 3
To summarize, the main contributions of this paper are:
A unified framework general enough to include a wide range of backbones
for 3D data processing
A search space and an algorithm challenging enough to cover both the micro-
level and the macro-level
A successful NAS experiment which leads to state-of-the-art performance on
the Waymo Open Dataset
2 Related Work
2.1 Neural Architectures for 3D
We partition neural architectures for 3D into four categories, according to the
primary view(s) used. Since this paper studies backbone design for 3D object
detection, we will mostly cover detection but will also talk about segmentation
and classification.
The first category is top-down primary, which includes voxel and pillar.
The main idea is to divide 3D points into 3D voxels [11,8,52,59,51,9] or 2D pil-
lars [18], which then become regular. The advantage is that voxelization enables
locality, which in turn enables convolution operations. But the main limitation
is memory consumption, which grows cubically (or quadratically). This either
limits the maximum detection range or sacrifices the voxelization granularity.
Even if sparse operations may be used, for egocentric scans, the point densities
at long-range and short-range are different, posing challenges in learning.
The second category is point primary, which treats the point cloud as un-
organized sets. Originally developed for classification and segmentation [33,34],
the idea can also be used on detection [32,30]. The advantage is that it is more
memory-friendly than voxelization based approaches. However, its limitation is
that the neural layers do not perform as well, possibly due to irregular coordi-
nates. In addition, to achieve locality, nearest neighbor search is typically needed
for the input, which can be expensive.
The third category is perspective primary, operating directly on the range
image [29,5,6,12]. This is also very memory-friendly and can utilize powerful 2D
convolution layers which have been extensively researched. However, as the depth
can change drastically for adjacent pixels, these methods exhibit more difficulty
in localizing the objects accurately, as well as handling occlusions.
The fourth and final category is fusion methods, which use two or more of
the representations discussed above. The fusion may be either sequential and
parallel. For example, RSN [45] sequentially performs foreground segmentation
on the perspective view and delivers detection output on the top-down view.
PVCNN [26] and SPVCNN [47] fuses information from the point view and the
voxel view in a parallel fashion. MVF [58] fuses feature from perspective view,
point view, and pillar view, also in a parallel fashion. The hope is that fusion
methods can combine the best of multiple worlds, which is why it is important
to keep all options when doing architecture exploration.
4 C. Liu et al.
2.2 Neural Architecture Search
Early works on neural architecture search primarily focused on the search al-
gorithm. A variety of methods were introduced, including reinforcement learn-
ing [61,3], evolution [37,36], performance prediction [24], weight-sharing [31,25].
Essentially, different methods make different approximations about the search
process.
These search algorithm explorations started on image classification. The fol-
lowing phase consists of extending to other tasks, such as semantic segmenta-
tion [7,23] and object detection [50,14]. For 3D tasks, NAS research has been
done on medical imaging [60,17,2,49,54]. However, the volumetric CT scans are
different from point clouds, and as a result the search space is greatly simpli-
fied. There are also works on 3D shape classification [27,19], but their overall
frameworks do not exceed that set by [25]. [47,20] is closer to our work, in the
sense that it uses NAS to optimize for segmentation and detection on 3D scenes
(KITTI [13]). But generalizing the terminology used in [23], we believe there
is also a two-level hierarchy in 3D neural architecture designs, with the outer
macro-level controlling the views of the data / features, and the inner micro-level
being the specifics of the neural layers. Under this terminology, [47,20] keeps the
macro-level fixed, while our search covers both.
3 Unifying Neural Architectures for 3D
3.1 Philosophy
In order to offer a unified interpretation of the growing variety of neural networks
for 3D, we need to pinpoint their high-level design principles. Fortunately, we
find these underlying principles to be surprisingly congruent, and we character-
ize them as: finding some neighborhood of the 3D points and then aggregating
information within. The “aggregation” part is typically done through some form
of convolution and / or pooling. The “neighborhood” part has different choices:
PointNet [33]: the neighborhood alternates between the point itself (MLP)
and all points (max-pooling)
PointNet++ [34]: the neighborhood is an Euclidean ball with a certain radius
VoxelNet [59]: 3D neighborhood measured by Manhattan distance of Carte-
sian coordinates (x, y, z)
PointPillars [18]: 2D neighborhood measured by Manhattan distance of (part
of) Cartesian coordinates (x, y)
LaserNet [29]: 2D neighborhood measured by Manhattan distance of pixel
coordinates (i, j)
These common “neighborhood” choices have been typically expressed through
the views of the data / features: point, voxel, pillar, perspective. We point out
that there have been and will be more views being proposed, which is why we
feel the “neighborhood” interpretation is more generic. Notably, different data
views can transform between each other back and forth. However, once the data
摘要:

LidarNAS:UnifyingandSearchingNeuralArchitecturesfor3DPointCloudsChenxiLiu,ZhaoqiLeng,PeiSun,ShuyangCheng,CharlesR.Qi,YinZhou,MingxingTan,andDragomirAnguelovWaymoLLCfcxliu,lengzhaoqi,peis,shuyangcheng,rqi,yinzhou,tanmingxing,dragomirg@waymo.comAbstract.Developingneuralmodelsthataccuratelyunderstandob...

展开>> 收起<<
LidarNAS Unifying and Searching Neural Architectures for 3D Point Clouds Chenxi Liu Zhaoqi Leng Pei Sun Shuyang Cheng Charles R. Qi Yin Zhou.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:619.77KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注