2 C. Liu et al.
–Views: 2D images are captured by an egocentric photographer. A similar
view exists for 3D, that is the perspective view, or range images. But when
the scan is not egocentric, we have an unordered point set that can no longer
be indexed by pixel coordinates. In addition, gravity makes the zaxis special,
and often times a natural choice is to view an object from top-down. Each
view has its unique properties and (dis)advantages.
–Sparsity: Images are dense in the sense that each pixel has an RGB value
between 0 and 255. But in 3D, range images may have pixels that correspond
to infinite depth. Also, objects typically occupy a small percentage of the
space, meaning that when a scene is voxelized, the number of non-empty
voxels is typically small compared with the total number of voxels.
–Neural operations: Due to views and sparsity, 2D convolution does not always
apply, resulting in more diverse neural operations.
Our first contribution in this paper is a unified framework that can inter-
pret and organize the variety of neural architecture designs, while adhering to
the principles listed above. This framework allows us to put existing designs in
perspective and enables us to explore new designs. The key idea is to factorize
the entire neural network into a series of transforms and layers. The framework
supports four views (point, voxel, pillar, perspective) and two formats (dense,
sparse), as well as the transforms between them. It is also possible to merge
features from different views, building parallelism into the sequential stages.
But once a view-format combination is set, it restricts the types of layers that
can be applied. When visualized, this framework is a trellis, and any neural ar-
chitecture corresponds to a connected subset of this trellis. We provide several
examples of how popular architectures can be refactored and reproduced under
this framework, proving its generality.
A direct benefit of this framework is that it can easily materialize into a search
space, which immediately unlocks and enables NAS. NAS stands for neural ar-
chitecture search [61], which tries to replace human labor and manual designs
with machine computation and automatic discoveries. Despite its success on 2D
architectures [46], its usage on 3D has been limited. In this paper we conduct
a principled NAS-for-3D explorations, by not only considering the micro-level
(such as the number of channels), but also embracing the macro-level (such as
transforms between various views and formats).
We conduct our LidarNAS experiments on the 3D object detection task on
the Waymo Open Dataset [44]. Using regularized evolution [36], our search finds
LidarNASNet, which outperforms the state-of-the-art RSN model [45] on both
the vehicle and the pedestrian classes. In addition to the superior accuracy and
the competitive latency, there are also interesting observations about the Li-
darNASNet architecture itself. First of all, though the search / evolution was
conducted separately on vehicle and pedestrian, the found architectures have
essentially the same high-level design concept. Second, the modifications discov-
ered by NAS coincidentally reflects ideas from human designs. We also analyze
the hundreds of architectures sampled in the process and draw useful lessons
that should inform future designs.