Analyzing Deep Learning Representations of Point Clouds for Real-Time In-Vehicle LiDAR Perception Marc Uecker

2025-04-30 0 0 419.86KB 12 页 10玖币
侵权投诉
Analyzing Deep Learning Representations of Point
Clouds for Real-Time In-Vehicle LiDAR Perception
Marc Uecker
FZI Research Center for
Information Technology
uecker@fzi.de
Tobias Fleck
FZI Research Center for
Information Technology
tobias.fleck@fzi.de
Marcel Pflugfelder
Karlsruhe Institute of Technology
marcel.pflugfelder@student.kit.edu
J. Marius Zöllner
Karlsruhe Institute of Technology,
FZI Research Center for
Information Technology
zoellner@fzi.de
Abstract
LiDAR sensors are an integral part of modern autonomous vehicles as they pro-
vide an accurate, high-resolution 3D representation of the vehicle’s surroundings.
However, it is computationally difficult to make use of the ever-increasing amounts
of data from multiple high-resolution LiDAR sensors. As frame-rates, point cloud
sizes and sensor resolutions increase, real-time processing of these point clouds
must still extract semantics from this increasingly precise picture of the vehicle’s
environment. One deciding factor of the run-time performance and accuracy of
deep neural networks operating on these point clouds is the underlying data repre-
sentation and the way it is computed. In this work, we examine the relationship
between the computational representations used in neural networks and their per-
formance characteristics. To this end, we propose a novel computational taxonomy
of LiDAR point cloud representations used in modern deep neural networks for 3D
point cloud processing. Using this taxonomy, we perform a structured analysis of
different families of approaches. Thereby, we uncover common advantages and
limitations in terms of computational efficiency, memory requirements, and repre-
sentational capacity as measured by semantic segmentation performance. Finally,
we provide some insights and guidance for future developments in neural point
cloud processing methods.
1 Introduction
Thanks to a large amount of investment into research from many stakeholders, the field of autonomous
driving is progressing rapidly [
1
]. One area where this is particularly evident is the field of LiDAR
processing, which has recently been attracting increasing attention from the computer vision and
deep learning community [
2
]. Meanwhile, the sensor hardware is also evolving. As major players in
the industry drive demand for lower-cost, high-resolution sensors, they are becoming more affordable
and increasingly widespread [
3
]. Following this trend, recent research vehicles and prototypes are
often equipped with multiple high-resolution LiDAR sensors [
4
]. Modern sensors are capable of
delivering millions of points per second, at frame-rates at or above 10 Hz for each sensor [5].
Corresponding author
Machine Learning for Autonomous Driving Workshop at the 36th Conference on Neural Information Processing
Systems (NeurIPS 2022), New Orleans, USA.
arXiv:2210.14612v3 [cs.CV] 15 May 2023
3D Pointcloud
Representations
Explicit Spatial
Structure
Implicit Spatial
Structure
Explicit 3D
Rasterization
Voxel-based Cylinder-based
Explicit 2D
Rasterization
Bird's Eye View
(BEV)
Projection
Spherical
(Range-Image)
Projection
Implicit 3D
Rasterization
Bag-of-Points
with
Neighbors
Sparse
Voxel-based
Sparse
Cylinder-based
(Cartesian)
Bag-of-Points
Pointcloud
transformers
No Rasterization
Spatial
Structure
Rasterization
Dimensionality
Coordinate
System
Feature
Aggregation
Voxel CNN Cylinder-based
CNN
BEV
CNN
Range Image
CNN
Sparse
Voxel CNN
Sparse
Cylinder CNN
Figure 1: A categorization of common families of approaches using our proposed taxonomy for
neural LiDAR pointcloud representations.
These fast and high-resolution sensors produce large amounts of data, which must be processed in
real-time to be useful for the autonomous driving functions [
6
]. For many perception tasks which
require semantic or geometric reasoning, such as object detection and semantic segmentation, only
deep learning methods provide state-of-the-art processing capabilities [7–10]. However, many deep
learning approaches which could be used to process LiDAR point clouds of such scale do not fulfill
the real-time inference latency requirements for in-vehicle deployment [
8
]. We conjecture that the
most important design decisions for inference run-time performance hinge on the underlying learned
data representation. Multiple papers categorize approaches as either point-based, projection-based or
sometimes voxel-based to simplify comparison against state-of-the-art approaches [
11
13
]. However,
this categorization does not capture the full diversity of design decisions made in the development of
new architectures. We also found no substantive, objective analysis or comparison of the impact of
these design decisions on run-time performance, as each paper focuses on the approach presented.
In this work, we present a taxonomy of different architecture designs, based on design decisions
regarding the point cloud data representation. We categorize approaches by their choice of explicit or
implicit spatial structure, by their choice of internal representation dimensionality, their choice of
coordinate space, and finally by their chosen method of feature aggregation.
This taxonomy is described in detail in section 2. Using the introduced taxonomy, we analyze the
impact of these design decisions on the run-time performance characteristics in section 3. Finally,
based on this analysis, we also provide insights and recommendations for future work in section 4.
2 Taxonomy of neural representations for LiDAR point clouds
In this section we describe our proposed taxonomy in detail. The taxonomy is centered around
the design decisions during development, which lead to a final representation of the point cloud
inside a deep neural network. In figure 1 we illustrate the categorization of common pointcloud
representations using our taxonomy. Notably, the categorization shown in figure 1 is not exhaustive,
as there are many possible combinations of choices between the presented design decisions.
Spatial structure
The first design decision we observe (fig. 1, first layer) is the choice between
an explicit or implicit multi-dimensional spatial arrangement of data in memory. An explicit spatial
structure directly encodes positional information in the memory layout of the represented data.
Typically, a rasterized representation of a point cloud can be indexed by a point’s coordinate to receive
its feature vector. In comparison, an implicit spatial structure stores the points’ feature vectors in a
sparse representation. In this case, the points’ coordinates and/or a separate indexing data structure are
often stored to encode positional information and accessed to extract neighborhood relations [
14
16
].
Figure 2 visualizes the difference for an exemplary one-dimensional point cloud.
Rasterization dimensionality
The second design decision we observe (fig. 1, second layer) is the
dimensionality of the internal mathematical representation of the point cloud. The main varieties
we observe are three-dimensional voxel representations, two-dimensional projections of 3D space,
and one-dimensional unsorted set- or list-based representations [
17
,
13
,
18
]. We refer to the one-
dimensional representations as "Bag-of-Points", as their order is typically irrelevant for the operations
2
A F G B E,D C
A F G B E D C
Explicit
Implicit A
x=0.75
B
x=4.5
C
x=7.25
D
x=5.75
E
x=5.25
F
x=1.5
G
x=2.0
x
02461357
0241356
Figure 2: A comparison of explicit and implicit
data representations. The example shows a 1-
dimensional pointcloud on the top, and an explicit
and implicit memory layout at the bottom. An
example of aliasing can be observed, as points E
and D collide into the same memory cell. Wasted
memory space can be seen as empty memory
cells.
r
rr
W
H
LW
H
r r
x, y, z
x, y, z
x, y, z
x, y, z
x, y, z
x, y, z
x, y, z
n
N
3D
Voxels
2D
Projection
1D
Bag-of-Points
Figure 3: A comparison of 3D voxel representa-
tions, 2D Image representations and unordered
1-dimensional Bag-of-Points representations. As
the resolution
r
becomes more fine-grained, the
sparsity of rasterized representations increases.
However, a coarse resolution may cause multiple
points to collide within a single representation
cell (as seen in the top right of the 2D projection).
performed on them. The multi-dimensional representations perform a rasterization of the space into a
finite number of grid cells, aligning each memory cell to a section of 3D or 2D space. The decision of
representation dimensionality is orthogonal to the memory layout, as multi-dimensional rasterizations
can also be sparsely stored in one-dimensional data structures [
15
]. Figure 3 illustrates different
rasterization dimensionalities.
Coordinate system
The third design decision (fig. 1, third layer) we observe concerns the choice
of coordinate system, which is used for rasterizations of multi-dimensional spaces. Rasterization
divides 2D or 3D space into chunks of finite size. This partition is typically performed along regular
intervals across coordinate axes. Therefore, the coordinate axes chosen for this division also impacts
how the resulting representation partitions 3D space. Here, we mainly differentiate between Cartesian
coordinate systems, which refer to absolute positions in 3D Euclidean space [
12
], and polar coordinate
systems which refer to locations by combinations of angles and distance measurements [
19
,
20
].
Figure 4 (left) illustrates how different coordinate systems can lead to different rasterizations of
two-dimensional space.
Spherical coordinates are an extension of polar coordinates which uses two angles and one distance
measurement to index 3D space. Projecting spherical coordinates along the radial axis results in a
range image 2D representation. There are also various coordinate systems which combine polar and
Cartesian geometry for different axes. Figure 4 shows an example of this: Cylinder coordinates use a
polar coordinate system for the X-Y plane and a Cartesian axis for the Z-direction. Similarly, some
approaches use a polar coordinate system in a 2D Bird’s eye view (BEV) projection, which projects
points along a Cartesian z-axis [
19
]. For polar coordinate systems, the coordinate origin is typically
chosen as the center of the LiDAR sensor in order to minimize aliasing [19, 20].
Feature aggregation
As a final design decision (fig. 1, fourth layer), we differentiate approaches
by their choice of mathematical operation to be applied to compute the resulting point cloud represen-
tation. To compute a feature representation for a point in a 3D point cloud, almost all deep learning
approaches aggregate information about its local or global neighborhood. This aggregation typically
requires finding other points within the neighborhood of the point whose features are to be computed.
Next, these features are aggregated using parametric or non-parametric mathematical operations.
The decision for a rasterized representation often directly leads to the use of convolutions [
13
,
11
],
although other operations are certainly possible. For Bag-of-Points representations, the choice of the
feature aggregation method becomes the main differentiating factor. Some approaches use convo-
lutions in non-rasterized spaces [
21
,
16
,
22
], others perform feature aggregation through different
variants of weighted or non-weighted pooling of local neighbors [
23
,
14
,
18
]. For brevity, we group
these into neighbor-based approaches in figure 1 (bottom right). A final group of approaches which
exist for smaller point clouds, but have not yet found application in the large-scale point clouds used
in autonomous driving are point cloud transformers. These approaches use local or global (point
cloud-wide) attention mechanisms to exchange features of points inside a point cloud [24].
3
摘要:

AnalyzingDeepLearningRepresentationsofPointCloudsforReal-TimeIn-VehicleLiDARPerceptionMarcUeckerFZIResearchCenterforInformationTechnologyuecker@fzi.deTobiasFleckFZIResearchCenterforInformationTechnologytobias.fleck@fzi.deMarcelPugfelderKarlsruheInstituteofTechnologymarcel.pflugfelder@student.kit.e...

展开>> 收起<<
Analyzing Deep Learning Representations of Point Clouds for Real-Time In-Vehicle LiDAR Perception Marc Uecker.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:419.86KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注