Analyzing Deep Learning Representations of Point Clouds for Real-Time In-Vehicle LiDAR Perception Marc Uecker

2025-04-30 0 0 419.86KB 12 页 10玖币

侵权投诉

Analyzing Deep Learning Representations of Point

Clouds for Real-Time In-Vehicle LiDAR Perception

Marc Uecker∗

FZI Research Center for

Information Technology

uecker@fzi.de

Tobias Fleck

FZI Research Center for

Information Technology

tobias.fleck@fzi.de

Marcel Pﬂugfelder

Karlsruhe Institute of Technology

marcel.pflugfelder@student.kit.edu

J. Marius Zöllner

Karlsruhe Institute of Technology,

FZI Research Center for

Information Technology

zoellner@fzi.de

Abstract

LiDAR sensors are an integral part of modern autonomous vehicles as they pro-

vide an accurate, high-resolution 3D representation of the vehicle’s surroundings.

However, it is computationally difﬁcult to make use of the ever-increasing amounts

of data from multiple high-resolution LiDAR sensors. As frame-rates, point cloud

sizes and sensor resolutions increase, real-time processing of these point clouds

must still extract semantics from this increasingly precise picture of the vehicle’s

environment. One deciding factor of the run-time performance and accuracy of

deep neural networks operating on these point clouds is the underlying data repre-

sentation and the way it is computed. In this work, we examine the relationship

between the computational representations used in neural networks and their per-

formance characteristics. To this end, we propose a novel computational taxonomy

of LiDAR point cloud representations used in modern deep neural networks for 3D

point cloud processing. Using this taxonomy, we perform a structured analysis of

different families of approaches. Thereby, we uncover common advantages and

limitations in terms of computational efﬁciency, memory requirements, and repre-

sentational capacity as measured by semantic segmentation performance. Finally,

we provide some insights and guidance for future developments in neural point

cloud processing methods.

1 Introduction

Thanks to a large amount of investment into research from many stakeholders, the ﬁeld of autonomous

driving is progressing rapidly [

]. One area where this is particularly evident is the ﬁeld of LiDAR

processing, which has recently been attracting increasing attention from the computer vision and

deep learning community [

]. Meanwhile, the sensor hardware is also evolving. As major players in

the industry drive demand for lower-cost, high-resolution sensors, they are becoming more affordable

and increasingly widespread [

]. Following this trend, recent research vehicles and prototypes are

often equipped with multiple high-resolution LiDAR sensors [

]. Modern sensors are capable of

delivering millions of points per second, at frame-rates at or above 10 Hz for each sensor [5].

∗Corresponding author

Machine Learning for Autonomous Driving Workshop at the 36th Conference on Neural Information Processing

Systems (NeurIPS 2022), New Orleans, USA.

arXiv:2210.14612v3 [cs.CV] 15 May 2023

3D Pointcloud

Representations

Explicit Spatial

Structure

Implicit Spatial

Structure

Explicit 3D

Rasterization

Voxel-based Cylinder-based

Explicit 2D

Rasterization

Bird's Eye View

(BEV)

Projection

Spherical

(Range-Image)

Projection

Implicit 3D

Rasterization

Bag-of-Points

with

Neighbors

Sparse

Voxel-based

Sparse

Cylinder-based

(Cartesian)

Bag-of-Points

Pointcloud

transformers

No Rasterization

Spatial

Structure

Rasterization

Dimensionality

Coordinate

System

Feature

Aggregation

Voxel CNN Cylinder-based

CNN

BEV

CNN

Range Image

CNN

Sparse

Voxel CNN

Sparse

Cylinder CNN

Figure 1: A categorization of common families of approaches using our proposed taxonomy for

neural LiDAR pointcloud representations.

These fast and high-resolution sensors produce large amounts of data, which must be processed in

real-time to be useful for the autonomous driving functions [

]. For many perception tasks which

require semantic or geometric reasoning, such as object detection and semantic segmentation, only

deep learning methods provide state-of-the-art processing capabilities [7–10]. However, many deep

learning approaches which could be used to process LiDAR point clouds of such scale do not fulﬁll

the real-time inference latency requirements for in-vehicle deployment [

]. We conjecture that the

most important design decisions for inference run-time performance hinge on the underlying learned

data representation. Multiple papers categorize approaches as either point-based, projection-based or

sometimes voxel-based to simplify comparison against state-of-the-art approaches [

–

]. However,

this categorization does not capture the full diversity of design decisions made in the development of

new architectures. We also found no substantive, objective analysis or comparison of the impact of

these design decisions on run-time performance, as each paper focuses on the approach presented.

In this work, we present a taxonomy of different architecture designs, based on design decisions

regarding the point cloud data representation. We categorize approaches by their choice of explicit or

implicit spatial structure, by their choice of internal representation dimensionality, their choice of

coordinate space, and ﬁnally by their chosen method of feature aggregation.

This taxonomy is described in detail in section 2. Using the introduced taxonomy, we analyze the

impact of these design decisions on the run-time performance characteristics in section 3. Finally,

based on this analysis, we also provide insights and recommendations for future work in section 4.

2 Taxonomy of neural representations for LiDAR point clouds

In this section we describe our proposed taxonomy in detail. The taxonomy is centered around

the design decisions during development, which lead to a ﬁnal representation of the point cloud

inside a deep neural network. In ﬁgure 1 we illustrate the categorization of common pointcloud

representations using our taxonomy. Notably, the categorization shown in ﬁgure 1 is not exhaustive,

as there are many possible combinations of choices between the presented design decisions.

Spatial structure

The ﬁrst design decision we observe (ﬁg. 1, ﬁrst layer) is the choice between

an explicit or implicit multi-dimensional spatial arrangement of data in memory. An explicit spatial

structure directly encodes positional information in the memory layout of the represented data.

Typically, a rasterized representation of a point cloud can be indexed by a point’s coordinate to receive

its feature vector. In comparison, an implicit spatial structure stores the points’ feature vectors in a

sparse representation. In this case, the points’ coordinates and/or a separate indexing data structure are

often stored to encode positional information and accessed to extract neighborhood relations [

–

Figure 2 visualizes the difference for an exemplary one-dimensional point cloud.

Rasterization dimensionality

The second design decision we observe (ﬁg. 1, second layer) is the

dimensionality of the internal mathematical representation of the point cloud. The main varieties

we observe are three-dimensional voxel representations, two-dimensional projections of 3D space,

and one-dimensional unsorted set- or list-based representations [

]. We refer to the one-

dimensional representations as "Bag-of-Points", as their order is typically irrelevant for the operations

A F G B E,D C

A F G B E D C

Explicit

Implicit A

x=0.75

x=4.5

x=7.25

x=5.75

x=5.25

x=1.5

x=2.0

02461357

0241356

Figure 2: A comparison of explicit and implicit

data representations. The example shows a 1-

dimensional pointcloud on the top, and an explicit

and implicit memory layout at the bottom. An

example of aliasing can be observed, as points E

and D collide into the same memory cell. Wasted

memory space can be seen as empty memory

cells.

r r

x, y, z

Voxels

Projection

Bag-of-Points

Figure 3: A comparison of 3D voxel representa-

tions, 2D Image representations and unordered

1-dimensional Bag-of-Points representations. As

the resolution

becomes more ﬁne-grained, the

sparsity of rasterized representations increases.

However, a coarse resolution may cause multiple

points to collide within a single representation

cell (as seen in the top right of the 2D projection).

performed on them. The multi-dimensional representations perform a rasterization of the space into a

ﬁnite number of grid cells, aligning each memory cell to a section of 3D or 2D space. The decision of

representation dimensionality is orthogonal to the memory layout, as multi-dimensional rasterizations

can also be sparsely stored in one-dimensional data structures [

]. Figure 3 illustrates different

rasterization dimensionalities.

Coordinate system

The third design decision (ﬁg. 1, third layer) we observe concerns the choice

of coordinate system, which is used for rasterizations of multi-dimensional spaces. Rasterization

divides 2D or 3D space into chunks of ﬁnite size. This partition is typically performed along regular

intervals across coordinate axes. Therefore, the coordinate axes chosen for this division also impacts

how the resulting representation partitions 3D space. Here, we mainly differentiate between Cartesian

coordinate systems, which refer to absolute positions in 3D Euclidean space [

], and polar coordinate

systems which refer to locations by combinations of angles and distance measurements [

Figure 4 (left) illustrates how different coordinate systems can lead to different rasterizations of

two-dimensional space.

Spherical coordinates are an extension of polar coordinates which uses two angles and one distance

measurement to index 3D space. Projecting spherical coordinates along the radial axis results in a

range image 2D representation. There are also various coordinate systems which combine polar and

Cartesian geometry for different axes. Figure 4 shows an example of this: Cylinder coordinates use a

polar coordinate system for the X-Y plane and a Cartesian axis for the Z-direction. Similarly, some

approaches use a polar coordinate system in a 2D Bird’s eye view (BEV) projection, which projects

points along a Cartesian z-axis [

]. For polar coordinate systems, the coordinate origin is typically

chosen as the center of the LiDAR sensor in order to minimize aliasing [19, 20].

Feature aggregation

As a ﬁnal design decision (ﬁg. 1, fourth layer), we differentiate approaches

by their choice of mathematical operation to be applied to compute the resulting point cloud represen-

tation. To compute a feature representation for a point in a 3D point cloud, almost all deep learning

approaches aggregate information about its local or global neighborhood. This aggregation typically

requires ﬁnding other points within the neighborhood of the point whose features are to be computed.

Next, these features are aggregated using parametric or non-parametric mathematical operations.

The decision for a rasterized representation often directly leads to the use of convolutions [

although other operations are certainly possible. For Bag-of-Points representations, the choice of the

feature aggregation method becomes the main differentiating factor. Some approaches use convo-

lutions in non-rasterized spaces [

], others perform feature aggregation through different

variants of weighted or non-weighted pooling of local neighbors [

]. For brevity, we group

these into neighbor-based approaches in ﬁgure 1 (bottom right). A ﬁnal group of approaches which

exist for smaller point clouds, but have not yet found application in the large-scale point clouds used

in autonomous driving are point cloud transformers. These approaches use local or global (point

cloud-wide) attention mechanisms to exchange features of points inside a point cloud [24].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnalyzingDeepLearningRepresentationsofPointCloudsforReal-TimeIn-VehicleLiDARPerceptionMarcUeckerFZIResearchCenterforInformationTechnologyuecker@fzi.deTobiasFleckFZIResearchCenterforInformationTechnologytobias.fleck@fzi.deMarcelPugfelderKarlsruheInstituteofTechnologymarcel.pflugfelder@student.kit.e...

展开>> 收起<<

Analyzing Deep Learning Representations of Point Clouds for Real-Time In-Vehicle LiDAR Perception Marc Uecker.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Analyzing Deep Learning Representations of Point Clouds for Real-Time In-Vehicle LiDAR Perception Marc Uecker

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: