representation of the input. For example, systems employing
multiple LiDARs pointing in different directions are typically
processed sequentially or using multiple instances of the
same network, one for each sensor. However, more crucial
insights and structural dependencies in overlapping areas can
be considered by jointly predicting the segmentation using
all available sensors.
In this work, we propose a framework that takes LiDAR
scans as input (cf. Figure 1), projects them onto a sphere, and
utilizes a spherical Convolutional Neural Network (
CNN
)
for the task of semantic segmentation. The projection of
the LiDAR scans onto the sphere does not introduce any
distortions and is independent of the utilized LiDAR, thus,
yielding an agnostic representation for various LiDAR systems
with different vertical
FoV
. We adapt the structure of
common 2D encoder and decoder networks and support
simultaneous training on different datasets obtained with
varying LiDAR sensors and parameters without having to
adapt our configuration. Moreover, since our approach is
invariant to rotations due to the spherical representation, we
support arbitrarily rotated input pointclouds. In summary, the
key contributions of this paper are as follows:
•
A spherical end-to-end pipeline for semantic segmenta-
tion supporting various input configurations.
•
A spherical encoder-decoder structure including a spec-
tral pooling and unpooling operation for
SO(3)
signals.
II. RELATED WORK
Methods using a LiDAR have to deal with the inherent
sparsity and irregularity of the data in contrast to vision-based
approaches. Moreover, LiDAR-based methods have various
choices on how to represent the input data [1], including,
directly using the pointcloud [2], [3], voxel-based [4]–[6] or
projection-based [7]–[10] representations. The selection of
the input representation, which yields the best performance
for a specific task, however, still remains an open research
question.
Direct methods such as PointNet [2], [3] operate on the raw
unordered pointcloud and extract local contextual features
using point convolutions [11]. Voxel-based approaches [4], [5],
[12] keep all the geometric understanding of the environment
and can readily accumulate multiple scans either chronologi-
cally or from different sensors. SpSequenceNet [13] explicitly
uses 4D pointclouds and considers the temporal information
between consecutive scans. However, it is evident that the
computational complexity of voxel-based approaches is high
due to their high-dimensional convolutions, and their accuracy
and performance are directly linked to the chosen voxel
size, which resulted in works that organize the pointclouds
into an Octree, Kdtree, etc. [14] for efficiency. Furthermore,
instead of using a cartesian grid, PolarNet [15] discretizes
the space using a polar grid and shows superior quality.
A different direction of research is offered by graph-based
approaches [16] which can seamlessly model the irregular
structure of pointclouds though more experimental directions
in terms of graph building, and network design are still to
be addressed.
Projection-based methods differ from other approaches by
transforming the pointcloud into a specific domain, such as 2D
images, which the majority of projection-based methods [7]–
[10] rely on.
Furthermore, projections to 2D images are appealing as
it enables leveraging all the research in image-based deep
learning but generally need to rely on the limited amount of
labeled pointcloud data. Hence, the work of Wu et al. [17]
tackles the deficiency in labeled pointcloud data by using
domain-adaption between synthetic and real-world data.
The downsides of the projection onto the 2D domain
are: i) the lack of a detailed geometric understanding of
the environment and ii) the large
FoV
of LiDARs, which
produces significant distortions, decreasing the accuracy of
these methods. Hence, recent approaches have explored
using a combination of several representations [6], [18] and
convolutions [19]. Recent works [20], [21] additionally learn
and extract features from a Bird’s Eye View projection that
would otherwise be difficult to retain with a 2D projection.
In contrast to 2D image projections, projecting onto the
sphere is a more suitable representation for such large
FoV
sensors. Recently, spherical
CNN
s [22]–[24] have shown
great potential for, e.g., omnidirectional images [25], [26]
and cortical surfaces [27], [28].
Moreover, Lohit et al. [25] proposes an encoder-decoder
spherical network design that is rotation-invariant by perform-
ing a global average pooling of the encoded feature map.
However, their work discards the rotation information of the
input signals and thus, needs a special loss that includes a
spherical correlation to find the unknown rotation w.r.t. the
ground truth labels.
Considering the findings above, we propose a composition
of spherical
CNN
s, based on the work of Cohen et al. [23],
that semantically segments pointclouds from various LiDAR
sensor configurations.
III. SPHERICAL SEMANTIC SEGMENTATION
This section describes the core modules of our spherical
semantic segmentation framework, which mainly operates in
three stages: i) feature projection, ii) semantic segmentation,
and iii) back-projection (cf. Figure 2).
Initially, we discuss the projection of LiDAR pointclouds
onto the unit sphere and the feature representation that serves
as input to the spherical
CNN
. Next, we describe the details of
our network design and architecture used to learn a semantic
segmentation of LiDAR scans.
A. Sensor Projection and Feature Representation
Initially, the input to our spherical segmentation network
is a signal defined on the sphere
S2=p∈R3| kpk2= 1
,
with the parametrization as proposed by Healy et al. [29], i.e.
ω(φ, θ) = [cos φsin θ, sin φsin θ, cos θ]>,(1)
where
ω∈S2
, and
φ∈[0,2π]
and
θ∈[0, π]
are azimuthal
and polar angle, respectively.
We then operate in an end-to-end fashion by transforming
the input modality (i.e., the pointcloud scan) into a spherical