
method that utilizes the complementary nature of camera and
LiDAR data to facilitate object detection at long ranges. In
particular, depth and intensity values from sparse LiDAR
returns are used to detect and generate location proposals
for the objects present in the environment. These location
proposals are then used by a PTZ camera system to perform a
directed search by adjusting its orientation and zoom level to
perform object detection and classification in difficult environ-
ments at long ranges. The performance and applicability of
the proposed method is thoroughly evaluated on data collected
by an ANYmal-C quadruped robot during field deployments
conducted in challenging underground settings, including the
SubT Challenge finals event consisting of an underground
urban environment, a cave network, and a tunnel system.
II. RELATED WORK
Visual cameras have been the preferred sensor choice for
object detection due to having rich scene information includ-
ing texture and context. Especially with the emergence of
Convolutional Neural Network (CNN) based object detection
approaches such as YOLO [8], SSD [9], faster R-CNN [10],
on-par human-level performance has been achieved. Moreover,
recent approaches such as Mask R-CNN [11], DetectoRS
[12] are able to perform instance segmentation in which each
pixel of the image is assigned a class label and an instance
label. However, due to the absence of depth information,
localizing the detected objects in 3D environment remains a
challenge. This has motivated the teams participating in the
DARPA SubT challenge to use LiDAR scans for localizing
the detected objects. Team CERBERUS [13] made use of
a YOLO architecture trained to include competition-specific
objects for detection. The 3D location of the object in world
coordinates is obtained by projecting the bounding box into
the robot occupancy map built using the LiDAR scans. Other
teams also made use of similar approaches utilizing both
camera and LiDAR data [14,15]. A common problem reported
by all teams was the reduced object detection range using
only visual cameras due to poor illumination in complex
underground environments.
LiDAR-based 3D object detection methods which make
use of CNNs and operate on point clouds (Point R-CNN
[16]) or voxel-based representation (Voxel R-CNN [17])
have also gained popularity. While these approaches are well
suited for detecting and localizing objects like vehicles and
pedestrians in a structured environment like that of a self-
driving vehicle, they are not well suited for detecting highly
specific objects in an unstructured environment as required
in our case. Thus, we propose to use LiDAR and a PTZ
camera in a coupled manner to improve the object detection
range. In particular, we propose to use LiDAR scans to
generate object proposals by performing clustering based on
LiDAR intensity and depth difference. These clusters are then
scanned by the PTZ camera and classified using a CNN-based
object detection model. Existing methods have performed
object segmentation and clustering using sparse LiDAR scans,
with a simple clustering approach based on the Euclidean
distance proposed in [18]. The approach operates directly
Point cloud
Accumulation
Waypoint
Generation
Aggregated Point cloud
Filtered
Images
Cluster
Centers
Ground Points
Removal
Object Segmentation
(Depth + Intensity)
Image Filter
Point clouds Odometry
Point cloud
Projection
Range
Intensity
Surface Normals
Object Proposal
Cluster Merge
Waypoints
To PTZ Camera Controller
Object Point Cluster
Segmentation
Object Cluster Filter
Volume-based
Surface-Normals Std Dev
Cluster Points Size
Fig. 2: An overview of the proposed method.
on the 3D point clouds and introduces a radially bounded
nearest neighbor algorithm for clustering which is able to
handle outliers as opposed to a ’k’ nearest neighbor clustering
[19]. This approach was further extended in [20] to work in
real-time on a continuous stream of data. Methods operating
directly on unordered point clouds are relatively slow due to
the expensive nearest neighbor search queries. Thus, for speed-
up, approaches choose to operate on range images generated
from point clouds instead. Performing computations on range
images have the advantages of exploitable neighborhood
relations and the reduction of redundant points to a single
representative pixel in the image. In [21], the authors propose
to use the depth angle for clustering on range images. In
another clustering approach, Scan-Line-Run (SLR) [22], the
authors propose to modify the two-run connected component
labeling technique for binary images [23] and apply it for
clustering the range images. In recent work [24], the authors
extend the depth-angle-based clustering approach of [21] to
make it robust to instance over-segmentation by introducing
additional sparse connections in the range image, termed map
connections.
III. PROPOSED METHOD
To aid the camera object detection and classification
process, especially in challenging and visually-degraded
environments, this work proposes to utilize LiDAR data to
generate object proposals at longer distances. In addition to
utilizing depth data, this work uses auxiliary LiDAR data, such
as intensity return information, to distinguish and segment
objects from the environment. An overview of the proposed
approach is presented in Figure 2, with each component
detailed below:
A. Point cloud Accumulation
To facilitate object detection from sparse LiDAR scans
(Figure 3A), such as that obtained from low-cost LiDARs with