
Car
Pedestrian
Barrier
Truck
Traffic Cone
Trailer
Bus
Construction Vehicle
Motorcycle
Bicycle
Car
Ped
Cyclist
50%
40%
30%
20%
10%
Amount of Bounding Boxes
Amount of Bounding Boxes
100%
80%
60%
40%
20%
nuScenes dataset
Class Class
KITTI dataset
Figure 1. Class distributions for two 3D object detection datasets:
nuScenes [1] (left) and KITTI [9] (right).
present ground truth points in a more natural position, e.g.,
a sidewalk for pedestrians. Our experiment shows that our
contextual GT sampling provides extra performance gain,
especially for minor classes.
Our approach is mostly close to Zhu et al. [36] (CBGS)
in that they also use multiple detection heads and data aug-
mentation techniques, i.e., GT sampling [28]. However, our
work differs from it as follows: (i) we explore using multi-
task learning techniques, including multiple detection heads
with loss balancing techniques, to improve overall detection
performance across all categories. CBGS focused on utiliz-
ing multi-headed architecture with a uniform scaling, which
minimizes a uniformly weighted sum and does not consider
dynamically modifying weights like ours. (ii) we propose
contextual GT sampling, which addresses issues with con-
ventional GT sampling and results in better detection accu-
racy.
We summarize our contributions as follows:
• Inspired by multi-task learning, we propose a multi-
headed LiDAR-based 3D object detector where losses
for each head are balanced by dynamic weight average
(DWA).
• Combined with multi-headed architecture, we propose
contextual ground truth sampling, which improves
conventional ground truth (GT) sampling by leverag-
ing semantic scene information to introduce GT ob-
jects in a more realistic position.
• We conduct various experiments to demonstrate the
effectiveness of our proposed approach with widely-
used public datasets: KITTI and nuScenes. Our exper-
iments show that multi-task learning techniques com-
bined with our contextual GT sampling significantly
improve the overall detection performance, especially
for minor classes.
2. Related Work
2.1. 3D Object Detection
A landmark work in the LiDAR-based 3D object detec-
tion is VoxelNet [35], an end-to-end trainable model that
first voxelized a point cloud, and each equally spaced voxel
is encoded as a descriptive volumetric representation. Given
these features, conventional 2D convolutions are used to
generate and regress its region proposals. Yan et al. [28]
used sparse 3D convolutions to accelerate heavy compu-
tations of earlier LiDAR-based works. PointPillars [15]
is another landmark work that speeds up the encoding of
3D volumetric representation by dividing the 3D space
into pillars (instead of voxels). A more sophisticated ar-
chitecture is also used to achieve better detection results.
PointRCNN [22] used a two-stage architecture to refine
the initial 3D bounding box proposals. Part-A2 [23] fo-
cuses on leveraging intra-object parts for better results. PV-
RCNN [20] and PV-RCNN++ [21] simultaneously process
coarse-grained voxels and the raw point cloud. Recently,
CenterPoint [31] applied a key-point detector that predicts
the geometric center of objects. Similarly, Voxel RCNN [5]
used coarse voxel granularity to reduce the computation
cost, retaining the overall detection performance. In this
work, we focus on improving data imbalance problems in
LiDAR-based object detection. Thus, we do not claim a
novel 3D object detector; rather, we rely on existing land-
mark work PointPillars [15], PV-RCNN [20], and Voxel
RCNN [5] to demonstrate the effectiveness of our proposed
approach. Note that, ideally, our approach is applicable to
others as well.
Lidar Points Augmentations. Data augmentation has been
widely applied to LiDAR-based 3D object detection for var-
ious reasons: (i) improving point cloud quality by upsam-
pling a low-density point cloud [30,32] or by point cloud
completion for occluded regions [2,27,29,33]. (ii) Improv-
ing the robustness of object detection by global and local
augmentations. Choi et al. [4] randomly augmented sub-
partitions of GT objects (e.g., dropping points in a certain
sub-partition) [4]. Zheng et al. [34] divided each ground
truth object into six (inward facing) pyramids, then aug-
mented them with random dropout, swap, and sparsifying
operations. (iii) Improving generalization power by aug-
menting clear weather point clouds with adverse conditions
via physical modelings, such as fog [13] or snowfall [12].
(iv) Augmenting LiDAR-based features with other modal-
ities, such as images [25,26]. (v) Smoothing class dis-
tribution by sampling ground truth objects from the (of-
fline) database and introducing them to the current scene
(GT sampling, [28]). In this work, similar to (v), we focus
on smoothing the density of each class to address the data
imbalance problem (i.e., improving the detection accuracy
of rare objects while maintaining that of common objects).