D-Align Dual Query Co-attention Network for 3D Object Detection Based on Multi-frame Point Cloud Sequence Junhyung Lee1 Junho Koh2 Youngwoo Lee2and Jun Won Choi2

2025-04-27 0 0 535KB 7 页 10玖币
侵权投诉
D-Align: Dual Query Co-attention Network for 3D Object Detection
Based on Multi-frame Point Cloud Sequence
Junhyung Lee1, Junho Koh2, Youngwoo Lee2and Jun Won Choi2
Abstract LiDAR sensors are widely used for 3D object
detection in various mobile robotics applications. LiDAR sen-
sors continuously generate point cloud data in real-time.
Conventional 3D object detectors detect objects using a set
of points acquired over a fixed duration. However, recent
studies have shown that the performance of object detection can
be further enhanced by utilizing spatio-temporal information
obtained from point cloud sequences. In this paper, we propose
a new 3D object detector, named D-Align, which can effectively
produce strong bird’s-eye-view (BEV) features by aligning and
aggregating the features obtained from a sequence of point sets.
The proposed method includes a novel dual-query co-attention
network that uses two types of queries, including target query
set (T-QS) and support query set (S-QS), to update the features
of target and support frames, respectively. D-Align aligns S-
QS to T-QS based on the temporal context features extracted
from the adjacent feature maps and then aggregates S-QS with
T-QS using a gated attention mechanism. The dual queries
are updated through multiple attention layers to progressively
enhance the target frame features used to produce the detection
results. Our experiments on the nuScenes dataset show that the
proposed D-Align method greatly improved the performance
of a single frame-based baseline method and significantly
outperformed the latest 3D object detectors.
I. INTRODUCTION
LiDAR is a widely used sensor modality for perception
tasks in various mobile robotics applications. LiDAR sensors
generate point cloud data corresponding to observations of
laser reflections from the surfaces of objects. Point cloud
data are particularly useful for 3D object detection tasks,
which involve estimating objects’ locations in 3D coordinate
systems. Recently, LiDAR-based 3D object detection has
advanced rapidly with the adoption of deep neural networks
to extract features from point cloud data.
Conventional 3D object detectors operate on a set of
LiDAR points, called point clouds, which are acquired by
a fixed number of consecutive laser scans. The geometrical
distribution of point clouds is used to detect objects in 3D
space. In practical applications, LiDAR sensors continuously
scan their surroundings to generate point cloud sequences in
real time. Therefore, the quality of point cloud data can be
improved by using more than one set of point clouds for 3D
object detection. One na¨
ıve approach is to merge multiple
consecutive sets of point clouds acquired in the fixed duration
and use the larger set of points as input to a 3D object
1Department of Future Mobility, Hanyang University, Seoul, 04763,
Korea
2Department of Electrical Engineering, Hanyang University, Seoul,
04763, Korea
{junhyunglee,jhkoh,youngwoolee}@spa.hanyang.ac.kr
junwchoi@hanyang.ac.kr
1 5 10 15 20
Size of merged point set [sweep]
45
50
55
60
NDS [%]
43.9
57.0 58.1 58.6 58.3
(a)
𝑳𝒂𝒕𝒆𝒔𝒕 𝒔𝒄𝒂𝒏𝒏𝒆𝒅
𝒑𝒐𝒊𝒏𝒕 𝒔𝒆𝒕
𝟏 𝑺𝒘𝒆𝒆𝒑 𝟏𝟎 𝑺𝒘𝒆𝒆𝒑𝒔
𝟐𝟎 𝑺𝒘𝒆𝒆𝒑𝒔
(b)
Fig. 1. Effects of using larger LiDAR point sets for 3D object detection:
(a) Evaluation results of the PointPillars [11] method on the nuScenes
[4] validation set. NDS denotes the nuScenes detection score. The NDS
performance improvement of the 3D object detector quickly diminished as
the size of LiDAR point set increased. (b) Object motion leads to dynamic
changes in the distribution of point clouds as more LiDAR points are
merged.
detection model [4]. This effectively improves the density
of the point clouds and thus improves performance, as
shown in Fig. 1 (a). However, this performance improvement
quickly diminishes with increasing numbers of point clouds
merged because the distribution of the set of points changes
across multiple scans. Fig. 1 (b) shows that a moving object
exhibits a dynamically changing point distribution even after
compensating for the ego vehicle’s motion that occurred
in the duration. Therefore, more sophisticated 3D object
detection models that capture the dynamic temporal structure
of LiDAR points are needed to improve accuracy.
Several recent studies have explored methods to utilize
temporal information existing in long sequences of point
clouds for 3D object detection. In this study, we refer to
these detectors as 3D multi-frame object detectors (3D-
MOD). Various architectures have been proposed for 3D-
MOD [9], [18], [21], [22], [24]. These methods extracted
geometric features from each set of points (called a frame)
within which the point distribution did not change much and
arXiv:2210.00087v1 [cs.CV] 30 Sep 2022
modeled their temporal variations to improve performance.
In LSTM-TOD [9] and 3DVID [22], the multiple bird’s-eye-
view (BEV) feature maps extracted from the point cloud
sequence were combined with the variant of the recurrent
module, i.e., ConvLSTM [17] and ConvGRU [2]. TCTR [24]
exploited temporal-channel relations over multiple feature
maps using an encoder-decoder structure [16]. 3D-MAN
[21] aggregated box-level features with a memory bank that
containing sequential temporal view information.
In this paper, we present a new 3D-MOD method called
Dual-Query Align (D-Align), which can produce robust
spatio-temporal BEV representations using a multi-frame
point cloud sequence. We propose a novel dual query co-
attention network that employs two types of queries: target
query set (T-QS) and support query set (S-QS) to facilitate
the co-attention to both the target and support frame features.
T-QS and S-QS serve to carry the target frame features and
the support frame features, which are continuously enhanced
through multiple layers of attention.
In each attention layer, the dual queries are updated in two
steps. First, the inter-frame deformable alignment network
(IDANet) aligns S-QS to T-QS using deformable attention.
The deformable attention mechanism is applied to S-QS with
mask offsets and weights determined by multi-scale temporal
context features generated from two adjacent BEV feature
maps. This step updates S-QS first. Next, the inter-frame
gated aggregation network (IGANet) aggregates S-QS and
T-QS using a gated attention network [10]. The aggregated
query features finally update T-QS. After going through
multiple attention layers, D-Align produces the improved
BEV features of the target frame for 3D object detection.
We evaluate the proposed D-Align on the widely used
public nuScenes dataset [4]. Our experimental results show
that the proposed method improves the 3D detection baseline
by significant margins and outperforms the latest LiDAR-
based state-of-the-art (SOTA) 3D object detectors.
The contributions of our paper are summarized as follows.
We propose an enhanced 3D object detection archi-
tecture, D-Align, which exploits temporal structure in
point cloud sequences. We devise a novel dual query
co-attention network that transforms the dual queries S-
QS and T-QS through successive operations of feature
alignment and feature aggregation. This co-attention
mechanism allows attending to both the support and
target frame features to gather useful spatio-temporal
information from multiple frames of the point data.
We design the temporal context-guided deformable at-
tention to achieve inter-frame feature alignment. Our
deformable attention mechanism differs from the orig-
inal model proposed in [29] in that the attention mask
is adjusted by the motion context obtained from two
adjacent BEV feature maps. Our analysis shows that the
use of such motion features contributed significantly to
the overall detection performance.
Our codes will be publicly released.
II. RELATED WORK
3D object detection techniques based on a single point
set [7], [11], [19], [23], [26] have advanced rapidly since
deep neural networks have been adopted to encode irregular
and unordered point clouds. However, the performance of
these 3D object detectors remains limited because they do
not exploit temporal information in sequence data.
To date, several 3D-MOD methods have been proposed,
which used point cloud sequences to perform 3D object
detection [9], [21], [22], [24]. These methods explored
ways to represent time-varying features obtained from long
sequence of point clouds. In [9], [22], [24], LIDAR features
obtained from multiple point cloud frames were combined to
exploit temporal information in sequence data. LSTM-TOD
[9] produced a spatio-temporal representation of point cloud
sequences using a 3D sparse ConvLSTM [17] modified to
encode the point cloud sequence data. 3DVID [22] improved
the conventional ConvGRU by adopting a transformer at-
tention mechanism to exploit the spatio-temporal coherence
of point cloud sequences. TCTR [24] explored channel-wise
temporal correlations among consecutive frames and decoded
spatial information using Transformer. 3D-MAN [21] stored
3D proposals and 3D spatial features obtained from a single-
frame 3D detector in a memory bank. Then, to integrate
the local features of objects extracted from each frame, the
method explored spatio-temporal relationships among the
proposals.
The proposed D-Align differs from these methods in that
a novel dual query co-attention architecture is introduced
to utilize spatio-temporal information obtained from point
cloud sequences. The proposed method effectively aligns and
aggregates multi-frame features by refining the dual query
sets through multiple attention layers.
III. PROPOSED METHOD
Fig. 2 shows the overall structure of D-Align. It consists
of three main blocks, including 1) BEV feature extractor,
2) dual-query co-attention network (DUCANet), and 3) 3D
object detection head. The set of points acquired by the
LiDAR sensor over the duration of Tseconds is called a
frame. Ego-motion compensation [4] is applied to the points
within each frame. D-Align takes the sequence of point
clouds {Pn}t
n=tN+1 in Nsuccessive frames as an input,
where Pndenotes the point set obtained in the nth frame.
The frame tis called a target frame because we aim to detect
objects for the frame t. The remaining frames are called
support frames.
A. Overview
The BEV feature extractor produces the BEV feature
maps {Ftk}N1
k=0 by using a grid-based backbone network
for each frame [11], [19]. This backbone network produces
the feature maps of Sscales, i.e., Fn={Fs
n}S
s=1. Next,
DUCANet produces the enhanced target frame BEV features
by applying the dual-query co-attention mechanism to the
multi-frame features {Ftk}N1
k=0 . DUCANet maintains two
query sets, T-QS and S-QS. T-QS serves to store the target
摘要:

D-Align:DualQueryCo-attentionNetworkfor3DObjectDetectionBasedonMulti-framePointCloudSequenceJunhyungLee1,JunhoKoh2,YoungwooLee2andJunWonChoi2Abstract—LiDARsensorsarewidelyusedfor3Dobjectdetectioninvariousmobileroboticsapplications.LiDARsen-sorscontinuouslygeneratepointclouddatainreal-time.Convention...

展开>> 收起<<
D-Align Dual Query Co-attention Network for 3D Object Detection Based on Multi-frame Point Cloud Sequence Junhyung Lee1 Junho Koh2 Youngwoo Lee2and Jun Won Choi2.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:7 页 大小:535KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注