modeled their temporal variations to improve performance.
In LSTM-TOD [9] and 3DVID [22], the multiple bird’s-eye-
view (BEV) feature maps extracted from the point cloud
sequence were combined with the variant of the recurrent
module, i.e., ConvLSTM [17] and ConvGRU [2]. TCTR [24]
exploited temporal-channel relations over multiple feature
maps using an encoder-decoder structure [16]. 3D-MAN
[21] aggregated box-level features with a memory bank that
containing sequential temporal view information.
In this paper, we present a new 3D-MOD method called
Dual-Query Align (D-Align), which can produce robust
spatio-temporal BEV representations using a multi-frame
point cloud sequence. We propose a novel dual query co-
attention network that employs two types of queries: target
query set (T-QS) and support query set (S-QS) to facilitate
the co-attention to both the target and support frame features.
T-QS and S-QS serve to carry the target frame features and
the support frame features, which are continuously enhanced
through multiple layers of attention.
In each attention layer, the dual queries are updated in two
steps. First, the inter-frame deformable alignment network
(IDANet) aligns S-QS to T-QS using deformable attention.
The deformable attention mechanism is applied to S-QS with
mask offsets and weights determined by multi-scale temporal
context features generated from two adjacent BEV feature
maps. This step updates S-QS first. Next, the inter-frame
gated aggregation network (IGANet) aggregates S-QS and
T-QS using a gated attention network [10]. The aggregated
query features finally update T-QS. After going through
multiple attention layers, D-Align produces the improved
BEV features of the target frame for 3D object detection.
We evaluate the proposed D-Align on the widely used
public nuScenes dataset [4]. Our experimental results show
that the proposed method improves the 3D detection baseline
by significant margins and outperforms the latest LiDAR-
based state-of-the-art (SOTA) 3D object detectors.
The contributions of our paper are summarized as follows.
•We propose an enhanced 3D object detection archi-
tecture, D-Align, which exploits temporal structure in
point cloud sequences. We devise a novel dual query
co-attention network that transforms the dual queries S-
QS and T-QS through successive operations of feature
alignment and feature aggregation. This co-attention
mechanism allows attending to both the support and
target frame features to gather useful spatio-temporal
information from multiple frames of the point data.
•We design the temporal context-guided deformable at-
tention to achieve inter-frame feature alignment. Our
deformable attention mechanism differs from the orig-
inal model proposed in [29] in that the attention mask
is adjusted by the motion context obtained from two
adjacent BEV feature maps. Our analysis shows that the
use of such motion features contributed significantly to
the overall detection performance.
•Our codes will be publicly released.
II. RELATED WORK
3D object detection techniques based on a single point
set [7], [11], [19], [23], [26] have advanced rapidly since
deep neural networks have been adopted to encode irregular
and unordered point clouds. However, the performance of
these 3D object detectors remains limited because they do
not exploit temporal information in sequence data.
To date, several 3D-MOD methods have been proposed,
which used point cloud sequences to perform 3D object
detection [9], [21], [22], [24]. These methods explored
ways to represent time-varying features obtained from long
sequence of point clouds. In [9], [22], [24], LIDAR features
obtained from multiple point cloud frames were combined to
exploit temporal information in sequence data. LSTM-TOD
[9] produced a spatio-temporal representation of point cloud
sequences using a 3D sparse ConvLSTM [17] modified to
encode the point cloud sequence data. 3DVID [22] improved
the conventional ConvGRU by adopting a transformer at-
tention mechanism to exploit the spatio-temporal coherence
of point cloud sequences. TCTR [24] explored channel-wise
temporal correlations among consecutive frames and decoded
spatial information using Transformer. 3D-MAN [21] stored
3D proposals and 3D spatial features obtained from a single-
frame 3D detector in a memory bank. Then, to integrate
the local features of objects extracted from each frame, the
method explored spatio-temporal relationships among the
proposals.
The proposed D-Align differs from these methods in that
a novel dual query co-attention architecture is introduced
to utilize spatio-temporal information obtained from point
cloud sequences. The proposed method effectively aligns and
aggregates multi-frame features by refining the dual query
sets through multiple attention layers.
III. PROPOSED METHOD
Fig. 2 shows the overall structure of D-Align. It consists
of three main blocks, including 1) BEV feature extractor,
2) dual-query co-attention network (DUCANet), and 3) 3D
object detection head. The set of points acquired by the
LiDAR sensor over the duration of Tseconds is called a
frame. Ego-motion compensation [4] is applied to the points
within each frame. D-Align takes the sequence of point
clouds {Pn}t
n=t−N+1 in Nsuccessive frames as an input,
where Pndenotes the point set obtained in the nth frame.
The frame tis called a target frame because we aim to detect
objects for the frame t. The remaining frames are called
support frames.
A. Overview
The BEV feature extractor produces the BEV feature
maps {Ft−k}N−1
k=0 by using a grid-based backbone network
for each frame [11], [19]. This backbone network produces
the feature maps of Sscales, i.e., Fn={Fs
n}S
s=1. Next,
DUCANet produces the enhanced target frame BEV features
by applying the dual-query co-attention mechanism to the
multi-frame features {Ft−k}N−1
k=0 . DUCANet maintains two
query sets, T-QS and S-QS. T-QS serves to store the target