D-Align Dual Query Co-attention Network for 3D Object Detection Based on Multi-frame Point Cloud Sequence Junhyung Lee1 Junho Koh2 Youngwoo Lee2and Jun Won Choi2

2025-04-27 0 0 535KB 7 页 10玖币

侵权投诉

D-Align: Dual Query Co-attention Network for 3D Object Detection

Based on Multi-frame Point Cloud Sequence

Junhyung Lee1, Junho Koh2, Youngwoo Lee2and Jun Won Choi2

Abstract— LiDAR sensors are widely used for 3D object

detection in various mobile robotics applications. LiDAR sen-

sors continuously generate point cloud data in real-time.

Conventional 3D object detectors detect objects using a set

of points acquired over a ﬁxed duration. However, recent

studies have shown that the performance of object detection can

be further enhanced by utilizing spatio-temporal information

obtained from point cloud sequences. In this paper, we propose

a new 3D object detector, named D-Align, which can effectively

produce strong bird’s-eye-view (BEV) features by aligning and

aggregating the features obtained from a sequence of point sets.

The proposed method includes a novel dual-query co-attention

network that uses two types of queries, including target query

set (T-QS) and support query set (S-QS), to update the features

of target and support frames, respectively. D-Align aligns S-

QS to T-QS based on the temporal context features extracted

from the adjacent feature maps and then aggregates S-QS with

T-QS using a gated attention mechanism. The dual queries

are updated through multiple attention layers to progressively

enhance the target frame features used to produce the detection

results. Our experiments on the nuScenes dataset show that the

proposed D-Align method greatly improved the performance

of a single frame-based baseline method and signiﬁcantly

outperformed the latest 3D object detectors.

I. INTRODUCTION

LiDAR is a widely used sensor modality for perception

tasks in various mobile robotics applications. LiDAR sensors

generate point cloud data corresponding to observations of

laser reﬂections from the surfaces of objects. Point cloud

data are particularly useful for 3D object detection tasks,

which involve estimating objects’ locations in 3D coordinate

systems. Recently, LiDAR-based 3D object detection has

advanced rapidly with the adoption of deep neural networks

to extract features from point cloud data.

Conventional 3D object detectors operate on a set of

LiDAR points, called point clouds, which are acquired by

a ﬁxed number of consecutive laser scans. The geometrical

distribution of point clouds is used to detect objects in 3D

space. In practical applications, LiDAR sensors continuously

scan their surroundings to generate point cloud sequences in

real time. Therefore, the quality of point cloud data can be

improved by using more than one set of point clouds for 3D

object detection. One na¨

ıve approach is to merge multiple

consecutive sets of point clouds acquired in the ﬁxed duration

and use the larger set of points as input to a 3D object

1Department of Future Mobility, Hanyang University, Seoul, 04763,

Korea

2Department of Electrical Engineering, Hanyang University, Seoul,

04763, Korea

{junhyunglee,jhkoh,youngwoolee}@spa.hanyang.ac.kr

junwchoi@hanyang.ac.kr

1 5 10 15 20

Size of merged point set [sweep]

NDS [%]

43.9

57.0 58.1 58.6 58.3

(a)

𝑳𝒂𝒕𝒆𝒔𝒕 𝒔𝒄𝒂𝒏𝒏𝒆𝒅

𝒑𝒐𝒊𝒏𝒕 𝒔𝒆𝒕

𝟏 𝑺𝒘𝒆𝒆𝒑 𝟏𝟎 𝑺𝒘𝒆𝒆𝒑𝒔

𝟐𝟎 𝑺𝒘𝒆𝒆𝒑𝒔

(b)

Fig. 1. Effects of using larger LiDAR point sets for 3D object detection:

(a) Evaluation results of the PointPillars [11] method on the nuScenes

[4] validation set. NDS denotes the nuScenes detection score. The NDS

performance improvement of the 3D object detector quickly diminished as

the size of LiDAR point set increased. (b) Object motion leads to dynamic

changes in the distribution of point clouds as more LiDAR points are

merged.

detection model [4]. This effectively improves the density

of the point clouds and thus improves performance, as

shown in Fig. 1 (a). However, this performance improvement

quickly diminishes with increasing numbers of point clouds

merged because the distribution of the set of points changes

across multiple scans. Fig. 1 (b) shows that a moving object

exhibits a dynamically changing point distribution even after

compensating for the ego vehicle’s motion that occurred

in the duration. Therefore, more sophisticated 3D object

detection models that capture the dynamic temporal structure

of LiDAR points are needed to improve accuracy.

Several recent studies have explored methods to utilize

temporal information existing in long sequences of point

clouds for 3D object detection. In this study, we refer to

these detectors as 3D multi-frame object detectors (3D-

MOD). Various architectures have been proposed for 3D-

MOD [9], [18], [21], [22], [24]. These methods extracted

geometric features from each set of points (called a frame)

within which the point distribution did not change much and

arXiv:2210.00087v1 [cs.CV] 30 Sep 2022

modeled their temporal variations to improve performance.

In LSTM-TOD [9] and 3DVID [22], the multiple bird’s-eye-

view (BEV) feature maps extracted from the point cloud

sequence were combined with the variant of the recurrent

module, i.e., ConvLSTM [17] and ConvGRU [2]. TCTR [24]

exploited temporal-channel relations over multiple feature

maps using an encoder-decoder structure [16]. 3D-MAN

[21] aggregated box-level features with a memory bank that

containing sequential temporal view information.

In this paper, we present a new 3D-MOD method called

Dual-Query Align (D-Align), which can produce robust

spatio-temporal BEV representations using a multi-frame

point cloud sequence. We propose a novel dual query co-

attention network that employs two types of queries: target

query set (T-QS) and support query set (S-QS) to facilitate

the co-attention to both the target and support frame features.

T-QS and S-QS serve to carry the target frame features and

the support frame features, which are continuously enhanced

through multiple layers of attention.

In each attention layer, the dual queries are updated in two

steps. First, the inter-frame deformable alignment network

(IDANet) aligns S-QS to T-QS using deformable attention.

The deformable attention mechanism is applied to S-QS with

mask offsets and weights determined by multi-scale temporal

context features generated from two adjacent BEV feature

maps. This step updates S-QS ﬁrst. Next, the inter-frame

gated aggregation network (IGANet) aggregates S-QS and

T-QS using a gated attention network [10]. The aggregated

query features ﬁnally update T-QS. After going through

multiple attention layers, D-Align produces the improved

BEV features of the target frame for 3D object detection.

We evaluate the proposed D-Align on the widely used

public nuScenes dataset [4]. Our experimental results show

that the proposed method improves the 3D detection baseline

by signiﬁcant margins and outperforms the latest LiDAR-

based state-of-the-art (SOTA) 3D object detectors.

The contributions of our paper are summarized as follows.

•We propose an enhanced 3D object detection archi-

tecture, D-Align, which exploits temporal structure in

point cloud sequences. We devise a novel dual query

co-attention network that transforms the dual queries S-

QS and T-QS through successive operations of feature

alignment and feature aggregation. This co-attention

mechanism allows attending to both the support and

target frame features to gather useful spatio-temporal

information from multiple frames of the point data.

•We design the temporal context-guided deformable at-

tention to achieve inter-frame feature alignment. Our

deformable attention mechanism differs from the orig-

inal model proposed in [29] in that the attention mask

is adjusted by the motion context obtained from two

adjacent BEV feature maps. Our analysis shows that the

use of such motion features contributed signiﬁcantly to

the overall detection performance.

•Our codes will be publicly released.

II. RELATED WORK

3D object detection techniques based on a single point

set [7], [11], [19], [23], [26] have advanced rapidly since

deep neural networks have been adopted to encode irregular

and unordered point clouds. However, the performance of

these 3D object detectors remains limited because they do

not exploit temporal information in sequence data.

To date, several 3D-MOD methods have been proposed,

which used point cloud sequences to perform 3D object

detection [9], [21], [22], [24]. These methods explored

ways to represent time-varying features obtained from long

sequence of point clouds. In [9], [22], [24], LIDAR features

obtained from multiple point cloud frames were combined to

exploit temporal information in sequence data. LSTM-TOD

[9] produced a spatio-temporal representation of point cloud

sequences using a 3D sparse ConvLSTM [17] modiﬁed to

encode the point cloud sequence data. 3DVID [22] improved

the conventional ConvGRU by adopting a transformer at-

tention mechanism to exploit the spatio-temporal coherence

of point cloud sequences. TCTR [24] explored channel-wise

temporal correlations among consecutive frames and decoded

spatial information using Transformer. 3D-MAN [21] stored

3D proposals and 3D spatial features obtained from a single-

frame 3D detector in a memory bank. Then, to integrate

the local features of objects extracted from each frame, the

method explored spatio-temporal relationships among the

proposals.

The proposed D-Align differs from these methods in that

a novel dual query co-attention architecture is introduced

to utilize spatio-temporal information obtained from point

cloud sequences. The proposed method effectively aligns and

aggregates multi-frame features by reﬁning the dual query

sets through multiple attention layers.

III. PROPOSED METHOD

Fig. 2 shows the overall structure of D-Align. It consists

of three main blocks, including 1) BEV feature extractor,

2) dual-query co-attention network (DUCANet), and 3) 3D

object detection head. The set of points acquired by the

LiDAR sensor over the duration of Tseconds is called a

frame. Ego-motion compensation [4] is applied to the points

within each frame. D-Align takes the sequence of point

clouds {Pn}t

n=t−N+1 in Nsuccessive frames as an input,

where Pndenotes the point set obtained in the nth frame.

The frame tis called a target frame because we aim to detect

objects for the frame t. The remaining frames are called

support frames.

A. Overview

The BEV feature extractor produces the BEV feature

maps {Ft−k}N−1

k=0 by using a grid-based backbone network

for each frame [11], [19]. This backbone network produces

the feature maps of Sscales, i.e., Fn={Fs

n}S

s=1. Next,

DUCANet produces the enhanced target frame BEV features

by applying the dual-query co-attention mechanism to the

multi-frame features {Ft−k}N−1

k=0 . DUCANet maintains two

query sets, T-QS and S-QS. T-QS serves to store the target

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

D-Align:DualQueryCo-attentionNetworkfor3DObjectDetectionBasedonMulti-framePointCloudSequenceJunhyungLee1,JunhoKoh2,YoungwooLee2andJunWonChoi2AbstractLiDARsensorsarewidelyusedfor3Dobjectdetectioninvariousmobileroboticsapplications.LiDARsen-sorscontinuouslygeneratepointclouddatainreal-time.Convention...

展开>> 收起<<

D-Align Dual Query Co-attention Network for 3D Object Detection Based on Multi-frame Point Cloud Sequence Junhyung Lee1 Junho Koh2 Youngwoo Lee2and Jun Won Choi2.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

D-Align Dual Query Co-attention Network for 3D Object Detection Based on Multi-frame Point Cloud Sequence Junhyung Lee1 Junho Koh2 Youngwoo Lee2and Jun Won Choi2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: