LONGSHORTNET EXPLORING TEMPORAL AND SEMANTIC FEATURES FUSION IN STREAMING PERCEPTION Chenyang Li1 Zhi-Qi Cheng2 Jun-Yan He1 Pengyu Li1

2025-05-02 0 0 951.54KB 5 页 10玖币

侵权投诉

LONGSHORTNET: EXPLORING TEMPORAL AND SEMANTIC FEATURES FUSION IN

STREAMING PERCEPTION

Chenyang Li1∗, Zhi-Qi Cheng2∗, Jun-Yan He1∗, Pengyu Li1,

Bin Luo1†, Hanyuan Chen1, Yifeng Geng1, Jin-Peng Lan1, Xuansong Xie1

1DAMO Academy, Alibaba Group 2Carnegie Mellon University

ABSTRACT

Streaming perception is a fundamental task in autonomous

driving that requires a careful balance between the latency

and accuracy of the autopilot system. However, current meth-

ods for streaming perception are limited as they rely only on

the current and adjacent two frames to learn movement pat-

terns, which restricts their ability to model complex scenes,

often leading to poor detection results. To address this limita-

tion, we propose LongShortNet, a novel dual-path network

that captures long-term temporal motion and integrates it

with short-term spatial semantics for real-time perception.

Our proposed LongShortNet is notable as it is the ﬁrst work

to extend long-term temporal modeling to streaming percep-

tion, enabling spatiotemporal feature fusion. We evaluate

LongShortNet on the challenging Argoverse-HD dataset and

demonstrate that it outperforms existing state-of-the-art meth-

ods with almost no additional computational cost.1

Index Terms—Perception in autonomous driving

1. INTRODUCTION

Autonomous driving requires the real-time perception of

streaming video to react to motion changes, such as over-

taking and turning. Different from traditional Video Object

Detection (VOD) methods that focus on detecting and track-

ing objects in video frames [1–15], Li et al. proposed a new

autopilot perception task called streaming perception [16].

Streaming perception is a valuable tool for simulating realis-

tic autonomous driving scenarios, offering the new metric of

streaming Average Precision (sAP) to consistently evaluate

accuracy and latency [16]. Unlike ofﬂine VOD, streaming

perception allows for real-time perception, opening up new

possibilities for autonomous driving.

To enhance comprehension, Fig. 1 provides a visual com-

parison between video-on-demand (VOD) and streaming

perception, using colored bounding boxes to contrast real-

world scenarios. Previous research [17] indicates that VOD

methods, such as those presented in [18–22], are vulnerable

to errors due to ofﬂine detection delays. Although some VOD

approaches have attempted to balance speed and accuracy

∗Equal contribution of C. Li, Z. Cheng, and J. He (in no particular order)

†Corresponding author

1Code is at https://github.com/LiChenyang-Github/LongShortNet.

Offline detection

Ground Truth Predictions

Non-real-time methods Real-time methods

It-1 It+1 It+2 It

It-1 It+1 It+2

Processing time (latency) Waiting time

Offline detection of time T

GT location of time T+ Latency

Streaming perception

Streaming perception of time T

GT location of time T+ Latency

(b)

(a)

Fig. 1: (a) Comparison between ofﬂine detection (VOD) and streaming per-

ception, where the latter is real-time and can respond promptly to motion

changes. (b) Timeline showing processing time.

by utilizing ofﬂine methods [23], streaming perception ap-

proaches, such as the one proposed in [16], only consider the

last two frames to minimize latency, disregarding long-term

temporal motions. Nonetheless, these methods have limita-

tions in managing complicated motion and scene shifts due

to their incapacity to account for the short-term spatial and

long-term temporal aspects of video streams.

Going further, we illustrate signiﬁcant perceptual chal-

lenges arising from the lack of spatial semantics and tempo-

ral motion. Fig. 2 displays a video stream captured by a car’s

front-view camera. The detected object’s state in the video

stream is inﬂuenced by its own motion and camera movement.

Besides ideal uniform linear motion, actual video streams in-

volve various challenges such as 1) non-uniform motion (e.g.,

vehicle accelerating to overtake), 2) non-straight motion (e.g.,

object and camera turning), 3) scene occlusion (e.g., billboard

and oncoming car occlusion), and 4) small objects (e.g., cars

and road signs in the distance). Undoubtedly, the motions and

scenarios encountered in real autonomous driving are exceed-

ingly complex and uncertain.

Faced with these concerns, StreamYOLO [17] disregards

the semantics and motion in video streams and only uses

the last two frames as input (i.e., the current and previous

frames). Due to the lack of spatial semantic and temporal mo-

tion cues, StreamYOLO is unable to handle complex scenes

involving non-uniform, non-straight, and occluded objects.

arXiv:2210.15518v4 [cs.CV] 30 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LONGSHORTNET:EXPLORINGTEMPORALANDSEMANTICFEATURESFUSIONINSTREAMINGPERCEPTIONChenyangLi1,Zhi-QiCheng2,Jun-YanHe1,PengyuLi1,BinLuo1y,HanyuanChen1,YifengGeng1,Jin-PengLan1,XuansongXie11DAMOAcademy,AlibabaGroup2CarnegieMellonUniversityABSTRACTStreamingperceptionisafundamentaltaskinautonomousdrivingth...

展开>> 收起<<

LONGSHORTNET EXPLORING TEMPORAL AND SEMANTIC FEATURES FUSION IN STREAMING PERCEPTION Chenyang Li1 Zhi-Qi Cheng2 Jun-Yan He1 Pengyu Li1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LONGSHORTNET EXPLORING TEMPORAL AND SEMANTIC FEATURES FUSION IN STREAMING PERCEPTION Chenyang Li1 Zhi-Qi Cheng2 Jun-Yan He1 Pengyu Li1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: