LONGSHORTNET EXPLORING TEMPORAL AND SEMANTIC FEATURES FUSION IN STREAMING PERCEPTION Chenyang Li1 Zhi-Qi Cheng2 Jun-Yan He1 Pengyu Li1

2025-05-02 0 0 951.54KB 5 页 10玖币
侵权投诉
LONGSHORTNET: EXPLORING TEMPORAL AND SEMANTIC FEATURES FUSION IN
STREAMING PERCEPTION
Chenyang Li1, Zhi-Qi Cheng2, Jun-Yan He1, Pengyu Li1,
Bin Luo1, Hanyuan Chen1, Yifeng Geng1, Jin-Peng Lan1, Xuansong Xie1
1DAMO Academy, Alibaba Group 2Carnegie Mellon University
ABSTRACT
Streaming perception is a fundamental task in autonomous
driving that requires a careful balance between the latency
and accuracy of the autopilot system. However, current meth-
ods for streaming perception are limited as they rely only on
the current and adjacent two frames to learn movement pat-
terns, which restricts their ability to model complex scenes,
often leading to poor detection results. To address this limita-
tion, we propose LongShortNet, a novel dual-path network
that captures long-term temporal motion and integrates it
with short-term spatial semantics for real-time perception.
Our proposed LongShortNet is notable as it is the first work
to extend long-term temporal modeling to streaming percep-
tion, enabling spatiotemporal feature fusion. We evaluate
LongShortNet on the challenging Argoverse-HD dataset and
demonstrate that it outperforms existing state-of-the-art meth-
ods with almost no additional computational cost.1
Index TermsPerception in autonomous driving
1. INTRODUCTION
Autonomous driving requires the real-time perception of
streaming video to react to motion changes, such as over-
taking and turning. Different from traditional Video Object
Detection (VOD) methods that focus on detecting and track-
ing objects in video frames [1–15], Li et al. proposed a new
autopilot perception task called streaming perception [16].
Streaming perception is a valuable tool for simulating realis-
tic autonomous driving scenarios, offering the new metric of
streaming Average Precision (sAP) to consistently evaluate
accuracy and latency [16]. Unlike offline VOD, streaming
perception allows for real-time perception, opening up new
possibilities for autonomous driving.
To enhance comprehension, Fig. 1 provides a visual com-
parison between video-on-demand (VOD) and streaming
perception, using colored bounding boxes to contrast real-
world scenarios. Previous research [17] indicates that VOD
methods, such as those presented in [18–22], are vulnerable
to errors due to offline detection delays. Although some VOD
approaches have attempted to balance speed and accuracy
Equal contribution of C. Li, Z. Cheng, and J. He (in no particular order)
Corresponding author
1Code is at https://github.com/LiChenyang-Github/LongShortNet.
Offline detection
Ground Truth Predictions
Non-real-time methods Real-time methods
It
It-1 It+1 It+2 It
It-1 It+1 It+2
Processing time (latency) Waiting time
Offline detection of time T
GT location of time T+ Latency
Streaming perception
Streaming perception of time T
GT location of time T+ Latency
(b)
(a)
Fig. 1: (a) Comparison between offline detection (VOD) and streaming per-
ception, where the latter is real-time and can respond promptly to motion
changes. (b) Timeline showing processing time.
by utilizing offline methods [23], streaming perception ap-
proaches, such as the one proposed in [16], only consider the
last two frames to minimize latency, disregarding long-term
temporal motions. Nonetheless, these methods have limita-
tions in managing complicated motion and scene shifts due
to their incapacity to account for the short-term spatial and
long-term temporal aspects of video streams.
Going further, we illustrate significant perceptual chal-
lenges arising from the lack of spatial semantics and tempo-
ral motion. Fig. 2 displays a video stream captured by a car’s
front-view camera. The detected object’s state in the video
stream is influenced by its own motion and camera movement.
Besides ideal uniform linear motion, actual video streams in-
volve various challenges such as 1) non-uniform motion (e.g.,
vehicle accelerating to overtake), 2) non-straight motion (e.g.,
object and camera turning), 3) scene occlusion (e.g., billboard
and oncoming car occlusion), and 4) small objects (e.g., cars
and road signs in the distance). Undoubtedly, the motions and
scenarios encountered in real autonomous driving are exceed-
ingly complex and uncertain.
Faced with these concerns, StreamYOLO [17] disregards
the semantics and motion in video streams and only uses
the last two frames as input (i.e., the current and previous
frames). Due to the lack of spatial semantic and temporal mo-
tion cues, StreamYOLO is unable to handle complex scenes
involving non-uniform, non-straight, and occluded objects.
arXiv:2210.15518v4 [cs.CV] 30 Mar 2023
摘要:

LONGSHORTNET:EXPLORINGTEMPORALANDSEMANTICFEATURESFUSIONINSTREAMINGPERCEPTIONChenyangLi1,Zhi-QiCheng2,Jun-YanHe1,PengyuLi1,BinLuo1y,HanyuanChen1,YifengGeng1,Jin-PengLan1,XuansongXie11DAMOAcademy,AlibabaGroup2CarnegieMellonUniversityABSTRACTStreamingperceptionisafundamentaltaskinautonomousdrivingth...

展开>> 收起<<
LONGSHORTNET EXPLORING TEMPORAL AND SEMANTIC FEATURES FUSION IN STREAMING PERCEPTION Chenyang Li1 Zhi-Qi Cheng2 Jun-Yan He1 Pengyu Li1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:951.54KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注