PROCONTEXT EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING Jin-Peng Lan1 Zhi-Qi Cheng2 Jun-Yan He1y Chenyang Li1 Bin Luo1 Xu Bao1 Wangmeng Xiang1 Yifeng Geng1 Xuansong Xie1
2025-05-02
0
0
1.07MB
5 页
10玖币
侵权投诉
PROCONTEXT: EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING
Jin-Peng Lan1∗, Zhi-Qi Cheng2∗, Jun-Yan He1†, Chenyang Li1,
Bin Luo1, Xu Bao1, Wangmeng Xiang1, Yifeng Geng1, Xuansong Xie1
1DAMO Academy, Alibaba Group 2Carnegie Mellon University
ABSTRACT
Existing Visual Object Tracking (VOT) only takes the target
area in the first frame as a template. This causes tracking
to inevitably fail in fast-changing and crowded scenes, as
it cannot account for changes in object appearance between
frames. To this end, we revamped the tracking framework
with Progressive Context Encoding Transformer Tracker
(ProContEXT), which coherently exploits spatial and tempo-
ral contexts to predict object motion trajectories. Specifically,
ProContEXT leverages a context-aware self-attention mod-
ule to encode the spatial and temporal context, refining and
updating the multi-scale static and dynamic templates to pro-
gressively perform accurately tracking. It explores the com-
plementary between spatial and temporal context, raising a
new pathway to multi-context modeling for transformer-based
trackers. In addition, ProContEXT revised the token pruning
technique to reduce computational complexity. Extensive ex-
periments on popular benchmark datasets such as GOT-10k
and TrackingNet demonstrate that the proposed ProContEXT
achieves state-of-the-art performance1.
Index Terms—Context-aware transformer tracking
1. INTRODUCTION
Visual object tracking (VOT) is a crucial research topic due
to its numerous applications, including autonomous driving,
human-computer interaction, and video surveillance. In VOT,
the goal is to predict the precise location of the target object
in subsequent frames, given its location in the first frame,
typically represented by a bounding box. However, due to
challenges such as scaling and deformation, tracking systems
must dynamically learn object appearance changes to en-
code content information. Additionally, in fast-changing and
crowded scenes, visual trackers must identify which object to
tracking among multiple similar instances, making tracking
particularly challenging.
To address these challenges, we propose an intuitive solu-
tion. In Fig.1, we show three video frames chronologically in
the first row, and their cropped patches are displayed below.
The middle row of Fig.1 demonstrates that object appearance
∗equal contribution, alphabetically sorted
†corresponding author
1The source code is at https://github.com/jp-lan/ProContEXT
𝑰𝟕𝟎
𝑰𝟏𝑰𝟏𝟎𝟎
Tempo ral Context
Spatial Context
Search Area
Extended Region
Extended Region Extended Region
Target Region
Fig. 1: The fast-changing and crowded scenes widely exist in visual object
tracking. Apparently, exploiting the temporal and spatial context in video
sequences is the cornerstone of accurate tracking.
can change significantly during tracking. The red dashed cir-
cles in the middle column represent the target object, which is
more similar to objects in the search area, improving tracking
performance. Furthermore, at the bottom of Fig. 1, we extend
the template regions to include more background instances
marked by cyan dotted circles, which could assist trackers in
identifying similar targets. Thus, temporal and spatial con-
texts are crucial for visual object tracking, and we refer to
them as temporal context and spatial context, respectively.
Despite the emergence of context-free tracking meth-
ods, such as Siamese-based trackers (e.g., SiamFC [1],
SiamRPN [2], and SiamRPN++ [3]) and transformer-based
approaches (e.g., TransT [4] and OSTrack [5]), their per-
formance suffers in rapidly changing scenarios due to the
lack of contextual information. To address this, spatial con-
text learning pipelines, such as TLD [6] and its extensions
(e.g., LTCT [7], FuCoLoT [8], and LTT [9]), have been
developed. Furthermore, dynamic template updating has
been utilized in various tasks, including perception [10, 11],
segmentation [12, 13], tracking [14–17], and density esti-
mation [18, 19], for spatial context modeling. However, a
comprehensive study of both temporal and spatial context in
tracking tasks remains to be achieved.
To solve these issues, we propose a novel visual ob-
ject tracking method called Progressive Context Encoding
Transformer Tracker (ProContEXT). ProContEXT encodes
both temporal and spatial contexts through a template group
composed of static and dynamic templates, providing a com-
prehensive and progressive context representation. The model
leverages a context-aware self-attention module to learn rich
and robust feature representations, while a tracking head
arXiv:2210.15511v4 [cs.CV] 30 Mar 2023
摘要:
展开>>
收起<<
PROCONTEXT:EXPLORINGPROGRESSIVECONTEXTTRANSFORMERFORTRACKINGJin-PengLan1,Zhi-QiCheng2,Jun-YanHe1y,ChenyangLi1,BinLuo1,XuBao1,WangmengXiang1,YifengGeng1,XuansongXie11DAMOAcademy,AlibabaGroup2CarnegieMellonUniversityABSTRACTExistingVisualObjectTracking(VOT)onlytakesthetargetareaintherstframeasatemp...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
公司营销部领导述职述廉报告VIP免费
2024-12-03 4 -
100套述职述廉述法述学框架提纲VIP免费
2024-12-03 3 -
20220106政府党组班子党史学习教育专题民主生活会“五个带头”对照检查材料VIP免费
2024-12-03 3 -
20220106县纪委监委领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 6 -
A文秘笔杆子工作资料汇编手册(近70000字)VIP免费
2024-12-03 3 -
20220106县领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 4 -
经济开发区党工委书记管委会主任述学述职述廉述法报告VIP免费
2024-12-03 34 -
20220106政府领导专题民主生活会五个方面对照检查材料VIP免费
2024-12-03 11 -
派出所教导员述职述廉报告6篇VIP免费
2024-12-03 8 -
民主生活会对县委班子及其成员批评意见清单VIP免费
2024-12-03 50
分类:图书资源
价格:10玖币
属性:5 页
大小:1.07MB
格式:PDF
时间:2025-05-02


渝公网安备50010702506394