PROCONTEXT EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING Jin-Peng Lan1 Zhi-Qi Cheng2 Jun-Yan He1y Chenyang Li1 Bin Luo1 Xu Bao1 Wangmeng Xiang1 Yifeng Geng1 Xuansong Xie1

2025-05-02 0 0 1.07MB 5 页 10玖币
侵权投诉
PROCONTEXT: EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING
Jin-Peng Lan1, Zhi-Qi Cheng2, Jun-Yan He1, Chenyang Li1,
Bin Luo1, Xu Bao1, Wangmeng Xiang1, Yifeng Geng1, Xuansong Xie1
1DAMO Academy, Alibaba Group 2Carnegie Mellon University
ABSTRACT
Existing Visual Object Tracking (VOT) only takes the target
area in the first frame as a template. This causes tracking
to inevitably fail in fast-changing and crowded scenes, as
it cannot account for changes in object appearance between
frames. To this end, we revamped the tracking framework
with Progressive Context Encoding Transformer Tracker
(ProContEXT), which coherently exploits spatial and tempo-
ral contexts to predict object motion trajectories. Specifically,
ProContEXT leverages a context-aware self-attention mod-
ule to encode the spatial and temporal context, refining and
updating the multi-scale static and dynamic templates to pro-
gressively perform accurately tracking. It explores the com-
plementary between spatial and temporal context, raising a
new pathway to multi-context modeling for transformer-based
trackers. In addition, ProContEXT revised the token pruning
technique to reduce computational complexity. Extensive ex-
periments on popular benchmark datasets such as GOT-10k
and TrackingNet demonstrate that the proposed ProContEXT
achieves state-of-the-art performance1.
Index TermsContext-aware transformer tracking
1. INTRODUCTION
Visual object tracking (VOT) is a crucial research topic due
to its numerous applications, including autonomous driving,
human-computer interaction, and video surveillance. In VOT,
the goal is to predict the precise location of the target object
in subsequent frames, given its location in the first frame,
typically represented by a bounding box. However, due to
challenges such as scaling and deformation, tracking systems
must dynamically learn object appearance changes to en-
code content information. Additionally, in fast-changing and
crowded scenes, visual trackers must identify which object to
tracking among multiple similar instances, making tracking
particularly challenging.
To address these challenges, we propose an intuitive solu-
tion. In Fig.1, we show three video frames chronologically in
the first row, and their cropped patches are displayed below.
The middle row of Fig.1 demonstrates that object appearance
equal contribution, alphabetically sorted
corresponding author
1The source code is at https://github.com/jp-lan/ProContEXT
𝑰𝟕𝟎
𝑰𝟏𝑰𝟏𝟎𝟎
Tempo ral Context
Spatial Context
Search Area
Extended Region
Extended Region Extended Region
Target Region
Fig. 1: The fast-changing and crowded scenes widely exist in visual object
tracking. Apparently, exploiting the temporal and spatial context in video
sequences is the cornerstone of accurate tracking.
can change significantly during tracking. The red dashed cir-
cles in the middle column represent the target object, which is
more similar to objects in the search area, improving tracking
performance. Furthermore, at the bottom of Fig. 1, we extend
the template regions to include more background instances
marked by cyan dotted circles, which could assist trackers in
identifying similar targets. Thus, temporal and spatial con-
texts are crucial for visual object tracking, and we refer to
them as temporal context and spatial context, respectively.
Despite the emergence of context-free tracking meth-
ods, such as Siamese-based trackers (e.g., SiamFC [1],
SiamRPN [2], and SiamRPN++ [3]) and transformer-based
approaches (e.g., TransT [4] and OSTrack [5]), their per-
formance suffers in rapidly changing scenarios due to the
lack of contextual information. To address this, spatial con-
text learning pipelines, such as TLD [6] and its extensions
(e.g., LTCT [7], FuCoLoT [8], and LTT [9]), have been
developed. Furthermore, dynamic template updating has
been utilized in various tasks, including perception [10, 11],
segmentation [12, 13], tracking [14–17], and density esti-
mation [18, 19], for spatial context modeling. However, a
comprehensive study of both temporal and spatial context in
tracking tasks remains to be achieved.
To solve these issues, we propose a novel visual ob-
ject tracking method called Progressive Context Encoding
Transformer Tracker (ProContEXT). ProContEXT encodes
both temporal and spatial contexts through a template group
composed of static and dynamic templates, providing a com-
prehensive and progressive context representation. The model
leverages a context-aware self-attention module to learn rich
and robust feature representations, while a tracking head
arXiv:2210.15511v4 [cs.CV] 30 Mar 2023
摘要:

PROCONTEXT:EXPLORINGPROGRESSIVECONTEXTTRANSFORMERFORTRACKINGJin-PengLan1,Zhi-QiCheng2,Jun-YanHe1y,ChenyangLi1,BinLuo1,XuBao1,WangmengXiang1,YifengGeng1,XuansongXie11DAMOAcademy,AlibabaGroup2CarnegieMellonUniversityABSTRACTExistingVisualObjectTracking(VOT)onlytakesthetargetareaintherstframeasatemp...

展开>> 收起<<
PROCONTEXT EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING Jin-Peng Lan1 Zhi-Qi Cheng2 Jun-Yan He1y Chenyang Li1 Bin Luo1 Xu Bao1 Wangmeng Xiang1 Yifeng Geng1 Xuansong Xie1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:1.07MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注