PROCONTEXT EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING Jin-Peng Lan1 Zhi-Qi Cheng2 Jun-Yan He1y Chenyang Li1 Bin Luo1 Xu Bao1 Wangmeng Xiang1 Yifeng Geng1 Xuansong Xie1

2025-05-02 0 0 1.07MB 5 页 10玖币

侵权投诉

PROCONTEXT: EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING

Jin-Peng Lan1∗, Zhi-Qi Cheng2∗, Jun-Yan He1†, Chenyang Li1,

Bin Luo1, Xu Bao1, Wangmeng Xiang1, Yifeng Geng1, Xuansong Xie1

1DAMO Academy, Alibaba Group 2Carnegie Mellon University

ABSTRACT

Existing Visual Object Tracking (VOT) only takes the target

area in the ﬁrst frame as a template. This causes tracking

to inevitably fail in fast-changing and crowded scenes, as

it cannot account for changes in object appearance between

frames. To this end, we revamped the tracking framework

with Progressive Context Encoding Transformer Tracker

(ProContEXT), which coherently exploits spatial and tempo-

ral contexts to predict object motion trajectories. Speciﬁcally,

ProContEXT leverages a context-aware self-attention mod-

ule to encode the spatial and temporal context, reﬁning and

updating the multi-scale static and dynamic templates to pro-

gressively perform accurately tracking. It explores the com-

plementary between spatial and temporal context, raising a

new pathway to multi-context modeling for transformer-based

trackers. In addition, ProContEXT revised the token pruning

technique to reduce computational complexity. Extensive ex-

periments on popular benchmark datasets such as GOT-10k

and TrackingNet demonstrate that the proposed ProContEXT

achieves state-of-the-art performance1.

Index Terms—Context-aware transformer tracking

1. INTRODUCTION

Visual object tracking (VOT) is a crucial research topic due

to its numerous applications, including autonomous driving,

human-computer interaction, and video surveillance. In VOT,

the goal is to predict the precise location of the target object

in subsequent frames, given its location in the ﬁrst frame,

typically represented by a bounding box. However, due to

challenges such as scaling and deformation, tracking systems

must dynamically learn object appearance changes to en-

code content information. Additionally, in fast-changing and

crowded scenes, visual trackers must identify which object to

tracking among multiple similar instances, making tracking

particularly challenging.

To address these challenges, we propose an intuitive solu-

tion. In Fig.1, we show three video frames chronologically in

the ﬁrst row, and their cropped patches are displayed below.

The middle row of Fig.1 demonstrates that object appearance

∗equal contribution, alphabetically sorted

†corresponding author

1The source code is at https://github.com/jp-lan/ProContEXT

𝑰𝟕𝟎

𝑰𝟏𝑰𝟏𝟎𝟎

Tempo ral Context

Spatial Context

Search Area

Extended Region

Extended Region Extended Region

Target Region

Fig. 1: The fast-changing and crowded scenes widely exist in visual object

tracking. Apparently, exploiting the temporal and spatial context in video

sequences is the cornerstone of accurate tracking.

can change signiﬁcantly during tracking. The red dashed cir-

cles in the middle column represent the target object, which is

more similar to objects in the search area, improving tracking

performance. Furthermore, at the bottom of Fig. 1, we extend

the template regions to include more background instances

marked by cyan dotted circles, which could assist trackers in

identifying similar targets. Thus, temporal and spatial con-

texts are crucial for visual object tracking, and we refer to

them as temporal context and spatial context, respectively.

Despite the emergence of context-free tracking meth-

ods, such as Siamese-based trackers (e.g., SiamFC [1],

SiamRPN [2], and SiamRPN++ [3]) and transformer-based

approaches (e.g., TransT [4] and OSTrack [5]), their per-

formance suffers in rapidly changing scenarios due to the

lack of contextual information. To address this, spatial con-

text learning pipelines, such as TLD [6] and its extensions

(e.g., LTCT [7], FuCoLoT [8], and LTT [9]), have been

developed. Furthermore, dynamic template updating has

been utilized in various tasks, including perception [10, 11],

segmentation [12, 13], tracking [14–17], and density esti-

mation [18, 19], for spatial context modeling. However, a

comprehensive study of both temporal and spatial context in

tracking tasks remains to be achieved.

To solve these issues, we propose a novel visual ob-

ject tracking method called Progressive Context Encoding

Transformer Tracker (ProContEXT). ProContEXT encodes

both temporal and spatial contexts through a template group

composed of static and dynamic templates, providing a com-

prehensive and progressive context representation. The model

leverages a context-aware self-attention module to learn rich

and robust feature representations, while a tracking head

arXiv:2210.15511v4 [cs.CV] 30 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PROCONTEXT:EXPLORINGPROGRESSIVECONTEXTTRANSFORMERFORTRACKINGJin-PengLan1,Zhi-QiCheng2,Jun-YanHe1y,ChenyangLi1,BinLuo1,XuBao1,WangmengXiang1,YifengGeng1,XuansongXie11DAMOAcademy,AlibabaGroup2CarnegieMellonUniversityABSTRACTExistingVisualObjectTracking(VOT)onlytakesthetargetareaintherstframeasatemp...

展开>> 收起<<

PROCONTEXT EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING Jin-Peng Lan1 Zhi-Qi Cheng2 Jun-Yan He1y Chenyang Li1 Bin Luo1 Xu Bao1 Wangmeng Xiang1 Yifeng Geng1 Xuansong Xie1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PROCONTEXT EXPLORING PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING Jin-Peng Lan1 Zhi-Qi Cheng2 Jun-Yan He1y Chenyang Li1 Bin Luo1 Xu Bao1 Wangmeng Xiang1 Yifeng Geng1 Xuansong Xie1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: