End-to-end Tracking with a Multi-query Transformer Bruno Korbar and Andrew Zisserman

2025-05-06 0 0 5.21MB 29 页 10玖币
侵权投诉
End-to-end Tracking with a Multi-query
Transformer
Bruno Korbar* and Andrew Zisserman
Visual Geometry Group
University of Oxford
*korbar@robots.ox.ac.uk
Abstract.
Multiple-object tracking (MOT) is a challenging task that re-
quires simultaneous reasoning about location, appearance, and identity of
the objects in the scene over time. Our aim in this paper is to move beyond
tracking-by-detection approaches, that perform well on datasets where
the object classes are known, to class-agnostic tracking that performs
well also for unknown object classes. To this end, we make the following
three contributions: first, we introduce semantic detector queries that
enable an object to be localized by specifying its approximate position,
or its appearance, or both; second, we use these queries within an auto-
regressive framework for tracking, and propose a multi-query tracking
transformer (MQT ) model for simultaneous tracking and appearance-
based re-identification (reID) based on the transformer architecture with
deformable attention. This formulation allows the tracker to operate
in a class-agnostic manner, and the model can be trained end-to-end;
finally, we demonstrate that MQT performs competitively on standard
MOT benchmarks, outperforms all baselines on generalised-MOT, and
generalises well to a much harder tracking problems such as tracking any
object on the TAO dataset.
1 Introduction
The objective of this paper is multi-object tracking (MOT) – the task of deter-
mining the spatial location of multiple objects over time in a video. This is a
very well researched area and, broadly, two approaches are dominant: the first is
tracking-by-detection, where a strong object category detector is trained for the
object class of interest, for example a person or a car. This approach proceeds in
two steps: the detector is first applied independently on each frame, and in the
second step, the tracking task reduces to the data association of grouping these
detections over time (over the frames in this case). Examples of this approach
include [
5
,
12
,
50
,
59
]. The second approach is class agnostic tracking where any
object can be tracked. The object of interest is specified by a bounding box or
segmentation in one frame, and the task is then to track that object through the
other frames. Examples of this approach include [8,16,26].
arXiv:2210.14601v1 [cs.CV] 26 Oct 2022
2 Korbar and Zisserman
Transformer
Init location
Transformer
object kID
Location object kIDLocation
object jIDLocation
object j: frame 1
ID
Location
Frame: 0 Frame: 1
Fig. 1: An overview of the functionality of the multi-query tracking transformer (MQT ).
Each frame generates location and appearance embeddings of the target object. These
embeddings are used as queries for the subsequent frame. By propagating information
between frames in this simple manner the object is tracked over time through the video.
The tracking-by-detection approach generally outperforms class-agnostic models
at the moment, but the approaches often suffer from overly complex processing
pipelines (using multiple separately trained models for each step) and they rely
on prior knowledge of the object class of interest. More importantly the detection
model and the data-association model are in tension: with one model trained to
tolerate object class variations (to better detect all instances of the same class),
whilst the other is trained to maximise discrimination of two instances of the
same class (to prevent identity switching). Such models are generally not trained
end-to-end. Lastly, such models are highly specific – the results of these trackers
often don’t generalise well to the more general tracking scenario [17].
In this paper, we present a class-agnostic tracker that can be trained end-to-
end, but also build on the lessons of a strong object category detector. To this
end we base the tracker on the DETR object category detector [
11
], using a
transformer-detector modified in such a way that it can attend to multiple
objects locations and identities simultaneously. We introduce dual ‘object-specific
location’ and ‘identity’ encodings (dubbed semantic queries) which allow the
model to selectively focus on the location or appearance of objects we want to
track, irrespective of their classes. These object-specific embeddings enable the
model to be optimized jointly for track prediction and re-identification by training
in a class agnostic manner. In this way we achieve a single model class-agnostic
tracker that performs competitively on several MOT benchmarks [
18
,
35
], and
can outperform all previous work on the class agnostic-MOT task [
3
] where the
class-prior is not known. Lastly, we show that the tracker trained in this way can
also generalise well to tracking task such as TAO [
17
], where the categories and
number of tracking targets is far more general than on MOT benchmarks.
To summarise, we make the following three contributions: First, we introduce
the concept of semantic detector-queries and show their effectiveness for multi-
object tracking. Second, we design a transformer-based class-agnostic tracking
model around semantic detector-queries that is capable of simultaneous detection
and re-identification of multiple objects in the scene. Finally, we achieve compet-
itive results on various MOT benchmarks [18,35] where object identity is used,
MQT 3
demonstrate state-of-the-art class-agnostic performance on generalized MOT [
3
],
and show the potential of the model to generalise to even harder tracking tasks
on TAO [17].
2 Related work
To put this work into context, we compare it to the modern tracking approaches
that use a similar tracking paradigm to ours. There are, of course, many other
tracking approaches (e.g. tracking-by-segmentation [
7
,
36
,
49
,
52
]) that are not
as closely related to our method.
Tracking by detection
approaches form trajectories by associating detections
over time [
12
,
50
,
59
]. A common way of representing the data-association
problem is to view it as a graph, where each detection is a node linked by
possible edges and formulating it as a maximum-flow problem [
4
] with distance
based [
39
,
57
] or learned costs [
29
]. Alternative formulations use association
graphs [
33
], learned models based on motion models [
27
], or a completely learned
graph-neural-network [
9
]. A common issue with graph-based approaches is the
high optimization cost that doesn’t necessarily translate to better performance.
Detections can also be associated by modelling motion directly [
1
,
28
]. Pre-deep
learning approaches often rely on assumptions of constant motion [
2
,
13
] or
existing models of human behaviour [
38
,
43
,
53
], whilst more modern approaches
attempt to learn the motion models directly from the data [
29
]. Our model
doesn’t model motion explicitly, although, we do rely on the assumption of small
motion within frames to account for appearance similarity.
Tracking by appearance
methods use increasingly powerful image-representations
to track objects based on the similarities produced by either Siamese-networks [
29
,
45], learned reID features [41], or other alternative methods [12,14,37].
Tracking by regression
refines (instead of detecting) the bounding box of the
current frame by regressing the current bounding box given the bounding box
at the previous frame [
5
,
9
,
19
,
59
]. As these models usually lack information
about the object identity or relative track location, additional reID and motion
models [
5
,
19
,
59
] or graph methods [
9
] are necessary to achieve competitive
performance. Our model falls roughly in this category, although we show that it
can learn reID information directly from data.
Tracking with transformers
uses aspects of the transformer architecture [
48
],
such as self-attention and set-prediction [
11
,
60
]. The Trackformer, a transformer
tracker proposed by Meinhardt et al.
[34]
, is the closest approach to ours, employ-
ing largely the same architecture model, but use class information for tracking and
do not employ semantic queries. The TransTrack model [
46
] operates in the same
way as [
34
] but with a different underlying backbone. MOTR [
56
] extends this
framework by adding a “query-interaction-module" to reason about track-queries
over time. Yu et al.
[55]
leverage the importance of semantically-decoupled embed-
dings. They employ the “global context disentangling unit" to separate the final
layer output of a backbone CNN directly to semantic embeddings; we on another
hand, do it in the transformer decoder. TrackCenter model [
51
] introduces two
4 Korbar and Zisserman
Decoder
Q0pos
MLP
Init
queries
IDBBOX
b0a0
Decoder
ID
b1|k a1|k
b0
Q1pos Q1id Q1both Q1det
V0
a0
k = {pos, id,
both, det}
Init stage Tracking stage
V0
V1|k
Decoder
Encoder
Sem.
POS
query
Det.
query
Sem.
ID
query
agg
agg
agg
b1a1
V1
Architecture
overview
BBOX
Sem.
BOTH
query
Fig. 2: We show the high level overview of MQT on the left. Two distinct training
stages with a single query during initialization, and multiple queries during tracking
stage are on the right. Tracks are initialised either by using
det
(detection) queries,
or with existing detections projected into semantic queries (e.g.
𝑄0
pos
as shown in the
figure). Each query is then processed to obtain the decoder output (
𝑉0
), bounding-box
prediction (
𝑏0
) and appearance vector (
𝑎0
). These are passed to the following frame in
form of semantic queries, and their corresponding outputs (
𝑉1
|𝑘
,
𝑏1
|𝑘
,
𝑎1
|𝑘
respectively)
are aggregated for each object to obtain final predictions (𝑉1,𝑏1,𝑎1).
key improvements: pixel-level dense-queries, and semantically-decoupled repre-
sentation learning via model separation. TransMOT [
15
] utilises transformers in
a different way, by introducing a spatio-temporal graph transformers for post de-
tection data-association. MeMOT [
10
] introduces a memory module on top of the
transformer encoder to further boost performance. Note that none of these works
can be generalised to GMOT or TAO tasks, as they are tracking-by-detection
approaches and cannot be used for class-agnostic tracking. For more in-depth
comparison to most-similar works please refer to the supplementary material.
Class-agnostic tracking
leverages powerful appearance embeddings to track
objects based the similarity of the embeddings. The method does not leverage class
information explicitly. These models often use a form of a Siamese architecture
to learn a patch-based matching function [
8
,
16
,
26
,
29
,
30
,
45
,
47
]. However,
even if the model is in principle capable of class-agnostic inference, models such
as [
45
] are not fully class-agnostic, as they require class information for successful
training of their tracker in the form of an object detection loss (that requires
ground-truth class information for every object in the training triplet). Our work
differs in that it does not require this explicit object class labelling.
3 Multi-query transformer for tracking
The goal of multi-object tracking is to obtain the trajectories of
𝑛
objects over
a sequence of frames from a video. For example, given the initial set of object
locations (bounding boxes) in the first frame, the task is to predict a new set of
MQT 5
bounding boxes and associate them to the correct objects for every subsequent
frame thus forming trajectories.
We formulate an auto-regressive tracking process as illustrated in Figure 1. At
the current frame, the model produces three outputs for each object: (1) a
bounding box of the object’s location in the current frame, (2) an appearance
embedding of the visual appearance of the object given its location, and (3) a raw
transformer-decoder embedding. This information is then passed to the following
frame in the form of semantic queries to the decoder. For the following frame,
the model either looks for an object given its location, its appearance, or any
additional information carried over by the raw decoder output in queries. The
output embeddings are aggregated, and if the appearance output of an object at
the frame
𝑘
matches the known appearance of the track (usually the appearance
output at the frame
𝑘
1, but it can be earlier when using reID from memory to
overcome occlusions), the location output is then added to the trajectory of an
object. This makes our model applicable in generalised tracking scenarios where
class information is not available.
On a high level, this work is performed by a transformer [
48
]. The current
image is processed by a convolutional neural network and fed into a transformer-
encoder, whereas all semantic queries from the previous image are fed into the
transformer-decoder module – see Figure 2. This differs from traditional tracking-
by-detection approaches (e.g. Bergmann et al.
[5]
) where detection is separate
from data-association and where each step commonly uses separate embeddings.
Our method merges these two steps into one, and the initial object embedding is
disentangled within the transformer-decoder directly into embeddings used for
detection and data-association (reID).
The rest of this section outlines the main parts of the model and their application
for tracking. For more detailed information on architecture, hyperparameters and
implementation details please refer to the supplementary material.
3.1 Transformer-decoder queries
The key insight of our work is the fact that the queries passed to the decoder
part of a transformer can be customized for the tracking task. For example, if we
know the approximate bounding box of an object from a previous frame, then
a query can be formed from this bounding box and used to search for the new
position of the object in its vicinity (in a similar manner to the RoI pooling
module of a traditional two-stage detector that extracts the image embedding
corresponding to the input bounding box, and is then used for bounding box
regression and classification).
What if we wanted to update the appearance (or maybe find the location of an
object defined by its appearance)? We simply extend this approach by having a
query encode the object appearance.
These semantic queries are used in an auto-regessive manner for tracking, in
that the output of the decoder of one frame is used as the query input for the
subsequent frame. We also include another type of query that is not used auto-
regressively, but instead is applied independently on each frame. This second
摘要:

End-to-endTrackingwithaMulti-queryTransformerBrunoKorbar*andAndrewZissermanVisualGeometryGroupUniversityofOxford*korbar@robots.ox.ac.ukAbstract.Multiple-objecttracking(MOT)isachallengingtaskthatre-quiressimultaneousreasoningaboutlocation,appearance,andidentityoftheobjectsinthesceneovertime.Ouraimint...

展开>> 收起<<
End-to-end Tracking with a Multi-query Transformer Bruno Korbar and Andrew Zisserman.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:5.21MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注