End-to-end Tracking with a Multi-query Transformer Bruno Korbar and Andrew Zisserman

2025-05-06 2 0 5.21MB 29 页 10玖币

侵权投诉

End-to-end Tracking with a Multi-query

Transformer

Bruno Korbar* and Andrew Zisserman

Visual Geometry Group

University of Oxford

*korbar@robots.ox.ac.uk

Abstract.

Multiple-object tracking (MOT) is a challenging task that re-

quires simultaneous reasoning about location, appearance, and identity of

the objects in the scene over time. Our aim in this paper is to move beyond

tracking-by-detection approaches, that perform well on datasets where

the object classes are known, to class-agnostic tracking that performs

well also for unknown object classes. To this end, we make the following

three contributions: ﬁrst, we introduce semantic detector queries that

enable an object to be localized by specifying its approximate position,

or its appearance, or both; second, we use these queries within an auto-

regressive framework for tracking, and propose a multi-query tracking

transformer (MQT ) model for simultaneous tracking and appearance-

based re-identiﬁcation (reID) based on the transformer architecture with

deformable attention. This formulation allows the tracker to operate

in a class-agnostic manner, and the model can be trained end-to-end;

ﬁnally, we demonstrate that MQT performs competitively on standard

MOT benchmarks, outperforms all baselines on generalised-MOT, and

generalises well to a much harder tracking problems such as tracking any

object on the TAO dataset.

1 Introduction

The objective of this paper is multi-object tracking (MOT) – the task of deter-

mining the spatial location of multiple objects over time in a video. This is a

very well researched area and, broadly, two approaches are dominant: the ﬁrst is

tracking-by-detection, where a strong object category detector is trained for the

object class of interest, for example a person or a car. This approach proceeds in

two steps: the detector is ﬁrst applied independently on each frame, and in the

second step, the tracking task reduces to the data association of grouping these

detections over time (over the frames in this case). Examples of this approach

include [

]. The second approach is class agnostic tracking where any

object can be tracked. The object of interest is speciﬁed by a bounding box or

segmentation in one frame, and the task is then to track that object through the

other frames. Examples of this approach include [8,16,26].

arXiv:2210.14601v1 [cs.CV] 26 Oct 2022

2 Korbar and Zisserman

Transformer

Init location

Transformer

object kID

Location object kIDLocation

object jIDLocation

object j: frame 1

Location

Frame: 0 Frame: 1

Fig. 1: An overview of the functionality of the multi-query tracking transformer (MQT ).

Each frame generates location and appearance embeddings of the target object. These

embeddings are used as queries for the subsequent frame. By propagating information

between frames in this simple manner the object is tracked over time through the video.

The tracking-by-detection approach generally outperforms class-agnostic models

at the moment, but the approaches often suﬀer from overly complex processing

pipelines (using multiple separately trained models for each step) and they rely

on prior knowledge of the object class of interest. More importantly the detection

model and the data-association model are in tension: with one model trained to

tolerate object class variations (to better detect all instances of the same class),

whilst the other is trained to maximise discrimination of two instances of the

same class (to prevent identity switching). Such models are generally not trained

end-to-end. Lastly, such models are highly speciﬁc – the results of these trackers

often don’t generalise well to the more general tracking scenario [17].

In this paper, we present a class-agnostic tracker that can be trained end-to-

end, but also build on the lessons of a strong object category detector. To this

end we base the tracker on the DETR object category detector [

], using a

transformer-detector modiﬁed in such a way that it can attend to multiple

objects locations and identities simultaneously. We introduce dual ‘object-speciﬁc

location’ and ‘identity’ encodings (dubbed semantic queries) which allow the

model to selectively focus on the location or appearance of objects we want to

track, irrespective of their classes. These object-speciﬁc embeddings enable the

model to be optimized jointly for track prediction and re-identiﬁcation by training

in a class agnostic manner. In this way we achieve a single model class-agnostic

tracker that performs competitively on several MOT benchmarks [

], and

can outperform all previous work on the class agnostic-MOT task [

] where the

class-prior is not known. Lastly, we show that the tracker trained in this way can

also generalise well to tracking task such as TAO [

], where the categories and

number of tracking targets is far more general than on MOT benchmarks.

To summarise, we make the following three contributions: First, we introduce

the concept of semantic detector-queries and show their eﬀectiveness for multi-

object tracking. Second, we design a transformer-based class-agnostic tracking

model around semantic detector-queries that is capable of simultaneous detection

and re-identiﬁcation of multiple objects in the scene. Finally, we achieve compet-

itive results on various MOT benchmarks [18,35] where object identity is used,

MQT 3

demonstrate state-of-the-art class-agnostic performance on generalized MOT [

and show the potential of the model to generalise to even harder tracking tasks

on TAO [17].

2 Related work

To put this work into context, we compare it to the modern tracking approaches

that use a similar tracking paradigm to ours. There are, of course, many other

tracking approaches (e.g. tracking-by-segmentation [

]) that are not

as closely related to our method.

Tracking by detection

approaches form trajectories by associating detections

over time [

]. A common way of representing the data-association

problem is to view it as a graph, where each detection is a node linked by

possible edges and formulating it as a maximum-ﬂow problem [

] with distance

based [

] or learned costs [

]. Alternative formulations use association

graphs [

], learned models based on motion models [

], or a completely learned

graph-neural-network [

]. A common issue with graph-based approaches is the

high optimization cost that doesn’t necessarily translate to better performance.

Detections can also be associated by modelling motion directly [

]. Pre-deep

learning approaches often rely on assumptions of constant motion [

] or

existing models of human behaviour [

], whilst more modern approaches

attempt to learn the motion models directly from the data [

]. Our model

doesn’t model motion explicitly, although, we do rely on the assumption of small

motion within frames to account for appearance similarity.

Tracking by appearance

methods use increasingly powerful image-representations

to track objects based on the similarities produced by either Siamese-networks [

45], learned reID features [41], or other alternative methods [12,14,37].

Tracking by regression

reﬁnes (instead of detecting) the bounding box of the

current frame by regressing the current bounding box given the bounding box

at the previous frame [

]. As these models usually lack information

about the object identity or relative track location, additional reID and motion

models [

] or graph methods [

] are necessary to achieve competitive

performance. Our model falls roughly in this category, although we show that it

can learn reID information directly from data.

Tracking with transformers

uses aspects of the transformer architecture [

such as self-attention and set-prediction [

]. The Trackformer, a transformer

tracker proposed by Meinhardt et al.

[34]

, is the closest approach to ours, employ-

ing largely the same architecture model, but use class information for tracking and

do not employ semantic queries. The TransTrack model [

] operates in the same

way as [

] but with a diﬀerent underlying backbone. MOTR [

] extends this

framework by adding a “query-interaction-module" to reason about track-queries

over time. Yu et al.

[55]

leverage the importance of semantically-decoupled embed-

dings. They employ the “global context disentangling unit" to separate the ﬁnal

layer output of a backbone CNN directly to semantic embeddings; we on another

hand, do it in the transformer decoder. TrackCenter model [

] introduces two

4 Korbar and Zisserman

Decoder

Q0pos

MLP

Init

queries

IDBBOX

b0a0

Decoder

b1|k a1|k

Q1pos Q1id Q1both Q1det

k = {pos, id,

both, det}

Init stage Tracking stage

V1|k

Decoder

Encoder

Sem.

POS

query

Det.

query

Sem.

query

agg

b1a1

Architecture

overview

BBOX

Sem.

BOTH

query

Fig. 2: We show the high level overview of MQT on the left. Two distinct training

stages with a single query during initialization, and multiple queries during tracking

stage are on the right. Tracks are initialised either by using

det

(detection) queries,

or with existing detections projected into semantic queries (e.g.

𝑄0

pos

as shown in the

ﬁgure). Each query is then processed to obtain the decoder output (

𝑉0

), bounding-box

prediction (

𝑏0

) and appearance vector (

𝑎0

). These are passed to the following frame in

form of semantic queries, and their corresponding outputs (

𝑉1

|𝑘

𝑏1

|𝑘

𝑎1

|𝑘

respectively)

are aggregated for each object to obtain ﬁnal predictions (𝑉1,𝑏1,𝑎1).

key improvements: pixel-level dense-queries, and semantically-decoupled repre-

sentation learning via model separation. TransMOT [

] utilises transformers in

a diﬀerent way, by introducing a spatio-temporal graph transformers for post de-

tection data-association. MeMOT [

] introduces a memory module on top of the

transformer encoder to further boost performance. Note that none of these works

can be generalised to GMOT or TAO tasks, as they are tracking-by-detection

approaches and cannot be used for class-agnostic tracking. For more in-depth

comparison to most-similar works please refer to the supplementary material.

Class-agnostic tracking

leverages powerful appearance embeddings to track

objects based the similarity of the embeddings. The method does not leverage class

information explicitly. These models often use a form of a Siamese architecture

to learn a patch-based matching function [

]. However,

even if the model is in principle capable of class-agnostic inference, models such

as [

] are not fully class-agnostic, as they require class information for successful

training of their tracker in the form of an object detection loss (that requires

ground-truth class information for every object in the training triplet). Our work

diﬀers in that it does not require this explicit object class labelling.

3 Multi-query transformer for tracking

The goal of multi-object tracking is to obtain the trajectories of

𝑛

objects over

a sequence of frames from a video. For example, given the initial set of object

locations (bounding boxes) in the ﬁrst frame, the task is to predict a new set of

MQT 5

bounding boxes and associate them to the correct objects for every subsequent

frame thus forming trajectories.

We formulate an auto-regressive tracking process as illustrated in Figure 1. At

the current frame, the model produces three outputs for each object: (1) a

bounding box of the object’s location in the current frame, (2) an appearance

embedding of the visual appearance of the object given its location, and (3) a raw

transformer-decoder embedding. This information is then passed to the following

frame in the form of semantic queries to the decoder. For the following frame,

the model either looks for an object given its location, its appearance, or any

additional information carried over by the raw decoder output in queries. The

output embeddings are aggregated, and if the appearance output of an object at

the frame

𝑘

matches the known appearance of the track (usually the appearance

output at the frame

𝑘−

1, but it can be earlier when using reID from memory to

overcome occlusions), the location output is then added to the trajectory of an

object. This makes our model applicable in generalised tracking scenarios where

class information is not available.

On a high level, this work is performed by a transformer [

]. The current

image is processed by a convolutional neural network and fed into a transformer-

encoder, whereas all semantic queries from the previous image are fed into the

transformer-decoder module – see Figure 2. This diﬀers from traditional tracking-

by-detection approaches (e.g. Bergmann et al.

[5]

) where detection is separate

from data-association and where each step commonly uses separate embeddings.

Our method merges these two steps into one, and the initial object embedding is

disentangled within the transformer-decoder directly into embeddings used for

detection and data-association (reID).

The rest of this section outlines the main parts of the model and their application

for tracking. For more detailed information on architecture, hyperparameters and

implementation details please refer to the supplementary material.

3.1 Transformer-decoder queries

The key insight of our work is the fact that the queries passed to the decoder

part of a transformer can be customized for the tracking task. For example, if we

know the approximate bounding box of an object from a previous frame, then

a query can be formed from this bounding box and used to search for the new

position of the object in its vicinity (in a similar manner to the RoI pooling

module of a traditional two-stage detector that extracts the image embedding

corresponding to the input bounding box, and is then used for bounding box

regression and classiﬁcation).

What if we wanted to update the appearance (or maybe ﬁnd the location of an

object deﬁned by its appearance)? We simply extend this approach by having a

query encode the object appearance.

These semantic queries are used in an auto-regessive manner for tracking, in

that the output of the decoder of one frame is used as the query input for the

subsequent frame. We also include another type of query that is not used auto-

regressively, but instead is applied independently on each frame. This second

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

End-to-endTrackingwithaMulti-queryTransformerBrunoKorbar*andAndrewZissermanVisualGeometryGroupUniversityofOxford*korbar@robots.ox.ac.ukAbstract.Multiple-objecttracking(MOT)isachallengingtaskthatre-quiressimultaneousreasoningaboutlocation,appearance,andidentityoftheobjectsinthesceneovertime.Ouraimint...

展开>> 收起<<

End-to-end Tracking with a Multi-query Transformer Bruno Korbar and Andrew Zisserman.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

End-to-end Tracking with a Multi-query Transformer Bruno Korbar and Andrew Zisserman

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: