MQT 5
bounding boxes and associate them to the correct objects for every subsequent
frame thus forming trajectories.
We formulate an auto-regressive tracking process as illustrated in Figure 1. At
the current frame, the model produces three outputs for each object: (1) a
bounding box of the object’s location in the current frame, (2) an appearance
embedding of the visual appearance of the object given its location, and (3) a raw
transformer-decoder embedding. This information is then passed to the following
frame in the form of semantic queries to the decoder. For the following frame,
the model either looks for an object given its location, its appearance, or any
additional information carried over by the raw decoder output in queries. The
output embeddings are aggregated, and if the appearance output of an object at
the frame
𝑘
matches the known appearance of the track (usually the appearance
output at the frame
𝑘−
1, but it can be earlier when using reID from memory to
overcome occlusions), the location output is then added to the trajectory of an
object. This makes our model applicable in generalised tracking scenarios where
class information is not available.
On a high level, this work is performed by a transformer [
48
]. The current
image is processed by a convolutional neural network and fed into a transformer-
encoder, whereas all semantic queries from the previous image are fed into the
transformer-decoder module – see Figure 2. This differs from traditional tracking-
by-detection approaches (e.g. Bergmann et al.
[5]
) where detection is separate
from data-association and where each step commonly uses separate embeddings.
Our method merges these two steps into one, and the initial object embedding is
disentangled within the transformer-decoder directly into embeddings used for
detection and data-association (reID).
The rest of this section outlines the main parts of the model and their application
for tracking. For more detailed information on architecture, hyperparameters and
implementation details please refer to the supplementary material.
3.1 Transformer-decoder queries
The key insight of our work is the fact that the queries passed to the decoder
part of a transformer can be customized for the tracking task. For example, if we
know the approximate bounding box of an object from a previous frame, then
a query can be formed from this bounding box and used to search for the new
position of the object in its vicinity (in a similar manner to the RoI pooling
module of a traditional two-stage detector that extracts the image embedding
corresponding to the input bounding box, and is then used for bounding box
regression and classification).
What if we wanted to update the appearance (or maybe find the location of an
object defined by its appearance)? We simply extend this approach by having a
query encode the object appearance.
These semantic queries are used in an auto-regessive manner for tracking, in
that the output of the decoder of one frame is used as the query input for the
subsequent frame. We also include another type of query that is not used auto-
regressively, but instead is applied independently on each frame. This second