In this paper, we address both these issues by training
networks to detect flows of people across images. Our
model directly leverages temporal consistency using self-
supervision across video frames. Furthermore, it can fuse
information from different cameras while retaining spatial
consistency for human position from different viewpoints.
As a result, we outperform state-of-the-art multi-view tech-
niques [24, 25, 12] on challenging datasets.
2. Related works
Early work on tracking objects in video sequence rely
on model evolution technique which focuses on tracking a
single object using gating and Kalman filtering [44]. Be-
cause of their recursive nature, they are prone to errors such
as drift, which are difficult to recover from. Therefore, this
method is largely replaced by tracking-by-detection tech-
niques which have proven to be effective in addressing peo-
ple tracking problems. In this section we first briefly intro-
duce previous work of tracking-by-detection and then dis-
cuss previous work in modeling human motions.
2.1. Tracking-by-Detection.
Tracking-by-detection [2] aims to track objects in video
sequences by optimizing a global objective function over
many frames given frame-wise object detection informa-
tion. They reply on Conditional Random Fields [33, 62, 43],
Belief Propagation [63, 13], Dynamic or Linear Program-
ming [5, 51], or Network Flow Programming [1, 15]. Some
of these algorithms follow the graph formulation with nodes
as either all the spatial locations where an object can
be present [18, 9, 8] or only those where a detector has
fired [27, 55, 52, 6].
Among these graph-based approaches, the K-Shortest
Paths (KSP) algorithm [9] works on the graph of all po-
tential locations over all time instants, and finds the ground-
plane trajectories that yield the overall minimum cost. This
optimality is achieved at the cost of multiple strong assump-
tion about human motion, in particular it treats all motion
direction as equiprobable. Similar to the KSP algorithm, the
Successive Shortest Paths (SSP) approach [48] links detec-
tions using sequential dynamic programming. [36] extends
this SSP approach with bounded memory and computation
which enables tracking in even longer sequences. The mem-
ory consumption is further reduced in [58] by exploiting the
special structures and properties of the graphs formulated in
multiple objects tracking problems. More recent work [59]
proposes to learn a deep association metric on a large-scale
person re-identification dataset which enables reliable peo-
ple tracking in long video sequence.
Occlusion makes it extremely challenging to achieve re-
liable object tracking in long sequences. Some algorithms
address this by leveraging multiple viewpoints, some ap-
proaches first detect people in single images before repro-
jecting and matching detections into a common reference
frame [60, 18]. [4] propose to directly combine view aggre-
gation and prediction with a joint CNN/CRF. More recently
[25] proposed to use spatial transformer networks [26] to
project feature representation in the ground plane resulting
in an end-to-end trainable multi-view detection model. [53]
proposed to combine multiple views using an approximati-
ion of a 3D world coordinate system by projecting features
in planes at different height levels. Finally [24] proposed to
use multi-view data augmentation combined with a trans-
former architecture to fuse ground plane features from mul-
tiple points of view and obtains state-of-the-art results for
multiple object detection on the WILDTRACK dataset [12].
2.2. Modeling human motion
Modeling human motion as flow when tracking people
has been a concern long before the advent of deep learn-
ing [47, 57, 11, 37, 14, 34, 20, 10, 45, 42, 3, 9]. For ex-
ample, in [9], people tracking is formulated as multi-target
tracking on a grid and gives rise to a linear program that
can be solved efficiently using the K-Shortest Path algo-
rithm [54]. The key to this formulation is to optimize the
people flows from one grid location to another, instead of
the actual number of people in each grid location. In [48],
a people conservation constraint is enforced and the global
solution is found by a greedy algorithm that sequentially in-
stantiates tracks using shortest path computations on a flow
network [65]. Such people conservation constraints have
since been combined with additional ones to further boost
performance. They include appearance constraints [7, 16, 8]
to prevent identity switches, spatiotemporal constraints to
force the trajectories of different objects to be disjoint [22],
and higher-order constraints [11, 14]. More recent work
extends this flow formulation with deep learning [38, 39]
to formulate people as people flows which contributes to
reliable people counting in even dense regions. However,
none of these methods leverage such people flow formu-
lation to address tracking problems with deep neural net-
works. These kinds of flow constraints have therefore never
been used in a deep people tracking context.
3. Approach
Most recent approaches rely on the tracking-by-detection
paradigm. In its simplest form, the detection step is discon-
nected from the association step. In this section, we propose
a novel method to bring closer those two steps. First we
introduce a detection network predicting people flow in a
weakly supervised manner. Then we show how we modify
existing association algorithms to leverage predicted flows
to generate unambiguous tracks.