Multi-view Tracking Using Weakly Supervised Human Motion Prediction Martin Engilberge EPFL Lausanne Switzerland

2025-05-02 0 0 5.67MB 11 页 10玖币
侵权投诉
Multi-view Tracking Using Weakly Supervised Human Motion Prediction
Martin Engilberge
EPFL, Lausanne, Switzerland
martin.engilberge@epfl.ch
Weizhe Liu
Tencent XR Vision Labs
weizheliu@tencent.com
Pascal Fua
EPFL, Lausanne, Switzerland
pascal.fua@epfl.ch
Abstract
Multi-view approaches to people-tracking have the po-
tential to better handle occlusions than single-view ones
in crowded scenes. They often rely on the tracking-by-
detection paradigm, which involves detecting people first
and then connecting the detections. In this paper, we argue
that an even more effective approach is to predict people
motion over time and infer people’s presence in individual
frames from these. This enables to enforce consistency both
over time and across views of a single temporal frame. We
validate our approach on the PETS2009 and WILDTRACK
datasets and demonstrate that it outperforms state-of-the-
art methods.
1. Introduction
When it comes to tracking multiple people, tracking-
by-detection [2] has become a standard paradigm and has
proven effective for many applications such as surveillance
or sports player tracking. It involves first detecting the tar-
get objects in individual frames, associating these detec-
tions into short but reliable trajectories known as track-
lets, and then concatenating them into longer trajecto-
ries [40, 23, 29, 64, 31, 41, 50, 46, 30, 56, 19]. The grouping
of detections into full trajectories can also be formulated as
the search for multiple min-cost paths on a graph [9, 58].
More recently, tracking-by-regression [66, 61] has been ad-
vocated as a potential alternative. It readily enables tracking
while being end-to-end differentiable, unlike the detection-
based approaches.
However, these single-view tracking techniques can be
derailed by occlusions and are bound to fragment tracks
when detections are missed. Using multiple cameras is one
way to address this problem, especially in locations such as
sports arenas where such a setup can be installed once and
for all [8, 60, 12]. This can be highly effective but can still
fail when occlusions become severe. This is in part because
detection algorithms typically operate on single frames and
Project code at https://github.com/cvlab-epfl/MVFlow
Figure 1: Predicting human motion. Our model learns to de-
tect people by predicting human flows. It generates the proba-
bilities that a person moves from one location to one of its eight
neighbors or itself, depicted by the yellow grid in the top image.
The white triangles depict detections in the ground-plane while the
green ones denote the predicted location at the next time step. The
bottom image corresponds to the top view re-projection of the top
one, the blue arrows illustrate the motion predicted by our model.
On both images the region of interest is overlaid in yellow. People
outside of that region are ignored.
fail to exploit the fact that we have videos that exhibit time
consistency. In other words, if someone is detected in one
frame, chances are they should be found at a neighboring
location in the next frame, as depicted by Fig. 1. Further-
more, even though people’s motion and scale is consistent
across views, that consistency is rarely enforced when fus-
ing results from different views.
arXiv:2210.10771v1 [cs.CV] 19 Oct 2022
In this paper, we address both these issues by training
networks to detect flows of people across images. Our
model directly leverages temporal consistency using self-
supervision across video frames. Furthermore, it can fuse
information from different cameras while retaining spatial
consistency for human position from different viewpoints.
As a result, we outperform state-of-the-art multi-view tech-
niques [24, 25, 12] on challenging datasets.
2. Related works
Early work on tracking objects in video sequence rely
on model evolution technique which focuses on tracking a
single object using gating and Kalman filtering [44]. Be-
cause of their recursive nature, they are prone to errors such
as drift, which are difficult to recover from. Therefore, this
method is largely replaced by tracking-by-detection tech-
niques which have proven to be effective in addressing peo-
ple tracking problems. In this section we first briefly intro-
duce previous work of tracking-by-detection and then dis-
cuss previous work in modeling human motions.
2.1. Tracking-by-Detection.
Tracking-by-detection [2] aims to track objects in video
sequences by optimizing a global objective function over
many frames given frame-wise object detection informa-
tion. They reply on Conditional Random Fields [33, 62, 43],
Belief Propagation [63, 13], Dynamic or Linear Program-
ming [5, 51], or Network Flow Programming [1, 15]. Some
of these algorithms follow the graph formulation with nodes
as either all the spatial locations where an object can
be present [18, 9, 8] or only those where a detector has
fired [27, 55, 52, 6].
Among these graph-based approaches, the K-Shortest
Paths (KSP) algorithm [9] works on the graph of all po-
tential locations over all time instants, and finds the ground-
plane trajectories that yield the overall minimum cost. This
optimality is achieved at the cost of multiple strong assump-
tion about human motion, in particular it treats all motion
direction as equiprobable. Similar to the KSP algorithm, the
Successive Shortest Paths (SSP) approach [48] links detec-
tions using sequential dynamic programming. [36] extends
this SSP approach with bounded memory and computation
which enables tracking in even longer sequences. The mem-
ory consumption is further reduced in [58] by exploiting the
special structures and properties of the graphs formulated in
multiple objects tracking problems. More recent work [59]
proposes to learn a deep association metric on a large-scale
person re-identification dataset which enables reliable peo-
ple tracking in long video sequence.
Occlusion makes it extremely challenging to achieve re-
liable object tracking in long sequences. Some algorithms
address this by leveraging multiple viewpoints, some ap-
proaches first detect people in single images before repro-
jecting and matching detections into a common reference
frame [60, 18]. [4] propose to directly combine view aggre-
gation and prediction with a joint CNN/CRF. More recently
[25] proposed to use spatial transformer networks [26] to
project feature representation in the ground plane resulting
in an end-to-end trainable multi-view detection model. [53]
proposed to combine multiple views using an approximati-
ion of a 3D world coordinate system by projecting features
in planes at different height levels. Finally [24] proposed to
use multi-view data augmentation combined with a trans-
former architecture to fuse ground plane features from mul-
tiple points of view and obtains state-of-the-art results for
multiple object detection on the WILDTRACK dataset [12].
2.2. Modeling human motion
Modeling human motion as flow when tracking people
has been a concern long before the advent of deep learn-
ing [47, 57, 11, 37, 14, 34, 20, 10, 45, 42, 3, 9]. For ex-
ample, in [9], people tracking is formulated as multi-target
tracking on a grid and gives rise to a linear program that
can be solved efficiently using the K-Shortest Path algo-
rithm [54]. The key to this formulation is to optimize the
people flows from one grid location to another, instead of
the actual number of people in each grid location. In [48],
a people conservation constraint is enforced and the global
solution is found by a greedy algorithm that sequentially in-
stantiates tracks using shortest path computations on a flow
network [65]. Such people conservation constraints have
since been combined with additional ones to further boost
performance. They include appearance constraints [7, 16, 8]
to prevent identity switches, spatiotemporal constraints to
force the trajectories of different objects to be disjoint [22],
and higher-order constraints [11, 14]. More recent work
extends this flow formulation with deep learning [38, 39]
to formulate people as people flows which contributes to
reliable people counting in even dense regions. However,
none of these methods leverage such people flow formu-
lation to address tracking problems with deep neural net-
works. These kinds of flow constraints have therefore never
been used in a deep people tracking context.
3. Approach
Most recent approaches rely on the tracking-by-detection
paradigm. In its simplest form, the detection step is discon-
nected from the association step. In this section, we propose
a novel method to bring closer those two steps. First we
introduce a detection network predicting people flow in a
weakly supervised manner. Then we show how we modify
existing association algorithms to leverage predicted flows
to generate unambiguous tracks.
Figure 2: Grid flow representation For each location i
we predict the probability that a person is moving from i
to one of its eight neighbors or itself in the next time step.
Detection probability at a given location at time t can be
computed by summing the nine outgoing flows or the nine
flows reaching that location from t-1.
3.1. Formalism
Let us consider a multi-view video sequence S=
{I1,I2, ..., IT1,IT}consisting of Ttime steps. Each time
step It={It
1, . . . It
V}consists of a set of synchronized
frames taken by Vcameras with overlapping fields of view.
For each camera the calibration Cvis known and con-
tains both intrinsic and extrinsic parameters. Each frame
It
v(0,255)W×H×3is a color image with a spatial size
(W,H).
To combine multiple views we choose to work in the
common ground plane. For each frame we define Gt
v=
P(It
v,Cv)as the projection of frame It
von the ground
plane using the projection function Pproducing Gt
v
(0,255)w×h×3with (w,h) the spatial size of the ground
plane image.
Finally, we adopt similar grid world formalism as previ-
ous work [9]. At each time step twe discretize the physical
ground plane to form a grid of w×hcells, giving us a
scene representation of dimensionality w×h×tfor a full
sequence.
3.2. People Flow
Given a pair of consecutive multi-view time steps we de-
fine the human flow ft,t+1 as follows: For a given location
i, the flow ft,t+1
i,j is the probability that a person in cell iat
time tmoves to location jat time t+ 1. Where j N (i)
is a neighbor of i. Concretely, for each cell in the ground
plane we represent people flow by a 9-dimensional vector
of probability (one dimension per neighbors of that cell).
The grid representation and the definition of neighborhood
are illustrated in Fig. 2
To accurately model human motion, the flow need to re-
spect three constraints:
First, people conservation constraints, if a person is
present at time t, he should be present at time t+ 1 in the
same location, or in a neighboring one. In other words, if
we consider three time steps It1,It, and It+1 the sum of
the incoming flow in cell jbetween time t1and tshould
be equal to the sum of outgoing flow between time tand
t+ 1. More formally it reads:
X
i∈N (j)
ft1,t
i,j =xt
j=X
k∈N (j)
ft,t+1
j,k .(1)
The sums of the flow are equal to xt
jthe probability that
there is a person in jat time t.
Second, non-overlapping constraints, at any time there
should be at most one person in every cell.
k, t, X
j∈N (k)
ft,t+1
k,j 1.(2)
Finally, a temporal consistency constraint, if we reversed a
sequence, the flow should be the same with the flow direc-
tion being flipped.
ft1,t
i,j =ft,t1
j,i .(3)
Reconstructing detection from human flow is trivial
(Eq. (1)) and has a unique solution, on the other hand, gen-
erating flow from detection can have multiple solutions.
Therefore we introduce Multi-View FlowNet (MVFlow),
trained to generate human flow. By predicting flow in-
stead of detection our model is able to take advantage of the
asymmetric mapping between flow and detection. It learns
to predict flow in a weakly supervised manner, using only
detection annotation. Enforcing flow constraints in Eq. (1),
Eq. (2) and Eq. (3) also serves as a regularization for the
final detection. Predictions are temporally consistent and
represent natural human motion.
3.3. Multi-View architecture
In this section we detail the architecture of MVFlow, our
multi-view detection model.
The proposed model consists of 5 steps and takes as in-
put a pair of multi-view frames. Each frame is processed by
a ResNet. The resulting features are projected in the ground
plane. Ground features from the same point of view at time
tand t+ 1 are aggregated. Afterwards, the spatial aggrega-
tion module combines the features from the different points
of view into human flow. Detection predictions are recon-
structed from the flow for both time steps. The 5 steps are
illustrated in Fig. 3. More formally, the model is defined as
follows:
It
It+1
gθ0
7−Ft
Ft+1
C
7−Gt
Gt+1
cθ1
7−Gt,t+1 sθ2
7−ft,t+1 rec.
7−xt
xt+1 ,
(4)
摘要:

Multi-viewTrackingUsingWeaklySupervisedHumanMotionPredictionMartinEngilbergeEPFL,Lausanne,Switzerlandmartin.engilberge@epfl.chWeizheLiuTencentXRVisionLabsweizheliu@tencent.comPascalFuaEPFL,Lausanne,Switzerlandpascal.fua@epfl.chAbstractMulti-viewapproachestopeople-trackinghavethepo-tentialtobetterhan...

展开>> 收起<<
Multi-view Tracking Using Weakly Supervised Human Motion Prediction Martin Engilberge EPFL Lausanne Switzerland.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:5.67MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注