Multi-view Tracking Using Weakly Supervised Human Motion Prediction Martin Engilberge EPFL Lausanne Switzerland

2025-05-02 0 0 5.67MB 11 页 10玖币

侵权投诉

Multi-view Tracking Using Weakly Supervised Human Motion Prediction

Martin Engilberge

EPFL, Lausanne, Switzerland

martin.engilberge@epfl.ch

Weizhe Liu

Tencent XR Vision Labs

weizheliu@tencent.com

Pascal Fua

EPFL, Lausanne, Switzerland

pascal.fua@epfl.ch

Abstract

Multi-view approaches to people-tracking have the po-

tential to better handle occlusions than single-view ones

in crowded scenes. They often rely on the tracking-by-

detection paradigm, which involves detecting people ﬁrst

and then connecting the detections. In this paper, we argue

that an even more effective approach is to predict people

motion over time and infer people’s presence in individual

frames from these. This enables to enforce consistency both

over time and across views of a single temporal frame. We

validate our approach on the PETS2009 and WILDTRACK

datasets and demonstrate that it outperforms state-of-the-

art methods.

1. Introduction

When it comes to tracking multiple people, tracking-

by-detection [2] has become a standard paradigm and has

proven effective for many applications such as surveillance

or sports player tracking. It involves ﬁrst detecting the tar-

get objects in individual frames, associating these detec-

tions into short but reliable trajectories known as track-

lets, and then concatenating them into longer trajecto-

ries [40, 23, 29, 64, 31, 41, 50, 46, 30, 56, 19]. The grouping

of detections into full trajectories can also be formulated as

the search for multiple min-cost paths on a graph [9, 58].

More recently, tracking-by-regression [66, 61] has been ad-

vocated as a potential alternative. It readily enables tracking

while being end-to-end differentiable, unlike the detection-

based approaches.

However, these single-view tracking techniques can be

derailed by occlusions and are bound to fragment tracks

when detections are missed. Using multiple cameras is one

way to address this problem, especially in locations such as

sports arenas where such a setup can be installed once and

for all [8, 60, 12]. This can be highly effective but can still

fail when occlusions become severe. This is in part because

detection algorithms typically operate on single frames and

Project code at https://github.com/cvlab-epfl/MVFlow

Figure 1: Predicting human motion. Our model learns to de-

tect people by predicting human ﬂows. It generates the proba-

bilities that a person moves from one location to one of its eight

neighbors or itself, depicted by the yellow grid in the top image.

The white triangles depict detections in the ground-plane while the

green ones denote the predicted location at the next time step. The

bottom image corresponds to the top view re-projection of the top

one, the blue arrows illustrate the motion predicted by our model.

On both images the region of interest is overlaid in yellow. People

outside of that region are ignored.

fail to exploit the fact that we have videos that exhibit time

consistency. In other words, if someone is detected in one

frame, chances are they should be found at a neighboring

location in the next frame, as depicted by Fig. 1. Further-

more, even though people’s motion and scale is consistent

across views, that consistency is rarely enforced when fus-

ing results from different views.

arXiv:2210.10771v1 [cs.CV] 19 Oct 2022

In this paper, we address both these issues by training

networks to detect ﬂows of people across images. Our

model directly leverages temporal consistency using self-

supervision across video frames. Furthermore, it can fuse

information from different cameras while retaining spatial

consistency for human position from different viewpoints.

As a result, we outperform state-of-the-art multi-view tech-

niques [24, 25, 12] on challenging datasets.

2. Related works

Early work on tracking objects in video sequence rely

on model evolution technique which focuses on tracking a

single object using gating and Kalman ﬁltering [44]. Be-

cause of their recursive nature, they are prone to errors such

as drift, which are difﬁcult to recover from. Therefore, this

method is largely replaced by tracking-by-detection tech-

niques which have proven to be effective in addressing peo-

ple tracking problems. In this section we ﬁrst brieﬂy intro-

duce previous work of tracking-by-detection and then dis-

cuss previous work in modeling human motions.

2.1. Tracking-by-Detection.

Tracking-by-detection [2] aims to track objects in video

sequences by optimizing a global objective function over

many frames given frame-wise object detection informa-

tion. They reply on Conditional Random Fields [33, 62, 43],

Belief Propagation [63, 13], Dynamic or Linear Program-

ming [5, 51], or Network Flow Programming [1, 15]. Some

of these algorithms follow the graph formulation with nodes

as either all the spatial locations where an object can

be present [18, 9, 8] or only those where a detector has

ﬁred [27, 55, 52, 6].

Among these graph-based approaches, the K-Shortest

Paths (KSP) algorithm [9] works on the graph of all po-

tential locations over all time instants, and ﬁnds the ground-

plane trajectories that yield the overall minimum cost. This

optimality is achieved at the cost of multiple strong assump-

tion about human motion, in particular it treats all motion

direction as equiprobable. Similar to the KSP algorithm, the

Successive Shortest Paths (SSP) approach [48] links detec-

tions using sequential dynamic programming. [36] extends

this SSP approach with bounded memory and computation

which enables tracking in even longer sequences. The mem-

ory consumption is further reduced in [58] by exploiting the

special structures and properties of the graphs formulated in

multiple objects tracking problems. More recent work [59]

proposes to learn a deep association metric on a large-scale

person re-identiﬁcation dataset which enables reliable peo-

ple tracking in long video sequence.

Occlusion makes it extremely challenging to achieve re-

liable object tracking in long sequences. Some algorithms

address this by leveraging multiple viewpoints, some ap-

proaches ﬁrst detect people in single images before repro-

jecting and matching detections into a common reference

frame [60, 18]. [4] propose to directly combine view aggre-

gation and prediction with a joint CNN/CRF. More recently

[25] proposed to use spatial transformer networks [26] to

project feature representation in the ground plane resulting

in an end-to-end trainable multi-view detection model. [53]

proposed to combine multiple views using an approximati-

ion of a 3D world coordinate system by projecting features

in planes at different height levels. Finally [24] proposed to

use multi-view data augmentation combined with a trans-

former architecture to fuse ground plane features from mul-

tiple points of view and obtains state-of-the-art results for

multiple object detection on the WILDTRACK dataset [12].

2.2. Modeling human motion

Modeling human motion as ﬂow when tracking people

has been a concern long before the advent of deep learn-

ing [47, 57, 11, 37, 14, 34, 20, 10, 45, 42, 3, 9]. For ex-

ample, in [9], people tracking is formulated as multi-target

tracking on a grid and gives rise to a linear program that

can be solved efﬁciently using the K-Shortest Path algo-

rithm [54]. The key to this formulation is to optimize the

people ﬂows from one grid location to another, instead of

the actual number of people in each grid location. In [48],

a people conservation constraint is enforced and the global

solution is found by a greedy algorithm that sequentially in-

stantiates tracks using shortest path computations on a ﬂow

network [65]. Such people conservation constraints have

since been combined with additional ones to further boost

performance. They include appearance constraints [7, 16, 8]

to prevent identity switches, spatiotemporal constraints to

force the trajectories of different objects to be disjoint [22],

and higher-order constraints [11, 14]. More recent work

extends this ﬂow formulation with deep learning [38, 39]

to formulate people as people ﬂows which contributes to

reliable people counting in even dense regions. However,

none of these methods leverage such people ﬂow formu-

lation to address tracking problems with deep neural net-

works. These kinds of ﬂow constraints have therefore never

been used in a deep people tracking context.

3. Approach

Most recent approaches rely on the tracking-by-detection

paradigm. In its simplest form, the detection step is discon-

nected from the association step. In this section, we propose

a novel method to bring closer those two steps. First we

introduce a detection network predicting people ﬂow in a

weakly supervised manner. Then we show how we modify

existing association algorithms to leverage predicted ﬂows

to generate unambiguous tracks.

Figure 2: Grid ﬂow representation For each location i

we predict the probability that a person is moving from i

to one of its eight neighbors or itself in the next time step.

Detection probability at a given location at time t can be

computed by summing the nine outgoing ﬂows or the nine

ﬂows reaching that location from t-1.

3.1. Formalism

Let us consider a multi-view video sequence S=

{I1,I2, ..., IT−1,IT}consisting of Ttime steps. Each time

step It={It

1, . . . It

V}consists of a set of synchronized

frames taken by Vcameras with overlapping ﬁelds of view.

For each camera the calibration Cvis known and con-

tains both intrinsic and extrinsic parameters. Each frame

v∈(0,255)W×H×3is a color image with a spatial size

(W,H).

To combine multiple views we choose to work in the

common ground plane. For each frame we deﬁne Gt

P(It

v,Cv)as the projection of frame It

von the ground

plane using the projection function Pproducing Gt

v∈

(0,255)w×h×3with (w,h) the spatial size of the ground

plane image.

Finally, we adopt similar grid world formalism as previ-

ous work [9]. At each time step twe discretize the physical

ground plane to form a grid of w×hcells, giving us a

scene representation of dimensionality w×h×tfor a full

sequence.

3.2. People Flow

Given a pair of consecutive multi-view time steps we de-

ﬁne the human ﬂow ft,t+1 as follows: For a given location

i, the ﬂow ft,t+1

i,j is the probability that a person in cell iat

time tmoves to location jat time t+ 1. Where j∈ N (i)

is a neighbor of i. Concretely, for each cell in the ground

plane we represent people ﬂow by a 9-dimensional vector

of probability (one dimension per neighbors of that cell).

The grid representation and the deﬁnition of neighborhood

are illustrated in Fig. 2

To accurately model human motion, the ﬂow need to re-

spect three constraints:

First, people conservation constraints, if a person is

present at time t, he should be present at time t+ 1 in the

same location, or in a neighboring one. In other words, if

we consider three time steps It−1,It, and It+1 the sum of

the incoming ﬂow in cell jbetween time t−1and tshould

be equal to the sum of outgoing ﬂow between time tand

t+ 1. More formally it reads:

i∈N (j)

ft−1,t

i,j =xt

j=X

k∈N (j)

ft,t+1

j,k .(1)

The sums of the ﬂow are equal to xt

jthe probability that

there is a person in jat time t.

Second, non-overlapping constraints, at any time there

should be at most one person in every cell.

∀k, t, X

j∈N (k)

ft,t+1

k,j ≤1.(2)

Finally, a temporal consistency constraint, if we reversed a

sequence, the ﬂow should be the same with the ﬂow direc-

tion being ﬂipped.

ft−1,t

i,j =ft,t−1

j,i .(3)

Reconstructing detection from human ﬂow is trivial

(Eq. (1)) and has a unique solution, on the other hand, gen-

erating ﬂow from detection can have multiple solutions.

Therefore we introduce Multi-View FlowNet (MVFlow),

trained to generate human ﬂow. By predicting ﬂow in-

stead of detection our model is able to take advantage of the

asymmetric mapping between ﬂow and detection. It learns

to predict ﬂow in a weakly supervised manner, using only

detection annotation. Enforcing ﬂow constraints in Eq. (1),

Eq. (2) and Eq. (3) also serves as a regularization for the

ﬁnal detection. Predictions are temporally consistent and

represent natural human motion.

3.3. Multi-View architecture

In this section we detail the architecture of MVFlow, our

multi-view detection model.

The proposed model consists of 5 steps and takes as in-

put a pair of multi-view frames. Each frame is processed by

a ResNet. The resulting features are projected in the ground

plane. Ground features from the same point of view at time

tand t+ 1 are aggregated. Afterwards, the spatial aggrega-

tion module combines the features from the different points

of view into human ﬂow. Detection predictions are recon-

structed from the ﬂow for both time steps. The 5 steps are

illustrated in Fig. 3. More formally, the model is deﬁned as

follows:

It+1

gθ0

7−−→ Ft

Ft+1

7−→ Gt

Gt+1

cθ1

7−−→Gt,t+1 sθ2

7−−→ft,t+1 rec.

7−−→ xt

xt+1 ,

(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-viewTrackingUsingWeaklySupervisedHumanMotionPredictionMartinEngilbergeEPFL,Lausanne,Switzerlandmartin.engilberge@epfl.chWeizheLiuTencentXRVisionLabsweizheliu@tencent.comPascalFuaEPFL,Lausanne,Switzerlandpascal.fua@epfl.chAbstractMulti-viewapproachestopeople-trackinghavethepo-tentialtobetterhan...

展开>> 收起<<

Multi-view Tracking Using Weakly Supervised Human Motion Prediction Martin Engilberge EPFL Lausanne Switzerland.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-view Tracking Using Weakly Supervised Human Motion Prediction Martin Engilberge EPFL Lausanne Switzerland

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: