ViewBirdiformer Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view Mai Nishimura12 Shohei Nobuhara2and Ko Nishino2

2025-04-26 0 0 939.46KB 8 页 10玖币

侵权投诉

ViewBirdiformer: Learning to recover ground-plane crowd trajectories

and ego-motion from a single ego-centric view

Mai Nishimura1,2, Shohei Nobuhara2and Ko Nishino2

Abstract— We introduce a novel learning-based method for

view birdiﬁcation [1], the task of recovering ground-plane

trajectories of pedestrians of a crowd and their observer in

the same crowd just from the observed ego-centric video. View

birdiﬁcation becomes essential for mobile robot navigation and

localization in dense crowds where the static background is hard

to see and reliably track. It is challenging mainly for two rea-

sons; i) absolute trajectories of pedestrians are entangled with

the movement of the observer which needs to be decoupled from

their observed relative movements in the ego-centric video, and

ii) a crowd motion model describing the pedestrian movement

interactions is speciﬁc to the scene yet unknown a priori. For

this, we introduce a Transformer-based network referred to

as ViewBirdiformer which implicitly models the crowd motion

through self-attention and decomposes relative 2D movement

observations onto the ground-plane trajectories of the crowd

and the camera through cross-attention between views. Most

important, ViewBirdiformer achieves view birdiﬁcation in a

single forward pass which opens the door to accurate real-time,

always-on situational awareness. Extensive experimental results

demonstrate that ViewBirdiformer achieves accuracy similar to

or better than state-of-the-art with three orders of magnitude

reduction in execution time.

I. INTRODUCTION

We as human beings have a fairly accurate idea of

the absolute movements of our surroundings in the world

coordinate frame, even when we can only observe their

movements relative to our own in our sight such as

when walking in a crowd. Enabling a mobile agent

to maintain a dynamically updated map of surrounding

absolute movements on the ground, solely from observations

collected from its own vantage point, would be of signiﬁcant

use for various applications including robot navigation [2],

autonomous driving [3], sports analysis [4], and crowd mon-

itoring [5]–[7]. The key challenge lies in the fact that when

the observer (e.g., person or robot) is surrounded by other

dynamic agents, static “background” can hardly be found in

the agent’s ﬁeld of view. In such scenes, conventional visual

localization methods including SLAM would fail since static

landmarks become untrackable due to frequent occlusions

by pedestrians and the limited dynamically changing ﬁeld

of view [1]. External odometry signals such as IMU and

GPS are also often unreliable. Even when they are available,

visual feedback becomes essential for robust pose estimation

(imagine walking in a crowd with closed eyes).

1Mai Nishimura is with OMRON SINIC X Corporation, 5-24-5, Hongo,

Bunkyo-ku, Tokyo, Japan mai.nishimura@sinicx.com

2Shohei Nobuhara and Ko Nishino are with Kyoto

University, Yoshida Honmachi, Sakyo-ku, Kyoto, Japan

{nob,kon}@i.kyoto-u.ac.jp

Nishimura et al. recently introduced this exact task

as view birdiﬁcation whose goal is to recover on-ground

trajectories of a camera and a crowd just from perceived

movements (not appearance) in an ego-centric video [1] 1.

They proposed to decompose these two types of trajectories,

one of the pedestrians in the crowd and another of a

person or mobile robot with an ego-view camera, with a

cascaded optimization which alternates between estimating

the displacements of the camera and estimating those

of surrounding pedestrians while constraining the crowd

trajectories with a pre-determined crowd motion model [8],

[9]. This iterative approach suffers from two critical

problems which hinder their practical use. First, its iterative

optimization incurs a large computational cost which

precludes real-time use. Second, the analytical crowd model

as a prior is restricting and not applicable to diverse scenes

where the crowd motion model is unknown.

In this paper, we propose ViewBirdiformer, a Transformer-

based view birdiﬁcation method. Instead of relying on re-

strictive assumptions on the motion of surrounding people

and costly alternating optimization, we deﬁne a Transformer-

based network that learns to reconstruct on-ground trajecto-

ries of the surrounding pedestrians and the camera from a

single ego-centric video while simultaneously learning their

motion models. As Fig. 1 depicts, ViewBirdiformer takes in-

image 2D pedestrian movements as inputs, and outputs 2D

pedestrian trajectories and the observer’s ego-motion on the

ground plane. The multi-head self-attention on the motion

feature embeddings of each pedestrian of ViewBirdiformer

captures the local and global interactions of pedestrians. At

the same time, it learns to reconstruct on-ground trajectories

from observed 2D motion in the image with cross-attention

on features coming from different viewpoints.

A key challenge of this data-driven view birdiﬁcation lies

in the inconsistency of coordinate frames between input and

output movements—the input is 2D in-image movements

relative to ego-motion, but the expected outputs are on-

ground trajectories in absolute coordinates (i.e., independent

of the observer’s motion). ViewBirdiformer resolves this by

introducing the two types of queries, i.e., the camera ego-

motion and pedestrian trajectories, in a multi-task learning

formulation, and by transforming coordinates of pedestrian

queries relative to the previous ego-motion estimates.

We thoroughly evaluate the effectiveness of our

method using the view birdiﬁcation dataset [1] and

1Note that Bird’s Eye View transform is a completely different problem

as it concerns a single frame view of the appearance (not the movements)

and cannot reconstruct the camera ego-motion.

arXiv:2210.06332v1 [cs.CV] 12 Oct 2022

Transformer

Decoder

Transformer

Encoder Ego-Motion

Head

Trajectory

Head

R(⊿θτ)

⊿tτ

⊿xτ

(a) 2D in-image movements (b) 2D on-ground trajectories

τ-1

τ-2

Observer

Pedestrians

Fig. 1: Given bounding boxes of moving pedestrians in an ego-centric view captured in the crowd, ViewBirdiformer

reconstructs on-ground trajectories of both the observer and the surrounding pedestrians.

also by conducting ablation studies which validate its key

components. The proposed Transformer-based architecture

learns to reconstruct trajectories of the camera and the

crowd while learning their motion models by adaptively

attending to movement features of them in the image plane

and on the ground. It enables real-time view birdiﬁcation

of arbitrary ego-view crowd sequences in a single inference

pass, which leads to three orders of magnitude speedup

from the iterative optimization approach [1]. We show that

the results of ViewBirdiformer can be opportunistically

reﬁned with geometric post-processing, which results in

similar or better accuracy than state-of-the-art [1] but still

in orders of magnitude faster execution time.

II. RELATED WORK

A. View Birdiﬁcation

As summarized in Table I, View Birdiﬁcation [1] is not

the same as bird’s-eye view (BEV) transformation [10]–

[13]. BEV transformation refers to the task of rendering

a 2D top-down view image from an on-ground ego-centric

view and concerns the appearance of the surroundings as

seen from the top and does not resolve the ego-motion,

i.e., all recovered BEVs are still relative to the observer. View

birdiﬁcation, in contrast, reconstructs both the observer’s and

surrounding pedestrians’ locations on the ground so that the

relative movements captured in the ego-centric view can be

analyzed in a single world coordinate frame on the ground

(i.e., “birdiﬁed”). View birdiﬁcation thus fundamentally dif-

fers from BEV transform as it is inherently a 3D transform

that accounts for the ego-motion, i.e., the 2D projections of

surrounding people in the 2D ego-view need to be implicitly

or explicitly lifted into 3D and translated to cancel out the

jointly estimated ego-motion of the observer before being

projected down onto the ground-plane. Nishimura et al. intro-

duced a geometric method for view birdiﬁcation [1], which

explicitly transforms the 2D projected pedestrian movements

into 3D but on the ground plane with a graph energy

minimization by leveraging analytically expressible crowd

motion models [8]. Our method fundamentally differs from

this in that the transformation from 2D in-image movement

to on-ground motion as well as the on-ground coordination

of pedestrian motion is jointly learned from data.

TABLE I: View birdiﬁcation (VB) is the only task that si-

multaneously recovers the absolute trajectories of the camera

and its surrounding pedestrians only from their perceived

movements relative to an observer.

Task Input Output Scenes

static dynamic ego traj a few people crowd

BEV [10] X X

3D MOT [14] X X X X

SLAM [15] X X X X X

VB (Ours, [1]) X X X X X

B. Simultaneous Localization and Mapping (SLAM)

Dynamic SLAM and its variants inherently rely on the

assumption that the world is static [16]–[18]. Dynamic

objects cause feature points to drift and contaminate the

ego-motion estimate and consequently the 3D reconstruction.

Past methods have made SLAM applicable to dynamic

scenes, “despite” these dynamic objects, by treating them

as outliers [19] or explicitly tracking and ﬁltering them

[20]–[23]. A notable exception is Dynamic Object SLAM

which explicitly incorporates such objects into its geometric

optimization [15], [24], [25]. The method detects and tracks

dynamic objects together with static keypoints, but assumes

that the dynamic objects in view are rigid and obey a

simple motion model that results in smoothly changing

poses. None of the above methods consider the complex

pedestrian interactions in the crowd [5], [8], [26], [27]. Our

method fundamentally differs from dynamic SLAM in that

it reconstructs both the observer’s ego-motion and the on-

ground trajectories of surrounding dynamic objects without

relying on any static key-point, while also recovering the

interaction between surrounding dynamic objects. In other

words, the movements themselves are the features.

C. 3D Multi-Object Tracking (3D MOT)

3D MOT concerns the detection and tracking of target

objects in a video sequence while estimating their 3D lo-

cations on the ground [28]–[30]. Most recent works aim

to improve tracklet association across frames [29], [31].

These approaches, however, assume a simple motion model

independent of the camera ego-motion [32], which hardly

applies to a dynamic observer in a crowd with complex

interactions with other pedestrians. 3D MOT in a video

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ViewBirdiformer:Learningtorecoverground-planecrowdtrajectoriesandego-motionfromasingleego-centricviewMaiNishimura1;2,ShoheiNobuhara2andKoNishino2AbstractWeintroduceanovellearning-basedmethodforviewbirdication[1],thetaskofrecoveringground-planetrajectoriesofpedestriansofacrowdandtheirobserverinthes...

展开>> 收起<<

ViewBirdiformer Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view Mai Nishimura12 Shohei Nobuhara2and Ko Nishino2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ViewBirdiformer Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view Mai Nishimura12 Shohei Nobuhara2and Ko Nishino2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: