ViewBirdiformer Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view Mai Nishimura12 Shohei Nobuhara2and Ko Nishino2

2025-04-26 0 0 939.46KB 8 页 10玖币
侵权投诉
ViewBirdiformer: Learning to recover ground-plane crowd trajectories
and ego-motion from a single ego-centric view
Mai Nishimura1,2, Shohei Nobuhara2and Ko Nishino2
Abstract We introduce a novel learning-based method for
view birdification [1], the task of recovering ground-plane
trajectories of pedestrians of a crowd and their observer in
the same crowd just from the observed ego-centric video. View
birdification becomes essential for mobile robot navigation and
localization in dense crowds where the static background is hard
to see and reliably track. It is challenging mainly for two rea-
sons; i) absolute trajectories of pedestrians are entangled with
the movement of the observer which needs to be decoupled from
their observed relative movements in the ego-centric video, and
ii) a crowd motion model describing the pedestrian movement
interactions is specific to the scene yet unknown a priori. For
this, we introduce a Transformer-based network referred to
as ViewBirdiformer which implicitly models the crowd motion
through self-attention and decomposes relative 2D movement
observations onto the ground-plane trajectories of the crowd
and the camera through cross-attention between views. Most
important, ViewBirdiformer achieves view birdification in a
single forward pass which opens the door to accurate real-time,
always-on situational awareness. Extensive experimental results
demonstrate that ViewBirdiformer achieves accuracy similar to
or better than state-of-the-art with three orders of magnitude
reduction in execution time.
I. INTRODUCTION
We as human beings have a fairly accurate idea of
the absolute movements of our surroundings in the world
coordinate frame, even when we can only observe their
movements relative to our own in our sight such as
when walking in a crowd. Enabling a mobile agent
to maintain a dynamically updated map of surrounding
absolute movements on the ground, solely from observations
collected from its own vantage point, would be of significant
use for various applications including robot navigation [2],
autonomous driving [3], sports analysis [4], and crowd mon-
itoring [5]–[7]. The key challenge lies in the fact that when
the observer (e.g., person or robot) is surrounded by other
dynamic agents, static “background” can hardly be found in
the agent’s field of view. In such scenes, conventional visual
localization methods including SLAM would fail since static
landmarks become untrackable due to frequent occlusions
by pedestrians and the limited dynamically changing field
of view [1]. External odometry signals such as IMU and
GPS are also often unreliable. Even when they are available,
visual feedback becomes essential for robust pose estimation
(imagine walking in a crowd with closed eyes).
1Mai Nishimura is with OMRON SINIC X Corporation, 5-24-5, Hongo,
Bunkyo-ku, Tokyo, Japan mai.nishimura@sinicx.com
2Shohei Nobuhara and Ko Nishino are with Kyoto
University, Yoshida Honmachi, Sakyo-ku, Kyoto, Japan
{nob,kon}@i.kyoto-u.ac.jp
Nishimura et al. recently introduced this exact task
as view birdification whose goal is to recover on-ground
trajectories of a camera and a crowd just from perceived
movements (not appearance) in an ego-centric video [1] 1.
They proposed to decompose these two types of trajectories,
one of the pedestrians in the crowd and another of a
person or mobile robot with an ego-view camera, with a
cascaded optimization which alternates between estimating
the displacements of the camera and estimating those
of surrounding pedestrians while constraining the crowd
trajectories with a pre-determined crowd motion model [8],
[9]. This iterative approach suffers from two critical
problems which hinder their practical use. First, its iterative
optimization incurs a large computational cost which
precludes real-time use. Second, the analytical crowd model
as a prior is restricting and not applicable to diverse scenes
where the crowd motion model is unknown.
In this paper, we propose ViewBirdiformer, a Transformer-
based view birdification method. Instead of relying on re-
strictive assumptions on the motion of surrounding people
and costly alternating optimization, we define a Transformer-
based network that learns to reconstruct on-ground trajecto-
ries of the surrounding pedestrians and the camera from a
single ego-centric video while simultaneously learning their
motion models. As Fig. 1 depicts, ViewBirdiformer takes in-
image 2D pedestrian movements as inputs, and outputs 2D
pedestrian trajectories and the observer’s ego-motion on the
ground plane. The multi-head self-attention on the motion
feature embeddings of each pedestrian of ViewBirdiformer
captures the local and global interactions of pedestrians. At
the same time, it learns to reconstruct on-ground trajectories
from observed 2D motion in the image with cross-attention
on features coming from different viewpoints.
A key challenge of this data-driven view birdification lies
in the inconsistency of coordinate frames between input and
output movements—the input is 2D in-image movements
relative to ego-motion, but the expected outputs are on-
ground trajectories in absolute coordinates (i.e., independent
of the observer’s motion). ViewBirdiformer resolves this by
introducing the two types of queries, i.e., the camera ego-
motion and pedestrian trajectories, in a multi-task learning
formulation, and by transforming coordinates of pedestrian
queries relative to the previous ego-motion estimates.
We thoroughly evaluate the effectiveness of our
method using the view birdification dataset [1] and
1Note that Bird’s Eye View transform is a completely different problem
as it concerns a single frame view of the appearance (not the movements)
and cannot reconstruct the camera ego-motion.
arXiv:2210.06332v1 [cs.CV] 12 Oct 2022
Transformer
Decoder
Transformer
Encoder Ego-Motion
Head
Trajectory
Head
R(θτ)
tτ
xτ
(a) 2D in-image movements (b) 2D on-ground trajectories
τ-1
τ
τ-1
τ
τ-2
Observer
Pedestrians
τ
Fig. 1: Given bounding boxes of moving pedestrians in an ego-centric view captured in the crowd, ViewBirdiformer
reconstructs on-ground trajectories of both the observer and the surrounding pedestrians.
also by conducting ablation studies which validate its key
components. The proposed Transformer-based architecture
learns to reconstruct trajectories of the camera and the
crowd while learning their motion models by adaptively
attending to movement features of them in the image plane
and on the ground. It enables real-time view birdification
of arbitrary ego-view crowd sequences in a single inference
pass, which leads to three orders of magnitude speedup
from the iterative optimization approach [1]. We show that
the results of ViewBirdiformer can be opportunistically
refined with geometric post-processing, which results in
similar or better accuracy than state-of-the-art [1] but still
in orders of magnitude faster execution time.
II. RELATED WORK
A. View Birdification
As summarized in Table I, View Birdification [1] is not
the same as bird’s-eye view (BEV) transformation [10]–
[13]. BEV transformation refers to the task of rendering
a 2D top-down view image from an on-ground ego-centric
view and concerns the appearance of the surroundings as
seen from the top and does not resolve the ego-motion,
i.e., all recovered BEVs are still relative to the observer. View
birdification, in contrast, reconstructs both the observer’s and
surrounding pedestrians’ locations on the ground so that the
relative movements captured in the ego-centric view can be
analyzed in a single world coordinate frame on the ground
(i.e., “birdified”). View birdification thus fundamentally dif-
fers from BEV transform as it is inherently a 3D transform
that accounts for the ego-motion, i.e., the 2D projections of
surrounding people in the 2D ego-view need to be implicitly
or explicitly lifted into 3D and translated to cancel out the
jointly estimated ego-motion of the observer before being
projected down onto the ground-plane. Nishimura et al. intro-
duced a geometric method for view birdification [1], which
explicitly transforms the 2D projected pedestrian movements
into 3D but on the ground plane with a graph energy
minimization by leveraging analytically expressible crowd
motion models [8]. Our method fundamentally differs from
this in that the transformation from 2D in-image movement
to on-ground motion as well as the on-ground coordination
of pedestrian motion is jointly learned from data.
TABLE I: View birdification (VB) is the only task that si-
multaneously recovers the absolute trajectories of the camera
and its surrounding pedestrians only from their perceived
movements relative to an observer.
Task Input Output Scenes
static dynamic ego traj a few people crowd
BEV [10] X X
3D MOT [14] X X X X
SLAM [15] X X X X X
VB (Ours, [1]) X X X X X
B. Simultaneous Localization and Mapping (SLAM)
Dynamic SLAM and its variants inherently rely on the
assumption that the world is static [16]–[18]. Dynamic
objects cause feature points to drift and contaminate the
ego-motion estimate and consequently the 3D reconstruction.
Past methods have made SLAM applicable to dynamic
scenes, “despite” these dynamic objects, by treating them
as outliers [19] or explicitly tracking and filtering them
[20]–[23]. A notable exception is Dynamic Object SLAM
which explicitly incorporates such objects into its geometric
optimization [15], [24], [25]. The method detects and tracks
dynamic objects together with static keypoints, but assumes
that the dynamic objects in view are rigid and obey a
simple motion model that results in smoothly changing
poses. None of the above methods consider the complex
pedestrian interactions in the crowd [5], [8], [26], [27]. Our
method fundamentally differs from dynamic SLAM in that
it reconstructs both the observer’s ego-motion and the on-
ground trajectories of surrounding dynamic objects without
relying on any static key-point, while also recovering the
interaction between surrounding dynamic objects. In other
words, the movements themselves are the features.
C. 3D Multi-Object Tracking (3D MOT)
3D MOT concerns the detection and tracking of target
objects in a video sequence while estimating their 3D lo-
cations on the ground [28]–[30]. Most recent works aim
to improve tracklet association across frames [29], [31].
These approaches, however, assume a simple motion model
independent of the camera ego-motion [32], which hardly
applies to a dynamic observer in a crowd with complex
interactions with other pedestrians. 3D MOT in a video
摘要:

ViewBirdiformer:Learningtorecoverground-planecrowdtrajectoriesandego-motionfromasingleego-centricviewMaiNishimura1;2,ShoheiNobuhara2andKoNishino2Abstract—Weintroduceanovellearning-basedmethodforviewbirdication[1],thetaskofrecoveringground-planetrajectoriesofpedestriansofacrowdandtheirobserverinthes...

展开>> 收起<<
ViewBirdiformer Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view Mai Nishimura12 Shohei Nobuhara2and Ko Nishino2.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:939.46KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注