ViewBirdiformer: Learning to recover ground-plane crowd trajectories
and ego-motion from a single ego-centric view
Mai Nishimura1,2, Shohei Nobuhara2and Ko Nishino2
Abstract— We introduce a novel learning-based method for
view birdification [1], the task of recovering ground-plane
trajectories of pedestrians of a crowd and their observer in
the same crowd just from the observed ego-centric video. View
birdification becomes essential for mobile robot navigation and
localization in dense crowds where the static background is hard
to see and reliably track. It is challenging mainly for two rea-
sons; i) absolute trajectories of pedestrians are entangled with
the movement of the observer which needs to be decoupled from
their observed relative movements in the ego-centric video, and
ii) a crowd motion model describing the pedestrian movement
interactions is specific to the scene yet unknown a priori. For
this, we introduce a Transformer-based network referred to
as ViewBirdiformer which implicitly models the crowd motion
through self-attention and decomposes relative 2D movement
observations onto the ground-plane trajectories of the crowd
and the camera through cross-attention between views. Most
important, ViewBirdiformer achieves view birdification in a
single forward pass which opens the door to accurate real-time,
always-on situational awareness. Extensive experimental results
demonstrate that ViewBirdiformer achieves accuracy similar to
or better than state-of-the-art with three orders of magnitude
reduction in execution time.
I. INTRODUCTION
We as human beings have a fairly accurate idea of
the absolute movements of our surroundings in the world
coordinate frame, even when we can only observe their
movements relative to our own in our sight such as
when walking in a crowd. Enabling a mobile agent
to maintain a dynamically updated map of surrounding
absolute movements on the ground, solely from observations
collected from its own vantage point, would be of significant
use for various applications including robot navigation [2],
autonomous driving [3], sports analysis [4], and crowd mon-
itoring [5]–[7]. The key challenge lies in the fact that when
the observer (e.g., person or robot) is surrounded by other
dynamic agents, static “background” can hardly be found in
the agent’s field of view. In such scenes, conventional visual
localization methods including SLAM would fail since static
landmarks become untrackable due to frequent occlusions
by pedestrians and the limited dynamically changing field
of view [1]. External odometry signals such as IMU and
GPS are also often unreliable. Even when they are available,
visual feedback becomes essential for robust pose estimation
(imagine walking in a crowd with closed eyes).
1Mai Nishimura is with OMRON SINIC X Corporation, 5-24-5, Hongo,
Bunkyo-ku, Tokyo, Japan mai.nishimura@sinicx.com
2Shohei Nobuhara and Ko Nishino are with Kyoto
University, Yoshida Honmachi, Sakyo-ku, Kyoto, Japan
{nob,kon}@i.kyoto-u.ac.jp
Nishimura et al. recently introduced this exact task
as view birdification whose goal is to recover on-ground
trajectories of a camera and a crowd just from perceived
movements (not appearance) in an ego-centric video [1] 1.
They proposed to decompose these two types of trajectories,
one of the pedestrians in the crowd and another of a
person or mobile robot with an ego-view camera, with a
cascaded optimization which alternates between estimating
the displacements of the camera and estimating those
of surrounding pedestrians while constraining the crowd
trajectories with a pre-determined crowd motion model [8],
[9]. This iterative approach suffers from two critical
problems which hinder their practical use. First, its iterative
optimization incurs a large computational cost which
precludes real-time use. Second, the analytical crowd model
as a prior is restricting and not applicable to diverse scenes
where the crowd motion model is unknown.
In this paper, we propose ViewBirdiformer, a Transformer-
based view birdification method. Instead of relying on re-
strictive assumptions on the motion of surrounding people
and costly alternating optimization, we define a Transformer-
based network that learns to reconstruct on-ground trajecto-
ries of the surrounding pedestrians and the camera from a
single ego-centric video while simultaneously learning their
motion models. As Fig. 1 depicts, ViewBirdiformer takes in-
image 2D pedestrian movements as inputs, and outputs 2D
pedestrian trajectories and the observer’s ego-motion on the
ground plane. The multi-head self-attention on the motion
feature embeddings of each pedestrian of ViewBirdiformer
captures the local and global interactions of pedestrians. At
the same time, it learns to reconstruct on-ground trajectories
from observed 2D motion in the image with cross-attention
on features coming from different viewpoints.
A key challenge of this data-driven view birdification lies
in the inconsistency of coordinate frames between input and
output movements—the input is 2D in-image movements
relative to ego-motion, but the expected outputs are on-
ground trajectories in absolute coordinates (i.e., independent
of the observer’s motion). ViewBirdiformer resolves this by
introducing the two types of queries, i.e., the camera ego-
motion and pedestrian trajectories, in a multi-task learning
formulation, and by transforming coordinates of pedestrian
queries relative to the previous ego-motion estimates.
We thoroughly evaluate the effectiveness of our
method using the view birdification dataset [1] and
1Note that Bird’s Eye View transform is a completely different problem
as it concerns a single frame view of the appearance (not the movements)
and cannot reconstruct the camera ego-motion.
arXiv:2210.06332v1 [cs.CV] 12 Oct 2022