2 Sarlin and Dusmanu et al.
1 Introduction
Placing virtual content in the physical 3D world, persisting it over time, and sharing it
with other users are typical scenarios for Augmented Reality (AR). In order to reliably
overlay virtual content in the real world with pixel-level precision, these scenarios require
AR devices to accurately determine their 6-DoF pose at any point in time. While visual
localization and mapping is one of the most studied problems in computer vision, its
use for AR entails specific challenges and opportunities. First, modern AR devices, such
as mobile phones or the Microsoft HoloLens or
MagicLeap One
, are often equipped
with multiple cameras and additional inertial or radio sensors. Second, they exhibit
characteristic hand-held or head-mounted motion patterns. The on-device real-time
tracking systems provide spatially-posed sensor streams. However, many AR scenarios
require positioning beyond local tracking, both indoors and outdoors, and robustness to
common temporal changes of appearance and structure. Furthermore, given the plurality
of temporal sensor data, the question is often not whether, but how quickly can the device
localize at any time to ensure a compelling end-user experience. Finally, as AR adoption
grows, crowd-sourced data captured by users with diverse devices can be mined for
building large-scale maps without a manual and costly scanning effort. Crowd-sourcing
offers great opportunities but poses additional challenges on the robustness of algorithms,
e.g., to enable cross-device localization [
21
], mapping from incomplete data with low
accuracy [68,8], privacy-preservation of data [74,25,72,26,23], etc.
However, the academic community is mainly driven by benchmarks that are dis-
connected from the specifics of AR. They mostly evaluate localization and mapping
using single still images and either lack temporal changes [
73
,
57
] or accurate ground
truth (GT) [
66
,
37
,
77
], are restricted to small scenes [
6
,
73
,
37
,
84
,
71
] or landmarks [
34
,
69
]
with perfect coverage and limited viewpoint variability, or disregard temporal tracking
data or additional visual, inertial, or radio sensors [67,66,77,40,12,76].
Our first contribution is to introduce
a large-scale dataset captured using AR
devices in diverse environments
, notably a historical building, a multi-story office
building, and part of a city center. The initial data release contains both indoor and
outdoor images with illumination and semantic changes as well as dynamic objects.
Specifically, we collected multi-sensor data streams (images, depth, tracking, IMU, BT,
WiFi) totalling more than 100 hours using head-mounted HoloLens 2 and hand-held
iPhone / iPad devices covering 45’000 square meters over the span of one year (Fig. 1).
Second, we develop
a GT pipeline to automatically and accurately register AR
trajectories
against large-scale 3D laser scans. Our pipeline does not require any manual
labelling or setup of custom infrastructure (e.g., fiducial markers). Furthermore, the
system robustly handles crowd-sourced data from heterogeneous devices captured over
longer periods of time and can be easily extended to support future devices.
Finally, we present
a rigorous evaluation of localization and mapping in the
context of AR
and provide
novel insights for future research
. Notably, we show that
the performance of state-of-the-art methods can be drastically improved by considering
additional data streams generally available in AR devices, such as radio signals or
sequence odometry. Thus, future algorithms in the field of AR localization and mapping
should always consider these sensors in their evaluation to show real-world impact.