LaMAR Benchmarking Localization and Mapping for Augmented Reality Paul-Edouard Sarlin1 Mihai Dusmanu1 Johannes L. Schönberger2 Pablo Speciale2

2025-05-03 0 0 9.41MB 30 页 10玖币
侵权投诉
LaMAR: Benchmarking Localization and Mapping
for Augmented Reality
Paul-Edouard Sarlin
1
, Mihai Dusmanu
1
, Johannes L. Schönberger
2
, Pablo Speciale
2
,
Lukas Gruber2, Viktor Larsson1, Ondrej Miksik2, and Marc Pollefeys1,2
1Department of Computer Science, ETH Zurich, Switzerland
2Microsoft Mixed Reality & AI Lab, Zurich, Switzerland
Ground truth
from laser
scanners
AR headset
radio signals
multi-camera rig
image sequences
mobile phone
150m
day/night
year-long
changes
CAB HGE
outdoor
indoor LIN
200m
depth
150m
IMU
Fig. 1.
We revisit localization and mapping in the context of Augmented Reality by introducing
LaMAR, a large-scale dataset captured using AR devices (HoloLens 2, iPhone) and laser scanners.
Abstract.
Localization and mapping is the foundational technology for aug-
mented reality (AR) that enables sharing and persistence of digital content in the
real world. While significant progress has been made, researchers are still mostly
driven by unrealistic benchmarks not representative of real-world AR scenarios.
These benchmarks are often based on small-scale datasets with low scene diversity,
captured from stationary cameras, and lack other sensor inputs like inertial, radio,
or depth data. Furthermore, their ground-truth (GT) accuracy is mostly insufficient
to satisfy AR requirements. To close this gap, we introduce LaMAR, a new bench-
mark with a comprehensive capture and GT pipeline that co-registers realistic
trajectories and sensor streams captured by heterogeneous AR devices in large,
unconstrained scenes. To establish an accurate GT, our pipeline robustly aligns the
trajectories against laser scans in a fully automated manner. As a result, we publish
a benchmark dataset of diverse and large-scale scenes recorded with head-mounted
and hand-held AR devices. We extend several state-of-the-art methods to take
advantage of the AR-specific setup and evaluate them on our benchmark. The
results offer new insights on current research and reveal promising avenues for
future work in the field of localization and mapping for AR.
Equal contribution. Now at Lund University, Sweden.
arXiv:2210.10770v1 [cs.CV] 19 Oct 2022
2 Sarlin and Dusmanu et al.
1 Introduction
Placing virtual content in the physical 3D world, persisting it over time, and sharing it
with other users are typical scenarios for Augmented Reality (AR). In order to reliably
overlay virtual content in the real world with pixel-level precision, these scenarios require
AR devices to accurately determine their 6-DoF pose at any point in time. While visual
localization and mapping is one of the most studied problems in computer vision, its
use for AR entails specific challenges and opportunities. First, modern AR devices, such
as mobile phones or the Microsoft HoloLens or
MagicLeap One
, are often equipped
with multiple cameras and additional inertial or radio sensors. Second, they exhibit
characteristic hand-held or head-mounted motion patterns. The on-device real-time
tracking systems provide spatially-posed sensor streams. However, many AR scenarios
require positioning beyond local tracking, both indoors and outdoors, and robustness to
common temporal changes of appearance and structure. Furthermore, given the plurality
of temporal sensor data, the question is often not whether, but how quickly can the device
localize at any time to ensure a compelling end-user experience. Finally, as AR adoption
grows, crowd-sourced data captured by users with diverse devices can be mined for
building large-scale maps without a manual and costly scanning effort. Crowd-sourcing
offers great opportunities but poses additional challenges on the robustness of algorithms,
e.g., to enable cross-device localization [
21
], mapping from incomplete data with low
accuracy [68,8], privacy-preservation of data [74,25,72,26,23], etc.
However, the academic community is mainly driven by benchmarks that are dis-
connected from the specifics of AR. They mostly evaluate localization and mapping
using single still images and either lack temporal changes [
73
,
57
] or accurate ground
truth (GT) [
66
,
37
,
77
], are restricted to small scenes [
6
,
73
,
37
,
84
,
71
] or landmarks [
34
,
69
]
with perfect coverage and limited viewpoint variability, or disregard temporal tracking
data or additional visual, inertial, or radio sensors [67,66,77,40,12,76].
Our first contribution is to introduce
a large-scale dataset captured using AR
devices in diverse environments
, notably a historical building, a multi-story office
building, and part of a city center. The initial data release contains both indoor and
outdoor images with illumination and semantic changes as well as dynamic objects.
Specifically, we collected multi-sensor data streams (images, depth, tracking, IMU, BT,
WiFi) totalling more than 100 hours using head-mounted HoloLens 2 and hand-held
iPhone / iPad devices covering 45’000 square meters over the span of one year (Fig. 1).
Second, we develop
a GT pipeline to automatically and accurately register AR
trajectories
against large-scale 3D laser scans. Our pipeline does not require any manual
labelling or setup of custom infrastructure (e.g., fiducial markers). Furthermore, the
system robustly handles crowd-sourced data from heterogeneous devices captured over
longer periods of time and can be easily extended to support future devices.
Finally, we present
a rigorous evaluation of localization and mapping in the
context of AR
and provide
novel insights for future research
. Notably, we show that
the performance of state-of-the-art methods can be drastically improved by considering
additional data streams generally available in AR devices, such as radio signals or
sequence odometry. Thus, future algorithms in the field of AR localization and mapping
should always consider these sensors in their evaluation to show real-world impact.
LaMAR: Benchmarking Localization and Mapping for Augmented Reality 3
dataset out/indoor changes scale density camera motion imaging devices additional sensors ground truth accuracy
Aachen [67,66]⋆⋆+ ⋆⋆still images DSLR SfM >dm
Phototourism [34]+⋆⋆ ⋆⋆⋆ still images DSLR, phone SfM m
San Francisco [14]⋆⋆⋆ ⋆⋆⋆ still images DSLR, phone GNSS SfM+GNSS m
Cambridge [37]+⋆⋆ ⋆⋆handheld mobile SfM >dm
7Scenes [73]+⋆⋆ ⋆⋆⋆ handheld mobile depth RGB-D cm
RIO10 [84]+⋆⋆ ⋆⋆⋆ handheld Tango tablet depth VIO >dm
InLoc [77]⋆++⋆⋆ still images panoramas, phone lidar manual+lidar >dm
Baidu mall [76]⋆+⋆⋆still images DSLR, phone lidar manual+lidar dm
Naver Labs [40]⋆⋆⋆⋆robot-mounted fisheye, phone lidar lidar+SfM dm
NCLT [12]⋆⋆⋆⋆robot-mounted wide-angle lidar, IMU, GNSS lidar+VIO dm
ADVIO [57]⋆⋆+⋆⋆ handheld phone, Tango IMU, depth, GNSS manual+VIO m
ETH3D [71]+⋆⋆ ⋆⋆handheld DSLR, wide-angle lidar manual+lidar mm
LaMAR (ours)
⋆⋆+
3 locations
45’000 m2
⋆⋆⋆
100 hours
40 km
handheld
head-mounted
phone, headset
backpack, trolley
lidar, IMU, Õ+
depth, infrared
lidar+SfM+VIO
automated cm
Table 1. Overview of existing datasets.
No dataset, besides ours, exhibits at the same time
short-term appearance and structural changes due to moving people , weather , or day-night
cycles , but also long-term changes due to displaced furniture or construction work .
The LaMAR dataset, benchmark, GT pipeline, and the implementations of baselines
integrating additional sensory data are all publicly available at
lamar.ethz.ch
. We
hope that this will spark future research addressing the challenges of AR.
2 Related work
Image-based localization
is classically tackled by estimating a camera pose from
correspondences established between sparse local features [
44
,
7
,
60
,
48
] and a 3D
Structure-from-Motion (SfM) [
68
] map of the scene [
24
,
42
,
65
]. This pipeline scales
to large scenes using image retrieval [
2
,
33
,
58
,
79
,
11
,
56
,
80
]. Recently, many of these
steps or even the end-to-end pipeline have been successfully learned with neural net-
works [
20
,
63
,
22
,
70
,
3
,
50
,
78
,
62
,
89
,
32
,
64
,
43
]. Other approaches regress absolute camera
pose [
37
,
36
,
51
] or scene coordinates [
73
,
83
,
47
,
46
,
41
,
9
,
86
,
10
]. However, all these ap-
proaches typically fail whenever there is lack of context (e.g., limited field-of-view) or the
map has repetitive elements. Leveraging the sequential ordering of video frames [
49
,
35
]
or modelling the problem as a generalized camera [54,29,66,74] can improve results.
Radio-based localization:
Radio signals, such as WiFi and Bluetooth, are spatially
bounded (logarithmic decay) [
5
,
38
,
28
], thus can distinguish similarly looking (spatially
distant) locations. Their unique identifiers can be uniquely hashed which makes them
computationally attractive (compared with high-dimensional image descriptors). Several
methods use the signal strength, angle, direction, or time of arrival [
52
,
13
,
18
] but the
most popular is model-free map-based fingerprinting [
38
,
28
,
39
], as it only requires to
collect unique identifiers of nearby radio sources and received signal strength. GNSS
provides absolute 3-DoF positioning but is not applicable indoors and has insufficient
accuracy for AR scenarios, especially in urban environments due to multi-pathing, etc.
Datasets and ground-truth:
Many of the existing benchmarks (cf. Tab. 1)
are captured in small-scale environments [
73
,
84
,
19
,
30
], do not contain sequential
4 Sarlin and Dusmanu et al.
device motion
type
cameras radios other data poses
# FOV frequency resolution specs
M6 trolley 6 113° 1-3m 1080p RGB, sync Õ+lidar points+mesh lidar SLAM
VLX backpack 4 90° 1-3m 1080p RGB, sync +lidar points+mesh lidar SLAM
HoloLens2 head-mounted 4 83° 30Hz VGA gray, GS Õ+ToF depth/IR 1Hz, IMU head-tracking
iPad/iPhone hand-held 1 64° 10Hz 1080p RGB, RS, AF +lidar depth 10Hz, IMU ARKit
Table 2. Sensor specifications.
Our dataset has visible light images (global shutter GS, rolling
shutter RS, auto-focus AF), depth data (ToF, lidar), radio signals (
, if partial), dense lidar point
clouds, and poses with intrinsics from on-device tracking.
data [
67
,
34
,
14
,
77
,
76
,
71
,
6
,
69
], lack characteristic hand-held/head-mounted motion pat-
terns [
66
,
4
,
45
,
87
], or their GT is not accurate enough for AR [
57
,
37
]. None of these
datasets contain WiFi or Bluetooth data (Tab. 1). The closest to our work are Naver
Labs [
40
], NCLT [
12
] and ETH3D [
71
]. Both, Naver Labs [
40
] and NCLT [
12
] are less
accurate than ours and do not contain AR specific trajectories or radio data. The Naver
Labs dataset [
40
] also does not contain any outdoor data. ETH3D [
71
] is highly accurate,
however, it is only small-scale, does not contain significant changes, or any radio data.
To establish ground-truth, many datasets rely on off-the-shelf SfM algorithms [
68
]
for unordered image collections [
67
,
34
,
37
,
84
,
57
,
76
,
77
,
34
]. Pure SfM-based GT gen-
eration has limited accuracy [
8
] and completeness, which biases the evaluations to
scenarios in which visual localization already works well. Other approaches rely on
RGB(-D) tracking [
84
,
73
], which usually drifts in larger scenes and cannot produce GT
in crowd-sourced, multi-device scenarios. Specialized capture rigs of an AR device with
a more accurate sensor (lidar) [
40
,
12
] prevent capturing of realistic AR motion patterns.
Furthermore, scalability is limited for these approaches, especially if they rely on manual
selection of reference images [
76
], laborious labelling of correspondences [
67
,
77
], or
placement of fiducial markers [
30
]. For example, the accuracy of ETH3D [
71
] is achieved
by using single stationary lidar scan, manual cleaning, and aligning very few images
captured by tripod-mounted DSLR cameras. Images thus obtained are not representative
for AR devices and the process cannot scale or take advantage of crowd-sourced data. In
contrast, our fully automatic approach does not require any manual labelling or special
capture setups, thus enables light-weight and repeated scanning of large locations.
3 Dataset
We first give an overview of the setup and content of our dataset.
Locations:
The initial release of the dataset contains 3 large locations representative of
AR use cases: 1) HGE (18’000 m
2
) is the ground floor of a historical university building
composed of multiple large halls and large esplanades on both sides. 2) CAB (12’000 m
2
)
is a multi-floor office building composed of multiple small and large offices, a kitchen,
storage rooms, and 2 courtyards. 3) LIN (15’000 m
2
) is a few blocks of an old town
with shops, restaurants, and narrow passages. HGE and CAB contain both indoor and
outdoor sections with many symmetric structures. Each location underwent structural
changes over the span of a year, e.g., the front of HGE turned into a construction site
and the indoor furniture was rearranged. See Fig. 2and Appendix Afor visualizations.
LaMAR: Benchmarking Localization and Mapping for Augmented Reality 5
Fig. 2. The locations feature diverse indoor and outdoor spaces.
High-quality meshes, obtained
from lidar, are registered with numerous AR sequences, each shown here as a different color.
Data collection:
We collected data using Microsoft HoloLens 2 and Apple iPad Pro
devices with custom raw sensor recording applications. 10 participants were each given
one device and asked to walk through a common designated area. They were only given
the instructions to freely walk through the environment to visit, inspect, and find their
way around. This yielded diverse camera heights and motion patterns. Their trajectories
were not planned or restricted in any way. Participants visited each location, both during
the day and at night, at different points in time over the course of up to 1 year. In total,
each location is covered by more than 100 sessions of 5 minutes. We did not need to
prepare the capturing site in any way before recording. This enables easy barrier-free
crowd-sourced data collections. Each location was also captured two to three times by
NavVis M6 trolley or VLX backpack mapping platforms, which generate textured dense
3D models of the environment using laser scanners and panoramic cameras.
Privacy:
We paid special attention to comply with privacy regulations. Since the dataset
is recorded in public spaces, our pipeline anonymizes all visible faces and licence plates.
Sensors:
We provide details about the recorded sensors in Tab. 2. The HoloLens has a
specialized large field-of-view (FOV) multi-camera tracking rig (low resolution, global
shutter) [
82
], while the iPad has a single, higher-resolution camera with rolling shutter
and more limited FOV. We also recorded outputs of the real-time AR tracking algorithms
available on each device, which includes relative camera poses and sensor calibration.
All images are undistorted. All sensor data is registered into a common reference frame
with accurate absolute GT poses using the pipeline described in the next section.
4 Ground-truth generation
The GT estimation process takes as input the raw data from the different sensors. The
entire pipeline is fully automated and does not require any manual alignment or input.
Overview:
We start by aligning different sessions of the laser scanner by using the
images and the 3D lidar point cloud. When registered together, they form the GT
reference map, which accurately captures the structure and appearance of the scene. We
摘要:

LaMAR:BenchmarkingLocalizationandMappingforAugmentedRealityPaul-EdouardSarlin⋆1,MihaiDusmanu⋆1,JohannesL.Schönberger2,PabloSpeciale2,LukasGruber2,ViktorLarsson†1,OndrejMiksik2,andMarcPollefeys1,21DepartmentofComputerScience,ETHZurich,Switzerland2MicrosoftMixedReality&AILab,Zurich,SwitzerlandFig.1.We...

展开>> 收起<<
LaMAR Benchmarking Localization and Mapping for Augmented Reality Paul-Edouard Sarlin1 Mihai Dusmanu1 Johannes L. Schönberger2 Pablo Speciale2.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:9.41MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注