LaMAR Benchmarking Localization and Mapping for Augmented Reality Paul-Edouard Sarlin1 Mihai Dusmanu1 Johannes L. Schönberger2 Pablo Speciale2

2025-05-03 0 0 9.41MB 30 页 10玖币

侵权投诉

LaMAR: Benchmarking Localization and Mapping

for Augmented Reality

Paul-Edouard Sarlin

⋆1

, Mihai Dusmanu

⋆1

, Johannes L. Schönberger

, Pablo Speciale

Lukas Gruber2, Viktor Larsson†1, Ondrej Miksik2, and Marc Pollefeys1,2

1Department of Computer Science, ETH Zurich, Switzerland

2Microsoft Mixed Reality & AI Lab, Zurich, Switzerland

Ground truth

from laser

scanners

AR headset

radio signals

multi-camera rig

image sequences

mobile phone

150m

day/night

year-long

changes

CAB HGE

outdoor

indoor LIN

200m

depth

150m

IMU

Fig. 1.

We revisit localization and mapping in the context of Augmented Reality by introducing

LaMAR, a large-scale dataset captured using AR devices (HoloLens 2, iPhone) and laser scanners.

Abstract.

Localization and mapping is the foundational technology for aug-

mented reality (AR) that enables sharing and persistence of digital content in the

real world. While signiﬁcant progress has been made, researchers are still mostly

driven by unrealistic benchmarks not representative of real-world AR scenarios.

These benchmarks are often based on small-scale datasets with low scene diversity,

captured from stationary cameras, and lack other sensor inputs like inertial, radio,

or depth data. Furthermore, their ground-truth (GT) accuracy is mostly insufﬁcient

to satisfy AR requirements. To close this gap, we introduce LaMAR, a new bench-

mark with a comprehensive capture and GT pipeline that co-registers realistic

trajectories and sensor streams captured by heterogeneous AR devices in large,

unconstrained scenes. To establish an accurate GT, our pipeline robustly aligns the

trajectories against laser scans in a fully automated manner. As a result, we publish

a benchmark dataset of diverse and large-scale scenes recorded with head-mounted

and hand-held AR devices. We extend several state-of-the-art methods to take

advantage of the AR-speciﬁc setup and evaluate them on our benchmark. The

results offer new insights on current research and reveal promising avenues for

future work in the ﬁeld of localization and mapping for AR.

⋆Equal contribution. †Now at Lund University, Sweden.

arXiv:2210.10770v1 [cs.CV] 19 Oct 2022

2 Sarlin and Dusmanu et al.

1 Introduction

Placing virtual content in the physical 3D world, persisting it over time, and sharing it

with other users are typical scenarios for Augmented Reality (AR). In order to reliably

overlay virtual content in the real world with pixel-level precision, these scenarios require

AR devices to accurately determine their 6-DoF pose at any point in time. While visual

localization and mapping is one of the most studied problems in computer vision, its

use for AR entails speciﬁc challenges and opportunities. First, modern AR devices, such

as mobile phones or the Microsoft HoloLens or

MagicLeap One

, are often equipped

with multiple cameras and additional inertial or radio sensors. Second, they exhibit

characteristic hand-held or head-mounted motion patterns. The on-device real-time

tracking systems provide spatially-posed sensor streams. However, many AR scenarios

require positioning beyond local tracking, both indoors and outdoors, and robustness to

common temporal changes of appearance and structure. Furthermore, given the plurality

of temporal sensor data, the question is often not whether, but how quickly can the device

localize at any time to ensure a compelling end-user experience. Finally, as AR adoption

grows, crowd-sourced data captured by users with diverse devices can be mined for

building large-scale maps without a manual and costly scanning effort. Crowd-sourcing

offers great opportunities but poses additional challenges on the robustness of algorithms,

e.g., to enable cross-device localization [

], mapping from incomplete data with low

accuracy [68,8], privacy-preservation of data [74,25,72,26,23], etc.

However, the academic community is mainly driven by benchmarks that are dis-

connected from the speciﬁcs of AR. They mostly evaluate localization and mapping

using single still images and either lack temporal changes [

] or accurate ground

truth (GT) [

], are restricted to small scenes [

] or landmarks [

]

with perfect coverage and limited viewpoint variability, or disregard temporal tracking

data or additional visual, inertial, or radio sensors [67,66,77,40,12,76].

Our ﬁrst contribution is to introduce

a large-scale dataset captured using AR

devices in diverse environments

, notably a historical building, a multi-story ofﬁce

building, and part of a city center. The initial data release contains both indoor and

outdoor images with illumination and semantic changes as well as dynamic objects.

Speciﬁcally, we collected multi-sensor data streams (images, depth, tracking, IMU, BT,

WiFi) totalling more than 100 hours using head-mounted HoloLens 2 and hand-held

iPhone / iPad devices covering 45’000 square meters over the span of one year (Fig. 1).

Second, we develop

a GT pipeline to automatically and accurately register AR

trajectories

against large-scale 3D laser scans. Our pipeline does not require any manual

labelling or setup of custom infrastructure (e.g., ﬁducial markers). Furthermore, the

system robustly handles crowd-sourced data from heterogeneous devices captured over

longer periods of time and can be easily extended to support future devices.

Finally, we present

a rigorous evaluation of localization and mapping in the

context of AR

and provide

novel insights for future research

. Notably, we show that

the performance of state-of-the-art methods can be drastically improved by considering

additional data streams generally available in AR devices, such as radio signals or

sequence odometry. Thus, future algorithms in the ﬁeld of AR localization and mapping

should always consider these sensors in their evaluation to show real-world impact.

LaMAR: Benchmarking Localization and Mapping for Augmented Reality 3

dataset out/indoor changes scale density camera motion imaging devices additional sensors ground truth accuracy

Aachen [67,66]⋆⋆+ ⋆⋆⋆still images DSLR SfM >dm

Phototourism [34]+⋆⋆ ⋆⋆⋆ still images DSLR, phone SfM ∼m

San Francisco [14]⋆⋆⋆ ⋆⋆⋆ still images DSLR, phone GNSS SfM+GNSS ∼m

Cambridge [37]+⋆⋆ ⋆⋆⋆handheld mobile SfM >dm

7Scenes [73]+⋆⋆ ⋆⋆⋆ handheld mobile depth RGB-D ∼cm

RIO10 [84]+⋆⋆ ⋆⋆⋆ handheld Tango tablet depth VIO >dm

InLoc [77]⋆+⋆+⋆⋆ still images panoramas, phone lidar manual+lidar >dm

Baidu mall [76]⋆+⋆⋆⋆⋆still images DSLR, phone lidar manual+lidar ∼dm

Naver Labs [40]⋆⋆⋆⋆⋆⋆robot-mounted ﬁsheye, phone lidar lidar+SfM ∼dm

NCLT [12]⋆⋆⋆⋆⋆⋆robot-mounted wide-angle lidar, IMU, GNSS lidar+VIO ∼dm

ADVIO [57]⋆⋆⋆+⋆⋆ handheld phone, Tango IMU, depth, GNSS manual+VIO ∼m

ETH3D [71]+⋆⋆ ⋆⋆⋆handheld DSLR, wide-angle lidar manual+lidar ∼mm

LaMAR (ours)

⋆⋆+

3 locations

45’000 m2

⋆⋆⋆

100 hours

40 km

handheld

head-mounted

phone, headset

backpack, trolley

lidar, IMU, Õ+

depth, infrared

lidar+SfM+VIO

automated ∼cm

Table 1. Overview of existing datasets.

No dataset, besides ours, exhibits at the same time

short-term appearance and structural changes due to moving people , weather , or day-night

cycles , but also long-term changes due to displaced furniture or construction work .

The LaMAR dataset, benchmark, GT pipeline, and the implementations of baselines

integrating additional sensory data are all publicly available at

lamar.ethz.ch

. We

hope that this will spark future research addressing the challenges of AR.

2 Related work

Image-based localization

is classically tackled by estimating a camera pose from

correspondences established between sparse local features [

] and a 3D

Structure-from-Motion (SfM) [

] map of the scene [

]. This pipeline scales

to large scenes using image retrieval [

]. Recently, many of these

steps or even the end-to-end pipeline have been successfully learned with neural net-

works [

]. Other approaches regress absolute camera

pose [

] or scene coordinates [

]. However, all these ap-

proaches typically fail whenever there is lack of context (e.g., limited ﬁeld-of-view) or the

map has repetitive elements. Leveraging the sequential ordering of video frames [

]

or modelling the problem as a generalized camera [54,29,66,74] can improve results.

Radio-based localization:

Radio signals, such as WiFi and Bluetooth, are spatially

bounded (logarithmic decay) [

], thus can distinguish similarly looking (spatially

distant) locations. Their unique identiﬁers can be uniquely hashed which makes them

computationally attractive (compared with high-dimensional image descriptors). Several

methods use the signal strength, angle, direction, or time of arrival [

] but the

most popular is model-free map-based ﬁngerprinting [

], as it only requires to

collect unique identiﬁers of nearby radio sources and received signal strength. GNSS

provides absolute 3-DoF positioning but is not applicable indoors and has insufﬁcient

accuracy for AR scenarios, especially in urban environments due to multi-pathing, etc.

Datasets and ground-truth:

Many of the existing benchmarks (cf. Tab. 1)

are captured in small-scale environments [

], do not contain sequential

4 Sarlin and Dusmanu et al.

device motion

type

cameras radios other data poses

# FOV frequency resolution specs

M6 trolley 6 113° 1-3m 1080p RGB, sync Õ+lidar points+mesh lidar SLAM

VLX backpack 4 90° 1-3m 1080p RGB, sync +lidar points+mesh lidar SLAM

HoloLens2 head-mounted 4 83° 30Hz VGA gray, GS Õ+ToF depth/IR 1Hz, IMU head-tracking

iPad/iPhone hand-held 1 64° 10Hz 1080p RGB, RS, AF +∗lidar depth 10Hz, IMU ARKit

Table 2. Sensor speciﬁcations.

Our dataset has visible light images (global shutter GS, rolling

shutter RS, auto-focus AF), depth data (ToF, lidar), radio signals (

∗

, if partial), dense lidar point

clouds, and poses with intrinsics from on-device tracking.

data [

], lack characteristic hand-held/head-mounted motion pat-

terns [

], or their GT is not accurate enough for AR [

]. None of these

datasets contain WiFi or Bluetooth data (Tab. 1). The closest to our work are Naver

Labs [

], NCLT [

] and ETH3D [

]. Both, Naver Labs [

] and NCLT [

] are less

accurate than ours and do not contain AR speciﬁc trajectories or radio data. The Naver

Labs dataset [

] also does not contain any outdoor data. ETH3D [

] is highly accurate,

however, it is only small-scale, does not contain signiﬁcant changes, or any radio data.

To establish ground-truth, many datasets rely on off-the-shelf SfM algorithms [

]

for unordered image collections [

]. Pure SfM-based GT gen-

eration has limited accuracy [

] and completeness, which biases the evaluations to

scenarios in which visual localization already works well. Other approaches rely on

RGB(-D) tracking [

], which usually drifts in larger scenes and cannot produce GT

in crowd-sourced, multi-device scenarios. Specialized capture rigs of an AR device with

a more accurate sensor (lidar) [

] prevent capturing of realistic AR motion patterns.

Furthermore, scalability is limited for these approaches, especially if they rely on manual

selection of reference images [

], laborious labelling of correspondences [

], or

placement of ﬁducial markers [

]. For example, the accuracy of ETH3D [

] is achieved

by using single stationary lidar scan, manual cleaning, and aligning very few images

captured by tripod-mounted DSLR cameras. Images thus obtained are not representative

for AR devices and the process cannot scale or take advantage of crowd-sourced data. In

contrast, our fully automatic approach does not require any manual labelling or special

capture setups, thus enables light-weight and repeated scanning of large locations.

3 Dataset

We ﬁrst give an overview of the setup and content of our dataset.

Locations:

The initial release of the dataset contains 3 large locations representative of

AR use cases: 1) HGE (18’000 m

) is the ground ﬂoor of a historical university building

composed of multiple large halls and large esplanades on both sides. 2) CAB (12’000 m

)

is a multi-ﬂoor ofﬁce building composed of multiple small and large ofﬁces, a kitchen,

storage rooms, and 2 courtyards. 3) LIN (15’000 m

) is a few blocks of an old town

with shops, restaurants, and narrow passages. HGE and CAB contain both indoor and

outdoor sections with many symmetric structures. Each location underwent structural

changes over the span of a year, e.g., the front of HGE turned into a construction site

and the indoor furniture was rearranged. See Fig. 2and Appendix Afor visualizations.

LaMAR: Benchmarking Localization and Mapping for Augmented Reality 5

Fig. 2. The locations feature diverse indoor and outdoor spaces.

High-quality meshes, obtained

from lidar, are registered with numerous AR sequences, each shown here as a different color.

Data collection:

We collected data using Microsoft HoloLens 2 and Apple iPad Pro

devices with custom raw sensor recording applications. 10 participants were each given

one device and asked to walk through a common designated area. They were only given

the instructions to freely walk through the environment to visit, inspect, and ﬁnd their

way around. This yielded diverse camera heights and motion patterns. Their trajectories

were not planned or restricted in any way. Participants visited each location, both during

the day and at night, at different points in time over the course of up to 1 year. In total,

each location is covered by more than 100 sessions of 5 minutes. We did not need to

prepare the capturing site in any way before recording. This enables easy barrier-free

crowd-sourced data collections. Each location was also captured two to three times by

NavVis M6 trolley or VLX backpack mapping platforms, which generate textured dense

3D models of the environment using laser scanners and panoramic cameras.

Privacy:

We paid special attention to comply with privacy regulations. Since the dataset

is recorded in public spaces, our pipeline anonymizes all visible faces and licence plates.

Sensors:

We provide details about the recorded sensors in Tab. 2. The HoloLens has a

specialized large ﬁeld-of-view (FOV) multi-camera tracking rig (low resolution, global

shutter) [

], while the iPad has a single, higher-resolution camera with rolling shutter

and more limited FOV. We also recorded outputs of the real-time AR tracking algorithms

available on each device, which includes relative camera poses and sensor calibration.

All images are undistorted. All sensor data is registered into a common reference frame

with accurate absolute GT poses using the pipeline described in the next section.

4 Ground-truth generation

The GT estimation process takes as input the raw data from the different sensors. The

entire pipeline is fully automated and does not require any manual alignment or input.

Overview:

We start by aligning different sessions of the laser scanner by using the

images and the 3D lidar point cloud. When registered together, they form the GT

reference map, which accurately captures the structure and appearance of the scene. We

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LaMAR:BenchmarkingLocalizationandMappingforAugmentedRealityPaul-EdouardSarlin⋆1,MihaiDusmanu⋆1,JohannesL.Schönberger2,PabloSpeciale2,LukasGruber2,ViktorLarsson†1,OndrejMiksik2,andMarcPollefeys1,21DepartmentofComputerScience,ETHZurich,Switzerland2MicrosoftMixedReality&AILab,Zurich,SwitzerlandFig.1.We...

展开>> 收起<<

LaMAR Benchmarking Localization and Mapping for Augmented Reality Paul-Edouard Sarlin1 Mihai Dusmanu1 Johannes L. Schönberger2 Pablo Speciale2.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LaMAR Benchmarking Localization and Mapping for Augmented Reality Paul-Edouard Sarlin1 Mihai Dusmanu1 Johannes L. Schönberger2 Pablo Speciale2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: