FloatingFusion Depth from ToF and Image-stabilized Stereo Cameras Andreas Meuleman1 Hakyeong Kim1

2025-04-27 0 0 2.59MB 17 页 10玖币

侵权投诉

FloatingFusion: Depth from ToF and

Image-stabilized Stereo Cameras

Andreas Meuleman1, Hakyeong Kim1,

James Tompkin2, and Min H. Kim1

1KAIST, South Korea

{ameuleman,hkkim,minhkim}@vclab.kaist.ac.kr

2Brown University, United States

Abstract. High-accuracy per-pixel depth is vital for computational pho-

tography, so smartphones now have multimodal camera systems with

time-of-ﬂight (ToF) depth sensors and multiple color cameras. However,

producing accurate high-resolution depth is still challenging due to the

low resolution and limited active illumination power of ToF sensors. Fus-

ing RGB stereo and ToF information is a promising direction to over-

come these issues, but a key problem remains: to provide high-quality

2D RGB images, the main color sensor’s lens is optically stabilized, re-

sulting in an unknown pose for the ﬂoating lens that breaks the geo-

metric relationships between the multimodal image sensors. Leveraging

ToF depth estimates and a wide-angle RGB camera, we design an au-

tomatic calibration technique based on dense 2D/3D matching that can

estimate camera extrinsic, intrinsic, and distortion parameters of a sta-

bilized main RGB sensor from a single snapshot. This lets us fuse stereo

and ToF cues via a correlation volume. For fusion, we apply deep learn-

ing via a real-world training dataset with depth supervision estimated

by a neural reconstruction method. For evaluation, we acquire a test

dataset using a commercial high-power depth camera and show that our

approach achieves higher accuracy than existing baselines.

Keywords: Online camera calibration, 3D imaging, depth estimation,

multi-modal sensor fusion, stereo imaging, time of ﬂight.

1 Introduction

Advances in computational photography allow many applications such as 3D

reconstruction [18], view synthesis [23,43], depth-aware image editing [49,53],

and augmented reality [20,48]. Vital to these algorithms is high-accuracy per-

pixel depth, e.g., to integrate virtual objects by backprojecting high-resolution

camera color into 3D. To this end, smartphones now have camera systems with

multiple sensors, lenses of diﬀerent focal lengths, and active-illumination time-

of-ﬂight (ToF). For instance, correlation-based ToF provides depth by measuring

the travel time of infrared active illumination with a gated infrared sensor.

We consider two challenges in providing high-accuracy per-pixel depth: (1)

ToF sensor spatial resolution is orders of magnitude less than that of its com-

patriot color cameras. RGB spatial resolution has increased dramatically on

arXiv:2210.02785v1 [cs.CV] 6 Oct 2022

2 A. Meuleman et al.

(a) Phone sensors

Floating

camera

w/ OIS

ToF

emission

ToF

camera

Ultrawide

camera

(b) RGB (c) ToF depth (d) Our depth

Fig. 1: (a) Multi-modal smartphone imaging. (b) Reference RGB image. (c) ToF

depth reprojected to the reference. (d) Our depth from ﬂoating fusion.

smartphones—12–64 million pixels is common—whereas ToF is often 0.05–0.3

million pixels. One might correctly think that fusing depth information from ToF

with depth information from color camera stereo disparity is a good strategy to

increase our depth resolution. Fusion might also help us overcome the low signal-

to-noise ratio in ToF signals that arises from the low-intensity illumination of a

battery-powered device. For fusion, we need to accurately know the geometric

poses of all sensors and lenses in the camera system.

This leads to our second challenge: (2) As RGB spatial resolution has in-

creased, smartphones now use optical image stabilization [40,50]: a ﬂoating lens

compensates for camera body motion to avoid motion blur during exposure.

Two low-power actuators suspend the lens body vertically and horizontally to

provide a few degrees of in-plane rotation or translation, similar to how a third

actuator translates the lens along the optical axis for focus. The magnetic actua-

tion varies with focus and even with the smartphone’s orientation due to gravity,

and the pose of the stabilizer is not currently possible to measure or read out

electronically. As such, we can only use a fusion strategy if we can automatically

optically calibrate the geometry of the ﬂoating lens for each exposure taken.

This work proposes a ﬂoating fusion algorithm to provide high accuracy per

pixel depth estimates from an optically-image-stabilized camera, a second RGB

camera, and a ToF camera (Fig. 1). We design an online calibration approach

for the ﬂoating lens that uses ToF measurements and dense optical ﬂow match-

ing between the RGB camera pair. This lets us form 2D/3D correspondences to

recover intrinsic, extrinsic, and lens distortion parameters in an absolute manner

(not ‘up to scale’), and for every snapshot. This makes it suitable for dynamic en-

vironments. Then, to fuse multi-modal sensor information, we build a correlation

volume that integrates both ToF and stereo RGB cues, then predict disparity

via a learned function. There are few large multi-modal datasets to train this

function, and synthetic data creation is expensive and retains a domain gap to

the real world. Instead, we capture real-world scenes with multiple views and

optimize a neural radiance ﬁeld [6] with ToF supervision. The resulting depth

maps are lower noise and higher detail than those of a depth camera, and pro-

FloatingFusion: Depth from ToF and Image-stabilized Stereo Cameras 3

vide us with high-quality training data. For validation, we build a test dataset

using a Kinect Azure and show that our method outperforms other traditional

and data-driven approaches for snapshot RGB-D imaging.

2 Related Work

ToF and RGB Fusion. Existing data-driven approaches [1,2,36] heavily rely

on synthetic data, creating a domain gap. This is exacerbated when using imper-

fect low-power sensors such on mobile phones. In addition, current stereo-ToF

fusion [14,10,15] typically estimates disparity from stereo and ToF separately

before fusion. One approach is to estimate stereo and ToF conﬁdence to merge

the disparity maps [33,1,2,37]. In contrast, our ToF estimates are directly in-

corporated into our disparity pipeline before depth selection. Fusion without

stereo [22] tackles more challenging scenarios than direct ToF depth estimation.

However, Jung et al.’s downsampling process can blur over occlusion edges, pro-

ducing incorrect depth at a low resolution that is diﬃcult to ﬁx after reprojection

at ﬁner resolutions.

Phone and Multi-Sensor Calibration. DiVerdi and Barron [12] tackle per

shot stereo calibration up to scale in the challenging mobile camera environment;

however, absolute calibration is critical for stereo/ToF fusion. We leverage coarse

ToF depth estimates for absolute stereo calibration. Gil et al. [16] estimate two-

view stereo calibration by ﬁrst estimating a monocular depth map in one image

before optimizing the diﬀerentiable projective transformation (DPT) that max-

imizes the consistency between the stereo depth and the monocular depth. The

method reﬁnes the DPT parameters, handling camera pose shift after factory

calibration and improving stereo depth quality, but it still requires the initial

transformation to be suﬃciently accurate for reasonable stereo depth estima-

tion. In addition, to allow for stable optimization, a lower degree of freedom

model is selected, which can neglect camera distortion and lens shift. Works

tackling calibration with phone and ToF sensors are not common. Gao et al. [15]

use Kinect RGB-D inputs, match RGB to the other camera, use depth to lift

points to 3D, then solves a PnP problem to ﬁnd the transformation. Since it

matches sparse keypoints, it is not guaranteed that depth is available where a

keypoint is, leading to too few available keypoints. In addition, the method does

not account for intrinsic or distortion reﬁnement.

Data-Driven ToF Depth Estimation. Numerous works [44,32,17,3,46,39]

attempt to tackle ToF depth estimation via learned approaches. While these

approaches have demonstrated strong capabilities in handling challenging arti-

facts (noise, multi-path interference, or motion), our approach does not strictly

require a dedicated method for ToF depth estimation as we directly merge ToF

samples in our stereo fusion pipeline.

Conventional Datasets. Accurate real-world datasets with ground-truth depth

maps are common for stereo depth estimation [41,34,45]. However, the variety

of fusion systems makes it challenging to acquire large-high-quality, real-world

4 A. Meuleman et al.

datasets. A majority of ToF-related works leverage rendered data [32,17], par-

ticularly for fusion datasets [1,2]. These datasets enable improvement over con-

ventional approaches, but synthesizing RGB and ToF images accurately is chal-

lenging. A domain gap is introduced as the noise proﬁle and imaging artifacts

are diﬀerent from the training data. Notable exceptions are Son et al. [44], and

Gao and Fan et al. [39], where an accurate depth camera provides training data

for a lower-quality ToF module. The acquisition is partially automated thanks

to a robotic arm. However, this bulky setup limits the variety of the scenes:

all scenes are captured on the same table, with similar backgrounds across the

dataset. In addition, the use of a single depth camera at a diﬀerent location from

the ToF module introduces occlusion, with some areas in the ToF image having

no supervision. In addition, this method only tackles ToF depth estimation, and

the dataset does not feature RGB images.

Multiview Geometry Estimation. Several approaches are capable of ac-

curate depth estimation from multiview images [42], even in dynamic environ-

ments [26,31,24]. Despite their accuracy, including ToF data to these approaches

is not obvious. Scene representations optimized from a set of images [6,21,4] have

recently shown good novel view synthesis and scene geometry reconstruction, in-

cluding to reﬁne depth estimates in the context of multiview stereo [51]. Since

the optimization can accept supervision from varied sources, including ToF mea-

surements is straightforward. For this reason, we select a state-of-the-art neural

representation that has the advantage to handle heterogeneous resolutions [6]

for our training data generation. T¨oRF [5] renders phasor images from a volume

representation to optimize raw ToF image reconstruction. While eﬃciently im-

proving NeRF’s results and tackling ToF phase wrapping, this approach is not

necessary for our context as our device is not prone to phase wrapping due to

its low illumination range (low power) and thanks to the use of several modula-

tion frequencies. We also observe that, in the absence of explicit ToF conﬁdence,

erroneous ToF measurements tend to be more present in depth maps rendered

from a T¨oRF. Finally, approaches based on ICP registration [18] cannot be

applied directly to our data since depth maps from the low-power ToF module

are too noisy to be registered through ICP.

3 Method

We use an oﬀ-the-shelf Samsung Galaxy S20+ smartphone. This has the main

camera with a 12MP color sensor and a magnetic mount 79°lens for stabilization

and focusing, a secondary 12MP color camera with a ﬁxed ultrawide 120°lens,

and a 0.3MP ToF system with an infrared ﬁxed 78°lens camera and infrared

emitter (Fig. 1a). As the ultrawide camera and the ToF module are rigidly ﬁxed,

we calibrate their intrinsics KUW, KToF , extrinsics [R|t]UW,[R|t]ToF , and lens

distortion parameters using an oﬄine method based on checkerboard corner es-

timation. We use a checkerboard with similar absorption in the visible spectrum

as in infrared. However, calibrating the ﬂoating main camera (subscript FM) is

not possible oﬄine, as its pose changes from snapshot to snapshot. OIS intro-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FloatingFusion:DepthfromToFandImage-stabilizedStereoCamerasAndreasMeuleman1,HakyeongKim1,JamesTompkin2,andMinH.Kim11KAIST,SouthKorea{ameuleman,hkkim,minhkim}@vclab.kaist.ac.kr2BrownUniversity,UnitedStatesAbstract.High-accuracyper-pixeldepthisvitalforcomputationalpho-tography,sosmartphonesnowhavemult...

展开>> 收起<<

FloatingFusion Depth from ToF and Image-stabilized Stereo Cameras Andreas Meuleman1 Hakyeong Kim1.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FloatingFusion Depth from ToF and Image-stabilized Stereo Cameras Andreas Meuleman1 Hakyeong Kim1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: