FloatingFusion Depth from ToF and Image-stabilized Stereo Cameras Andreas Meuleman1 Hakyeong Kim1

2025-04-27 0 0 2.59MB 17 页 10玖币
侵权投诉
FloatingFusion: Depth from ToF and
Image-stabilized Stereo Cameras
Andreas Meuleman1, Hakyeong Kim1,
James Tompkin2, and Min H. Kim1
1KAIST, South Korea
{ameuleman,hkkim,minhkim}@vclab.kaist.ac.kr
2Brown University, United States
Abstract. High-accuracy per-pixel depth is vital for computational pho-
tography, so smartphones now have multimodal camera systems with
time-of-flight (ToF) depth sensors and multiple color cameras. However,
producing accurate high-resolution depth is still challenging due to the
low resolution and limited active illumination power of ToF sensors. Fus-
ing RGB stereo and ToF information is a promising direction to over-
come these issues, but a key problem remains: to provide high-quality
2D RGB images, the main color sensor’s lens is optically stabilized, re-
sulting in an unknown pose for the floating lens that breaks the geo-
metric relationships between the multimodal image sensors. Leveraging
ToF depth estimates and a wide-angle RGB camera, we design an au-
tomatic calibration technique based on dense 2D/3D matching that can
estimate camera extrinsic, intrinsic, and distortion parameters of a sta-
bilized main RGB sensor from a single snapshot. This lets us fuse stereo
and ToF cues via a correlation volume. For fusion, we apply deep learn-
ing via a real-world training dataset with depth supervision estimated
by a neural reconstruction method. For evaluation, we acquire a test
dataset using a commercial high-power depth camera and show that our
approach achieves higher accuracy than existing baselines.
Keywords: Online camera calibration, 3D imaging, depth estimation,
multi-modal sensor fusion, stereo imaging, time of flight.
1 Introduction
Advances in computational photography allow many applications such as 3D
reconstruction [18], view synthesis [23,43], depth-aware image editing [49,53],
and augmented reality [20,48]. Vital to these algorithms is high-accuracy per-
pixel depth, e.g., to integrate virtual objects by backprojecting high-resolution
camera color into 3D. To this end, smartphones now have camera systems with
multiple sensors, lenses of different focal lengths, and active-illumination time-
of-flight (ToF). For instance, correlation-based ToF provides depth by measuring
the travel time of infrared active illumination with a gated infrared sensor.
We consider two challenges in providing high-accuracy per-pixel depth: (1)
ToF sensor spatial resolution is orders of magnitude less than that of its com-
patriot color cameras. RGB spatial resolution has increased dramatically on
arXiv:2210.02785v1 [cs.CV] 6 Oct 2022
2 A. Meuleman et al.
(a) Phone sensors
Floating
camera
w/ OIS
ToF
emission
ToF
camera
Ultrawide
camera
(b) RGB (c) ToF depth (d) Our depth
Fig. 1: (a) Multi-modal smartphone imaging. (b) Reference RGB image. (c) ToF
depth reprojected to the reference. (d) Our depth from floating fusion.
smartphones—12–64 million pixels is common—whereas ToF is often 0.05–0.3
million pixels. One might correctly think that fusing depth information from ToF
with depth information from color camera stereo disparity is a good strategy to
increase our depth resolution. Fusion might also help us overcome the low signal-
to-noise ratio in ToF signals that arises from the low-intensity illumination of a
battery-powered device. For fusion, we need to accurately know the geometric
poses of all sensors and lenses in the camera system.
This leads to our second challenge: (2) As RGB spatial resolution has in-
creased, smartphones now use optical image stabilization [40,50]: a floating lens
compensates for camera body motion to avoid motion blur during exposure.
Two low-power actuators suspend the lens body vertically and horizontally to
provide a few degrees of in-plane rotation or translation, similar to how a third
actuator translates the lens along the optical axis for focus. The magnetic actua-
tion varies with focus and even with the smartphone’s orientation due to gravity,
and the pose of the stabilizer is not currently possible to measure or read out
electronically. As such, we can only use a fusion strategy if we can automatically
optically calibrate the geometry of the floating lens for each exposure taken.
This work proposes a floating fusion algorithm to provide high accuracy per
pixel depth estimates from an optically-image-stabilized camera, a second RGB
camera, and a ToF camera (Fig. 1). We design an online calibration approach
for the floating lens that uses ToF measurements and dense optical flow match-
ing between the RGB camera pair. This lets us form 2D/3D correspondences to
recover intrinsic, extrinsic, and lens distortion parameters in an absolute manner
(not ‘up to scale’), and for every snapshot. This makes it suitable for dynamic en-
vironments. Then, to fuse multi-modal sensor information, we build a correlation
volume that integrates both ToF and stereo RGB cues, then predict disparity
via a learned function. There are few large multi-modal datasets to train this
function, and synthetic data creation is expensive and retains a domain gap to
the real world. Instead, we capture real-world scenes with multiple views and
optimize a neural radiance field [6] with ToF supervision. The resulting depth
maps are lower noise and higher detail than those of a depth camera, and pro-
FloatingFusion: Depth from ToF and Image-stabilized Stereo Cameras 3
vide us with high-quality training data. For validation, we build a test dataset
using a Kinect Azure and show that our method outperforms other traditional
and data-driven approaches for snapshot RGB-D imaging.
2 Related Work
ToF and RGB Fusion. Existing data-driven approaches [1,2,36] heavily rely
on synthetic data, creating a domain gap. This is exacerbated when using imper-
fect low-power sensors such on mobile phones. In addition, current stereo-ToF
fusion [14,10,15] typically estimates disparity from stereo and ToF separately
before fusion. One approach is to estimate stereo and ToF confidence to merge
the disparity maps [33,1,2,37]. In contrast, our ToF estimates are directly in-
corporated into our disparity pipeline before depth selection. Fusion without
stereo [22] tackles more challenging scenarios than direct ToF depth estimation.
However, Jung et al.’s downsampling process can blur over occlusion edges, pro-
ducing incorrect depth at a low resolution that is difficult to fix after reprojection
at finer resolutions.
Phone and Multi-Sensor Calibration. DiVerdi and Barron [12] tackle per
shot stereo calibration up to scale in the challenging mobile camera environment;
however, absolute calibration is critical for stereo/ToF fusion. We leverage coarse
ToF depth estimates for absolute stereo calibration. Gil et al. [16] estimate two-
view stereo calibration by first estimating a monocular depth map in one image
before optimizing the differentiable projective transformation (DPT) that max-
imizes the consistency between the stereo depth and the monocular depth. The
method refines the DPT parameters, handling camera pose shift after factory
calibration and improving stereo depth quality, but it still requires the initial
transformation to be sufficiently accurate for reasonable stereo depth estima-
tion. In addition, to allow for stable optimization, a lower degree of freedom
model is selected, which can neglect camera distortion and lens shift. Works
tackling calibration with phone and ToF sensors are not common. Gao et al. [15]
use Kinect RGB-D inputs, match RGB to the other camera, use depth to lift
points to 3D, then solves a PnP problem to find the transformation. Since it
matches sparse keypoints, it is not guaranteed that depth is available where a
keypoint is, leading to too few available keypoints. In addition, the method does
not account for intrinsic or distortion refinement.
Data-Driven ToF Depth Estimation. Numerous works [44,32,17,3,46,39]
attempt to tackle ToF depth estimation via learned approaches. While these
approaches have demonstrated strong capabilities in handling challenging arti-
facts (noise, multi-path interference, or motion), our approach does not strictly
require a dedicated method for ToF depth estimation as we directly merge ToF
samples in our stereo fusion pipeline.
Conventional Datasets. Accurate real-world datasets with ground-truth depth
maps are common for stereo depth estimation [41,34,45]. However, the variety
of fusion systems makes it challenging to acquire large-high-quality, real-world
4 A. Meuleman et al.
datasets. A majority of ToF-related works leverage rendered data [32,17], par-
ticularly for fusion datasets [1,2]. These datasets enable improvement over con-
ventional approaches, but synthesizing RGB and ToF images accurately is chal-
lenging. A domain gap is introduced as the noise profile and imaging artifacts
are different from the training data. Notable exceptions are Son et al. [44], and
Gao and Fan et al. [39], where an accurate depth camera provides training data
for a lower-quality ToF module. The acquisition is partially automated thanks
to a robotic arm. However, this bulky setup limits the variety of the scenes:
all scenes are captured on the same table, with similar backgrounds across the
dataset. In addition, the use of a single depth camera at a different location from
the ToF module introduces occlusion, with some areas in the ToF image having
no supervision. In addition, this method only tackles ToF depth estimation, and
the dataset does not feature RGB images.
Multiview Geometry Estimation. Several approaches are capable of ac-
curate depth estimation from multiview images [42], even in dynamic environ-
ments [26,31,24]. Despite their accuracy, including ToF data to these approaches
is not obvious. Scene representations optimized from a set of images [6,21,4] have
recently shown good novel view synthesis and scene geometry reconstruction, in-
cluding to refine depth estimates in the context of multiview stereo [51]. Since
the optimization can accept supervision from varied sources, including ToF mea-
surements is straightforward. For this reason, we select a state-of-the-art neural
representation that has the advantage to handle heterogeneous resolutions [6]
for our training data generation. T¨oRF [5] renders phasor images from a volume
representation to optimize raw ToF image reconstruction. While efficiently im-
proving NeRF’s results and tackling ToF phase wrapping, this approach is not
necessary for our context as our device is not prone to phase wrapping due to
its low illumination range (low power) and thanks to the use of several modula-
tion frequencies. We also observe that, in the absence of explicit ToF confidence,
erroneous ToF measurements tend to be more present in depth maps rendered
from a T¨oRF. Finally, approaches based on ICP registration [18] cannot be
applied directly to our data since depth maps from the low-power ToF module
are too noisy to be registered through ICP.
3 Method
We use an off-the-shelf Samsung Galaxy S20+ smartphone. This has the main
camera with a 12MP color sensor and a magnetic mount 79°lens for stabilization
and focusing, a secondary 12MP color camera with a fixed ultrawide 120°lens,
and a 0.3MP ToF system with an infrared fixed 78°lens camera and infrared
emitter (Fig. 1a). As the ultrawide camera and the ToF module are rigidly fixed,
we calibrate their intrinsics KUW, KToF , extrinsics [R|t]UW,[R|t]ToF , and lens
distortion parameters using an offline method based on checkerboard corner es-
timation. We use a checkerboard with similar absorption in the visible spectrum
as in infrared. However, calibrating the floating main camera (subscript FM) is
not possible offline, as its pose changes from snapshot to snapshot. OIS intro-
摘要:

FloatingFusion:DepthfromToFandImage-stabilizedStereoCamerasAndreasMeuleman1,HakyeongKim1,JamesTompkin2,andMinH.Kim11KAIST,SouthKorea{ameuleman,hkkim,minhkim}@vclab.kaist.ac.kr2BrownUniversity,UnitedStatesAbstract.High-accuracyper-pixeldepthisvitalforcomputationalpho-tography,sosmartphonesnowhavemult...

展开>> 收起<<
FloatingFusion Depth from ToF and Image-stabilized Stereo Cameras Andreas Meuleman1 Hakyeong Kim1.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:2.59MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注