4 A. Meuleman et al.
datasets. A majority of ToF-related works leverage rendered data [32,17], par-
ticularly for fusion datasets [1,2]. These datasets enable improvement over con-
ventional approaches, but synthesizing RGB and ToF images accurately is chal-
lenging. A domain gap is introduced as the noise profile and imaging artifacts
are different from the training data. Notable exceptions are Son et al. [44], and
Gao and Fan et al. [39], where an accurate depth camera provides training data
for a lower-quality ToF module. The acquisition is partially automated thanks
to a robotic arm. However, this bulky setup limits the variety of the scenes:
all scenes are captured on the same table, with similar backgrounds across the
dataset. In addition, the use of a single depth camera at a different location from
the ToF module introduces occlusion, with some areas in the ToF image having
no supervision. In addition, this method only tackles ToF depth estimation, and
the dataset does not feature RGB images.
Multiview Geometry Estimation. Several approaches are capable of ac-
curate depth estimation from multiview images [42], even in dynamic environ-
ments [26,31,24]. Despite their accuracy, including ToF data to these approaches
is not obvious. Scene representations optimized from a set of images [6,21,4] have
recently shown good novel view synthesis and scene geometry reconstruction, in-
cluding to refine depth estimates in the context of multiview stereo [51]. Since
the optimization can accept supervision from varied sources, including ToF mea-
surements is straightforward. For this reason, we select a state-of-the-art neural
representation that has the advantage to handle heterogeneous resolutions [6]
for our training data generation. T¨oRF [5] renders phasor images from a volume
representation to optimize raw ToF image reconstruction. While efficiently im-
proving NeRF’s results and tackling ToF phase wrapping, this approach is not
necessary for our context as our device is not prone to phase wrapping due to
its low illumination range (low power) and thanks to the use of several modula-
tion frequencies. We also observe that, in the absence of explicit ToF confidence,
erroneous ToF measurements tend to be more present in depth maps rendered
from a T¨oRF. Finally, approaches based on ICP registration [18] cannot be
applied directly to our data since depth maps from the low-power ToF module
are too noisy to be registered through ICP.
3 Method
We use an off-the-shelf Samsung Galaxy S20+ smartphone. This has the main
camera with a 12MP color sensor and a magnetic mount 79°lens for stabilization
and focusing, a secondary 12MP color camera with a fixed ultrawide 120°lens,
and a 0.3MP ToF system with an infrared fixed 78°lens camera and infrared
emitter (Fig. 1a). As the ultrawide camera and the ToF module are rigidly fixed,
we calibrate their intrinsics KUW, KToF , extrinsics [R|t]UW,[R|t]ToF , and lens
distortion parameters using an offline method based on checkerboard corner es-
timation. We use a checkerboard with similar absorption in the visible spectrum
as in infrared. However, calibrating the floating main camera (subscript FM) is
not possible offline, as its pose changes from snapshot to snapshot. OIS intro-