On the contrary, many researchers aimed to provide VR of the remote scene by applying 3D reconstruction techniques
(Ni et al.,2017;Kohn et al.,2018). For example, Kohn et al. (2018) presents an approach using RGB-D camera. As the
main challenge of reconstruction based methods is the limited bandwidth in communication, Kohn et al. (2018) proposes
an object recognition pipeline, i.e., replace the detected object with sparse virtual meshes and discard the dense sensor
data. Pohl et al. (2020) uses RGB-D sensor to construct a VR for affordance based manipulation with a humanoid,
while Liu and Shen (2020) and Puljiz et al. (2020) create augmented reality for a drone and a manipulator, respectively.
Pace et al. (2021) conducts a user study, and argues that the point clouds of RGB-D sensors are noisy and inaccurate
(with artifacts), which motivates for point cloud pre-processing methods for telepresence applications (Pace et al.,2021).
In contrast, our approach is based on scene graphs (Section 3.2) with pose estimation, which is an alternative to 3D
reconstruction methods. Finally, the main novelties are illustrated in Table 1, which are the realizations of a VR based
telepresence system for outdoor environments using multiple sensors jointly. No external sensors or pre-generated maps
are used, while dealing with specific challenges of a floating-base manipulation system, i.e., the surface that holds a
robotic arm is constantly changing over time, thereby inducing motions for the attached sensors.
Object Pose Estimation
One of the crucial components in the proposed framework is object pose estimation algorithms.
This is because we utilize a scene graph representation, which requires 6D pose of the objects for creating a 3D display,
as opposed to a 3D reconstruction of the remote site. As the literature is vast, we refer to the survey (He et al.,2021) for
a comprehensive review. In this work, the main novelty is the working solutions for the considered application, which is
tailored towards realizing the proposed VR system. For this, the two scenarios are discussed below. These are visual
object pose estimation for objects of known geometry, and LiDAR based method for unknown geometry.
If the object is known and accessible a-priori, one of the robust solutions is to use fidicual marker systems. Fidicual
markers, which create artificial features on the scene for pose estimation, are widely used in robotics. The use-
cases are for creating the ground truths (Wang and Olson,2016), where environments are known (Malyuta et al.,
2020), for simplifying the problem in lieu of sophisticated perception (Laiacker et al.,2016), and also calibration
and mapping (Nissler et al.,2018). However, as the herein aim is on real-time VR creation, this use-case demands
stringent requirements on their limitations in run-time, inherent time-delays and robustness. Therefore, an extension of
ARToolKitPlus is provided (Wagner and Schmalstieg,2007) with an on-board visual-inertial SLAM system.
For LiDAR, point cloud registration is often used for pose estimation. By finding the transformation between the
current scans and a CAD model of an object, we can obtain 6D pose of an object. Broadly, point cloud registration
algorithms can be classified as local (Park et al.,2017;Rusinkiewicz and Levoy,2001;Besl and McKay,1992) or
global (Zhou et al.,2016), and model based (Pomerleau et al.,2015) or learning based (Wang and Solomon,2019;
Zhang et al.,2020a). As CAD models of objects are often not available in the given industrial scenario, a DNN based
detector and the idea of LOAM with pose graphs are combined, in order to obtain robust object pose estimates that cope
with occlusions, moving parts and view point variations in the scene.
Active Learning for Neural Networks
The motivations are the considerations of field robotic applications of DNN
based object detectors. Here, the need for labeled data can cause overhead in development processes, especially while
considering a long-term deployment of learning systems in outdoor environments. For example, weather conditions can
change depending on seasons, and we need to efficiently create labeled data. Active learning provides a principled way
to reduce manual annotations by explicitly picking data that are worth being labeled. One way to autonomously generate
the ”worth” of an unlabeled sample is to use uncertainty of DNNs. In the past, for robot perception, we find active
learning frameworks using random forests, Gaussian processes, etc (Narr et al.,2016;Mund et al.,2015) while for
DNNs, MacKay (1992) pioneered an active learning approach based on Bayesian Neural Networks, i.e., a probabilistic
or stochastic DNN (Gawlikowski et al.,2021), which offers a principled method for uncertainty quantification. Recent
works can also be found on active learning for DNN based object detectors (Choi et al.,2021;Aghdam et al.,2019),
where the focus is on adaptations of active learning to existing object detection frameworks. These include new
acquisition functions (or selection criteria) and how uncertainty estimates are generated.
For uncertainty quantification in DNNs, so-called Monte-Carlo dropout (MC-dropout Gal and Ghahramani (2016)) has
gained popularity recently. The main advantage of MC-dropout is that it is relatively easy to use and scale to large
data-set. However, MC-dropout requires a specific stochastic regularization called dropout (Srivastava et al.,2014).
This limits its use on already well trained architectures, because the current DNN based object detectors are often