
heuristics like coverage or information gain cannot distinguish between unobserved regions of the
scene and regions for which a reconstruction model is uncertain, and as a result acquire views
which may provide no additional scene understanding. Recently we have witnessed the introduction
of data-driven approaches to learn next best view policy models (Vasquez-Gomez et al., 2021) as
well as 3D shape reconstruction models to drive these policies (Yang et al., 2018). These data-
driven models often require additional information such as 3D shape supervision (Yang et al., 2018)
or depth to train (Peralta et al., 2020), which is rarely available, difficult to obtain and no longer a
requirement for any aspect of scene reconstruction when leveraging neural implicit functions (Yu
et al., 2021; Chen et al., 2021).
(0) Initial View
Highest
Uncertainty
View
Uncertainty
over views
Scene
Prediction
(4) New View of
Scene
(1) Reconstruction
Model
(2) Model Outputs
(3) Aquire New
View
Figure 1: Uncertainty-driven active vision pipeline.
We propose an uncertainty-driven active vi-
sion approach whose goal is to choose the se-
quence of views which lead to the highest re-
duction in reconstruction uncertainty (see Fig-
ure 1). The proposed approach, introduces
an implicit multi-view reconstruction model
to predict occupancy, leverages the occupancy
predictions to estimate uncertainty over un-
seen views, and defines view selection poli-
cies which seek to maximize the observable
model uncertainty. Notably, the contributed
reconstruction model is robust to arbitrary
numbers of input views, and can be trained
by leveraging either full 3D supervision from
occupancy values or 2D supervision from ren-
derings. Moreover, the observable model un-
certainty is estimated by extending the vol-
ume rendering formulation to accumulate pre-
dicted occupancy probabilities along rays cast
into the scene, enabling the search space of possible views to be efficiently explored. We evalu-
ate our proposed active vision approach on the simulated ABC (Koch et al., 2019) dataset as well
as the challenging, in the wild CO3D (Reizenstein et al., 2021) scene dataset, by leveraging up to
5image perspectives. Our results demonstrate that: (1) our reconstruction model obtains impres-
sive reconstructions which lead to visible improvement over the previous state-of-the-art multi-view
occupancy method on the ABC dataset, and perhaps surprisingly, this improvement persists even
when training with only 2D supervision for larger numbers of input views; (2) our uncertainty-
driven active vision approach achieves notable improvements in shape understanding under volu-
metric and projection-based metrics relative to strong baseline selection policies on both ABC and
CO3D datasets; and (3) by performing a gradient-based search on the view selection candidates,
we can further improve shape understanding. The code to reproduce the experiments is provided:
https://github.com/facebookresearch/Uncertainty-Driven-Active-Vision.
2 RELATED WORKS
Traditional active vision methods for 3D reconstruction, limited by lack of access to contemporary
learning methods and large scale data, generally focus on identifying views to maximize visibility of
unobserved areas of the scene using a range camera (Pito, 1999; Connolly, 1985; Banta et al., 2000).
Connolly (1985) first proposed to determine views in the scene which would maximize the visibility
of unobserved voxels. Many works then focused on reducing the cost of computing coverage metrics
and increasing the number of candidate views considered (Pito, 1999; Blaer & Allen, 2007; Low &
Lastra, 2006; Vasquez-Gomez et al., 2013). Conversely, other methods computed utility scores
over additional factors such as view overlap, scan quality and navigation distance which can be
optimized over to select views (Massios et al., 1998; Fisher & Sanchiz, 1999; Foissotte et al., 2008;
Vasquez-Gomez et al., 2014). More contemporary next best view methods, especially in the context
of robotics, focused on maximizing information gain as opposed to direct view coverage (Sebastian
et al., 2005; Le et al., 2008; Huber et al., 2012; Krainin et al., 2011; Peng et al., 2020), though this
optimization was over models without strong data priors. The absence of data-driven reconstruction
models in these methods results in depth information being necessary for both reconstruction and
2