UNCERTAINTY -DRIVEN ACTIVE VISION FOR IMPLICIT SCENE RECONSTRUCTION Edward J. Smith12Michal Drozdzal1

2025-05-06 0 0 6.09MB 23 页 10玖币
侵权投诉
UNCERTAINTY-DRIVEN ACTIVE VISION FOR IMPLICIT
SCENE RECONSTRUCTION
Edward J. Smith1,2Michal Drozdzal1
Derek Nowrouzezahrai1,2David Meger2Adriana Romero-Soriano1,2
1Facebook AI Research 2McGill University
ABSTRACT
Multi-view implicit scene reconstruction methods have become increasingly
popular due to their ability to represent complex scene details. Recent efforts
have been devoted to improving the representation of input information and to
reducing the number of views required to obtain high quality reconstructions. Yet,
perhaps surprisingly, the study of which views to select to maximally improve
scene understanding remains largely unexplored. We propose an uncertainty-
driven active vision approach for implicit scene reconstruction, which leverages
occupancy uncertainty accumulated across the scene using volume rendering to
select the next view to acquire. To this end, we develop an occupancy-based
reconstruction method which accurately represents scenes using either 2D or 3D
supervision. We evaluate our proposed approach on the ABC dataset and the in
the wild CO3D dataset, and show that: (1) we are able to obtain high quality
state-of-the-art occupancy reconstructions; (2) our perspective conditioned un-
certainty definition is effective to drive improvements in next best view selection
and outperforms strong baseline approaches; and (3) we can further improve
shape understanding by performing a gradient-based search on the view selection
candidates. Overall, our results highlight the importance of view selection for
implicit scene reconstruction, making it a promising avenue to explore further.
1 INTRODUCTION
Recent advances leveraging implicit neural representations have dramatically increased the capacity
for scene understanding (Mescheder et al., 2019; Park et al., 2019). For example, in the space of
neural rendering, a large number of works have focused on devising new methods to better under-
stand scenes with only 2D supervision (Yu et al., 2021; Yariv et al., 2020), with their widespread
adoption in part due to their ability to express far more complex scenes and details than explicit
counterparts such as meshes or voxels (Wang et al., 2018; Xie et al., 2019). Moreover, given the
potential of these functions to recover scene properties such as geometry, lighting and object se-
mantics, they hold the promise of revolutionizing applications in augmented reality, autonomous
driving and robotics. Current state-of-the-art models, however, may require up to hundred views to
achieve high quality scene reconstructions. Some efforts have been devoted to drastically reduce
this number by leveraging dataset amortization (Yu et al., 2021; Schwarz et al., 2020; Chen et al.,
2021). Surprisingly though, no work has focused on studying the effect of active view selection on
the scene reconstruction quality under a small budget constraint – e.g., using up to five views.
Active view selection methods aim to manipulate the viewpoint of a camera to choose views that
best improve 3D scene understanding (Connolly, 1985). These methods have traditionally leveraged
heuristics such as maximal coverage (Pito, 1999) or information gain (Krainin et al., 2011) without
data priors to iteratively select the next best view to acquire over depth images and the shape they
directly provide. In the context of contemporary reconstruction methods, without levering data,
Correspondence to: edward.smith@mail.mcgill.ca
1
arXiv:2210.00978v1 [cs.CV] 3 Oct 2022
heuristics like coverage or information gain cannot distinguish between unobserved regions of the
scene and regions for which a reconstruction model is uncertain, and as a result acquire views
which may provide no additional scene understanding. Recently we have witnessed the introduction
of data-driven approaches to learn next best view policy models (Vasquez-Gomez et al., 2021) as
well as 3D shape reconstruction models to drive these policies (Yang et al., 2018). These data-
driven models often require additional information such as 3D shape supervision (Yang et al., 2018)
or depth to train (Peralta et al., 2020), which is rarely available, difficult to obtain and no longer a
requirement for any aspect of scene reconstruction when leveraging neural implicit functions (Yu
et al., 2021; Chen et al., 2021).
(0) Initial View
Highest
Uncertainty
View
Uncertainty
over views
Scene
Prediction
(4) New View of
Scene
(1) Reconstruction
Model
(2) Model Outputs
(3) Aquire New
View
Figure 1: Uncertainty-driven active vision pipeline.
We propose an uncertainty-driven active vi-
sion approach whose goal is to choose the se-
quence of views which lead to the highest re-
duction in reconstruction uncertainty (see Fig-
ure 1). The proposed approach, introduces
an implicit multi-view reconstruction model
to predict occupancy, leverages the occupancy
predictions to estimate uncertainty over un-
seen views, and defines view selection poli-
cies which seek to maximize the observable
model uncertainty. Notably, the contributed
reconstruction model is robust to arbitrary
numbers of input views, and can be trained
by leveraging either full 3D supervision from
occupancy values or 2D supervision from ren-
derings. Moreover, the observable model un-
certainty is estimated by extending the vol-
ume rendering formulation to accumulate pre-
dicted occupancy probabilities along rays cast
into the scene, enabling the search space of possible views to be efficiently explored. We evalu-
ate our proposed active vision approach on the simulated ABC (Koch et al., 2019) dataset as well
as the challenging, in the wild CO3D (Reizenstein et al., 2021) scene dataset, by leveraging up to
5image perspectives. Our results demonstrate that: (1) our reconstruction model obtains impres-
sive reconstructions which lead to visible improvement over the previous state-of-the-art multi-view
occupancy method on the ABC dataset, and perhaps surprisingly, this improvement persists even
when training with only 2D supervision for larger numbers of input views; (2) our uncertainty-
driven active vision approach achieves notable improvements in shape understanding under volu-
metric and projection-based metrics relative to strong baseline selection policies on both ABC and
CO3D datasets; and (3) by performing a gradient-based search on the view selection candidates,
we can further improve shape understanding. The code to reproduce the experiments is provided:
https://github.com/facebookresearch/Uncertainty-Driven-Active-Vision.
2 RELATED WORKS
Traditional active vision methods for 3D reconstruction, limited by lack of access to contemporary
learning methods and large scale data, generally focus on identifying views to maximize visibility of
unobserved areas of the scene using a range camera (Pito, 1999; Connolly, 1985; Banta et al., 2000).
Connolly (1985) first proposed to determine views in the scene which would maximize the visibility
of unobserved voxels. Many works then focused on reducing the cost of computing coverage metrics
and increasing the number of candidate views considered (Pito, 1999; Blaer & Allen, 2007; Low &
Lastra, 2006; Vasquez-Gomez et al., 2013). Conversely, other methods computed utility scores
over additional factors such as view overlap, scan quality and navigation distance which can be
optimized over to select views (Massios et al., 1998; Fisher & Sanchiz, 1999; Foissotte et al., 2008;
Vasquez-Gomez et al., 2014). More contemporary next best view methods, especially in the context
of robotics, focused on maximizing information gain as opposed to direct view coverage (Sebastian
et al., 2005; Le et al., 2008; Huber et al., 2012; Krainin et al., 2011; Peng et al., 2020), though this
optimization was over models without strong data priors. The absence of data-driven reconstruction
models in these methods results in depth information being necessary for both reconstruction and
2
Positional
Embedding
X, Y, Z
Perceptual
Feature Pooling
{Ri , Ti}
Occupancy
Positional
Embedding
Colour
Image Encoder
Image Encoder
Image Encoder
Decoder
Decoder
Decoder
Deep
Sets MLP
Optional For 2D Supervision
v1
v2
vk
{Rt , Tt }
Figure 2: Our reconstruction method. (X, Y, Z)is the input 3D position in space, {Ri, Ti}is the set
of input image camera parameters, and {Rt, Tt}is the target camera parameters.
view selection, coarse shape predictions relative to learning based approaches, and view selections
which cannot reason over learned shape priors.
A small number of active vision methods have also been proposed which make use of deep learn-
ing. Mendoza et al. (2020) trained a deep learning classifier to predict which pose out of a set of
discreet options will best improve a generated point cloud, and Vasquez-Gomez et al. (2021) re-
gressed the pose for a camera which would maximize coverage, though both operate over a range
camera with no learned reconstruction model. Most similar to our setting, Peralta et al. (2020) used
reinforcement learning to select optimal paths for an RGB camera, over a pre-trained reconstruction
algorithm. This involved the training of a reinforcement learning policy on top of the reconstruction
algorithm, both of which required ground truth shape depth information for training, whereas our
method requires no additional learning and can be applied with only 2D supervision. Yang et al.
(2018) proposed a data-driven recurrent reconstruction method with a unified view planner, though
the voxel predictions here are coarse and the learning is performed on-policy, with their reconstruc-
tion model biased towards views selected by their policy, and so not directly comparable. Finally, in
the similar setting of active haptic perception, Smith et al. (2021) learn where to touch an object next
to best understand shape over a learned data-driven reconstruction model, also using reinforcement
learning approaches.
3 METHOD
In our active vision approach, the goal is to choose a sequence of views which lead to the highest
reduction in reconstruction uncertainty, and as result improve the 3D shape reconstruction accuracy.
An overview of the proposed pipeline is depicted in Figure 1: (1) a pre-trained shape reconstruction
model is fed with an object image; (2) the predicted reconstruction is used to estimate the uncertainty
over the unseen object views; (3) the view with the highest uncertainty is acquired and subsequently
fed to the reconstruction model, which is designed to process an arbitrary number of views. We
begin by describing our proposed reconstruction model, which can be trained by leveraging either
full 3D supervision from occupancy values or 2D supervision from renderings. Afterwards, we will
present our proposed uncertainty-driven next best view selection approach.
3.1 RECONSTRUCTION MODEL
Our proposed reconstruction model, depicted in Figure 2, is robust to arbitrary numbers of input
views and produces occupancy predictions. The model takes as input a position in space, (X, Y, Z),
a set of Kof input images viand their corresponding camera parameters (Ri, Ti), and produces
an occupancy prediction. In particular, an image encoder extracts features from each input image
through a large VGG-like CNN (Simonyan & Zisserman, 2014) followed by a perceptual feature
pooling operation (Wang et al., 2018). Then, the features of each image are concatenated with a
positional embedding (Mildenhall et al., 2020) of their corresponding camera parameters and the
input position, and are passed through a series of ResNet Blocks (He et al., 2016). The resulting
camera-position-aware features of each image are aggregated using deep set pooling layers (Zaheer
et al., 2017), allowing for permutation invariant aggregation of features from arbitrary numbers of
views. Finally, a sigmoid activation is applied to produce an occupancy prediction. The recon-
3
Input Image Perspective of Ray
Ray Origin
Ray Samples
Figure 3: Demonstration of the accumulation of uncertainty along a ray. On the left we display the
initial input image for to the model and in the middle a new perspective into the scene for a ray, with
the ray origin labeled in black, and samples along the ray labeled in red. In the graph on the left we
highlight the per-sample ground truth occupancy values ˆo(t), predicted occupancy values o(t), and
resulting uncertainty accumulation function values Tu(t)and accumulated uncertainty u(t).
struction model is trained using full 3D supervision from ground truth occupancy values through a
combination of intersection over union (IoU) and binary cross entropy (BCE) losses.
We extend the model to operate without full 3D supervision and to leverage 2D supervision from
renderings. In this case, the model takes the camera parameters of a target view (Rt, Tt)as additional
input. These parameters are embedded (Mildenhall et al., 2020) and concatenated with the predicted
occupancy and intermediate features from the deep set camera-position-aware feature aggregation.
The result of the concatenation is passed through a series of fully connected layers to predict colour
for the target position. The reconstruction model is trained by leveraging a rendering loss. Along a
ray r=ro+td, where rois the ray origin and dis the ray viewing direction, we compute a colour
value along a ray ˆ
C(r)by integrating occupancy o(t)and colour c(t)predictions:
ˆ
C(r) = Ztf
tn
T(t)o(t)c(t)dt, (1)
where tnand tfdefine the range of integration, and T(t) = exp(Rt
tno(s)ds)allows accumulation
of colour up to occlusions (Oechsle et al., 2021). The models are trained by randomly selecting
between 1 and 5 input images and minimizing the mean squared error (MSE) between predicted
pixel values along rays and ground truth pixel values in a target image. Further training details are
provided in the Appendix and an architecture diagram for the model is provided in Figure 2.
3.2 UNCERTAINTY-DRIVEN NEXT BEST VIEW SELECTION
We start by defining occupancy uncertainty from occupancy predictions. Then, we introduce our
proposed view uncertainty computation, and finally we present the uncertainty-driven policies con-
sidered for the task of next best view selection.
3.2.1 OCCUPANCY UNCERTAINTY
Occupancy prediction, as a binary classification task, is determined by applying a threshold to pre-
dicted network probabilities. For our tasks, we set the threshold for predictions at 0.5. The distance
of predicted probabilities from this decision boundary provides implicit model confidence (Watt
et al., 2020; Wu et al., 2018; Zhou et al., 2012). Scaling this value by two provides a normalized
confidence score for model predictions: 2|0.5o(t)|; however, it is well known that binary clas-
sification probabilities are poorly calibrated with accuracy (Guo et al., 2017), and so we calibrate
this confidence score using an exponential : (2|0.5o(t)|)β, where βR+is a hyper-parameter
which either smooths or exaggerates the distances from the decision boundary (see Figure 4). We
then define the uncertainty of occupancy prediction in as follows:
u(o(t)) = 1 (2|o(t).5|)β.(2)
In the Appendix we demonstrate our occupancy predictions are initially quite poorly calibrated, and
that by identifying the correct value of βwe drastically improve calibration error.
4
Figure 4: Occupancy confidence
for different calibration hyper-
parameter values.
Input image
New perspective Silhouette uncertainty Depth uncertainty Full Uncertainty Uncertainty projected
onto current prediction
Figure 5: Visualization of uncertainty sources (cols. 3-5) for 3
new views (col. 2) after an initial view (col. 1), with high view
uncertainty projected onto the current prediction (col. 6).
3.2.2 VIEW UNCERTAINTY
We decompose the observable uncertainty from a given perspective (view) into two sources: silhou-
ette uncertainty and depth uncertainty. Intuitively, for a given ray through a scene, the uncertainty
associated with silhouette prediction addresses the question “does this ray hit an object?”, and if
the ray has been established to hit an object, the uncertainty associated with the depth of occlusion
addresses the question “where does this ray first hit an object?”.
Silhouette Uncertainty. A ray’s silhouette prediction, s(r)[0,1], can be resolved using
the final accumulation value T(tn)as follows: s(r)=1T(tn). As this is an occu-
pancy prediction, we leverage Equation 2 to define silhouette uncertainty, usil(r), as follows:
usil(r)=1(2|s(r).5|)λs.
Depth Uncertainty. For the occupancy of a 3D point in space, we again leverage Equation 2:
up(t)=1(2|o(t).5|)λu.Then, we seek to accumulate point uncertainty along a ray. In particular,
we aim to accumulate uncertainty indiscriminately up until the model is highly confident that a
surface has been observed, as this point represents where depth uncertainty has been fully resolved.
We therefore update the accumulation function T(t)in Equation 1 to account for uncertainty as
follows: Tu(t) = exp(Rt
tn
1s|o(s).5|λtds), where 1sis the indicator function for o(s)>0.5,
and λtR+is a smoothing hyper-parameter. This ensures the accumulation of uncertainty along
the ray reduces with the degree to which the model is positively confident an occlusion has been
observed – i.e., o(t)> .5.
With an uncertainty definition over the scene and an accumulation function to integrate them up to
occlusions, we possess the minimum tools to apply volume rendering. However, we also consider
that our model predictions are limited in resolution which may lead to high uncertainty regions at
decision boundaries regardless of shape understanding.
To mitigate this issue, we introduce a rate of change correction, d(t) = 1 (ro(t))λd, where
ro(t)[0,1] is the directional derivative of the occupancy prediction along the ray r, and λd
R+is a smoothing hyper-parameter. Multiplying up(t)by d(t)reduces the defined uncertainty at a
point when passing through a tight surface decision boundary, as defined by the rate of change of
the occupancy.
We rewrite the volume rendering definition highlighted in Equation 1 to account for depth uncer-
tainty along a ray as:
udepth(r) = Ztf
tn
Tu(t)d(t)up(t)dt. (3)
In Figure 3, we highlight how predicted occupancy values along a ray through a scene results in
the accumulation of uncertainty. In this example uncertainty is present due to the model’s lack of
confidence in the exact location of the object’s outer shell.
View Uncertainty. We define the uncertainty of a perspective as the average accumulated silhouette
and depth uncertainty from rays cast from it:
u(v) = 1
|v|X
rv
(usil(r) + λ)udepth(r),(4)
5
摘要:

UNCERTAINTY-DRIVENACTIVEVISIONFORIMPLICITSCENERECONSTRUCTIONEdwardJ.Smith1;2MichalDrozdzal1DerekNowrouzezahrai1;2DavidMeger2AdrianaRomero-Soriano1;21FacebookAIResearch2McGillUniversityABSTRACTMulti-viewimplicitscenereconstructionmethodshavebecomeincreasinglypopularduetotheirabilitytorepresentcomple...

展开>> 收起<<
UNCERTAINTY -DRIVEN ACTIVE VISION FOR IMPLICIT SCENE RECONSTRUCTION Edward J. Smith12Michal Drozdzal1.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:23 页 大小:6.09MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注