UNCERTAINTY -DRIVEN ACTIVE VISION FOR IMPLICIT SCENE RECONSTRUCTION Edward J. Smith12Michal Drozdzal1

2025-05-06 1 0 6.09MB 23 页 10玖币

侵权投诉

UNCERTAINTY-DRIVEN ACTIVE VISION FOR IMPLICIT

SCENE RECONSTRUCTION

Edward J. Smith1,2∗Michal Drozdzal1

Derek Nowrouzezahrai1,2David Meger2Adriana Romero-Soriano1,2

1Facebook AI Research 2McGill University

ABSTRACT

Multi-view implicit scene reconstruction methods have become increasingly

popular due to their ability to represent complex scene details. Recent efforts

have been devoted to improving the representation of input information and to

reducing the number of views required to obtain high quality reconstructions. Yet,

perhaps surprisingly, the study of which views to select to maximally improve

scene understanding remains largely unexplored. We propose an uncertainty-

driven active vision approach for implicit scene reconstruction, which leverages

occupancy uncertainty accumulated across the scene using volume rendering to

select the next view to acquire. To this end, we develop an occupancy-based

reconstruction method which accurately represents scenes using either 2D or 3D

supervision. We evaluate our proposed approach on the ABC dataset and the in

the wild CO3D dataset, and show that: (1) we are able to obtain high quality

state-of-the-art occupancy reconstructions; (2) our perspective conditioned un-

certainty deﬁnition is effective to drive improvements in next best view selection

and outperforms strong baseline approaches; and (3) we can further improve

shape understanding by performing a gradient-based search on the view selection

candidates. Overall, our results highlight the importance of view selection for

implicit scene reconstruction, making it a promising avenue to explore further.

1 INTRODUCTION

Recent advances leveraging implicit neural representations have dramatically increased the capacity

for scene understanding (Mescheder et al., 2019; Park et al., 2019). For example, in the space of

neural rendering, a large number of works have focused on devising new methods to better under-

stand scenes with only 2D supervision (Yu et al., 2021; Yariv et al., 2020), with their widespread

adoption in part due to their ability to express far more complex scenes and details than explicit

counterparts such as meshes or voxels (Wang et al., 2018; Xie et al., 2019). Moreover, given the

potential of these functions to recover scene properties such as geometry, lighting and object se-

mantics, they hold the promise of revolutionizing applications in augmented reality, autonomous

driving and robotics. Current state-of-the-art models, however, may require up to hundred views to

achieve high quality scene reconstructions. Some efforts have been devoted to drastically reduce

this number by leveraging dataset amortization (Yu et al., 2021; Schwarz et al., 2020; Chen et al.,

2021). Surprisingly though, no work has focused on studying the effect of active view selection on

the scene reconstruction quality under a small budget constraint – e.g., using up to ﬁve views.

Active view selection methods aim to manipulate the viewpoint of a camera to choose views that

best improve 3D scene understanding (Connolly, 1985). These methods have traditionally leveraged

heuristics such as maximal coverage (Pito, 1999) or information gain (Krainin et al., 2011) without

data priors to iteratively select the next best view to acquire over depth images and the shape they

directly provide. In the context of contemporary reconstruction methods, without levering data,

∗Correspondence to: edward.smith@mail.mcgill.ca

arXiv:2210.00978v1 [cs.CV] 3 Oct 2022

heuristics like coverage or information gain cannot distinguish between unobserved regions of the

scene and regions for which a reconstruction model is uncertain, and as a result acquire views

which may provide no additional scene understanding. Recently we have witnessed the introduction

of data-driven approaches to learn next best view policy models (Vasquez-Gomez et al., 2021) as

well as 3D shape reconstruction models to drive these policies (Yang et al., 2018). These data-

driven models often require additional information such as 3D shape supervision (Yang et al., 2018)

or depth to train (Peralta et al., 2020), which is rarely available, difﬁcult to obtain and no longer a

requirement for any aspect of scene reconstruction when leveraging neural implicit functions (Yu

et al., 2021; Chen et al., 2021).

(0) Initial View

Highest

Uncertainty

View

Uncertainty

over views

Scene

Prediction

(4) New View of

Scene

(1) Reconstruction

Model

(2) Model Outputs

(3) Aquire New

View

Figure 1: Uncertainty-driven active vision pipeline.

We propose an uncertainty-driven active vi-

sion approach whose goal is to choose the se-

quence of views which lead to the highest re-

duction in reconstruction uncertainty (see Fig-

ure 1). The proposed approach, introduces

an implicit multi-view reconstruction model

to predict occupancy, leverages the occupancy

predictions to estimate uncertainty over un-

seen views, and deﬁnes view selection poli-

cies which seek to maximize the observable

model uncertainty. Notably, the contributed

reconstruction model is robust to arbitrary

numbers of input views, and can be trained

by leveraging either full 3D supervision from

occupancy values or 2D supervision from ren-

derings. Moreover, the observable model un-

certainty is estimated by extending the vol-

ume rendering formulation to accumulate pre-

dicted occupancy probabilities along rays cast

into the scene, enabling the search space of possible views to be efﬁciently explored. We evalu-

ate our proposed active vision approach on the simulated ABC (Koch et al., 2019) dataset as well

as the challenging, in the wild CO3D (Reizenstein et al., 2021) scene dataset, by leveraging up to

5image perspectives. Our results demonstrate that: (1) our reconstruction model obtains impres-

sive reconstructions which lead to visible improvement over the previous state-of-the-art multi-view

occupancy method on the ABC dataset, and perhaps surprisingly, this improvement persists even

when training with only 2D supervision for larger numbers of input views; (2) our uncertainty-

driven active vision approach achieves notable improvements in shape understanding under volu-

metric and projection-based metrics relative to strong baseline selection policies on both ABC and

CO3D datasets; and (3) by performing a gradient-based search on the view selection candidates,

we can further improve shape understanding. The code to reproduce the experiments is provided:

https://github.com/facebookresearch/Uncertainty-Driven-Active-Vision.

2 RELATED WORKS

Traditional active vision methods for 3D reconstruction, limited by lack of access to contemporary

learning methods and large scale data, generally focus on identifying views to maximize visibility of

unobserved areas of the scene using a range camera (Pito, 1999; Connolly, 1985; Banta et al., 2000).

Connolly (1985) ﬁrst proposed to determine views in the scene which would maximize the visibility

of unobserved voxels. Many works then focused on reducing the cost of computing coverage metrics

and increasing the number of candidate views considered (Pito, 1999; Blaer & Allen, 2007; Low &

Lastra, 2006; Vasquez-Gomez et al., 2013). Conversely, other methods computed utility scores

over additional factors such as view overlap, scan quality and navigation distance which can be

optimized over to select views (Massios et al., 1998; Fisher & Sanchiz, 1999; Foissotte et al., 2008;

Vasquez-Gomez et al., 2014). More contemporary next best view methods, especially in the context

of robotics, focused on maximizing information gain as opposed to direct view coverage (Sebastian

et al., 2005; Le et al., 2008; Huber et al., 2012; Krainin et al., 2011; Peng et al., 2020), though this

optimization was over models without strong data priors. The absence of data-driven reconstruction

models in these methods results in depth information being necessary for both reconstruction and

Positional

Embedding

X, Y, Z

Perceptual

Feature Pooling

{Ri , Ti}

Occupancy

Positional

Embedding

Colour

Image Encoder

Decoder

Deep

Sets MLP

Optional For 2D Supervision

{Rt , Tt }

Figure 2: Our reconstruction method. (X, Y, Z)is the input 3D position in space, {Ri, Ti}is the set

of input image camera parameters, and {Rt, Tt}is the target camera parameters.

view selection, coarse shape predictions relative to learning based approaches, and view selections

which cannot reason over learned shape priors.

A small number of active vision methods have also been proposed which make use of deep learn-

ing. Mendoza et al. (2020) trained a deep learning classiﬁer to predict which pose out of a set of

discreet options will best improve a generated point cloud, and Vasquez-Gomez et al. (2021) re-

gressed the pose for a camera which would maximize coverage, though both operate over a range

camera with no learned reconstruction model. Most similar to our setting, Peralta et al. (2020) used

reinforcement learning to select optimal paths for an RGB camera, over a pre-trained reconstruction

algorithm. This involved the training of a reinforcement learning policy on top of the reconstruction

algorithm, both of which required ground truth shape depth information for training, whereas our

method requires no additional learning and can be applied with only 2D supervision. Yang et al.

(2018) proposed a data-driven recurrent reconstruction method with a uniﬁed view planner, though

the voxel predictions here are coarse and the learning is performed on-policy, with their reconstruc-

tion model biased towards views selected by their policy, and so not directly comparable. Finally, in

the similar setting of active haptic perception, Smith et al. (2021) learn where to touch an object next

to best understand shape over a learned data-driven reconstruction model, also using reinforcement

learning approaches.

3 METHOD

In our active vision approach, the goal is to choose a sequence of views which lead to the highest

reduction in reconstruction uncertainty, and as result improve the 3D shape reconstruction accuracy.

An overview of the proposed pipeline is depicted in Figure 1: (1) a pre-trained shape reconstruction

model is fed with an object image; (2) the predicted reconstruction is used to estimate the uncertainty

over the unseen object views; (3) the view with the highest uncertainty is acquired and subsequently

fed to the reconstruction model, which is designed to process an arbitrary number of views. We

begin by describing our proposed reconstruction model, which can be trained by leveraging either

full 3D supervision from occupancy values or 2D supervision from renderings. Afterwards, we will

present our proposed uncertainty-driven next best view selection approach.

3.1 RECONSTRUCTION MODEL

Our proposed reconstruction model, depicted in Figure 2, is robust to arbitrary numbers of input

views and produces occupancy predictions. The model takes as input a position in space, (X, Y, Z),

a set of Kof input images viand their corresponding camera parameters (Ri, Ti), and produces

an occupancy prediction. In particular, an image encoder extracts features from each input image

through a large VGG-like CNN (Simonyan & Zisserman, 2014) followed by a perceptual feature

pooling operation (Wang et al., 2018). Then, the features of each image are concatenated with a

positional embedding (Mildenhall et al., 2020) of their corresponding camera parameters and the

input position, and are passed through a series of ResNet Blocks (He et al., 2016). The resulting

camera-position-aware features of each image are aggregated using deep set pooling layers (Zaheer

et al., 2017), allowing for permutation invariant aggregation of features from arbitrary numbers of

views. Finally, a sigmoid activation is applied to produce an occupancy prediction. The recon-

Input Image Perspective of Ray

Ray Origin

Ray Samples

Figure 3: Demonstration of the accumulation of uncertainty along a ray. On the left we display the

initial input image for to the model and in the middle a new perspective into the scene for a ray, with

the ray origin labeled in black, and samples along the ray labeled in red. In the graph on the left we

highlight the per-sample ground truth occupancy values ˆo(t), predicted occupancy values o(t), and

resulting uncertainty accumulation function values Tu(t)and accumulated uncertainty u(t).

struction model is trained using full 3D supervision from ground truth occupancy values through a

combination of intersection over union (IoU) and binary cross entropy (BCE) losses.

We extend the model to operate without full 3D supervision and to leverage 2D supervision from

renderings. In this case, the model takes the camera parameters of a target view (Rt, Tt)as additional

input. These parameters are embedded (Mildenhall et al., 2020) and concatenated with the predicted

occupancy and intermediate features from the deep set camera-position-aware feature aggregation.

The result of the concatenation is passed through a series of fully connected layers to predict colour

for the target position. The reconstruction model is trained by leveraging a rendering loss. Along a

ray r=ro+td, where rois the ray origin and dis the ray viewing direction, we compute a colour

value along a ray ˆ

C(r)by integrating occupancy o(t)and colour c(t)predictions:

C(r) = Ztf

T(t)o(t)c(t)dt, (1)

where tnand tfdeﬁne the range of integration, and T(t) = exp(−Rt

tno(s)ds)allows accumulation

of colour up to occlusions (Oechsle et al., 2021). The models are trained by randomly selecting

between 1 and 5 input images and minimizing the mean squared error (MSE) between predicted

pixel values along rays and ground truth pixel values in a target image. Further training details are

provided in the Appendix and an architecture diagram for the model is provided in Figure 2.

3.2 UNCERTAINTY-DRIVEN NEXT BEST VIEW SELECTION

We start by deﬁning occupancy uncertainty from occupancy predictions. Then, we introduce our

proposed view uncertainty computation, and ﬁnally we present the uncertainty-driven policies con-

sidered for the task of next best view selection.

3.2.1 OCCUPANCY UNCERTAINTY

Occupancy prediction, as a binary classiﬁcation task, is determined by applying a threshold to pre-

dicted network probabilities. For our tasks, we set the threshold for predictions at 0.5. The distance

of predicted probabilities from this decision boundary provides implicit model conﬁdence (Watt

et al., 2020; Wu et al., 2018; Zhou et al., 2012). Scaling this value by two provides a normalized

conﬁdence score for model predictions: 2|0.5−o(t)|; however, it is well known that binary clas-

siﬁcation probabilities are poorly calibrated with accuracy (Guo et al., 2017), and so we calibrate

this conﬁdence score using an exponential : (2|0.5−o(t)|)β, where β∈R+is a hyper-parameter

which either smooths or exaggerates the distances from the decision boundary (see Figure 4). We

then deﬁne the uncertainty of occupancy prediction in as follows:

u(o(t)) = 1 −(2|o(t)−.5|)β.(2)

In the Appendix we demonstrate our occupancy predictions are initially quite poorly calibrated, and

that by identifying the correct value of βwe drastically improve calibration error.

Figure 4: Occupancy conﬁdence

for different calibration hyper-

parameter values.

Input image

New perspective Silhouette uncertainty Depth uncertainty Full Uncertainty Uncertainty projected

onto current prediction

Figure 5: Visualization of uncertainty sources (cols. 3-5) for 3

new views (col. 2) after an initial view (col. 1), with high view

uncertainty projected onto the current prediction (col. 6).

3.2.2 VIEW UNCERTAINTY

We decompose the observable uncertainty from a given perspective (view) into two sources: silhou-

ette uncertainty and depth uncertainty. Intuitively, for a given ray through a scene, the uncertainty

associated with silhouette prediction addresses the question “does this ray hit an object?”, and if

the ray has been established to hit an object, the uncertainty associated with the depth of occlusion

addresses the question “where does this ray ﬁrst hit an object?”.

Silhouette Uncertainty. A ray’s silhouette prediction, s(r)∈[0,1], can be resolved using

the ﬁnal accumulation value T(tn)as follows: s(r)=1−T(tn). As this is an occu-

pancy prediction, we leverage Equation 2 to deﬁne silhouette uncertainty, usil(r), as follows:

usil(r)=1−(2|s(r)−.5|)λs.

Depth Uncertainty. For the occupancy of a 3D point in space, we again leverage Equation 2:

up(t)=1−(2|o(t)−.5|)λu.Then, we seek to accumulate point uncertainty along a ray. In particular,

we aim to accumulate uncertainty indiscriminately up until the model is highly conﬁdent that a

surface has been observed, as this point represents where depth uncertainty has been fully resolved.

We therefore update the accumulation function T(t)in Equation 1 to account for uncertainty as

follows: Tu(t) = exp(−Rt

1s|o(s)−.5|λtds), where 1sis the indicator function for o(s)>0.5,

and λt∈R+is a smoothing hyper-parameter. This ensures the accumulation of uncertainty along

the ray reduces with the degree to which the model is positively conﬁdent an occlusion has been

observed – i.e., o(t)> .5.

With an uncertainty deﬁnition over the scene and an accumulation function to integrate them up to

occlusions, we possess the minimum tools to apply volume rendering. However, we also consider

that our model predictions are limited in resolution which may lead to high uncertainty regions at

decision boundaries regardless of shape understanding.

To mitigate this issue, we introduce a rate of change correction, d(t) = 1 −(∇ro(t))λd, where

∇ro(t)∈[0,1] is the directional derivative of the occupancy prediction along the ray r, and λd∈

R+is a smoothing hyper-parameter. Multiplying up(t)by d(t)reduces the deﬁned uncertainty at a

point when passing through a tight surface decision boundary, as deﬁned by the rate of change of

the occupancy.

We rewrite the volume rendering deﬁnition highlighted in Equation 1 to account for depth uncer-

tainty along a ray as:

udepth(r) = Ztf

Tu(t)d(t)up(t)dt. (3)

In Figure 3, we highlight how predicted occupancy values along a ray through a scene results in

the accumulation of uncertainty. In this example uncertainty is present due to the model’s lack of

conﬁdence in the exact location of the object’s outer shell.

View Uncertainty. We deﬁne the uncertainty of a perspective as the average accumulated silhouette

and depth uncertainty from rays cast from it:

u(v) = 1

|v|X

r∈v

(usil(r) + λ)∗udepth(r),(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UNCERTAINTY-DRIVENACTIVEVISIONFORIMPLICITSCENERECONSTRUCTIONEdwardJ.Smith1;2MichalDrozdzal1DerekNowrouzezahrai1;2DavidMeger2AdrianaRomero-Soriano1;21FacebookAIResearch2McGillUniversityABSTRACTMulti-viewimplicitscenereconstructionmethodshavebecomeincreasinglypopularduetotheirabilitytorepresentcomple...

展开>> 收起<<

UNCERTAINTY -DRIVEN ACTIVE VISION FOR IMPLICIT SCENE RECONSTRUCTION Edward J. Smith12Michal Drozdzal1.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UNCERTAINTY -DRIVEN ACTIVE VISION FOR IMPLICIT SCENE RECONSTRUCTION Edward J. Smith12Michal Drozdzal1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: