Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation Yunzhi Lin12 Thomas M uller1 Jonathan Tremblay1 Bowen Wen1 Stephen Tyree1 Alex Evans1 Patricio A. Vela2 Stan Birchﬁeld1

2025-05-02 0 0 5.43MB 8 页 10玖币

侵权投诉

Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation

Yunzhi Lin1,2, Thomas M¨

uller1, Jonathan Tremblay1, Bowen Wen1, Stephen Tyree1,

Alex Evans1, Patricio A. Vela2, Stan Birchﬁeld1

1NVIDIA: {tmueller, jtremblay, bowenw, styree, alexe, sbirchfield}@nvidia.com

2Georgia Institute of Technology: {yunzhi.lin, pvela}@gatech.edu

Fig. 1. Our NeRF-based parallelized optimization method estimates the camera pose from a monocular RGB image of a novel object. The optimization

iteratively updates a set of pose estimates in parallel by backpropagating discrepancies between the observed and rendered image. LEFT: To simplify the

display, we show four camera hypotheses at three iterations: initial (red), after the ﬁrst resample (blue), and ﬁnal (green). RIGHT: Corresponding renderings

of estimated poses (in color) overlaid on the observed image (grayscale).

Abstract— We present a parallelized optimization method

based on fast Neural Radiance Fields (NeRF) for estimating 6-

DoF pose of a camera with respect to an object or scene. Given

a single observed RGB image of the target, we can predict

the translation and rotation of the camera by minimizing the

residual between pixels rendered from a fast NeRF model and

pixels in the observed image. We integrate a momentum-based

camera extrinsic optimization procedure into Instant Neural

Graphics Primitives, a recent exceptionally fast NeRF imple-

mentation. By introducing parallel Monte Carlo sampling into

the pose estimation task, our method overcomes local minima

and improves efﬁciency in a more extensive search space. We

also show the importance of adopting a more robust pixel-based

loss function to reduce error. Experiments demonstrate that our

method can achieve improved generalization and robustness on

both synthetic and real-world benchmarks.

I. INTRODUCTION

6-DoF pose estimation—predicting the 3D position and

orientation of a camera with respect to an object or scene—

is a fundamental step for many tasks, including some in robot

manipulation and augmented reality. While RGB-D or point

cloud-based methods [1]–[4] have received much attention,

monocular RGB-only approaches [5], [6] have great potential

for wider applicability and for handling certain material prop-

erties that are difﬁcult for depth sensors—such as transparent

or dark surfaces.

Much research in this area has focused on instance-level

object pose estimation [7]–[11]. These methods assume that

a textured 3D model of the target object is available for

training. Such methods have achieved success under different

scenarios but suffer from a lack of scalability. Research

beyond this limitation considers category-level object pose

Project page: https://pnerfp.github.io

estimation [6], [12]–[14]. These methods scale better for

real-world applications, since a single trained model works

for a variety of object instances within a known category.

Nevertheless, the effort needed to deﬁne and train a model

for each category remains a limitation. For more widespread

generalizability, it is important to be able to easily estimate

poses for arbitrary objects.

The emergence of Neural Radiance Fields (NeRF) [15] has

the potential to facilitate novel object pose estimation. NeRF

and its variants learn generative models of objects from pose-

annotated image collections, capturing complex 3D structure

and high-ﬁdelity surface details. Recently, iNeRF [16] has

been proposed as an analysis-by-synthesis approach for pose

estimation built on the concept of inverting a NeRF model.

Inspired by iNeRF’s success, this paper further explores the

idea of pose estimation via neural radiance ﬁeld inversion.

A drawback of NeRF is its computational overhead which

impacts execution time. To overcome this limitation, we

leverage our fast version of NeRF, known as Instant Neural

Graphics Primitives (Instant NGP) [17]. Using Instant NGP

model inversion provides signiﬁcant speedups over NeRF.

The structure of Instant NGP admits parallel optimization,

which is leveraged to overcome issues with local minima

and thereby achieve greater robustness than possible with

iNeRF. Similar to iNeRF, our pose estimation requires three

inputs: a single RGB image with the target, an initial rough

pose estimate of the target, and an instant NGP model trained

from multiple views of the target.

Considering that a single camera pose is vulnerable to

local minima during optimization iterations, we leverage

parallelized Monte Carlo sampling. At adaptive intervals,

camera pose hypotheses are re-sampled around the hypothe-

arXiv:2210.10108v2 [cs.CV] 10 Mar 2023

ses with the lowest loss. This design alleviates the issue

of convergence to local minima and improves efﬁciency of

search over a more extensive search space.

The gradients of pixel residuals calculated between the

rendered model and the target view are backpropagated

to generate camera pose updates. Unlike iNeRF where a

subsample of a new image is rendered at each iteration,we

enable hundreds of thousands of rays to work independently

in parallel to accumulate gradient descent updates per camera

pose hypothesis. This design dramatically improves the efﬁ-

ciency. Furthermore, we investigate different pixel-based loss

functions to identify which approach to quantifying the visual

difference between the rendered model and the observed

target image best informs the camera pose updates. As shown

in the ablation study, the mean absolute percentage error

(MAPE) [18] loss exhibits better robustness to disturbances.

In summary, this work makes the following contributions:

•A parallelized, momentum-based optimization method us-

ing NeRF models is proposed to estimate 6-DoF poses

from monocular RGB input. The object-speciﬁc NeRF

model does not require pre-training on large datasets.

•Parallelized Monte Carlo sampling is introduced into the

pose estimation task, and we show the importance of pixel-

based loss function selection for robustness.

•Quantitative demonstration through synthetic and real-

world benchmarks that the proposed method has improved

generalization and robustness.

II. RELATED WORKS

Neural 3D Scene Representation. Recent works [19]–

[21] have investigated representing 3D scenes implicitly with

neural networks, where coordinates are sampled and fed into

a neural network to produce physical ﬁeld values across

space and time [22]. NeRF [15] is a milestone approach

demonstrating that neural scene representations have the

capabilities to synthesize photo-realistic views. Since then,

signiﬁcant effort has been put into pushing the boundaries

of NeRF. Follow-up works have focused on speeding up the

training and inference processes [17], [23], [24], adding sup-

port for relighting [25], relaxing the requirement of known

camera poses [26], [27], reducing the number of training

images [28], extending to dynamic scenes [29], and so on.

NeRF also opens up opportunities in the robotics community.

Researchers have proposed to use it to represent scenes for

visuomotor control [30], reconstruct transparent objects [31],

generate training data for pose estimators [32] or dense object

descriptors [33], and model 3D object categories [34]. In this

work, we aim to follow in their footsteps by applying NeRF

directly to the 6-DoF pose estimation task.

Generalizable 6-DoF Pose Estimation. Generalizable

6-DoF pose estimation—not limited to any speciﬁc target

or category—from RGB images has been a long-standing

problem in the community. Existing methods tend to share

a similar pipeline of two phases: 1) model registration and

2) pose estimation.

Traditional methods [35]–[38] ﬁrst build a 3D CAD

model via commercial scanners or dense 3D reconstruction

techniques [39], [40]. They resolve the pose by ﬁnding

2D-3D correspondences (via hand-designed features like

SIFT [41] or ORB [42]) between the input RGB image

and the registered model. However, creating high quality

3D models is not easy, and ﬁnding correspondence across a

large database (renderings from different viewpoints) can be

time-consuming [35]. More recently, several attempts have

been made to revisit the object-agnostic pose estimation

problem with deep learning. The presumption is that a deep

network pretrained on a large dataset can generalize to ﬁnd

correspondence between the query image and the registered

model for novel objects. OnePose [43], inspired by visual lo-

calization research, proposes to use a graph attention network

to aggregate 2D features from different views during the

registration phase of structure-from-motion [44]. Then the

aggregated 3D descriptor is matched with 2D features from

the query view to solve the PnP problem [45]. Similarly,

OSOP [46] explores solving the PnP problem with a dense

correspondence between the query image and the coordinate

map from a pre-built 3D CAD model. On the other hand,

Gen6D [47] only needs to register the model with a set

of posed images. Following the iterative template matching

idea [48], [49], its network takes as input several neighboring

registered images closest to the predicted pose and repeatedly

reﬁnes the result.

While data-driven approaches rely on the generalization of

a large training dataset (usually composed of both synthetic

& real-world data) [47], iNeRF [16] is an optimization on-

the-ﬂy approach free of pretraining. Each new object is

ﬁrst registered by a NeRF model [15], after which iNeRF

can optimize the camera pose on the synthesized photo-

realistic renderings from NeRF. Although iNeRF’s idea

seems promising, there still remain several challenges. The

ﬁrst is the expensive training cost of a NeRF model, which

may take hours for just one target. Additionally, iNeRF’s

pose update strategy is inefﬁcient, as the accumulation and

backpropagation of the loss gradient is performed until

a subsample of a new image is rendered. Moreover, the

optimization process of a single pose hypothesis is easily

trapped in local minima due to outliers. To deal with the

aforementioned issues, we propose a more efﬁcient and

robust approach leveraging the recent success of Instant

NGP [17]. We re-formulate the camera pose representation

as the Cartesian product SO(3) ×T(3) and integrate the

optimization process into the structure of Instant NGP. We

also adopt parallelized Monte Carlo sampling to improve

robustness to local minima. Loc-NeRF [50] is another con-

current work using Monte Carlo sampling to improve iNeRF.

III. PRELIMINARIES

NeRF. Given a collection of NRGB images {Ii}N

i=1 , Ii∈

[0,1]H×W×3with known camera poses {Ti}N

i=1, NeRF [15]

learns to represent a scene as 5D neural radiance ﬁelds

(spatial location and viewing direction). It can synthesize

novel views by querying 5D coordinates along the camera

rays and use classic volume rendering techniques to project

the output colors and densities into an image.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ParallelInversionofNeuralRadianceFieldsforRobustPoseEstimationYunzhiLin1;2,ThomasM¨uller1,JonathanTremblay1,BowenWen1,StephenTyree1,AlexEvans1,PatricioA.Vela2,StanBircheld11NVIDIA:ftmueller,jtremblay,bowenw,styree,alexe,sbirchfieldg@nvidia.com2GeorgiaInstituteofTechnology:fyunzhi.lin,pvelag@gatech....

展开>> 收起<<

Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation Yunzhi Lin12 Thomas M uller1 Jonathan Tremblay1 Bowen Wen1 Stephen Tyree1 Alex Evans1 Patricio A. Vela2 Stan Birchﬁeld1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation Yunzhi Lin12 Thomas M uller1 Jonathan Tremblay1 Bowen Wen1 Stephen Tyree1 Alex Evans1 Patricio A. Vela2 Stan Birchﬁeld1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: