Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation Yunzhi Lin12 Thomas M uller1 Jonathan Tremblay1 Bowen Wen1 Stephen Tyree1 Alex Evans1 Patricio A. Vela2 Stan Birchfield1

2025-05-02 0 0 5.43MB 8 页 10玖币
侵权投诉
Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation
Yunzhi Lin1,2, Thomas M¨
uller1, Jonathan Tremblay1, Bowen Wen1, Stephen Tyree1,
Alex Evans1, Patricio A. Vela2, Stan Birchfield1
1NVIDIA: {tmueller, jtremblay, bowenw, styree, alexe, sbirchfield}@nvidia.com
2Georgia Institute of Technology: {yunzhi.lin, pvela}@gatech.edu
Fig. 1. Our NeRF-based parallelized optimization method estimates the camera pose from a monocular RGB image of a novel object. The optimization
iteratively updates a set of pose estimates in parallel by backpropagating discrepancies between the observed and rendered image. LEFT: To simplify the
display, we show four camera hypotheses at three iterations: initial (red), after the first resample (blue), and final (green). RIGHT: Corresponding renderings
of estimated poses (in color) overlaid on the observed image (grayscale).
Abstract We present a parallelized optimization method
based on fast Neural Radiance Fields (NeRF) for estimating 6-
DoF pose of a camera with respect to an object or scene. Given
a single observed RGB image of the target, we can predict
the translation and rotation of the camera by minimizing the
residual between pixels rendered from a fast NeRF model and
pixels in the observed image. We integrate a momentum-based
camera extrinsic optimization procedure into Instant Neural
Graphics Primitives, a recent exceptionally fast NeRF imple-
mentation. By introducing parallel Monte Carlo sampling into
the pose estimation task, our method overcomes local minima
and improves efficiency in a more extensive search space. We
also show the importance of adopting a more robust pixel-based
loss function to reduce error. Experiments demonstrate that our
method can achieve improved generalization and robustness on
both synthetic and real-world benchmarks.
I. INTRODUCTION
6-DoF pose estimation—predicting the 3D position and
orientation of a camera with respect to an object or scene—
is a fundamental step for many tasks, including some in robot
manipulation and augmented reality. While RGB-D or point
cloud-based methods [1]–[4] have received much attention,
monocular RGB-only approaches [5], [6] have great potential
for wider applicability and for handling certain material prop-
erties that are difficult for depth sensors—such as transparent
or dark surfaces.
Much research in this area has focused on instance-level
object pose estimation [7]–[11]. These methods assume that
a textured 3D model of the target object is available for
training. Such methods have achieved success under different
scenarios but suffer from a lack of scalability. Research
beyond this limitation considers category-level object pose
Project page: https://pnerfp.github.io
estimation [6], [12]–[14]. These methods scale better for
real-world applications, since a single trained model works
for a variety of object instances within a known category.
Nevertheless, the effort needed to define and train a model
for each category remains a limitation. For more widespread
generalizability, it is important to be able to easily estimate
poses for arbitrary objects.
The emergence of Neural Radiance Fields (NeRF) [15] has
the potential to facilitate novel object pose estimation. NeRF
and its variants learn generative models of objects from pose-
annotated image collections, capturing complex 3D structure
and high-fidelity surface details. Recently, iNeRF [16] has
been proposed as an analysis-by-synthesis approach for pose
estimation built on the concept of inverting a NeRF model.
Inspired by iNeRF’s success, this paper further explores the
idea of pose estimation via neural radiance field inversion.
A drawback of NeRF is its computational overhead which
impacts execution time. To overcome this limitation, we
leverage our fast version of NeRF, known as Instant Neural
Graphics Primitives (Instant NGP) [17]. Using Instant NGP
model inversion provides significant speedups over NeRF.
The structure of Instant NGP admits parallel optimization,
which is leveraged to overcome issues with local minima
and thereby achieve greater robustness than possible with
iNeRF. Similar to iNeRF, our pose estimation requires three
inputs: a single RGB image with the target, an initial rough
pose estimate of the target, and an instant NGP model trained
from multiple views of the target.
Considering that a single camera pose is vulnerable to
local minima during optimization iterations, we leverage
parallelized Monte Carlo sampling. At adaptive intervals,
camera pose hypotheses are re-sampled around the hypothe-
arXiv:2210.10108v2 [cs.CV] 10 Mar 2023
ses with the lowest loss. This design alleviates the issue
of convergence to local minima and improves efficiency of
search over a more extensive search space.
The gradients of pixel residuals calculated between the
rendered model and the target view are backpropagated
to generate camera pose updates. Unlike iNeRF where a
subsample of a new image is rendered at each iteration,we
enable hundreds of thousands of rays to work independently
in parallel to accumulate gradient descent updates per camera
pose hypothesis. This design dramatically improves the effi-
ciency. Furthermore, we investigate different pixel-based loss
functions to identify which approach to quantifying the visual
difference between the rendered model and the observed
target image best informs the camera pose updates. As shown
in the ablation study, the mean absolute percentage error
(MAPE) [18] loss exhibits better robustness to disturbances.
In summary, this work makes the following contributions:
A parallelized, momentum-based optimization method us-
ing NeRF models is proposed to estimate 6-DoF poses
from monocular RGB input. The object-specific NeRF
model does not require pre-training on large datasets.
Parallelized Monte Carlo sampling is introduced into the
pose estimation task, and we show the importance of pixel-
based loss function selection for robustness.
Quantitative demonstration through synthetic and real-
world benchmarks that the proposed method has improved
generalization and robustness.
II. RELATED WORKS
Neural 3D Scene Representation. Recent works [19]–
[21] have investigated representing 3D scenes implicitly with
neural networks, where coordinates are sampled and fed into
a neural network to produce physical field values across
space and time [22]. NeRF [15] is a milestone approach
demonstrating that neural scene representations have the
capabilities to synthesize photo-realistic views. Since then,
significant effort has been put into pushing the boundaries
of NeRF. Follow-up works have focused on speeding up the
training and inference processes [17], [23], [24], adding sup-
port for relighting [25], relaxing the requirement of known
camera poses [26], [27], reducing the number of training
images [28], extending to dynamic scenes [29], and so on.
NeRF also opens up opportunities in the robotics community.
Researchers have proposed to use it to represent scenes for
visuomotor control [30], reconstruct transparent objects [31],
generate training data for pose estimators [32] or dense object
descriptors [33], and model 3D object categories [34]. In this
work, we aim to follow in their footsteps by applying NeRF
directly to the 6-DoF pose estimation task.
Generalizable 6-DoF Pose Estimation. Generalizable
6-DoF pose estimation—not limited to any specific target
or category—from RGB images has been a long-standing
problem in the community. Existing methods tend to share
a similar pipeline of two phases: 1) model registration and
2) pose estimation.
Traditional methods [35]–[38] first build a 3D CAD
model via commercial scanners or dense 3D reconstruction
techniques [39], [40]. They resolve the pose by finding
2D-3D correspondences (via hand-designed features like
SIFT [41] or ORB [42]) between the input RGB image
and the registered model. However, creating high quality
3D models is not easy, and finding correspondence across a
large database (renderings from different viewpoints) can be
time-consuming [35]. More recently, several attempts have
been made to revisit the object-agnostic pose estimation
problem with deep learning. The presumption is that a deep
network pretrained on a large dataset can generalize to find
correspondence between the query image and the registered
model for novel objects. OnePose [43], inspired by visual lo-
calization research, proposes to use a graph attention network
to aggregate 2D features from different views during the
registration phase of structure-from-motion [44]. Then the
aggregated 3D descriptor is matched with 2D features from
the query view to solve the PnP problem [45]. Similarly,
OSOP [46] explores solving the PnP problem with a dense
correspondence between the query image and the coordinate
map from a pre-built 3D CAD model. On the other hand,
Gen6D [47] only needs to register the model with a set
of posed images. Following the iterative template matching
idea [48], [49], its network takes as input several neighboring
registered images closest to the predicted pose and repeatedly
refines the result.
While data-driven approaches rely on the generalization of
a large training dataset (usually composed of both synthetic
& real-world data) [47], iNeRF [16] is an optimization on-
the-fly approach free of pretraining. Each new object is
first registered by a NeRF model [15], after which iNeRF
can optimize the camera pose on the synthesized photo-
realistic renderings from NeRF. Although iNeRF’s idea
seems promising, there still remain several challenges. The
first is the expensive training cost of a NeRF model, which
may take hours for just one target. Additionally, iNeRF’s
pose update strategy is inefficient, as the accumulation and
backpropagation of the loss gradient is performed until
a subsample of a new image is rendered. Moreover, the
optimization process of a single pose hypothesis is easily
trapped in local minima due to outliers. To deal with the
aforementioned issues, we propose a more efficient and
robust approach leveraging the recent success of Instant
NGP [17]. We re-formulate the camera pose representation
as the Cartesian product SO(3) ×T(3) and integrate the
optimization process into the structure of Instant NGP. We
also adopt parallelized Monte Carlo sampling to improve
robustness to local minima. Loc-NeRF [50] is another con-
current work using Monte Carlo sampling to improve iNeRF.
III. PRELIMINARIES
NeRF. Given a collection of NRGB images {Ii}N
i=1 , Ii
[0,1]H×W×3with known camera poses {Ti}N
i=1, NeRF [15]
learns to represent a scene as 5D neural radiance fields
(spatial location and viewing direction). It can synthesize
novel views by querying 5D coordinates along the camera
rays and use classic volume rendering techniques to project
the output colors and densities into an image.
摘要:

ParallelInversionofNeuralRadianceFieldsforRobustPoseEstimationYunzhiLin1;2,ThomasM¨uller1,JonathanTremblay1,BowenWen1,StephenTyree1,AlexEvans1,PatricioA.Vela2,StanBircheld11NVIDIA:ftmueller,jtremblay,bowenw,styree,alexe,sbirchfieldg@nvidia.com2GeorgiaInstituteofTechnology:fyunzhi.lin,pvelag@gatech....

展开>> 收起<<
Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation Yunzhi Lin12 Thomas M uller1 Jonathan Tremblay1 Bowen Wen1 Stephen Tyree1 Alex Evans1 Patricio A. Vela2 Stan Birchfield1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:5.43MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注