ses with the lowest loss. This design alleviates the issue
of convergence to local minima and improves efficiency of
search over a more extensive search space.
The gradients of pixel residuals calculated between the
rendered model and the target view are backpropagated
to generate camera pose updates. Unlike iNeRF where a
subsample of a new image is rendered at each iteration,we
enable hundreds of thousands of rays to work independently
in parallel to accumulate gradient descent updates per camera
pose hypothesis. This design dramatically improves the effi-
ciency. Furthermore, we investigate different pixel-based loss
functions to identify which approach to quantifying the visual
difference between the rendered model and the observed
target image best informs the camera pose updates. As shown
in the ablation study, the mean absolute percentage error
(MAPE) [18] loss exhibits better robustness to disturbances.
In summary, this work makes the following contributions:
•A parallelized, momentum-based optimization method us-
ing NeRF models is proposed to estimate 6-DoF poses
from monocular RGB input. The object-specific NeRF
model does not require pre-training on large datasets.
•Parallelized Monte Carlo sampling is introduced into the
pose estimation task, and we show the importance of pixel-
based loss function selection for robustness.
•Quantitative demonstration through synthetic and real-
world benchmarks that the proposed method has improved
generalization and robustness.
II. RELATED WORKS
Neural 3D Scene Representation. Recent works [19]–
[21] have investigated representing 3D scenes implicitly with
neural networks, where coordinates are sampled and fed into
a neural network to produce physical field values across
space and time [22]. NeRF [15] is a milestone approach
demonstrating that neural scene representations have the
capabilities to synthesize photo-realistic views. Since then,
significant effort has been put into pushing the boundaries
of NeRF. Follow-up works have focused on speeding up the
training and inference processes [17], [23], [24], adding sup-
port for relighting [25], relaxing the requirement of known
camera poses [26], [27], reducing the number of training
images [28], extending to dynamic scenes [29], and so on.
NeRF also opens up opportunities in the robotics community.
Researchers have proposed to use it to represent scenes for
visuomotor control [30], reconstruct transparent objects [31],
generate training data for pose estimators [32] or dense object
descriptors [33], and model 3D object categories [34]. In this
work, we aim to follow in their footsteps by applying NeRF
directly to the 6-DoF pose estimation task.
Generalizable 6-DoF Pose Estimation. Generalizable
6-DoF pose estimation—not limited to any specific target
or category—from RGB images has been a long-standing
problem in the community. Existing methods tend to share
a similar pipeline of two phases: 1) model registration and
2) pose estimation.
Traditional methods [35]–[38] first build a 3D CAD
model via commercial scanners or dense 3D reconstruction
techniques [39], [40]. They resolve the pose by finding
2D-3D correspondences (via hand-designed features like
SIFT [41] or ORB [42]) between the input RGB image
and the registered model. However, creating high quality
3D models is not easy, and finding correspondence across a
large database (renderings from different viewpoints) can be
time-consuming [35]. More recently, several attempts have
been made to revisit the object-agnostic pose estimation
problem with deep learning. The presumption is that a deep
network pretrained on a large dataset can generalize to find
correspondence between the query image and the registered
model for novel objects. OnePose [43], inspired by visual lo-
calization research, proposes to use a graph attention network
to aggregate 2D features from different views during the
registration phase of structure-from-motion [44]. Then the
aggregated 3D descriptor is matched with 2D features from
the query view to solve the PnP problem [45]. Similarly,
OSOP [46] explores solving the PnP problem with a dense
correspondence between the query image and the coordinate
map from a pre-built 3D CAD model. On the other hand,
Gen6D [47] only needs to register the model with a set
of posed images. Following the iterative template matching
idea [48], [49], its network takes as input several neighboring
registered images closest to the predicted pose and repeatedly
refines the result.
While data-driven approaches rely on the generalization of
a large training dataset (usually composed of both synthetic
& real-world data) [47], iNeRF [16] is an optimization on-
the-fly approach free of pretraining. Each new object is
first registered by a NeRF model [15], after which iNeRF
can optimize the camera pose on the synthesized photo-
realistic renderings from NeRF. Although iNeRF’s idea
seems promising, there still remain several challenges. The
first is the expensive training cost of a NeRF model, which
may take hours for just one target. Additionally, iNeRF’s
pose update strategy is inefficient, as the accumulation and
backpropagation of the loss gradient is performed until
a subsample of a new image is rendered. Moreover, the
optimization process of a single pose hypothesis is easily
trapped in local minima due to outliers. To deal with the
aforementioned issues, we propose a more efficient and
robust approach leveraging the recent success of Instant
NGP [17]. We re-formulate the camera pose representation
as the Cartesian product SO(3) ×T(3) and integrate the
optimization process into the structure of Instant NGP. We
also adopt parallelized Monte Carlo sampling to improve
robustness to local minima. Loc-NeRF [50] is another con-
current work using Monte Carlo sampling to improve iNeRF.
III. PRELIMINARIES
NeRF. Given a collection of NRGB images {Ii}N
i=1 , Ii∈
[0,1]H×W×3with known camera poses {Ti}N
i=1, NeRF [15]
learns to represent a scene as 5D neural radiance fields
(spatial location and viewing direction). It can synthesize
novel views by querying 5D coordinates along the camera
rays and use classic volume rendering techniques to project
the output colors and densities into an image.