SC-wLS Towards Interpretable Feed-forward Camera Re-localization Xin Wu12 Hao Zhao13 Shunkai Li4 Yingdian Cao12 and Hongbin Zha12

2025-05-03 0 0 2.91MB 27 页 10玖币
侵权投诉
SC-wLS: Towards Interpretable Feed-forward
Camera Re-localization
Xin Wu1,2?, Hao Zhao1,3?, Shunkai Li4, Yingdian Cao1,2, and Hongbin Zha1,2
1Key Laboratory of Machine Perception (MOE), School of AI, Peking University
2PKU-SenseTime Machine Vision Joint Lab
3Intel Labs China
4Kuaishou Technology
{wuxin1998,zhao-hao,lishunkai,yingdianc}@pku.edu.cn, zha@cis.pku.edu.cn
https://github.com/XinWu98/SC-wLS
Abstract. Visual re-localization aims to recover camera poses in a known
environment, which is vital for applications like robotics or augmented re-
ality. Feed-forward absolute camera pose regression methods directly out-
put poses by a network, but suffer from low accuracy. Meanwhile, scene
coordinate based methods are accurate, but need iterative RANSAC
post-processing, which brings challenges to efficient end-to-end training
and inference. In order to have the best of both worlds, we propose a
feed-forward method termed SC-wLS that exploits all scene coordinate
estimates for weighted least squares pose regression. This differentiable
formulation exploits a weight network imposed on 2D-3D correspon-
dences, and requires pose supervision only. Qualitative results demon-
strate the interpretability of learned weights. Evaluations on 7Scenes
and Cambridge datasets show significantly promoted performance when
compared with former feed-forward counterparts. Moreover, our SC-wLS
method enables a new capability: self-supervised test-time adaptation on
the weight network. Codes and models are publicly available.
Keywords: Camera Re-localization, Differentiable Optimization
1 Introduction
Visual re-localization [10,16,32,38] determines the global 6-DoF poses (i.e., po-
sition and orientation) of query RGB images in a known environment. It is a
fundamental computer vision problem and has many applications in robotics and
augmented reality. Recently there is a trend to incorporate deep neural networks
into various 3D vision tasks, and use differentiable formulations that optimize
losses of interest to learn result-oriented intermediate representation. Follow-
ing this trend, many learning-based absolute pose regression (APR) methods
[22,8] have been proposed for camera re-localization, which only need a single
feed-forward pass to recover poses. However, they treat the neural network as a
?equal contribution
arXiv:2210.12748v1 [cs.CV] 23 Oct 2022
2 X. Wu et al.
(a) input test image (b) reprojection error (c) learned weights
Fig. 1. For input images (a), our network firstly regresses their scene coordinates, then
predicts correspondence-wise weights (c). With these weights, we can use all 2D-3D
correspondences for end-to-end differentiable least squares pose estimation. We use
re-projeciton errors (b) to illustrate scene coordinate quality. Our weights select high-
quality scene coordinates. A higher color temperature represents a higher value.
black box and suffer from low accuracy [33]. On the other hand, scene coordi-
nate based methods learn pixel-wise 3D scene coordinates from RGB images and
solve camera poses using 2D-3D correspondences by Perspective-n-Point (PnP)
[24]. In order to handle outliers in estimated scene coordinates, the random
sample consensus (RANSAC) [11] algorithm is usually used for robust fitting.
Compared to the feed-forward APR paradigm, scene coordinate based methods
achieve state-of-the-art performance on public camera re-localization datasets.
However, RANSAC-based post-processing is an iterative procedure conducted
on CPUs, which brings engineering challenges for efficient end-to-end training
and inference [4,7].
In order to get the best of both worlds, we develop a new feed-forward method
based upon the state-of-the-art (SOTA) pipeline DSAC* [7], thus enjoying the
strong representation power of scene coordinates. We propose an alternative op-
tion to RANSAC that exploits all 3D scene coordinate estimates for weighted
least squares pose regression (SC-wLS). The key to SC-wLS is a weight net-
work that treats 2D-3D correspondences as 5D point clouds and learns weights
that capture geometric patterns in this 5D space, with only pose supervision.
Our learned weights can be used to interpret how much each scene coordinate
contributes to the least squares solver.
Our SC-wLS estimates poses using only tensor operators on GPUs, which is
similar to APR methods due to the feed-forward nature but out-performs APR
methods due to the usage of scene coordinates. Furthermore, we show that a self-
supervised test-time adaptation step that updates the weight network can lead
to further performance improvements. This is potentially useful in scenarios like
a robot vacuum adapts to specific rooms during standby time. Although we focus
on comparisons with APR methods, we also equip SC-wLS with the LM-Refine
SC-wLS: Towards Interpretable Feed-forward Camera Re-localization 3
post-processing module provided in DSAC* [7] to explore the limits of SC-wLS,
and show that it out-performs SOTA on the outdoor dataset Cambridge.
Our major contributions can be summarized as follows: (1) We propose a
new feed-forward camera re-localization method, termed SC-wLS, that learns
interpretable scene coordinate quality weights (as in Fig. 1) for weighted least
squares pose estimation, with only pose supervision. (2) Our method combines
the advantages of two paradigms. It exploits learnt 2D-3D correspondences while
still allows efficient end-to-end training and feed-forward inference in a principled
manner. As a result, we achieve significantly better results than APR methods.
(3) Our SC-wLS formulation allows test-time adaptation via self-supervised fine-
tuning of the weight network with the photometric loss.
2 Related works
Camera re-localization. In the following, we discuss camera re-localization
methods from the perspective of map representation.
Representing image databases with global descriptors like thumbnails [16],
BoW [39], or learned features [1] is a natural choice for camera re-localization.
By retrieving poses of similar images, localization can be done in the extremely
large scale [16,34]. Meanwhile, CNN-based absolute pose regression methods
[22,42,8,49] belong to this category, since their final-layer embeddings are also
learned global descriptors. They regress camera poses from single images in an
end-to-end manner, and recent work primarily focuses on sequential inputs [48]
and network structure enhancement [49,37,13]. Although the accuracies of this
line of methods are generally low due to intrinsic limitations [33], they are usually
compact and fast, enabling pose estimation in a single feed-forward pass.
Maps can also be represented by 3D point cloud [46] with associated 2D
descriptors [26] via SfM tools [35]. Given a query image, feature matching estab-
lishes sparse 2D-3D correspondences and yields very accurate camera poses with
RANSAC-PnP pose optimization [53,32]. The success of these methods heav-
ily depends on the discriminativeness of features and the robustness of matching
strategies. Inspired by feature based pipelines, scene coordinate regression learns
a 2D-3D correspondence for each pixel, instead of using feature extraction and
matching separately. The map is implicitly encoded into network parameters.
[27] demonstrates impressive localization performance using stereo initialization
and sequence input. Recently, [2] shows that the algorithm used to create pseudo
ground truth has a significant impact on the relative ranking of above methods.
Apart from random forest based methods using RGB-D inputs [38,40,28],
scene coordinate regression on RGB images is seeing steady progress [4,5,6,7,56].
This line of work lays the foundation for our research. In this scheme, predicted
scene coordinates are noisy due to single-view ambiguity and domain gap dur-
ing inference. As such, [4,5] use RANSAC and non-linear optimization to deal
with outliers, and NG-RANSAC [6] learns correspondence-wise weights to guide
RANSAC sampling. [6] conditions weights on RGB images, whose statistics is
often influenced by factors like lighting, weather or even exposure time. Object
4 X. Wu et al.
(A) Scene
Coordinate
Regression
FCN
RGB image input Scene coordinate prediction
Scene coordinate GT 2D pixel coordinates
Known
camera
intrinsics
pairs
(B) Order-aware
Filtering Network
with Attention
weights
(C) Differentiable
Weighted Least Squares
Relocalization Result
Ground Truth Pose
Supervision signal
Fig. 2. The overview of SC-wLS. Firstly, a fully convolutional network (A) regresses
pixel-wise scene coordinates from an input RGB image. Scene coordinate predictions
are flattened to the shape of N×3, with Nbeing pixel count. We concatenate it with
normalized N×2 2D pixel coordinates, forming N×5 correspondence inputs. The
correspondences are fed into the weight learning network (B), producing N×1 weights
indicating scene coordinate quality. The architecture of B is an order-aware filtering
network [52] with graph attention modules [31]. Thirdly, correspondences and weights
are sent into a differentiable weighted least squares layer (C), directly outputing camera
poses. The scene coordinate ground truth is not used during training.
pose and room layout estimation [23,55,17,18,54,50] can also be addressed with
similar representations [3,44].
Differentiable optimization. To enhance their compatibility with deep neu-
ral networks and learn result-oriented feature representation, some recent works
focus on re-formulating geometric optimization techniques into an end-to-end
trainable fashion, for various 3D vision tasks. [20] proposes several standard dif-
ferentiable optimization layers. [51,30] propose to estimate fundamental/essential
matrix via solving weighted least squares problems with spectral layers. [12] fur-
ther shows that the eigen-value switching problem when solving least squares can
be avoided by minimizing linear system residuals. [14] develops generic black-box
differentiable optimization techniques with implicit declarative nodes.
3 Method
Given an RGB image I, we aim to find an estimate of the absolute camera pose
consisting of a 3D translation and a 3D rotation, in the world coordinate system.
Towards this goal, we exploit the scene coordinate representation. Specifically,
for each pixel iwith position piin an image, we predict the corresponding 3D
scene coordinate si. As illustrated in Fig. 2, we propose an end-to-end trainable
deep network that directly calculates global camera poses via weighted least
squares. The method is named as SC-wLS. Fig. 2-A is a standard fully con-
volutional network for scene coordinate regression, as used in former works [7].
Our innovation lies in Fig. 2-B/C, as elaborated below.
SC-wLS: Towards Interpretable Feed-forward Camera Re-localization 5
3.1 Formulation
Given learned 3D scene coordinates and corresponding 2D pixel positions, our
goal is to determine the absolute poses of calibrated images taking all correspon-
dences into account. This would inevitably include outliers, and we need to give
them proper weights. Ideally, if all outliers are rejected by zero weights, calcu-
lating the absolute pose can be formulated as a linear least squares problem.
Inspired by [51] (which solves an essential matrix problem instead), we use all
of the N2D-3D correspondences as input and predict Nrespective weights wi
using a neural network (Fig. 2-B). wiindicates the uncertainty of each scene
coordinate prediction. As such the ideal least squares problem is turned into a
weighted version for pose recovery.
Specifically, the input correspondence cito Fig. 2-B is
ci= [xi, yi, zi, ui, vi] (1)
where xi, yi, ziare the three components of the scene coordinate si, and ui,
videnote the corresponding pixel position. ui,viare generated by normalizing
piwith the known camera intrinsic matrix. The absolute pose is written as a
transformation matrix TR3×4. It projects scene coordinates to the camera
plane as below:
ui
vi
1
=T
xi
yi
zi
1
=
p1p2p3p4
p5p6p7p8
p9p10 p11 p12
xi
yi
zi
1
(2)
When N > 6, the transformation matrix Tcan be recovered by Direct Linear
Transform (DLT) [15], which converts Eq. 2into a linear system:
XVec(T) = 0 (3)
Vec(T) is the vectorized T.Xis a R2N×12 matrix whose 2i1 and 2irows
X(2i1) ,X(2i)are as follows:
xi, yi, zi,1,0,0,0,0,uixi,uiyi,uizi,ui
0,0,0,0, xi, yi, zi,1,vixi,viyi,vizi,vi(4)
As such, pose estimation is formulated as a least squares problem. Vec(T)
can be recovered by finding the eigenvector associated to the smallest eigenvalue
of X>X.
Note that in SC-wLS, each correspondence contributes differently according
to wi, so X>Xcan be rewritten as X>diag(w)Xand Vec(T) still corresponds
to its smallest eigenvector. As the rotation matrix Rneeds to be orthogonal
and has determinant 1, we further refine the DLT results by the generalized
Procrustes algorithm [36], which is also differentiable. More details about this
post-processing step can be found in the supplementary material.
摘要:

SC-wLS:TowardsInterpretableFeed-forwardCameraRe-localizationXinWu1;2?,HaoZhao1;3?,ShunkaiLi4,YingdianCao1;2,andHongbinZha1;21KeyLaboratoryofMachinePerception(MOE),SchoolofAI,PekingUniversity2PKU-SenseTimeMachineVisionJointLab3IntelLabsChina4KuaishouTechnologyfwuxin1998,zhao-hao,lishunkai,yingdiancg@...

展开>> 收起<<
SC-wLS Towards Interpretable Feed-forward Camera Re-localization Xin Wu12 Hao Zhao13 Shunkai Li4 Yingdian Cao12 and Hongbin Zha12.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:2.91MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注