SC-wLS Towards Interpretable Feed-forward Camera Re-localization Xin Wu12 Hao Zhao13 Shunkai Li4 Yingdian Cao12 and Hongbin Zha12

2025-05-03 0 0 2.91MB 27 页 10玖币

侵权投诉

SC-wLS: Towards Interpretable Feed-forward

Camera Re-localization

Xin Wu1,2?, Hao Zhao1,3?, Shunkai Li4, Yingdian Cao1,2, and Hongbin Zha1,2

1Key Laboratory of Machine Perception (MOE), School of AI, Peking University

2PKU-SenseTime Machine Vision Joint Lab

3Intel Labs China

4Kuaishou Technology

{wuxin1998,zhao-hao,lishunkai,yingdianc}@pku.edu.cn, zha@cis.pku.edu.cn

https://github.com/XinWu98/SC-wLS

Abstract. Visual re-localization aims to recover camera poses in a known

environment, which is vital for applications like robotics or augmented re-

ality. Feed-forward absolute camera pose regression methods directly out-

put poses by a network, but suﬀer from low accuracy. Meanwhile, scene

coordinate based methods are accurate, but need iterative RANSAC

post-processing, which brings challenges to eﬃcient end-to-end training

and inference. In order to have the best of both worlds, we propose a

feed-forward method termed SC-wLS that exploits all scene coordinate

estimates for weighted least squares pose regression. This diﬀerentiable

formulation exploits a weight network imposed on 2D-3D correspon-

dences, and requires pose supervision only. Qualitative results demon-

strate the interpretability of learned weights. Evaluations on 7Scenes

and Cambridge datasets show signiﬁcantly promoted performance when

compared with former feed-forward counterparts. Moreover, our SC-wLS

method enables a new capability: self-supervised test-time adaptation on

the weight network. Codes and models are publicly available.

Keywords: Camera Re-localization, Diﬀerentiable Optimization

1 Introduction

Visual re-localization [10,16,32,38] determines the global 6-DoF poses (i.e., po-

sition and orientation) of query RGB images in a known environment. It is a

fundamental computer vision problem and has many applications in robotics and

augmented reality. Recently there is a trend to incorporate deep neural networks

into various 3D vision tasks, and use diﬀerentiable formulations that optimize

losses of interest to learn result-oriented intermediate representation. Follow-

ing this trend, many learning-based absolute pose regression (APR) methods

[22,8] have been proposed for camera re-localization, which only need a single

feed-forward pass to recover poses. However, they treat the neural network as a

?equal contribution

arXiv:2210.12748v1 [cs.CV] 23 Oct 2022

2 X. Wu et al.

(a) input test image (b) reprojection error (c) learned weights

Fig. 1. For input images (a), our network ﬁrstly regresses their scene coordinates, then

predicts correspondence-wise weights (c). With these weights, we can use all 2D-3D

correspondences for end-to-end diﬀerentiable least squares pose estimation. We use

re-projeciton errors (b) to illustrate scene coordinate quality. Our weights select high-

quality scene coordinates. A higher color temperature represents a higher value.

black box and suﬀer from low accuracy [33]. On the other hand, scene coordi-

nate based methods learn pixel-wise 3D scene coordinates from RGB images and

solve camera poses using 2D-3D correspondences by Perspective-n-Point (PnP)

[24]. In order to handle outliers in estimated scene coordinates, the random

sample consensus (RANSAC) [11] algorithm is usually used for robust ﬁtting.

Compared to the feed-forward APR paradigm, scene coordinate based methods

achieve state-of-the-art performance on public camera re-localization datasets.

However, RANSAC-based post-processing is an iterative procedure conducted

on CPUs, which brings engineering challenges for eﬃcient end-to-end training

and inference [4,7].

In order to get the best of both worlds, we develop a new feed-forward method

based upon the state-of-the-art (SOTA) pipeline DSAC* [7], thus enjoying the

strong representation power of scene coordinates. We propose an alternative op-

tion to RANSAC that exploits all 3D scene coordinate estimates for weighted

least squares pose regression (SC-wLS). The key to SC-wLS is a weight net-

work that treats 2D-3D correspondences as 5D point clouds and learns weights

that capture geometric patterns in this 5D space, with only pose supervision.

Our learned weights can be used to interpret how much each scene coordinate

contributes to the least squares solver.

Our SC-wLS estimates poses using only tensor operators on GPUs, which is

similar to APR methods due to the feed-forward nature but out-performs APR

methods due to the usage of scene coordinates. Furthermore, we show that a self-

supervised test-time adaptation step that updates the weight network can lead

to further performance improvements. This is potentially useful in scenarios like

a robot vacuum adapts to speciﬁc rooms during standby time. Although we focus

on comparisons with APR methods, we also equip SC-wLS with the LM-Reﬁne

SC-wLS: Towards Interpretable Feed-forward Camera Re-localization 3

post-processing module provided in DSAC* [7] to explore the limits of SC-wLS,

and show that it out-performs SOTA on the outdoor dataset Cambridge.

Our major contributions can be summarized as follows: (1) We propose a

new feed-forward camera re-localization method, termed SC-wLS, that learns

interpretable scene coordinate quality weights (as in Fig. 1) for weighted least

squares pose estimation, with only pose supervision. (2) Our method combines

the advantages of two paradigms. It exploits learnt 2D-3D correspondences while

still allows eﬃcient end-to-end training and feed-forward inference in a principled

manner. As a result, we achieve signiﬁcantly better results than APR methods.

(3) Our SC-wLS formulation allows test-time adaptation via self-supervised ﬁne-

tuning of the weight network with the photometric loss.

2 Related works

Camera re-localization. In the following, we discuss camera re-localization

methods from the perspective of map representation.

Representing image databases with global descriptors like thumbnails [16],

BoW [39], or learned features [1] is a natural choice for camera re-localization.

By retrieving poses of similar images, localization can be done in the extremely

large scale [16,34]. Meanwhile, CNN-based absolute pose regression methods

[22,42,8,49] belong to this category, since their ﬁnal-layer embeddings are also

learned global descriptors. They regress camera poses from single images in an

end-to-end manner, and recent work primarily focuses on sequential inputs [48]

and network structure enhancement [49,37,13]. Although the accuracies of this

line of methods are generally low due to intrinsic limitations [33], they are usually

compact and fast, enabling pose estimation in a single feed-forward pass.

Maps can also be represented by 3D point cloud [46] with associated 2D

descriptors [26] via SfM tools [35]. Given a query image, feature matching estab-

lishes sparse 2D-3D correspondences and yields very accurate camera poses with

RANSAC-PnP pose optimization [53,32]. The success of these methods heav-

ily depends on the discriminativeness of features and the robustness of matching

strategies. Inspired by feature based pipelines, scene coordinate regression learns

a 2D-3D correspondence for each pixel, instead of using feature extraction and

matching separately. The map is implicitly encoded into network parameters.

[27] demonstrates impressive localization performance using stereo initialization

and sequence input. Recently, [2] shows that the algorithm used to create pseudo

ground truth has a signiﬁcant impact on the relative ranking of above methods.

Apart from random forest based methods using RGB-D inputs [38,40,28],

scene coordinate regression on RGB images is seeing steady progress [4,5,6,7,56].

This line of work lays the foundation for our research. In this scheme, predicted

scene coordinates are noisy due to single-view ambiguity and domain gap dur-

ing inference. As such, [4,5] use RANSAC and non-linear optimization to deal

with outliers, and NG-RANSAC [6] learns correspondence-wise weights to guide

RANSAC sampling. [6] conditions weights on RGB images, whose statistics is

often inﬂuenced by factors like lighting, weather or even exposure time. Object

4 X. Wu et al.

(A) Scene

Coordinate

Regression

FCN

RGB image input Scene coordinate prediction

Scene coordinate GT 2D pixel coordinates

Known

camera

intrinsics

pairs

(B) Order-aware

Filtering Network

with Attention

weights

Weighted Least Squares

Relocalization Result

Ground Truth Pose

Supervision signal

Fig. 2. The overview of SC-wLS. Firstly, a fully convolutional network (A) regresses

pixel-wise scene coordinates from an input RGB image. Scene coordinate predictions

are ﬂattened to the shape of N×3, with Nbeing pixel count. We concatenate it with

normalized N×2 2D pixel coordinates, forming N×5 correspondence inputs. The

correspondences are fed into the weight learning network (B), producing N×1 weights

indicating scene coordinate quality. The architecture of B is an order-aware ﬁltering

network [52] with graph attention modules [31]. Thirdly, correspondences and weights

are sent into a diﬀerentiable weighted least squares layer (C), directly outputing camera

poses. The scene coordinate ground truth is not used during training.

pose and room layout estimation [23,55,17,18,54,50] can also be addressed with

similar representations [3,44].

Diﬀerentiable optimization. To enhance their compatibility with deep neu-

ral networks and learn result-oriented feature representation, some recent works

focus on re-formulating geometric optimization techniques into an end-to-end

trainable fashion, for various 3D vision tasks. [20] proposes several standard dif-

ferentiable optimization layers. [51,30] propose to estimate fundamental/essential

matrix via solving weighted least squares problems with spectral layers. [12] fur-

ther shows that the eigen-value switching problem when solving least squares can

be avoided by minimizing linear system residuals. [14] develops generic black-box

diﬀerentiable optimization techniques with implicit declarative nodes.

3 Method

Given an RGB image I, we aim to ﬁnd an estimate of the absolute camera pose

consisting of a 3D translation and a 3D rotation, in the world coordinate system.

Towards this goal, we exploit the scene coordinate representation. Speciﬁcally,

for each pixel iwith position piin an image, we predict the corresponding 3D

scene coordinate si. As illustrated in Fig. 2, we propose an end-to-end trainable

deep network that directly calculates global camera poses via weighted least

squares. The method is named as SC-wLS. Fig. 2-A is a standard fully con-

volutional network for scene coordinate regression, as used in former works [7].

Our innovation lies in Fig. 2-B/C, as elaborated below.

SC-wLS: Towards Interpretable Feed-forward Camera Re-localization 5

3.1 Formulation

Given learned 3D scene coordinates and corresponding 2D pixel positions, our

goal is to determine the absolute poses of calibrated images taking all correspon-

dences into account. This would inevitably include outliers, and we need to give

them proper weights. Ideally, if all outliers are rejected by zero weights, calcu-

lating the absolute pose can be formulated as a linear least squares problem.

Inspired by [51] (which solves an essential matrix problem instead), we use all

of the N2D-3D correspondences as input and predict Nrespective weights wi

using a neural network (Fig. 2-B). wiindicates the uncertainty of each scene

coordinate prediction. As such the ideal least squares problem is turned into a

weighted version for pose recovery.

Speciﬁcally, the input correspondence cito Fig. 2-B is

ci= [xi, yi, zi, ui, vi] (1)

where xi, yi, ziare the three components of the scene coordinate si, and ui,

videnote the corresponding pixel position. ui,viare generated by normalizing

piwith the known camera intrinsic matrix. The absolute pose is written as a

transformation matrix T∈R3×4. It projects scene coordinates to the camera

plane as below:





1

=T











=



p1p2p3p4

p5p6p7p8

p9p10 p11 p12













(2)

When N > 6, the transformation matrix Tcan be recovered by Direct Linear

Transform (DLT) [15], which converts Eq. 2into a linear system:

XVec(T) = 0 (3)

Vec(T) is the vectorized T.Xis a R2N×12 matrix whose 2i−1 and 2irows

X(2i−1) ,X(2i)are as follows:

xi, yi, zi,1,0,0,0,0,−uixi,−uiyi,−uizi,−ui

0,0,0,0, xi, yi, zi,1,−vixi,−viyi,−vizi,−vi(4)

As such, pose estimation is formulated as a least squares problem. Vec(T)

can be recovered by ﬁnding the eigenvector associated to the smallest eigenvalue

of X>X.

Note that in SC-wLS, each correspondence contributes diﬀerently according

to wi, so X>Xcan be rewritten as X>diag(w)Xand Vec(T) still corresponds

to its smallest eigenvector. As the rotation matrix Rneeds to be orthogonal

and has determinant 1, we further reﬁne the DLT results by the generalized

Procrustes algorithm [36], which is also diﬀerentiable. More details about this

post-processing step can be found in the supplementary material.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SC-wLS:TowardsInterpretableFeed-forwardCameraRe-localizationXinWu1;2?,HaoZhao1;3?,ShunkaiLi4,YingdianCao1;2,andHongbinZha1;21KeyLaboratoryofMachinePerception(MOE),SchoolofAI,PekingUniversity2PKU-SenseTimeMachineVisionJointLab3IntelLabsChina4KuaishouTechnologyfwuxin1998,zhao-hao,lishunkai,yingdiancg@...

展开>> 收起<<

SC-wLS Towards Interpretable Feed-forward Camera Re-localization Xin Wu12 Hao Zhao13 Shunkai Li4 Yingdian Cao12 and Hongbin Zha12.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SC-wLS Towards Interpretable Feed-forward Camera Re-localization Xin Wu12 Hao Zhao13 Shunkai Li4 Yingdian Cao12 and Hongbin Zha12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: