SC-wLS: Towards Interpretable Feed-forward Camera Re-localization 3
post-processing module provided in DSAC* [7] to explore the limits of SC-wLS,
and show that it out-performs SOTA on the outdoor dataset Cambridge.
Our major contributions can be summarized as follows: (1) We propose a
new feed-forward camera re-localization method, termed SC-wLS, that learns
interpretable scene coordinate quality weights (as in Fig. 1) for weighted least
squares pose estimation, with only pose supervision. (2) Our method combines
the advantages of two paradigms. It exploits learnt 2D-3D correspondences while
still allows efficient end-to-end training and feed-forward inference in a principled
manner. As a result, we achieve significantly better results than APR methods.
(3) Our SC-wLS formulation allows test-time adaptation via self-supervised fine-
tuning of the weight network with the photometric loss.
2 Related works
Camera re-localization. In the following, we discuss camera re-localization
methods from the perspective of map representation.
Representing image databases with global descriptors like thumbnails [16],
BoW [39], or learned features [1] is a natural choice for camera re-localization.
By retrieving poses of similar images, localization can be done in the extremely
large scale [16,34]. Meanwhile, CNN-based absolute pose regression methods
[22,42,8,49] belong to this category, since their final-layer embeddings are also
learned global descriptors. They regress camera poses from single images in an
end-to-end manner, and recent work primarily focuses on sequential inputs [48]
and network structure enhancement [49,37,13]. Although the accuracies of this
line of methods are generally low due to intrinsic limitations [33], they are usually
compact and fast, enabling pose estimation in a single feed-forward pass.
Maps can also be represented by 3D point cloud [46] with associated 2D
descriptors [26] via SfM tools [35]. Given a query image, feature matching estab-
lishes sparse 2D-3D correspondences and yields very accurate camera poses with
RANSAC-PnP pose optimization [53,32]. The success of these methods heav-
ily depends on the discriminativeness of features and the robustness of matching
strategies. Inspired by feature based pipelines, scene coordinate regression learns
a 2D-3D correspondence for each pixel, instead of using feature extraction and
matching separately. The map is implicitly encoded into network parameters.
[27] demonstrates impressive localization performance using stereo initialization
and sequence input. Recently, [2] shows that the algorithm used to create pseudo
ground truth has a significant impact on the relative ranking of above methods.
Apart from random forest based methods using RGB-D inputs [38,40,28],
scene coordinate regression on RGB images is seeing steady progress [4,5,6,7,56].
This line of work lays the foundation for our research. In this scheme, predicted
scene coordinates are noisy due to single-view ambiguity and domain gap dur-
ing inference. As such, [4,5] use RANSAC and non-linear optimization to deal
with outliers, and NG-RANSAC [6] learns correspondence-wise weights to guide
RANSAC sampling. [6] conditions weights on RGB images, whose statistics is
often influenced by factors like lighting, weather or even exposure time. Object