DeepMLE A Robust Deep Maximum Likelihood Estimator for Two-view Structure from Motion Yuxi Xiao Li Li Xiaodi Li and Jian Yao

2025-04-24 0 0 2.13MB 8 页 10玖币
侵权投诉
DeepMLE: A Robust Deep Maximum Likelihood Estimator for
Two-view Structure from Motion
Yuxi Xiao, Li Li, Xiaodi Li and Jian Yao
Abstract Two-view structure from motion (SfM) is the
cornerstone of 3D reconstruction and visual SLAM (vSLAM).
Many existing end-to-end learning-based methods usually for-
mulate it as a brute regression problem. However, the inad-
equate utilization of traditional geometry model makes the
model not robust in unseen environments. To improve the
generalization capability and robustness of end-to-end two-view
SfM network, we formulate the two-view SfM problem as a
maximum likelihood estimation (MLE) and solve it with the
proposed framework, denoted as DeepMLE. First, we propose
to take the deep multi-scale correlation maps to depict the visual
similarities of 2D image matches decided by ego-motion. In
addition, in order to increase the robustness of our framework,
we formulate the likelihood function of the correlations of 2D
image matches as a Gaussian and Uniform mixture distribution
which takes the uncertainty caused by illumination changes,
image noise and moving objects into account. Meanwhile, an
uncertainty prediction module is presented to predict the pixel-
wise distribution parameters. Finally, we iteratively refine the
depth and relative camera pose using the gradient-like infor-
mation to maximize the likelihood function of the correlations.
Extensive experimental results on several datasets prove that
our method significantly outperforms the state-of-the-art end-
to-end two-view SfM approaches in accuracy and generalization
capability.
I. INTRODUCTION
Two-view structure from motion (SfM) aims to recover
the 3D structure (depth map) and motion (relative camera
pose) from two consecutive images. As the cornerstone of
3D reconstruction and visual simultaneous localization and
mapping (vSLAM), this fundamental techniques is widely
applied in high-level tasks like autonomous driving [1],
augmented/virtual reality [2], and robotics [3], [4].
Traditional SfM methods usually follow the pipeline of
features extraction [5], sparse 2D image matching [6], rel-
ative camera pose computation [7], [8], [9] and 3D trian-
gulation [10]. The dense depth map is hard to attain in
this pipeline due to the sparsity of 2D image matches.
Therefore, the direct methods have been proposed [11], [12],
which simultaneously calculate a semi-dense depth map and
relative camera pose by minimizing the photometric residuals
between two images. However, whatever the feature-based or
the direct methods are hard to remain robust in the lighting
variable condition and dynamic environments [13].
To overcome these problems, many deep learning-based
two-view SfM methods have been proposed recently. In
general, learning-based methods can be divided into two
categories. The first class [14], [15], [16] focuses on ex-
plicitly learning to match, and the calculation for the depth
*Corresponding Author.
The authors are with the School of School of Remote Sensing and
Information Engineering, Wuhan University, Wuhan, Hubei, P.R. China.
map and the relative camera pose is left to traditional
algorithms [10]. The second one tends to generate a dense
monocular depth map and relative camera pose using an end-
to-end network [17], [18], [19], [20], [21]. Compared to the
first class, the latter benefits from the end-to-end training
and inference, and it is usually more efficient. However,
developing a robust and generalizable, end-to-end two-view
SfM framework remains a challenging task [22]. Therefore,
in this paper, we focus on the discussion of the end-to-end
learning-based two-view SfM methods.
To improve the generalization capability and robustness of
end-to-end two-view SfM networks, we propose to integrate
maximum likelihood estimation (MLE) into our designed
architecture. First, we propose to learn the multi-scale corre-
lation volumes as a fine depiction for the correspondence
of pixels. Then, we formulate the likelihood function of
correlations of 2D image matches as a Gaussian and Uniform
mixture distribution, and an uncertainty module is designed
to predict the parameters of the distribution for each pixel.
In this mixture distribution, tiny visual differences or illumi-
nation changes can be approximated by the Gaussian distri-
bution, while the outliers of observations caused by extreme
illumination change, occlusions and moving objects can be
modeled by the Uniform distribution. Last, we solve this
MLE problem by a proposed deep network instead of using
traditional methods. The proposed deep network iteratively
searches the optimal depth map and relative camera pose by
maximizing the likelihood of observations. Thus, we do not
need to manually design the regularizers or priors. Our model
can implicitly learn these information from labeled data. The
main contributions of our work can be summarized as:
We propose to integrate the MLE into the DNNs to
maximize the likelihood of correlations of 2D image
matches, where the likelihood function is modeled as
the Gaussian and Uniform mixture distribution.
We design an end-to-end DNN to iteratively search the
optimal estimation to maximize the likelihood using the
gradient-like information, which solves the proposed
MLE problem efficiently.
The proposed DeepMLE achieves state-of-the-art per-
formance and significantly outperforms existing end-
to-end two-view SfM methods in both accuracy and
generalization capability.
II. RELATED WORKS
A. Traditional Two-view SfM
Traditional methods commonly recover a sparse monocu-
lar depth map and relative camera pose by a set of 2D image
arXiv:2210.05517v1 [cs.CV] 11 Oct 2022
matched points. Thus, the key of traditional methods is 2D
images points matching. To this end, a lot of hand-crafted and
distinctive descriptors (e.g. SIFT [23], ORB [6], SURF [5])
were proposed to find the 2D image matches. Although
these descriptors are enabled to attain accurate 2D image
matches for keypoints, it is impractical to calculate dense 2D
image matches for the limitations of computational cost and
time [6], [24]. To reduce the computational cost and generate
the semi-dense depth map, DSO [11] and LSD[12] employed
photometric bundle adjustment to recover the semi-dense
depth map and relative camera pose simultaneously. In addi-
tion, in order to alleviate the influence of photometric noise,
DTAM [25] proposed to minimize the photometric residuals
along with a TV-L1 regularization. However, the photometric
residuals may fail in the conditions of illumination changes
and less texture.
B. Learning-based Two-view SfM
Generally, existing learning-based methods can be divided
into two types [22] including Partial Learning methods
and Whole Learning methods.Partial Learning methods
leverage the powerful representation ability of DNNs
to explicitly learn to match, and the most of rest
calculations are finished by classic geometry methods.
Whole Learning methods pursue to develop a robust
pipeline that is end-to-end during the training and the in-
ference.
Partial Learning methods. These methods use the DNNs
to extract deep features and explicitly learn the correspon-
dences of pixels. Such correspondences can be represented
as the explicit matches, featuremetric loss or the optical
flow. Recent works [26], [16], [27] proposed to apply the
deep optical flow network to find the 2D image matches,
the relative camera pose can be directly calculated from
the matched points using classical algorithms [8], [9]. After
that, an extra depth estimation module is applied in their
framework to further regress the depth map. Specially, Wang
et al. [26] proposed a scale-invariant depth module to recover
the depth map. In their module, the scale of predicted relative
camera pose is applied to guide the estimation of the depth
map. However, the works above can only estimate the relative
camera pose and the depth with two separate stages. Simu-
lating the traditional direct methods, BA-Net [21] applied a
DNN to extract dense and distinctive visual features to define
the featuremetric loss, and then simultaneously estimated a
monocular depth map and relative camera pose via bundle
adjustment.
Whole Learning methods. These methods attempt to find
an end-to-end mapping from two-view images to original
3D structure and motion. In general, these methods can be
categorized as self-supervised and supervised methods. The
self-supervised methods [20], [28], [19], [19] usually apply
the photometric residuals to train the DNNs. In general, they
simply apply the U-Net-like convolutional network [29] to
directly regress the depth map and relative camera pose.
However, the photo-consistency assumption, which assumes
the photometric residuals of two matched points should be
small, may be violated by the illumination changes and
inherent image noise. In addition, it is difficult to minimize
the photometric residuals if the intensity gradient is low
and the texture information is poor. Yang et al. [30] esti-
mated the affine brightness transformation for two images
and photometric uncertainty for each pixel to alleviate the
influence of illumination changes. For supervised methods,
the ground truth is applied to train the network. LS-Net [17]
applied the LSTM-RNN as an optimizer to minimize the
photometric residuals iteratively, and they took the Jacobian
matrix computed by Tensorflow as the input to regress the
update of the depth map and relative camera pose. But
the Jacobian matrix calculation is memory and time con-
suming. DeMoN [31] proposed a coarse-to-fine architecture
and alternately refined the optical flow, depth and motion.
DeepSFM [18] proposed to iteratively and alternately refine
the initial depth map and initial relative camera pose offered
by DeMoN [31]. Differently, our DeepMLE does not require
the initial depth or relative camera pose, and meanwhile,
our method is beneficial from MLE-based inference which
clearly show the likelihood probabilistic of estimations.
III. METHOD
In this section, we are going to introduce how the MLE
inference is integrated into the networks. To this end, the
general maximum likelihood formula is introduced in Sec-
tion III-A firstly, which demonstrates the process of MLE
modelling as a whole. Subsequently, our DeepMLE is intro-
duced to implement the MLE model by three sub-modules in
Section III-B. The overview of DeepMLE is shown in Fig. 1.
A. Maximum Likelihood Formula for Two-view SfM
In the two-view SfM problem, given two consecutive im-
ages, the target image Itand the source image Is, the depth
map Dof Itand relative camera pose Tfrom Itto Iscan be
estimated by maximizing the likelihood of the correlations.
In traditional methods, minimizing the photometric residuals
also provide a maximum likelihood solution for the depth
and relative camera pose estimation [11], [12]. In order to
increase the robustness of the observations, we take the pixel-
wise deep correlations as the observations for estimation
(b
D,b
T). Meanwhile, we use C(It, Is)(see Section III-B.1)
to denote the domain of deep correlation observations, which
stores the all-pair correlation values in the target image It
and the source image Is. For each pixel-wise observation
c, its value can be looked-up in C(It, Is). The aim of our
work is to find the optimal depth map Dand relative camera
pose Tthat can maximize the likelihood of observations. The
definition is formulated as:
{D,T}= arg max
b
D,b
T
YV
i=1 P(ci|b
D,b
T),(1)
where Vis the pixel number of the target image It.
For each pixel in It, its correlation observation with its
correspondence in Iscan be approximated by a Gaussian
distribution when only relatively tiny illumination changes
happen, but it more likely follow a Uniform distribution if the
extreme illumination changes, occlusions and objects moving
摘要:

DeepMLE:ARobustDeepMaximumLikelihoodEstimatorforTwo-viewStructurefromMotionYuxiXiao,LiLi,XiaodiLiandJianYaoAbstract—Two-viewstructurefrommotion(SfM)isthecornerstoneof3DreconstructionandvisualSLAM(vSLAM).Manyexistingend-to-endlearning-basedmethodsusuallyfor-mulateitasabruteregressionproblem.However,...

展开>> 收起<<
DeepMLE A Robust Deep Maximum Likelihood Estimator for Two-view Structure from Motion Yuxi Xiao Li Li Xiaodi Li and Jian Yao.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:2.13MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注