DeepMLE A Robust Deep Maximum Likelihood Estimator for Two-view Structure from Motion Yuxi Xiao Li Li Xiaodi Li and Jian Yao

2025-04-24 2 0 2.13MB 8 页 10玖币

侵权投诉

DeepMLE: A Robust Deep Maximum Likelihood Estimator for

Two-view Structure from Motion

Yuxi Xiao, Li Li, Xiaodi Li and Jian Yao∗

Abstract— Two-view structure from motion (SfM) is the

cornerstone of 3D reconstruction and visual SLAM (vSLAM).

Many existing end-to-end learning-based methods usually for-

mulate it as a brute regression problem. However, the inad-

equate utilization of traditional geometry model makes the

model not robust in unseen environments. To improve the

generalization capability and robustness of end-to-end two-view

SfM network, we formulate the two-view SfM problem as a

maximum likelihood estimation (MLE) and solve it with the

proposed framework, denoted as DeepMLE. First, we propose

to take the deep multi-scale correlation maps to depict the visual

similarities of 2D image matches decided by ego-motion. In

addition, in order to increase the robustness of our framework,

we formulate the likelihood function of the correlations of 2D

image matches as a Gaussian and Uniform mixture distribution

which takes the uncertainty caused by illumination changes,

image noise and moving objects into account. Meanwhile, an

uncertainty prediction module is presented to predict the pixel-

wise distribution parameters. Finally, we iteratively reﬁne the

depth and relative camera pose using the gradient-like infor-

mation to maximize the likelihood function of the correlations.

Extensive experimental results on several datasets prove that

our method signiﬁcantly outperforms the state-of-the-art end-

to-end two-view SfM approaches in accuracy and generalization

capability.

I. INTRODUCTION

Two-view structure from motion (SfM) aims to recover

the 3D structure (depth map) and motion (relative camera

pose) from two consecutive images. As the cornerstone of

3D reconstruction and visual simultaneous localization and

mapping (vSLAM), this fundamental techniques is widely

applied in high-level tasks like autonomous driving [1],

augmented/virtual reality [2], and robotics [3], [4].

Traditional SfM methods usually follow the pipeline of

features extraction [5], sparse 2D image matching [6], rel-

ative camera pose computation [7], [8], [9] and 3D trian-

gulation [10]. The dense depth map is hard to attain in

this pipeline due to the sparsity of 2D image matches.

Therefore, the direct methods have been proposed [11], [12],

which simultaneously calculate a semi-dense depth map and

relative camera pose by minimizing the photometric residuals

between two images. However, whatever the feature-based or

the direct methods are hard to remain robust in the lighting

variable condition and dynamic environments [13].

To overcome these problems, many deep learning-based

two-view SfM methods have been proposed recently. In

general, learning-based methods can be divided into two

categories. The ﬁrst class [14], [15], [16] focuses on ex-

plicitly learning to match, and the calculation for the depth

*Corresponding Author.

The authors are with the School of School of Remote Sensing and

Information Engineering, Wuhan University, Wuhan, Hubei, P.R. China.

map and the relative camera pose is left to traditional

algorithms [10]. The second one tends to generate a dense

monocular depth map and relative camera pose using an end-

to-end network [17], [18], [19], [20], [21]. Compared to the

ﬁrst class, the latter beneﬁts from the end-to-end training

and inference, and it is usually more efﬁcient. However,

developing a robust and generalizable, end-to-end two-view

SfM framework remains a challenging task [22]. Therefore,

in this paper, we focus on the discussion of the end-to-end

learning-based two-view SfM methods.

To improve the generalization capability and robustness of

end-to-end two-view SfM networks, we propose to integrate

maximum likelihood estimation (MLE) into our designed

architecture. First, we propose to learn the multi-scale corre-

lation volumes as a ﬁne depiction for the correspondence

of pixels. Then, we formulate the likelihood function of

correlations of 2D image matches as a Gaussian and Uniform

mixture distribution, and an uncertainty module is designed

to predict the parameters of the distribution for each pixel.

In this mixture distribution, tiny visual differences or illumi-

nation changes can be approximated by the Gaussian distri-

bution, while the outliers of observations caused by extreme

illumination change, occlusions and moving objects can be

modeled by the Uniform distribution. Last, we solve this

MLE problem by a proposed deep network instead of using

traditional methods. The proposed deep network iteratively

searches the optimal depth map and relative camera pose by

maximizing the likelihood of observations. Thus, we do not

need to manually design the regularizers or priors. Our model

can implicitly learn these information from labeled data. The

main contributions of our work can be summarized as:

•We propose to integrate the MLE into the DNNs to

maximize the likelihood of correlations of 2D image

matches, where the likelihood function is modeled as

the Gaussian and Uniform mixture distribution.

•We design an end-to-end DNN to iteratively search the

optimal estimation to maximize the likelihood using the

gradient-like information, which solves the proposed

MLE problem efﬁciently.

•The proposed DeepMLE achieves state-of-the-art per-

formance and signiﬁcantly outperforms existing end-

to-end two-view SfM methods in both accuracy and

generalization capability.

II. RELATED WORKS

A. Traditional Two-view SfM

Traditional methods commonly recover a sparse monocu-

lar depth map and relative camera pose by a set of 2D image

arXiv:2210.05517v1 [cs.CV] 11 Oct 2022

matched points. Thus, the key of traditional methods is 2D

images points matching. To this end, a lot of hand-crafted and

distinctive descriptors (e.g. SIFT [23], ORB [6], SURF [5])

were proposed to ﬁnd the 2D image matches. Although

these descriptors are enabled to attain accurate 2D image

matches for keypoints, it is impractical to calculate dense 2D

image matches for the limitations of computational cost and

time [6], [24]. To reduce the computational cost and generate

the semi-dense depth map, DSO [11] and LSD[12] employed

photometric bundle adjustment to recover the semi-dense

depth map and relative camera pose simultaneously. In addi-

tion, in order to alleviate the inﬂuence of photometric noise,

DTAM [25] proposed to minimize the photometric residuals

along with a TV-L1 regularization. However, the photometric

residuals may fail in the conditions of illumination changes

and less texture.

B. Learning-based Two-view SfM

Generally, existing learning-based methods can be divided

into two types [22] including Partial Learning methods

and Whole Learning methods.Partial Learning methods

leverage the powerful representation ability of DNNs

to explicitly learn to match, and the most of rest

calculations are ﬁnished by classic geometry methods.

Whole Learning methods pursue to develop a robust

pipeline that is end-to-end during the training and the in-

ference.

Partial Learning methods. These methods use the DNNs

to extract deep features and explicitly learn the correspon-

dences of pixels. Such correspondences can be represented

as the explicit matches, featuremetric loss or the optical

ﬂow. Recent works [26], [16], [27] proposed to apply the

deep optical ﬂow network to ﬁnd the 2D image matches,

the relative camera pose can be directly calculated from

the matched points using classical algorithms [8], [9]. After

that, an extra depth estimation module is applied in their

framework to further regress the depth map. Specially, Wang

et al. [26] proposed a scale-invariant depth module to recover

the depth map. In their module, the scale of predicted relative

camera pose is applied to guide the estimation of the depth

map. However, the works above can only estimate the relative

camera pose and the depth with two separate stages. Simu-

lating the traditional direct methods, BA-Net [21] applied a

DNN to extract dense and distinctive visual features to deﬁne

the featuremetric loss, and then simultaneously estimated a

monocular depth map and relative camera pose via bundle

adjustment.

Whole Learning methods. These methods attempt to ﬁnd

an end-to-end mapping from two-view images to original

3D structure and motion. In general, these methods can be

categorized as self-supervised and supervised methods. The

self-supervised methods [20], [28], [19], [19] usually apply

the photometric residuals to train the DNNs. In general, they

simply apply the U-Net-like convolutional network [29] to

directly regress the depth map and relative camera pose.

However, the photo-consistency assumption, which assumes

the photometric residuals of two matched points should be

small, may be violated by the illumination changes and

inherent image noise. In addition, it is difﬁcult to minimize

the photometric residuals if the intensity gradient is low

and the texture information is poor. Yang et al. [30] esti-

mated the afﬁne brightness transformation for two images

and photometric uncertainty for each pixel to alleviate the

inﬂuence of illumination changes. For supervised methods,

the ground truth is applied to train the network. LS-Net [17]

applied the LSTM-RNN as an optimizer to minimize the

photometric residuals iteratively, and they took the Jacobian

matrix computed by Tensorﬂow as the input to regress the

update of the depth map and relative camera pose. But

the Jacobian matrix calculation is memory and time con-

suming. DeMoN [31] proposed a coarse-to-ﬁne architecture

and alternately reﬁned the optical ﬂow, depth and motion.

DeepSFM [18] proposed to iteratively and alternately reﬁne

the initial depth map and initial relative camera pose offered

by DeMoN [31]. Differently, our DeepMLE does not require

the initial depth or relative camera pose, and meanwhile,

our method is beneﬁcial from MLE-based inference which

clearly show the likelihood probabilistic of estimations.

III. METHOD

In this section, we are going to introduce how the MLE

inference is integrated into the networks. To this end, the

general maximum likelihood formula is introduced in Sec-

tion III-A ﬁrstly, which demonstrates the process of MLE

modelling as a whole. Subsequently, our DeepMLE is intro-

duced to implement the MLE model by three sub-modules in

Section III-B. The overview of DeepMLE is shown in Fig. 1.

A. Maximum Likelihood Formula for Two-view SfM

In the two-view SfM problem, given two consecutive im-

ages, the target image Itand the source image Is, the depth

map Dof Itand relative camera pose Tfrom Itto Iscan be

estimated by maximizing the likelihood of the correlations.

In traditional methods, minimizing the photometric residuals

also provide a maximum likelihood solution for the depth

and relative camera pose estimation [11], [12]. In order to

increase the robustness of the observations, we take the pixel-

wise deep correlations as the observations for estimation

D,b

T). Meanwhile, we use C(It, Is)(see Section III-B.1)

to denote the domain of deep correlation observations, which

stores the all-pair correlation values in the target image It

and the source image Is. For each pixel-wise observation

c, its value can be looked-up in C(It, Is). The aim of our

work is to ﬁnd the optimal depth map Dand relative camera

pose Tthat can maximize the likelihood of observations. The

deﬁnition is formulated as:

{D,T}= arg max

D,b

i=1 P(ci|b

D,b

T),(1)

where Vis the pixel number of the target image It.

For each pixel in It, its correlation observation with its

correspondence in Iscan be approximated by a Gaussian

distribution when only relatively tiny illumination changes

happen, but it more likely follow a Uniform distribution if the

extreme illumination changes, occlusions and objects moving

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepMLE:ARobustDeepMaximumLikelihoodEstimatorforTwo-viewStructurefromMotionYuxiXiao,LiLi,XiaodiLiandJianYaoAbstractTwo-viewstructurefrommotion(SfM)isthecornerstoneof3DreconstructionandvisualSLAM(vSLAM).Manyexistingend-to-endlearning-basedmethodsusuallyfor-mulateitasabruteregressionproblem.However,...

展开>> 收起<<

DeepMLE A Robust Deep Maximum Likelihood Estimator for Two-view Structure from Motion Yuxi Xiao Li Li Xiaodi Li and Jian Yao.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DeepMLE A Robust Deep Maximum Likelihood Estimator for Two-view Structure from Motion Yuxi Xiao Li Li Xiaodi Li and Jian Yao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: