matched points. Thus, the key of traditional methods is 2D
images points matching. To this end, a lot of hand-crafted and
distinctive descriptors (e.g. SIFT [23], ORB [6], SURF [5])
were proposed to find the 2D image matches. Although
these descriptors are enabled to attain accurate 2D image
matches for keypoints, it is impractical to calculate dense 2D
image matches for the limitations of computational cost and
time [6], [24]. To reduce the computational cost and generate
the semi-dense depth map, DSO [11] and LSD[12] employed
photometric bundle adjustment to recover the semi-dense
depth map and relative camera pose simultaneously. In addi-
tion, in order to alleviate the influence of photometric noise,
DTAM [25] proposed to minimize the photometric residuals
along with a TV-L1 regularization. However, the photometric
residuals may fail in the conditions of illumination changes
and less texture.
B. Learning-based Two-view SfM
Generally, existing learning-based methods can be divided
into two types [22] including Partial Learning methods
and Whole Learning methods.Partial Learning methods
leverage the powerful representation ability of DNNs
to explicitly learn to match, and the most of rest
calculations are finished by classic geometry methods.
Whole Learning methods pursue to develop a robust
pipeline that is end-to-end during the training and the in-
ference.
Partial Learning methods. These methods use the DNNs
to extract deep features and explicitly learn the correspon-
dences of pixels. Such correspondences can be represented
as the explicit matches, featuremetric loss or the optical
flow. Recent works [26], [16], [27] proposed to apply the
deep optical flow network to find the 2D image matches,
the relative camera pose can be directly calculated from
the matched points using classical algorithms [8], [9]. After
that, an extra depth estimation module is applied in their
framework to further regress the depth map. Specially, Wang
et al. [26] proposed a scale-invariant depth module to recover
the depth map. In their module, the scale of predicted relative
camera pose is applied to guide the estimation of the depth
map. However, the works above can only estimate the relative
camera pose and the depth with two separate stages. Simu-
lating the traditional direct methods, BA-Net [21] applied a
DNN to extract dense and distinctive visual features to define
the featuremetric loss, and then simultaneously estimated a
monocular depth map and relative camera pose via bundle
adjustment.
Whole Learning methods. These methods attempt to find
an end-to-end mapping from two-view images to original
3D structure and motion. In general, these methods can be
categorized as self-supervised and supervised methods. The
self-supervised methods [20], [28], [19], [19] usually apply
the photometric residuals to train the DNNs. In general, they
simply apply the U-Net-like convolutional network [29] to
directly regress the depth map and relative camera pose.
However, the photo-consistency assumption, which assumes
the photometric residuals of two matched points should be
small, may be violated by the illumination changes and
inherent image noise. In addition, it is difficult to minimize
the photometric residuals if the intensity gradient is low
and the texture information is poor. Yang et al. [30] esti-
mated the affine brightness transformation for two images
and photometric uncertainty for each pixel to alleviate the
influence of illumination changes. For supervised methods,
the ground truth is applied to train the network. LS-Net [17]
applied the LSTM-RNN as an optimizer to minimize the
photometric residuals iteratively, and they took the Jacobian
matrix computed by Tensorflow as the input to regress the
update of the depth map and relative camera pose. But
the Jacobian matrix calculation is memory and time con-
suming. DeMoN [31] proposed a coarse-to-fine architecture
and alternately refined the optical flow, depth and motion.
DeepSFM [18] proposed to iteratively and alternately refine
the initial depth map and initial relative camera pose offered
by DeMoN [31]. Differently, our DeepMLE does not require
the initial depth or relative camera pose, and meanwhile,
our method is beneficial from MLE-based inference which
clearly show the likelihood probabilistic of estimations.
III. METHOD
In this section, we are going to introduce how the MLE
inference is integrated into the networks. To this end, the
general maximum likelihood formula is introduced in Sec-
tion III-A firstly, which demonstrates the process of MLE
modelling as a whole. Subsequently, our DeepMLE is intro-
duced to implement the MLE model by three sub-modules in
Section III-B. The overview of DeepMLE is shown in Fig. 1.
A. Maximum Likelihood Formula for Two-view SfM
In the two-view SfM problem, given two consecutive im-
ages, the target image Itand the source image Is, the depth
map Dof Itand relative camera pose Tfrom Itto Iscan be
estimated by maximizing the likelihood of the correlations.
In traditional methods, minimizing the photometric residuals
also provide a maximum likelihood solution for the depth
and relative camera pose estimation [11], [12]. In order to
increase the robustness of the observations, we take the pixel-
wise deep correlations as the observations for estimation
(b
D,b
T). Meanwhile, we use C(It, Is)(see Section III-B.1)
to denote the domain of deep correlation observations, which
stores the all-pair correlation values in the target image It
and the source image Is. For each pixel-wise observation
c, its value can be looked-up in C(It, Is). The aim of our
work is to find the optimal depth map Dand relative camera
pose Tthat can maximize the likelihood of observations. The
definition is formulated as:
{D,T}= arg max
b
D,b
T
YV
i=1 P(ci|b
D,b
T),(1)
where Vis the pixel number of the target image It.
For each pixel in It, its correlation observation with its
correspondence in Iscan be approximated by a Gaussian
distribution when only relatively tiny illumination changes
happen, but it more likely follow a Uniform distribution if the
extreme illumination changes, occlusions and objects moving