
Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image A PREPRINT
photometric information, including lighting, camera properties, and reflectance properties, are themselves less robust
features especially for unconstrained images. Nowadays, several accurate landmarks detection algorithms and methods
in the literature use a different standard. King [2009], Zhu and Ramanan [2012], Zou et al. [2020], Li et al. [2017].
Using simple features like landmarks, with efficient use of memory and computation, makes the solution suitable for
the high-latency and energy-efficient embedded devices Chen et al. [2021], Wisth et al. [2021], where the objective is to
push the computing closer to where the sensors gather data. Therefore, the main requirement of these systems is to keep
the frameworks as light as possible.
Instead of the whole image, utilizing landmark features opens up new horizons toward the point-based statistical
methods like Multi-Dimensional Scaling (MDS) variants Kruskal [1978]. Formally, MDS approach refers to a set of
statistical procedures used for exploratory data analysis and dimensionality reduction. It takes the estimates of similarity
among a group of items, as input, or various “indirect” measurements (e.g., perceptual confusions), and the outcome
is a “map” that conveys, spatially, the relationships among items, wherein similar items are located proximal to one
another, and dissimilar items are located proportionately further apart Hout et al. [2013]. A detailed description of
MDS approach is given in Sec. 3, and, for more information, the reader is referred to Ghojogh et al. [2020], Hout et al.
[2013]. In general, MDS solution includes decomposing a dissimilarity matrix, computing from the input points, to find
a mapping into a new space, preserving the input points’ configuration as much as possible Ghojogh et al. [2020].
In this paper, the idea is to use the MDS approach for mapping a number of standard 2D landmarks on a human face,
in a single input image, into the corresponding 3D shape space, by recovering the landmarks’ 3D shape. The main
challenge here is that, to the best of our knowledge, the MDS approach is usually used to reduce the dimensions, not
increase them. In fact, the reason why MDS reduces the dimensions is that the rank of its used dissimilarity, usually
selected as the Euclidean distances among the input points, having
D
D dimensions, is one minus the dimensions of the
input points, and therefore, MDS could map the data to at most D−1dimensions.
Instead of using the Euclidean distance among the input 2D landmarks, in the MDS approach, consider using a
dissimilarity matrix which gives an estimate of
3D Euclidean distance
among the landmarks on the corresponding 3D
shape. If there is a way to learn such dissimilarity, MDS approach can appropriately recover the 3D shape of the input
2D landmarks.
We propose to learn such dissimilarity, using a deep learning framework with a pair of 2D points on a single image of a
human face, as the input, and an estimate of the 3D Euclidean distance between their corresponding 3D locations, as
the output. The proposed deep learning dissimilarity has then the small number of training parameters.
From another viewpoint, the recovery of the 3D shape of some 2D landmarks in a single image, is an ill-posed problem
which needs to impose some constraints, on the solution space, to obtain only feasible solutions, for the problem at
hand Engl and Groetsch [2014]. In the case of the 3D reconstruction from a single image, a usual way is to use a 3D
model to constraint the resulting 3D shape to be a feasible human face, defined by the model bfm [2009]. This type of
constraint comes with the drawback of obtaining biased solutions toward the mean shape of the used 3D model Aldrian
and Smith [2012]. In our proposed framework, instead, the needed constraints on the solution space are provided by the
deep learning components and the MDS approach, which don’t bias the found solutions toward a specific region of the
solution space.
In the case of different types of image formation processes, we use an autoencoder to turn any input projection type into
some profile view so that the deep learning dissimilarity could be trained without any confusion. This is because we
empirically observed that the self-occlusion, caused by posedness or complex projection types, like perspective, etc.,
mislead the performance of the deep neural network for learning the dissimilarity in our framework. In the case of the
human faces, a simple autoencoder could appropriately turn the input view into an appropriate profile view.
Therefore, our contributions in this paper include:
• using the MDS approach to increase the dimensions, for the first time.
•
proposing a deep learning symmetric dissimilarity measure to be used in the MDS approach for estimating the
3D shape of 2D landmarks in a single image.
•
proposing a low parameter, unbiased deep learning framework for 3D recovery of the landmark locations in a
single image, independent of the type of 2D projection.
The rest of this paper is organized as follows: Section 2 includes the review of the related 3D reconstruction methods to
ours and their advantages and drawbacks. In Section 3, we demonstrate our proposed method and the concepts needed
for understanding our method. We then report the experimental evaluations in Section 4, followed by the Conclusions
and future works, discussed in Section 5.
2