Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image

2025-05-06 0 0 2.14MB 15 页 10玖币
侵权投诉
DEEP-MDS FRAMEWORK FOR RECOVERING THE 3D SHAPE OF
2D LANDMARKS FROM A SINGLE IMAGE
A PREPRINT
Shima Kamyab
Department of Computer Engineering
Shiraz University
Fars, Iran
sh.kamyab@cse.shirazu.ac.ir
Zohreh Azimifar
Department of Computer Engineering
Shiraz University
Fars, Iran
azimifar@cse.shirazu.ac.ir
October 28, 2022
ABSTRACT
In this paper, a low parameter deep learning framework utilizing the Non-metric Multi-Dimensional
scaling (NMDS) method, is proposed to recover the 3D shape of 2D landmarks on a human face, in a
single input image. Hence, NMDS approach is used for the first time to establish a mapping from a
2D landmark space to the corresponding 3D shape space. A deep neural network learns the pairwise
dissimilarity among 2D landmarks, used by NMDS approach, whose objective is to learn the pairwise
3D Euclidean distance of the corresponding 2D landmarks on the input image. This scheme results in
a symmetric dissimilarity matrix, with the rank larger than 2, leading the NMDS approach toward
appropriately recovering the 3D shape of corresponding 2D landmarks. In the case of posed images
and complex image formation processes like perspective projection which causes occlusion in the
input image, we consider an autoencoder component in the proposed framework, as an occlusion
removal part, which turns different input views of the human face into a profile view. The results
of a performance evaluation using different synthetic and real-world human face datasets, including
Besel Face Model (BFM), CelebA, CoMA - FLAME, and CASIA-3D, indicates the comparable
performance of the proposed framework, despite its small number of training parameters, with the
related state-of-the-art and powerful 3D reconstruction methods from the literature, in terms of
efficiency and accuracy.
Keywords
Single view 3D human face shape recovery
·
deep learning human face shape recovery
·
using
multidimensional scaling for 2D shape recovery from a singe image
1 Introduction
In general, human face 3D reconstruction plays a fundamental role in many computer vision applications such as
face recognition, gaming, etc. Utilizing the 3D face shape in a framework helps increase the framework accuracy
and makes it invariant to the pose and/or occlusion changes in the input Sharma and Kumar [2020], Sadeghzadeh
and Ebrahimnezhad [2020], Komal and Malhotrauthor [2020]. On the other hand, this problem is a computationally
demanding process which is considered a limitation in some environments like embedded systems.
One effective solution to reduce the complexity of a 3D reconstruction framework, is using geometric features, called
landmarks instead of the whole input image. Landmarks are well-known and efficient features which are applicable in
many computer vision tasks, such as camera calibration Bartl et al. [2020], Rehder et al. [2017], mensuration Naqvi
et al. [2020], image registration Zhang et al. [2020], Bhavana [2020], scene reconstruction Nouduri et al. [2020], Ngo
et al. [2021], object recognition Choi and Song [2020], Bocchi et al. [2020], and motion analysis Bandini et al. [2020],
Malti [2021]. The main advantage of landmark-based approaches is that they are less sensitive to lightning conditions
or other radiometric variations Rohr [2001]. As stated in Tian et al. [2018], 2D geometric information, like landmarks
and contours, sometimes are more effective than photometric stereo-based for 3D reconstruction in the wild, while
arXiv:2210.15200v1 [cs.CV] 27 Oct 2022
Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image A PREPRINT
photometric information, including lighting, camera properties, and reflectance properties, are themselves less robust
features especially for unconstrained images. Nowadays, several accurate landmarks detection algorithms and methods
in the literature use a different standard. King [2009], Zhu and Ramanan [2012], Zou et al. [2020], Li et al. [2017].
Using simple features like landmarks, with efficient use of memory and computation, makes the solution suitable for
the high-latency and energy-efficient embedded devices Chen et al. [2021], Wisth et al. [2021], where the objective is to
push the computing closer to where the sensors gather data. Therefore, the main requirement of these systems is to keep
the frameworks as light as possible.
Instead of the whole image, utilizing landmark features opens up new horizons toward the point-based statistical
methods like Multi-Dimensional Scaling (MDS) variants Kruskal [1978]. Formally, MDS approach refers to a set of
statistical procedures used for exploratory data analysis and dimensionality reduction. It takes the estimates of similarity
among a group of items, as input, or various “indirect” measurements (e.g., perceptual confusions), and the outcome
is a “map” that conveys, spatially, the relationships among items, wherein similar items are located proximal to one
another, and dissimilar items are located proportionately further apart Hout et al. [2013]. A detailed description of
MDS approach is given in Sec. 3, and, for more information, the reader is referred to Ghojogh et al. [2020], Hout et al.
[2013]. In general, MDS solution includes decomposing a dissimilarity matrix, computing from the input points, to find
a mapping into a new space, preserving the input points’ configuration as much as possible Ghojogh et al. [2020].
In this paper, the idea is to use the MDS approach for mapping a number of standard 2D landmarks on a human face,
in a single input image, into the corresponding 3D shape space, by recovering the landmarks’ 3D shape. The main
challenge here is that, to the best of our knowledge, the MDS approach is usually used to reduce the dimensions, not
increase them. In fact, the reason why MDS reduces the dimensions is that the rank of its used dissimilarity, usually
selected as the Euclidean distances among the input points, having
D
D dimensions, is one minus the dimensions of the
input points, and therefore, MDS could map the data to at most D1dimensions.
Instead of using the Euclidean distance among the input 2D landmarks, in the MDS approach, consider using a
dissimilarity matrix which gives an estimate of
3D Euclidean distance
among the landmarks on the corresponding 3D
shape. If there is a way to learn such dissimilarity, MDS approach can appropriately recover the 3D shape of the input
2D landmarks.
We propose to learn such dissimilarity, using a deep learning framework with a pair of 2D points on a single image of a
human face, as the input, and an estimate of the 3D Euclidean distance between their corresponding 3D locations, as
the output. The proposed deep learning dissimilarity has then the small number of training parameters.
From another viewpoint, the recovery of the 3D shape of some 2D landmarks in a single image, is an ill-posed problem
which needs to impose some constraints, on the solution space, to obtain only feasible solutions, for the problem at
hand Engl and Groetsch [2014]. In the case of the 3D reconstruction from a single image, a usual way is to use a 3D
model to constraint the resulting 3D shape to be a feasible human face, defined by the model bfm [2009]. This type of
constraint comes with the drawback of obtaining biased solutions toward the mean shape of the used 3D model Aldrian
and Smith [2012]. In our proposed framework, instead, the needed constraints on the solution space are provided by the
deep learning components and the MDS approach, which don’t bias the found solutions toward a specific region of the
solution space.
In the case of different types of image formation processes, we use an autoencoder to turn any input projection type into
some profile view so that the deep learning dissimilarity could be trained without any confusion. This is because we
empirically observed that the self-occlusion, caused by posedness or complex projection types, like perspective, etc.,
mislead the performance of the deep neural network for learning the dissimilarity in our framework. In the case of the
human faces, a simple autoencoder could appropriately turn the input view into an appropriate profile view.
Therefore, our contributions in this paper include:
using the MDS approach to increase the dimensions, for the first time.
proposing a deep learning symmetric dissimilarity measure to be used in the MDS approach for estimating the
3D shape of 2D landmarks in a single image.
proposing a low parameter, unbiased deep learning framework for 3D recovery of the landmark locations in a
single image, independent of the type of 2D projection.
The rest of this paper is organized as follows: Section 2 includes the review of the related 3D reconstruction methods to
ours and their advantages and drawbacks. In Section 3, we demonstrate our proposed method and the concepts needed
for understanding our method. We then report the experimental evaluations in Section 4, followed by the Conclusions
and future works, discussed in Section 5.
2
Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image A PREPRINT
2 Related Works
A large body of research in the field of 3D reconstruction has utilized spatial geometric features, like landmarks, as a
cue and has reported more precise and accurate solutions. This section reviews the state-of-the-art and the most recent
single view human face 3D reconstruction methods that utilize landmark depth estimation in their process.
The approaches appeared in Zhao et al. [2017], Chinaev et al. [2018], Aldrian and Smith [2012], Moniz et al. [2018] are
the state-of-the-art 3D reconstruction methods which use landmark depth estimation. In Zhao et al. [2017], assuming
the images are the result of the orthographic projection, only the depth (i.e., the third dimension) of input 2D landmarks
is estimated using a Deep Neural Network (DNN). In the case of other projection types like perspective projection,
the work in Zhao et al. [2017] has then a poor performance. In Chinaev et al. [2018], a low parameter deep learning
framework for model-based 3D reconstruction from a single image is proposed. In this framework, the coefficients of a
3D Morphable Model (3DMM) are learned from a single input image, using a CNN. The solution is usually biased
toward the 3DMM mean shape in the model-based frameworks. In Aldrian and Smith [2012], a closed-form solution is
proposed for estimation of the coefficients of a 3DMM from a single input image, using different assumptions about the
image formation process. This method is also a model-based framework with the orthographic projection assumption.
In Moniz et al. [2018], an unsupervised DNN is proposed to replace the face in a given image by a face in another image.
To do this, a number of landmarks in the image are considered for correspondence between two images. Then, their
depth, i.e., the third dimension, assuming orthographic projection, are estimated using a DNN along with a matrix for
transforming the resulted 3D landmarks and aligning them with the target face pose. In this work ( Moniz et al. [2018]),
the landmarks’ depth is obtained along with the transformation matrix, in the output of a DNN, as an intermediate
solution, and may not be a feasible landmark set by itself.
Among the recent deep learning frameworks for 3D human face reconstruction we can name Li et al. [2021], which
proposes a model-based 3D reconstruction of human face shape including two steps: coarse shape estimation and
adding more details to the resulted face. In the coarse shape estimation phase, a deep neural network is used to estimate
the 3DMM parameters from some landmarks on the input image. The estimated shape is then fine-tuned in the second
phase using high frequency features obtained by a GAN. A high number of parameters and model-based schemes
may be the drawbacks of these frameworks. In Wu et al. [2019], another model-based method with a model fitting
algorithm for 3D facial expression reconstruction from a single image is proposed. A cascaded regression framework is
designed to estimate the parameters for the 3DMM from a single image. In this framework, 2D landmarks are detected
and used to initialize the 3D shape and the mapping matrices. At each iteration, residues between the current 3DMM
parameters and the ground-truth are estimated and used to update the 3D shapes. In Hu et al. [2021], a self-supervised
3D reconstruction framework is proposed which considers the feature representation of some landmarks in both 2D and
3D images and establishes consistency among the landmarks in all images. Using landmarks in this framework can
further improve the reconstructed quality in local regions by the self-supervised classification for the landmarks. In Cai
et al. [2021], a 3D reconstruction framework is proposed for caricature images. It learns a parametric model on the 3D
caricature faces and trains a DNN for regressing the 3D face shape based on the learned nonlinear parametric model.
The proposed DNN results in a number of detected landmarks on each input caricature face image. The proposed
RingNet Sanyal et al. [2019] is a model-based single image 3D human face reconstruction framework without the need
for 2D-3D supervision. The RingNet output is based on the FLAME model Li et al. [2017]. RingNet leverages multiple
images of a person and automatically detects 2D face features. It uses a novel loss that encourages the face shapes to be
similar when the identity is the same and different for different people. In our experiments, we used this method as a
recent unsupervised 3D landmark shape recovery method.
In Tian et al. [2018], a human face 3D reconstruction framework is proposed whose input includes only landmarks.
This framework reconstructs a 3D face shape of the frontal pose and natural expression from multiple unconstrained
images of the subject via cascade regression in 2D/3D shape space. Using the landmarks in this framework, the 3D
face shape is progressively updated, using a set of cascade regressors, which are learned offline, using a training set
of unconstrained face images and their corresponding 3D shape. In Gong et al. [2015], it is stated that in a dense
3D reconstruction framework, knowing the depth of some landmarks on the input face is necessary to prevent the
degradation of prediction accuracy. Therefore, a two-stage method is proposed which first estimates landmarks’ depth
and then uses them within a deformation algorithm to reconstruct a precise 3D shape. In this work, the landmarks’
depth estimation needs to have two input images which are not available in many cases.
Despite their good performance in single view human face reconstruction applications, the methods reviewed above
may suffer from a large number of parameters or be biased toward a mean shape. Our proposed method for 3D shape
recovery of 2D landmarks could be incorporated as the landmark depth estimation component in these methods to
increase their efficiency. Our proposed deep learning framework is aimed to overcome the above problems in a low
parameter framework, incorporating a novel use of the analytic MDS method.
3
摘要:

DEEP-MDSFRAMEWORKFORRECOVERINGTHE3DSHAPEOF2DLANDMARKSFROMASINGLEIMAGEAPREPRINTShimaKamyabDepartmentofComputerEngineeringShirazUniversityFars,Iransh.kamyab@cse.shirazu.ac.irZohrehAzimifarDepartmentofComputerEngineeringShirazUniversityFars,Iranazimifar@cse.shirazu.ac.irOctober28,2022ABSTRACTInthispape...

展开>> 收起<<
Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.14MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注