Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image

2025-05-06 0 0 2.14MB 15 页 10玖币

侵权投诉

DEEP-MDS FRAMEWORK FOR RECOVERING THE 3D SHAPE OF

2D LANDMARKS FROM A SINGLE IMAGE

A PREPRINT

Shima Kamyab

Department of Computer Engineering

Shiraz University

Fars, Iran

sh.kamyab@cse.shirazu.ac.ir

Zohreh Azimifar

Department of Computer Engineering

Shiraz University

Fars, Iran

azimifar@cse.shirazu.ac.ir

October 28, 2022

ABSTRACT

In this paper, a low parameter deep learning framework utilizing the Non-metric Multi-Dimensional

scaling (NMDS) method, is proposed to recover the 3D shape of 2D landmarks on a human face, in a

single input image. Hence, NMDS approach is used for the ﬁrst time to establish a mapping from a

2D landmark space to the corresponding 3D shape space. A deep neural network learns the pairwise

dissimilarity among 2D landmarks, used by NMDS approach, whose objective is to learn the pairwise

3D Euclidean distance of the corresponding 2D landmarks on the input image. This scheme results in

a symmetric dissimilarity matrix, with the rank larger than 2, leading the NMDS approach toward

appropriately recovering the 3D shape of corresponding 2D landmarks. In the case of posed images

and complex image formation processes like perspective projection which causes occlusion in the

input image, we consider an autoencoder component in the proposed framework, as an occlusion

removal part, which turns different input views of the human face into a proﬁle view. The results

of a performance evaluation using different synthetic and real-world human face datasets, including

Besel Face Model (BFM), CelebA, CoMA - FLAME, and CASIA-3D, indicates the comparable

performance of the proposed framework, despite its small number of training parameters, with the

related state-of-the-art and powerful 3D reconstruction methods from the literature, in terms of

efﬁciency and accuracy.

Keywords

Single view 3D human face shape recovery

deep learning human face shape recovery

using

multidimensional scaling for 2D shape recovery from a singe image

1 Introduction

In general, human face 3D reconstruction plays a fundamental role in many computer vision applications such as

face recognition, gaming, etc. Utilizing the 3D face shape in a framework helps increase the framework accuracy

and makes it invariant to the pose and/or occlusion changes in the input Sharma and Kumar [2020], Sadeghzadeh

and Ebrahimnezhad [2020], Komal and Malhotrauthor [2020]. On the other hand, this problem is a computationally

demanding process which is considered a limitation in some environments like embedded systems.

One effective solution to reduce the complexity of a 3D reconstruction framework, is using geometric features, called

landmarks instead of the whole input image. Landmarks are well-known and efﬁcient features which are applicable in

many computer vision tasks, such as camera calibration Bartl et al. [2020], Rehder et al. [2017], mensuration Naqvi

et al. [2020], image registration Zhang et al. [2020], Bhavana [2020], scene reconstruction Nouduri et al. [2020], Ngo

et al. [2021], object recognition Choi and Song [2020], Bocchi et al. [2020], and motion analysis Bandini et al. [2020],

Malti [2021]. The main advantage of landmark-based approaches is that they are less sensitive to lightning conditions

or other radiometric variations Rohr [2001]. As stated in Tian et al. [2018], 2D geometric information, like landmarks

and contours, sometimes are more effective than photometric stereo-based for 3D reconstruction in the wild, while

arXiv:2210.15200v1 [cs.CV] 27 Oct 2022

Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image A PREPRINT

photometric information, including lighting, camera properties, and reﬂectance properties, are themselves less robust

features especially for unconstrained images. Nowadays, several accurate landmarks detection algorithms and methods

in the literature use a different standard. King [2009], Zhu and Ramanan [2012], Zou et al. [2020], Li et al. [2017].

Using simple features like landmarks, with efﬁcient use of memory and computation, makes the solution suitable for

the high-latency and energy-efﬁcient embedded devices Chen et al. [2021], Wisth et al. [2021], where the objective is to

push the computing closer to where the sensors gather data. Therefore, the main requirement of these systems is to keep

the frameworks as light as possible.

Instead of the whole image, utilizing landmark features opens up new horizons toward the point-based statistical

methods like Multi-Dimensional Scaling (MDS) variants Kruskal [1978]. Formally, MDS approach refers to a set of

statistical procedures used for exploratory data analysis and dimensionality reduction. It takes the estimates of similarity

among a group of items, as input, or various “indirect” measurements (e.g., perceptual confusions), and the outcome

is a “map” that conveys, spatially, the relationships among items, wherein similar items are located proximal to one

another, and dissimilar items are located proportionately further apart Hout et al. [2013]. A detailed description of

MDS approach is given in Sec. 3, and, for more information, the reader is referred to Ghojogh et al. [2020], Hout et al.

[2013]. In general, MDS solution includes decomposing a dissimilarity matrix, computing from the input points, to ﬁnd

a mapping into a new space, preserving the input points’ conﬁguration as much as possible Ghojogh et al. [2020].

In this paper, the idea is to use the MDS approach for mapping a number of standard 2D landmarks on a human face,

in a single input image, into the corresponding 3D shape space, by recovering the landmarks’ 3D shape. The main

challenge here is that, to the best of our knowledge, the MDS approach is usually used to reduce the dimensions, not

increase them. In fact, the reason why MDS reduces the dimensions is that the rank of its used dissimilarity, usually

selected as the Euclidean distances among the input points, having

D dimensions, is one minus the dimensions of the

input points, and therefore, MDS could map the data to at most D−1dimensions.

Instead of using the Euclidean distance among the input 2D landmarks, in the MDS approach, consider using a

dissimilarity matrix which gives an estimate of

3D Euclidean distance

among the landmarks on the corresponding 3D

shape. If there is a way to learn such dissimilarity, MDS approach can appropriately recover the 3D shape of the input

2D landmarks.

We propose to learn such dissimilarity, using a deep learning framework with a pair of 2D points on a single image of a

human face, as the input, and an estimate of the 3D Euclidean distance between their corresponding 3D locations, as

the output. The proposed deep learning dissimilarity has then the small number of training parameters.

From another viewpoint, the recovery of the 3D shape of some 2D landmarks in a single image, is an ill-posed problem

which needs to impose some constraints, on the solution space, to obtain only feasible solutions, for the problem at

hand Engl and Groetsch [2014]. In the case of the 3D reconstruction from a single image, a usual way is to use a 3D

model to constraint the resulting 3D shape to be a feasible human face, deﬁned by the model bfm [2009]. This type of

constraint comes with the drawback of obtaining biased solutions toward the mean shape of the used 3D model Aldrian

and Smith [2012]. In our proposed framework, instead, the needed constraints on the solution space are provided by the

deep learning components and the MDS approach, which don’t bias the found solutions toward a speciﬁc region of the

solution space.

In the case of different types of image formation processes, we use an autoencoder to turn any input projection type into

some proﬁle view so that the deep learning dissimilarity could be trained without any confusion. This is because we

empirically observed that the self-occlusion, caused by posedness or complex projection types, like perspective, etc.,

mislead the performance of the deep neural network for learning the dissimilarity in our framework. In the case of the

human faces, a simple autoencoder could appropriately turn the input view into an appropriate proﬁle view.

Therefore, our contributions in this paper include:

• using the MDS approach to increase the dimensions, for the ﬁrst time.

•

proposing a deep learning symmetric dissimilarity measure to be used in the MDS approach for estimating the

3D shape of 2D landmarks in a single image.

•

proposing a low parameter, unbiased deep learning framework for 3D recovery of the landmark locations in a

single image, independent of the type of 2D projection.

The rest of this paper is organized as follows: Section 2 includes the review of the related 3D reconstruction methods to

ours and their advantages and drawbacks. In Section 3, we demonstrate our proposed method and the concepts needed

for understanding our method. We then report the experimental evaluations in Section 4, followed by the Conclusions

and future works, discussed in Section 5.

Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image A PREPRINT

2 Related Works

A large body of research in the ﬁeld of 3D reconstruction has utilized spatial geometric features, like landmarks, as a

cue and has reported more precise and accurate solutions. This section reviews the state-of-the-art and the most recent

single view human face 3D reconstruction methods that utilize landmark depth estimation in their process.

The approaches appeared in Zhao et al. [2017], Chinaev et al. [2018], Aldrian and Smith [2012], Moniz et al. [2018] are

the state-of-the-art 3D reconstruction methods which use landmark depth estimation. In Zhao et al. [2017], assuming

the images are the result of the orthographic projection, only the depth (i.e., the third dimension) of input 2D landmarks

is estimated using a Deep Neural Network (DNN). In the case of other projection types like perspective projection,

the work in Zhao et al. [2017] has then a poor performance. In Chinaev et al. [2018], a low parameter deep learning

framework for model-based 3D reconstruction from a single image is proposed. In this framework, the coefﬁcients of a

3D Morphable Model (3DMM) are learned from a single input image, using a CNN. The solution is usually biased

toward the 3DMM mean shape in the model-based frameworks. In Aldrian and Smith [2012], a closed-form solution is

proposed for estimation of the coefﬁcients of a 3DMM from a single input image, using different assumptions about the

image formation process. This method is also a model-based framework with the orthographic projection assumption.

In Moniz et al. [2018], an unsupervised DNN is proposed to replace the face in a given image by a face in another image.

To do this, a number of landmarks in the image are considered for correspondence between two images. Then, their

depth, i.e., the third dimension, assuming orthographic projection, are estimated using a DNN along with a matrix for

transforming the resulted 3D landmarks and aligning them with the target face pose. In this work ( Moniz et al. [2018]),

the landmarks’ depth is obtained along with the transformation matrix, in the output of a DNN, as an intermediate

solution, and may not be a feasible landmark set by itself.

Among the recent deep learning frameworks for 3D human face reconstruction we can name Li et al. [2021], which

proposes a model-based 3D reconstruction of human face shape including two steps: coarse shape estimation and

adding more details to the resulted face. In the coarse shape estimation phase, a deep neural network is used to estimate

the 3DMM parameters from some landmarks on the input image. The estimated shape is then ﬁne-tuned in the second

phase using high frequency features obtained by a GAN. A high number of parameters and model-based schemes

may be the drawbacks of these frameworks. In Wu et al. [2019], another model-based method with a model ﬁtting

algorithm for 3D facial expression reconstruction from a single image is proposed. A cascaded regression framework is

designed to estimate the parameters for the 3DMM from a single image. In this framework, 2D landmarks are detected

and used to initialize the 3D shape and the mapping matrices. At each iteration, residues between the current 3DMM

parameters and the ground-truth are estimated and used to update the 3D shapes. In Hu et al. [2021], a self-supervised

3D reconstruction framework is proposed which considers the feature representation of some landmarks in both 2D and

3D images and establishes consistency among the landmarks in all images. Using landmarks in this framework can

further improve the reconstructed quality in local regions by the self-supervised classiﬁcation for the landmarks. In Cai

et al. [2021], a 3D reconstruction framework is proposed for caricature images. It learns a parametric model on the 3D

caricature faces and trains a DNN for regressing the 3D face shape based on the learned nonlinear parametric model.

The proposed DNN results in a number of detected landmarks on each input caricature face image. The proposed

RingNet Sanyal et al. [2019] is a model-based single image 3D human face reconstruction framework without the need

for 2D-3D supervision. The RingNet output is based on the FLAME model Li et al. [2017]. RingNet leverages multiple

images of a person and automatically detects 2D face features. It uses a novel loss that encourages the face shapes to be

similar when the identity is the same and different for different people. In our experiments, we used this method as a

recent unsupervised 3D landmark shape recovery method.

In Tian et al. [2018], a human face 3D reconstruction framework is proposed whose input includes only landmarks.

This framework reconstructs a 3D face shape of the frontal pose and natural expression from multiple unconstrained

images of the subject via cascade regression in 2D/3D shape space. Using the landmarks in this framework, the 3D

face shape is progressively updated, using a set of cascade regressors, which are learned ofﬂine, using a training set

of unconstrained face images and their corresponding 3D shape. In Gong et al. [2015], it is stated that in a dense

3D reconstruction framework, knowing the depth of some landmarks on the input face is necessary to prevent the

degradation of prediction accuracy. Therefore, a two-stage method is proposed which ﬁrst estimates landmarks’ depth

and then uses them within a deformation algorithm to reconstruct a precise 3D shape. In this work, the landmarks’

depth estimation needs to have two input images which are not available in many cases.

Despite their good performance in single view human face reconstruction applications, the methods reviewed above

may suffer from a large number of parameters or be biased toward a mean shape. Our proposed method for 3D shape

recovery of 2D landmarks could be incorporated as the landmark depth estimation component in these methods to

increase their efﬁciency. Our proposed deep learning framework is aimed to overcome the above problems in a low

parameter framework, incorporating a novel use of the analytic MDS method.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DEEP-MDSFRAMEWORKFORRECOVERINGTHE3DSHAPEOF2DLANDMARKSFROMASINGLEIMAGEAPREPRINTShimaKamyabDepartmentofComputerEngineeringShirazUniversityFars,Iransh.kamyab@cse.shirazu.ac.irZohrehAzimifarDepartmentofComputerEngineeringShirazUniversityFars,Iranazimifar@cse.shirazu.ac.irOctober28,2022ABSTRACTInthispape...

展开>> 收起<<

Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: