Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN FEV-GAN

2025-05-06 0 0 2.04MB 13 页 10玖币

侵权投诉

Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN:

FEV-GAN

Hamza Bouzida,∗∗, Lahoucine Ballihia

aLRIT-CNRST URAC 29, Mohammed V University in Rabat, Faculty Of Sciences, Rabat, Morocco.

ABSTRACT

Facial expression generation has always been an intriguing task for scientists and researchers all over

the globe. In this context, we present our novel approach for generating videos of the six basic facial

expressions. Starting from a single neutral facial image and a label indicating the desired facial ex-

pression, we aim to synthesize a video of the given identity performing the speciﬁed facial expression.

Our approach, referred to as FEV-GAN (Facial Expression Video GAN), is based on Spatio-temporal

Convolutional GANs, that are known to model both content and motion in the same network. Pre-

vious methods based on such a network have shown a good ability to generate coherent videos with

smooth temporal evolution. However, they still suﬀer from low image quality and low identity preser-

vation capability. In this work, we address this problem by using a generator composed of two image

encoders. The ﬁrst one is pre-trained for facial identity feature extraction and the second for spatial

feature extraction. We have qualitatively and quantitatively evaluated our model on two international

facial expression benchmark databases: MUG and Oulu-CASIA NIR&VIS. The experimental results

analysis demonstrates the eﬀectiveness of our approach in generating videos of the six basic facial ex-

pressions while preserving the input identity. The analysis also proves that the use of both identity and

spatial features enhances the decoder ability to better preserve the identity and generate high-quality

videos. The code and the pre-trained model will soon be made publicly available. .

1. Introduction

Facial expressions have always been considered one of the

essential tools for human interaction. Integrating the ability to

recognize and synthesize facial expressions to machines pro-

vides a natural and smooth interaction. Which opens the door to

many exciting new applications in diﬀerent ﬁelds, including the

movie industry, e-commerce, and even in the medical ﬁeld. Mo-

tivated by this, researches have studied facial expression recog-

nition and have already reached a high level of precision, while

facial expression generation has been more demanding and less

studied in the state of the art. Recently, with the success of Gen-

erative Adversarial Networks (GANs) (Goodfellow et al.,2014)

in data generation, in particular image generation, the task of

generating facial expressions has seen tremendous progress.

However, the dynamic facial expressions synthesis is even

∗∗Corresponding author

e-mail: hamza-bouzid@um5r.ac.ma (Hamza Bouzid ),

lahoucine.ballihi@fsr.um5.ac.ma (Lahoucine Ballihi )

less studied due to the diﬃculty of the tasks: 1) learning the

dataset distribution (facial structure, background), 2) repre-

senting a natural and smooth evolution of facial expressions

(Temporal representation), and 3) preserving the same input

identity. To address the high complexity of these three tasks,

most existing methods tend to treat facial expression genera-

tion as a two-steps process. One step for the low dimensional

temporal generation (motion) and the other for the spatial

generation (content). Such methods (Tulyakov et al.,2018;

Wang et al.,2018a;Otberdout et al.,2019) are mostly based on

1) the generation of motion as codes in a latent space, thereafter

2) combining it with the input image embedding to generate

frames individually, through the use of an Image-to-Image

translation network. These methods are eﬃcient in learning

facial structure and identity preservation, but they are ﬂawed

when it comes to modeling spatio-temporal consistency and

appearance retainment. This is caused by the independence

between frames generation.

Motivated by the success of Deep spatio-temporal neural

arXiv:2210.11182v1 [cs.CV] 20 Oct 2022

network models in recognition and prediction tasks (Al Chanti

and Caplier,2018;Tran et al.,2018;Ali et al.,2021,2022),

researchers have proposed a variety of one-step methods

(Vondrick et al.,2016;Jang et al.,2018;Wang et al.,2020a,b)

that use fractionally strided 3D convolutions. Videos gener-

ated by these kinds of methods show more spatio-temporal

consistency, however, lower video quality, more noise and

distortions, and more identity preservation issues compared to

two-steps methods. We claim this is due to the high complexity

of the three tasks combined in a single network ( learning 1. the

spatial presentation, 2. the temporal representation, 3. identity

preservation). This requires a large network with high potential

complexity and a large amount of data, which signiﬁcantly

increases the diﬃculty of model optimization.

To solve the issues of the low quality, noise and identity

preservation capability faced by one-step methods, We propose

encoding the input image into two codes in the latent space,

using two feature extractors (EId identity feature extractor,

Esspatial feature extractor). We also suggest exploiting the

high performance of state-of-the-art facial recognition systems,

by utilizing a pre-trained facial recognition feature extractor

as our identity encoder EId . This grants identity related

features that help with identity preservation. In addition, the

use of a pre-trained feature extractor allows for applying the

optimization process only on the other encoder ES, that is used

to extract other spatial features in order to maintain suﬃciently

good quality while reconstructing facial expression video.

In summary, our contributions include the following aspects:

1. We propose a conditional GAN, with a single generator

and a single discriminator, that generates at each time step

a dynamic facial expression video, corresponding to the

desired class of expressions. The generated videos present

a realistic appearance, and preserve the identity of the in-

put image.

2. We investigate the inﬂuence of utilizing two encoders EId

and ES, where EId is a facial identity feature extractor and

ESis a spatial feature extractor.

3. We exploit the high potential of state-of-the-art facial

recognition systems. We use a pre-trained face recogni-

tion model as our generator encoder EId, which will en-

sure strongly related identity features. This aims to facil-

itate the task of the decoder by providing meaningful and

structured features.

4. We deeply evaluate our model, quantitatively and qualita-

tively, on two public facial expressions benchmarks: MUG

facial expression database and Oulu-CASIA NIR&VIS

facial expression database. We compare it with the re-

cent state-of-the-art approaches: VGAN (Vondrick et al.,

2016), MoCoGAN (Tulyakov et al.,2018), ImaGINator

(Wang et al.,2020b) and (Otberdout et al.,2019).

2. Related Work

Static Facial Expression Generation −Facial expressions

synthesis was initially achieved through the use of tradi-

tional methods, such as geometric interpolation (Pighin et al.,

2006),Parameterization (Raouzaiou et al.,2002), Morphing

(Beier and Neely,1992), etc. These methods show success on

avatars, but they fall short when dealing with real faces, and

they are unable to generalize a ﬂow of movement for all hu-

man faces due to the high complexity of natural human expres-

sions and the variety of identity-speciﬁc characteristics. To face

these limitations, neural networks based methods have been ap-

plied on facial expressions generation, including RBMs (Zeiler

et al.,2011), DBNs (Sabzevari et al.,2010) and Variational

Auto-Encoders (Kingma and Welling,2014),etc. These meth-

ods learn acceptable data representations and better ﬂow be-

tween diﬀerent data distributions compared to prior methods,

but they face problems such as the lack of precision in con-

trolling facial expressions, low resolution and blurry generated

images.

With the appearance of GANs, multiple of its exten-

sions have been dedicated to facial expressions generation.

(Makhzani et al.,2015) and (Zhou and Shi,2017) exploit the

concept of adversity with auto-encoders to present Adversarial

Auto Encoders. (Zhu et al.,2017) propose a conditional

GAN, namely CycleGAN, that uses Cycle-Consistency Loss

to preserve the key attributes (identity) of the data. (Choi et al.,

2018) address the ineﬃciency of creating a new generator for

each type of transformation, by proposing an architecture that

can handle diﬀerent transformation between diﬀerent datasets.

(Wang et al.,2018b) suggest exploiting the U-Net architecture

as GAN generator, in order to increase the quality and the reso-

lution of generated images. US-GAN (Akram and Khan,2021)

uses a skip connection, called the ultimate skip connection,

that links the input image with the output of the model, which

allows the model to focus on learning expression-related details

only. the model outputs the addition of the input image and the

generated expression details, improving therefore the identity

preservation, but displaying artifacts in areas related to the

expression (mouth, nose, eyes). The studies above established

the task of generating classes of facial expressions (sad, happy,

angry, etc.), but in reality, the intensity of the expression

inhibits the understanding of the emotional state of the person.

ExprGAN (Ding et al.,2017) used an expression controller

module to control the expression intensity continuously from

weak to strong. Methods like GANimation (Pumarola et al.,

2018), EF-GAN (Wu et al.,2020) used Action Units (AUs) in

order to learn conditioning the generation process which oﬀers

more diversity in the generated expressions. Other methods

such as G2-GAN (Song et al.,2018) and GC-GAN (Qiao et al.,

2018) exploited Facial Geometry as a condition to control the

facial expression synthesis. The objective of the latter models

is to take as input a facial image and facial Landmarks in

form of binary images or landmarks coordinates, then learn to

generate a realistic face image with the same identity and the

target expression. (Kollias et al.,2020) and (Bai et al.,2022)

utilize labels from the 2D Valence-Arousal Space, in which

the valence is how positive or negative is an emotion and the

arousal is the power of the emotion activation (Russell,1980),

to guide the process of facial expression generation, enhancing

the variety and the control of the generated expressions. All

these approaches and others have established the task of facial

expression generation, but have not considered the dynamicity

of these expressions.

Dynamic Facial Expression Generation −Facial expres-

sions are naturally dynamic actions that contain more infor-

mation and details than a single pose, e.g. the speed of fa-

cial expression transformation, head movements when display-

ing the expression, etc. This information can be signiﬁcant

in understanding the emotional state of a person. To achieve

this, methods like (Ha et al.,2020;Tang et al.,2021;Li et al.,

2021;Tu et al.,2021;Vowels et al.,2021) focus on facial ex-

pression transfer, in which the facial expression is transferred

from a driver to a target face, while aiming to preserve the

target identity even in situations where the facial characteris-

tics of the driver diﬀers widely from those of the target. In

other methods, the motion is generated separately as codes in

the latent space, these codes are then fed to the generator in

order to generate frames of the video individually. For exam-

ple, MoCoGAN (Tulyakov et al.,2018) decomposes the video

into content and motion information, where the video motion

is learned by Gated RNN (GRU) and the video frames are gen-

erated sequentially by a GAN. RV-GAN (Gupta et al.,2022)

uses a transpose (upsampling instead of downsampling) con-

volutional LSTMs as GAN generator to generate frames indi-

vidually. However, the results of both models present content

and motion artifacts, and they both could only be applied to

seen-before identities, and a ﬁnite number of expressions. In

(Fan et al.,2019), the principle of MoCoGAN is extended by

adding an encoder that helps preserving the input identity, and

a coeﬃcient to control the degree of the expression continu-

ously. The authors of (Wang et al.,2018a) utilize a Multi-Mode

Recurrent Landmark Generator to learn generating variant se-

quences of facial landmarks of the same category (e.g. diﬀer-

ent ways to smile), translated later to video frames. In (Ot-

berdout et al.,2019), the authors exploit a conditional version

of manifold-valued Wasserstein GAN to model the facial land-

marks motion as curves encoded as points on a hypersphere.

The W-GAN learns the distribution of facial expression dy-

namics of diﬀerent classes, from which new facial expression

motions are synthesized and transformed to videos by Texture-

GAN. Other works have investigated guiding facial expression

generation by speech audio data, such as (Chen et al.,2020;

Guo et al.,2021;Wang et al.,2022;Liang et al.,2022), or by

a combination of audio and facial landmark information, like

(Wang et al.,2021;Wu et al.,2021;Sinha et al.,2022) . All

the methods mentioned before are methods that generate a sin-

gle frame at a time-step, which lowers the dependency between

the video frames causing the lack of spatio-temporal consis-

tency. Contrasted to the methods mentioned before, methods

like VGAN(Vondrick et al.,2016), G3AN(Wang et al.,2020a)

and ImaGINator (Wang et al.,2020b) use a single step for the

generation of the whole facial expression video, by employing

fractionally strided spatio-temporal convolutions to simultane-

ously generate both appearance, as well as motion. VGAN de-

composes the generated videos into two parts, a static section

(the background) and a dynamic section, which imposes the

use of a generator composed of two streams, for generating the

background and the foreground, that are combined in the output

to generate the whole video. G3AN aims to model appearance

and motion in disentangled manner. This is accomplished by

decomposing appearance and motion in a three-stream Genera-

tor, where the main stream models spatio-temporal consistency,

while the two auxiliary streams enhance the main stream with

multi-scale appearance and motion features, respectively. Both

VGAN and G3AN are unconditional models that start from a

Gaussian noise input, causing the lack of identity preservation

and of control over the generated expression. In order to avoid

these problems, ImaGINator uses a blend of auto-encoder ar-

chitecture and spatio-temporal fusion mechanism, where the

low-level spatial features in the encoder are sent directly to

the decoder (the same concept as U-Net (Ronneberger et al.,

2015)). It also uses two discriminators, one processes the whole

video and the other processes frame by frame. Videos gen-

erated by these kinds of methods show more spatio-temporal

consistency but lower video quality, more noise and less iden-

tity preservation compared to two-steps methods.

Motivated by the discussed above, we present a novel one-

step approach for facial expression generation based on frac-

tionally strided spatio-temporal convolutions. The rest of the

paper is organized as follows. In section 3 we introduce our

new FEV-GAN model. Section 4 shows the experimental set-

tings and the quantitative and qualitative analysis of the model.

Section 5 concludes the paper and provides perspectives for fu-

ture research.

3. Proposed Approach

As stated in the introduction, our main aim is to establish

a model that generates dynamic facial expression videos from

appearance information and expression category. Thus, we for-

mulate our goal as learning a function G:{I,L} −→ ˆ

Y, where I

is the input image, Lis the label vector and ˆ

Yis is the generated

video.

To achieve this objective, we propose a Framework consist-

ing of the following components: a Generator network Gbuilt

on an encoder-decoder architecture. The encoders EId and ES

take as input a single image Iand extract identity features FId

and spatial features FS. The decoder Gdec utilizes the extracted

features (FId ,FS) and a label Lto generate a realistic video ˆ

Finally a discriminator Dassists the learning of the Generator

for both appearance and expression category. The overview of

our approach is shown in Fig. 1.

3.1. FEV-GAN Model Description

In the following, the architecture of our network is described,

and details on the generator Gand the discriminator Dare

provided.

Generator EId,ES,Gdec: As shown in Fig. 1, our generator

consists of three networks, a pre-trained image identity encoder

EId, a randomly initialized encoder ESand a video decoder

Gdec. The encoder EId is a well known state-of-the-art facial

recognition model VGG-FACE (Parkhi et al.,2015) feature

extractor. It takes a (64 ×64 ×3) RGB image Ias input and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FacialExpressionVideoGenerationBased-OnSpatio-temporalConvolutionalGAN:FEV-GANHamzaBouzida,,LahoucineBallihiaaLRIT-CNRSTURAC29,MohammedVUniversityinRabat,FacultyOfSciences,Rabat,Morocco.ABSTRACTFacialexpressiongenerationhasalwaysbeenanintriguingtaskforscientistsandresearchersallovertheglobe.Inthis...

展开>> 收起<<

Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN FEV-GAN.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN FEV-GAN

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: