Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN FEV-GAN

2025-05-06 0 0 2.04MB 13 页 10玖币
侵权投诉
Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN:
FEV-GAN
Hamza Bouzida,∗∗, Lahoucine Ballihia
aLRIT-CNRST URAC 29, Mohammed V University in Rabat, Faculty Of Sciences, Rabat, Morocco.
ABSTRACT
Facial expression generation has always been an intriguing task for scientists and researchers all over
the globe. In this context, we present our novel approach for generating videos of the six basic facial
expressions. Starting from a single neutral facial image and a label indicating the desired facial ex-
pression, we aim to synthesize a video of the given identity performing the specified facial expression.
Our approach, referred to as FEV-GAN (Facial Expression Video GAN), is based on Spatio-temporal
Convolutional GANs, that are known to model both content and motion in the same network. Pre-
vious methods based on such a network have shown a good ability to generate coherent videos with
smooth temporal evolution. However, they still suer from low image quality and low identity preser-
vation capability. In this work, we address this problem by using a generator composed of two image
encoders. The first one is pre-trained for facial identity feature extraction and the second for spatial
feature extraction. We have qualitatively and quantitatively evaluated our model on two international
facial expression benchmark databases: MUG and Oulu-CASIA NIR&VIS. The experimental results
analysis demonstrates the eectiveness of our approach in generating videos of the six basic facial ex-
pressions while preserving the input identity. The analysis also proves that the use of both identity and
spatial features enhances the decoder ability to better preserve the identity and generate high-quality
videos. The code and the pre-trained model will soon be made publicly available. .
1. Introduction
Facial expressions have always been considered one of the
essential tools for human interaction. Integrating the ability to
recognize and synthesize facial expressions to machines pro-
vides a natural and smooth interaction. Which opens the door to
many exciting new applications in dierent fields, including the
movie industry, e-commerce, and even in the medical field. Mo-
tivated by this, researches have studied facial expression recog-
nition and have already reached a high level of precision, while
facial expression generation has been more demanding and less
studied in the state of the art. Recently, with the success of Gen-
erative Adversarial Networks (GANs) (Goodfellow et al.,2014)
in data generation, in particular image generation, the task of
generating facial expressions has seen tremendous progress.
However, the dynamic facial expressions synthesis is even
∗∗Corresponding author
e-mail: hamza-bouzid@um5r.ac.ma (Hamza Bouzid ),
lahoucine.ballihi@fsr.um5.ac.ma (Lahoucine Ballihi )
less studied due to the diculty of the tasks: 1) learning the
dataset distribution (facial structure, background), 2) repre-
senting a natural and smooth evolution of facial expressions
(Temporal representation), and 3) preserving the same input
identity. To address the high complexity of these three tasks,
most existing methods tend to treat facial expression genera-
tion as a two-steps process. One step for the low dimensional
temporal generation (motion) and the other for the spatial
generation (content). Such methods (Tulyakov et al.,2018;
Wang et al.,2018a;Otberdout et al.,2019) are mostly based on
1) the generation of motion as codes in a latent space, thereafter
2) combining it with the input image embedding to generate
frames individually, through the use of an Image-to-Image
translation network. These methods are ecient in learning
facial structure and identity preservation, but they are flawed
when it comes to modeling spatio-temporal consistency and
appearance retainment. This is caused by the independence
between frames generation.
Motivated by the success of Deep spatio-temporal neural
arXiv:2210.11182v1 [cs.CV] 20 Oct 2022
2
network models in recognition and prediction tasks (Al Chanti
and Caplier,2018;Tran et al.,2018;Ali et al.,2021,2022),
researchers have proposed a variety of one-step methods
(Vondrick et al.,2016;Jang et al.,2018;Wang et al.,2020a,b)
that use fractionally strided 3D convolutions. Videos gener-
ated by these kinds of methods show more spatio-temporal
consistency, however, lower video quality, more noise and
distortions, and more identity preservation issues compared to
two-steps methods. We claim this is due to the high complexity
of the three tasks combined in a single network ( learning 1. the
spatial presentation, 2. the temporal representation, 3. identity
preservation). This requires a large network with high potential
complexity and a large amount of data, which significantly
increases the diculty of model optimization.
To solve the issues of the low quality, noise and identity
preservation capability faced by one-step methods, We propose
encoding the input image into two codes in the latent space,
using two feature extractors (EId identity feature extractor,
Esspatial feature extractor). We also suggest exploiting the
high performance of state-of-the-art facial recognition systems,
by utilizing a pre-trained facial recognition feature extractor
as our identity encoder EId . This grants identity related
features that help with identity preservation. In addition, the
use of a pre-trained feature extractor allows for applying the
optimization process only on the other encoder ES, that is used
to extract other spatial features in order to maintain suciently
good quality while reconstructing facial expression video.
In summary, our contributions include the following aspects:
1. We propose a conditional GAN, with a single generator
and a single discriminator, that generates at each time step
a dynamic facial expression video, corresponding to the
desired class of expressions. The generated videos present
a realistic appearance, and preserve the identity of the in-
put image.
2. We investigate the influence of utilizing two encoders EId
and ES, where EId is a facial identity feature extractor and
ESis a spatial feature extractor.
3. We exploit the high potential of state-of-the-art facial
recognition systems. We use a pre-trained face recogni-
tion model as our generator encoder EId, which will en-
sure strongly related identity features. This aims to facil-
itate the task of the decoder by providing meaningful and
structured features.
4. We deeply evaluate our model, quantitatively and qualita-
tively, on two public facial expressions benchmarks: MUG
facial expression database and Oulu-CASIA NIR&VIS
facial expression database. We compare it with the re-
cent state-of-the-art approaches: VGAN (Vondrick et al.,
2016), MoCoGAN (Tulyakov et al.,2018), ImaGINator
(Wang et al.,2020b) and (Otberdout et al.,2019).
2. Related Work
Static Facial Expression Generation Facial expressions
synthesis was initially achieved through the use of tradi-
tional methods, such as geometric interpolation (Pighin et al.,
2006),Parameterization (Raouzaiou et al.,2002), Morphing
(Beier and Neely,1992), etc. These methods show success on
avatars, but they fall short when dealing with real faces, and
they are unable to generalize a flow of movement for all hu-
man faces due to the high complexity of natural human expres-
sions and the variety of identity-specific characteristics. To face
these limitations, neural networks based methods have been ap-
plied on facial expressions generation, including RBMs (Zeiler
et al.,2011), DBNs (Sabzevari et al.,2010) and Variational
Auto-Encoders (Kingma and Welling,2014),etc. These meth-
ods learn acceptable data representations and better flow be-
tween dierent data distributions compared to prior methods,
but they face problems such as the lack of precision in con-
trolling facial expressions, low resolution and blurry generated
images.
With the appearance of GANs, multiple of its exten-
sions have been dedicated to facial expressions generation.
(Makhzani et al.,2015) and (Zhou and Shi,2017) exploit the
concept of adversity with auto-encoders to present Adversarial
Auto Encoders. (Zhu et al.,2017) propose a conditional
GAN, namely CycleGAN, that uses Cycle-Consistency Loss
to preserve the key attributes (identity) of the data. (Choi et al.,
2018) address the ineciency of creating a new generator for
each type of transformation, by proposing an architecture that
can handle dierent transformation between dierent datasets.
(Wang et al.,2018b) suggest exploiting the U-Net architecture
as GAN generator, in order to increase the quality and the reso-
lution of generated images. US-GAN (Akram and Khan,2021)
uses a skip connection, called the ultimate skip connection,
that links the input image with the output of the model, which
allows the model to focus on learning expression-related details
only. the model outputs the addition of the input image and the
generated expression details, improving therefore the identity
preservation, but displaying artifacts in areas related to the
expression (mouth, nose, eyes). The studies above established
the task of generating classes of facial expressions (sad, happy,
angry, etc.), but in reality, the intensity of the expression
inhibits the understanding of the emotional state of the person.
ExprGAN (Ding et al.,2017) used an expression controller
module to control the expression intensity continuously from
weak to strong. Methods like GANimation (Pumarola et al.,
2018), EF-GAN (Wu et al.,2020) used Action Units (AUs) in
order to learn conditioning the generation process which oers
more diversity in the generated expressions. Other methods
such as G2-GAN (Song et al.,2018) and GC-GAN (Qiao et al.,
2018) exploited Facial Geometry as a condition to control the
facial expression synthesis. The objective of the latter models
is to take as input a facial image and facial Landmarks in
form of binary images or landmarks coordinates, then learn to
generate a realistic face image with the same identity and the
target expression. (Kollias et al.,2020) and (Bai et al.,2022)
utilize labels from the 2D Valence-Arousal Space, in which
the valence is how positive or negative is an emotion and the
arousal is the power of the emotion activation (Russell,1980),
to guide the process of facial expression generation, enhancing
the variety and the control of the generated expressions. All
these approaches and others have established the task of facial
3
expression generation, but have not considered the dynamicity
of these expressions.
Dynamic Facial Expression Generation Facial expres-
sions are naturally dynamic actions that contain more infor-
mation and details than a single pose, e.g. the speed of fa-
cial expression transformation, head movements when display-
ing the expression, etc. This information can be significant
in understanding the emotional state of a person. To achieve
this, methods like (Ha et al.,2020;Tang et al.,2021;Li et al.,
2021;Tu et al.,2021;Vowels et al.,2021) focus on facial ex-
pression transfer, in which the facial expression is transferred
from a driver to a target face, while aiming to preserve the
target identity even in situations where the facial characteris-
tics of the driver diers widely from those of the target. In
other methods, the motion is generated separately as codes in
the latent space, these codes are then fed to the generator in
order to generate frames of the video individually. For exam-
ple, MoCoGAN (Tulyakov et al.,2018) decomposes the video
into content and motion information, where the video motion
is learned by Gated RNN (GRU) and the video frames are gen-
erated sequentially by a GAN. RV-GAN (Gupta et al.,2022)
uses a transpose (upsampling instead of downsampling) con-
volutional LSTMs as GAN generator to generate frames indi-
vidually. However, the results of both models present content
and motion artifacts, and they both could only be applied to
seen-before identities, and a finite number of expressions. In
(Fan et al.,2019), the principle of MoCoGAN is extended by
adding an encoder that helps preserving the input identity, and
a coecient to control the degree of the expression continu-
ously. The authors of (Wang et al.,2018a) utilize a Multi-Mode
Recurrent Landmark Generator to learn generating variant se-
quences of facial landmarks of the same category (e.g. dier-
ent ways to smile), translated later to video frames. In (Ot-
berdout et al.,2019), the authors exploit a conditional version
of manifold-valued Wasserstein GAN to model the facial land-
marks motion as curves encoded as points on a hypersphere.
The W-GAN learns the distribution of facial expression dy-
namics of dierent classes, from which new facial expression
motions are synthesized and transformed to videos by Texture-
GAN. Other works have investigated guiding facial expression
generation by speech audio data, such as (Chen et al.,2020;
Guo et al.,2021;Wang et al.,2022;Liang et al.,2022), or by
a combination of audio and facial landmark information, like
(Wang et al.,2021;Wu et al.,2021;Sinha et al.,2022) . All
the methods mentioned before are methods that generate a sin-
gle frame at a time-step, which lowers the dependency between
the video frames causing the lack of spatio-temporal consis-
tency. Contrasted to the methods mentioned before, methods
like VGAN(Vondrick et al.,2016), G3AN(Wang et al.,2020a)
and ImaGINator (Wang et al.,2020b) use a single step for the
generation of the whole facial expression video, by employing
fractionally strided spatio-temporal convolutions to simultane-
ously generate both appearance, as well as motion. VGAN de-
composes the generated videos into two parts, a static section
(the background) and a dynamic section, which imposes the
use of a generator composed of two streams, for generating the
background and the foreground, that are combined in the output
to generate the whole video. G3AN aims to model appearance
and motion in disentangled manner. This is accomplished by
decomposing appearance and motion in a three-stream Genera-
tor, where the main stream models spatio-temporal consistency,
while the two auxiliary streams enhance the main stream with
multi-scale appearance and motion features, respectively. Both
VGAN and G3AN are unconditional models that start from a
Gaussian noise input, causing the lack of identity preservation
and of control over the generated expression. In order to avoid
these problems, ImaGINator uses a blend of auto-encoder ar-
chitecture and spatio-temporal fusion mechanism, where the
low-level spatial features in the encoder are sent directly to
the decoder (the same concept as U-Net (Ronneberger et al.,
2015)). It also uses two discriminators, one processes the whole
video and the other processes frame by frame. Videos gen-
erated by these kinds of methods show more spatio-temporal
consistency but lower video quality, more noise and less iden-
tity preservation compared to two-steps methods.
Motivated by the discussed above, we present a novel one-
step approach for facial expression generation based on frac-
tionally strided spatio-temporal convolutions. The rest of the
paper is organized as follows. In section 3 we introduce our
new FEV-GAN model. Section 4 shows the experimental set-
tings and the quantitative and qualitative analysis of the model.
Section 5 concludes the paper and provides perspectives for fu-
ture research.
3. Proposed Approach
As stated in the introduction, our main aim is to establish
a model that generates dynamic facial expression videos from
appearance information and expression category. Thus, we for-
mulate our goal as learning a function G:{I,L} − ˆ
Y, where I
is the input image, Lis the label vector and ˆ
Yis is the generated
video.
To achieve this objective, we propose a Framework consist-
ing of the following components: a Generator network Gbuilt
on an encoder-decoder architecture. The encoders EId and ES
take as input a single image Iand extract identity features FId
and spatial features FS. The decoder Gdec utilizes the extracted
features (FId ,FS) and a label Lto generate a realistic video ˆ
Y.
Finally a discriminator Dassists the learning of the Generator
for both appearance and expression category. The overview of
our approach is shown in Fig. 1.
3.1. FEV-GAN Model Description
In the following, the architecture of our network is described,
and details on the generator Gand the discriminator Dare
provided.
Generator EId,ES,Gdec: As shown in Fig. 1, our generator
consists of three networks, a pre-trained image identity encoder
EId, a randomly initialized encoder ESand a video decoder
Gdec. The encoder EId is a well known state-of-the-art facial
recognition model VGG-FACE (Parkhi et al.,2015) feature
extractor. It takes a (64 ×64 ×3) RGB image Ias input and
摘要:

FacialExpressionVideoGenerationBased-OnSpatio-temporalConvolutionalGAN:FEV-GANHamzaBouzida,,LahoucineBallihiaaLRIT-CNRSTURAC29,MohammedVUniversityinRabat,FacultyOfSciences,Rabat,Morocco.ABSTRACTFacialexpressiongenerationhasalwaysbeenanintriguingtaskforscientistsandresearchersallovertheglobe.Inthis...

展开>> 收起<<
Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN FEV-GAN.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:13 页 大小:2.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注