2
network models in recognition and prediction tasks (Al Chanti
and Caplier,2018;Tran et al.,2018;Ali et al.,2021,2022),
researchers have proposed a variety of one-step methods
(Vondrick et al.,2016;Jang et al.,2018;Wang et al.,2020a,b)
that use fractionally strided 3D convolutions. Videos gener-
ated by these kinds of methods show more spatio-temporal
consistency, however, lower video quality, more noise and
distortions, and more identity preservation issues compared to
two-steps methods. We claim this is due to the high complexity
of the three tasks combined in a single network ( learning 1. the
spatial presentation, 2. the temporal representation, 3. identity
preservation). This requires a large network with high potential
complexity and a large amount of data, which significantly
increases the difficulty of model optimization.
To solve the issues of the low quality, noise and identity
preservation capability faced by one-step methods, We propose
encoding the input image into two codes in the latent space,
using two feature extractors (EId identity feature extractor,
Esspatial feature extractor). We also suggest exploiting the
high performance of state-of-the-art facial recognition systems,
by utilizing a pre-trained facial recognition feature extractor
as our identity encoder EId . This grants identity related
features that help with identity preservation. In addition, the
use of a pre-trained feature extractor allows for applying the
optimization process only on the other encoder ES, that is used
to extract other spatial features in order to maintain sufficiently
good quality while reconstructing facial expression video.
In summary, our contributions include the following aspects:
1. We propose a conditional GAN, with a single generator
and a single discriminator, that generates at each time step
a dynamic facial expression video, corresponding to the
desired class of expressions. The generated videos present
a realistic appearance, and preserve the identity of the in-
put image.
2. We investigate the influence of utilizing two encoders EId
and ES, where EId is a facial identity feature extractor and
ESis a spatial feature extractor.
3. We exploit the high potential of state-of-the-art facial
recognition systems. We use a pre-trained face recogni-
tion model as our generator encoder EId, which will en-
sure strongly related identity features. This aims to facil-
itate the task of the decoder by providing meaningful and
structured features.
4. We deeply evaluate our model, quantitatively and qualita-
tively, on two public facial expressions benchmarks: MUG
facial expression database and Oulu-CASIA NIR&VIS
facial expression database. We compare it with the re-
cent state-of-the-art approaches: VGAN (Vondrick et al.,
2016), MoCoGAN (Tulyakov et al.,2018), ImaGINator
(Wang et al.,2020b) and (Otberdout et al.,2019).
2. Related Work
Static Facial Expression Generation −Facial expressions
synthesis was initially achieved through the use of tradi-
tional methods, such as geometric interpolation (Pighin et al.,
2006),Parameterization (Raouzaiou et al.,2002), Morphing
(Beier and Neely,1992), etc. These methods show success on
avatars, but they fall short when dealing with real faces, and
they are unable to generalize a flow of movement for all hu-
man faces due to the high complexity of natural human expres-
sions and the variety of identity-specific characteristics. To face
these limitations, neural networks based methods have been ap-
plied on facial expressions generation, including RBMs (Zeiler
et al.,2011), DBNs (Sabzevari et al.,2010) and Variational
Auto-Encoders (Kingma and Welling,2014),etc. These meth-
ods learn acceptable data representations and better flow be-
tween different data distributions compared to prior methods,
but they face problems such as the lack of precision in con-
trolling facial expressions, low resolution and blurry generated
images.
With the appearance of GANs, multiple of its exten-
sions have been dedicated to facial expressions generation.
(Makhzani et al.,2015) and (Zhou and Shi,2017) exploit the
concept of adversity with auto-encoders to present Adversarial
Auto Encoders. (Zhu et al.,2017) propose a conditional
GAN, namely CycleGAN, that uses Cycle-Consistency Loss
to preserve the key attributes (identity) of the data. (Choi et al.,
2018) address the inefficiency of creating a new generator for
each type of transformation, by proposing an architecture that
can handle different transformation between different datasets.
(Wang et al.,2018b) suggest exploiting the U-Net architecture
as GAN generator, in order to increase the quality and the reso-
lution of generated images. US-GAN (Akram and Khan,2021)
uses a skip connection, called the ultimate skip connection,
that links the input image with the output of the model, which
allows the model to focus on learning expression-related details
only. the model outputs the addition of the input image and the
generated expression details, improving therefore the identity
preservation, but displaying artifacts in areas related to the
expression (mouth, nose, eyes). The studies above established
the task of generating classes of facial expressions (sad, happy,
angry, etc.), but in reality, the intensity of the expression
inhibits the understanding of the emotional state of the person.
ExprGAN (Ding et al.,2017) used an expression controller
module to control the expression intensity continuously from
weak to strong. Methods like GANimation (Pumarola et al.,
2018), EF-GAN (Wu et al.,2020) used Action Units (AUs) in
order to learn conditioning the generation process which offers
more diversity in the generated expressions. Other methods
such as G2-GAN (Song et al.,2018) and GC-GAN (Qiao et al.,
2018) exploited Facial Geometry as a condition to control the
facial expression synthesis. The objective of the latter models
is to take as input a facial image and facial Landmarks in
form of binary images or landmarks coordinates, then learn to
generate a realistic face image with the same identity and the
target expression. (Kollias et al.,2020) and (Bai et al.,2022)
utilize labels from the 2D Valence-Arousal Space, in which
the valence is how positive or negative is an emotion and the
arousal is the power of the emotion activation (Russell,1980),
to guide the process of facial expression generation, enhancing
the variety and the control of the generated expressions. All
these approaches and others have established the task of facial