
special stream network to obtain better classification results.
The central idea of GGViT is to add the embedding of whole
face image to the other parts of the face image. The global
information about the whole face can guide the stream network
responsible for the local face. Considering the problem of
image quality, we designed an image quality block to extract
the image quality information, and made different constraints
on the final classification results according to the image quality
to enhance the generalization ability of the GGViT. The pre-
diction results of multi-stream networks will be reconstructed
by fusion attention block of the GGViT. More attention will
be paid to the prediction results of stream networks that can
better discriminate in favor of the corresponding image quality.
We conducted a large number of experiments on the
FF++[6] dataset, and the experimental results show that
GGViT achieves state-of-the-art on FF++ dataset, and has a
great improvement in scenarios of different compression rates.
In summary, we make three major contributions in this
paper:
•We use ViT models to design a multi-stream deep learn-
ing network to detect facial reenactment in videos.
•We propose a loss function that has excellent performance
for the network of different compression rates.
•A large number of experiments have been conducted
on the proposed model, and our proposed method has
reached state-of-the-art on mainstream datasets.
II. RELATED WORK
In recent years, face forgery has received more attention
due to its wide range of applications. Correspondingly, face
forgery detection has also become a popular research field.
In this section, we will briefly review the evolution of face
reenactment technology and the progress of corresponding
face reenactment detection methods.
A. Face Reenactment Generation Techniques
The method of face reenactment refers to the transfer of the
source facial expression to the target face without changing
the identity of the target. It can be roughly divided into a
method using three-dimensional models and a method based
on GAN[7]. Suwajanakorn[8] produced photorealistic results
by using audio to produce lip movements, combining proper
3-D pose with high quality lip textures and mouth shape.
Volker[9] derived a morphable face model by transforming the
shape and texture of the examples to a vector space represen-
tation. Face2Face[10] effectively transferred the expressions
of target face and source face through a transfer matrix, toke
into account the details of mouth opening, and re-renders and
syntheses the faces with changed expressions.
The GAN-based method requires a large number of paired
images for training. Jin[11] directly used CycleGAN[12] to
exchange expressions between faces of different identities,
capture details of facial expressions and head poses to generate
transformation videos of higher consistency and stability.
ReenactGAN[13] used the mapping of the underlying space to
transfer facial movements and expressions from an arbitrary
Fig. 2. In real application scenarios, the obtained faces are not aligned.
After such a picture is divided into four parts, the network responsible for
detecting the eye part can only get a picture of half an eye, while the nose part
occasionally appears completely at the lower left of (a) and occasionally at
the lower right of (b), which increases the detection difficulty of each stream
network.
person’s monocular video input to a target person’s video in
real-time.
B. Face Reenactment Detection Techniques
The traditional face forgery detection technology obtains
artificial features from face images for discrimination, uses
LBP[14], SIFT[15] and other local descriptors to obtain im-
age features, and then uses SVM[16] or other methods of
discrimination. In the stage of producing a large number
of forged pictures and videos, the deep neural network is
mainly used to extract features and judge the traces generated
by forged pictures. Matern[17] built on this by taking into
account differences in physical features such as eyes and teeth.
According to the characteristics of forged videos, dynamic
information, such as dynamic texture[3], twitching[18] and
blinking[4], muscle[19] and other dynamic elements are used
to assist identification. With the development of deep learning,
multi-stream network was gradually used to distinguish forged
images considering different reference factors of the network.
Zhou[20] took local noise residuals and camera characteristics
as a second stream. Atuom[21] proposed a novel two-stream
CNN-based approach, which takes the local features and
global depth of images as the input of two networks. Masi[22]
used two-stream network to suppress the influence of simple
facial information on network output. Kumar[2] had achieved
good results by dividing faces into multiple parts and using a
multi-stream network to focus on local forgeries.
III. METHODS
In this section, we will first state our design motivation and
briefly introduce our framework. As mentioned earlier, the
differences between real and reenactment images are subtle
and often localized. Moreover, the operation method of face
reenactment usually leaves traces in the canthus, chin, cheek
and other parts. Therefore, it is very beneficial to divide
the face into different partial pictures and train specialized
networks to find the corresponding reenactment feature of each
similar facial part.
However, the face videos obtained from social networks
are likely not aligned faces in actual scenes. Therefore, if
the face is partially converted into four local images, the face