GGViTMultistream Vision Transformer Network in Face2Face Facial Reenactment Detection Haotian Wu12Peipei Wang3Xin Wang1Ji Xiang1Rui Gong1

2025-04-27 0 0 422.28KB 7 页 10玖币
侵权投诉
GGViT:Multistream Vision Transformer Network in
Face2Face Facial Reenactment Detection
Haotian Wu1,2,Peipei Wang3,Xin Wang1,Ji Xiang1,Rui Gong1
1Institute of Information Engineering,Chinese Academy of Sciences, Beijing, China
2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
3National Computer Network Emergency Response Technical Team
Coordination Center of China (CNCERT/CC), Beijing, China
Email:{wuhaotian}@iie.ac.cn, {wangpeipei}@cert.org.cn, {wangxin,xiangji,gongrui}@iie.ac.cn
Abstract—Detecting manipulated facial images and videos on
social networks has been an urgent problem to be solved. The
compression of videos on social media has destroyed some pixel
details that could be used to detect forgeries. Hence, it is crucial
to detect manipulated faces in videos of different quality. We
propose a new multi-stream network architecture named GGViT,
which utilizes global information to improve the generalization
of the model. The embedding of the whole face extracted by
ViT will guide each stream network. Through a large number of
experiments, we have proved that our proposed model achieves
state-of-the-art classification accuracy on FF++ dataset, and has
been greatly improved on scenarios of different compression
rates. The accuracy of Raw/C23, Raw/C40 and C23/C40 was
increased by 24.34%, 15.08% and 10.14% respectively.
I. INTRODUCTION
A large number of videos and pictures are transmitted
through social networks all the time. People don’t have enough
time to identify the authenticity of each picture and video in
the face of a large amount of information. Facial reenactment
is a common form of forged faces. In the past, this kind
of forged videos usually used splicing or synthesis methods.
These methods not only need a lot of time to make, but
also can be easily recognized by people. The development
of deep learning network reduces the manufacturing obstacles
to this technology, a large number of forged videos can
be produced and transmitted quickly. In order to solve this
problem, Many previous studies[1][2] have designed different
models to extract features from these videos for classification.
On this basis, facial movement information in the videos was
also taken into account in classification[3][4]. Although these
methods achieve excellent classification results from the same
video quality datasets, different video quality in the wild brings
a great challenge to these methods.
In order to reduce the pressure in data storage and im-
prove the speed of information transmission, social media
usually compresses the transmitted videos to varying degrees.
Compression will reduce the quality of pictures, so that the
forged videos will lose some forgery details that can be
easily distinguished. Compression also makes it more difficult
for existing methods to classify these forged videos(Fig.1).
Although great progress has been made in the classification of
the same video quality, identifying forged videos in different
C0(raw) C23 C40
Fig. 1. Compressed images will reduce the image quality, and forged traces
are more difficult to be observed. The four rows of forged images show the
traces of the edge of the cheek, the mouth, the chin and the triangular area of
the mouth, which are more difficult to detect with the increase of the image
compression rate. The compression parameters are 0 (no compression), 23
and 40, respectively
compression rates is still a very necessary measure to protect
the information security of social media.
In this work, we present a new multi-stream net-
work architecture based on ViT(Vision Transformer), named
GGViT(Global Guidance Vision Transformer). Our approach
mainly addresses the challenge to classification accuracy of
facial reenactment in different video quality environments. To
tackle this challenge, inspired by Kumar[2] and ViT(vision
transformer[5]), we propose a multi-stream network. In this
network, the face to be detected will be divided into multiple
parts, and the same part of each face will be classified by a
arXiv:2210.05990v1 [cs.CV] 12 Oct 2022
special stream network to obtain better classification results.
The central idea of GGViT is to add the embedding of whole
face image to the other parts of the face image. The global
information about the whole face can guide the stream network
responsible for the local face. Considering the problem of
image quality, we designed an image quality block to extract
the image quality information, and made different constraints
on the final classification results according to the image quality
to enhance the generalization ability of the GGViT. The pre-
diction results of multi-stream networks will be reconstructed
by fusion attention block of the GGViT. More attention will
be paid to the prediction results of stream networks that can
better discriminate in favor of the corresponding image quality.
We conducted a large number of experiments on the
FF++[6] dataset, and the experimental results show that
GGViT achieves state-of-the-art on FF++ dataset, and has a
great improvement in scenarios of different compression rates.
In summary, we make three major contributions in this
paper:
We use ViT models to design a multi-stream deep learn-
ing network to detect facial reenactment in videos.
We propose a loss function that has excellent performance
for the network of different compression rates.
A large number of experiments have been conducted
on the proposed model, and our proposed method has
reached state-of-the-art on mainstream datasets.
II. RELATED WORK
In recent years, face forgery has received more attention
due to its wide range of applications. Correspondingly, face
forgery detection has also become a popular research field.
In this section, we will briefly review the evolution of face
reenactment technology and the progress of corresponding
face reenactment detection methods.
A. Face Reenactment Generation Techniques
The method of face reenactment refers to the transfer of the
source facial expression to the target face without changing
the identity of the target. It can be roughly divided into a
method using three-dimensional models and a method based
on GAN[7]. Suwajanakorn[8] produced photorealistic results
by using audio to produce lip movements, combining proper
3-D pose with high quality lip textures and mouth shape.
Volker[9] derived a morphable face model by transforming the
shape and texture of the examples to a vector space represen-
tation. Face2Face[10] effectively transferred the expressions
of target face and source face through a transfer matrix, toke
into account the details of mouth opening, and re-renders and
syntheses the faces with changed expressions.
The GAN-based method requires a large number of paired
images for training. Jin[11] directly used CycleGAN[12] to
exchange expressions between faces of different identities,
capture details of facial expressions and head poses to generate
transformation videos of higher consistency and stability.
ReenactGAN[13] used the mapping of the underlying space to
transfer facial movements and expressions from an arbitrary
(a) (b) (c)
Fig. 2. In real application scenarios, the obtained faces are not aligned.
After such a picture is divided into four parts, the network responsible for
detecting the eye part can only get a picture of half an eye, while the nose part
occasionally appears completely at the lower left of (a) and occasionally at
the lower right of (b), which increases the detection difficulty of each stream
network.
person’s monocular video input to a target person’s video in
real-time.
B. Face Reenactment Detection Techniques
The traditional face forgery detection technology obtains
artificial features from face images for discrimination, uses
LBP[14], SIFT[15] and other local descriptors to obtain im-
age features, and then uses SVM[16] or other methods of
discrimination. In the stage of producing a large number
of forged pictures and videos, the deep neural network is
mainly used to extract features and judge the traces generated
by forged pictures. Matern[17] built on this by taking into
account differences in physical features such as eyes and teeth.
According to the characteristics of forged videos, dynamic
information, such as dynamic texture[3], twitching[18] and
blinking[4], muscle[19] and other dynamic elements are used
to assist identification. With the development of deep learning,
multi-stream network was gradually used to distinguish forged
images considering different reference factors of the network.
Zhou[20] took local noise residuals and camera characteristics
as a second stream. Atuom[21] proposed a novel two-stream
CNN-based approach, which takes the local features and
global depth of images as the input of two networks. Masi[22]
used two-stream network to suppress the influence of simple
facial information on network output. Kumar[2] had achieved
good results by dividing faces into multiple parts and using a
multi-stream network to focus on local forgeries.
III. METHODS
In this section, we will first state our design motivation and
briefly introduce our framework. As mentioned earlier, the
differences between real and reenactment images are subtle
and often localized. Moreover, the operation method of face
reenactment usually leaves traces in the canthus, chin, cheek
and other parts. Therefore, it is very beneficial to divide
the face into different partial pictures and train specialized
networks to find the corresponding reenactment feature of each
similar facial part.
However, the face videos obtained from social networks
are likely not aligned faces in actual scenes. Therefore, if
the face is partially converted into four local images, the face
摘要:

GGViT:MultistreamVisionTransformerNetworkinFace2FaceFacialReenactmentDetectionHaotianWu1;2,PeipeiWang3,XinWang1,JiXiang1,RuiGong11InstituteofInformationEngineering,ChineseAcademyofSciences,Beijing,China2SchoolofCyberSecurity,UniversityofChineseAcademyofSciences,Beijing,China3NationalComputerNetwork...

展开>> 收起<<
GGViTMultistream Vision Transformer Network in Face2Face Facial Reenactment Detection Haotian Wu12Peipei Wang3Xin Wang1Ji Xiang1Rui Gong1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:422.28KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注