GGViTMultistream Vision Transformer Network in Face2Face Facial Reenactment Detection Haotian Wu12Peipei Wang3Xin Wang1Ji Xiang1Rui Gong1

2025-04-27 0 0 422.28KB 7 页 10玖币

侵权投诉

GGViT:Multistream Vision Transformer Network in

Face2Face Facial Reenactment Detection

Haotian Wu1,2,Peipei Wang3,Xin Wang1∗,Ji Xiang1,Rui Gong1

1Institute of Information Engineering,Chinese Academy of Sciences, Beijing, China

2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

3National Computer Network Emergency Response Technical Team

Coordination Center of China (CNCERT/CC), Beijing, China

Email:{wuhaotian}@iie.ac.cn, {wangpeipei}@cert.org.cn, {wangxin,xiangji,gongrui}@iie.ac.cn

Abstract—Detecting manipulated facial images and videos on

social networks has been an urgent problem to be solved. The

compression of videos on social media has destroyed some pixel

details that could be used to detect forgeries. Hence, it is crucial

to detect manipulated faces in videos of different quality. We

propose a new multi-stream network architecture named GGViT,

which utilizes global information to improve the generalization

of the model. The embedding of the whole face extracted by

ViT will guide each stream network. Through a large number of

experiments, we have proved that our proposed model achieves

state-of-the-art classiﬁcation accuracy on FF++ dataset, and has

been greatly improved on scenarios of different compression

rates. The accuracy of Raw/C23, Raw/C40 and C23/C40 was

increased by 24.34%, 15.08% and 10.14% respectively.

I. INTRODUCTION

A large number of videos and pictures are transmitted

through social networks all the time. People don’t have enough

time to identify the authenticity of each picture and video in

the face of a large amount of information. Facial reenactment

is a common form of forged faces. In the past, this kind

of forged videos usually used splicing or synthesis methods.

These methods not only need a lot of time to make, but

also can be easily recognized by people. The development

of deep learning network reduces the manufacturing obstacles

to this technology, a large number of forged videos can

be produced and transmitted quickly. In order to solve this

problem, Many previous studies[1][2] have designed different

models to extract features from these videos for classiﬁcation.

On this basis, facial movement information in the videos was

also taken into account in classiﬁcation[3][4]. Although these

methods achieve excellent classiﬁcation results from the same

video quality datasets, different video quality in the wild brings

a great challenge to these methods.

In order to reduce the pressure in data storage and im-

prove the speed of information transmission, social media

usually compresses the transmitted videos to varying degrees.

Compression will reduce the quality of pictures, so that the

forged videos will lose some forgery details that can be

easily distinguished. Compression also makes it more difﬁcult

for existing methods to classify these forged videos(Fig.1).

Although great progress has been made in the classiﬁcation of

the same video quality, identifying forged videos in different

C0(raw) C23 C40

Fig. 1. Compressed images will reduce the image quality, and forged traces

are more difﬁcult to be observed. The four rows of forged images show the

traces of the edge of the cheek, the mouth, the chin and the triangular area of

the mouth, which are more difﬁcult to detect with the increase of the image

compression rate. The compression parameters are 0 (no compression), 23

and 40, respectively

compression rates is still a very necessary measure to protect

the information security of social media.

In this work, we present a new multi-stream net-

work architecture based on ViT(Vision Transformer), named

GGViT(Global Guidance Vision Transformer). Our approach

mainly addresses the challenge to classiﬁcation accuracy of

facial reenactment in different video quality environments. To

tackle this challenge, inspired by Kumar[2] and ViT(vision

transformer[5]), we propose a multi-stream network. In this

network, the face to be detected will be divided into multiple

parts, and the same part of each face will be classiﬁed by a

arXiv:2210.05990v1 [cs.CV] 12 Oct 2022

special stream network to obtain better classiﬁcation results.

The central idea of GGViT is to add the embedding of whole

face image to the other parts of the face image. The global

information about the whole face can guide the stream network

responsible for the local face. Considering the problem of

image quality, we designed an image quality block to extract

the image quality information, and made different constraints

on the ﬁnal classiﬁcation results according to the image quality

to enhance the generalization ability of the GGViT. The pre-

diction results of multi-stream networks will be reconstructed

by fusion attention block of the GGViT. More attention will

be paid to the prediction results of stream networks that can

better discriminate in favor of the corresponding image quality.

We conducted a large number of experiments on the

FF++[6] dataset, and the experimental results show that

GGViT achieves state-of-the-art on FF++ dataset, and has a

great improvement in scenarios of different compression rates.

In summary, we make three major contributions in this

paper:

•We use ViT models to design a multi-stream deep learn-

ing network to detect facial reenactment in videos.

•We propose a loss function that has excellent performance

for the network of different compression rates.

•A large number of experiments have been conducted

on the proposed model, and our proposed method has

reached state-of-the-art on mainstream datasets.

II. RELATED WORK

In recent years, face forgery has received more attention

due to its wide range of applications. Correspondingly, face

forgery detection has also become a popular research ﬁeld.

In this section, we will brieﬂy review the evolution of face

reenactment technology and the progress of corresponding

face reenactment detection methods.

A. Face Reenactment Generation Techniques

The method of face reenactment refers to the transfer of the

source facial expression to the target face without changing

the identity of the target. It can be roughly divided into a

method using three-dimensional models and a method based

on GAN[7]. Suwajanakorn[8] produced photorealistic results

by using audio to produce lip movements, combining proper

3-D pose with high quality lip textures and mouth shape.

Volker[9] derived a morphable face model by transforming the

shape and texture of the examples to a vector space represen-

tation. Face2Face[10] effectively transferred the expressions

of target face and source face through a transfer matrix, toke

into account the details of mouth opening, and re-renders and

syntheses the faces with changed expressions.

The GAN-based method requires a large number of paired

images for training. Jin[11] directly used CycleGAN[12] to

exchange expressions between faces of different identities,

capture details of facial expressions and head poses to generate

transformation videos of higher consistency and stability.

ReenactGAN[13] used the mapping of the underlying space to

transfer facial movements and expressions from an arbitrary

(a) (b) (c)

Fig. 2. In real application scenarios, the obtained faces are not aligned.

After such a picture is divided into four parts, the network responsible for

detecting the eye part can only get a picture of half an eye, while the nose part

occasionally appears completely at the lower left of (a) and occasionally at

the lower right of (b), which increases the detection difﬁculty of each stream

network.

person’s monocular video input to a target person’s video in

real-time.

B. Face Reenactment Detection Techniques

The traditional face forgery detection technology obtains

artiﬁcial features from face images for discrimination, uses

LBP[14], SIFT[15] and other local descriptors to obtain im-

age features, and then uses SVM[16] or other methods of

discrimination. In the stage of producing a large number

of forged pictures and videos, the deep neural network is

mainly used to extract features and judge the traces generated

by forged pictures. Matern[17] built on this by taking into

account differences in physical features such as eyes and teeth.

According to the characteristics of forged videos, dynamic

information, such as dynamic texture[3], twitching[18] and

blinking[4], muscle[19] and other dynamic elements are used

to assist identiﬁcation. With the development of deep learning,

multi-stream network was gradually used to distinguish forged

images considering different reference factors of the network.

Zhou[20] took local noise residuals and camera characteristics

as a second stream. Atuom[21] proposed a novel two-stream

CNN-based approach, which takes the local features and

global depth of images as the input of two networks. Masi[22]

used two-stream network to suppress the inﬂuence of simple

facial information on network output. Kumar[2] had achieved

good results by dividing faces into multiple parts and using a

multi-stream network to focus on local forgeries.

III. METHODS

In this section, we will ﬁrst state our design motivation and

brieﬂy introduce our framework. As mentioned earlier, the

differences between real and reenactment images are subtle

and often localized. Moreover, the operation method of face

reenactment usually leaves traces in the canthus, chin, cheek

and other parts. Therefore, it is very beneﬁcial to divide

the face into different partial pictures and train specialized

networks to ﬁnd the corresponding reenactment feature of each

similar facial part.

However, the face videos obtained from social networks

are likely not aligned faces in actual scenes. Therefore, if

the face is partially converted into four local images, the face

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GGViT:MultistreamVisionTransformerNetworkinFace2FaceFacialReenactmentDetectionHaotianWu1;2,PeipeiWang3,XinWang1,JiXiang1,RuiGong11InstituteofInformationEngineering,ChineseAcademyofSciences,Beijing,China2SchoolofCyberSecurity,UniversityofChineseAcademyofSciences,Beijing,China3NationalComputerNetwork...

展开>> 收起<<

GGViTMultistream Vision Transformer Network in Face2Face Facial Reenactment Detection Haotian Wu12Peipei Wang3Xin Wang1Ji Xiang1Rui Gong1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GGViTMultistream Vision Transformer Network in Face2Face Facial Reenactment Detection Haotian Wu12Peipei Wang3Xin Wang1Ji Xiang1Rui Gong1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: