Aggregating Layers for Deepfake Detection

2025-04-24 1 0 8.01MB 8 页 10玖币

侵权投诉

Amir Jevnisek

School of Electrical Engineering

Tel-Aviv University

Tel-Aviv, Israel

amirjevn@mail.tau.ac.il

Shai Avidan

School of Electrical Engineering

Tel-Aviv University

Tel-Aviv, Israel

avidan@eng.tau.ac.il

Abstract—The increasing popularity of facial manipulation

(Deepfakes) and synthetic face creation raises the need to develop

robust forgery detection solutions. Crucially, most work in this

domain assume that the Deepfakes in the test set come from

the same Deepfake algorithms that were used for training the

network. This is not how things work in practice. Instead, we

consider the case where the network is trained on one Deepfake

algorithm, and tested on Deepfakes generated by another algo-

rithm. Typically, supervised techniques follow a pipeline of visual

feature extraction from a deep backbone, followed by a binary

classiﬁcation head. Instead, our algorithm aggregates features

extracted across all layers of one backbone network to detect

a fake. We evaluate our approach on two domains of interest -

Deepfake detection and Synthetic image detection, and ﬁnd that

we achieve SOTA results.

I. INTRODUCTION

High quality facial manipulations are no longer within the

purview of the research community. Facial Forgery (Deep-

fakes) tools are widely spread and available to all. The Reface

App [4], for example, replaces ones face with Captain Jack

Sparrow’s 1face from a few images of the target face and

within a couple of seconds. While this is an example of

an entertaining use of facial manipulations, the use of this

technology might pose threats to privacy, democracy and na-

tional security [8]. Therefore, it is clear that forgery detection

algorithms are needed.

It is common [26] to categorize facial manipulations to

four families of manipulations: i) entire face synthesis, ii)

identity swap, iii) attribute manipulation, and iv) expression

swap which we will address as facial reenactment. Entire face

synthesis is a category in which random noise is served as

an input to a system and a fully synthesized face image is

generated. Identity Swap, is a family of methods in which a

source face is blended into a target face image. The outcome

is a blend of the targets’ context and the sources’ identity.

Attribute manipulation takes one facial attribute such as “wears

eyeglasses” or “hair color” and changes that attribute. Facial

Reenactment, on the other hand, preserves both the context

and the identity but replaces the gestures made by an “actor”

(source) video with those of the target.

Most research in the ﬁeld assumes that the training set and

test set come from the same distribution. That is, a collection

of Deepfake images, created by a number of Deepfake algo-

rithms, is randomly split into train and test set and the goal of

1Fictional character from the Pirates of the Caribbean movie Series.

the Deepfake detector is to correctly distinguish fake images

from real.

We argue that this is not how a deepfake detector will be

used in practice. In practice, the detector will be trained on

Deepfake images produced by one algorithm and will have to

detect Deepfake images produced by a yet to be developed

and unknown Deepfake algorithm. This is the setting of this

paper.

A straight-forward approach to Deepfake detection is to

rely on some backbone neural network (i.e., resnet, VGG,

EfﬁcientNet) with a binary classiﬁcation head. This assumes

that data propagates through the network until it reaches the

classiﬁcation head that determines if the image is real or fake.

We improve the performance of the backbone network

by aggregating information from all layers of the network.

Speciﬁcally, we use skip connections from every layer of the

network to the fully-connected classiﬁcation head. This way,

various features, that correspond to different receptive ﬁeld in

the image plane, are all used, at once, by the classiﬁcation

head.

Comparing the performance of Deepfake detection is usu-

ally done using the Average Precision metric. However, when

the detector is trained on a set coming from one Deepfake

algorithm and tested on a set coming from a different Deepfake

algorithm, we need a way to rank the competing algorithms. To

this end, we suggest using a popular measure, the Coefﬁcient

of Variation (CoV) of Average Precision scores, to measure the

performance of the various algorithms on different datasets.

We use these measures to report results on both synthetic

image detection, as well as Deepfake detection on standarad

datasets. In summary, our contributions are threefold:

•A new architecture to ﬁne-tune a backbone network,

using its layers.

•Works on both Deepfake and Synthetic Image detection.

•SOTA overall performance for cross dataset generaliza-

tion (training on one Deepfake or synthetic image model,

and testing on another).

II. RELATED WORKS

A. Forgery Detection Techniques

Forgery detection techniques can be roughly divided into

three categories. The ﬁrst class is based on Spatial/Frequency

methods. These methods are based on some well-engineered

cues that are extracted from the image. Such cues have been

arXiv:2210.05478v1 [cs.CV] 11 Oct 2022

thoughtfully investigated in [6] while more face-speciﬁc (i.e.

physiological) cues are discussed in [10]. To this category

we can also attribute multi-task methods that attempt to ﬁnd

CNN artifacts [32] in images or inconsistency in non-localized

features [30].

In the frequency domain, [15] use a DCT-coefﬁcient re-

arrangement block and a feature extraction block to mine

frequency cues in a data-driven approach. On top of that,

their main contribution is a loss which promotes intra-class

compactness. It encourages pristine images to have a min-

imal distance to some center point, and a larger distance

for manipulated faces by at least a margin. [27] takes the

route of data augmentation to guide a detector to reﬁne and

enlarge its attention. They compute the Top-Nsensitive facial

regions using a gradient-based method. They then occlude

these regions with random integers, and the resulting image

is served back to the model with the same label. This process

allows the model to mine for features it ignored before.

The second category exploits temporal features. [17] extract

visual and temporal features using both CNN and RNN. CNNs

are used to extract visual cues from each frame and RNNs

are used to aggregate features from all regions in all frames.

[21] and [9] use facial landmarks extracted from a sequence

of frames to distinguish real from fake videos. It follow that

these methods measure their performance at a video-level.

[21] use the facial landmark locations and velocity as

inputs to two RNNs which are, in turn, aggregated to a

ﬁnal prediction score. [9] use a pretrained backbone on an

auxiliary task: visual speech recognition (lipreading). The

features extracted from every frame of the video are then

aggregated to a temporal network that produces the ﬁnal

real/fake classiﬁcation verdict.

The third category is based on anomaly detection. The goal

of such methods is to train solely on pristine images and

output a normality score of the query image. One can think

of the normality score as a measure of how real the query

image is or, put another way, the score can be thought as the

inverse of an “out-of-distribution” score. Kahlid et al. [13] use

both reconstruction loss and latent space distance to predict

real from fake images. [5] map the real and fake classes to

Gaussian distributions and measure the distribution alignment

distance.

B. Manipulations

Image manipulations are at the core of computer vision

tasks. They range from image enhancement to image splicing

and blending which can be applied to any natural image. Facial

image manipulations for identity swapping are closely related

to the aforementioned tasks. Deepfakes [1] and FaceSwap [3]

are examples of identity swapping techniques based on deep

learning techniques. On the other hand, Face2Face [25] and

NeuralTextures [24] are examples of Facial Reenactment tech-

niques. Some datasets do not state the manipulation technique

(or techniques) used to generate the manipulations.

Fig. 1: Deepfakes FaceForensics++ fake images examples.

Top to bottom and left to right: Deepfakes, FaceSwap,

Face2Face and NeuralTextures.

C. Deepfake Datasets

FaceForensics++ [19] is a common benchmark for fake

detection of human faces. This dataset consists of four manip-

ulations: Deepfakes [1], Face2Face [25], FaceSwap [3] and

NeuralTextures [24] applied to a set of 1,000 pristine videos.

The videos are available in three compression modes: Raw

(c0), High Quality (c23) and Low Quality (c40).

D. Synthetic Images

We consider synthetic facial images created by fully gen-

erative models. We use three types of such generative meth-

ods: Generative Adversarial methods, Flow-based generative

methods and Gaussian Mixture Models (GMMs). Generative

adversarial methods map noise from some distribution to an

image that looks like a face. The building blocks of these

methods are a generative component, that is responsible for

the image creation, and a discriminative component, that helps

the generator to train by distinguishing pristine from synthetic

images. Datasets of synthetic faces differ in the method used

to create synthetic images and the dataset of original images

that they were adversarially trained against.

Progressive-GAN (PGAN) [11] is a method which is

trained gradually, to ﬁrst build low resolution images of

synthetic faces and as the training evolves, add new layers to

the generator and discriminator which in turn adds ﬁne details

to the result images.

StyleGAN [12] is a method which is essentially an alterna-

tive generator architecture for generative adversarial networks.

While traditional generators feed the latent representations into

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AggregatingLayersforDeepfakeDetectionAmirJevnisekSchoolofElectricalEngineeringTel-AvivUniversityTel-Aviv,Israelamirjevn@mail.tau.ac.ilShaiAvidanSchoolofElectricalEngineeringTel-AvivUniversityTel-Aviv,Israelavidan@eng.tau.ac.ilAbstractTheincreasingpopularityoffacialmanipulation(Deepfakes)andsyntheti...

展开>> 收起<<

Aggregating Layers for Deepfake Detection.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Aggregating Layers for Deepfake Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: