Aggregating Layers for Deepfake Detection

2025-04-24 0 0 8.01MB 8 页 10玖币
侵权投诉
Aggregating Layers for Deepfake Detection
Amir Jevnisek
School of Electrical Engineering
Tel-Aviv University
Tel-Aviv, Israel
amirjevn@mail.tau.ac.il
Shai Avidan
School of Electrical Engineering
Tel-Aviv University
Tel-Aviv, Israel
avidan@eng.tau.ac.il
Abstract—The increasing popularity of facial manipulation
(Deepfakes) and synthetic face creation raises the need to develop
robust forgery detection solutions. Crucially, most work in this
domain assume that the Deepfakes in the test set come from
the same Deepfake algorithms that were used for training the
network. This is not how things work in practice. Instead, we
consider the case where the network is trained on one Deepfake
algorithm, and tested on Deepfakes generated by another algo-
rithm. Typically, supervised techniques follow a pipeline of visual
feature extraction from a deep backbone, followed by a binary
classification head. Instead, our algorithm aggregates features
extracted across all layers of one backbone network to detect
a fake. We evaluate our approach on two domains of interest -
Deepfake detection and Synthetic image detection, and find that
we achieve SOTA results.
I. INTRODUCTION
High quality facial manipulations are no longer within the
purview of the research community. Facial Forgery (Deep-
fakes) tools are widely spread and available to all. The Reface
App [4], for example, replaces ones face with Captain Jack
Sparrow’s 1face from a few images of the target face and
within a couple of seconds. While this is an example of
an entertaining use of facial manipulations, the use of this
technology might pose threats to privacy, democracy and na-
tional security [8]. Therefore, it is clear that forgery detection
algorithms are needed.
It is common [26] to categorize facial manipulations to
four families of manipulations: i) entire face synthesis, ii)
identity swap, iii) attribute manipulation, and iv) expression
swap which we will address as facial reenactment. Entire face
synthesis is a category in which random noise is served as
an input to a system and a fully synthesized face image is
generated. Identity Swap, is a family of methods in which a
source face is blended into a target face image. The outcome
is a blend of the targets’ context and the sources’ identity.
Attribute manipulation takes one facial attribute such as “wears
eyeglasses” or “hair color” and changes that attribute. Facial
Reenactment, on the other hand, preserves both the context
and the identity but replaces the gestures made by an “actor”
(source) video with those of the target.
Most research in the field assumes that the training set and
test set come from the same distribution. That is, a collection
of Deepfake images, created by a number of Deepfake algo-
rithms, is randomly split into train and test set and the goal of
1Fictional character from the Pirates of the Caribbean movie Series.
the Deepfake detector is to correctly distinguish fake images
from real.
We argue that this is not how a deepfake detector will be
used in practice. In practice, the detector will be trained on
Deepfake images produced by one algorithm and will have to
detect Deepfake images produced by a yet to be developed
and unknown Deepfake algorithm. This is the setting of this
paper.
A straight-forward approach to Deepfake detection is to
rely on some backbone neural network (i.e., resnet, VGG,
EfficientNet) with a binary classification head. This assumes
that data propagates through the network until it reaches the
classification head that determines if the image is real or fake.
We improve the performance of the backbone network
by aggregating information from all layers of the network.
Specifically, we use skip connections from every layer of the
network to the fully-connected classification head. This way,
various features, that correspond to different receptive field in
the image plane, are all used, at once, by the classification
head.
Comparing the performance of Deepfake detection is usu-
ally done using the Average Precision metric. However, when
the detector is trained on a set coming from one Deepfake
algorithm and tested on a set coming from a different Deepfake
algorithm, we need a way to rank the competing algorithms. To
this end, we suggest using a popular measure, the Coefficient
of Variation (CoV) of Average Precision scores, to measure the
performance of the various algorithms on different datasets.
We use these measures to report results on both synthetic
image detection, as well as Deepfake detection on standarad
datasets. In summary, our contributions are threefold:
A new architecture to fine-tune a backbone network,
using its layers.
Works on both Deepfake and Synthetic Image detection.
SOTA overall performance for cross dataset generaliza-
tion (training on one Deepfake or synthetic image model,
and testing on another).
II. RELATED WORKS
A. Forgery Detection Techniques
Forgery detection techniques can be roughly divided into
three categories. The first class is based on Spatial/Frequency
methods. These methods are based on some well-engineered
cues that are extracted from the image. Such cues have been
arXiv:2210.05478v1 [cs.CV] 11 Oct 2022
thoughtfully investigated in [6] while more face-specific (i.e.
physiological) cues are discussed in [10]. To this category
we can also attribute multi-task methods that attempt to find
CNN artifacts [32] in images or inconsistency in non-localized
features [30].
In the frequency domain, [15] use a DCT-coefficient re-
arrangement block and a feature extraction block to mine
frequency cues in a data-driven approach. On top of that,
their main contribution is a loss which promotes intra-class
compactness. It encourages pristine images to have a min-
imal distance to some center point, and a larger distance
for manipulated faces by at least a margin. [27] takes the
route of data augmentation to guide a detector to refine and
enlarge its attention. They compute the Top-Nsensitive facial
regions using a gradient-based method. They then occlude
these regions with random integers, and the resulting image
is served back to the model with the same label. This process
allows the model to mine for features it ignored before.
The second category exploits temporal features. [17] extract
visual and temporal features using both CNN and RNN. CNNs
are used to extract visual cues from each frame and RNNs
are used to aggregate features from all regions in all frames.
[21] and [9] use facial landmarks extracted from a sequence
of frames to distinguish real from fake videos. It follow that
these methods measure their performance at a video-level.
[21] use the facial landmark locations and velocity as
inputs to two RNNs which are, in turn, aggregated to a
final prediction score. [9] use a pretrained backbone on an
auxiliary task: visual speech recognition (lipreading). The
features extracted from every frame of the video are then
aggregated to a temporal network that produces the final
real/fake classification verdict.
The third category is based on anomaly detection. The goal
of such methods is to train solely on pristine images and
output a normality score of the query image. One can think
of the normality score as a measure of how real the query
image is or, put another way, the score can be thought as the
inverse of an “out-of-distribution” score. Kahlid et al. [13] use
both reconstruction loss and latent space distance to predict
real from fake images. [5] map the real and fake classes to
Gaussian distributions and measure the distribution alignment
distance.
B. Manipulations
Image manipulations are at the core of computer vision
tasks. They range from image enhancement to image splicing
and blending which can be applied to any natural image. Facial
image manipulations for identity swapping are closely related
to the aforementioned tasks. Deepfakes [1] and FaceSwap [3]
are examples of identity swapping techniques based on deep
learning techniques. On the other hand, Face2Face [25] and
NeuralTextures [24] are examples of Facial Reenactment tech-
niques. Some datasets do not state the manipulation technique
(or techniques) used to generate the manipulations.
Fig. 1: Deepfakes FaceForensics++ fake images examples.
Top to bottom and left to right: Deepfakes, FaceSwap,
Face2Face and NeuralTextures.
C. Deepfake Datasets
FaceForensics++ [19] is a common benchmark for fake
detection of human faces. This dataset consists of four manip-
ulations: Deepfakes [1], Face2Face [25], FaceSwap [3] and
NeuralTextures [24] applied to a set of 1,000 pristine videos.
The videos are available in three compression modes: Raw
(c0), High Quality (c23) and Low Quality (c40).
D. Synthetic Images
We consider synthetic facial images created by fully gen-
erative models. We use three types of such generative meth-
ods: Generative Adversarial methods, Flow-based generative
methods and Gaussian Mixture Models (GMMs). Generative
adversarial methods map noise from some distribution to an
image that looks like a face. The building blocks of these
methods are a generative component, that is responsible for
the image creation, and a discriminative component, that helps
the generator to train by distinguishing pristine from synthetic
images. Datasets of synthetic faces differ in the method used
to create synthetic images and the dataset of original images
that they were adversarially trained against.
Progressive-GAN (PGAN) [11] is a method which is
trained gradually, to first build low resolution images of
synthetic faces and as the training evolves, add new layers to
the generator and discriminator which in turn adds fine details
to the result images.
StyleGAN [12] is a method which is essentially an alterna-
tive generator architecture for generative adversarial networks.
While traditional generators feed the latent representations into
摘要:

AggregatingLayersforDeepfakeDetectionAmirJevnisekSchoolofElectricalEngineeringTel-AvivUniversityTel-Aviv,Israelamirjevn@mail.tau.ac.ilShaiAvidanSchoolofElectricalEngineeringTel-AvivUniversityTel-Aviv,Israelavidan@eng.tau.ac.ilAbstract—Theincreasingpopularityoffacialmanipulation(Deepfakes)andsyntheti...

展开>> 收起<<
Aggregating Layers for Deepfake Detection.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:8.01MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注