Aggregating Layers for Deepfake Detection
Amir Jevnisek
School of Electrical Engineering
Tel-Aviv University
Tel-Aviv, Israel
amirjevn@mail.tau.ac.il
Shai Avidan
School of Electrical Engineering
Tel-Aviv University
Tel-Aviv, Israel
avidan@eng.tau.ac.il
Abstract—The increasing popularity of facial manipulation
(Deepfakes) and synthetic face creation raises the need to develop
robust forgery detection solutions. Crucially, most work in this
domain assume that the Deepfakes in the test set come from
the same Deepfake algorithms that were used for training the
network. This is not how things work in practice. Instead, we
consider the case where the network is trained on one Deepfake
algorithm, and tested on Deepfakes generated by another algo-
rithm. Typically, supervised techniques follow a pipeline of visual
feature extraction from a deep backbone, followed by a binary
classification head. Instead, our algorithm aggregates features
extracted across all layers of one backbone network to detect
a fake. We evaluate our approach on two domains of interest -
Deepfake detection and Synthetic image detection, and find that
we achieve SOTA results.
I. INTRODUCTION
High quality facial manipulations are no longer within the
purview of the research community. Facial Forgery (Deep-
fakes) tools are widely spread and available to all. The Reface
App [4], for example, replaces ones face with Captain Jack
Sparrow’s 1face from a few images of the target face and
within a couple of seconds. While this is an example of
an entertaining use of facial manipulations, the use of this
technology might pose threats to privacy, democracy and na-
tional security [8]. Therefore, it is clear that forgery detection
algorithms are needed.
It is common [26] to categorize facial manipulations to
four families of manipulations: i) entire face synthesis, ii)
identity swap, iii) attribute manipulation, and iv) expression
swap which we will address as facial reenactment. Entire face
synthesis is a category in which random noise is served as
an input to a system and a fully synthesized face image is
generated. Identity Swap, is a family of methods in which a
source face is blended into a target face image. The outcome
is a blend of the targets’ context and the sources’ identity.
Attribute manipulation takes one facial attribute such as “wears
eyeglasses” or “hair color” and changes that attribute. Facial
Reenactment, on the other hand, preserves both the context
and the identity but replaces the gestures made by an “actor”
(source) video with those of the target.
Most research in the field assumes that the training set and
test set come from the same distribution. That is, a collection
of Deepfake images, created by a number of Deepfake algo-
rithms, is randomly split into train and test set and the goal of
1Fictional character from the Pirates of the Caribbean movie Series.
the Deepfake detector is to correctly distinguish fake images
from real.
We argue that this is not how a deepfake detector will be
used in practice. In practice, the detector will be trained on
Deepfake images produced by one algorithm and will have to
detect Deepfake images produced by a yet to be developed
and unknown Deepfake algorithm. This is the setting of this
paper.
A straight-forward approach to Deepfake detection is to
rely on some backbone neural network (i.e., resnet, VGG,
EfficientNet) with a binary classification head. This assumes
that data propagates through the network until it reaches the
classification head that determines if the image is real or fake.
We improve the performance of the backbone network
by aggregating information from all layers of the network.
Specifically, we use skip connections from every layer of the
network to the fully-connected classification head. This way,
various features, that correspond to different receptive field in
the image plane, are all used, at once, by the classification
head.
Comparing the performance of Deepfake detection is usu-
ally done using the Average Precision metric. However, when
the detector is trained on a set coming from one Deepfake
algorithm and tested on a set coming from a different Deepfake
algorithm, we need a way to rank the competing algorithms. To
this end, we suggest using a popular measure, the Coefficient
of Variation (CoV) of Average Precision scores, to measure the
performance of the various algorithms on different datasets.
We use these measures to report results on both synthetic
image detection, as well as Deepfake detection on standarad
datasets. In summary, our contributions are threefold:
•A new architecture to fine-tune a backbone network,
using its layers.
•Works on both Deepfake and Synthetic Image detection.
•SOTA overall performance for cross dataset generaliza-
tion (training on one Deepfake or synthetic image model,
and testing on another).
II. RELATED WORKS
A. Forgery Detection Techniques
Forgery detection techniques can be roughly divided into
three categories. The first class is based on Spatial/Frequency
methods. These methods are based on some well-engineered
cues that are extracted from the image. Such cues have been
arXiv:2210.05478v1 [cs.CV] 11 Oct 2022