allowing quick screening of the suspicious samples.
The problem of DeepFakes should be considered at scale —
the volume of the potentially considered materials can qualify
it as a big data task. New legal regulations are introduced
(e.g. European Union’s Strengthened Code of Practice on
Disinformation [15]) that obligate content providers and hosts
to implement self–regulatory standards to counteract against
disinformation, including DeepFakes. This means that further
steps in research of DeepFake detection area should aim to
decrease their computational complexity while providing high
performance. The most natural application of those methods is
in online multimedia platforms, where 720,000 hours of new
videos is added daily [16]. Such platforms are often used for
spreading fake news, and lightweight detection methods can
be handy to spot potentially forged materials.
In addition, such lightweight methods give a possibility to
an average citizen to detect forged materials — this is achieved
by reducing the computational and memory requirements to
run the detection method. Note that many such citizens may
not be able to use GPUs in order to execute the detection.
We point out that due to the small quantity of the source
material the average citizen will likely be targeted with a less
sophisticated (hence less convincing) forgeries. This means
that smaller models, despite not achieving state–of–the–art
results are still capable of detecting such manipulations. Such
topics were previously raised in the following works [17]–[19].
Our contribution in this paper includes:
•proposing a new neural network architecture — SpecR-
Net,
•comparing the performance of SpecRNet, to its inspira-
tion — RawNet2 [14] and a one of the leading archi-
tectures for Audio DeepFake Detection – LCNN [11] on
WaveFake dataset,
•time evaluation on both CPU and GPU with respect to
different batch sizes,
•evaluation on three unique settings — data scarcity,
limited attacks and short utterances.
II. RELATED WORK
Detection of audio DeepFakes is a more recent task
than its visual equivalent. For this reason, the number of
both audio datasets and detection methods is much smaller.
ASVspoof [20] is one of the most important challenges
regarding the detection of spoofed audio. The 2021 version
of this bi–yearly challenge brought a new subset — along
with logical access (LA) and physical access (PA), there exists
now a speech DeepFake (DF) subset. This new subset was not
yet available when this paper was written. FakeAVCeleb [13]
is an example of multi–modal datasets — the samples are
composed of both visual and audio manipulations. While the
visual part of the dataset was composed of many methods, its
audio equivalent was generated using only one approach —
RTVC [4]. Thus, it was not selected by us as the evaluation
subset due to the small variety of generated samples in relation
to other datasets. WaveFake [21] is, up to this day, the largest
audio DeepFake detection dataset. It consists of about 120,000
samples, and its generated samples were created using 8
spoofing methods. Due to its volume, a number of supported
languages (English and Japanese), and generation methods, it
became our choice as a benchmark data.
Audio DeepFake detection, despite lesser renown than its vi-
sual equivalent, is ever–growing branch of validity verification.
Due to its resemblance to audio spoofing, some of the spoofing
detection methods were adapted to this field as well. Detection
approaches can be divided into classical and deep–learning–
based methods. Gaussian Mixture Models (GMM) [10] is one
of the most prominent examples of the classical approaches.
However, nowadays, the majority of the methods are based
on deep learning (DL). One of the reasons behind that is that
classical approaches like GMMs tend to require a separate
model for each attack type, which decreases the scalability
and flexibility of such solutions. DL–based methods are further
divided on the basis of the used audio representation. Methods
like [14], [22] process a raw signal information. Solutions
like [11], [12], [23] base on a spectrogram representation of
audio form (front–ends). Mel–frequency cepstral coefficients
(MFCCs) and linear–frequency cepstral coefficients (LFCCs)
are one of the most popular approaches. In addition, methods
of visual DeepFake detection, based on spectrogram features,
were lately adapted to audio DeepFake detection [13].
III. SPECRNET
In this section, we present a novel architecture — SpecRNet.
Its backbone is inspired by RawNet2 [14]. Contrary to its
predecessor, which operates on one–dimensional raw audio
signal, it processes two–dimensional spectrogram information
— in particular linear frequency cepstral coefficients (LFCC).
This decision was inspired by the recent works that show a
significant increase in performance of the spectrogram–based
models in relation to architectures based on raw signals [21].
SpecRNet takes as an input LFCC representation of an
audio signal. Network’s input is first normalized using two–
dimensional batch normalization [24] followed by the SeLU
activation function. Model is later composed of 3 residual
blocks containing 2 convolutional layers preceded by batch
normalization layers and LeakyReLU activation functions.
Only the first residual block skips the normalization and acti-
vation layers at the beginning as the input is being processed
by the batch normalization and SeLU layers. We also utilize
an additional convolution layer with a kernel size of 1 on
the residual identity path. The convolution layer is used to
synchronize the number of channels between the identity and
main paths. The blocks, similarly to RawNet2, come in two
kinds differing in the number of convolution channels. Note
that, due to the constant number of channels (64) in the third
residual block, no synchronization between the identity and
main paths is required, thus the additional convolution layer
is not applied in this case. Next, each block is succeeded by
two–dimensional max pooling layer, FMS attention block [25]
and the next two–dimensional max pooling layer. The forward
pass through the residual block and the FMS attention layer
is presented in Figure 1.