
proposed approach, Section IV illustrates the dataset and the
experimental setup, while Section V reports the final accuracy
on different types of datasets. Final conclusions are drawn in
Section VI.
II. RELATED WORKS
Generative audio speech approaches can mainly be divided
in two branches i.e. text to speech (TTS) and voice conversion
(VC) algorithms. The former starts from a textual representation
and aims at producing the corresponding waveform, while the
latter modifies the signal to change the perceived identity of
the speaker in the audio.
Early TTS approaches were based on waveform concate-
nation [15], [16] where diphones from large datasets are
concatenated seamlessly. More recently researchers have started
to design techniques that produce audio features from text
representations using an acoustic model (usually a hidden
markov model) [17], [18] that are then processed with a
vocoder synthesizer such as STRAIGHT [19], WORLD [20]
or VOCAINE [21] to produce the corresponding waveform.
To improve upon this, neural networks have also been used
to substitute either the acoustic model [22] or the vocoder
[23], [24] later leading to the first end to end TTS generation
algorithms [25], [26].
On the other hand, VCs pipelines usually extract an inter-
mediate representation of the audio signal (feature extraction
step), this is then mapped to a representation that matches the
target characteristics (feature mapping step) which is finally
used to obtain the final waveform (reconstruction step).
Most feature extraction techniques are usually based on pitch
synchronous overlap and add (PSOLA) [27] that represents
the input as the parameters required by a vocoder synthesizer
to reproduce it. This is a useful intermediate characterization
of the signal because it allows performing reconstruction with
a vocoder, which is convenient since these algorithms are well
tested and efficient. On the other hand, the mapping function
is usually implemented with parallel training methods by using
a gaussian mixture model [28] or neural networks [29], [30].
The mapping can also be performed by means of Generative
Adversarial Networks (GANs) since the task is similar to image
to image translation allowing similar techniques to be adopted
[31], [32].
Classic audio forgery detection algorithms usually perform
classification by relying on hand crafted features such as
Constant-Q Cepstral Coefficients [33], Log Magnitude Spec-
trum or phase- based features like Group Delay [34] and Linear
Frequency Cepstral Coefficients (LFCC) or MFCCs [35]. More
discriminative representations have been recently proposed by
exploiting the bicoherence matrix [36], long-short term features
computed in an autoregressive manner [37], environmental cues
[6], and even emotions [7].
Also in this case neural network based techniques have
proven very effective. Some examples are [5], where the
frequency representations of the signals are fed to simple
convolutional neural networks (CNNs), and in [3] where the
convolutional filters are just used for feature extraction while a
recurrent neural network is exploited for classification. Some
approaches have also been directly applied to the raw input
signal (i.e. in the time domain) [38]. In particular, Rawnet2
[38] has achieved impressive results both for synthetic speech
detection and user identification. For this reason, it has been
proposed as the baseline for the ASVSpoof 2021 challenge
[14] i.e. where the dataset considered in this paper for training
and testing was proposed.
III. FIRST DIGIT FEATURES FOR SYNTHETIC AUDIO TRACKS.
First digit law has proved very effective in the detection
of multiple compressed data [39]–[41]. More recently, it has
also been shown its effectiveness in detecting GAN generated
images [12]. Following this trend, it is possible to verify that
any synthetic signal generated by a set of FIR filters with
limited support fits Benford’s law with a different accuracy
with respect to a natural signal.
Audio waveforms
x(t)
are represented in the frequency
domain by computing the MFCC coefficients
mw(f)
, where
f
is the considered frequency and
w
is the index of the
frame. This representation has already proved very effective
in highlighting the more meaningful frequency elements in
audio signals and in detecting forged waveforms [35]. Since the
original samples in the considered dataset sometimes contain
long sequences of zeros (which result in zero-valued MFCCs
coeffiecients) and since computing FD statistics requires
processing non-zero signals, zero values were removed from
the input data. This operation does not compromise the final
results because this eventuality was verified on both training
and test sets, as well as on both natural and synthetic audio.
In order to obtain rich features that can highlight irregularities
in the data, MFCC coefficients were quantized with different
step values ∆as
mw,∆(f) = mw(f)
∆.(1)
At this point, first digits were computed on mw,∆(f)as
dw,∆(f) = |mw,∆(f)|
bblogb|mw,∆(f)|c (2)
where
b
is the considered integer representation base (e.g. 10
for decimal).
For each distinct cepstral coefficient and for each quantiza-
tion step, we computed the probability mass function
pf,∆(d) =
nw
X
w=1
1d(dw,∆(f))
nw
(3)
where
1d(dw,∆(f))
is the indicator function for digit
d
, and
nw
is the number of windows in the signal whose value depends
on the duration of the audio and on the window overlap.
Several previous studies show that this p.m.f. can be
approximated by the generalized Benford’s law, i.e.,
ˆpf,∆(d) = βlogb1 + 1
γ+dδ(4)
and the approximation accuracy highly varies if we are
considering bonafide w.r.t. forged data [10], [11]. As a matter