The Sound of Silence Efficiency of First Digit Features in Synthetic Audio Detection Daniele Mari Federica Latora Simone Milani

2025-05-06 0 0 446.55KB 6 页 10玖币
侵权投诉
The Sound of Silence: Efficiency of First Digit
Features in Synthetic Audio Detection
Daniele Mari, Federica Latora, Simone Milani
Department of Information Engineering,University of Padova, Padova, Italy
{daniele.mari,federica.latora,simone.milani}@dei.unipd.it
Abstract—The recent integration of generative neural strate-
gies and audio processing techniques have fostered the
widespread of synthetic speech synthesis or transformation al-
gorithms. This capability proves to be harmful in many legal
and informative processes (news, biometric authentication, audio
evidence in courts, etc.). Thus, the development of efficient
detection algorithms is both crucial and challenging due to the
heterogeneity of forgery techniques.
This work investigates the discriminative role of silenced parts
in synthetic speech detection and shows how first digit statistics
extracted from MFCC coefficients can efficiently enable a robust
detection. The proposed procedure is computationally-lightweight
and effective on many different algorithms since it does not rely
on large neural detection architecture and obtains an accuracy
above 90% in most of the classes of the ASVSpoof dataset.
Index Terms—synthetic speech detection, first digit statistics,
fake audio detection, silenced signal analysis, Random Forest
I. INTRODUCTION
Fake human speech recordings have recently proved sig-
nificantly harmful with respect to misinformation, fake news
widespreading, frauds, and ID replacement [1]. Such results are
the outcome of the recent evolution of computing facilities and
deepfake technologies, which has allowed the generation of
more and more credible synthetic images, videos, and speech
audio signals. As a matter of fact, this development has urged
the need for accurate fake audio detection strategies that help
human listeners in discriminating fraudulent audio samples
from bonafide ones.
Several fake audio detection strategies were proposed in
literature targeting different types of acoustic features that are
present in a real signal and, at the same time, are difficult to
synthesize.
Traditional methods rely on estimating fake audio pecu-
liarities from audio transform coefficients like MFCC or
LPC [2]. More recently, such coefficients have been replaced
by learned feature representations generated with CNN or
RNN architectures [3]. Some other strategies rely on the
effects of the physical acquisition environment on the signal
(e.g., reverberation, noise, etc.) [4]–[6] or on prosodic and
emotional characteristics [7]. Other solutions rely on statistics
and symmetry properties of speech signals [8]. Among these,
it is worth mentioning the First Digits (FD) statistics computed
on signal transform coefficients [9]–[11], whose applications
have been widely exploited in other multimedia contents [12].
Although these solutions aim at detecting the peculiar
characteristics of a fake audio signal, more recent works [13]
have highlighted how synthetic speech algorithms prove to be
effective in spoken parts but fail in generating realistic silence.
The work by Muller et al. shows that the length of trailing
silenced parts
1
in synthetic speech samples from ASVSpoof
dataset [14] prove to have different statistics with respect to
bonafide samples. Indeed, removing such parts dramatically
reduces the detection efficiency of most algorithms.
The current paper aims at investigating this eventuality
more in depth by analyzing the discriminative potentialities
of silenced parts in ASVSpoof dataset. More precisely, we
show that FD statistics prove to be effective in discriminating
fake audio samples since they allow catching irregularities
in silenced parts between the different words of the speech.
Tests were run both on the full audio sequence, on the
silenced parts, and on the voiced segments (regardless of
their lengths). Experimental results show that the statistical
characteristics of silence prove to be a discriminative feature
since the performance on silent sections matches the detection
performance on the full sequence (while this result is not
verified for voiced sections). This implies that silence extraction
is no longer needed, allowing to avoid parameter tuning and
arbitrary set-ups. The final performance has proved to be
higher than previous state-of-the-art approaches with a limited
computational effort.
It is possible to summarize the novel contributions of the
current paper as follows.
We analyzed the role of silent parts in detection showing
that most of the classification accuracy derives from
the difficulty in synthesizing statistically-realistic silence
intervals.
We evaluated the efficiency of MFCC FD statistics in
detecting audio fake samples generated by a set of different
heterogeneous algorithms. Such features have proved to be
extremely useful in highlighting the statistics of silenced
parts.
We designed a lightweight classifier whose efficiency can
cope with more complex detectors.
The code developed to produce the results presented in this
work can be found at https://github.com/Dan8991/The-Sound-
Of-Silence.
In the following, the paper is organized as follows. Section II
overviews some audio forgery detection algorithms that have
been proposed in the literature. Section III describes the
1Silent intervals at the end and at the beginnning of the audio sequence.
arXiv:2210.02746v1 [cs.SD] 6 Oct 2022
proposed approach, Section IV illustrates the dataset and the
experimental setup, while Section V reports the final accuracy
on different types of datasets. Final conclusions are drawn in
Section VI.
II. RELATED WORKS
Generative audio speech approaches can mainly be divided
in two branches i.e. text to speech (TTS) and voice conversion
(VC) algorithms. The former starts from a textual representation
and aims at producing the corresponding waveform, while the
latter modifies the signal to change the perceived identity of
the speaker in the audio.
Early TTS approaches were based on waveform concate-
nation [15], [16] where diphones from large datasets are
concatenated seamlessly. More recently researchers have started
to design techniques that produce audio features from text
representations using an acoustic model (usually a hidden
markov model) [17], [18] that are then processed with a
vocoder synthesizer such as STRAIGHT [19], WORLD [20]
or VOCAINE [21] to produce the corresponding waveform.
To improve upon this, neural networks have also been used
to substitute either the acoustic model [22] or the vocoder
[23], [24] later leading to the first end to end TTS generation
algorithms [25], [26].
On the other hand, VCs pipelines usually extract an inter-
mediate representation of the audio signal (feature extraction
step), this is then mapped to a representation that matches the
target characteristics (feature mapping step) which is finally
used to obtain the final waveform (reconstruction step).
Most feature extraction techniques are usually based on pitch
synchronous overlap and add (PSOLA) [27] that represents
the input as the parameters required by a vocoder synthesizer
to reproduce it. This is a useful intermediate characterization
of the signal because it allows performing reconstruction with
a vocoder, which is convenient since these algorithms are well
tested and efficient. On the other hand, the mapping function
is usually implemented with parallel training methods by using
a gaussian mixture model [28] or neural networks [29], [30].
The mapping can also be performed by means of Generative
Adversarial Networks (GANs) since the task is similar to image
to image translation allowing similar techniques to be adopted
[31], [32].
Classic audio forgery detection algorithms usually perform
classification by relying on hand crafted features such as
Constant-Q Cepstral Coefficients [33], Log Magnitude Spec-
trum or phase- based features like Group Delay [34] and Linear
Frequency Cepstral Coefficients (LFCC) or MFCCs [35]. More
discriminative representations have been recently proposed by
exploiting the bicoherence matrix [36], long-short term features
computed in an autoregressive manner [37], environmental cues
[6], and even emotions [7].
Also in this case neural network based techniques have
proven very effective. Some examples are [5], where the
frequency representations of the signals are fed to simple
convolutional neural networks (CNNs), and in [3] where the
convolutional filters are just used for feature extraction while a
recurrent neural network is exploited for classification. Some
approaches have also been directly applied to the raw input
signal (i.e. in the time domain) [38]. In particular, Rawnet2
[38] has achieved impressive results both for synthetic speech
detection and user identification. For this reason, it has been
proposed as the baseline for the ASVSpoof 2021 challenge
[14] i.e. where the dataset considered in this paper for training
and testing was proposed.
III. FIRST DIGIT FEATURES FOR SYNTHETIC AUDIO TRACKS.
First digit law has proved very effective in the detection
of multiple compressed data [39]–[41]. More recently, it has
also been shown its effectiveness in detecting GAN generated
images [12]. Following this trend, it is possible to verify that
any synthetic signal generated by a set of FIR filters with
limited support fits Benford’s law with a different accuracy
with respect to a natural signal.
Audio waveforms
x(t)
are represented in the frequency
domain by computing the MFCC coefficients
mw(f)
, where
f
is the considered frequency and
w
is the index of the
frame. This representation has already proved very effective
in highlighting the more meaningful frequency elements in
audio signals and in detecting forged waveforms [35]. Since the
original samples in the considered dataset sometimes contain
long sequences of zeros (which result in zero-valued MFCCs
coeffiecients) and since computing FD statistics requires
processing non-zero signals, zero values were removed from
the input data. This operation does not compromise the final
results because this eventuality was verified on both training
and test sets, as well as on both natural and synthetic audio.
In order to obtain rich features that can highlight irregularities
in the data, MFCC coefficients were quantized with different
step values as
mw,(f) = mw(f)
.(1)
At this point, first digits were computed on mw,(f)as
dw,(f) = |mw,(f)|
bblogb|mw,(f)|c (2)
where
b
is the considered integer representation base (e.g. 10
for decimal).
For each distinct cepstral coefficient and for each quantiza-
tion step, we computed the probability mass function
pf,(d) =
nw
X
w=1
1d(dw,(f))
nw
(3)
where
1d(dw,(f))
is the indicator function for digit
d
, and
nw
is the number of windows in the signal whose value depends
on the duration of the audio and on the window overlap.
Several previous studies show that this p.m.f. can be
approximated by the generalized Benford’s law, i.e.,
ˆpf,(d) = βlogb1 + 1
γ+dδ(4)
and the approximation accuracy highly varies if we are
considering bonafide w.r.t. forged data [10], [11]. As a matter
摘要:

TheSoundofSilence:EfciencyofFirstDigitFeaturesinSyntheticAudioDetectionDanieleMari,FedericaLatora,SimoneMilaniDepartmentofInformationEngineering,UniversityofPadova,Padova,Italyfdaniele.mari,federica.latora,simone.milanig@dei.unipd.itAbstract—Therecentintegrationofgenerativeneuralstrate-giesandaudio...

展开>> 收起<<
The Sound of Silence Efficiency of First Digit Features in Synthetic Audio Detection Daniele Mari Federica Latora Simone Milani.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:446.55KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注