The Sound of Silence Efﬁciency of First Digit Features in Synthetic Audio Detection Daniele Mari Federica Latora Simone Milani

2025-05-06 0 0 446.55KB 6 页 10玖币

侵权投诉

The Sound of Silence: Efﬁciency of First Digit

Features in Synthetic Audio Detection

Daniele Mari, Federica Latora, Simone Milani

Department of Information Engineering,University of Padova, Padova, Italy

{daniele.mari,federica.latora,simone.milani}@dei.unipd.it

Abstract—The recent integration of generative neural strate-

gies and audio processing techniques have fostered the

widespread of synthetic speech synthesis or transformation al-

gorithms. This capability proves to be harmful in many legal

and informative processes (news, biometric authentication, audio

evidence in courts, etc.). Thus, the development of efﬁcient

detection algorithms is both crucial and challenging due to the

heterogeneity of forgery techniques.

This work investigates the discriminative role of silenced parts

in synthetic speech detection and shows how ﬁrst digit statistics

extracted from MFCC coefﬁcients can efﬁciently enable a robust

detection. The proposed procedure is computationally-lightweight

and effective on many different algorithms since it does not rely

on large neural detection architecture and obtains an accuracy

above 90% in most of the classes of the ASVSpoof dataset.

Index Terms—synthetic speech detection, ﬁrst digit statistics,

fake audio detection, silenced signal analysis, Random Forest

I. INTRODUCTION

Fake human speech recordings have recently proved sig-

niﬁcantly harmful with respect to misinformation, fake news

widespreading, frauds, and ID replacement [1]. Such results are

the outcome of the recent evolution of computing facilities and

deepfake technologies, which has allowed the generation of

more and more credible synthetic images, videos, and speech

audio signals. As a matter of fact, this development has urged

the need for accurate fake audio detection strategies that help

human listeners in discriminating fraudulent audio samples

from bonaﬁde ones.

Several fake audio detection strategies were proposed in

literature targeting different types of acoustic features that are

present in a real signal and, at the same time, are difﬁcult to

synthesize.

Traditional methods rely on estimating fake audio pecu-

liarities from audio transform coefﬁcients like MFCC or

LPC [2]. More recently, such coefﬁcients have been replaced

by learned feature representations generated with CNN or

RNN architectures [3]. Some other strategies rely on the

effects of the physical acquisition environment on the signal

(e.g., reverberation, noise, etc.) [4]–[6] or on prosodic and

emotional characteristics [7]. Other solutions rely on statistics

and symmetry properties of speech signals [8]. Among these,

it is worth mentioning the First Digits (FD) statistics computed

on signal transform coefﬁcients [9]–[11], whose applications

have been widely exploited in other multimedia contents [12].

Although these solutions aim at detecting the peculiar

characteristics of a fake audio signal, more recent works [13]

have highlighted how synthetic speech algorithms prove to be

effective in spoken parts but fail in generating realistic silence.

The work by Muller et al. shows that the length of trailing

silenced parts

in synthetic speech samples from ASVSpoof

dataset [14] prove to have different statistics with respect to

bonaﬁde samples. Indeed, removing such parts dramatically

reduces the detection efﬁciency of most algorithms.

The current paper aims at investigating this eventuality

more in depth by analyzing the discriminative potentialities

of silenced parts in ASVSpoof dataset. More precisely, we

show that FD statistics prove to be effective in discriminating

fake audio samples since they allow catching irregularities

in silenced parts between the different words of the speech.

Tests were run both on the full audio sequence, on the

silenced parts, and on the voiced segments (regardless of

their lengths). Experimental results show that the statistical

characteristics of silence prove to be a discriminative feature

since the performance on silent sections matches the detection

performance on the full sequence (while this result is not

veriﬁed for voiced sections). This implies that silence extraction

is no longer needed, allowing to avoid parameter tuning and

arbitrary set-ups. The ﬁnal performance has proved to be

higher than previous state-of-the-art approaches with a limited

computational effort.

It is possible to summarize the novel contributions of the

current paper as follows.

•

We analyzed the role of silent parts in detection showing

that most of the classiﬁcation accuracy derives from

the difﬁculty in synthesizing statistically-realistic silence

intervals.

•

We evaluated the efﬁciency of MFCC FD statistics in

detecting audio fake samples generated by a set of different

heterogeneous algorithms. Such features have proved to be

extremely useful in highlighting the statistics of silenced

parts.

•

We designed a lightweight classiﬁer whose efﬁciency can

cope with more complex detectors.

The code developed to produce the results presented in this

work can be found at https://github.com/Dan8991/The-Sound-

Of-Silence.

In the following, the paper is organized as follows. Section II

overviews some audio forgery detection algorithms that have

been proposed in the literature. Section III describes the

1Silent intervals at the end and at the beginnning of the audio sequence.

arXiv:2210.02746v1 [cs.SD] 6 Oct 2022

proposed approach, Section IV illustrates the dataset and the

experimental setup, while Section V reports the ﬁnal accuracy

on different types of datasets. Final conclusions are drawn in

Section VI.

II. RELATED WORKS

Generative audio speech approaches can mainly be divided

in two branches i.e. text to speech (TTS) and voice conversion

(VC) algorithms. The former starts from a textual representation

and aims at producing the corresponding waveform, while the

latter modiﬁes the signal to change the perceived identity of

the speaker in the audio.

Early TTS approaches were based on waveform concate-

nation [15], [16] where diphones from large datasets are

concatenated seamlessly. More recently researchers have started

to design techniques that produce audio features from text

representations using an acoustic model (usually a hidden

markov model) [17], [18] that are then processed with a

vocoder synthesizer such as STRAIGHT [19], WORLD [20]

or VOCAINE [21] to produce the corresponding waveform.

To improve upon this, neural networks have also been used

to substitute either the acoustic model [22] or the vocoder

[23], [24] later leading to the ﬁrst end to end TTS generation

algorithms [25], [26].

On the other hand, VCs pipelines usually extract an inter-

mediate representation of the audio signal (feature extraction

step), this is then mapped to a representation that matches the

target characteristics (feature mapping step) which is ﬁnally

used to obtain the ﬁnal waveform (reconstruction step).

Most feature extraction techniques are usually based on pitch

synchronous overlap and add (PSOLA) [27] that represents

the input as the parameters required by a vocoder synthesizer

to reproduce it. This is a useful intermediate characterization

of the signal because it allows performing reconstruction with

a vocoder, which is convenient since these algorithms are well

tested and efﬁcient. On the other hand, the mapping function

is usually implemented with parallel training methods by using

a gaussian mixture model [28] or neural networks [29], [30].

The mapping can also be performed by means of Generative

Adversarial Networks (GANs) since the task is similar to image

to image translation allowing similar techniques to be adopted

[31], [32].

Classic audio forgery detection algorithms usually perform

classiﬁcation by relying on hand crafted features such as

Constant-Q Cepstral Coefﬁcients [33], Log Magnitude Spec-

trum or phase- based features like Group Delay [34] and Linear

Frequency Cepstral Coefﬁcients (LFCC) or MFCCs [35]. More

discriminative representations have been recently proposed by

exploiting the bicoherence matrix [36], long-short term features

computed in an autoregressive manner [37], environmental cues

[6], and even emotions [7].

Also in this case neural network based techniques have

proven very effective. Some examples are [5], where the

frequency representations of the signals are fed to simple

convolutional neural networks (CNNs), and in [3] where the

convolutional ﬁlters are just used for feature extraction while a

recurrent neural network is exploited for classiﬁcation. Some

approaches have also been directly applied to the raw input

signal (i.e. in the time domain) [38]. In particular, Rawnet2

[38] has achieved impressive results both for synthetic speech

detection and user identiﬁcation. For this reason, it has been

proposed as the baseline for the ASVSpoof 2021 challenge

[14] i.e. where the dataset considered in this paper for training

and testing was proposed.

III. FIRST DIGIT FEATURES FOR SYNTHETIC AUDIO TRACKS.

First digit law has proved very effective in the detection

of multiple compressed data [39]–[41]. More recently, it has

also been shown its effectiveness in detecting GAN generated

images [12]. Following this trend, it is possible to verify that

any synthetic signal generated by a set of FIR ﬁlters with

limited support ﬁts Benford’s law with a different accuracy

with respect to a natural signal.

Audio waveforms

x(t)

are represented in the frequency

domain by computing the MFCC coefﬁcients

mw(f)

, where

is the considered frequency and

is the index of the

frame. This representation has already proved very effective

in highlighting the more meaningful frequency elements in

audio signals and in detecting forged waveforms [35]. Since the

original samples in the considered dataset sometimes contain

long sequences of zeros (which result in zero-valued MFCCs

coefﬁecients) and since computing FD statistics requires

processing non-zero signals, zero values were removed from

the input data. This operation does not compromise the ﬁnal

results because this eventuality was veriﬁed on both training

and test sets, as well as on both natural and synthetic audio.

In order to obtain rich features that can highlight irregularities

in the data, MFCC coefﬁcients were quantized with different

step values ∆as

mw,∆(f) = mw(f)

∆.(1)

At this point, ﬁrst digits were computed on mw,∆(f)as

dw,∆(f) = |mw,∆(f)|

bblogb|mw,∆(f)|c (2)

where

is the considered integer representation base (e.g. 10

for decimal).

For each distinct cepstral coefﬁcient and for each quantiza-

tion step, we computed the probability mass function

pf,∆(d) =

w=1

1d(dw,∆(f))

(3)

where

1d(dw,∆(f))

is the indicator function for digit

, and

is the number of windows in the signal whose value depends

on the duration of the audio and on the window overlap.

Several previous studies show that this p.m.f. can be

approximated by the generalized Benford’s law, i.e.,

ˆpf,∆(d) = βlogb1 + 1

γ+dδ(4)

and the approximation accuracy highly varies if we are

considering bonaﬁde w.r.t. forged data [10], [11]. As a matter

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheSoundofSilence:EfciencyofFirstDigitFeaturesinSyntheticAudioDetectionDanieleMari,FedericaLatora,SimoneMilaniDepartmentofInformationEngineering,UniversityofPadova,Padova,Italyfdaniele.mari,federica.latora,simone.milanig@dei.unipd.itAbstractTherecentintegrationofgenerativeneuralstrate-giesandaudio...

展开>> 收起<<

The Sound of Silence Efﬁciency of First Digit Features in Synthetic Audio Detection Daniele Mari Federica Latora Simone Milani.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Sound of Silence Efﬁciency of First Digit Features in Synthetic Audio Detection Daniele Mari Federica Latora Simone Milani

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: