SpecRNet Towards Faster and More Accessible Audio DeepFake Detection Piotr Kawa Marcin Plata and Piotr Syga

2025-05-03 0 0 707.06KB 8 页 10玖币

侵权投诉

SpecRNet: Towards Faster and More Accessible

Audio DeepFake Detection

Piotr Kawa, Marcin Plata and Piotr Syga

Department of Artiﬁcial Intelligence

Wrocław University of Science and Technology

Wrocław, Poland

{piotr.kawa,marcin.plata,piotr.syga}@pwr.edu.pl

reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.

Abstract—Audio DeepFakes are utterances generated with the

use of deep neural networks. They are highly misleading and pose

a threat due to use in fake news, impersonation, or extortion.

In this work, we focus on increasing accessibility to the audio

DeepFake detection methods by providing SpecRNet, a neural

network architecture characterized by a quick inference time

and low computational requirements. Our benchmark shows

that SpecRNet, requiring up to about 40% less time to process

an audio sample, provides performance comparable to LCNN

architecture — one of the best audio DeepFake detection models.

Such a method can not only be used by online multimedia services

to verify a large bulk of content uploaded daily but also, thanks to

its low requirements, by average citizens to evaluate materials on

their devices. In addition, we provide benchmarks in three unique

settings that conﬁrm the correctness of our model. They reﬂect

scenarios of low–resource datasets, detection on short utterances

and limited attacks benchmark in which we take a closer look

at the inﬂuence of particular attacks on given architectures.

Index Terms—DeepFake Detection, Speech Processing, Neural

Networks, Deep Learning

I. INTRODUCTION

DeepFakes refer to a branch of algorithms for manipulating

audio–visual content. They utilize deep neural networks to

create highly convincing spoofs of biometric features such as

the face or voice. It is one of the most popular methods of

automatic generation or manipulation of audio–visual modi-

ﬁcations of the original content. The original approach was

introduced in 2017 and allowed to create a video in which

the face of the original individual was swapped with the face

of another person. This phenomenon is also present in the

ﬁeld of audio. The term ”Audio DeepFake” covers solutions

that can create artiﬁcially modiﬁed speech. These solutions

can either generate new utterances using Text–To–Speech

(TTS) [1]–[3] and Voice Cloning [4], [5] methods, or modify

existing utterances and therefore change it to someone else

— Voice Conversion [6], [7]. The more recent architectures

not only provide a synthesis of a speech but also focus on

proper intonation, stress, and rhythm. Such tampered material

is characterized by high quality and is highly misleading.

DeepFake utterances often achieve high mean opinion scores

compared to bona ﬁde utterances.

This work is partially supported by Polish National Science Centre —

project UMO-2018/29/B/ST6/02969.

DeepFakes have many malicious applications and can hinder

various spheres of life. Fake news is one the most prominent

and dangerous examples — e.g. induction of the conﬂicts by

depicting politicians saying bogus things. Tampered samples

can also be used to bypass speaker recognition (authorization)

systems (e.g. voice–recognition in banking systems) which are

increasingly popular due to directives like [8]. Furthermore,

DeepFake–based impersonation is also used for extortion [9].

The process of creating DeepFake audio is simple — there

exist many well documented open–source toolkits that require

only a consumer–grade computer. In addition, methods such as

Real Time Voice Cloning (RTVC) [4] generate voice cloning

utterances using only a few seconds of voice signal. The

results are less sophisticated than DeepFakes created with

state–of–the–art methods using hours of the source material.

Nonetheless, they can still be used to deceive people — es-

pecially when combined with low–quality Internet connection

and other factors that decrease the audio quality and therefore

make manipulations even more difﬁcult to spot.

Audio spooﬁng is a problem from the ﬁeld of speech

processing which is similar to the concept of DeepFakes. It

covers problems of voice synthesis, voice conversion, or replay

attacks. The key difference between DeepFakes and spooﬁng

is the target one method wishes to deceive. While DeepFakes

are mainly used to trick humans so they believe utterance is

authentic, spooﬁng methods aim to deceive automatic speech

veriﬁcation systems that a person being veriﬁed is an owner of

a particular voice. In addition, DeepFakes always generate new

material, whereas audio spooﬁng can be created from existing

samples, e.g. via replaying or merging various audio chunks.

Constantly raising threat of DeepFakes caused by more

and more advanced methods, their increasing popularity and

ease of creation, induced the scientiﬁc community to research

methods of determining whether a given utterance is pristine

or generated. The introduced solutions vary both in terms

of the classiﬁcation methods (e.g. Gaussian Mixture Models

(GMMs) [10] or deep neural networks [11]–[14]) as well as

the representation of the audio they base on.

Our primary motivation is to decrease the computational

requirements and inference time of audio DF detection. For

this purpose, we introduce SpecRNet — a novel spectrogram–

based model inspired by RawNet2 [14] backbone. Less de-

manding architectures can have a wide range of applications

arXiv:2210.06105v1 [cs.SD] 12 Oct 2022

allowing quick screening of the suspicious samples.

The problem of DeepFakes should be considered at scale —

the volume of the potentially considered materials can qualify

it as a big data task. New legal regulations are introduced

(e.g. European Union’s Strengthened Code of Practice on

Disinformation [15]) that obligate content providers and hosts

to implement self–regulatory standards to counteract against

disinformation, including DeepFakes. This means that further

steps in research of DeepFake detection area should aim to

decrease their computational complexity while providing high

performance. The most natural application of those methods is

in online multimedia platforms, where 720,000 hours of new

videos is added daily [16]. Such platforms are often used for

spreading fake news, and lightweight detection methods can

be handy to spot potentially forged materials.

In addition, such lightweight methods give a possibility to

an average citizen to detect forged materials — this is achieved

by reducing the computational and memory requirements to

run the detection method. Note that many such citizens may

not be able to use GPUs in order to execute the detection.

We point out that due to the small quantity of the source

material the average citizen will likely be targeted with a less

sophisticated (hence less convincing) forgeries. This means

that smaller models, despite not achieving state–of–the–art

results are still capable of detecting such manipulations. Such

topics were previously raised in the following works [17]–[19].

Our contribution in this paper includes:

•proposing a new neural network architecture — SpecR-

Net,

•comparing the performance of SpecRNet, to its inspira-

tion — RawNet2 [14] and a one of the leading archi-

tectures for Audio DeepFake Detection – LCNN [11] on

WaveFake dataset,

•time evaluation on both CPU and GPU with respect to

different batch sizes,

•evaluation on three unique settings — data scarcity,

limited attacks and short utterances.

II. RELATED WORK

Detection of audio DeepFakes is a more recent task

than its visual equivalent. For this reason, the number of

both audio datasets and detection methods is much smaller.

ASVspoof [20] is one of the most important challenges

regarding the detection of spoofed audio. The 2021 version

of this bi–yearly challenge brought a new subset — along

with logical access (LA) and physical access (PA), there exists

now a speech DeepFake (DF) subset. This new subset was not

yet available when this paper was written. FakeAVCeleb [13]

is an example of multi–modal datasets — the samples are

composed of both visual and audio manipulations. While the

visual part of the dataset was composed of many methods, its

audio equivalent was generated using only one approach —

RTVC [4]. Thus, it was not selected by us as the evaluation

subset due to the small variety of generated samples in relation

to other datasets. WaveFake [21] is, up to this day, the largest

audio DeepFake detection dataset. It consists of about 120,000

samples, and its generated samples were created using 8

spooﬁng methods. Due to its volume, a number of supported

languages (English and Japanese), and generation methods, it

became our choice as a benchmark data.

Audio DeepFake detection, despite lesser renown than its vi-

sual equivalent, is ever–growing branch of validity veriﬁcation.

Due to its resemblance to audio spooﬁng, some of the spooﬁng

detection methods were adapted to this ﬁeld as well. Detection

approaches can be divided into classical and deep–learning–

based methods. Gaussian Mixture Models (GMM) [10] is one

of the most prominent examples of the classical approaches.

However, nowadays, the majority of the methods are based

on deep learning (DL). One of the reasons behind that is that

classical approaches like GMMs tend to require a separate

model for each attack type, which decreases the scalability

and ﬂexibility of such solutions. DL–based methods are further

divided on the basis of the used audio representation. Methods

like [14], [22] process a raw signal information. Solutions

like [11], [12], [23] base on a spectrogram representation of

audio form (front–ends). Mel–frequency cepstral coefﬁcients

(MFCCs) and linear–frequency cepstral coefﬁcients (LFCCs)

are one of the most popular approaches. In addition, methods

of visual DeepFake detection, based on spectrogram features,

were lately adapted to audio DeepFake detection [13].

III. SPECRNET

In this section, we present a novel architecture — SpecRNet.

Its backbone is inspired by RawNet2 [14]. Contrary to its

predecessor, which operates on one–dimensional raw audio

signal, it processes two–dimensional spectrogram information

— in particular linear frequency cepstral coefﬁcients (LFCC).

This decision was inspired by the recent works that show a

signiﬁcant increase in performance of the spectrogram–based

models in relation to architectures based on raw signals [21].

SpecRNet takes as an input LFCC representation of an

audio signal. Network’s input is ﬁrst normalized using two–

dimensional batch normalization [24] followed by the SeLU

activation function. Model is later composed of 3 residual

blocks containing 2 convolutional layers preceded by batch

normalization layers and LeakyReLU activation functions.

Only the ﬁrst residual block skips the normalization and acti-

vation layers at the beginning as the input is being processed

by the batch normalization and SeLU layers. We also utilize

an additional convolution layer with a kernel size of 1 on

the residual identity path. The convolution layer is used to

synchronize the number of channels between the identity and

main paths. The blocks, similarly to RawNet2, come in two

kinds differing in the number of convolution channels. Note

that, due to the constant number of channels (64) in the third

residual block, no synchronization between the identity and

main paths is required, thus the additional convolution layer

is not applied in this case. Next, each block is succeeded by

two–dimensional max pooling layer, FMS attention block [25]

and the next two–dimensional max pooling layer. The forward

pass through the residual block and the FMS attention layer

is presented in Figure 1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SpecRNet:TowardsFasterandMoreAccessibleAudioDeepFakeDetectionPiotrKawa,MarcinPlataandPiotrSygaDepartmentofArticialIntelligenceWrocawUniversityofScienceandTechnologyWrocaw,Polandfpiotr.kawa,marcin.plata,piotr.sygag@pwr.edu.pl©2022IEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeob...

展开>> 收起<<

SpecRNet Towards Faster and More Accessible Audio DeepFake Detection Piotr Kawa Marcin Plata and Piotr Syga.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SpecRNet Towards Faster and More Accessible Audio DeepFake Detection Piotr Kawa Marcin Plata and Piotr Syga

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: