SpecRNet Towards Faster and More Accessible Audio DeepFake Detection Piotr Kawa Marcin Plata and Piotr Syga

2025-05-03 0 0 707.06KB 8 页 10玖币
侵权投诉
SpecRNet: Towards Faster and More Accessible
Audio DeepFake Detection
Piotr Kawa, Marcin Plata and Piotr Syga
Department of Artificial Intelligence
Wrocław University of Science and Technology
Wrocław, Poland
{piotr.kawa,marcin.plata,piotr.syga}@pwr.edu.pl
©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
Abstract—Audio DeepFakes are utterances generated with the
use of deep neural networks. They are highly misleading and pose
a threat due to use in fake news, impersonation, or extortion.
In this work, we focus on increasing accessibility to the audio
DeepFake detection methods by providing SpecRNet, a neural
network architecture characterized by a quick inference time
and low computational requirements. Our benchmark shows
that SpecRNet, requiring up to about 40% less time to process
an audio sample, provides performance comparable to LCNN
architecture — one of the best audio DeepFake detection models.
Such a method can not only be used by online multimedia services
to verify a large bulk of content uploaded daily but also, thanks to
its low requirements, by average citizens to evaluate materials on
their devices. In addition, we provide benchmarks in three unique
settings that confirm the correctness of our model. They reflect
scenarios of low–resource datasets, detection on short utterances
and limited attacks benchmark in which we take a closer look
at the influence of particular attacks on given architectures.
Index Terms—DeepFake Detection, Speech Processing, Neural
Networks, Deep Learning
I. INTRODUCTION
DeepFakes refer to a branch of algorithms for manipulating
audio–visual content. They utilize deep neural networks to
create highly convincing spoofs of biometric features such as
the face or voice. It is one of the most popular methods of
automatic generation or manipulation of audio–visual modi-
fications of the original content. The original approach was
introduced in 2017 and allowed to create a video in which
the face of the original individual was swapped with the face
of another person. This phenomenon is also present in the
field of audio. The term ”Audio DeepFake” covers solutions
that can create artificially modified speech. These solutions
can either generate new utterances using Text–To–Speech
(TTS) [1]–[3] and Voice Cloning [4], [5] methods, or modify
existing utterances and therefore change it to someone else
— Voice Conversion [6], [7]. The more recent architectures
not only provide a synthesis of a speech but also focus on
proper intonation, stress, and rhythm. Such tampered material
is characterized by high quality and is highly misleading.
DeepFake utterances often achieve high mean opinion scores
compared to bona fide utterances.
This work is partially supported by Polish National Science Centre —
project UMO-2018/29/B/ST6/02969.
DeepFakes have many malicious applications and can hinder
various spheres of life. Fake news is one the most prominent
and dangerous examples — e.g. induction of the conflicts by
depicting politicians saying bogus things. Tampered samples
can also be used to bypass speaker recognition (authorization)
systems (e.g. voice–recognition in banking systems) which are
increasingly popular due to directives like [8]. Furthermore,
DeepFake–based impersonation is also used for extortion [9].
The process of creating DeepFake audio is simple — there
exist many well documented open–source toolkits that require
only a consumer–grade computer. In addition, methods such as
Real Time Voice Cloning (RTVC) [4] generate voice cloning
utterances using only a few seconds of voice signal. The
results are less sophisticated than DeepFakes created with
state–of–the–art methods using hours of the source material.
Nonetheless, they can still be used to deceive people — es-
pecially when combined with low–quality Internet connection
and other factors that decrease the audio quality and therefore
make manipulations even more difficult to spot.
Audio spoofing is a problem from the field of speech
processing which is similar to the concept of DeepFakes. It
covers problems of voice synthesis, voice conversion, or replay
attacks. The key difference between DeepFakes and spoofing
is the target one method wishes to deceive. While DeepFakes
are mainly used to trick humans so they believe utterance is
authentic, spoofing methods aim to deceive automatic speech
verification systems that a person being verified is an owner of
a particular voice. In addition, DeepFakes always generate new
material, whereas audio spoofing can be created from existing
samples, e.g. via replaying or merging various audio chunks.
Constantly raising threat of DeepFakes caused by more
and more advanced methods, their increasing popularity and
ease of creation, induced the scientific community to research
methods of determining whether a given utterance is pristine
or generated. The introduced solutions vary both in terms
of the classification methods (e.g. Gaussian Mixture Models
(GMMs) [10] or deep neural networks [11]–[14]) as well as
the representation of the audio they base on.
Our primary motivation is to decrease the computational
requirements and inference time of audio DF detection. For
this purpose, we introduce SpecRNet — a novel spectrogram–
based model inspired by RawNet2 [14] backbone. Less de-
manding architectures can have a wide range of applications
arXiv:2210.06105v1 [cs.SD] 12 Oct 2022
allowing quick screening of the suspicious samples.
The problem of DeepFakes should be considered at scale —
the volume of the potentially considered materials can qualify
it as a big data task. New legal regulations are introduced
(e.g. European Union’s Strengthened Code of Practice on
Disinformation [15]) that obligate content providers and hosts
to implement self–regulatory standards to counteract against
disinformation, including DeepFakes. This means that further
steps in research of DeepFake detection area should aim to
decrease their computational complexity while providing high
performance. The most natural application of those methods is
in online multimedia platforms, where 720,000 hours of new
videos is added daily [16]. Such platforms are often used for
spreading fake news, and lightweight detection methods can
be handy to spot potentially forged materials.
In addition, such lightweight methods give a possibility to
an average citizen to detect forged materials — this is achieved
by reducing the computational and memory requirements to
run the detection method. Note that many such citizens may
not be able to use GPUs in order to execute the detection.
We point out that due to the small quantity of the source
material the average citizen will likely be targeted with a less
sophisticated (hence less convincing) forgeries. This means
that smaller models, despite not achieving state–of–the–art
results are still capable of detecting such manipulations. Such
topics were previously raised in the following works [17]–[19].
Our contribution in this paper includes:
proposing a new neural network architecture — SpecR-
Net,
comparing the performance of SpecRNet, to its inspira-
tion — RawNet2 [14] and a one of the leading archi-
tectures for Audio DeepFake Detection – LCNN [11] on
WaveFake dataset,
time evaluation on both CPU and GPU with respect to
different batch sizes,
evaluation on three unique settings — data scarcity,
limited attacks and short utterances.
II. RELATED WORK
Detection of audio DeepFakes is a more recent task
than its visual equivalent. For this reason, the number of
both audio datasets and detection methods is much smaller.
ASVspoof [20] is one of the most important challenges
regarding the detection of spoofed audio. The 2021 version
of this bi–yearly challenge brought a new subset — along
with logical access (LA) and physical access (PA), there exists
now a speech DeepFake (DF) subset. This new subset was not
yet available when this paper was written. FakeAVCeleb [13]
is an example of multi–modal datasets — the samples are
composed of both visual and audio manipulations. While the
visual part of the dataset was composed of many methods, its
audio equivalent was generated using only one approach —
RTVC [4]. Thus, it was not selected by us as the evaluation
subset due to the small variety of generated samples in relation
to other datasets. WaveFake [21] is, up to this day, the largest
audio DeepFake detection dataset. It consists of about 120,000
samples, and its generated samples were created using 8
spoofing methods. Due to its volume, a number of supported
languages (English and Japanese), and generation methods, it
became our choice as a benchmark data.
Audio DeepFake detection, despite lesser renown than its vi-
sual equivalent, is ever–growing branch of validity verification.
Due to its resemblance to audio spoofing, some of the spoofing
detection methods were adapted to this field as well. Detection
approaches can be divided into classical and deep–learning–
based methods. Gaussian Mixture Models (GMM) [10] is one
of the most prominent examples of the classical approaches.
However, nowadays, the majority of the methods are based
on deep learning (DL). One of the reasons behind that is that
classical approaches like GMMs tend to require a separate
model for each attack type, which decreases the scalability
and flexibility of such solutions. DL–based methods are further
divided on the basis of the used audio representation. Methods
like [14], [22] process a raw signal information. Solutions
like [11], [12], [23] base on a spectrogram representation of
audio form (front–ends). Mel–frequency cepstral coefficients
(MFCCs) and linear–frequency cepstral coefficients (LFCCs)
are one of the most popular approaches. In addition, methods
of visual DeepFake detection, based on spectrogram features,
were lately adapted to audio DeepFake detection [13].
III. SPECRNET
In this section, we present a novel architecture — SpecRNet.
Its backbone is inspired by RawNet2 [14]. Contrary to its
predecessor, which operates on one–dimensional raw audio
signal, it processes two–dimensional spectrogram information
— in particular linear frequency cepstral coefficients (LFCC).
This decision was inspired by the recent works that show a
significant increase in performance of the spectrogram–based
models in relation to architectures based on raw signals [21].
SpecRNet takes as an input LFCC representation of an
audio signal. Network’s input is first normalized using two–
dimensional batch normalization [24] followed by the SeLU
activation function. Model is later composed of 3 residual
blocks containing 2 convolutional layers preceded by batch
normalization layers and LeakyReLU activation functions.
Only the first residual block skips the normalization and acti-
vation layers at the beginning as the input is being processed
by the batch normalization and SeLU layers. We also utilize
an additional convolution layer with a kernel size of 1 on
the residual identity path. The convolution layer is used to
synchronize the number of channels between the identity and
main paths. The blocks, similarly to RawNet2, come in two
kinds differing in the number of convolution channels. Note
that, due to the constant number of channels (64) in the third
residual block, no synchronization between the identity and
main paths is required, thus the additional convolution layer
is not applied in this case. Next, each block is succeeded by
two–dimensional max pooling layer, FMS attention block [25]
and the next two–dimensional max pooling layer. The forward
pass through the residual block and the FMS attention layer
is presented in Figure 1.
摘要:

SpecRNet:TowardsFasterandMoreAccessibleAudioDeepFakeDetectionPiotrKawa,MarcinPlataandPiotrSygaDepartmentofArticialIntelligenceWrocawUniversityofScienceandTechnologyWrocaw,Polandfpiotr.kawa,marcin.plata,piotr.sygag@pwr.edu.pl©2022IEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeob...

展开>> 收起<<
SpecRNet Towards Faster and More Accessible Audio DeepFake Detection Piotr Kawa Marcin Plata and Piotr Syga.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:707.06KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注