Adaptive re-calibration of channel-wise features for Adversarial Audio Classification V ardhan Dongre1Abhinav Reddy Thimma1Nikhitha Reddeddy1

2025-04-30 0 0 897.92KB 7 页 10玖币
侵权投诉
Adaptive re-calibration of channel-wise features for
Adversarial Audio Classification
V ardhan Dongre[1],Abhinav Reddy T himma[1],Nikhitha Reddeddy[1]
[1] University of Illinois Urbana-Champaign
Abstract—DeepFake Audio, unlike DeepFake images and
videos, has been a less studied topic, and the solutions which
exist for the synthetic speech classification either use complex
networks or don’t generalize to different kinds of synthetic
speech. Through this work, we perform a comparative analysis
of different proposed models for synthetic speech detection
including End2End and ResNet-based models against synthetic
speech generated using Text to Speech and Vocoder systems
like WaveNet [8], WaveRNN [4], Tactotron, and WaveGlow. We
also experimented with Squeeze Excitation (SE) blocks in our
ResNet models and found that the combination was able to get
better performance. In addition to the analysis, we propose a
combination of Linear frequency cepstral coefficients (LFCC)
and Mel Frequency cepstral coefficients (MFCC) using the
attentional feature fusion technique to create better input features
which can help even simpler models generalize well on synthetic
speech classification tasks. Our best models (ResNet based using
feature fusion) were trained on Fake or Real (FoR) dataset and
were able to achieve 95% test accuracy with the FoR data, and
an average of 90% accuracy with samples we generated using
different generative models.
I. INTRODUCTION
Speech synthesis and spoofing attacks have become preva-
lent in recent years because of the development of generative
models which are able to synthesize speech of great quality
which humans are not able to distinguish. The spoofing attacks
include either replay attacks, where the speaker’s voice is
recorded and used in a different context, or generated speech
attacks, where a text to speech or voice conversion system is
able to generate new voice samples. In this work, we focus
on the generated speech attacks from both text to speech and
voice conversion systems because this is an area that is rapidly
changing because of newer neural network-based generative
models. We perform a comparison of performance between
different existing models and proposed models, against a
variety of generated speech samples created using WaveNet
[8], WaveRNN [4], Tacotron&WaveGlow [10], and FastSpeech
[7].
The speaker verification community has been able to come
up with innovative models to tackle the problem described
above. A majority of models in the synthetic speech detection
domain fall under two categories: Traditional models and End
to End systems [Fig 1]. Traditional systems try to tackle
the problem in two phases. The first of which is feature
extraction and the next is building a classifier based on the
extracted features. End to End systems skip over the feature
extraction phase and build models which take in the raw
audio samples as input and give out a classification result.
Traditional models with specifically curated features have been
shown to produce promising results in this domain, but the
newer End to End systems are not that far behind and are
to get similar performance without much focus on features.
End to End systems have been a recent solution for the
problem and had issues with noisy audio during our initial
analysis. Another major issue with the existing systems is
the generalizing capability to speech synthesized using newer
generative models. Thus we went ahead with a traditional
architecture using ResNets and explored the impact of feature
fusion on features extracted from speech.
For Audio classification tasks features like Mel-
Spectrogram, MFCC, LFCC, CQT and CQCC have been to
produce good results. However, LFCC and MFCC are the
features we focused on specifically for building the classifier.
MFCC has been used in a variety of speech recognition
applications and has been promising in detecting the lower
frequencies coefficients from human speech. LFCC on the
other hand was added so as to introduce a measure for
higher frequency coefficients which might be prevalent in the
synthesized speech. We focused on fusing both LFCC and
MFCC and settled on using Attentional Feature Fusion (AFF)
technique [1]. AFF has shown promising results in fusing and
scaling features of inconsistent semantics. We built a total
of 8 models using 3 combinations of features: only LFCC,
only MFCC, and a combination of LFCC and MFCC. In
these 8 models, we experimented with 2 different ResNet
structures: ResNet34 and ResNet50, and compared the results
with pre-trained versions of ResNets. We also introduced
Squeeze excitation blocks into our model architecture and
found that it was able to improve the performance of our
models by enhancing the inter-channel dependencies for our
binary classification task.
For the problem, we used the Fake or Real normalized
dataset which has an even distribution of samples between
genders (male and female) and classes (fake and natural).
This dataset was primarily used for training our models. We
evaluated the trained models on the Fake or Real datasets’ test
samples and also against all of the generated audio samples.
To augment the test set, we also verified the performance of
our models against the ASVSpoof 2019 dataset [9].
Our primary contribution through this work is the introduc-
tion attentional feature fusion block, which has been effective
in the image domain, into the speech domain to leverage
the combination of different extracted features. The paper
is organized as follows: Section 2 discusses the background
arXiv:2210.11722v1 [cs.SD] 21 Oct 2022
Fig. 1. Workflow Pipeline
of feature engineering, feature fusion, and existing model
architecture. Section 3 focuses on our experimentation setup,
baseline, and proposed models. In Section 4, we discuss the
results from our testing and compare different models we
experimented with. The conclusion of the paper would be in
Section 5.
The code for this work can be found at https://github-dev.
cs.illinois.edu/athimma2/deepfake-audio-classifier
Fig. 2. A Multi scale channel attention module used inside AFF
Fig. 3. The Attentional feature fusion block
II. BACKGROUND
A. Feature Engineering
Feature engineering is an essential component of learning
algorithms, the performance of the ML models is heavily
dependent on how we represent the feature vector. As a result,
a significant time and effort is spent in designing preprocessing
pipelines and data transformations. In the audio domain, audio
files usually exist in the form of digital files with wav, .mp3,
.wma, .aac, .flac etc. as the common formats. The major
audio features extracted from them can either be timbral
texture features or rhythmic content features. In this work
our focus has been only on using timbral texture features
which specifically include MFCC and LFCC. A common
practice while modelling deep learning frameworks in the
audio domain is to convert the audio into spectrogram which is
a concise snap of an audio wave that has undergone a Fourier
transform. Mel-frequency cepstral coefficients (MFCC) is a
cepstral representation of the audio which has been widely
used in automatic speaker recognition and vocoder systems.
Introduced in the 1980’s, they have been the state of the
art ever since as they have proven to be robust in training
several deep learning algorithms for high level audio domain
tasks. Linear-frequency cepstral coefficients (LFCC) is another
alternative to MFCC which has been also used as a go to
feature for training models to obtain learnable parameters,
the difference between LFCC and MFCC is based on the
filter banks that they use for transforming the audio, where
MFCC uses a Mel filter bank, LFCC uses a linear filter bank.
Several studies have shown both features to be comparable
although MFCC is dominantly used in speakers as well speech
recognition still.
B. Feature Fusion
Many times, a single feature representation obtained from
the data is insufficient to convey the necessary details of the
underlying distribution of the natural process, thus a common
approach in many feature engineering methods is fusing to-
gether features obtained from the single data source through
different methods. This combination of features can either be
a simple concatenation or can be an informed mathematical
combination function. In the image domain fusing features
obtained from different layers is not a novel idea, image
pyramids are one such example of this attempt. However,
while combining such cross-layer features a common problem
we run into is of scale variance. In our framework we decided
not to rely on simple concatenation or summation of features
but rather include scale invariant fusion mechanism that has
been recently introduced in the image domain. Attentional
feature fusion (AFF) [1] [Fig 3] is a trainable mechanism
that has the ability to fuse together features obtained across
long and short skip connections without running into issues
摘要:

Adaptivere-calibrationofchannel-wisefeaturesforAdversarialAudioClassicationVardhanDongre[1],AbhinavReddyThimma[1],NikhithaReddeddy[1][1]UniversityofIllinoisUrbana-ChampaignAbstract—DeepFakeAudio,unlikeDeepFakeimagesandvideos,hasbeenalessstudiedtopic,andthesolutionswhichexistforthesyntheticspeechcla...

展开>> 收起<<
Adaptive re-calibration of channel-wise features for Adversarial Audio Classification V ardhan Dongre1Abhinav Reddy Thimma1Nikhitha Reddeddy1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:897.92KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注