Adaptive re-calibration of channel-wise features for
Adversarial Audio Classification
V ardhan Dongre[1],Abhinav Reddy T himma[1],Nikhitha Reddeddy[1]
[1] University of Illinois Urbana-Champaign
Abstract—DeepFake Audio, unlike DeepFake images and
videos, has been a less studied topic, and the solutions which
exist for the synthetic speech classification either use complex
networks or don’t generalize to different kinds of synthetic
speech. Through this work, we perform a comparative analysis
of different proposed models for synthetic speech detection
including End2End and ResNet-based models against synthetic
speech generated using Text to Speech and Vocoder systems
like WaveNet [8], WaveRNN [4], Tactotron, and WaveGlow. We
also experimented with Squeeze Excitation (SE) blocks in our
ResNet models and found that the combination was able to get
better performance. In addition to the analysis, we propose a
combination of Linear frequency cepstral coefficients (LFCC)
and Mel Frequency cepstral coefficients (MFCC) using the
attentional feature fusion technique to create better input features
which can help even simpler models generalize well on synthetic
speech classification tasks. Our best models (ResNet based using
feature fusion) were trained on Fake or Real (FoR) dataset and
were able to achieve 95% test accuracy with the FoR data, and
an average of 90% accuracy with samples we generated using
different generative models.
I. INTRODUCTION
Speech synthesis and spoofing attacks have become preva-
lent in recent years because of the development of generative
models which are able to synthesize speech of great quality
which humans are not able to distinguish. The spoofing attacks
include either replay attacks, where the speaker’s voice is
recorded and used in a different context, or generated speech
attacks, where a text to speech or voice conversion system is
able to generate new voice samples. In this work, we focus
on the generated speech attacks from both text to speech and
voice conversion systems because this is an area that is rapidly
changing because of newer neural network-based generative
models. We perform a comparison of performance between
different existing models and proposed models, against a
variety of generated speech samples created using WaveNet
[8], WaveRNN [4], Tacotron&WaveGlow [10], and FastSpeech
[7].
The speaker verification community has been able to come
up with innovative models to tackle the problem described
above. A majority of models in the synthetic speech detection
domain fall under two categories: Traditional models and End
to End systems [Fig 1]. Traditional systems try to tackle
the problem in two phases. The first of which is feature
extraction and the next is building a classifier based on the
extracted features. End to End systems skip over the feature
extraction phase and build models which take in the raw
audio samples as input and give out a classification result.
Traditional models with specifically curated features have been
shown to produce promising results in this domain, but the
newer End to End systems are not that far behind and are
to get similar performance without much focus on features.
End to End systems have been a recent solution for the
problem and had issues with noisy audio during our initial
analysis. Another major issue with the existing systems is
the generalizing capability to speech synthesized using newer
generative models. Thus we went ahead with a traditional
architecture using ResNets and explored the impact of feature
fusion on features extracted from speech.
For Audio classification tasks features like Mel-
Spectrogram, MFCC, LFCC, CQT and CQCC have been to
produce good results. However, LFCC and MFCC are the
features we focused on specifically for building the classifier.
MFCC has been used in a variety of speech recognition
applications and has been promising in detecting the lower
frequencies coefficients from human speech. LFCC on the
other hand was added so as to introduce a measure for
higher frequency coefficients which might be prevalent in the
synthesized speech. We focused on fusing both LFCC and
MFCC and settled on using Attentional Feature Fusion (AFF)
technique [1]. AFF has shown promising results in fusing and
scaling features of inconsistent semantics. We built a total
of 8 models using 3 combinations of features: only LFCC,
only MFCC, and a combination of LFCC and MFCC. In
these 8 models, we experimented with 2 different ResNet
structures: ResNet34 and ResNet50, and compared the results
with pre-trained versions of ResNets. We also introduced
Squeeze excitation blocks into our model architecture and
found that it was able to improve the performance of our
models by enhancing the inter-channel dependencies for our
binary classification task.
For the problem, we used the Fake or Real normalized
dataset which has an even distribution of samples between
genders (male and female) and classes (fake and natural).
This dataset was primarily used for training our models. We
evaluated the trained models on the Fake or Real datasets’ test
samples and also against all of the generated audio samples.
To augment the test set, we also verified the performance of
our models against the ASVSpoof 2019 dataset [9].
Our primary contribution through this work is the introduc-
tion attentional feature fusion block, which has been effective
in the image domain, into the speech domain to leverage
the combination of different extracted features. The paper
is organized as follows: Section 2 discusses the background
arXiv:2210.11722v1 [cs.SD] 21 Oct 2022