Adaptive re-calibration of channel-wise features for Adversarial Audio Classiﬁcation V ardhan Dongre1Abhinav Reddy Thimma1Nikhitha Reddeddy1

2025-04-30 0 0 897.92KB 7 页 10玖币

侵权投诉

Adaptive re-calibration of channel-wise features for

Adversarial Audio Classiﬁcation

V ardhan Dongre[1],Abhinav Reddy T himma[1],Nikhitha Reddeddy[1]

[1] University of Illinois Urbana-Champaign

Abstract—DeepFake Audio, unlike DeepFake images and

videos, has been a less studied topic, and the solutions which

exist for the synthetic speech classiﬁcation either use complex

networks or don’t generalize to different kinds of synthetic

speech. Through this work, we perform a comparative analysis

of different proposed models for synthetic speech detection

including End2End and ResNet-based models against synthetic

speech generated using Text to Speech and Vocoder systems

like WaveNet [8], WaveRNN [4], Tactotron, and WaveGlow. We

also experimented with Squeeze Excitation (SE) blocks in our

ResNet models and found that the combination was able to get

better performance. In addition to the analysis, we propose a

combination of Linear frequency cepstral coefﬁcients (LFCC)

and Mel Frequency cepstral coefﬁcients (MFCC) using the

attentional feature fusion technique to create better input features

which can help even simpler models generalize well on synthetic

speech classiﬁcation tasks. Our best models (ResNet based using

feature fusion) were trained on Fake or Real (FoR) dataset and

were able to achieve 95% test accuracy with the FoR data, and

an average of 90% accuracy with samples we generated using

different generative models.

I. INTRODUCTION

Speech synthesis and spooﬁng attacks have become preva-

lent in recent years because of the development of generative

models which are able to synthesize speech of great quality

which humans are not able to distinguish. The spooﬁng attacks

include either replay attacks, where the speaker’s voice is

recorded and used in a different context, or generated speech

attacks, where a text to speech or voice conversion system is

able to generate new voice samples. In this work, we focus

on the generated speech attacks from both text to speech and

voice conversion systems because this is an area that is rapidly

changing because of newer neural network-based generative

models. We perform a comparison of performance between

different existing models and proposed models, against a

variety of generated speech samples created using WaveNet

[8], WaveRNN [4], Tacotron&WaveGlow [10], and FastSpeech

[7].

The speaker veriﬁcation community has been able to come

up with innovative models to tackle the problem described

above. A majority of models in the synthetic speech detection

domain fall under two categories: Traditional models and End

to End systems [Fig 1]. Traditional systems try to tackle

the problem in two phases. The ﬁrst of which is feature

extraction and the next is building a classiﬁer based on the

extracted features. End to End systems skip over the feature

extraction phase and build models which take in the raw

audio samples as input and give out a classiﬁcation result.

Traditional models with speciﬁcally curated features have been

shown to produce promising results in this domain, but the

newer End to End systems are not that far behind and are

to get similar performance without much focus on features.

End to End systems have been a recent solution for the

problem and had issues with noisy audio during our initial

analysis. Another major issue with the existing systems is

the generalizing capability to speech synthesized using newer

generative models. Thus we went ahead with a traditional

architecture using ResNets and explored the impact of feature

fusion on features extracted from speech.

For Audio classiﬁcation tasks features like Mel-

Spectrogram, MFCC, LFCC, CQT and CQCC have been to

produce good results. However, LFCC and MFCC are the

features we focused on speciﬁcally for building the classiﬁer.

MFCC has been used in a variety of speech recognition

applications and has been promising in detecting the lower

frequencies coefﬁcients from human speech. LFCC on the

other hand was added so as to introduce a measure for

higher frequency coefﬁcients which might be prevalent in the

synthesized speech. We focused on fusing both LFCC and

MFCC and settled on using Attentional Feature Fusion (AFF)

technique [1]. AFF has shown promising results in fusing and

scaling features of inconsistent semantics. We built a total

of 8 models using 3 combinations of features: only LFCC,

only MFCC, and a combination of LFCC and MFCC. In

these 8 models, we experimented with 2 different ResNet

structures: ResNet34 and ResNet50, and compared the results

with pre-trained versions of ResNets. We also introduced

Squeeze excitation blocks into our model architecture and

found that it was able to improve the performance of our

models by enhancing the inter-channel dependencies for our

binary classiﬁcation task.

For the problem, we used the Fake or Real normalized

dataset which has an even distribution of samples between

genders (male and female) and classes (fake and natural).

This dataset was primarily used for training our models. We

evaluated the trained models on the Fake or Real datasets’ test

samples and also against all of the generated audio samples.

To augment the test set, we also veriﬁed the performance of

our models against the ASVSpoof 2019 dataset [9].

Our primary contribution through this work is the introduc-

tion attentional feature fusion block, which has been effective

in the image domain, into the speech domain to leverage

the combination of different extracted features. The paper

is organized as follows: Section 2 discusses the background

arXiv:2210.11722v1 [cs.SD] 21 Oct 2022

Fig. 1. Workﬂow Pipeline

of feature engineering, feature fusion, and existing model

architecture. Section 3 focuses on our experimentation setup,

baseline, and proposed models. In Section 4, we discuss the

results from our testing and compare different models we

experimented with. The conclusion of the paper would be in

Section 5.

The code for this work can be found at https://github-dev.

cs.illinois.edu/athimma2/deepfake-audio-classiﬁer

Fig. 2. A Multi scale channel attention module used inside AFF

Fig. 3. The Attentional feature fusion block

II. BACKGROUND

A. Feature Engineering

Feature engineering is an essential component of learning

algorithms, the performance of the ML models is heavily

dependent on how we represent the feature vector. As a result,

a signiﬁcant time and effort is spent in designing preprocessing

pipelines and data transformations. In the audio domain, audio

ﬁles usually exist in the form of digital ﬁles with wav, .mp3,

.wma, .aac, .ﬂac etc. as the common formats. The major

audio features extracted from them can either be timbral

texture features or rhythmic content features. In this work

our focus has been only on using timbral texture features

which speciﬁcally include MFCC and LFCC. A common

practice while modelling deep learning frameworks in the

audio domain is to convert the audio into spectrogram which is

a concise snap of an audio wave that has undergone a Fourier

transform. Mel-frequency cepstral coefﬁcients (MFCC) is a

cepstral representation of the audio which has been widely

used in automatic speaker recognition and vocoder systems.

Introduced in the 1980’s, they have been the state of the

art ever since as they have proven to be robust in training

several deep learning algorithms for high level audio domain

tasks. Linear-frequency cepstral coefﬁcients (LFCC) is another

alternative to MFCC which has been also used as a go to

feature for training models to obtain learnable parameters,

the difference between LFCC and MFCC is based on the

ﬁlter banks that they use for transforming the audio, where

MFCC uses a Mel ﬁlter bank, LFCC uses a linear ﬁlter bank.

Several studies have shown both features to be comparable

although MFCC is dominantly used in speakers as well speech

recognition still.

B. Feature Fusion

Many times, a single feature representation obtained from

the data is insufﬁcient to convey the necessary details of the

underlying distribution of the natural process, thus a common

approach in many feature engineering methods is fusing to-

gether features obtained from the single data source through

different methods. This combination of features can either be

a simple concatenation or can be an informed mathematical

combination function. In the image domain fusing features

obtained from different layers is not a novel idea, image

pyramids are one such example of this attempt. However,

while combining such cross-layer features a common problem

we run into is of scale variance. In our framework we decided

not to rely on simple concatenation or summation of features

but rather include scale invariant fusion mechanism that has

been recently introduced in the image domain. Attentional

feature fusion (AFF) [1] [Fig 3] is a trainable mechanism

that has the ability to fuse together features obtained across

long and short skip connections without running into issues

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Adaptivere-calibrationofchannel-wisefeaturesforAdversarialAudioClassicationVardhanDongre[1],AbhinavReddyThimma[1],NikhithaReddeddy[1][1]UniversityofIllinoisUrbana-ChampaignAbstractDeepFakeAudio,unlikeDeepFakeimagesandvideos,hasbeenalessstudiedtopic,andthesolutionswhichexistforthesyntheticspeechcla...

展开>> 收起<<

Adaptive re-calibration of channel-wise features for Adversarial Audio Classiﬁcation V ardhan Dongre1Abhinav Reddy Thimma1Nikhitha Reddeddy1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive re-calibration of channel-wise features for Adversarial Audio Classiﬁcation V ardhan Dongre1Abhinav Reddy Thimma1Nikhitha Reddeddy1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: