1 Speaker Identif_ication from emotional and noisy speech data using learned voice segregation and Speech VGG

2025-04-30 0 0 1.09MB 10 页 10玖币

侵权投诉

Speaker Identication from emotional and noisy speech data

using learned voice segregation and Speech VGG

Shibani Hamsa1, Ismail Shahin2,Member, IEEE Youssef Iraqi3,Senior Member, IEEE

Ali Bou Nassif4Member, IEEE, Ernesto Damiani1,Senior Member, IEEE, and Naoufel Werghi1,Senior Member, IEEE

1Center for Cyber-Physical Systems (C2PS), Dept. of ECE, Khalifa University, Abu Dhabi, UAE.

2Dept. of Electrical Engineering, University of Sharjah, Sharjah, UAE.

3School of Computer Science, Mohammed VI Polytechnic University, Morocco

4Dept. of Computer Engineering, University of Sharjah, UAE.

Speech signals are subjected to more acoustic interference and emotional factors than other signals. Noisy emotion-riddled

speech data is a challenge for real-time speech processing applications. It is essential to nd an eective way to segregate the

dominant signal from other external inuences. An ideal system should have the capacity to accurately recognize required

auditory events from a complex scene taken in an unfavorable situation. This paper proposes a novel approach to speaker

identication in unfavorable conditions such as emotion and interference using a pre-trained Deep Neural Network mask and

speech VGG. The proposed model obtained superior performance over the recent literature in English and Arabic emotional

speech data and reported an average speaker identication rate of 85.2%, 87.0%, and 86.6% using the Ryerson audio-visual

dataset (RAVDESS), speech under simulated and actual stress (SUSAS) dataset and Emirati-accented Speech dataset (ESD)

respectively.

Index Terms—Deep Neural Network: Emotional talking conditions; Feature extraction; Noise reduction; Speaker identi-

cation; Speech segregation.

I. Introduction

The human auditory system can handle complex auditory

scenes and is ecient enough in precisely distinguishing the

various auditory events. There are many dierent terms to

describe this phenomenon, such as segregation by type or

frequency, but they all share one important quality: they are

ecient. The human ear can only hear so many dierent

types of sound at once but can still distinguish them with

great accuracy. A well-constructed soundscape that could

be described as perfectly balanced and harmonious. The

challenges most of the real-time human-machine interact-

ing audio systems face when handling complex auditory

scenes are discussed in this work. The auditory scene is a

series of sounds of which the signal-to-noise ratio varies.

We must process this, as our auditory experience does not

provide these signals in isolation, and isolated sounds are

non-existent in the real world—the cocktail party eect [1]

demonstrates this problem, where we attempt to segregate

the necessary speech signals from all the other noises in the

auditory eld, at a party full of noise and chatter [2]. The goal

of this work is to propose a way to tackle these challenges

by using machine learning and data mining techniques,

where some pre-processing steps are performed to gener-

ate appropriate feature representation for a collaborative

ltering/learning algorithm, which subsequently leads to the

generation of a new sound event classier. Deep Learning is a

form of machine learning using articial neural networks to

model high-level abstractions in data [3] [4]. The closed-form

solution [5] allows one to learn the structure of data without

Corresponding author: Shibani Hamsa (email: 100050116@ku.ac.ae).

requiring costly supervised or unsupervised pre-processing.

Deep Neural Networks dier from standard approaches in

how they are learned, which means that rather than relying

on hand crafted features [6], the deep neural network can

simply be exposed to lots and lots of data and it will learn

features automatically. In this work, we have used machine

learning models for dominant signal extraction and speaker

identication from emotional and noisy speech data.

In recent years, we have seen many real-time human-

machine interactions that mainly focus on audio. In this

work, we have focused on designing and implementing

a model suitable for identifying the unknown speaker in

emotional and noisy real application situations. The proposed

speaker identication model, designed in the deep learning

platform, has been evaluated in noisy, stressful, and emotion-

ally challenging talking environments to ensure the system’s

robustness in real applications [7]. The proposed model

achieves the same or better than the best previous state-of-

the-art models in most of the evaluation metrics. In addition,

it is robust to various acoustic distortions and interference.

Finally, we evaluated the eectiveness of the proposed system

through its performance on various evaluation metrics.

The rest of the paper is organized as follows. A literature

review is given in Section II. System description is explained

in Section III. Experiment results are described in Section IV,

and nally conclusion is given in Section V.

II. LITERATURE REVIEW

The Auditory Scene Analysis (ASA) is based on a theory

that describes how the brain processes sounds as a result

of neural networks. The term "ASA" has been used in the

arXiv:2210.12701v1 [eess.AS] 23 Oct 2022

literature to refer to several elds, including music, speech,

prosody, and language [8]. Acoustic scene analysis is a

method used to extract acoustic environmental information.

The technique attempts to model soundscape as a series

of layers, each representing the temporal variations of spe-

cic properties (e.g., intensity, fundamental frequency, etc.).

Acoustic scenes are then analyzed using spectral clustering

which nds "typical" or "normal" patterns within an acous-

tic scene that can be used for recognition purposes [8].

Nonetheless, the concept of ASA has received considerable

attention in the years following its introduction. The ASA

principle was applied to a much broader spectrum of audi-

tory stimuli including non-speech sounds. In addition, new

theories based on Computational Auditory Scene Analysis

(CASA) were developed to explain how humans extract

speech and music signals from noise or reverberation [9]. The

eld of computational study known as "speech segregation"

refers to the processing of signals in which one or more

sources are generating sounds that are being detected by a

microphone. One goal is to separate the speech signal from

noise signals since any noise can interfere with making out

what someone’s saying. [10]. For a single microphone, the

speech source is located at the microphone’s center. Since

we cannot place a second microphone close to the location

of the original one, we cannot dierentiate between the

two sources [11]. Blind source separation minimizes errors

caused by noise and other unwanted noises when compared

to conventional sound separating methods. The main idea

behind this method is to use two microphones and cancel

out signals from both [11]. In speech processing, in recent

years, source separation methods have been extensively in-

vestigated to obtain clean speech signals from a mixture of

multiple talkers. This is because the dynamic range of a real-

world signal (in particular human speech) is very wide. The

use of an overlapping independent set of sources has been

proposed as an eective method for capturing the various

sources and their respective signals. [12].

The CASA systems come in two types: data-driven and

prediction-driven [13]. The data-driven method is based only

on the input signal attribute. The system that uses this

architecture is also based on the input signal features. It

is called bottom-up since they are built at a higher level

even though the data collected from the signals are at a

low level. On the contrary, the prediction-driven approach

denes the top-down system [14]. This architecture is based

on the predictions of future outputs. In other words, this

system predicts the next attribute of a signal. Therefore, top-

down approaches are based on high-level features compared

to bottom-up approaches, which are based on low-level

features. The data-driven approach has more stability but

less adaptability and exibility than the prediction-driven

approach, which has less stability but more adaptability and

exibility in many cases.

Meddis [15] and O’Mard prepared the most ecient pitch

estimation models. Since they used multi-channel models,

they are not suitable for speech separation applications

such as hearing aids which require a single channel pitch

estimation algorithm. The ITU-T G.1204 and G.1205 stan-

dards require that hearing aids [16] be capable of separating

speakers in a room. This can be done by identifying the dom-

inant speaker among several speakers. The pitch estimation

algorithm provides a score for each speaker’s contribution to

the signal mixed in the acoustic domain coupled with speech

recognition to contribute to the identication of the dominant

speaker.

In CASA systems, time-frequency decomposition is per-

formed using auditory lters, whose bandwidth increases

quasi-logarithmic concerning center frequencies. Since the

eective speech separation algorithm or one of its derivatives

is dened as a 2D spectral ratio of two time-frequency

signals, it cannot be given by some simple formula. The main

diculty in these algorithms is to estimate the parameters

for which the resulting signal is most suited for subsequent

Minimum Mean Square Error (MMSE) ltering and vice

versa. These lters are derived from the psychophysical

observations of the auditory periphery. An auditory lter

bank is used to imitate cochlear ltering. There are two such

lter banks: the Gamma-tone lter bank and the Short-time

Fourier transform (STFT) based lter bank [6]. The STFT

lter bank is more ecient as it utilizes the high-resolution

capabilities of Digital Signal Processing (DSP) hardware [17].

Hamsa et al. proposed and implemented a wavelet packet

transform (WPT) based lter bank for segregating noise and

emotional speech data [4].

Emotion attribute projection (EAP) and linear fusion were

used by Bao et al. [18] to analyze speech, design a recogni-

tion system for speaker identication in emotional speaking

conditions, and validate the system through evaluation of

real data. The ndings were that linear fusion provided an

improvement to EAP based emotion recognizer for mental

well-being in emotional speaking condition. Shahin et al.

focused on improving the performance of techniques for

voice identication in emotional speaking conditions [19].

His studies include improving speaker identication per-

formance based on hand-crafted features such as Hidden

Markov Models (HMMs), Second Order Circular Hidden

Markov Models (CHMM2s), and Supra-segmental Hidden

Markov Models (SPHMMs). Each of these models achieved

average speaker identication performance, with the highest

being SPHMMs having 69.1% followed by CHMM2s and

HMMs with 66.4% and 61.4%, respectively [20]. For improved

results, the authors used and assessed a hybrid Gaussian

Mixture Model-Deep Neural Network (GMM-DNN) classier

and obtained an average speaker identication rate of 76.8%

[20]. Nassif et al. improved the results of the GMM-DNN

model by adding a suitable noise reduction pre-processing

module based on CASA [21].

In this paper, we have designed and applied a more

coherent and less complex model than the existing mod-

els for speech segregation and identication of the un-

known speaker in emotional and noisy talking conditions.

The proposed algorithm utilizes pre-trained deep learning

approaches for speech segregation, feature extraction, and

classication. The state-of-the-art model used onset-oset-

based segmentation and classication for dominant voice

segregation, in which the pitch of target and interference

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1SpeakerIdenticationfromemotionalandnoisyspeechdatausinglearnedvoicesegregationandSpeechVGGShibaniHamsa1,IsmailShahin2,Member,IEEEYoussefIraqi3,SeniorMember,IEEEAliBouNassif4Member,IEEE,ErnestoDamiani1,SeniorMember,IEEE,andNaoufelWerghi1,SeniorMember,IEEE1CenterforCyber-PhysicalSystems(C2PS),Dept.o...

展开>> 收起<<

1 Speaker Identif_ication from emotional and noisy speech data using learned voice segregation and Speech VGG.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Speaker Identif_ication from emotional and noisy speech data using learned voice segregation and Speech VGG

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: