1 Speaker Identif_ication from emotional and noisy speech data using learned voice segregation and Speech VGG

2025-04-30 0 0 1.09MB 10 页 10玖币
侵权投诉
1
Speaker Identication from emotional and noisy speech data
using learned voice segregation and Speech VGG
Shibani Hamsa1, Ismail Shahin2,Member, IEEE Youssef Iraqi3,Senior Member, IEEE
Ali Bou Nassif4Member, IEEE, Ernesto Damiani1,Senior Member, IEEE, and Naoufel Werghi1,Senior Member, IEEE
1Center for Cyber-Physical Systems (C2PS), Dept. of ECE, Khalifa University, Abu Dhabi, UAE.
2Dept. of Electrical Engineering, University of Sharjah, Sharjah, UAE.
3School of Computer Science, Mohammed VI Polytechnic University, Morocco
4Dept. of Computer Engineering, University of Sharjah, UAE.
Speech signals are subjected to more acoustic interference and emotional factors than other signals. Noisy emotion-riddled
speech data is a challenge for real-time speech processing applications. It is essential to nd an eective way to segregate the
dominant signal from other external inuences. An ideal system should have the capacity to accurately recognize required
auditory events from a complex scene taken in an unfavorable situation. This paper proposes a novel approach to speaker
identication in unfavorable conditions such as emotion and interference using a pre-trained Deep Neural Network mask and
speech VGG. The proposed model obtained superior performance over the recent literature in English and Arabic emotional
speech data and reported an average speaker identication rate of 85.2%, 87.0%, and 86.6% using the Ryerson audio-visual
dataset (RAVDESS), speech under simulated and actual stress (SUSAS) dataset and Emirati-accented Speech dataset (ESD)
respectively.
Index Terms—Deep Neural Network: Emotional talking conditions; Feature extraction; Noise reduction; Speaker identi-
cation; Speech segregation.
I. Introduction
The human auditory system can handle complex auditory
scenes and is ecient enough in precisely distinguishing the
various auditory events. There are many dierent terms to
describe this phenomenon, such as segregation by type or
frequency, but they all share one important quality: they are
ecient. The human ear can only hear so many dierent
types of sound at once but can still distinguish them with
great accuracy. A well-constructed soundscape that could
be described as perfectly balanced and harmonious. The
challenges most of the real-time human-machine interact-
ing audio systems face when handling complex auditory
scenes are discussed in this work. The auditory scene is a
series of sounds of which the signal-to-noise ratio varies.
We must process this, as our auditory experience does not
provide these signals in isolation, and isolated sounds are
non-existent in the real world—the cocktail party eect [1]
demonstrates this problem, where we attempt to segregate
the necessary speech signals from all the other noises in the
auditory eld, at a party full of noise and chatter [2]. The goal
of this work is to propose a way to tackle these challenges
by using machine learning and data mining techniques,
where some pre-processing steps are performed to gener-
ate appropriate feature representation for a collaborative
ltering/learning algorithm, which subsequently leads to the
generation of a new sound event classier. Deep Learning is a
form of machine learning using articial neural networks to
model high-level abstractions in data [3] [4]. The closed-form
solution [5] allows one to learn the structure of data without
Corresponding author: Shibani Hamsa (email: 100050116@ku.ac.ae).
requiring costly supervised or unsupervised pre-processing.
Deep Neural Networks dier from standard approaches in
how they are learned, which means that rather than relying
on hand crafted features [6], the deep neural network can
simply be exposed to lots and lots of data and it will learn
features automatically. In this work, we have used machine
learning models for dominant signal extraction and speaker
identication from emotional and noisy speech data.
In recent years, we have seen many real-time human-
machine interactions that mainly focus on audio. In this
work, we have focused on designing and implementing
a model suitable for identifying the unknown speaker in
emotional and noisy real application situations. The proposed
speaker identication model, designed in the deep learning
platform, has been evaluated in noisy, stressful, and emotion-
ally challenging talking environments to ensure the system’s
robustness in real applications [7]. The proposed model
achieves the same or better than the best previous state-of-
the-art models in most of the evaluation metrics. In addition,
it is robust to various acoustic distortions and interference.
Finally, we evaluated the eectiveness of the proposed system
through its performance on various evaluation metrics.
The rest of the paper is organized as follows. A literature
review is given in Section II. System description is explained
in Section III. Experiment results are described in Section IV,
and nally conclusion is given in Section V.
II. LITERATURE REVIEW
The Auditory Scene Analysis (ASA) is based on a theory
that describes how the brain processes sounds as a result
of neural networks. The term "ASA" has been used in the
arXiv:2210.12701v1 [eess.AS] 23 Oct 2022
2
literature to refer to several elds, including music, speech,
prosody, and language [8]. Acoustic scene analysis is a
method used to extract acoustic environmental information.
The technique attempts to model soundscape as a series
of layers, each representing the temporal variations of spe-
cic properties (e.g., intensity, fundamental frequency, etc.).
Acoustic scenes are then analyzed using spectral clustering
which nds "typical" or "normal" patterns within an acous-
tic scene that can be used for recognition purposes [8].
Nonetheless, the concept of ASA has received considerable
attention in the years following its introduction. The ASA
principle was applied to a much broader spectrum of audi-
tory stimuli including non-speech sounds. In addition, new
theories based on Computational Auditory Scene Analysis
(CASA) were developed to explain how humans extract
speech and music signals from noise or reverberation [9]. The
eld of computational study known as "speech segregation"
refers to the processing of signals in which one or more
sources are generating sounds that are being detected by a
microphone. One goal is to separate the speech signal from
noise signals since any noise can interfere with making out
what someone’s saying. [10]. For a single microphone, the
speech source is located at the microphone’s center. Since
we cannot place a second microphone close to the location
of the original one, we cannot dierentiate between the
two sources [11]. Blind source separation minimizes errors
caused by noise and other unwanted noises when compared
to conventional sound separating methods. The main idea
behind this method is to use two microphones and cancel
out signals from both [11]. In speech processing, in recent
years, source separation methods have been extensively in-
vestigated to obtain clean speech signals from a mixture of
multiple talkers. This is because the dynamic range of a real-
world signal (in particular human speech) is very wide. The
use of an overlapping independent set of sources has been
proposed as an eective method for capturing the various
sources and their respective signals. [12].
The CASA systems come in two types: data-driven and
prediction-driven [13]. The data-driven method is based only
on the input signal attribute. The system that uses this
architecture is also based on the input signal features. It
is called bottom-up since they are built at a higher level
even though the data collected from the signals are at a
low level. On the contrary, the prediction-driven approach
denes the top-down system [14]. This architecture is based
on the predictions of future outputs. In other words, this
system predicts the next attribute of a signal. Therefore, top-
down approaches are based on high-level features compared
to bottom-up approaches, which are based on low-level
features. The data-driven approach has more stability but
less adaptability and exibility than the prediction-driven
approach, which has less stability but more adaptability and
exibility in many cases.
Meddis [15] and O’Mard prepared the most ecient pitch
estimation models. Since they used multi-channel models,
they are not suitable for speech separation applications
such as hearing aids which require a single channel pitch
estimation algorithm. The ITU-T G.1204 and G.1205 stan-
dards require that hearing aids [16] be capable of separating
speakers in a room. This can be done by identifying the dom-
inant speaker among several speakers. The pitch estimation
algorithm provides a score for each speaker’s contribution to
the signal mixed in the acoustic domain coupled with speech
recognition to contribute to the identication of the dominant
speaker.
In CASA systems, time-frequency decomposition is per-
formed using auditory lters, whose bandwidth increases
quasi-logarithmic concerning center frequencies. Since the
eective speech separation algorithm or one of its derivatives
is dened as a 2D spectral ratio of two time-frequency
signals, it cannot be given by some simple formula. The main
diculty in these algorithms is to estimate the parameters
for which the resulting signal is most suited for subsequent
Minimum Mean Square Error (MMSE) ltering and vice
versa. These lters are derived from the psychophysical
observations of the auditory periphery. An auditory lter
bank is used to imitate cochlear ltering. There are two such
lter banks: the Gamma-tone lter bank and the Short-time
Fourier transform (STFT) based lter bank [6]. The STFT
lter bank is more ecient as it utilizes the high-resolution
capabilities of Digital Signal Processing (DSP) hardware [17].
Hamsa et al. proposed and implemented a wavelet packet
transform (WPT) based lter bank for segregating noise and
emotional speech data [4].
Emotion attribute projection (EAP) and linear fusion were
used by Bao et al. [18] to analyze speech, design a recogni-
tion system for speaker identication in emotional speaking
conditions, and validate the system through evaluation of
real data. The ndings were that linear fusion provided an
improvement to EAP based emotion recognizer for mental
well-being in emotional speaking condition. Shahin et al.
focused on improving the performance of techniques for
voice identication in emotional speaking conditions [19].
His studies include improving speaker identication per-
formance based on hand-crafted features such as Hidden
Markov Models (HMMs), Second Order Circular Hidden
Markov Models (CHMM2s), and Supra-segmental Hidden
Markov Models (SPHMMs). Each of these models achieved
average speaker identication performance, with the highest
being SPHMMs having 69.1% followed by CHMM2s and
HMMs with 66.4% and 61.4%, respectively [20]. For improved
results, the authors used and assessed a hybrid Gaussian
Mixture Model-Deep Neural Network (GMM-DNN) classier
and obtained an average speaker identication rate of 76.8%
[20]. Nassif et al. improved the results of the GMM-DNN
model by adding a suitable noise reduction pre-processing
module based on CASA [21].
In this paper, we have designed and applied a more
coherent and less complex model than the existing mod-
els for speech segregation and identication of the un-
known speaker in emotional and noisy talking conditions.
The proposed algorithm utilizes pre-trained deep learning
approaches for speech segregation, feature extraction, and
classication. The state-of-the-art model used onset-oset-
based segmentation and classication for dominant voice
segregation, in which the pitch of target and interference
摘要:

1SpeakerIdenticationfromemotionalandnoisyspeechdatausinglearnedvoicesegregationandSpeechVGGShibaniHamsa1,IsmailShahin2,Member,IEEEYoussefIraqi3,SeniorMember,IEEEAliBouNassif4Member,IEEE,ErnestoDamiani1,SeniorMember,IEEE,andNaoufelWerghi1,SeniorMember,IEEE1CenterforCyber-PhysicalSystems(C2PS),Dept.o...

展开>> 收起<<
1 Speaker Identif_ication from emotional and noisy speech data using learned voice segregation and Speech VGG.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.09MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注