2
literature to refer to several elds, including music, speech,
prosody, and language [8]. Acoustic scene analysis is a
method used to extract acoustic environmental information.
The technique attempts to model soundscape as a series
of layers, each representing the temporal variations of spe-
cic properties (e.g., intensity, fundamental frequency, etc.).
Acoustic scenes are then analyzed using spectral clustering
which nds "typical" or "normal" patterns within an acous-
tic scene that can be used for recognition purposes [8].
Nonetheless, the concept of ASA has received considerable
attention in the years following its introduction. The ASA
principle was applied to a much broader spectrum of audi-
tory stimuli including non-speech sounds. In addition, new
theories based on Computational Auditory Scene Analysis
(CASA) were developed to explain how humans extract
speech and music signals from noise or reverberation [9]. The
eld of computational study known as "speech segregation"
refers to the processing of signals in which one or more
sources are generating sounds that are being detected by a
microphone. One goal is to separate the speech signal from
noise signals since any noise can interfere with making out
what someone’s saying. [10]. For a single microphone, the
speech source is located at the microphone’s center. Since
we cannot place a second microphone close to the location
of the original one, we cannot dierentiate between the
two sources [11]. Blind source separation minimizes errors
caused by noise and other unwanted noises when compared
to conventional sound separating methods. The main idea
behind this method is to use two microphones and cancel
out signals from both [11]. In speech processing, in recent
years, source separation methods have been extensively in-
vestigated to obtain clean speech signals from a mixture of
multiple talkers. This is because the dynamic range of a real-
world signal (in particular human speech) is very wide. The
use of an overlapping independent set of sources has been
proposed as an eective method for capturing the various
sources and their respective signals. [12].
The CASA systems come in two types: data-driven and
prediction-driven [13]. The data-driven method is based only
on the input signal attribute. The system that uses this
architecture is also based on the input signal features. It
is called bottom-up since they are built at a higher level
even though the data collected from the signals are at a
low level. On the contrary, the prediction-driven approach
denes the top-down system [14]. This architecture is based
on the predictions of future outputs. In other words, this
system predicts the next attribute of a signal. Therefore, top-
down approaches are based on high-level features compared
to bottom-up approaches, which are based on low-level
features. The data-driven approach has more stability but
less adaptability and exibility than the prediction-driven
approach, which has less stability but more adaptability and
exibility in many cases.
Meddis [15] and O’Mard prepared the most ecient pitch
estimation models. Since they used multi-channel models,
they are not suitable for speech separation applications
such as hearing aids which require a single channel pitch
estimation algorithm. The ITU-T G.1204 and G.1205 stan-
dards require that hearing aids [16] be capable of separating
speakers in a room. This can be done by identifying the dom-
inant speaker among several speakers. The pitch estimation
algorithm provides a score for each speaker’s contribution to
the signal mixed in the acoustic domain coupled with speech
recognition to contribute to the identication of the dominant
speaker.
In CASA systems, time-frequency decomposition is per-
formed using auditory lters, whose bandwidth increases
quasi-logarithmic concerning center frequencies. Since the
eective speech separation algorithm or one of its derivatives
is dened as a 2D spectral ratio of two time-frequency
signals, it cannot be given by some simple formula. The main
diculty in these algorithms is to estimate the parameters
for which the resulting signal is most suited for subsequent
Minimum Mean Square Error (MMSE) ltering and vice
versa. These lters are derived from the psychophysical
observations of the auditory periphery. An auditory lter
bank is used to imitate cochlear ltering. There are two such
lter banks: the Gamma-tone lter bank and the Short-time
Fourier transform (STFT) based lter bank [6]. The STFT
lter bank is more ecient as it utilizes the high-resolution
capabilities of Digital Signal Processing (DSP) hardware [17].
Hamsa et al. proposed and implemented a wavelet packet
transform (WPT) based lter bank for segregating noise and
emotional speech data [4].
Emotion attribute projection (EAP) and linear fusion were
used by Bao et al. [18] to analyze speech, design a recogni-
tion system for speaker identication in emotional speaking
conditions, and validate the system through evaluation of
real data. The ndings were that linear fusion provided an
improvement to EAP based emotion recognizer for mental
well-being in emotional speaking condition. Shahin et al.
focused on improving the performance of techniques for
voice identication in emotional speaking conditions [19].
His studies include improving speaker identication per-
formance based on hand-crafted features such as Hidden
Markov Models (HMMs), Second Order Circular Hidden
Markov Models (CHMM2s), and Supra-segmental Hidden
Markov Models (SPHMMs). Each of these models achieved
average speaker identication performance, with the highest
being SPHMMs having 69.1% followed by CHMM2s and
HMMs with 66.4% and 61.4%, respectively [20]. For improved
results, the authors used and assessed a hybrid Gaussian
Mixture Model-Deep Neural Network (GMM-DNN) classier
and obtained an average speaker identication rate of 76.8%
[20]. Nassif et al. improved the results of the GMM-DNN
model by adding a suitable noise reduction pre-processing
module based on CASA [21].
In this paper, we have designed and applied a more
coherent and less complex model than the existing mod-
els for speech segregation and identication of the un-
known speaker in emotional and noisy talking conditions.
The proposed algorithm utilizes pre-trained deep learning
approaches for speech segregation, feature extraction, and
classication. The state-of-the-art model used onset-oset-
based segmentation and classication for dominant voice
segregation, in which the pitch of target and interference