Submitted for publication to the Journal of Audio Engineering Society on October 20th 2022.Enhanced Fuzzy Decomposition of Sound Into Sines Transients and Noise

2025-05-02 0 0 7.12MB 12 页 10玖币
侵权投诉
Submitted for publication to the Journal of Audio
Engineering Society on October 20th, 2022.
Enhanced Fuzzy Decomposition of Sound Into
Sines, Transients and Noise
Preprint,compiled December 1, 2022
Leonardo Fierro and Vesa Välimäki
leonardo.fierro@aalto.fi | vesa.valimaki@aalto.fi
Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland
Abstract
The decomposition of sounds into sines, transients, and noise is a long–standing research problem in audio
processing. The current solutions for this three–way separation detect either horizontal and vertical structures or
anisotropy and orientations in the spectrogram to identify the properties of each spectral bin and classify it as
sinusoidal, transient, or noise. This paper proposes an enhanced three–way decomposition method based on
fuzzy logic, enabling soft masking while preserving the perfect reconstruction property. The proposed method
allows each spectral bin to simultaneously belong to two classes, sine and noise or transient and noise. Results of
a subjective listening test against three other techniques are reported, showing that the proposed decomposition
yields a better or comparable quality. The main improvement appears in transient separation, which enjoys little
or no loss of energy or leakage from the other components and performs well for test signals presenting strong
transients. The audio quality of the separation is shown to depend on the complexity of the input signal for all
tested methods. The proposed method helps improve the quality of various audio processing applications. A
successful implementation over a state-of-the-art time-scale modification method is reported as an example.
1 INTRODUCTION
Decomposing an audio signal into its sinusoidal, transient, and
noise (STN) components has been drawing research interest for
over two decades [
1
,
2
,
3
,
4
]. It is a widely used tool in a variety
of audio processing applications, ranging from beat tracking [
5
]
and tonality estimation [
6
] to reduction of spectral complexity
in cochlear implants [
7
] and to virtual bass enhancement [
8
].
The STN separation is also helpful in time-scale modification
[1,9,10], where it has been combined with the notion of fuzzy
logic in order to improve [
4
,
11
] or evaluate the audio quality
[
12
]. In all these audio applications, it is helpful to process
sine, transient, and noise components independently of each
other. This paper proposes improvements to the fuzzy STN
decomposition of audio signals.
The STN separation relies on the assumption that any audio
signal can be described as a linear combination of three indepen-
dent actors: tonal content (sines), impulsive events (transients),
and a residual part (noise) that does not belong to either one of
the other two classes and adds nuance to the sound. Historically,
additive synthesis modeled any sound as a sum of sinusoidal
components [
13
,
14
,
15
]. Serra and Smith expanded the addi-
tive synthesis method by introducing the noise class, which was
obtained as a residual after a sinusoidal model was subtracted
from the original signal [
16
]. In the resulting method—called
spectral modeling synthesis [
16
]—the frequency, amplitude, and
phase of the sinusoidal components were estimated from the
short-time Fourier transform (STFT) using a method similar to
the McAulay-Quatieri algorithm [17].
The three-way decomposition was first introduced by Verma et
al. [
1
,
18
,
19
], who showed that including a third component for
transients was greatly beneficial in the context of signal analysis
and synthesis, as it avoided the smearing of transients, which
was a weakness in sines +noise models. Levine and Smith
also showed that the adaptiveness of the STN model made it
suitable for audio compression and for pitch– and time–scale
modification [20].
Fitzgerald discovered that it was possible to decompose an au-
dio signal into its sinusoidal and transient components by using
spectral masks extracted via horizontal and vertical median fil-
tering of the STFT [
21
]. Driedger et al. [
2
] reintroduced the
three-way separation by updating Fitzgerald’s method: the noise
component could be obtained by retrieving spurious information
after extracting the other two components with median filtering.
Füg et al. [
3
] proposed a follow-up method involving the use of
structure tensors (ST) to find predominant orientation angles and
anisotropy in the time-frequency signal representation, show-
ing an improvement in the separation quality for sounds with
vibrato.
Other recent approaches for sines–transients separation include
a kernel additive matrix [
22
], non-negative matrix factorization
[
23
], improved sinusoidal modeling [
24
,
25
], and neural net-
works [
26
]. However, these methods do not involve a third class
for the noise component, hence they are not discussed further in
this paper.
It should be noted that the STN decomposition does not directly
relate to traditional source separation, which usually aims at
retrieving musical instruments in a sound mixture [
27
] or speech
from noisy background sources [
28
]. According to the STN
formulation, even strongly percussive sources, such as drums,
will have a sinusoidal and noise component—unless they are
perfect, synthetic pulses—and, similarly, strongly-harmonical
sources, such as the violin, will hold, in addition to a sinusoidal
part, a transient component in their attack and a noise component
to describe the nuances, such as the bowing noise.
arXiv:2210.14041v2 [eess.AS] 30 Nov 2022
Preprint – Enhanced Fuzzy Decomposition of Sound Into Sines, Transients and Noise 2
While both Driedger et al. [
2
] and Füg et al. [
3
] applied hard bi-
nary masks to define the sinusoidal, transient, and noise classes,
Damskägg and Välimäki [
4
] introduced the concept of fuzzy
logic in the context of time-scale modification. The fuzzy classi-
fication (FZ) allows spectral bins to simultaneously contribute
to the three classes, providing a more refined basis for the three-
way separation [
4
]. This decomposition method was then ex-
tended to objective evaluation by Fierro and Välimäki [12] and
improved by Moliner et al. [
8
] to allow perfect reconstruction
by ensuring that the three soft spectral masks sum up to unity.
This work proposes a novel way to estimate fuzzy soft masks
for STN decomposition of audio signals. The proposed method
allows for intermediate classifications of the spectral bins be-
tween two components—sines v. noise and transients v. noise.
This two-stage decomposition is shown to improve the overall
sound quality of the separated components, in particular for
transients. The masks ensure perfect reconstruction and are opti-
mized for each class to have a large constant region followed by
a fast but smooth transition to the adjacent class. The transition
slope is refined for both decomposition stages to provide the
best separation quality.
The rest of this paper is structured as follows. Sec. 2discusses
previous three-way separation techniques. Sec. 3introduces
the new STN decomposition method, which optimally extracts
the sinusoidal and transient components. Sec. 4evaluates the
proposed method against three previous techniques. Sec. 5
applies the proposed method to time-scale modification, and
Sec. 6concludes.
2 RELATED WORK
This section summarizes three previous STN decomposition
methods based on a spectrogram representation of the input
signal. A spectrogram
X
is a
K
-by-
M
matrix representing the
time-frequency behavior of audio signal
x
. Each element
X
(
m,k
)
is computed using the STFT:
X(m,k)=
L1
X
n=0
x(n+mH)w(n)ejωkn,(1)
where
n
is the sample index,
m
=0
,
1
,
2
, ... M
1 is the temporal
frame index,
k
=0
,
1
,
2
, ... K
1 is the spectral bin index,
w
is
the analysis window,
H
is the hop size,
L
is the window length
in samples, which is assumed to be even,
j
is the imaginary unit,
and
ωk
is the normalized central frequency of the
kth
spectral
bin.
2.1 Harmonic–Percussive–Residual Separation
The Harmonic–Percussive–Residual (HPR) separation [
2
] builds
upon the Harmonic–Percussive (HP) method for sines–transients
decomposition [
21
]. Fitzgerald noted that since sinusoids form
flat lines in time direction in the spectrogram and, vice versa,
impulsive events appear as flat lines in the frequency direction,
they can be detected (suppressed) using a median filter [
21
].
Fig. 1shows the spectrogram of a signal consisting of a mixture
of violin and castanets, whose time– and frequency–direction
ridges are noticeable.
Horizontal (time-oriented) and vertical (frequency-oriented) me-
dian filtering can be applied to the spectrogram
X
(
m,k
) to high-
1 2 3 4
Time (s)
0
5
10
15
20
Frequency (kHz)
-80
-60
-40
-20
0dB
Figure 1: Spectrogram of a test signal consisting of the castanets
and the violin playing simultaneously.
light the desired component and suppress the other [21]:
Xh(m,k)
=medh|X(mLh
2+1,k)|, ..., |X(m+Lh
2,k)|i(2)
and
Xv(m,k)
=medh|X(m,kLv
2+1)|, ..., |X(m,k+Lv
2)|i,(3)
where
med
[
·
] is the median function, and
Xh
and
Xv
are the
resulting horizontally- and vertically-enhanced magnitude spec-
trograms, respectively. Parameters
Lh
and
Lv
are the median
filter lengths (in samples) in the time and frequency directions,
respectively.
Matrices
Xh
and
Xv
are then used to extract the tonalness
Rs
and
transientness Rtmatrices with the following elements [21]:
Rs(m,k)=Xh(m,k)
Xh(m,k)+Xv(m,k)(4)
and
Rt(m,k)=1Rs(m,k)=Xv(m,k)
Xh(m,k)+Xv(m,k),(5)
respectively. Fitzgerald [
21
] used
Rs
and
Rt
directly as spectral
masks, whereas Driedger et al. [
2
] later introduced a controllable
separation factor
β
and a third class (noise) to describe those
parts of the sound that are neither sines nor transients.
From Eqs.
(4)
and
(5)
, a set of hard spectral masks
S
(sinusoidal),
T(transient), and N(noise) can be derived as follows [2]:
S(m,k)=(1,if Rs(m,k)/Rt(m,k)> β
0,otherwise, (6)
T(m,k)=(1,if Rt(m,k)/Rs(m,k)> β
0,otherwise, (7)
and
N(m,k)=1S(m,k)T(m,k).(8)
Preprint – Enhanced Fuzzy Decomposition of Sound Into Sines, Transients and Noise 3
0 0.25 0.5 0.75 1
Tonalness Rs
0.25
0.5
0.75
1
T N S
Figure 2: Hard masks for transients, noise, and sines, as used in
the HPR method [2], for separation factor β=2.5.
Their relationship for a chosen
β
is shown in Fig. 2. The spectral
masks are then imposed on
X
(
m,k
) to retrieve the three desired
spectral components:
Xs=SX,Xt=TX,Xn=NX,(9)
where
represents the Hadamard product, or element-wise
multiplication.
It has been observed that the quality of the HPR separation
largely varies for the sinusoidal and the transient components
depending on the choice of the analysis window length
L
[
29
,
30
,
2
]. A large window length
L
for the STFT, ensuring suf-
ficient frequency resolution but poor time resolution, results in a
faultless extraction of sines but a low-quality transient output;
conversely, a smaller value of
L
leads to a better extraction of
the transient component but a worse description of sines.
To overcome the time-frequency limitation, Driedger et al. [
2
]
divided the decomposition process can be divided into two cas-
caded iterations [
2
]. In the first stage, a longer analysis window
is applied to extract the sinusoidal component [
2
], while tran-
sients and noise remain mixed together:
xs=ISTFThS1Xi,(10)
xres =ISTFTh(T1+N1)Xi,(11)
where ISTFT is the Inverse STFT. Subsequently, the residual
from the first stage is separated again with shorter windowing,
leading to the final decomposition [2]:
xt=ISTFThT2Xresi,(12)
xn=ISTFTh(S2+N2)Xresi.(13)
The noise signal
xn
will also contain residuals of sine compo-
nents, unless they were perfectly separated on the first stage.
Fig. 3shows the separated STN components of the example
audio signal used above obtained with the HPR method.
2.2 STN Separation Based on Structure Tensor
Füg et al. [
3
] noted that sounds exhibiting vibrato, which carry
tonal information and are perceived as sines, do not present
strictly horizontal structures in the spectrogram. This results in a
leakage of energy in-between dierent spectral components. The
strictness of the median filtering can be overcome using a struc-
ture tensor, a widely used tool in image processing, to obtain a
measure of the frequency change rate and local anisotropies in
the spectrogram, which will then be used as features to define
the spectral masks [3].
The structure tensor matrix is obtained from the partial deriva-
tives of the spectrogram with respect to time and frequency, and
the orientation angles
α
and the anisotropy
C
of the spectral
bins are computed from the eigenvalues and the eigenvectors of
such a matrix, as described in [
3
]. The instantaneous frequency
change rate
R
is computed for each bin from the orientation
angles:
R(m,k)=fs2
HM tan [α(m,k)],(14)
where
fs
is the sample rate. The spectral masks are then obtained
as follows:
S(m,k)=(1,if |R(m,k)| ≤ rsC(m,k)>c
0,otherwise, (15)
T(m,k)=(1,if |R(m,k)| ≥ rtC(m,k)>c
0,otherwise, (16)
where
c
is the anisotropy threshold and
rs
and
rt
are the frequency
rate thresholds for the sinusoidal and the transient component,
respectively. The noise mask is computed as described in Eq.
(8)
,
and the spectral components are then derived as in Eq. (9).
Fig. 4shows the separated STN components of the example
audio signal using the ST method. Some dierences can be
observed in comparison to the separation results of the HPR
method in Fig. 3, such as some holes in the transient events at
low and middle frequencies in Fig. 4(b).
2.3 Fuzzy Separation
Damskägg and Välimäki [
4
] introduced the concept of fuzzy
classification of the spectral bins, which corresponds to a non–
binary classification using continuous values between 0 and 1.
This method was later extended by Moliner et al. [
8
] to ensure
perfect reconstruction, i.e. all masks summing up to unity. In
[
8
], a third membership function for noisiness
Rn
is derived from
Eqs. (4) and (5):
Rn(m,k)=1p|Rs(m,k)Rt(m,k)|.(17)
The soft spectral masks are computed as
S(m,k)=Rs(m,k)1
2Rn(m,k),(18)
T(m,k)=Rt(m,k)1
2Rn(m,k),(19)
and
N(m,k)=1S(m,k)T(m,k)=Rn(m,k).(20)
Their relationship is shown in Fig. 5. The spectral masks are
once again imposed on
X
(
m,k
) to obtain the spectral components
using the Hadamard product, as in Eq. (9).
Fig. 6shows the separated STN components of the example
signal using the fuzzy masks. The results are again slightly
dierent from those obtained with the two previous techniques,
presented in Figs. 3and 4. One apparent feature is the leakage
of energy from the other components to the transient component
at frequencies below about 5 kHz, shown in Fig. 6(b).
摘要:

SubmittedforpublicationtotheJournalofAudioEngineeringSocietyonOctober20th,2022.EnhancedFuzzyDecompositionofSoundIntoSines,TransientsandNoisePreprint,compiledDecember1,2022LeonardoFierroandVesaVälimäkileonardo.fierro@aalto.fi|vesa.valimaki@aalto.fiAcousticsLab,DepartmentofSignalProcessingandAcoustics...

展开>> 收起<<
Submitted for publication to the Journal of Audio Engineering Society on October 20th 2022.Enhanced Fuzzy Decomposition of Sound Into Sines Transients and Noise.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:7.12MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注