Submitted for publication to the Journal of Audio Engineering Society on October 20th 2022.Enhanced Fuzzy Decomposition of Sound Into Sines Transients and Noise

2025-05-02 0 0 7.12MB 12 页 10玖币

侵权投诉

Submitted for publication to the Journal of Audio

Engineering Society on October 20th, 2022.

Enhanced Fuzzy Decomposition of Sound Into

Sines, Transients and Noise

Preprint,compiled December 1, 2022

Leonardo Fierro and Vesa Välimäki

leonardo.fierro@aalto.fi | vesa.valimaki@aalto.fi

Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland

Abstract

The decomposition of sounds into sines, transients, and noise is a long–standing research problem in audio

processing. The current solutions for this three–way separation detect either horizontal and vertical structures or

anisotropy and orientations in the spectrogram to identify the properties of each spectral bin and classify it as

sinusoidal, transient, or noise. This paper proposes an enhanced three–way decomposition method based on

fuzzy logic, enabling soft masking while preserving the perfect reconstruction property. The proposed method

allows each spectral bin to simultaneously belong to two classes, sine and noise or transient and noise. Results of

a subjective listening test against three other techniques are reported, showing that the proposed decomposition

yields a better or comparable quality. The main improvement appears in transient separation, which enjoys little

or no loss of energy or leakage from the other components and performs well for test signals presenting strong

transients. The audio quality of the separation is shown to depend on the complexity of the input signal for all

tested methods. The proposed method helps improve the quality of various audio processing applications. A

successful implementation over a state-of-the-art time-scale modiﬁcation method is reported as an example.

1 INTRODUCTION

Decomposing an audio signal into its sinusoidal, transient, and

noise (STN) components has been drawing research interest for

over two decades [

]. It is a widely used tool in a variety

of audio processing applications, ranging from beat tracking [

]

and tonality estimation [

] to reduction of spectral complexity

in cochlear implants [

] and to virtual bass enhancement [

The STN separation is also helpful in time-scale modiﬁcation

[1,9,10], where it has been combined with the notion of fuzzy

logic in order to improve [

] or evaluate the audio quality

[

]. In all these audio applications, it is helpful to process

sine, transient, and noise components independently of each

other. This paper proposes improvements to the fuzzy STN

decomposition of audio signals.

The STN separation relies on the assumption that any audio

signal can be described as a linear combination of three indepen-

dent actors: tonal content (sines), impulsive events (transients),

and a residual part (noise) that does not belong to either one of

the other two classes and adds nuance to the sound. Historically,

additive synthesis modeled any sound as a sum of sinusoidal

components [

]. Serra and Smith expanded the addi-

tive synthesis method by introducing the noise class, which was

obtained as a residual after a sinusoidal model was subtracted

from the original signal [

]. In the resulting method—called

spectral modeling synthesis [

]—the frequency, amplitude, and

phase of the sinusoidal components were estimated from the

short-time Fourier transform (STFT) using a method similar to

the McAulay-Quatieri algorithm [17].

The three-way decomposition was ﬁrst introduced by Verma et

al. [

], who showed that including a third component for

transients was greatly beneﬁcial in the context of signal analysis

and synthesis, as it avoided the smearing of transients, which

was a weakness in sines +noise models. Levine and Smith

also showed that the adaptiveness of the STN model made it

suitable for audio compression and for pitch– and time–scale

modiﬁcation [20].

Fitzgerald discovered that it was possible to decompose an au-

dio signal into its sinusoidal and transient components by using

spectral masks extracted via horizontal and vertical median ﬁl-

tering of the STFT [

]. Driedger et al. [

] reintroduced the

three-way separation by updating Fitzgerald’s method: the noise

component could be obtained by retrieving spurious information

after extracting the other two components with median ﬁltering.

Füg et al. [

] proposed a follow-up method involving the use of

structure tensors (ST) to ﬁnd predominant orientation angles and

anisotropy in the time-frequency signal representation, show-

ing an improvement in the separation quality for sounds with

vibrato.

Other recent approaches for sines–transients separation include

a kernel additive matrix [

], non-negative matrix factorization

[

], improved sinusoidal modeling [

], and neural net-

works [

]. However, these methods do not involve a third class

for the noise component, hence they are not discussed further in

this paper.

It should be noted that the STN decomposition does not directly

relate to traditional source separation, which usually aims at

retrieving musical instruments in a sound mixture [

] or speech

from noisy background sources [

]. According to the STN

formulation, even strongly percussive sources, such as drums,

will have a sinusoidal and noise component—unless they are

perfect, synthetic pulses—and, similarly, strongly-harmonical

sources, such as the violin, will hold, in addition to a sinusoidal

part, a transient component in their attack and a noise component

to describe the nuances, such as the bowing noise.

arXiv:2210.14041v2 [eess.AS] 30 Nov 2022

Preprint – Enhanced Fuzzy Decomposition of Sound Into Sines, Transients and Noise 2

While both Driedger et al. [

] and Füg et al. [

] applied hard bi-

nary masks to deﬁne the sinusoidal, transient, and noise classes,

Damskägg and Välimäki [

] introduced the concept of fuzzy

logic in the context of time-scale modiﬁcation. The fuzzy classi-

ﬁcation (FZ) allows spectral bins to simultaneously contribute

to the three classes, providing a more reﬁned basis for the three-

way separation [

]. This decomposition method was then ex-

tended to objective evaluation by Fierro and Välimäki [12] and

improved by Moliner et al. [

] to allow perfect reconstruction

by ensuring that the three soft spectral masks sum up to unity.

This work proposes a novel way to estimate fuzzy soft masks

for STN decomposition of audio signals. The proposed method

allows for intermediate classiﬁcations of the spectral bins be-

tween two components—sines v. noise and transients v. noise.

This two-stage decomposition is shown to improve the overall

sound quality of the separated components, in particular for

transients. The masks ensure perfect reconstruction and are opti-

mized for each class to have a large constant region followed by

a fast but smooth transition to the adjacent class. The transition

slope is reﬁned for both decomposition stages to provide the

best separation quality.

The rest of this paper is structured as follows. Sec. 2discusses

previous three-way separation techniques. Sec. 3introduces

the new STN decomposition method, which optimally extracts

the sinusoidal and transient components. Sec. 4evaluates the

proposed method against three previous techniques. Sec. 5

applies the proposed method to time-scale modiﬁcation, and

Sec. 6concludes.

2 RELATED WORK

This section summarizes three previous STN decomposition

methods based on a spectrogram representation of the input

signal. A spectrogram

is a

-by-

matrix representing the

time-frequency behavior of audio signal

. Each element

(

m,k

)

is computed using the STFT:

X(m,k)=

L−1

n=0

x(n+mH)w(n)e−jωkn,(1)

where

is the sample index,

, ... M−

1 is the temporal

frame index,

, ... K−

1 is the spectral bin index,

the analysis window,

is the hop size,

is the window length

in samples, which is assumed to be even,

is the imaginary unit,

and

ωk

is the normalized central frequency of the

kth

spectral

bin.

2.1 Harmonic–Percussive–Residual Separation

The Harmonic–Percussive–Residual (HPR) separation [

] builds

upon the Harmonic–Percussive (HP) method for sines–transients

decomposition [

]. Fitzgerald noted that since sinusoids form

ﬂat lines in time direction in the spectrogram and, vice versa,

impulsive events appear as ﬂat lines in the frequency direction,

they can be detected (suppressed) using a median ﬁlter [

Fig. 1shows the spectrogram of a signal consisting of a mixture

of violin and castanets, whose time– and frequency–direction

ridges are noticeable.

Horizontal (time-oriented) and vertical (frequency-oriented) me-

dian ﬁltering can be applied to the spectrogram

(

m,k

) to high-

1 2 3 4

Time (s)

Frequency (kHz)

-80

-60

-40

-20

0dB

Figure 1: Spectrogram of a test signal consisting of the castanets

and the violin playing simultaneously.

light the desired component and suppress the other [21]:

Xh(m,k)

=medh|X(m−Lh

2+1,k)|, ..., |X(m+Lh

2,k)|i(2)

and

Xv(m,k)

=medh|X(m,k−Lv

2+1)|, ..., |X(m,k+Lv

2)|i,(3)

where

med

[

] is the median function, and

and

are the

resulting horizontally- and vertically-enhanced magnitude spec-

trograms, respectively. Parameters

and

are the median

ﬁlter lengths (in samples) in the time and frequency directions,

respectively.

Matrices

and

are then used to extract the tonalness

and

transientness Rtmatrices with the following elements [21]:

Rs(m,k)=Xh(m,k)

Xh(m,k)+Xv(m,k)(4)

and

Rt(m,k)=1−Rs(m,k)=Xv(m,k)

Xh(m,k)+Xv(m,k),(5)

respectively. Fitzgerald [

] used

and

directly as spectral

masks, whereas Driedger et al. [

] later introduced a controllable

separation factor

and a third class (noise) to describe those

parts of the sound that are neither sines nor transients.

From Eqs.

(4)

and

(5)

, a set of hard spectral masks

(sinusoidal),

T(transient), and N(noise) can be derived as follows [2]:

S(m,k)=(1,if Rs(m,k)/Rt(m,k)> β

0,otherwise, (6)

T(m,k)=(1,if Rt(m,k)/Rs(m,k)> β

0,otherwise, (7)

and

N(m,k)=1−S(m,k)−T(m,k).(8)

Preprint – Enhanced Fuzzy Decomposition of Sound Into Sines, Transients and Noise 3

0 0.25 0.5 0.75 1

Tonalness Rs

0.25

0.5

0.75

T N S

Figure 2: Hard masks for transients, noise, and sines, as used in

the HPR method [2], for separation factor β=2.5.

Their relationship for a chosen

is shown in Fig. 2. The spectral

masks are then imposed on

(

m,k

) to retrieve the three desired

spectral components:

Xs=SX,Xt=TX,Xn=NX,(9)

where



represents the Hadamard product, or element-wise

multiplication.

It has been observed that the quality of the HPR separation

largely varies for the sinusoidal and the transient components

depending on the choice of the analysis window length

[

]. A large window length

for the STFT, ensuring suf-

ﬁcient frequency resolution but poor time resolution, results in a

faultless extraction of sines but a low-quality transient output;

conversely, a smaller value of

leads to a better extraction of

the transient component but a worse description of sines.

To overcome the time-frequency limitation, Driedger et al. [

]

divided the decomposition process can be divided into two cas-

caded iterations [

]. In the ﬁrst stage, a longer analysis window

is applied to extract the sinusoidal component [

], while tran-

sients and noise remain mixed together:

xs=ISTFThS1Xi,(10)

xres =ISTFTh(T1+N1)Xi,(11)

where ISTFT is the Inverse STFT. Subsequently, the residual

from the ﬁrst stage is separated again with shorter windowing,

leading to the ﬁnal decomposition [2]:

xt=ISTFThT2Xresi,(12)

xn=ISTFTh(S2+N2)Xresi.(13)

The noise signal

will also contain residuals of sine compo-

nents, unless they were perfectly separated on the ﬁrst stage.

Fig. 3shows the separated STN components of the example

audio signal used above obtained with the HPR method.

2.2 STN Separation Based on Structure Tensor

Füg et al. [

] noted that sounds exhibiting vibrato, which carry

tonal information and are perceived as sines, do not present

strictly horizontal structures in the spectrogram. This results in a

leakage of energy in-between diﬀerent spectral components. The

strictness of the median ﬁltering can be overcome using a struc-

ture tensor, a widely used tool in image processing, to obtain a

measure of the frequency change rate and local anisotropies in

the spectrogram, which will then be used as features to deﬁne

the spectral masks [3].

The structure tensor matrix is obtained from the partial deriva-

tives of the spectrogram with respect to time and frequency, and

the orientation angles

and the anisotropy

of the spectral

bins are computed from the eigenvalues and the eigenvectors of

such a matrix, as described in [

]. The instantaneous frequency

change rate

is computed for each bin from the orientation

angles:

R(m,k)=fs2

HM tan [α(m,k)],(14)

where

is the sample rate. The spectral masks are then obtained

as follows:

S(m,k)=(1,if |R(m,k)| ≤ rs∧C(m,k)>c

0,otherwise, (15)

T(m,k)=(1,if |R(m,k)| ≥ rt∧C(m,k)>c

0,otherwise, (16)

where

is the anisotropy threshold and

and

are the frequency

rate thresholds for the sinusoidal and the transient component,

respectively. The noise mask is computed as described in Eq.

(8)

and the spectral components are then derived as in Eq. (9).

Fig. 4shows the separated STN components of the example

audio signal using the ST method. Some diﬀerences can be

observed in comparison to the separation results of the HPR

method in Fig. 3, such as some holes in the transient events at

low and middle frequencies in Fig. 4(b).

2.3 Fuzzy Separation

Damskägg and Välimäki [

] introduced the concept of fuzzy

classiﬁcation of the spectral bins, which corresponds to a non–

binary classiﬁcation using continuous values between 0 and 1.

This method was later extended by Moliner et al. [

] to ensure

perfect reconstruction, i.e. all masks summing up to unity. In

[

], a third membership function for noisiness

is derived from

Eqs. (4) and (5):

Rn(m,k)=1−p|Rs(m,k)−Rt(m,k)|.(17)

The soft spectral masks are computed as

S(m,k)=Rs(m,k)−1

2Rn(m,k),(18)

T(m,k)=Rt(m,k)−1

2Rn(m,k),(19)

and

N(m,k)=1−S(m,k)−T(m,k)=Rn(m,k).(20)

Their relationship is shown in Fig. 5. The spectral masks are

once again imposed on

(

m,k

) to obtain the spectral components

using the Hadamard product, as in Eq. (9).

Fig. 6shows the separated STN components of the example

signal using the fuzzy masks. The results are again slightly

diﬀerent from those obtained with the two previous techniques,

presented in Figs. 3and 4. One apparent feature is the leakage

of energy from the other components to the transient component

at frequencies below about 5 kHz, shown in Fig. 6(b).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SubmittedforpublicationtotheJournalofAudioEngineeringSocietyonOctober20th,2022.EnhancedFuzzyDecompositionofSoundIntoSines,TransientsandNoisePreprint,compiledDecember1,2022LeonardoFierroandVesaVälimäkileonardo.fierro@aalto.fi|vesa.valimaki@aalto.fiAcousticsLab,DepartmentofSignalProcessingandAcoustics...

展开>> 收起<<

Submitted for publication to the Journal of Audio Engineering Society on October 20th 2022.Enhanced Fuzzy Decomposition of Sound Into Sines Transients and Noise.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Submitted for publication to the Journal of Audio Engineering Society on October 20th 2022.Enhanced Fuzzy Decomposition of Sound Into Sines Transients and Noise

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: