
Submitted for publication to the Journal of Audio
Engineering Society on October 20th, 2022.
Enhanced Fuzzy Decomposition of Sound Into
Sines, Transients and Noise
Preprint,compiled December 1, 2022
Leonardo Fierro and Vesa Välimäki
leonardo.fierro@aalto.fi | vesa.valimaki@aalto.fi
Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland
Abstract
The decomposition of sounds into sines, transients, and noise is a long–standing research problem in audio
processing. The current solutions for this three–way separation detect either horizontal and vertical structures or
anisotropy and orientations in the spectrogram to identify the properties of each spectral bin and classify it as
sinusoidal, transient, or noise. This paper proposes an enhanced three–way decomposition method based on
fuzzy logic, enabling soft masking while preserving the perfect reconstruction property. The proposed method
allows each spectral bin to simultaneously belong to two classes, sine and noise or transient and noise. Results of
a subjective listening test against three other techniques are reported, showing that the proposed decomposition
yields a better or comparable quality. The main improvement appears in transient separation, which enjoys little
or no loss of energy or leakage from the other components and performs well for test signals presenting strong
transients. The audio quality of the separation is shown to depend on the complexity of the input signal for all
tested methods. The proposed method helps improve the quality of various audio processing applications. A
successful implementation over a state-of-the-art time-scale modification method is reported as an example.
1 INTRODUCTION
Decomposing an audio signal into its sinusoidal, transient, and
noise (STN) components has been drawing research interest for
over two decades [
1
,
2
,
3
,
4
]. It is a widely used tool in a variety
of audio processing applications, ranging from beat tracking [
5
]
and tonality estimation [
6
] to reduction of spectral complexity
in cochlear implants [
7
] and to virtual bass enhancement [
8
].
The STN separation is also helpful in time-scale modification
[1,9,10], where it has been combined with the notion of fuzzy
logic in order to improve [
4
,
11
] or evaluate the audio quality
[
12
]. In all these audio applications, it is helpful to process
sine, transient, and noise components independently of each
other. This paper proposes improvements to the fuzzy STN
decomposition of audio signals.
The STN separation relies on the assumption that any audio
signal can be described as a linear combination of three indepen-
dent actors: tonal content (sines), impulsive events (transients),
and a residual part (noise) that does not belong to either one of
the other two classes and adds nuance to the sound. Historically,
additive synthesis modeled any sound as a sum of sinusoidal
components [
13
,
14
,
15
]. Serra and Smith expanded the addi-
tive synthesis method by introducing the noise class, which was
obtained as a residual after a sinusoidal model was subtracted
from the original signal [
16
]. In the resulting method—called
spectral modeling synthesis [
16
]—the frequency, amplitude, and
phase of the sinusoidal components were estimated from the
short-time Fourier transform (STFT) using a method similar to
the McAulay-Quatieri algorithm [17].
The three-way decomposition was first introduced by Verma et
al. [
1
,
18
,
19
], who showed that including a third component for
transients was greatly beneficial in the context of signal analysis
and synthesis, as it avoided the smearing of transients, which
was a weakness in sines +noise models. Levine and Smith
also showed that the adaptiveness of the STN model made it
suitable for audio compression and for pitch– and time–scale
modification [20].
Fitzgerald discovered that it was possible to decompose an au-
dio signal into its sinusoidal and transient components by using
spectral masks extracted via horizontal and vertical median fil-
tering of the STFT [
21
]. Driedger et al. [
2
] reintroduced the
three-way separation by updating Fitzgerald’s method: the noise
component could be obtained by retrieving spurious information
after extracting the other two components with median filtering.
Füg et al. [
3
] proposed a follow-up method involving the use of
structure tensors (ST) to find predominant orientation angles and
anisotropy in the time-frequency signal representation, show-
ing an improvement in the separation quality for sounds with
vibrato.
Other recent approaches for sines–transients separation include
a kernel additive matrix [
22
], non-negative matrix factorization
[
23
], improved sinusoidal modeling [
24
,
25
], and neural net-
works [
26
]. However, these methods do not involve a third class
for the noise component, hence they are not discussed further in
this paper.
It should be noted that the STN decomposition does not directly
relate to traditional source separation, which usually aims at
retrieving musical instruments in a sound mixture [
27
] or speech
from noisy background sources [
28
]. According to the STN
formulation, even strongly percussive sources, such as drums,
will have a sinusoidal and noise component—unless they are
perfect, synthetic pulses—and, similarly, strongly-harmonical
sources, such as the violin, will hold, in addition to a sinusoidal
part, a transient component in their attack and a noise component
to describe the nuances, such as the bowing noise.
arXiv:2210.14041v2 [eess.AS] 30 Nov 2022