methods require the number of microphones to be larger than the number of sources and can not
decompose waveforms.
Another challenge of sound field decomposition is to separate signals in the sound field from
different directions. Beamforming [
31
,
32
,
21
,
22
] is a technique to filter signals in specific beams.
Conventional beamforming methods include the minimum variance distortionless response (MVDR)
[
33
]. Recently, neural network-based beamforming methods have been proposed for beamforming
[21, 22]. However, those beamforming methods do not predict the localization of sources. Previous
neural network-based beamforming methods focused on on speech and were not train on the general
sound sounds. Sound field decomposition is also related to the source separation problem, where
deep neural networks have been applied to address the source separation problem [
18
,
14
,
34
,
35
,
15
].
However, many source separation systems do not separate correlated waveforms from different
directions. Recently, unsupervised source separation methods [
36
–
38
] were proposed to separate
unseen sources. Still, those methods do not separate highly correlated sources and do not predict the
directions of sources.
The conventional way of sound field decomposition requires numerous measurements to capture a
sound field. First-order Ambisonics (FOA) [
39
–
41
] and high-order Ambisonics (HOA) [
42
,
43
] were
proposed to record sound fields. Ambisonics provides truncated spherical harmonic decomposition
of a sound field. A
K
-th order ambisonics require
(K+ 1)2
channels to record a sound field. A
sound field is hard to record and process when
K
is large. Moreover, the accurate reproduction in
a head-sized volume up to 20 kHz would require an order
K
of more than 30. Recently, DOANet
[
44
] was proposed to predict pseudo-spectrum, while does not predict waveforms. Still, here is a lack
of works on neural network-based sound field decomposition that can achieve super-resolution of a
sound field.
In this work, we propose a NeSD framework to address the sound field decomposition problem. The
NeSD approach is inspired by the neural radiance fields (NeRF) [
45
] for view synthesis in computer
vision. NeSD has the advantage of predicting the locations of wideband, non-stationary, moving,
and arbitrary number of sources in a sound field. NeSD supports any layout of microphone array
types, including uniform or non-uniform arrays such as planar, spherical array, or other irregular
arrays. NeSD also supports the microphones to have different directivity patterns. NeSD can
separate correlated signals in a sound field. NeSD can decompose a sound field with arbitrary spatial
resolutions and can achieve better directivity than FOA and HOA methods. In training, the inputs to
a NeSD system include the signals of arbitrary microphone arrays, the positions of all microphones,
and arbitrary queried directions on a sphere. In inference, all sound field directions are input to the
trained NeSD in mini-batches to predict the waveforms and the presence probabilities of sources in a
sound field.
This work is organized as follows. Section 2 introduces the sound field decomposition problem.
Section 3 introduces our proposed NeSD framework. Section 4 shows experiments. Section 5
concludes this work.
2 Problem Statement
The signals recorded from a microphone capsule are denoted as
x={x1(t), ..., xM(t)}
, where
M
is
the number of microphones in the capsule. The
m
-th microphone signal is
xm(t)∈RT
where
T
is
the number of samples in an audio segment. The microphone capsule can be any type, such as the
Ambisonics or circular capsule.
The coordinate of the
m
-th microphone in the spherical coordinate system is
qm(t)
, where
qm(t) =
{rm(t), θm(t), φm(t)}
denotes the distance, azimuthal angle, and polar angle of the microphone.
The position information of all microphones is
q={q1(t), ..., qM(t)}
. Note that
qm(t)
is a time-
dependent variable to reflect moving microphones. For a static microphone, all of
rm(t)
,
θm(t)
, and
φm(t)in qm(t)have constant values.
We denote a continuous sound field as
s=s(Ω, t)∈S2×RT
where
S2
is a sphere. Each direction
Ω
is described by an azimuth angle
θ∈[0,2π)
and a polar angle
φ∈[0, π]
. Sound field decomposition
is a task to estimate sfrom the microphone signals xand the microphone positions q:
ˆs(Ω, t) = fS2(x,q),(1)
where fS2(·,·)is a sound field decomposition mapping and ˆs(Ω, t)is the estimated waveform.
2