Neural Sound Field Decomposition with Super-resolution of Sound Direction Qiuqiang Kong1 Shilei Liu1 Junjie Shi1 Xuzhou Ye1

2025-05-02 0 0 800.41KB 12 页 10玖币
侵权投诉
Neural Sound Field Decomposition with
Super-resolution of Sound Direction
Qiuqiang Kong1, Shilei Liu1, Junjie Shi1, Xuzhou Ye1
Yin Cao2, Qiaoxi Zhu3, Yong Xu4, Yuxuan Wang1
1ByteDance, Shanghai, China
2University of Surrey, Guildford, UK
3University of Technology Sydney, Sydney, Australia
4Tencent AI Lab, Bellevue, USA
1{kongqiuqiang, liushilei.666, shijunjie, yexuzhou, wangyuxuan.11}@bytedance.com
2yin.cao@surrey.ac.uk, 3qiaoxi.zhu@uts.edu.au, 4lucayongxu@tencent.com
Abstract
Sound field decomposition predicts waveforms in arbitrary directions using signals
from a limited number of microphones as inputs. Sound field decomposition is
fundamental to downstream tasks, including source localization, source separation,
and spatial audio reproduction. Conventional sound field decomposition methods
such as Ambisonics have limited spatial decomposition resolution. This paper
proposes a learning-based Neural Sound field Decomposition (NeSD) framework
to allow sound field decomposition with fine spatial direction resolution, using
recordings from microphone capsules of a few microphones at arbitrary positions.
The inputs of a NeSD system include microphone signals, microphone positions,
and queried directions. The outputs of a NeSD include the waveform and the
presence probability of a queried position. We model the NeSD systems respec-
tively with different neural networks, including fully connected, time delay, and
recurrent neural networks. We show that the NeSD systems outperform conven-
tional Ambisonics and DOANet methods in sound field decomposition and source
localization on speech, music, and sound events datasets. Demos are available at
1
.
1 Introduction
A sound field of a spatial region may contain sound waves propagating from different directions.
Sound field decomposition [
1
4
] decomposes wave fields from arbitrary directions from signals
recorded by microphone arrays. This work introduces a learning-based neural sound decomposition
(NeSD) approach to predict what,where, and when are sound in a recording. The NeSD system can
be used as pre-processing for downstream tasks, such as sound localization and direction of arrival
estimation [
5
10
], sound event detection [
11
13
], source separation [
14
18
], beamforming [
19
22
],
sound field reproduction [23–25], and augmented reality (AR) and virtual reality (VR) [26–28].
A sound field consists of sounds coming from different directions. For example, musicians in a
band may have different locations on a stage. Multiple speakers may have different locations in a
room. Sound localization [
6
], also called direction of arrival (DOA) estimation, is a task to predict
the locations of sources. Previous DOA methods include parametric-based methods such as time
difference of arrival (TDOA) [
29
] and multiple signal classification (MUSIC) [
30
]. Recently, neural
networks have been introduced to address the DOA problem, such as convolutional recurrent neural
networks (CRNNs) [
5
,
8
] and two stages estimation methods [
7
]. However, many conventional DOA
1https://www.youtube.com/watch?v=0GIr6doj3BQ
Preprint. Under review.
arXiv:2210.12345v1 [cs.SD] 22 Oct 2022
methods require the number of microphones to be larger than the number of sources and can not
decompose waveforms.
Another challenge of sound field decomposition is to separate signals in the sound field from
different directions. Beamforming [
31
,
32
,
21
,
22
] is a technique to filter signals in specific beams.
Conventional beamforming methods include the minimum variance distortionless response (MVDR)
[
33
]. Recently, neural network-based beamforming methods have been proposed for beamforming
[21, 22]. However, those beamforming methods do not predict the localization of sources. Previous
neural network-based beamforming methods focused on on speech and were not train on the general
sound sounds. Sound field decomposition is also related to the source separation problem, where
deep neural networks have been applied to address the source separation problem [
18
,
14
,
34
,
35
,
15
].
However, many source separation systems do not separate correlated waveforms from different
directions. Recently, unsupervised source separation methods [
36
38
] were proposed to separate
unseen sources. Still, those methods do not separate highly correlated sources and do not predict the
directions of sources.
The conventional way of sound field decomposition requires numerous measurements to capture a
sound field. First-order Ambisonics (FOA) [
39
41
] and high-order Ambisonics (HOA) [
42
,
43
] were
proposed to record sound fields. Ambisonics provides truncated spherical harmonic decomposition
of a sound field. A
K
-th order ambisonics require
(K+ 1)2
channels to record a sound field. A
sound field is hard to record and process when
K
is large. Moreover, the accurate reproduction in
a head-sized volume up to 20 kHz would require an order
K
of more than 30. Recently, DOANet
[
44
] was proposed to predict pseudo-spectrum, while does not predict waveforms. Still, here is a lack
of works on neural network-based sound field decomposition that can achieve super-resolution of a
sound field.
In this work, we propose a NeSD framework to address the sound field decomposition problem. The
NeSD approach is inspired by the neural radiance fields (NeRF) [
45
] for view synthesis in computer
vision. NeSD has the advantage of predicting the locations of wideband, non-stationary, moving,
and arbitrary number of sources in a sound field. NeSD supports any layout of microphone array
types, including uniform or non-uniform arrays such as planar, spherical array, or other irregular
arrays. NeSD also supports the microphones to have different directivity patterns. NeSD can
separate correlated signals in a sound field. NeSD can decompose a sound field with arbitrary spatial
resolutions and can achieve better directivity than FOA and HOA methods. In training, the inputs to
a NeSD system include the signals of arbitrary microphone arrays, the positions of all microphones,
and arbitrary queried directions on a sphere. In inference, all sound field directions are input to the
trained NeSD in mini-batches to predict the waveforms and the presence probabilities of sources in a
sound field.
This work is organized as follows. Section 2 introduces the sound field decomposition problem.
Section 3 introduces our proposed NeSD framework. Section 4 shows experiments. Section 5
concludes this work.
2 Problem Statement
The signals recorded from a microphone capsule are denoted as
x={x1(t), ..., xM(t)}
, where
M
is
the number of microphones in the capsule. The
m
-th microphone signal is
xm(t)RT
where
T
is
the number of samples in an audio segment. The microphone capsule can be any type, such as the
Ambisonics or circular capsule.
The coordinate of the
m
-th microphone in the spherical coordinate system is
qm(t)
, where
qm(t) =
{rm(t), θm(t), φm(t)}
denotes the distance, azimuthal angle, and polar angle of the microphone.
The position information of all microphones is
q={q1(t), ..., qM(t)}
. Note that
qm(t)
is a time-
dependent variable to reflect moving microphones. For a static microphone, all of
rm(t)
,
θm(t)
, and
φm(t)in qm(t)have constant values.
We denote a continuous sound field as
s=s(Ω, t)S2×RT
where
S2
is a sphere. Each direction
is described by an azimuth angle
θ[0,2π)
and a polar angle
φ[0, π]
. Sound field decomposition
is a task to estimate sfrom the microphone signals xand the microphone positions q:
ˆs(Ω, t) = fS2(x,q),(1)
where fS2(·,·)is a sound field decomposition mapping and ˆs(Ω, t)is the estimated waveform.
2
3 Neural Sound Field Decomposition (NeSD)
We introduce the training, the data creation, the NeSD architecture, the hard example mining, and the
inference of NeSD in this section.
3.1 Empirical Risk
Directly model
fS2
is intractable because the output dimension
S2×RL
is infinite. Instead, we
propose a mapping fto predict ˆs(Ω, t)conditioned on a direction for all S2:
ˆs(Ω, t) = f(x,q,Ω),(2)
We model
f
in (2) by using a neural network. We denote the risk
l
between the estimated sound field
ˆs and the oracle sound field sas:
l=Espfield(s)Epsphere (Ω)Etptime (t)d(s(Ω, t),ˆs(Ω, t)),(3)
where
pfield
,
psphere
, and
ptime
are the distributions of sound field
s
, direction
, and time
t
, respectively.
The loss function is denoted by
d(·,·)
. Equation (3) shows that the risk
l
consists of expectations over
pfield
,
psphere
, and
ptime
. However, directly optimize the risk (3) is intractable. To address this problem,
we propose to minimize the empirical risk in mini-batches:
lbatch =
B
X
spfield(s)
b=1
Q
X
psphere(Ω)
q=1
T
X
t=1
d(s(Ω, t),ˆ
, s(t)),(4)
where
B
is the mini-batches number to sample sound fields from a dataset and
Q
is the number of
directions to sample on a sphere. In each mini-batch, sound field signals
s
are sampled from
pfield(s)
.
Directions
are sampled from
psphere(Ω)
. By this means, the optimization of (4) is tractable. The
risk (4) is differentiable with respect to the learnable parameters of
f
, so that the learnable parameters
can be optimized by gradient-based methods.
3.2 Create Microphone Sound Field Signals
The optimization of (4) requires paired microphone signals
x
, microphone positions information
q
and sound field signals
s
for training. However, it is impossible to obtain oracle sound field signals
s
in real scenes. To address this problem, we propose to create microphone and sound field signals
from point sources.
We create
I
far field point sources
{ai(t)}I
i=1
from
I
randomly sampled directions
{i}I
i=1
. Each
ai(t)
is a randomly selected monophonic audio segment from an audio dataset such as a speech
or a music dataset. In a free field without room reverberation, the sound field
s
can be created by:
Figure 1: Point source signals arrive microphones.
s(Ω, t) = ai(t),Ω=Ωi
0,Ω = S2/{i}I
i=1.(5)
In (5), a sound field
s
only has non-zero values
in the directions containing point sources.
Next, we create microphone signals
x
. First, all
I
point source signals are propagated to all
M
microphones. We denote
a(delay)
i,m (t)
as the
i
-th
point source signal arrives the
m
-th microphone:
a(delay)
i,m (t) = ai(tτi,m),(6)
where
τi,m
is the propagation time as shown
in Fig. 1. The propagation time
τi,m
can be
calculated by the speed of sound
c
, the distance from the
m
-th microphone to the origin
rm
, and the
included angle
between the
i
-th point source and the
m
-th microphone by:
τi,m =rmcos Ω
as shown in Fig. 1.
3
摘要:

NeuralSoundFieldDecompositionwithSuper-resolutionofSoundDirectionQiuqiangKong1,ShileiLiu1,JunjieShi1,XuzhouYe1YinCao2,QiaoxiZhu3,YongXu4,YuxuanWang11ByteDance,Shanghai,China2UniversityofSurrey,Guildford,UK3UniversityofTechnologySydney,Sydney,Australia4TencentAILab,Bellevue,USA1{kongqiuqiang,liushile...

展开>> 收起<<
Neural Sound Field Decomposition with Super-resolution of Sound Direction Qiuqiang Kong1 Shilei Liu1 Junjie Shi1 Xuzhou Ye1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:800.41KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注