Neural Sound Field Decomposition with Super-resolution of Sound Direction Qiuqiang Kong1 Shilei Liu1 Junjie Shi1 Xuzhou Ye1

2025-05-02 0 0 800.41KB 12 页 10玖币

侵权投诉

Neural Sound Field Decomposition with

Super-resolution of Sound Direction

Qiuqiang Kong1, Shilei Liu1, Junjie Shi1, Xuzhou Ye1

Yin Cao2, Qiaoxi Zhu3, Yong Xu4, Yuxuan Wang1

1ByteDance, Shanghai, China

2University of Surrey, Guildford, UK

3University of Technology Sydney, Sydney, Australia

4Tencent AI Lab, Bellevue, USA

1{kongqiuqiang, liushilei.666, shijunjie, yexuzhou, wangyuxuan.11}@bytedance.com

2yin.cao@surrey.ac.uk, 3qiaoxi.zhu@uts.edu.au, 4lucayongxu@tencent.com

Abstract

Sound ﬁeld decomposition predicts waveforms in arbitrary directions using signals

from a limited number of microphones as inputs. Sound ﬁeld decomposition is

fundamental to downstream tasks, including source localization, source separation,

and spatial audio reproduction. Conventional sound ﬁeld decomposition methods

such as Ambisonics have limited spatial decomposition resolution. This paper

proposes a learning-based Neural Sound ﬁeld Decomposition (NeSD) framework

to allow sound ﬁeld decomposition with ﬁne spatial direction resolution, using

recordings from microphone capsules of a few microphones at arbitrary positions.

The inputs of a NeSD system include microphone signals, microphone positions,

and queried directions. The outputs of a NeSD include the waveform and the

presence probability of a queried position. We model the NeSD systems respec-

tively with different neural networks, including fully connected, time delay, and

recurrent neural networks. We show that the NeSD systems outperform conven-

tional Ambisonics and DOANet methods in sound ﬁeld decomposition and source

localization on speech, music, and sound events datasets. Demos are available at

1 Introduction

A sound ﬁeld of a spatial region may contain sound waves propagating from different directions.

Sound ﬁeld decomposition [

–

] decomposes wave ﬁelds from arbitrary directions from signals

recorded by microphone arrays. This work introduces a learning-based neural sound decomposition

(NeSD) approach to predict what,where, and when are sound in a recording. The NeSD system can

be used as pre-processing for downstream tasks, such as sound localization and direction of arrival

estimation [

–

], sound event detection [

–

], source separation [

–

], beamforming [

–

sound ﬁeld reproduction [23–25], and augmented reality (AR) and virtual reality (VR) [26–28].

A sound ﬁeld consists of sounds coming from different directions. For example, musicians in a

band may have different locations on a stage. Multiple speakers may have different locations in a

room. Sound localization [

], also called direction of arrival (DOA) estimation, is a task to predict

the locations of sources. Previous DOA methods include parametric-based methods such as time

difference of arrival (TDOA) [

] and multiple signal classiﬁcation (MUSIC) [

]. Recently, neural

networks have been introduced to address the DOA problem, such as convolutional recurrent neural

networks (CRNNs) [

] and two stages estimation methods [

]. However, many conventional DOA

1https://www.youtube.com/watch?v=0GIr6doj3BQ

Preprint. Under review.

arXiv:2210.12345v1 [cs.SD] 22 Oct 2022

methods require the number of microphones to be larger than the number of sources and can not

decompose waveforms.

Another challenge of sound ﬁeld decomposition is to separate signals in the sound ﬁeld from

different directions. Beamforming [

] is a technique to ﬁlter signals in speciﬁc beams.

Conventional beamforming methods include the minimum variance distortionless response (MVDR)

[

]. Recently, neural network-based beamforming methods have been proposed for beamforming

[21, 22]. However, those beamforming methods do not predict the localization of sources. Previous

neural network-based beamforming methods focused on on speech and were not train on the general

sound sounds. Sound ﬁeld decomposition is also related to the source separation problem, where

deep neural networks have been applied to address the source separation problem [

However, many source separation systems do not separate correlated waveforms from different

directions. Recently, unsupervised source separation methods [

–

] were proposed to separate

unseen sources. Still, those methods do not separate highly correlated sources and do not predict the

directions of sources.

The conventional way of sound ﬁeld decomposition requires numerous measurements to capture a

sound ﬁeld. First-order Ambisonics (FOA) [

–

] and high-order Ambisonics (HOA) [

] were

proposed to record sound ﬁelds. Ambisonics provides truncated spherical harmonic decomposition

of a sound ﬁeld. A

-th order ambisonics require

(K+ 1)2

channels to record a sound ﬁeld. A

sound ﬁeld is hard to record and process when

is large. Moreover, the accurate reproduction in

a head-sized volume up to 20 kHz would require an order

of more than 30. Recently, DOANet

[

] was proposed to predict pseudo-spectrum, while does not predict waveforms. Still, here is a lack

of works on neural network-based sound ﬁeld decomposition that can achieve super-resolution of a

sound ﬁeld.

In this work, we propose a NeSD framework to address the sound ﬁeld decomposition problem. The

NeSD approach is inspired by the neural radiance ﬁelds (NeRF) [

] for view synthesis in computer

vision. NeSD has the advantage of predicting the locations of wideband, non-stationary, moving,

and arbitrary number of sources in a sound ﬁeld. NeSD supports any layout of microphone array

types, including uniform or non-uniform arrays such as planar, spherical array, or other irregular

arrays. NeSD also supports the microphones to have different directivity patterns. NeSD can

separate correlated signals in a sound ﬁeld. NeSD can decompose a sound ﬁeld with arbitrary spatial

resolutions and can achieve better directivity than FOA and HOA methods. In training, the inputs to

a NeSD system include the signals of arbitrary microphone arrays, the positions of all microphones,

and arbitrary queried directions on a sphere. In inference, all sound ﬁeld directions are input to the

trained NeSD in mini-batches to predict the waveforms and the presence probabilities of sources in a

sound ﬁeld.

This work is organized as follows. Section 2 introduces the sound ﬁeld decomposition problem.

Section 3 introduces our proposed NeSD framework. Section 4 shows experiments. Section 5

concludes this work.

2 Problem Statement

The signals recorded from a microphone capsule are denoted as

x={x1(t), ..., xM(t)}

, where

the number of microphones in the capsule. The

-th microphone signal is

xm(t)∈RT

where

the number of samples in an audio segment. The microphone capsule can be any type, such as the

Ambisonics or circular capsule.

The coordinate of the

-th microphone in the spherical coordinate system is

qm(t)

, where

qm(t) =

{rm(t), θm(t), φm(t)}

denotes the distance, azimuthal angle, and polar angle of the microphone.

The position information of all microphones is

q={q1(t), ..., qM(t)}

. Note that

qm(t)

is a time-

dependent variable to reﬂect moving microphones. For a static microphone, all of

rm(t)

θm(t)

, and

φm(t)in qm(t)have constant values.

We denote a continuous sound ﬁeld as

s=s(Ω, t)∈S2×RT

where

is a sphere. Each direction

Ω

is described by an azimuth angle

θ∈[0,2π)

and a polar angle

φ∈[0, π]

. Sound ﬁeld decomposition

is a task to estimate sfrom the microphone signals xand the microphone positions q:

ˆs(Ω, t) = fS2(x,q),(1)

where fS2(·,·)is a sound ﬁeld decomposition mapping and ˆs(Ω, t)is the estimated waveform.

3 Neural Sound Field Decomposition (NeSD)

We introduce the training, the data creation, the NeSD architecture, the hard example mining, and the

inference of NeSD in this section.

3.1 Empirical Risk

Directly model

fS2

is intractable because the output dimension

S2×RL

is inﬁnite. Instead, we

propose a mapping fto predict ˆs(Ω, t)conditioned on a direction Ωfor all Ω∈S2:

ˆs(Ω, t) = f(x,q,Ω),(2)

We model

in (2) by using a neural network. We denote the risk

between the estimated sound ﬁeld

ˆs and the oracle sound ﬁeld sas:

l=Es∼pﬁeld(s)EΩ∼psphere (Ω)Et∼ptime (t)d(s(Ω, t),ˆs(Ω, t)),(3)

where

pﬁeld

psphere

, and

ptime

are the distributions of sound ﬁeld

, direction

Ω

, and time

, respectively.

The loss function is denoted by

d(·,·)

. Equation (3) shows that the risk

consists of expectations over

pﬁeld

psphere

, and

ptime

. However, directly optimize the risk (3) is intractable. To address this problem,

we propose to minimize the empirical risk in mini-batches:

lbatch =

s∼pﬁeld(s)

b=1

Ω∼psphere(Ω)

q=1

t=1

d(s(Ω, t),ˆ

Ω, s(t)),(4)

where

is the mini-batches number to sample sound ﬁelds from a dataset and

is the number of

directions to sample on a sphere. In each mini-batch, sound ﬁeld signals

are sampled from

pﬁeld(s)

Directions

Ω

are sampled from

psphere(Ω)

. By this means, the optimization of (4) is tractable. The

risk (4) is differentiable with respect to the learnable parameters of

, so that the learnable parameters

can be optimized by gradient-based methods.

3.2 Create Microphone Sound Field Signals

The optimization of (4) requires paired microphone signals

, microphone positions information

and sound ﬁeld signals

for training. However, it is impossible to obtain oracle sound ﬁeld signals

in real scenes. To address this problem, we propose to create microphone and sound ﬁeld signals

from point sources.

We create

far ﬁeld point sources

{ai(t)}I

i=1

from

randomly sampled directions

{Ωi}I

i=1

. Each

ai(t)

is a randomly selected monophonic audio segment from an audio dataset such as a speech

or a music dataset. In a free ﬁeld without room reverberation, the sound ﬁeld

can be created by:

Figure 1: Point source signals arrive microphones.

s(Ω, t) = ai(t),Ω=Ωi

0,Ω = S2/{Ωi}I

i=1.(5)

In (5), a sound ﬁeld

only has non-zero values

in the directions containing point sources.

Next, we create microphone signals

. First, all

point source signals are propagated to all

microphones. We denote

a(delay)

i,m (t)

as the

-th

point source signal arrives the

-th microphone:

a(delay)

i,m (t) = ai(t−τi,m),(6)

where

τi,m

is the propagation time as shown

in Fig. 1. The propagation time

τi,m

can be

calculated by the speed of sound

, the distance from the

-th microphone to the origin

, and the

included angle

Ω∆

between the

-th point source and the

-th microphone by:

τi,m =−rmcos Ω∆

as shown in Fig. 1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NeuralSoundFieldDecompositionwithSuper-resolutionofSoundDirectionQiuqiangKong1,ShileiLiu1,JunjieShi1,XuzhouYe1YinCao2,QiaoxiZhu3,YongXu4,YuxuanWang11ByteDance,Shanghai,China2UniversityofSurrey,Guildford,UK3UniversityofTechnologySydney,Sydney,Australia4TencentAILab,Bellevue,USA1{kongqiuqiang,liushile...

展开>> 收起<<

Neural Sound Field Decomposition with Super-resolution of Sound Direction Qiuqiang Kong1 Shilei Liu1 Junjie Shi1 Xuzhou Ye1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Neural Sound Field Decomposition with Super-resolution of Sound Direction Qiuqiang Kong1 Shilei Liu1 Junjie Shi1 Xuzhou Ye1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: