1 Deep Learning Based Stage-wise Two-dimensional Speaker Localization

2025-04-30 0 0 2.22MB 11 页 10玖币

侵权投诉

Deep Learning Based Stage-wise

Two-dimensional Speaker Localization

with Large Ad-hoc Microphone Arrays

Shupei Liu†, Linfeng Feng†, Yijun Gong, Chengdong Liang, Chen Zhang,

Xiao-Lei Zhang, Senior Member, IEEE, and Xuelong Li, Fellow, IEEE

Abstract—While deep-learning-based speaker localization has

shown advantages in challenging acoustic environments, it often

yields only direction-of-arrival (DOA) cues rather than precise

two-dimensional (2D) coordinates. To address this, we propose a

novel deep-learning-based 2D speaker localization method lever-

aging ad-hoc microphone arrays, where an ad-hoc microphone

array is composed of randomly distributed microphone nodes,

each of which is equipped with a traditional array. Speciﬁcally,

we ﬁrst employ convolutional neural networks at each node

to estimate speaker directions. Then, we integrate these DOA

estimates using triangulation and clustering techniques to get 2D

speaker locations. To further boost the estimation accuracy, we

introduce a node selection algorithm that strategically ﬁlters the

most reliable nodes. Extensive experiments on both simulated

and real-world data demonstrate that our approach signiﬁcantly

outperforms conventional methods. The proposed node selection

further reﬁnes performance. The real-world dataset in the ex-

periment, named Libri-adhoc-node10 which is a newly recorded

data described for the ﬁrst time in this paper, is online available

at https://github.com/Liu-sp/Libri-adhoc-nodes10.

Index Terms—Two-dimensional speaker localization, ad-hoc

microphone array, deep learning, triangulation, clustering.

I. INTRODUCTION

SPeaker localization aims to localize speaker positions

using speech signals recorded by microphones. It ﬁnds

wide applications in sound event detection and localization

[1], speaker separation [2]–[4] and diarization [5]–[7], etc.

A. Motivation and challenges

Speaker localization in adverse acoustic environments with

strong reverberation and noise interference is challenging.

Conventional speaker localization requires obtaining the di-

rections of speech sources, also known as direction-of-arrival

(DOA) estimation. Representative methods include multiple

signal classiﬁcation (MUSIC) [8] and steered response power

with phase transform (SRP-PHAT) [9].

Shupei Liu and Linfeng Feng contributed equally to this work.

Xiao-Lei Zhang is the corresponding author.

Shupei Liu, Linfeng Feng, Yijun Gong, Chengdong Liang, Chen

Zhang and Xiao-Lei Zhang are with the School of Marine Science and

Technology, Northwestern Polytechnical University, Xi’an 710072, China

(e-mail: shupei.liu@mail.nwpu.edu.cn; fenglinfeng@mail.nwpu.edu.cn;

gongyj@mail.nwpu.edu.cn; liangchengdong@mail.nwpu.edu.cn;

chen7zhang@mail.nwpu.edu.cn; xiaolei.zhang@nwpu.edu.cn).

Xuelong Li is with the Institute of Artiﬁcial Intelligence (TeleAI), China

Telecom Corp Ltd, 31 Jinrong Street, Beijing 100033, P. R. China (e-mail:

li@nwpu.edu.cn).

Recently, with the rapid development of deep-learning-

based speech separation and enhancement [10], deep-learning-

based DOA estimation has received increasing attention [11]–

[19]. Some methods utilize deep models to estimate noise-

robust variables that are then fed into conventional DOA

estimators [12]. Other methods formulate DOA estimation

as a classiﬁcation problem of azimuth classes [11]. Spatial

acoustic features like generalized cross correlation [11], phase

spectrograms [13], spatial pseudo-spectrum [14], and circu-

lar harmonic features [19] are frequently extracted as input

to deep models. Convolutional neural networks (CNNs) are

popular in the study of the DOA estimation [13], [14], [16],

[18]. Following the above directions, many generalized issues

were explored [14]–[17], exhibiting improved performance

over conventional approaches.

However, in many applications, obtaining a speaker’s 2-

dimensional (2D) or 3-dimensional coordinate is more helpful

than merely obtaining the DOA. Ad-hoc microphone arrays

may be able to address the problem. An ad-hoc microphone

array is a group of randomly distributed cooperative micro-

phone nodes, each of which contains a traditional microphone

array, like a uniform linear array. The advantages of ad-

hoc microphone arrays lie in that (i) they can be easily

deployed and widespread in real world by organizing online

devices, and (ii) they can reduce the occurrence probability

of far-ﬁeld speech signal processing [20]. As for the sound-

source localization problem, analogous to prior investigations

such as [21]–[23], whether an ad-hoc array can substantially

outperform traditional ﬁxed arrays by handling a substantial

multitude of nodes, needs a deep investigation.

Conventional sound source localization approaches based on

ad-hoc microphone arrays primarily employ signal processing

methods, as described in [24]. Recent progress in speaker

localization leverages deep learning in conjunction with dis-

tributed microphone nodes [25]–[31], which is the focus of this

paper. For example, [26] utilize multiple deep-learning-based

nodes to directly predict 2D speaker coordinates. Alternatively,

[25], [27], [30], [31] formulate indoor localization as a spa-

tial grid classiﬁcation problem. [28] derives 2D coordinates

through triangulation of two distributed nodes. [29] feeds DOA

estimates from each node into a deep neural network (DNN)

to obtain the ﬁnal speaker location.

While these pioneering works highlight the potential of

deep learning techniques, their investigations have been lim-

ited to sparse node numbers (e.g. two nodes) and additional

arXiv:2210.10265v2 [eess.AS] 1 Apr 2024

Phase

map

Phase

map

Phase

map

Phase

map

Convolutional

Neural Network

Convolutional

Neural Network

Convolutional

Neural Network

microphone array 1

microphone array 2

microphone array N

MPhase

map

1 2



1 2



( )

ˆ ˆ

, xy

Feature extraction and DOA

estimation Node selection Triangulation Clustering

Fig. 1: Diagram of the proposed 2-dimensional speaker localization method based on deep learning.

constraints. These constraints include ﬁxed node position-

ing for both training and testing in the same room [25]–

[27], [29], [31]. Alternatively, [28] mandates identical spatial

node patterns for training and testing, making it difﬁcult

to maximize the ﬂexibility of ad-hoc arrays. Additionally,

[30] is tailored for scenarios where each node consists of a

single microphone, precluding integration with prevalent DOA

estimation techniques.

B. Framework of the proposed method

In pursuit of the ﬂexibility and advantages of ad-hoc ar-

rays, this paper introduces a deep-learning-based 2D speaker

localization leveraging large-scale ad-hoc microphone arrays.

The framework of the proposed method is shown in Fig. 1.

Speciﬁcally, it comprises a feature extraction module, a DOA

estimation module, a node selection algorithm, and a triangu-

lation and clustering method. The DOA estimation module

provides speaker directions. The node selection algorithm

selects ad-hoc nodes that yield highly reliable DOA estimates.

The triangulation module yield a rough 2D speaker location

from any two randomly selected ad-hoc nodes. At last, the

clustering algorithm conducts clustering on all rough speaker

locations, and takes the clustering center as the ﬁnal accurate

speaker location.

C. Goals and contributions

The novelty and contributions of the proposed method lie

in that:

•We have proposed a stage-wise deep-learning-based

2D sound source localization method. The method has

been described in Section I-B. It does not require the

ad-hoc nodes to be at ﬁxed positions. It is a stage-

wise framework, which is ﬂexible in incorporating many

advanced techniques in DOA estimations, node selection

strategies, and clusterings. At last, it bridges the gap be-

tween conventional signal processing methods and recent

deep learning methods.

•We have employed an advanced classiﬁcation-based

DOA estimation algorithm that is free of quantization

errors. The backbone network is CNN, where a mask

layer is used to enhance the robustness of the DOA esti-

mation. Furthermore, to improve the accuracy of the DOA

estimation of the CNN-based classiﬁcation model, we

incorporate a quantization-error-free soft label encoding

and decoding strategy.

•We have recorded a real-world dataset named Libri-

adhoc-nodes10. The Libri-adhoc-nodes10 dataset is a

432-hour collection of replayed speech of the “test-clean”

subset of the Librispeech corpus [32], where an ad-

hoc microphone array with 10 nodes were placed in an

ofﬁce and a conference room respectively. Each node is

a linear array of four microphones. For each room, 4

array conﬁgurations with 10 distinct speaker positions per

conﬁguration were designed.

Experimental results on both simulated data and real-world

data demonstrate the superiority of the proposed method

over existing approaches. Moreover, the models trained on

simulated data perform well on real-world test data.

This paper is organized as follows. Section II describes

the DOA estimation algorithm based on CNN at each single

ad-hoc node. Section III describes the process on how to

integrates the DOA estimations of all ad-hoc nodes into a 2D

position estimate. Section IV describes the collected Libri-

adhoc-node10 dataset. Section V demonstrates the advantages

of the proposed method on both simulated and real-world

data. Section VI discusses some limitations of the proposed

method both theoretically and empirically. Finally, Section VII

concludes our ﬁndings.

II. CNN-BASED DOA ESTIMATION AT EACH SINGLE

AD-HOC NODE

In this section, we ﬁrst describe the CNN backbone net-

works in Section II-A. Then, we discuss the permutation

ambiguity problem of multi-source localization training in

Section II-B. Section II-C introduces our solution, named

unbiased label distribution encoding, to the quantization error

problem. Finally, in Section II-D, we describe the soft decod-

ing, which transforms the DNN output to a DOA estimate.

TABLE I: Architecture of the CNN-MLC [13].

Layer name Structure Output size

Input — 1×4×256

Conv-1 2×1, Stride=(1, 1) 4×3×256

Conv-2 2×3, Stride=(1, 1) 16 ×2×256

Conv-3 2×3, Stride=(1, 1) 32 ×1×256

Flatten — 8192

Linear-1 — 512

Linear-2 —512

Linear-3 — L+ 1

A. Backbone networks

This subsection describes two backbone networks for the

multi-speaker DOA estimation problem. The ﬁrst one is a

modiﬁed classic CNN-based multi-label classiﬁcation (CNN-

MLC) network [13]. The second one is a recent CNN-based

masking (CNN-Mask) network [33].

1) CNN-MLC: Consider a room with an ad-hoc micro-

phone array of Nnodes and Bspeakers, where each node

comprises a conventional array of Mmicrophones.

The short-time Fourier transform (STFT) of a speech

recording at the i-th microphone of an ad-hoc node is

Yi(t, f) = Ai(t, f)ejϕi(t,f), where Yi(t, f )is the STFT at the

t-th frame and f-th frequency bin, and Ai(t, f)and ϕi(t, f)

are the magnitude and phase components of the STFT respec-

tively, ∀i∈1, . . . , M,∀t∈1, . . . , T , and ∀f∈1, . . . , F

where Tis the number of the time frames of the recording

and Fis the number of frequency bins. We group the phase

spectrograms of all microphones of the node into a M×F×T

matrix, denoted as Φ, i.e. Φ= [ϕi(t, f)]i,f,t ∈RM×F×T.

The deep-learning-based DOA estimation is formulated as

a classiﬁcation problem of L+ 1 azimuth angles in the

full azimuth range, where the multi-speaker DOA problem is

formulated as a multi-label classiﬁcation (MLC) problem [13].

Table I describes the architecture of the CNN-MLC, which is

a modiﬁed version of [13].

After taking Φinto the CNN-MLC, the predicted distribu-

tion of the CNN-MLC can be represented as ˆ

ρ∈[0,1]L+1,

where ˆρldenotes the probability of the speaker being in the

l-th azimuth class of the DOA, ∀l∈0, . . . , L. This can be

formulated as:

ρ=CNN(Φ)(1)

where the Bclasses with the highest probabilities are the

estimated DOAs of the Bspeakers.

2) CNN-Mask: Inspired by [33], we designed a mask layer

for both single and multiple speakers, implemented with Bi-

directional Long Short-Term Memory (BiLSTM). Table II

outlines the CNN architecture incorporating the mask layer.

We replaced the original second dense layer in CNN-MLC

with Bparallel BiLSTM layers, which results in the CNN-

Mask backbone network. The BBiLSTM layers take the

sigmoid function as the activations. They are designed to learn

Bratio masks. See the following for the details.

We denote the embedding features produced from the ﬁrst

dense layer as E∈RT×D, where Dis the embedding

dimension, and Ecomprises features from direct sounds,

TABLE II: Architecture of the CNN-Mask [33].

Layer name Structure Output size

Input — 1×4×256

Conv-1 2×1, Stride=(1, 1) 4×3×256

Conv-2 2×3, Stride=(1, 1) 16 ×2×256

Conv-3 2×3, Stride=(1, 1) 32 ×1×256

Flatten — 8192

Linear-1 — 512

Mask —512

Linear-3 — L+ 1

reverberations, and noises. We aim to implicitly isolate direct

sound features, which is represented as ratio masks:

{Wb}B

b=1 = Sep(E)(2)

where Wb∈[0,1]T×Drepresents the ratio mask for speaker

b, and Sep(·)denotes the mask layer. Consequently, the

embedding feature of the direct sound of speaker b, denoted

as eb∈RD, can be recovered by applying the mask Wbto E

through element-wise multiplication:

eb=PT

t=1 Wb×E

t=1 Wb

(3)

which is further processed through a dense layer to derive the

predicted distribution ˆρbfor speaker b:

ˆρb= Dense(eb)(4)

where Dense(·)is composed of a linear layer with the softmax

activation.

B. Permutation ambiguity

Training the CNN-Mask backbone network when B > 1

involves speaker separation, which encounters the permutation

ambiguity problem. This subsection describes two ways to

address the issue.

1) Permutation invariant training: Intuitively, this can be

addressed using permutation invariant training (PIT) [34]. We

brieﬂy outline PIT as follows:

LPIT = min

ψ∈Ψ

b=1

L(ˆρb,ρψ(b))(5)

where ρbdenotes the label distribution for speaker b,Lstands

for a loss function, Ψis a set encompassing all permutations

of Bspeakers, and ψrepresents an individual permutation,

with ψ(b)indicating the b-th speaker in the permutation ψ.

2) Location-based training: An alternative training method

to address the permutation ambiguity problem in the multi-

channel scenarios is the location-based training (LBT) [3].

LBT arranges Bspeakers in the order of their DOAs. For

example, for a linear array with an azimuth range of [0,180]◦,

suppose speaker 1locates at the 30◦angle, speaker 2at

90◦, and speaker 3at 60◦, then the speaker sequence can be

{1,3,2}, which overcomes the speaker ambiguity problem.

In this paper, we apply LBT to the model training of CNN-

Mask. Speciﬁcally, suppose the original label list is denoted

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1DeepLearningBasedStage-wiseTwo-dimensionalSpeakerLocalizationwithLargeAd-hocMicrophoneArraysShupeiLiu†,LinfengFeng†,YijunGong,ChengdongLiang,ChenZhang,Xiao-LeiZhang,SeniorMember,IEEE,andXuelongLi,Fellow,IEEEAbstract—Whiledeep-learning-basedspeakerlocalizationhasshownadvantagesinchallengingacoustice...

展开>> 收起<<

1 Deep Learning Based Stage-wise Two-dimensional Speaker Localization.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Deep Learning Based Stage-wise Two-dimensional Speaker Localization

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: