1 Deep Learning Based Stage-wise Two-dimensional Speaker Localization

2025-04-30 0 0 2.22MB 11 页 10玖币
侵权投诉
1
Deep Learning Based Stage-wise
Two-dimensional Speaker Localization
with Large Ad-hoc Microphone Arrays
Shupei Liu, Linfeng Feng, Yijun Gong, Chengdong Liang, Chen Zhang,
Xiao-Lei Zhang, Senior Member, IEEE, and Xuelong Li, Fellow, IEEE
Abstract—While deep-learning-based speaker localization has
shown advantages in challenging acoustic environments, it often
yields only direction-of-arrival (DOA) cues rather than precise
two-dimensional (2D) coordinates. To address this, we propose a
novel deep-learning-based 2D speaker localization method lever-
aging ad-hoc microphone arrays, where an ad-hoc microphone
array is composed of randomly distributed microphone nodes,
each of which is equipped with a traditional array. Specifically,
we first employ convolutional neural networks at each node
to estimate speaker directions. Then, we integrate these DOA
estimates using triangulation and clustering techniques to get 2D
speaker locations. To further boost the estimation accuracy, we
introduce a node selection algorithm that strategically filters the
most reliable nodes. Extensive experiments on both simulated
and real-world data demonstrate that our approach significantly
outperforms conventional methods. The proposed node selection
further refines performance. The real-world dataset in the ex-
periment, named Libri-adhoc-node10 which is a newly recorded
data described for the first time in this paper, is online available
at https://github.com/Liu-sp/Libri-adhoc-nodes10.
Index Terms—Two-dimensional speaker localization, ad-hoc
microphone array, deep learning, triangulation, clustering.
I. INTRODUCTION
SPeaker localization aims to localize speaker positions
using speech signals recorded by microphones. It finds
wide applications in sound event detection and localization
[1], speaker separation [2]–[4] and diarization [5]–[7], etc.
A. Motivation and challenges
Speaker localization in adverse acoustic environments with
strong reverberation and noise interference is challenging.
Conventional speaker localization requires obtaining the di-
rections of speech sources, also known as direction-of-arrival
(DOA) estimation. Representative methods include multiple
signal classification (MUSIC) [8] and steered response power
with phase transform (SRP-PHAT) [9].
Shupei Liu and Linfeng Feng contributed equally to this work.
Xiao-Lei Zhang is the corresponding author.
Shupei Liu, Linfeng Feng, Yijun Gong, Chengdong Liang, Chen
Zhang and Xiao-Lei Zhang are with the School of Marine Science and
Technology, Northwestern Polytechnical University, Xi’an 710072, China
(e-mail: shupei.liu@mail.nwpu.edu.cn; fenglinfeng@mail.nwpu.edu.cn;
gongyj@mail.nwpu.edu.cn; liangchengdong@mail.nwpu.edu.cn;
chen7zhang@mail.nwpu.edu.cn; xiaolei.zhang@nwpu.edu.cn).
Xuelong Li is with the Institute of Artificial Intelligence (TeleAI), China
Telecom Corp Ltd, 31 Jinrong Street, Beijing 100033, P. R. China (e-mail:
li@nwpu.edu.cn).
Recently, with the rapid development of deep-learning-
based speech separation and enhancement [10], deep-learning-
based DOA estimation has received increasing attention [11]–
[19]. Some methods utilize deep models to estimate noise-
robust variables that are then fed into conventional DOA
estimators [12]. Other methods formulate DOA estimation
as a classification problem of azimuth classes [11]. Spatial
acoustic features like generalized cross correlation [11], phase
spectrograms [13], spatial pseudo-spectrum [14], and circu-
lar harmonic features [19] are frequently extracted as input
to deep models. Convolutional neural networks (CNNs) are
popular in the study of the DOA estimation [13], [14], [16],
[18]. Following the above directions, many generalized issues
were explored [14]–[17], exhibiting improved performance
over conventional approaches.
However, in many applications, obtaining a speaker’s 2-
dimensional (2D) or 3-dimensional coordinate is more helpful
than merely obtaining the DOA. Ad-hoc microphone arrays
may be able to address the problem. An ad-hoc microphone
array is a group of randomly distributed cooperative micro-
phone nodes, each of which contains a traditional microphone
array, like a uniform linear array. The advantages of ad-
hoc microphone arrays lie in that (i) they can be easily
deployed and widespread in real world by organizing online
devices, and (ii) they can reduce the occurrence probability
of far-field speech signal processing [20]. As for the sound-
source localization problem, analogous to prior investigations
such as [21]–[23], whether an ad-hoc array can substantially
outperform traditional fixed arrays by handling a substantial
multitude of nodes, needs a deep investigation.
Conventional sound source localization approaches based on
ad-hoc microphone arrays primarily employ signal processing
methods, as described in [24]. Recent progress in speaker
localization leverages deep learning in conjunction with dis-
tributed microphone nodes [25]–[31], which is the focus of this
paper. For example, [26] utilize multiple deep-learning-based
nodes to directly predict 2D speaker coordinates. Alternatively,
[25], [27], [30], [31] formulate indoor localization as a spa-
tial grid classification problem. [28] derives 2D coordinates
through triangulation of two distributed nodes. [29] feeds DOA
estimates from each node into a deep neural network (DNN)
to obtain the final speaker location.
While these pioneering works highlight the potential of
deep learning techniques, their investigations have been lim-
ited to sparse node numbers (e.g. two nodes) and additional
arXiv:2210.10265v2 [eess.AS] 1 Apr 2024
2
Phase
map
Phase
map
Phase
map
Phase
map
Convolutional
Neural Network
Convolutional
Neural Network
Convolutional
Neural Network
microphone array 1
microphone array 2
microphone array N
MPhase
map
MT
F
1 2
1
ˆ
2
ˆ
1 2
1
ˆ
2
ˆ
( )
ˆ ˆ
, xy
Feature extraction and DOA
estimation Node selection Triangulation Clustering
Fig. 1: Diagram of the proposed 2-dimensional speaker localization method based on deep learning.
constraints. These constraints include fixed node position-
ing for both training and testing in the same room [25]–
[27], [29], [31]. Alternatively, [28] mandates identical spatial
node patterns for training and testing, making it difficult
to maximize the flexibility of ad-hoc arrays. Additionally,
[30] is tailored for scenarios where each node consists of a
single microphone, precluding integration with prevalent DOA
estimation techniques.
B. Framework of the proposed method
In pursuit of the flexibility and advantages of ad-hoc ar-
rays, this paper introduces a deep-learning-based 2D speaker
localization leveraging large-scale ad-hoc microphone arrays.
The framework of the proposed method is shown in Fig. 1.
Specifically, it comprises a feature extraction module, a DOA
estimation module, a node selection algorithm, and a triangu-
lation and clustering method. The DOA estimation module
provides speaker directions. The node selection algorithm
selects ad-hoc nodes that yield highly reliable DOA estimates.
The triangulation module yield a rough 2D speaker location
from any two randomly selected ad-hoc nodes. At last, the
clustering algorithm conducts clustering on all rough speaker
locations, and takes the clustering center as the final accurate
speaker location.
C. Goals and contributions
The novelty and contributions of the proposed method lie
in that:
We have proposed a stage-wise deep-learning-based
2D sound source localization method. The method has
been described in Section I-B. It does not require the
ad-hoc nodes to be at fixed positions. It is a stage-
wise framework, which is flexible in incorporating many
advanced techniques in DOA estimations, node selection
strategies, and clusterings. At last, it bridges the gap be-
tween conventional signal processing methods and recent
deep learning methods.
We have employed an advanced classification-based
DOA estimation algorithm that is free of quantization
errors. The backbone network is CNN, where a mask
layer is used to enhance the robustness of the DOA esti-
mation. Furthermore, to improve the accuracy of the DOA
estimation of the CNN-based classification model, we
incorporate a quantization-error-free soft label encoding
and decoding strategy.
We have recorded a real-world dataset named Libri-
adhoc-nodes10. The Libri-adhoc-nodes10 dataset is a
432-hour collection of replayed speech of the “test-clean”
subset of the Librispeech corpus [32], where an ad-
hoc microphone array with 10 nodes were placed in an
office and a conference room respectively. Each node is
a linear array of four microphones. For each room, 4
array configurations with 10 distinct speaker positions per
configuration were designed.
Experimental results on both simulated data and real-world
data demonstrate the superiority of the proposed method
over existing approaches. Moreover, the models trained on
simulated data perform well on real-world test data.
This paper is organized as follows. Section II describes
the DOA estimation algorithm based on CNN at each single
ad-hoc node. Section III describes the process on how to
integrates the DOA estimations of all ad-hoc nodes into a 2D
position estimate. Section IV describes the collected Libri-
adhoc-node10 dataset. Section V demonstrates the advantages
of the proposed method on both simulated and real-world
data. Section VI discusses some limitations of the proposed
method both theoretically and empirically. Finally, Section VII
concludes our findings.
II. CNN-BASED DOA ESTIMATION AT EACH SINGLE
AD-HOC NODE
In this section, we first describe the CNN backbone net-
works in Section II-A. Then, we discuss the permutation
ambiguity problem of multi-source localization training in
Section II-B. Section II-C introduces our solution, named
unbiased label distribution encoding, to the quantization error
problem. Finally, in Section II-D, we describe the soft decod-
ing, which transforms the DNN output to a DOA estimate.
3
TABLE I: Architecture of the CNN-MLC [13].
Layer name Structure Output size
Input 1×4×256
Conv-1 2×1, Stride=(1, 1) 4×3×256
Conv-2 2×3, Stride=(1, 1) 16 ×2×256
Conv-3 2×3, Stride=(1, 1) 32 ×1×256
Flatten — 8192
Linear-1 — 512
Linear-2 512
Linear-3 — L+ 1
A. Backbone networks
This subsection describes two backbone networks for the
multi-speaker DOA estimation problem. The first one is a
modified classic CNN-based multi-label classification (CNN-
MLC) network [13]. The second one is a recent CNN-based
masking (CNN-Mask) network [33].
1) CNN-MLC: Consider a room with an ad-hoc micro-
phone array of Nnodes and Bspeakers, where each node
comprises a conventional array of Mmicrophones.
The short-time Fourier transform (STFT) of a speech
recording at the i-th microphone of an ad-hoc node is
Yi(t, f) = Ai(t, f)ejϕi(t,f), where Yi(t, f )is the STFT at the
t-th frame and f-th frequency bin, and Ai(t, f)and ϕi(t, f)
are the magnitude and phase components of the STFT respec-
tively, i1, . . . , M,t1, . . . , T , and f1, . . . , F
where Tis the number of the time frames of the recording
and Fis the number of frequency bins. We group the phase
spectrograms of all microphones of the node into a M×F×T
matrix, denoted as Φ, i.e. Φ= [ϕi(t, f)]i,f,t RM×F×T.
The deep-learning-based DOA estimation is formulated as
a classification problem of L+ 1 azimuth angles in the
full azimuth range, where the multi-speaker DOA problem is
formulated as a multi-label classification (MLC) problem [13].
Table I describes the architecture of the CNN-MLC, which is
a modified version of [13].
After taking Φinto the CNN-MLC, the predicted distribu-
tion of the CNN-MLC can be represented as ˆ
ρ[0,1]L+1,
where ˆρldenotes the probability of the speaker being in the
l-th azimuth class of the DOA, l0, . . . , L. This can be
formulated as:
ˆ
ρ=CNN(Φ)(1)
where the Bclasses with the highest probabilities are the
estimated DOAs of the Bspeakers.
2) CNN-Mask: Inspired by [33], we designed a mask layer
for both single and multiple speakers, implemented with Bi-
directional Long Short-Term Memory (BiLSTM). Table II
outlines the CNN architecture incorporating the mask layer.
We replaced the original second dense layer in CNN-MLC
with Bparallel BiLSTM layers, which results in the CNN-
Mask backbone network. The BBiLSTM layers take the
sigmoid function as the activations. They are designed to learn
Bratio masks. See the following for the details.
We denote the embedding features produced from the first
dense layer as ERT×D, where Dis the embedding
dimension, and Ecomprises features from direct sounds,
TABLE II: Architecture of the CNN-Mask [33].
Layer name Structure Output size
Input 1×4×256
Conv-1 2×1, Stride=(1, 1) 4×3×256
Conv-2 2×3, Stride=(1, 1) 16 ×2×256
Conv-3 2×3, Stride=(1, 1) 32 ×1×256
Flatten — 8192
Linear-1 — 512
Mask 512
Linear-3 — L+ 1
reverberations, and noises. We aim to implicitly isolate direct
sound features, which is represented as ratio masks:
{Wb}B
b=1 = Sep(E)(2)
where Wb[0,1]T×Drepresents the ratio mask for speaker
b, and Sep(·)denotes the mask layer. Consequently, the
embedding feature of the direct sound of speaker b, denoted
as ebRD, can be recovered by applying the mask Wbto E
through element-wise multiplication:
eb=PT
t=1 Wb×E
PT
t=1 Wb
(3)
which is further processed through a dense layer to derive the
predicted distribution ˆρbfor speaker b:
ˆρb= Dense(eb)(4)
where Dense(·)is composed of a linear layer with the softmax
activation.
B. Permutation ambiguity
Training the CNN-Mask backbone network when B > 1
involves speaker separation, which encounters the permutation
ambiguity problem. This subsection describes two ways to
address the issue.
1) Permutation invariant training: Intuitively, this can be
addressed using permutation invariant training (PIT) [34]. We
briefly outline PIT as follows:
LPIT = min
ψΨ
B
X
b=1
L(ˆρb,ρψ(b))(5)
where ρbdenotes the label distribution for speaker b,Lstands
for a loss function, Ψis a set encompassing all permutations
of Bspeakers, and ψrepresents an individual permutation,
with ψ(b)indicating the b-th speaker in the permutation ψ.
2) Location-based training: An alternative training method
to address the permutation ambiguity problem in the multi-
channel scenarios is the location-based training (LBT) [3].
LBT arranges Bspeakers in the order of their DOAs. For
example, for a linear array with an azimuth range of [0,180],
suppose speaker 1locates at the 30angle, speaker 2at
90, and speaker 3at 60, then the speaker sequence can be
{1,3,2}, which overcomes the speaker ambiguity problem.
In this paper, we apply LBT to the model training of CNN-
Mask. Specifically, suppose the original label list is denoted
摘要:

1DeepLearningBasedStage-wiseTwo-dimensionalSpeakerLocalizationwithLargeAd-hocMicrophoneArraysShupeiLiu†,LinfengFeng†,YijunGong,ChengdongLiang,ChenZhang,Xiao-LeiZhang,SeniorMember,IEEE,andXuelongLi,Fellow,IEEEAbstract—Whiledeep-learning-basedspeakerlocalizationhasshownadvantagesinchallengingacoustice...

展开>> 收起<<
1 Deep Learning Based Stage-wise Two-dimensional Speaker Localization.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:2.22MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注