1
Deep Learning Based Stage-wise
Two-dimensional Speaker Localization
with Large Ad-hoc Microphone Arrays
Shupei Liu†, Linfeng Feng†, Yijun Gong, Chengdong Liang, Chen Zhang,
Xiao-Lei Zhang, Senior Member, IEEE, and Xuelong Li, Fellow, IEEE
Abstract—While deep-learning-based speaker localization has
shown advantages in challenging acoustic environments, it often
yields only direction-of-arrival (DOA) cues rather than precise
two-dimensional (2D) coordinates. To address this, we propose a
novel deep-learning-based 2D speaker localization method lever-
aging ad-hoc microphone arrays, where an ad-hoc microphone
array is composed of randomly distributed microphone nodes,
each of which is equipped with a traditional array. Specifically,
we first employ convolutional neural networks at each node
to estimate speaker directions. Then, we integrate these DOA
estimates using triangulation and clustering techniques to get 2D
speaker locations. To further boost the estimation accuracy, we
introduce a node selection algorithm that strategically filters the
most reliable nodes. Extensive experiments on both simulated
and real-world data demonstrate that our approach significantly
outperforms conventional methods. The proposed node selection
further refines performance. The real-world dataset in the ex-
periment, named Libri-adhoc-node10 which is a newly recorded
data described for the first time in this paper, is online available
at https://github.com/Liu-sp/Libri-adhoc-nodes10.
Index Terms—Two-dimensional speaker localization, ad-hoc
microphone array, deep learning, triangulation, clustering.
I. INTRODUCTION
SPeaker localization aims to localize speaker positions
using speech signals recorded by microphones. It finds
wide applications in sound event detection and localization
[1], speaker separation [2]–[4] and diarization [5]–[7], etc.
A. Motivation and challenges
Speaker localization in adverse acoustic environments with
strong reverberation and noise interference is challenging.
Conventional speaker localization requires obtaining the di-
rections of speech sources, also known as direction-of-arrival
(DOA) estimation. Representative methods include multiple
signal classification (MUSIC) [8] and steered response power
with phase transform (SRP-PHAT) [9].
Shupei Liu and Linfeng Feng contributed equally to this work.
Xiao-Lei Zhang is the corresponding author.
Shupei Liu, Linfeng Feng, Yijun Gong, Chengdong Liang, Chen
Zhang and Xiao-Lei Zhang are with the School of Marine Science and
Technology, Northwestern Polytechnical University, Xi’an 710072, China
(e-mail: shupei.liu@mail.nwpu.edu.cn; fenglinfeng@mail.nwpu.edu.cn;
gongyj@mail.nwpu.edu.cn; liangchengdong@mail.nwpu.edu.cn;
chen7zhang@mail.nwpu.edu.cn; xiaolei.zhang@nwpu.edu.cn).
Xuelong Li is with the Institute of Artificial Intelligence (TeleAI), China
Telecom Corp Ltd, 31 Jinrong Street, Beijing 100033, P. R. China (e-mail:
li@nwpu.edu.cn).
Recently, with the rapid development of deep-learning-
based speech separation and enhancement [10], deep-learning-
based DOA estimation has received increasing attention [11]–
[19]. Some methods utilize deep models to estimate noise-
robust variables that are then fed into conventional DOA
estimators [12]. Other methods formulate DOA estimation
as a classification problem of azimuth classes [11]. Spatial
acoustic features like generalized cross correlation [11], phase
spectrograms [13], spatial pseudo-spectrum [14], and circu-
lar harmonic features [19] are frequently extracted as input
to deep models. Convolutional neural networks (CNNs) are
popular in the study of the DOA estimation [13], [14], [16],
[18]. Following the above directions, many generalized issues
were explored [14]–[17], exhibiting improved performance
over conventional approaches.
However, in many applications, obtaining a speaker’s 2-
dimensional (2D) or 3-dimensional coordinate is more helpful
than merely obtaining the DOA. Ad-hoc microphone arrays
may be able to address the problem. An ad-hoc microphone
array is a group of randomly distributed cooperative micro-
phone nodes, each of which contains a traditional microphone
array, like a uniform linear array. The advantages of ad-
hoc microphone arrays lie in that (i) they can be easily
deployed and widespread in real world by organizing online
devices, and (ii) they can reduce the occurrence probability
of far-field speech signal processing [20]. As for the sound-
source localization problem, analogous to prior investigations
such as [21]–[23], whether an ad-hoc array can substantially
outperform traditional fixed arrays by handling a substantial
multitude of nodes, needs a deep investigation.
Conventional sound source localization approaches based on
ad-hoc microphone arrays primarily employ signal processing
methods, as described in [24]. Recent progress in speaker
localization leverages deep learning in conjunction with dis-
tributed microphone nodes [25]–[31], which is the focus of this
paper. For example, [26] utilize multiple deep-learning-based
nodes to directly predict 2D speaker coordinates. Alternatively,
[25], [27], [30], [31] formulate indoor localization as a spa-
tial grid classification problem. [28] derives 2D coordinates
through triangulation of two distributed nodes. [29] feeds DOA
estimates from each node into a deep neural network (DNN)
to obtain the final speaker location.
While these pioneering works highlight the potential of
deep learning techniques, their investigations have been lim-
ited to sparse node numbers (e.g. two nodes) and additional
arXiv:2210.10265v2 [eess.AS] 1 Apr 2024