Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using
Permutation-Free Loss Function
Qing Wang1, Hang Chen1, Ya Jiang1, Zhe Wang1, Yuyang Wang1, Jun Du1∗, Chin-Hui Lee2
1University of Science and Technology of China, Hefei, China
2Georgia Institute of Technology, Atlanta, GA. USA
jundu@ustc.edu.cn
Abstract
In this paper, we propose a deep learning based multi-speaker
direction of arrival (DOA) estimation with audio and visual sig-
nals by using permutation-free loss function. We first collect a
data set for multi-modal sound source localization (SSL) where
both audio and visual signals are recorded in real-life home TV
scenarios. Then we propose a novel spatial annotation method
to produce the ground truth of DOA for each speaker with the
video data by transformation between camera coordinate and
pixel coordinate according to the pin-hole camera model. With
spatial location information served as another input along with
acoustic feature, multi-speaker DOA estimation could be solved
as a classification task of active speaker detection. Label permu-
tation problem in multi-speaker related tasks will be addressed
since the locations of each speaker are used as input. Experi-
ments conducted on both simulated data and real data show that
the proposed audio-visual DOA estimation model outperforms
audio-only DOA estimation model by a large margin.
Index Terms: sound source localization, DOA estimation,
audio-visual fusion, pin-hole camera model
1. Introduction
Sound source localization (SSL) aims to estimate the position
of single or multiple sound sources relative to the position of
the recording microphone array. In most cases, we are inter-
ested in direction of arrival (DOA) for each sound source, hence
most of the SSL methods focus on azimuth and elevation an-
gles estimation. Effective SSL is of great importance in many
applications including automatic speech recognition (ASR) [6],
tele-conferencing [3], robot audition [1, 2], and hearing aids [4].
Many previous studies about SSL pay more attention to au-
dio modality alone. Conventional SSL methods, such as gener-
alized cross-correlation with phase transform (GCC-PHAT) [8],
steered response power with phase transform (SRP-PHAT) [9],
estimation of signal parameters via rotational invariance tech-
nique (ESPRIT) [10], and multiple signal classification (MU-
SIC) [11], were based on signal processing techniques and usu-
ally performed poorly in noisy and reverberant environments.
Deep neural network (DNN)-based SSL methods have been
proposed in recent years and proven to outperform conven-
tional SSL methods due to their strong regression capability
[12]. Grumiaux [7] provided a thorough survey of the audio
SSL literature based on deep learning techniques. The output
strategy for DOA estimation can be divided into two categories:
classification and regression. Convolutional recurrent neural
networks (CRNN) were proposed for DOA estimation of multi-
ple sources by using a classification strategy in [14, 15]. Some
other works [13, 16] tried to solve the SSL problem as a regres-
sion task by directly estimating either Cartesian coordinates or
spherical coordinates. Tang [17] demonstrated that regression
model achieved better performance over classification model.
By hearing and seeing, human brain is able to perceive
surroundings and extract complementary information. Intelli-
gent devices equipped with audio-visual sensors are supposed to
achieve similar goals. Fusion of audio and video modalities has
shown promising results in many areas, e.g. acoustic scene clas-
sification [19], speech enhancement [18], and active speaker de-
tection [20]. The literature on audio-visual localization is sparse
compared to the large number of studies for sound source local-
ization [7]. Most of these works [22, 23, 21] mainly focused
on localizing sound sources in video clips rather than estimat-
ing DOA of sound sources. In [24], the authors first proposed a
deep neural network (DNN) architecture for audio-visual multi-
speaker DOA estimation by simulating visual features. Promis-
ing results were observed in [24] when at most two speakers
existed however the performance of localizing more than two
speakers remained unknown. Berghi [25] proposed a teacher-
student model to perform active speaker detection and local-
ization with the ‘teacher’ network generating pseudo-labels and
the ‘student’ network localizing speakers.
Most of previous works only consider localizing one or
two concurrent speakers and the existing audio-visual datasets
[26, 27, 25] are of limited size. In this paper, we propose a novel
audio-visual DOA estimation approach for multi-speaker sce-
nario based on the MISP2021-AVSR corpus [28], a large-scale
audio-visual Chinese conversational corpus which contains 141
hours of audio and video data with at most six concurrent speak-
ers. To avoid expensive and time-consuming cost of manual an-
notation, we propose to produce the ground truth of DOA for
each speaker based on the video data and camera calibration.
Then we solve the multi-speaker DOA estimation problem as
active speaker detection with the ground truth of DOA served
as complementary input to acoustic feature. Label permutation
problem in multi-speaker related tasks will be addressed since
the locations of each speaker are used as input.
2. Proposed Method
In this section, we describe our proposed approach for multi-
speaker DOA estimation using audio and video data. Real data
is recorded in the home TV scenario with several people sitting
and chatting in Chinese. In home TV scenario, people are al-
ways sitting, so our study is focused on estimating the azimuth
angle only. Firstly, we introduce how to generate DOA labels
with video clips. Then we describe the proposed multi-modal
DOA (MDOA) and audio-only DOA (ADOA) estimation mod-
els for multi-speaker situation.
arXiv:2210.14581v1 [eess.AS] 26 Oct 2022