Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function Qing Wang1 Hang Chen1 Ya Jiang1 Zhe Wang1 Yuyang Wang1 Jun Du1 Chin-Hui Lee2

2025-05-06 0 0 3.02MB 5 页 10玖币

侵权投诉

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using

Permutation-Free Loss Function

Qing Wang1, Hang Chen1, Ya Jiang1, Zhe Wang1, Yuyang Wang1, Jun Du1∗, Chin-Hui Lee2

1University of Science and Technology of China, Hefei, China

2Georgia Institute of Technology, Atlanta, GA. USA

jundu@ustc.edu.cn

Abstract

In this paper, we propose a deep learning based multi-speaker

direction of arrival (DOA) estimation with audio and visual sig-

nals by using permutation-free loss function. We ﬁrst collect a

data set for multi-modal sound source localization (SSL) where

both audio and visual signals are recorded in real-life home TV

scenarios. Then we propose a novel spatial annotation method

to produce the ground truth of DOA for each speaker with the

video data by transformation between camera coordinate and

pixel coordinate according to the pin-hole camera model. With

spatial location information served as another input along with

acoustic feature, multi-speaker DOA estimation could be solved

as a classiﬁcation task of active speaker detection. Label permu-

tation problem in multi-speaker related tasks will be addressed

since the locations of each speaker are used as input. Experi-

ments conducted on both simulated data and real data show that

the proposed audio-visual DOA estimation model outperforms

audio-only DOA estimation model by a large margin.

Index Terms: sound source localization, DOA estimation,

audio-visual fusion, pin-hole camera model

1. Introduction

Sound source localization (SSL) aims to estimate the position

of single or multiple sound sources relative to the position of

the recording microphone array. In most cases, we are inter-

ested in direction of arrival (DOA) for each sound source, hence

most of the SSL methods focus on azimuth and elevation an-

gles estimation. Effective SSL is of great importance in many

applications including automatic speech recognition (ASR) [6],

tele-conferencing [3], robot audition [1, 2], and hearing aids [4].

Many previous studies about SSL pay more attention to au-

dio modality alone. Conventional SSL methods, such as gener-

alized cross-correlation with phase transform (GCC-PHAT) [8],

steered response power with phase transform (SRP-PHAT) [9],

estimation of signal parameters via rotational invariance tech-

nique (ESPRIT) [10], and multiple signal classiﬁcation (MU-

SIC) [11], were based on signal processing techniques and usu-

ally performed poorly in noisy and reverberant environments.

Deep neural network (DNN)-based SSL methods have been

proposed in recent years and proven to outperform conven-

tional SSL methods due to their strong regression capability

[12]. Grumiaux [7] provided a thorough survey of the audio

SSL literature based on deep learning techniques. The output

strategy for DOA estimation can be divided into two categories:

classiﬁcation and regression. Convolutional recurrent neural

networks (CRNN) were proposed for DOA estimation of multi-

ple sources by using a classiﬁcation strategy in [14, 15]. Some

other works [13, 16] tried to solve the SSL problem as a regres-

sion task by directly estimating either Cartesian coordinates or

spherical coordinates. Tang [17] demonstrated that regression

model achieved better performance over classiﬁcation model.

By hearing and seeing, human brain is able to perceive

surroundings and extract complementary information. Intelli-

gent devices equipped with audio-visual sensors are supposed to

achieve similar goals. Fusion of audio and video modalities has

shown promising results in many areas, e.g. acoustic scene clas-

siﬁcation [19], speech enhancement [18], and active speaker de-

tection [20]. The literature on audio-visual localization is sparse

compared to the large number of studies for sound source local-

ization [7]. Most of these works [22, 23, 21] mainly focused

on localizing sound sources in video clips rather than estimat-

ing DOA of sound sources. In [24], the authors ﬁrst proposed a

deep neural network (DNN) architecture for audio-visual multi-

speaker DOA estimation by simulating visual features. Promis-

ing results were observed in [24] when at most two speakers

existed however the performance of localizing more than two

speakers remained unknown. Berghi [25] proposed a teacher-

student model to perform active speaker detection and local-

ization with the ‘teacher’ network generating pseudo-labels and

the ‘student’ network localizing speakers.

Most of previous works only consider localizing one or

two concurrent speakers and the existing audio-visual datasets

[26, 27, 25] are of limited size. In this paper, we propose a novel

audio-visual DOA estimation approach for multi-speaker sce-

nario based on the MISP2021-AVSR corpus [28], a large-scale

audio-visual Chinese conversational corpus which contains 141

hours of audio and video data with at most six concurrent speak-

ers. To avoid expensive and time-consuming cost of manual an-

notation, we propose to produce the ground truth of DOA for

each speaker based on the video data and camera calibration.

Then we solve the multi-speaker DOA estimation problem as

active speaker detection with the ground truth of DOA served

as complementary input to acoustic feature. Label permutation

problem in multi-speaker related tasks will be addressed since

the locations of each speaker are used as input.

2. Proposed Method

In this section, we describe our proposed approach for multi-

speaker DOA estimation using audio and video data. Real data

is recorded in the home TV scenario with several people sitting

and chatting in Chinese. In home TV scenario, people are al-

ways sitting, so our study is focused on estimating the azimuth

angle only. Firstly, we introduce how to generate DOA labels

with video clips. Then we describe the proposed multi-modal

DOA (MDOA) and audio-only DOA (ADOA) estimation mod-

els for multi-speaker situation.

arXiv:2210.14581v1 [eess.AS] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepLearningBasedAudio-VisualMulti-SpeakerDOAEstimationUsingPermutation-FreeLossFunctionQingWang1,HangChen1,YaJiang1,ZheWang1,YuyangWang1,JunDu1,Chin-HuiLee21UniversityofScienceandTechnologyofChina,Hefei,China2GeorgiaInstituteofTechnology,Atlanta,GA.USAjundu@ustc.edu.cnAbstractInthispaper,wepropose...

展开>> 收起<<

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function Qing Wang1 Hang Chen1 Ya Jiang1 Zhe Wang1 Yuyang Wang1 Jun Du1 Chin-Hui Lee2.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function Qing Wang1 Hang Chen1 Ya Jiang1 Zhe Wang1 Yuyang Wang1 Jun Du1 Chin-Hui Lee2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: