EXPLOITING SPATIAL INFORMATION WITH THE INFORMED COMPLEX-V ALUED SPATIAL AUTOENCODER FOR TARGET SPEAKER EXTRACTION Annika Briegleb Mhd Modar HalimehWalter Kellermann

2025-05-06 0 0 412.96KB 5 页 10玖币

侵权投诉

EXPLOITING SPATIAL INFORMATION WITH THE INFORMED COMPLEX-VALUED

SPATIAL AUTOENCODER FOR TARGET SPEAKER EXTRACTION

Annika Briegleb Mhd Modar Halimeh?Walter Kellermann

Friedrich-Alexander-Universit¨

at Erlangen-N¨

urnberg, Erlangen, Germany

{annika.briegleb, mhd.m.halimeh, walter.kellermann}@fau.de

ABSTRACT

In conventional multichannel audio signal enhancement, spatial and

spectral ﬁltering are often performed sequentially. In contrast, it

has been shown that for neural spatial ﬁltering a joint approach of

spectro-spatial ﬁltering is more beneﬁcial. In this contribution, we

investigate the spatial ﬁltering performed by such a time-varying

spectro-spatial ﬁlter. We extend the recently proposed complex-

valued spatial autoencoder (COSPA) for the task of target speaker

extraction by leveraging its interpretable structure and purposefully

informing the network of the target speaker’s position. We show

that the resulting informed COSPA (iCOSPA) effectively and ﬂexi-

bly extracts a target speaker from a mixture of speakers. We also ﬁnd

that the proposed architecture is well capable of learning pronounced

spatial selectivity patterns and show that the results depend signiﬁ-

cantly on the training target and the reference signal when computing

various evaluation metrics.

Index Terms—speaker extraction, spectro-spatial ﬁltering,

training targets, DNN

1. INTRODUCTION

While neural networks represent the state of the art for single-

channel audio signal enhancement for some time now, they are

only recently moving into the focus for multichannel audio signal

enhancement and, hence, spatial ﬁltering. There have been several

approaches to guide spatial ﬁlters, i.e., beamformers, by estimating

intermediate quantities by neural networks [1–5]. Other approaches

construct a beamformer by estimating its weights by a neural net-

work [6–11] or replace the beamforming process by a neural net-

work that directly estimates the clean speech signal [12, 13]. The

ﬁrst approach stays with the conventional deﬁnitions of beamform-

ers, whereas the second and third approach exploit the nonlinear

processing performed by the neural network and are denoted as

neural spatial ﬁlters.

In this paper, we focus on those neural spectro-spatial ﬁlters that

estimate the beamformer weights by a neural network. Such neural

spectro-spatial ﬁlters learn a spatially selective pattern for signal de-

noising in scenarios where only one speech source is active [6,14]. In

this contribution, we extend one of such neural spectro-spatial ﬁlters,

the Complex-valued Spatial Autoencoder (COSPA) [6], for the prob-

lem of target speaker extraction (TSE) from a mixture of speakers by

informing it about the target speaker’s direction of arrival (DoA) via

a low-cost extension of the network (cf. Sec. 2.1.1). Furthermore,

?M. M. Halimeh is now with Fraunhofer Institute for Integrated Cir-

cuits IIS, Am Wolfsmantel 33, 91058 Erlangen, Germany.

This work has been accepted to IEEE ICASSP 2023.

we explicitly exploit the provided multichannel information by re-

placing two-dimensional (2D) with three-dimensional (3D) convo-

lutional layers at the beginning of the network (cf. Sec. 2.1.2). We

show that these extensions allow to identify the target speaker and

enhance its signal in the presence of interfering speakers, rendering

the proposed informed COSPA (iCOSPA) a ﬂexible spatial ﬁlter. For

reverberant scenarios, several options for the target signal used for

training the neural ﬁlter exist. We examine how the spatial ﬁltering

capability is affected by different target signals and also show how

the evaluation metrics depend on the choice of reference, i.e., clean

signal, used for their computation.

We present the proposed method and discuss the training target

signal in Sec. 2. In Sec. 3.1, we detail our experimental setup and

present and discuss the corresponding results in Sec. 3.2. Sec. 4

concludes the paper.

2. PROVIDING SPATIAL INFORMATION FOR COSPA

In the following, we brieﬂy introduce the COSPA framework and

explain how it is modiﬁed to exploit spatial information for TSE

(Sec. 2.1). A discussion on the spatial selectivity obtained by appro-

priate target signals for multichannel processing follows in Sec. 2.2.

2.1. Extension of COSPA

We consider a signal with Mmicrophone channels, captured by an

arbitrary microphone array, where the signal at microphone min

time-frequency bin (τ, f)is given by

Xm(τ, f) = Dm(τ, f ) +

i=1

Uim(τ, f ) + Nm(τ, f).(1)

Dm(τ, f) = H∗

m(τ, f)S(τ, f )denotes the desired speaker’s signal

at microphone mbased on the acoustic transfer function Hm(τ, f )

from the desired speaker to the m-th microphone. Uim denotes the

contribution of interfering speaker i,i= 1,...,I, at microphone m

and Nmis additional low-level sensor noise. The goal is to suppress

the interfering speakers and to extract the source signal S, or a rever-

berant image of it, with minimal distortions. Thus, dereverberation

is not explicitly addressed in this paper. We allow different target

speakers and speaker positions across utterances but assume that the

speakers’ identities and positions remain static within one utterance.

In [6], COSPA was introduced for multichannel denoising. This

framework consists of an encoder, which, using a subnetwork de-

noted by CRUNet, estimates a single-channel mask that is applied

to all input channels and subsequently includes feature compression,

acompandor which effectuates multichannel processing, i.e., allows

to process each channel differently, and a decoder which outputs an

individual mask Mmfor each channel. In this paper, we modify

COSPA to the problem of TSE by adding DoA information.

arXiv:2210.15512v2 [eess.AS] 14 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EXPLOITINGSPATIALINFORMATIONWITHTHEINFORMEDCOMPLEX-VALUEDSPATIALAUTOENCODERFORTARGETSPEAKEREXTRACTIONAnnikaBrieglebMhdModarHalimeh?WalterKellermannFriedrich-Alexander-Universit¨atErlangen-N¨urnberg,Erlangen,Germanyfannika.briegleb,mhd.m.halimeh,walter.kellermanng@fau.deABSTRACTInconventionalmulticha...

展开>> 收起<<

EXPLOITING SPATIAL INFORMATION WITH THE INFORMED COMPLEX-V ALUED SPATIAL AUTOENCODER FOR TARGET SPEAKER EXTRACTION Annika Briegleb Mhd Modar HalimehWalter Kellermann.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EXPLOITING SPATIAL INFORMATION WITH THE INFORMED COMPLEX-V ALUED SPATIAL AUTOENCODER FOR TARGET SPEAKER EXTRACTION Annika Briegleb Mhd Modar HalimehWalter Kellermann

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: