EXPLOITING SPATIAL INFORMATION WITH THE INFORMED COMPLEX-V ALUED SPATIAL AUTOENCODER FOR TARGET SPEAKER EXTRACTION Annika Briegleb Mhd Modar HalimehWalter Kellermann

2025-05-06 0 0 412.96KB 5 页 10玖币
侵权投诉
EXPLOITING SPATIAL INFORMATION WITH THE INFORMED COMPLEX-VALUED
SPATIAL AUTOENCODER FOR TARGET SPEAKER EXTRACTION
Annika Briegleb Mhd Modar Halimeh?Walter Kellermann
Friedrich-Alexander-Universit¨
at Erlangen-N¨
urnberg, Erlangen, Germany
{annika.briegleb, mhd.m.halimeh, walter.kellermann}@fau.de
ABSTRACT
In conventional multichannel audio signal enhancement, spatial and
spectral filtering are often performed sequentially. In contrast, it
has been shown that for neural spatial filtering a joint approach of
spectro-spatial filtering is more beneficial. In this contribution, we
investigate the spatial filtering performed by such a time-varying
spectro-spatial filter. We extend the recently proposed complex-
valued spatial autoencoder (COSPA) for the task of target speaker
extraction by leveraging its interpretable structure and purposefully
informing the network of the target speaker’s position. We show
that the resulting informed COSPA (iCOSPA) effectively and flexi-
bly extracts a target speaker from a mixture of speakers. We also find
that the proposed architecture is well capable of learning pronounced
spatial selectivity patterns and show that the results depend signifi-
cantly on the training target and the reference signal when computing
various evaluation metrics.
Index Termsspeaker extraction, spectro-spatial filtering,
training targets, DNN
1. INTRODUCTION
While neural networks represent the state of the art for single-
channel audio signal enhancement for some time now, they are
only recently moving into the focus for multichannel audio signal
enhancement and, hence, spatial filtering. There have been several
approaches to guide spatial filters, i.e., beamformers, by estimating
intermediate quantities by neural networks [1–5]. Other approaches
construct a beamformer by estimating its weights by a neural net-
work [6–11] or replace the beamforming process by a neural net-
work that directly estimates the clean speech signal [12, 13]. The
first approach stays with the conventional definitions of beamform-
ers, whereas the second and third approach exploit the nonlinear
processing performed by the neural network and are denoted as
neural spatial filters.
In this paper, we focus on those neural spectro-spatial filters that
estimate the beamformer weights by a neural network. Such neural
spectro-spatial filters learn a spatially selective pattern for signal de-
noising in scenarios where only one speech source is active [6,14]. In
this contribution, we extend one of such neural spectro-spatial filters,
the Complex-valued Spatial Autoencoder (COSPA) [6], for the prob-
lem of target speaker extraction (TSE) from a mixture of speakers by
informing it about the target speaker’s direction of arrival (DoA) via
a low-cost extension of the network (cf. Sec. 2.1.1). Furthermore,
?M. M. Halimeh is now with Fraunhofer Institute for Integrated Cir-
cuits IIS, Am Wolfsmantel 33, 91058 Erlangen, Germany.
This work has been accepted to IEEE ICASSP 2023.
we explicitly exploit the provided multichannel information by re-
placing two-dimensional (2D) with three-dimensional (3D) convo-
lutional layers at the beginning of the network (cf. Sec. 2.1.2). We
show that these extensions allow to identify the target speaker and
enhance its signal in the presence of interfering speakers, rendering
the proposed informed COSPA (iCOSPA) a flexible spatial filter. For
reverberant scenarios, several options for the target signal used for
training the neural filter exist. We examine how the spatial filtering
capability is affected by different target signals and also show how
the evaluation metrics depend on the choice of reference, i.e., clean
signal, used for their computation.
We present the proposed method and discuss the training target
signal in Sec. 2. In Sec. 3.1, we detail our experimental setup and
present and discuss the corresponding results in Sec. 3.2. Sec. 4
concludes the paper.
2. PROVIDING SPATIAL INFORMATION FOR COSPA
In the following, we briefly introduce the COSPA framework and
explain how it is modified to exploit spatial information for TSE
(Sec. 2.1). A discussion on the spatial selectivity obtained by appro-
priate target signals for multichannel processing follows in Sec. 2.2.
2.1. Extension of COSPA
We consider a signal with Mmicrophone channels, captured by an
arbitrary microphone array, where the signal at microphone min
time-frequency bin (τ, f)is given by
Xm(τ, f) = Dm(τ, f ) +
I
X
i=1
Uim(τ, f ) + Nm(τ, f).(1)
Dm(τ, f) = H
m(τ, f)S(τ, f )denotes the desired speaker’s signal
at microphone mbased on the acoustic transfer function Hm(τ, f )
from the desired speaker to the m-th microphone. Uim denotes the
contribution of interfering speaker i,i= 1,...,I, at microphone m
and Nmis additional low-level sensor noise. The goal is to suppress
the interfering speakers and to extract the source signal S, or a rever-
berant image of it, with minimal distortions. Thus, dereverberation
is not explicitly addressed in this paper. We allow different target
speakers and speaker positions across utterances but assume that the
speakers’ identities and positions remain static within one utterance.
In [6], COSPA was introduced for multichannel denoising. This
framework consists of an encoder, which, using a subnetwork de-
noted by CRUNet, estimates a single-channel mask that is applied
to all input channels and subsequently includes feature compression,
acompandor which effectuates multichannel processing, i.e., allows
to process each channel differently, and a decoder which outputs an
individual mask Mmfor each channel. In this paper, we modify
COSPA to the problem of TSE by adding DoA information.
arXiv:2210.15512v2 [eess.AS] 14 Mar 2023
摘要:

EXPLOITINGSPATIALINFORMATIONWITHTHEINFORMEDCOMPLEX-VALUEDSPATIALAUTOENCODERFORTARGETSPEAKEREXTRACTIONAnnikaBrieglebMhdModarHalimeh?WalterKellermannFriedrich-Alexander-Universit¨atErlangen-N¨urnberg,Erlangen,Germanyfannika.briegleb,mhd.m.halimeh,walter.kellermanng@fau.deABSTRACTInconventionalmulticha...

展开>> 收起<<
EXPLOITING SPATIAL INFORMATION WITH THE INFORMED COMPLEX-V ALUED SPATIAL AUTOENCODER FOR TARGET SPEAKER EXTRACTION Annika Briegleb Mhd Modar HalimehWalter Kellermann.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:412.96KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注