QUANTITATIVE EVIDENCE ON OVERLOOKED ASPECTS OF ENROLLMENT SPEAKER EMBEDDINGS FOR TARGET SPEAKER SEPARATION Xiaoyu Liu Xu Li Joan Serr a

2025-05-02 0 0 191.6KB 5 页 10玖币

侵权投诉

QUANTITATIVE EVIDENCE ON OVERLOOKED ASPECTS OF ENROLLMENT SPEAKER

EMBEDDINGS FOR TARGET SPEAKER SEPARATION

Xiaoyu Liu Xu Li Joan Serr`

Dolby Laboratories

ABSTRACT

Single channel target speaker separation (TSS) aims at extracting

a speaker’s voice from a mixture of multiple talkers given an en-

rollment utterance of that speaker. A typical deep learning TSS

framework consists of an upstream model that obtains enrollment

speaker embeddings and a downstream model that performs the sep-

aration conditioned on the embeddings. In this paper, we look into

several important but overlooked aspects of the enrollment embed-

dings, including the suitability of the widely used speaker identiﬁca-

tion embeddings, the introduction of the log-mel ﬁlterbank and self-

supervised embeddings, and the embeddings’ cross-dataset general-

ization capability. Our results show that the speaker identiﬁcation

embeddings could lose relevant information due to a sub-optimal

metric, training objective, or common pre-processing. In contrast,

both the ﬁlterbank and the self-supervised embeddings preserve the

integrity of the speaker information, but the former consistently out-

performs the latter in a cross-dataset evaluation. The competitive

separation and generalization performance of the previously over-

looked ﬁlterbank embedding is consistent across our study, which

calls for future research on better upstream features.

Index Terms—Target speaker separation, speaker embedding,

ﬁlterbank, speaker identiﬁcation, self-supervised learning.

1. INTRODUCTION

Single channel target speaker separation (TSS) is the task of sepa-

rating a speaker’s voice from interfering talkers given a pre-recorded

enrollment utterance that characterizes that speaker (the target

speaker). A deep learning-based TSS framework typically con-

sists of an upstream speaker embedding model and a downstream

separation model, with the latter conditioned on the enrollment em-

beddings from the former, acting as target speaker references. For

the upstream, in general, existing research performs two choices: (i)

to use utterance- or frame-level embeddings, and (ii) to pre-train the

embedding model as a separate module or ﬁne-tune the embeddings

together with the downstream. For utterance-level embeddings,

many systems employ speaker identiﬁcation (SID) networks pre-

trained for a low equal error rate (EER) to extract a summary vector

from the entire enrollment utterance [1–7]. On the other hand,

frame-level embeddings take advantage of attention algorithms to

align each mixture frame with the most informative enrollment

frames [8–10]. For both (i) and (ii), ﬁne-tuning outperforms pre-

trained embeddings, based on the assumption that joint training

captures more relevant speaker information for TSS [10, 11].

Despite of much research along those choices, there are still im-

portant but overlooked aspects of enrollment embeddings that re-

quire further attention. For the widely used SID embeddings, the

effects of a low EER, commonly used pre-processing, and data aug-

mentation on the TSS quality are unclear. In addition, we look at

two new embeddings not explored for TSS before: log-mel ﬁlter-

bank (FBANK) and self-supervised learning (SSL). FBANK, as a

simple signal processing method, has been ignored as an enrollment

option in previous literature. SSL are a class of powerful models that

learn problem-agnostic speech features from unlabelled data [12–

14], and we hypothesize that such broader information (compared to

SID) could beneﬁt TSS enrollment. Note that, unlike [15], which

uses SSL as the input mixture features for blind speaker separation,

we limit SSL to ofﬂine processing the enrollment utterance, since

TSS often requires real-time low-complexity processing for the mix-

tures [2–5]. Finally, we consider a cross-dataset evaluation to assess

the generalization of the enrollment embeddings [16], which is an-

other important but overlooked aspect in previous TSS research.

Our work studies pre-trained utterance- and frame-level embed-

dings, as well as ﬁne-tuned frame-level embeddings. Under each

category, FBANK, SID, and SSL features are investigated in detail.

Speciﬁcally, we provide answers to the following open questions:

•Does a lower EER mean better separation? An upstream SID

network with a low EER is a natural choice for TSS [2, 5, 10], but

we show that EER is an unreliable metric for the success of TSS.

•Does feature normalization (FN) in SID improve TSS? FN, a

common pre-processing in SID [17–21], normalizes the recording

channel characteristics by subtracting the per-band mean from the

input FBANK to the SID systems. We show that FN hurts TSS.

•Does data augmentation in SID improve TSS? Augmenting

training data by more speakers and distortions often beneﬁts

SID [19], but we ﬁnd that such embeddings may not beneﬁt TSS

much under both clean and noisy enrollment conditions.

•Which one is better, SSL or SID embeddings? SSL encodes

more speech information than SID, but we show that, to take

advantage of SSL, frame-level embeddings are preferred over

utterance-level ones, and that the studied SID embeddings do not

beneﬁt from frame-level information.

•Does a more powerful SSL model yield better TSS? By compar-

ing two SSL models, we ﬁnd that the more powerful one performs

only marginally better.

•Are ﬁne-tuned embeddings better than the pre-trained ones?

To answer this question, we ﬁne-tune the pre-trained SSL and SID

models and only observe improvements by the SSL model.

•How does each embedding compare with FBANK? Remark-

ably, the performance of the simple FBANK is close to or in some

cases better than other studied embeddings.

•How generalizable are the embeddings to different test sets?

We show that FBANK generalizes competitively among various

upstream features. We also observe that the pre-trained and ﬁne-

tuned SSL features could suffer from overﬁtting.

With this extensive study, we hope to provide insight and a prac-

tical guide on speaker embeddings for TSS enrollment.

arXiv:2210.12635v2 [cs.SD] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

QUANTITATIVEEVIDENCEONOVERLOOKEDASPECTSOFENROLLMENTSPEAKEREMBEDDINGSFORTARGETSPEAKERSEPARATIONXiaoyuLiuXuLiJoanSerraDolbyLaboratoriesABSTRACTSinglechanneltargetspeakerseparation(TSS)aimsatextractingaspeaker'svoicefromamixtureofmultipletalkersgivenanen-rollmentutteranceofthatspeaker.Atypicaldeeplear...

展开>> 收起<<

QUANTITATIVE EVIDENCE ON OVERLOOKED ASPECTS OF ENROLLMENT SPEAKER EMBEDDINGS FOR TARGET SPEAKER SEPARATION Xiaoyu Liu Xu Li Joan Serr a.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

QUANTITATIVE EVIDENCE ON OVERLOOKED ASPECTS OF ENROLLMENT SPEAKER EMBEDDINGS FOR TARGET SPEAKER SEPARATION Xiaoyu Liu Xu Li Joan Serr a

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: