QUANTITATIVE EVIDENCE ON OVERLOOKED ASPECTS OF ENROLLMENT SPEAKER
EMBEDDINGS FOR TARGET SPEAKER SEPARATION
Xiaoyu Liu Xu Li Joan Serr`
a
Dolby Laboratories
ABSTRACT
Single channel target speaker separation (TSS) aims at extracting
a speaker’s voice from a mixture of multiple talkers given an en-
rollment utterance of that speaker. A typical deep learning TSS
framework consists of an upstream model that obtains enrollment
speaker embeddings and a downstream model that performs the sep-
aration conditioned on the embeddings. In this paper, we look into
several important but overlooked aspects of the enrollment embed-
dings, including the suitability of the widely used speaker identifica-
tion embeddings, the introduction of the log-mel filterbank and self-
supervised embeddings, and the embeddings’ cross-dataset general-
ization capability. Our results show that the speaker identification
embeddings could lose relevant information due to a sub-optimal
metric, training objective, or common pre-processing. In contrast,
both the filterbank and the self-supervised embeddings preserve the
integrity of the speaker information, but the former consistently out-
performs the latter in a cross-dataset evaluation. The competitive
separation and generalization performance of the previously over-
looked filterbank embedding is consistent across our study, which
calls for future research on better upstream features.
Index Terms—Target speaker separation, speaker embedding,
filterbank, speaker identification, self-supervised learning.
1. INTRODUCTION
Single channel target speaker separation (TSS) is the task of sepa-
rating a speaker’s voice from interfering talkers given a pre-recorded
enrollment utterance that characterizes that speaker (the target
speaker). A deep learning-based TSS framework typically con-
sists of an upstream speaker embedding model and a downstream
separation model, with the latter conditioned on the enrollment em-
beddings from the former, acting as target speaker references. For
the upstream, in general, existing research performs two choices: (i)
to use utterance- or frame-level embeddings, and (ii) to pre-train the
embedding model as a separate module or fine-tune the embeddings
together with the downstream. For utterance-level embeddings,
many systems employ speaker identification (SID) networks pre-
trained for a low equal error rate (EER) to extract a summary vector
from the entire enrollment utterance [1–7]. On the other hand,
frame-level embeddings take advantage of attention algorithms to
align each mixture frame with the most informative enrollment
frames [8–10]. For both (i) and (ii), fine-tuning outperforms pre-
trained embeddings, based on the assumption that joint training
captures more relevant speaker information for TSS [10, 11].
Despite of much research along those choices, there are still im-
portant but overlooked aspects of enrollment embeddings that re-
quire further attention. For the widely used SID embeddings, the
effects of a low EER, commonly used pre-processing, and data aug-
mentation on the TSS quality are unclear. In addition, we look at
two new embeddings not explored for TSS before: log-mel filter-
bank (FBANK) and self-supervised learning (SSL). FBANK, as a
simple signal processing method, has been ignored as an enrollment
option in previous literature. SSL are a class of powerful models that
learn problem-agnostic speech features from unlabelled data [12–
14], and we hypothesize that such broader information (compared to
SID) could benefit TSS enrollment. Note that, unlike [15], which
uses SSL as the input mixture features for blind speaker separation,
we limit SSL to offline processing the enrollment utterance, since
TSS often requires real-time low-complexity processing for the mix-
tures [2–5]. Finally, we consider a cross-dataset evaluation to assess
the generalization of the enrollment embeddings [16], which is an-
other important but overlooked aspect in previous TSS research.
Our work studies pre-trained utterance- and frame-level embed-
dings, as well as fine-tuned frame-level embeddings. Under each
category, FBANK, SID, and SSL features are investigated in detail.
Specifically, we provide answers to the following open questions:
•Does a lower EER mean better separation? An upstream SID
network with a low EER is a natural choice for TSS [2, 5, 10], but
we show that EER is an unreliable metric for the success of TSS.
•Does feature normalization (FN) in SID improve TSS? FN, a
common pre-processing in SID [17–21], normalizes the recording
channel characteristics by subtracting the per-band mean from the
input FBANK to the SID systems. We show that FN hurts TSS.
•Does data augmentation in SID improve TSS? Augmenting
training data by more speakers and distortions often benefits
SID [19], but we find that such embeddings may not benefit TSS
much under both clean and noisy enrollment conditions.
•Which one is better, SSL or SID embeddings? SSL encodes
more speech information than SID, but we show that, to take
advantage of SSL, frame-level embeddings are preferred over
utterance-level ones, and that the studied SID embeddings do not
benefit from frame-level information.
•Does a more powerful SSL model yield better TSS? By compar-
ing two SSL models, we find that the more powerful one performs
only marginally better.
•Are fine-tuned embeddings better than the pre-trained ones?
To answer this question, we fine-tune the pre-trained SSL and SID
models and only observe improvements by the SSL model.
•How does each embedding compare with FBANK? Remark-
ably, the performance of the simple FBANK is close to or in some
cases better than other studied embeddings.
•How generalizable are the embeddings to different test sets?
We show that FBANK generalizes competitively among various
upstream features. We also observe that the pre-trained and fine-
tuned SSL features could suffer from overfitting.
With this extensive study, we hope to provide insight and a prac-
tical guide on speaker embeddings for TSS enrollment.
arXiv:2210.12635v2 [cs.SD] 26 Oct 2022