
IN SEARCH OF STRONG EMBEDDING EXTRACTORS FOR SPEAKER DIARISATION
Jee-weon Jung1, Hee-Soo Heo1, Bong-Jin Lee1, Jaesung Huh2,
Andrew Brown2, Youngki Kwon1, Shinji Watanabe3, Joon Son Chung4
1Naver Corporation, South Korea
2Visual Geometry Group, Department of Engineering Science, University of Oxford, UK
3Carnegie Mellon University, Pittsburgh, PA, USA
4Korea Advanced Institute of Science and Technology, South Korea
ABSTRACT
Speaker embedding extractors (EEs), which map input audio to a
speaker discriminant latent space, are of paramount importance in
speaker diarisation. However, there are several challenges when
adopting EEs for diarisation, from which we tackle two key prob-
lems. First, the evaluation is not straightforward because the features
required for better performance differ between speaker verification
and diarisation. We show that better performance on widely adopted
speaker verification evaluation protocols does not lead to better di-
arisation performance. Second, embedding extractors have not seen
utterances in which multiple speakers exist. These inputs are in-
evitably present in speaker diarisation because of overlapped speech
and speaker changes; they degrade the performance. To mitigate
the first problem, we generate speaker verification evaluation proto-
cols that mimic the diarisation scenario better. We propose two data
augmentation techniques to alleviate the second problem, making
embedding extractors aware of overlapped speech or speaker change
input. One technique generates overlapped speech segments, and the
other generates segments where two speakers utter sequentially. Ex-
tensive experimental results using three state-of-the-art speaker em-
bedding extractors demonstrate that both proposed approaches are
effective.
Index Terms—speaker diarisation, speaker verification, data
augmentation, evaluation protocol
1. INTRODUCTION
Speaker diarisation, which solves the problem of “who spoke
when”, is widely used for many applications [1,2]. It separates
a multi-speaker audio input into single-speaker segments and as-
signs speaker labels. In the majority of recent works, a speaker
diarisation system consists of either a combination of sub-systems
such as end-point detection, speaker embedding extraction, and clus-
tering [3–9] or an end-to-end deep neural network [10–16] where,
in this work, we focus on the former. When composing a speaker
diarisation system based upon sub-systems, the speaker embedding
extractor (EE), which maps an utterance to a latent space where
speakers can be discriminated, plays the most critical role.
In this study, we tackle two problematic phenomena regarding
EEs when used for speaker diarisation. One is an issue that we
raise, and the other is a well-known issue through previous stud-
ies [19–21]. We first raise the issue that evaluating an EE for speaker
diarisation is difficult. The straightforward approach would be calcu-
lating the diarisation error rate (DER) of a speaker diarisation system
(a) Baseline, VoxCeleb1-O
(b) Proposed evaluation protocol
Fig. 1. Correlation between EERs and DERs using three different
EEs. Five points for each EE corresponds to five training configu-
rations described in Section 5.5. DERs are calculated on the Vox-
Converse test set [17]. (a): EERs are calculated on the VoxCeleb1-O
test set [18]. EERs and DERs do not have a positive correlation even
though both measures are related to speaker discrimination, and both
datasets are from YouTube videos. (b): EERs are calculated on the
proposed evaluation protocol, described in Section 3. Correlation is
higher than (a).
using each EE. However, this is time-consuming and also can be af-
fected by other sub-processes, such as clustering. Thus, an EE which
demonstrates low equal error rates (EER), a metric for speaker veri-
fication, on a widely adopted evaluation protocol is typically adopted
as an alternative.
arXiv:2210.14682v1 [cs.SD] 26 Oct 2022