IN SEARCH OF STRONG EMBEDDING EXTRACTORS FOR SPEAKER DIARISATION Jee-weon Jung1 Hee-Soo Heo1 Bong-Jin Lee1 Jaesung Huh2 Andrew Brown2 Youngki Kwon1 Shinji Watanabe3 Joon Son Chung4

2025-05-06 0 0 269.79KB 5 页 10玖币
侵权投诉
IN SEARCH OF STRONG EMBEDDING EXTRACTORS FOR SPEAKER DIARISATION
Jee-weon Jung1, Hee-Soo Heo1, Bong-Jin Lee1, Jaesung Huh2,
Andrew Brown2, Youngki Kwon1, Shinji Watanabe3, Joon Son Chung4
1Naver Corporation, South Korea
2Visual Geometry Group, Department of Engineering Science, University of Oxford, UK
3Carnegie Mellon University, Pittsburgh, PA, USA
4Korea Advanced Institute of Science and Technology, South Korea
ABSTRACT
Speaker embedding extractors (EEs), which map input audio to a
speaker discriminant latent space, are of paramount importance in
speaker diarisation. However, there are several challenges when
adopting EEs for diarisation, from which we tackle two key prob-
lems. First, the evaluation is not straightforward because the features
required for better performance differ between speaker verification
and diarisation. We show that better performance on widely adopted
speaker verification evaluation protocols does not lead to better di-
arisation performance. Second, embedding extractors have not seen
utterances in which multiple speakers exist. These inputs are in-
evitably present in speaker diarisation because of overlapped speech
and speaker changes; they degrade the performance. To mitigate
the first problem, we generate speaker verification evaluation proto-
cols that mimic the diarisation scenario better. We propose two data
augmentation techniques to alleviate the second problem, making
embedding extractors aware of overlapped speech or speaker change
input. One technique generates overlapped speech segments, and the
other generates segments where two speakers utter sequentially. Ex-
tensive experimental results using three state-of-the-art speaker em-
bedding extractors demonstrate that both proposed approaches are
effective.
Index Termsspeaker diarisation, speaker verification, data
augmentation, evaluation protocol
1. INTRODUCTION
Speaker diarisation, which solves the problem of “who spoke
when”, is widely used for many applications [1,2]. It separates
a multi-speaker audio input into single-speaker segments and as-
signs speaker labels. In the majority of recent works, a speaker
diarisation system consists of either a combination of sub-systems
such as end-point detection, speaker embedding extraction, and clus-
tering [39] or an end-to-end deep neural network [1016] where,
in this work, we focus on the former. When composing a speaker
diarisation system based upon sub-systems, the speaker embedding
extractor (EE), which maps an utterance to a latent space where
speakers can be discriminated, plays the most critical role.
In this study, we tackle two problematic phenomena regarding
EEs when used for speaker diarisation. One is an issue that we
raise, and the other is a well-known issue through previous stud-
ies [1921]. We first raise the issue that evaluating an EE for speaker
diarisation is difficult. The straightforward approach would be calcu-
lating the diarisation error rate (DER) of a speaker diarisation system
(a) Baseline, VoxCeleb1-O
(b) Proposed evaluation protocol
Fig. 1. Correlation between EERs and DERs using three different
EEs. Five points for each EE corresponds to five training configu-
rations described in Section 5.5. DERs are calculated on the Vox-
Converse test set [17]. (a): EERs are calculated on the VoxCeleb1-O
test set [18]. EERs and DERs do not have a positive correlation even
though both measures are related to speaker discrimination, and both
datasets are from YouTube videos. (b): EERs are calculated on the
proposed evaluation protocol, described in Section 3. Correlation is
higher than (a).
using each EE. However, this is time-consuming and also can be af-
fected by other sub-processes, such as clustering. Thus, an EE which
demonstrates low equal error rates (EER), a metric for speaker veri-
fication, on a widely adopted evaluation protocol is typically adopted
as an alternative.
arXiv:2210.14682v1 [cs.SD] 26 Oct 2022
摘要:

INSEARCHOFSTRONGEMBEDDINGEXTRACTORSFORSPEAKERDIARISATIONJee-weonJung1,Hee-SooHeo1,Bong-JinLee1,JaesungHuh2,AndrewBrown2,YoungkiKwon1,ShinjiWatanabe3,JoonSonChung41NaverCorporation,SouthKorea2VisualGeometryGroup,DepartmentofEngineeringScience,UniversityofOxford,UK3CarnegieMellonUniversity,Pittsburgh,...

展开>> 收起<<
IN SEARCH OF STRONG EMBEDDING EXTRACTORS FOR SPEAKER DIARISATION Jee-weon Jung1 Hee-Soo Heo1 Bong-Jin Lee1 Jaesung Huh2 Andrew Brown2 Youngki Kwon1 Shinji Watanabe3 Joon Son Chung4.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:269.79KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注