TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge Bowen Pang1 Huan Zhao1 Gaosheng Zhang2 Xiaoyue Yang2 Yang Sun2 Li Zhang1 Qing Wang1

2025-05-06 0 0 278.18KB 6 页 10玖币
侵权投诉
TSUP Speaker Diarization System for Conversational Short-phrase Speaker
Diarization Challenge
Bowen Pang1, Huan Zhao1, Gaosheng Zhang2, Xiaoyue Yang2, Yang Sun2, Li Zhang1, Qing Wang1,
Lei Xie1
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University (NPU), Xi’an, China
2Shenzhen Transsion Holding Limited
{zhaohuan, pangbowen}@mail.nwpu.edu.cn, {gaosheng.zhang, xiaoyue.yang,
yang.sun}@transsion.com, lxie@nwpu.edu.cn
Abstract
This paper describes the TSUP team’s submission to the
ISCSLP 2022 conversational short-phrase speaker diarization
(CSSD) challenge which particularly focuses on short-phrase
conversations with a new evaluation metric called conversa-
tional diarization error rate (CDER). In this challenge, we ex-
plore three kinds of typical speaker diarization systems, which
are spectral clustering (SC) based diarization, target-speaker
voice activity detection (TS-VAD) and end-to-end neural di-
arization (EEND) respectively. Our major findings are summa-
rized as follows. First, the SC approach is more favored over
the other two approaches under the new CDER metric. Second,
tuning on hyperparameters is essential to CDER for all three
types of speaker diarization systems. Specifically, CDER be-
comes smaller when the length of sub-segments setting longer.
Finally, multi-system fusion through DOVER-LAP will worsen
the CDER metric on the challenge data. Our submitted SC sys-
tem eventually ranks the third place in the challenge.
Index Terms: speaker diarization, spectral clustering, TS-
VAD, EEND
1. Introduction
Speaker diarization is to determine “who spoke when” in an
audio stream that may contain an unknown number of speak-
ers [2, 22]. It is an indispensable task in multimedia information
retrieval, speaker turn analysis and audio processing [32, 3]. In
particular, speaker diarization has the potential to significantly
improve automatic speech recognition (ASR) accuracy in multi-
speaker conversation scenarios [36, 40].
Clustering-based methods have dominated speaker di-
arization for many years, which is composed of multiple,
independently-optimized modules including voice activity de-
tection (VAD), speech segmentation, speaker embedding ex-
traction and speaker clustering [19, 15]. Although these sys-
tems have shown superior performance in several speaker di-
arization challenges [36, 35], there are apparently two defects in
such clustering-based methods. Specifically, they cannot prop-
erly deal with speaker overlap and the independent optimiza-
tion of different sub-modules may lead to sub-optimal perfor-
mance [25, 31, 21].
To deal with the overlapping speech problem particularly,
speech separation can be used as a pre-processing step [36]
and target-speaker voice activity detection (TS-VAD) can be
adopted as a post-processing step [16]. Particularly, the use of
TS-VAD has led the speaker diarization system to achieve state-
of-the-art (SOTA) performance in several open challenges [35,
33]. However, TS-VAD has a drawback that the oracle informa-
tion about the maximum number of speakers in an audio stream
has to be known in advance.
Recently, end-to-end neural diarization (EEND) [9, 10] has
been proposed to deal with both overlapping speech as well as to
directly optimize a diarization system via diarization error min-
imization. Specifically, EEND treats speaker diarization as a
classification task and estimates speech activities of all speakers
jointly frame-by-frame. To solve the permutation problem [1],
Fujita et al. introduced a permutation-free scheme [14, 39] par-
ticularly into the training objective function. Thus the EEND
system is trained in an end-to-end fashion under the objective
function that provides minimal diarization errors.
To advance the speaker diarization performance in conver-
sational short-phrase scenario, ISCSLP2022 specifically held a
CSSD challenge with a new conversation dataset. In conver-
sions like agent-customer telephone calls, the speech utterance
from each side is usually very brief and sometimes it is lim-
ited to only a short phrase, which results in frequent speaker
changes. Speaker diarization in such a scenario poses partic-
ular challenges to both current systems as well as evaluation
metrics. Different from the previous speaker diarization chal-
lenges [26, 4], the evaluation metric of the CSSD challenge is
the so-called conversational diarization error rate (CDER) [5]
rather than the typical diarization error rate (DER). Although
DER has been used as a standard metric for speaker diarization
for a long time, it fails to give enough emphasis to the short
conversational phrases which last for a short time but accurate
discrimination on them play a vital role to the downstream tasks
such as speech recognition and understanding. Different from
DER calculated on the time duration level, CDER is regard-
less of the length of the utterance, and all types of mistakes
are equally reflected in the evaluation metric [5]. Under this
measurement, short-phrase and long sentences have identical
importance.
Considering the specific task and the new evaluation metric,
in this paper, we make a comparative study on the three typical
speaker diairzation approaches – spectral clustering (SC), TS-
VAD and EEND. Our major findings are twofold. First, SC is
more preferred by the new CDER metric because frame-level
prediction nature of TS-VAD and EEND may lead to more er-
rors on short segments, which makes small impact on DER but
significant impact on CDER which equally treats long and short
utterances. Second, tuning on hyperparameters is essential to
CDER for all three types of speaker diarization system. Specif-
ically, CDER will be reduced when the length of sub-segments
setting longer. Finally our submitted spectral clustering system
arXiv:2210.14653v1 [cs.SD] 26 Oct 2022
manages to win the third place in the challenge.
2. System Description
In the CSSD challenge, we explore several systems including
SC, TS-VAD and EEND, which are summarized in Figure 1. In
addition, we also explore DOVER-LAP [24] to fuse the RTTM
outputs inferred from the above three systems to get the final
speaker diarization results.
2.1. Spectral Clustering System
A clustering-based speaker diarization system generally con-
sists of VAD, speaker embedding extractor and clustering mod-
ules. In general, the VAD module can be simple energy-based
or neural network based. Thanks to the recent advances in neu-
ral speaker verificatoin, various types of speaker embedding
extractors, including x-vectors [28], ResNet [13] and ECAPA-
TDNN [7], can be considered to extract discriminative speaker
embeddings from audio segments. As for the clustering step,
agglomerative hierarchical clustering (AHC) [17] and spectral
clustering [30] are the two typical methods.
In this challenge, we implement a ResNet-LSTM based
VAD module similar to [33] using the allowed training data
and its performance on the dev and test sets is comparable with
that achieved by the TDNN-based VAD in Kaldi [23] which is
trained using more data. The timestamp labels during training
are generated from the transcripts of the training set. In our
VAD module, the ResNet structure is first to extract frame-level
feature map. Then the current frame feature map is concate-
nated with the feature maps from both the previous and next
frames and the concatenated feature map then goes through a
statistics pooling layer. Finally, two BLSTM layers and a lin-
ear layer are used to obtain the probabilities of speech for the
current frame.
The structure of our speaker embedding extractor is
ResNet34 [34], which is used to extract utterance-level embed-
dings after speaker segmentation. Finally, we adopt the spec-
tral clustering algorithm to cluster the speaker identity of the
embeddings. Spectral clustering has been widely adopted in
speaker diarization. The conventional AHC approach is highly
time-consuming where the processing time for an audio stream
depends on the number of the initial segments. SC is thus intro-
duced to mitigate this problem.
In details, spectral clustering is a graph-based clustering al-
gorithm. After scoring pairs of sub-segment embeddings after
speaker segmentation, we can get a similarity matrix. Given the
similarity matrix S, SC finds a partition of the graph such that
different groups edges have very low weights. Our implemen-
tation on SC is summarized in Algorithm 1.
2.2. Target-speaker Voice Activity Detection
A TS-VAD system [16] aims to handle the overlapped speech
in speaker diarization, which adopts the target-speaker embed-
dings to identify the specific speakers within the overlapped
speech. The target-speaker embeddings are estimated based on
the initial diarization results from the clustering-based system.
Our TS-VAD system follows the structure in [34], which
consists of a ResNet front-end and a detection back-end. We
use a deep speaker embedding model to extract frame-level em-
beddings instead of the original TS-VAD [16] taking four CNN
layers to process the acoustic features. For an audio, we ex-
pect to detect speech segments for each speaker. First, we
use a ResNet34 network which has the same structure with
Algorithm 1 Algorithm of spectral clustering
Input: The similarity matrix SRn×n
Output: The clustering label of every segment embedding.
1: Set similarity matrix Sdiagonal elements to 0.
2: Compute the normalized Laplacian Lnorm of matrix S.
3: Compute the eigenvalues and eigenvectors of Lnorm.
4: Compute the number of cluster k. In our implementation,
we set a threshold αand count the number of eigenvalues
which is below αas k.
5: Concatenate the eigenvectors λ1,· · · , λkas columns cor-
responding to the ksmallest eigenvalues into a matrix P
Rn×k.
6: For i= 1,· · · , n, let ribe the i-th row of P. Cluster row
vectors r1,· · · , rkwith the k-means algorithm.
7: return The label of each embedding A1, A2,· · · , An
where Ai={j|j= 1,· · · , k}
the speaker embedding model to extract frame-level embed-
dings. Unlike speaker embedding model that applies statistics
pooling to project variable-length utterances into fixed-length
speaker representation embeddings, our TS-VAD employs the
same pooling on each frame which is combined with its adja-
cent frames to obtain frame-level embeddings. Then, the frame-
level embeddings are concatenated with the target-speaker em-
beddings. Next, a Transformer encoder is used to extract detec-
tion state of each speaker. After that a BLSTM layer processes
these detection states which are concatenated together to find
relationship between each speaker. Finally, a linear layer with
a sigmoid function are used to determine the probabilities of
each speaker at every time step. By analyzing and processing
the speech probabilities, we can obtain speech segments of each
speaker.
2.3. End-to-End Neural Diarization
EEND adopts a single neural network to obtian the final diariza-
tion result directly from input audio feature. EEND is composed
of an encoder block and a binary classification block. Given
a sequence generated from an audio signal, encoder block ex-
tracts the feature containing diarization information and classi-
fication block estimates a two-dimensional sequence to express
a probability of whether a speaker speaks at a frame. We em-
ploy permutation invariant training (PIT) [9] to traverse all or-
ders and optimize the minimal one because changing speakers
order does not affect the final result in diarization task. Even-
tually, each frame is classified into one of the following three
cases, non-speaker, one-speaker and overlap.
In this challenge, we attempt two kinds of model struc-
tures, which are residual auxiliary EEND (RX-EEND) [41] and
speaker-wise chain EEND (SC-EEND) [11]. In RX-EEND,
each encoder block is enriched by a residual connection to re-
strict gradient to a reasonable range. In addition, the output
tensor of the encoder block with the exception of the last block
is aggregated to calculate an extra auxiliary loss to gather more
diarization information. In SC-EEND, there is an LSTM layer
to model the relation that whether a speaker spoken in the past
will affect other speakers between different speakers since we
will indeed be temporarily silent when the other person is talk-
ing during the real conversation. RX-EEND improves the struc-
ture of the network and SC-EEND models a more appropriate
structure in the application scene.
摘要:

TSUPSpeakerDiarizationSystemforConversationalShort-phraseSpeakerDiarizationChallengeBowenPang1,HuanZhao1,GaoshengZhang2,XiaoyueYang2,YangSun2,LiZhang1,QingWang1,LeiXie11Audio,SpeechandLanguageProcessingGroup(ASLP@NPU),SchoolofComputerScience,NorthwesternPolytechnicalUniversity(NPU),Xi'an,China2Shenz...

展开>> 收起<<
TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge Bowen Pang1 Huan Zhao1 Gaosheng Zhang2 Xiaoyue Yang2 Yang Sun2 Li Zhang1 Qing Wang1.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:6 页 大小:278.18KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注