TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge Bowen Pang1 Huan Zhao1 Gaosheng Zhang2 Xiaoyue Yang2 Yang Sun2 Li Zhang1 Qing Wang1

2025-05-06 0 0 278.18KB 6 页 10玖币

侵权投诉

TSUP Speaker Diarization System for Conversational Short-phrase Speaker

Diarization Challenge

Bowen Pang1, Huan Zhao1, Gaosheng Zhang2, Xiaoyue Yang2, Yang Sun2, Li Zhang1, Qing Wang1,

Lei Xie1

1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,

Northwestern Polytechnical University (NPU), Xi’an, China

2Shenzhen Transsion Holding Limited

{zhaohuan, pangbowen}@mail.nwpu.edu.cn, {gaosheng.zhang, xiaoyue.yang,

yang.sun}@transsion.com, lxie@nwpu.edu.cn

Abstract

This paper describes the TSUP team’s submission to the

ISCSLP 2022 conversational short-phrase speaker diarization

(CSSD) challenge which particularly focuses on short-phrase

conversations with a new evaluation metric called conversa-

tional diarization error rate (CDER). In this challenge, we ex-

plore three kinds of typical speaker diarization systems, which

are spectral clustering (SC) based diarization, target-speaker

voice activity detection (TS-VAD) and end-to-end neural di-

arization (EEND) respectively. Our major ﬁndings are summa-

rized as follows. First, the SC approach is more favored over

the other two approaches under the new CDER metric. Second,

tuning on hyperparameters is essential to CDER for all three

types of speaker diarization systems. Speciﬁcally, CDER be-

comes smaller when the length of sub-segments setting longer.

Finally, multi-system fusion through DOVER-LAP will worsen

the CDER metric on the challenge data. Our submitted SC sys-

tem eventually ranks the third place in the challenge.

Index Terms: speaker diarization, spectral clustering, TS-

VAD, EEND

1. Introduction

Speaker diarization is to determine “who spoke when” in an

audio stream that may contain an unknown number of speak-

ers [2, 22]. It is an indispensable task in multimedia information

retrieval, speaker turn analysis and audio processing [32, 3]. In

particular, speaker diarization has the potential to signiﬁcantly

improve automatic speech recognition (ASR) accuracy in multi-

speaker conversation scenarios [36, 40].

Clustering-based methods have dominated speaker di-

arization for many years, which is composed of multiple,

independently-optimized modules including voice activity de-

tection (VAD), speech segmentation, speaker embedding ex-

traction and speaker clustering [19, 15]. Although these sys-

tems have shown superior performance in several speaker di-

arization challenges [36, 35], there are apparently two defects in

such clustering-based methods. Speciﬁcally, they cannot prop-

erly deal with speaker overlap and the independent optimiza-

tion of different sub-modules may lead to sub-optimal perfor-

mance [25, 31, 21].

To deal with the overlapping speech problem particularly,

speech separation can be used as a pre-processing step [36]

and target-speaker voice activity detection (TS-VAD) can be

adopted as a post-processing step [16]. Particularly, the use of

TS-VAD has led the speaker diarization system to achieve state-

of-the-art (SOTA) performance in several open challenges [35,

33]. However, TS-VAD has a drawback that the oracle informa-

tion about the maximum number of speakers in an audio stream

has to be known in advance.

Recently, end-to-end neural diarization (EEND) [9, 10] has

been proposed to deal with both overlapping speech as well as to

directly optimize a diarization system via diarization error min-

imization. Speciﬁcally, EEND treats speaker diarization as a

classiﬁcation task and estimates speech activities of all speakers

jointly frame-by-frame. To solve the permutation problem [1],

Fujita et al. introduced a permutation-free scheme [14, 39] par-

ticularly into the training objective function. Thus the EEND

system is trained in an end-to-end fashion under the objective

function that provides minimal diarization errors.

To advance the speaker diarization performance in conver-

sational short-phrase scenario, ISCSLP2022 speciﬁcally held a

CSSD challenge with a new conversation dataset. In conver-

sions like agent-customer telephone calls, the speech utterance

from each side is usually very brief and sometimes it is lim-

ited to only a short phrase, which results in frequent speaker

changes. Speaker diarization in such a scenario poses partic-

ular challenges to both current systems as well as evaluation

metrics. Different from the previous speaker diarization chal-

lenges [26, 4], the evaluation metric of the CSSD challenge is

the so-called conversational diarization error rate (CDER) [5]

rather than the typical diarization error rate (DER). Although

DER has been used as a standard metric for speaker diarization

for a long time, it fails to give enough emphasis to the short

conversational phrases which last for a short time but accurate

discrimination on them play a vital role to the downstream tasks

such as speech recognition and understanding. Different from

DER calculated on the time duration level, CDER is regard-

less of the length of the utterance, and all types of mistakes

are equally reﬂected in the evaluation metric [5]. Under this

measurement, short-phrase and long sentences have identical

importance.

Considering the speciﬁc task and the new evaluation metric,

in this paper, we make a comparative study on the three typical

speaker diairzation approaches – spectral clustering (SC), TS-

VAD and EEND. Our major ﬁndings are twofold. First, SC is

more preferred by the new CDER metric because frame-level

prediction nature of TS-VAD and EEND may lead to more er-

rors on short segments, which makes small impact on DER but

signiﬁcant impact on CDER which equally treats long and short

utterances. Second, tuning on hyperparameters is essential to

CDER for all three types of speaker diarization system. Specif-

ically, CDER will be reduced when the length of sub-segments

setting longer. Finally our submitted spectral clustering system

arXiv:2210.14653v1 [cs.SD] 26 Oct 2022

manages to win the third place in the challenge.

2. System Description

In the CSSD challenge, we explore several systems including

SC, TS-VAD and EEND, which are summarized in Figure 1. In

addition, we also explore DOVER-LAP [24] to fuse the RTTM

outputs inferred from the above three systems to get the ﬁnal

speaker diarization results.

2.1. Spectral Clustering System

A clustering-based speaker diarization system generally con-

sists of VAD, speaker embedding extractor and clustering mod-

ules. In general, the VAD module can be simple energy-based

or neural network based. Thanks to the recent advances in neu-

ral speaker veriﬁcatoin, various types of speaker embedding

extractors, including x-vectors [28], ResNet [13] and ECAPA-

TDNN [7], can be considered to extract discriminative speaker

embeddings from audio segments. As for the clustering step,

agglomerative hierarchical clustering (AHC) [17] and spectral

clustering [30] are the two typical methods.

In this challenge, we implement a ResNet-LSTM based

VAD module similar to [33] using the allowed training data

and its performance on the dev and test sets is comparable with

that achieved by the TDNN-based VAD in Kaldi [23] which is

trained using more data. The timestamp labels during training

are generated from the transcripts of the training set. In our

VAD module, the ResNet structure is ﬁrst to extract frame-level

feature map. Then the current frame feature map is concate-

nated with the feature maps from both the previous and next

frames and the concatenated feature map then goes through a

statistics pooling layer. Finally, two BLSTM layers and a lin-

ear layer are used to obtain the probabilities of speech for the

current frame.

The structure of our speaker embedding extractor is

ResNet34 [34], which is used to extract utterance-level embed-

dings after speaker segmentation. Finally, we adopt the spec-

tral clustering algorithm to cluster the speaker identity of the

embeddings. Spectral clustering has been widely adopted in

speaker diarization. The conventional AHC approach is highly

time-consuming where the processing time for an audio stream

depends on the number of the initial segments. SC is thus intro-

duced to mitigate this problem.

In details, spectral clustering is a graph-based clustering al-

gorithm. After scoring pairs of sub-segment embeddings after

speaker segmentation, we can get a similarity matrix. Given the

similarity matrix S, SC ﬁnds a partition of the graph such that

different groups edges have very low weights. Our implemen-

tation on SC is summarized in Algorithm 1.

2.2. Target-speaker Voice Activity Detection

A TS-VAD system [16] aims to handle the overlapped speech

in speaker diarization, which adopts the target-speaker embed-

dings to identify the speciﬁc speakers within the overlapped

speech. The target-speaker embeddings are estimated based on

the initial diarization results from the clustering-based system.

Our TS-VAD system follows the structure in [34], which

consists of a ResNet front-end and a detection back-end. We

use a deep speaker embedding model to extract frame-level em-

beddings instead of the original TS-VAD [16] taking four CNN

layers to process the acoustic features. For an audio, we ex-

pect to detect speech segments for each speaker. First, we

use a ResNet34 network which has the same structure with

Algorithm 1 Algorithm of spectral clustering

Input: The similarity matrix S∈Rn×n

Output: The clustering label of every segment embedding.

1: Set similarity matrix Sdiagonal elements to 0.

2: Compute the normalized Laplacian Lnorm of matrix S.

3: Compute the eigenvalues and eigenvectors of Lnorm.

4: Compute the number of cluster k. In our implementation,

we set a threshold αand count the number of eigenvalues

which is below αas k.

5: Concatenate the eigenvectors λ1,· · · , λkas columns cor-

responding to the ksmallest eigenvalues into a matrix P∈

Rn×k.

6: For i= 1,· · · , n, let ribe the i-th row of P. Cluster row

vectors r1,· · · , rkwith the k-means algorithm.

7: return The label of each embedding A1, A2,· · · , An

where Ai={j|j= 1,· · · , k}

the speaker embedding model to extract frame-level embed-

dings. Unlike speaker embedding model that applies statistics

pooling to project variable-length utterances into ﬁxed-length

speaker representation embeddings, our TS-VAD employs the

same pooling on each frame which is combined with its adja-

cent frames to obtain frame-level embeddings. Then, the frame-

level embeddings are concatenated with the target-speaker em-

beddings. Next, a Transformer encoder is used to extract detec-

tion state of each speaker. After that a BLSTM layer processes

these detection states which are concatenated together to ﬁnd

relationship between each speaker. Finally, a linear layer with

a sigmoid function are used to determine the probabilities of

each speaker at every time step. By analyzing and processing

the speech probabilities, we can obtain speech segments of each

speaker.

2.3. End-to-End Neural Diarization

EEND adopts a single neural network to obtian the ﬁnal diariza-

tion result directly from input audio feature. EEND is composed

of an encoder block and a binary classiﬁcation block. Given

a sequence generated from an audio signal, encoder block ex-

tracts the feature containing diarization information and classi-

ﬁcation block estimates a two-dimensional sequence to express

a probability of whether a speaker speaks at a frame. We em-

ploy permutation invariant training (PIT) [9] to traverse all or-

ders and optimize the minimal one because changing speakers

order does not affect the ﬁnal result in diarization task. Even-

tually, each frame is classiﬁed into one of the following three

cases, non-speaker, one-speaker and overlap.

In this challenge, we attempt two kinds of model struc-

tures, which are residual auxiliary EEND (RX-EEND) [41] and

speaker-wise chain EEND (SC-EEND) [11]. In RX-EEND,

each encoder block is enriched by a residual connection to re-

strict gradient to a reasonable range. In addition, the output

tensor of the encoder block with the exception of the last block

is aggregated to calculate an extra auxiliary loss to gather more

diarization information. In SC-EEND, there is an LSTM layer

to model the relation that whether a speaker spoken in the past

will affect other speakers between different speakers since we

will indeed be temporarily silent when the other person is talk-

ing during the real conversation. RX-EEND improves the struc-

ture of the network and SC-EEND models a more appropriate

structure in the application scene.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TSUPSpeakerDiarizationSystemforConversationalShort-phraseSpeakerDiarizationChallengeBowenPang1,HuanZhao1,GaoshengZhang2,XiaoyueYang2,YangSun2,LiZhang1,QingWang1,LeiXie11Audio,SpeechandLanguageProcessingGroup(ASLP@NPU),SchoolofComputerScience,NorthwesternPolytechnicalUniversity(NPU),Xi'an,China2Shenz...

展开>> 收起<<

TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge Bowen Pang1 Huan Zhao1 Gaosheng Zhang2 Xiaoyue Yang2 Yang Sun2 Li Zhang1 Qing Wang1.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge Bowen Pang1 Huan Zhao1 Gaosheng Zhang2 Xiaoyue Yang2 Yang Sun2 Li Zhang1 Qing Wang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: