
manages to win the third place in the challenge.
2. System Description
In the CSSD challenge, we explore several systems including
SC, TS-VAD and EEND, which are summarized in Figure 1. In
addition, we also explore DOVER-LAP [24] to fuse the RTTM
outputs inferred from the above three systems to get the final
speaker diarization results.
2.1. Spectral Clustering System
A clustering-based speaker diarization system generally con-
sists of VAD, speaker embedding extractor and clustering mod-
ules. In general, the VAD module can be simple energy-based
or neural network based. Thanks to the recent advances in neu-
ral speaker verificatoin, various types of speaker embedding
extractors, including x-vectors [28], ResNet [13] and ECAPA-
TDNN [7], can be considered to extract discriminative speaker
embeddings from audio segments. As for the clustering step,
agglomerative hierarchical clustering (AHC) [17] and spectral
clustering [30] are the two typical methods.
In this challenge, we implement a ResNet-LSTM based
VAD module similar to [33] using the allowed training data
and its performance on the dev and test sets is comparable with
that achieved by the TDNN-based VAD in Kaldi [23] which is
trained using more data. The timestamp labels during training
are generated from the transcripts of the training set. In our
VAD module, the ResNet structure is first to extract frame-level
feature map. Then the current frame feature map is concate-
nated with the feature maps from both the previous and next
frames and the concatenated feature map then goes through a
statistics pooling layer. Finally, two BLSTM layers and a lin-
ear layer are used to obtain the probabilities of speech for the
current frame.
The structure of our speaker embedding extractor is
ResNet34 [34], which is used to extract utterance-level embed-
dings after speaker segmentation. Finally, we adopt the spec-
tral clustering algorithm to cluster the speaker identity of the
embeddings. Spectral clustering has been widely adopted in
speaker diarization. The conventional AHC approach is highly
time-consuming where the processing time for an audio stream
depends on the number of the initial segments. SC is thus intro-
duced to mitigate this problem.
In details, spectral clustering is a graph-based clustering al-
gorithm. After scoring pairs of sub-segment embeddings after
speaker segmentation, we can get a similarity matrix. Given the
similarity matrix S, SC finds a partition of the graph such that
different groups edges have very low weights. Our implemen-
tation on SC is summarized in Algorithm 1.
2.2. Target-speaker Voice Activity Detection
A TS-VAD system [16] aims to handle the overlapped speech
in speaker diarization, which adopts the target-speaker embed-
dings to identify the specific speakers within the overlapped
speech. The target-speaker embeddings are estimated based on
the initial diarization results from the clustering-based system.
Our TS-VAD system follows the structure in [34], which
consists of a ResNet front-end and a detection back-end. We
use a deep speaker embedding model to extract frame-level em-
beddings instead of the original TS-VAD [16] taking four CNN
layers to process the acoustic features. For an audio, we ex-
pect to detect speech segments for each speaker. First, we
use a ResNet34 network which has the same structure with
Algorithm 1 Algorithm of spectral clustering
Input: The similarity matrix S∈Rn×n
Output: The clustering label of every segment embedding.
1: Set similarity matrix Sdiagonal elements to 0.
2: Compute the normalized Laplacian Lnorm of matrix S.
3: Compute the eigenvalues and eigenvectors of Lnorm.
4: Compute the number of cluster k. In our implementation,
we set a threshold αand count the number of eigenvalues
which is below αas k.
5: Concatenate the eigenvectors λ1,· · · , λkas columns cor-
responding to the ksmallest eigenvalues into a matrix P∈
Rn×k.
6: For i= 1,· · · , n, let ribe the i-th row of P. Cluster row
vectors r1,· · · , rkwith the k-means algorithm.
7: return The label of each embedding A1, A2,· · · , An
where Ai={j|j= 1,· · · , k}
the speaker embedding model to extract frame-level embed-
dings. Unlike speaker embedding model that applies statistics
pooling to project variable-length utterances into fixed-length
speaker representation embeddings, our TS-VAD employs the
same pooling on each frame which is combined with its adja-
cent frames to obtain frame-level embeddings. Then, the frame-
level embeddings are concatenated with the target-speaker em-
beddings. Next, a Transformer encoder is used to extract detec-
tion state of each speaker. After that a BLSTM layer processes
these detection states which are concatenated together to find
relationship between each speaker. Finally, a linear layer with
a sigmoid function are used to determine the probabilities of
each speaker at every time step. By analyzing and processing
the speech probabilities, we can obtain speech segments of each
speaker.
2.3. End-to-End Neural Diarization
EEND adopts a single neural network to obtian the final diariza-
tion result directly from input audio feature. EEND is composed
of an encoder block and a binary classification block. Given
a sequence generated from an audio signal, encoder block ex-
tracts the feature containing diarization information and classi-
fication block estimates a two-dimensional sequence to express
a probability of whether a speaker speaks at a frame. We em-
ploy permutation invariant training (PIT) [9] to traverse all or-
ders and optimize the minimal one because changing speakers
order does not affect the final result in diarization task. Even-
tually, each frame is classified into one of the following three
cases, non-speaker, one-speaker and overlap.
In this challenge, we attempt two kinds of model struc-
tures, which are residual auxiliary EEND (RX-EEND) [41] and
speaker-wise chain EEND (SC-EEND) [11]. In RX-EEND,
each encoder block is enriched by a residual connection to re-
strict gradient to a reasonable range. In addition, the output
tensor of the encoder block with the exception of the last block
is aggregated to calculate an extra auxiliary loss to gather more
diarization information. In SC-EEND, there is an LSTM layer
to model the relation that whether a speaker spoken in the past
will affect other speakers between different speakers since we
will indeed be temporarily silent when the other person is talk-
ing during the real conversation. RX-EEND improves the struc-
ture of the network and SC-EEND models a more appropriate
structure in the application scene.