HIGHLY EFFICIENT REAL-TIME STREAMING AND FULLY ON-DEVICE SPEAKER DIARIZATION WITH MULTI -STAGE CLUSTERING

2025-05-06 0 0 602.35KB 14 页 10玖币
侵权投诉
HIGHLY EFFICIENT REAL-TIME STREAMING AND
FULLY ON-DEVICE SPEAKER DIARIZATION WITH
MULTI-STAGE CLUSTERING
Quan WangYiling HuangHan Lu Guanlong Zhao Ignacio Lopez Moreno
Google LLC, USA Equal contribution
{quanw,yilinghuang,luha,guanlongzhao,elnota}@google.com
ABSTRACT
While recent research advances in speaker diarization mostly focus on improving
the quality of diarization results, there is also an increasing interest in improving the
efficiency of diarization systems. In this paper, we demonstrate that a multi-stage
clustering strategy that uses different clustering algorithms for input of different
lengths can address multi-faceted challenges of on-device speaker diarization
applications. Specifically, a fallback clusterer is used to handle short-form inputs;
a main clusterer is used to handle medium-length inputs; and a pre-clusterer is
used to compress long-form inputs before they are processed by the main clusterer.
Both the main clusterer and the pre-clusterer can be configured with an upper
bound of the computational complexity to adapt to devices with different resource
constraints. This multi-stage clustering strategy is critical for streaming on-device
speaker diarization systems, where the budgets of CPU, memory and battery are
tight.
1 INTRODUCTION
Research advances in the speaker diarization community have been tackling various challenges in
the past decade. Different clustering algorithms including agglomerative hierarchical clustering
(AHC) [
1
], K-means [
2
] and spectral clustering [
3
] had been explored. Specifically, many works
had been proposed to further improve the spectral clustering algorithms for speaker diarization,
such as auto-tune [
4
], speaker turn constraints [
5
], and multi-scale segmentation [
6
]. To better
leverage training datasets that are annotated with timestamped speaker labels, supervised diarization
approaches have been explored, including UIS-RNN [
7
], DNC [
8
], and EEND [
9
,
10
]. Approaches
such as TS-VAD [
11
] and EEND-SS [
12
] had been proposed to solve the speech separation and
diarization problems jointly [
13
,
14
]. Various other advances are described and discussed in literature
reviews and tutorials [15,16].
Despite these advances, another challenge that prevents speaker diarization systems from being
widely used in production environments is the efficiency of the system, which is also a relatively less
discussed topic in the speaker diarization community. In this paper, we are particularly interested in
streaming on-device speaker diarization for mobile phones, such as annotating speaker labels in a
recording session [17]. The requirements for such applications are multi-faceted:
1.
The diarization system must perform well on audio data from multiple domains. Since
supervised diarization algorithms such as UIS-RNN [
7
] and EEND [
9
] are highly dependent
on the domain of the training data, and often suffer from insufficient training data, we prefer
to use unsupervised clustering algorithms in such applications.
2.
The diarization system must perform well on input audio of variable lengths, from a few
seconds to a few hours.
3.
The system must work in a streaming fashion, producing real-time speaker labels while
audio being recorded by the microphone on the device.
4.
The diarization system must be optimized to be highly efficient, to work within the limited
budget of CPU, memory and power. Particularly, the computational cost of the clustering
1
arXiv:2210.13690v4 [eess.AS] 8 Jan 2024
Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering
algorithm must be upper bounded to avoid out-of-memory (OOM) or excessive battery drain
issues on mobile devices, even if the input audio is hours long.
To meet these requirements, we propose a multi-stage clustering strategy that combines AHC and
spectral clustering, and leverages the strength of both algorithms. Based on our experiments, with a
pre-define threshold, AHC is good at identifying single speaker versus multiple speakers for short-
form audio, but often ends up with too many small clusters, especially for long-form speech. Spectral
clustering is great at estimating the number of speakers with the eigen-gap approach, but usually
under the assumptions that there are at least two different speakers, and there are sufficient data points
to be clustered. At the same time, the computational cost of spectral clustering is too expensive for
long-form speech, mostly due to the calculation and eigen decomposition of the Laplacian matrix.
Based on the above observations, we use different clustering algorithms for inputs of different lengths.
When input audio is short, we use AHC to avoid the insufficient data problem of spectral clustering;
when input audio is of medium length, we use spectral clustering for better speaker count estimate,
and eliminate dependency on a pre-defined AHC threshold; when input audio is long, we first use
AHC as a pre-clusterer to compress the inputs to hierarchical cluster centroids, then use spectral
clustering to further cluster these centroids. We enforce an upper bound on the number of inputs
to the AHC pre-clusterer by caching and re-using previous AHC centroids, such that the overall
computational cost of the entire diarization system is always bounded.
2 BASELINE SYSTEM
Our speaker diarization system is largely built on top of the Turn-to-Diarize system described
in [
5
], which consists of a speaker turn detection model, a speaker encoder model, and unsupervised
clustering.
2.1 FEATURE FRONTEND
We used a shared feature frontend for the speaker turn detection model and the speaker encoder model.
This frontend first applies automatic gain control [
18
] to the input audio, then extracts 32ms-long
Hanning-windowed frames with a step of 10ms. For each frame, 128-dimensional log Mel-filterbank
energies (LFBE) are computed in the range between 125Hz and 7500Hz. These filterbank energies are
then stacked by 4 frames and subsampled by 3 frames, resulting in final features of 512 dimensions
with a frame rate of 30ms. These features are then filtered by a CLDNN based Voice Activity
Detection (VAD) model [
19
] before fed into the speaker turn detection and the speaker encoder
models.
2.2 SPEAKER TURN DETECTION
2.2.1 MODEL ARCHITECTURE
The speaker turn detection model is a Transformer Transducer (T-T) [
20
] trained to output automatic
speech recognition (ASR) transcripts augmented with a special token
<st>
to represent a speaker
turn. The Transformer Transducer architecture includes an audio encoder, a label encoder, and a joint
network that produces the final output distribution over all possible tokens.
The audio encoder has 15 layers of Transformer blocks. Each block has 32 left context and no right
context. The hyper-parameters for each repeated block can be found in Table 1. We also use a
stacking layer after the second transformer block to change the frame rate from 30ms to 90ms, and an
unstacking layer after the 13th transformer block to change the frame rate from 90ms back to 30ms,
to speed up the audio encoder as proposed in [21].
We use a LSTM-based label encoder that has a single layer of 128 dimensions.
For the joint network, we have a projection layer that projects the audio encoder output to 256
dimensions. At the output of the joint network, it produces a distribution over 75 possible graphemes
1
with a softmax layer. For optimization, we follow the same hyper-parameters described in [20].
1https://github.com/google/speaker-id/blob/master/publications/
Multi-stage/graphemes.syms
2
Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering
Table 1: Hyper-parameters of a Transformer block for the audio encoder of the speaker turn detection
model.
Input feature projection 256
Dense layer 1 1024
Dense layer 2 256
Number attention heads 8
Head dimension 64
Dropout ratio 0.1
This T-T model has
13M parameters in total, and is trained with the token-level loss that is
introduced in [
22
], with these hyper-parameters of the loss function: weight for word errors
α= 1
;
weight for speaker turn false accepts
β= 100
; weight for speaker turn false rejects
γ= 100
; weight
for RNN-T loss λ= 0.03; and weight for customized minimum edit distance alignment k= 1.1.
2.2.2 TRAINING DATA
The training data for the speaker turn detection model include Fisher [
23
] training subset, Callhome
American English [
24
] training subset, AMI training subset, ICSI training subset
2
, 4545 hours of
internal long-form videos, and 80 hours of internal simulated business meeting recordings.
2.3 SPEAKER ENCODER
2.3.1 MODEL ARCHITECTURE
The speaker encoder is a text-independent speaker recognition model trained with the generalized
end-to-end extended-set softmax (GE2E-XS) loss [
25
,
26
], which consists of 12 conformer [
27
]
layers each of 144-dim, followed by an attentive temporal pooling module [
28
]. The speaker encoder
model has 7.4M parameters in total, and the architecture is illustrated in Fig. 1.
To reduce the length of the sequence to be clustered, the speaker encoder produces only one embedding
(i.e. d-vector) for each speaker turn, or every 6 seconds if a single turn is longer than that. To achieve
great performance and streaming diarization at the same time, we run spectral clustering in an online
fashion: every time when we have a new speaker embedding, we run spectral clustering on the entire
sequence of all existing embeddings. This means it’s possible to correct previously predicted speaker
labels in a later clustering step.
2.3.2 TRAINING DATA
The training data for the speaker encoder model are the same as the data used in [
29
] and [
30
]. It is a
mixture [
31
] consisting mostly of vendor collected speech queries from different language varieties
3
using devices such as laptops and cell phones, as well as public training datasets including Lib-
riVox [
32
], CN-Celeb [
33
] and LDC sourced data (i.e. Fisher [
34
], Mixer 4/5 [
35
], and TIMIT [
36
]).
The language distribution is shown in Table 2.3.2.
During training, we apply data augmentation techniques based on noise and room simulation ef-
fects [
37
,
38
,
39
]. Similar augmentation techniques [
40
,
41
,
42
,
43
,
44
] were previously used for
speaker recognition. Noise is added to the training utterances with an SNR ranging from 3dB to
15dB. The signal amplitude is also varied from a scale of 0.01x to 1.2x.
2https://github.com/kaldi-asr/kaldi/tree/master/egs/icsi
3
72 languages in the training data include: Afrikaans, Akan, Albanian, Arabic, Assamese, Basque, Bengali,
Bulgarian, Cantonese, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French,
Galician, German, Greek, Gujarati, Haitian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Ital-
ian, Japanese, Kannada, Kazakh, Kinyarwanda, Korean, Lithuanian, Macedonian, Malagasy, Malay, Malayalam,
Mandarin Chinese (Simplified), Mandarin Chinese (Traditional), Marathi, Mongolian, Norwegian, Odia, Oromo,
Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Serbian, Sesotho, Sindhi, Slovak, Spanish, Swedish,
Tamil, Telugu, Thai, Tibetan, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Yoruba, Zulu.
3
摘要:

HIGHLYEFFICIENTREAL-TIMESTREAMINGANDFULLYON-DEVICESPEAKERDIARIZATIONWITHMULTI-STAGECLUSTERINGQuanWang⋆YilingHuang⋆HanLuGuanlongZhaoIgnacioLopezMorenoGoogleLLC,USA⋆Equalcontribution{quanw,yilinghuang,luha,guanlongzhao,elnota}@google.comABSTRACTWhilerecentresearchadvancesinspeakerdiarizationmostlyfocu...

展开>> 收起<<
HIGHLY EFFICIENT REAL-TIME STREAMING AND FULLY ON-DEVICE SPEAKER DIARIZATION WITH MULTI -STAGE CLUSTERING.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:602.35KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注