HIGHLY EFFICIENT REAL-TIME STREAMING AND FULLY ON-DEVICE SPEAKER DIARIZATION WITH MULTI -STAGE CLUSTERING

2025-05-06 0 0 602.35KB 14 页 10玖币

侵权投诉

HIGHLY EFFICIENT REAL-TIME STREAMING AND

FULLY ON-DEVICE SPEAKER DIARIZATION WITH

MULTI-STAGE CLUSTERING

Quan Wang⋆Yiling Huang⋆Han Lu Guanlong Zhao Ignacio Lopez Moreno

Google LLC, USA ⋆Equal contribution

{quanw,yilinghuang,luha,guanlongzhao,elnota}@google.com

ABSTRACT

While recent research advances in speaker diarization mostly focus on improving

the quality of diarization results, there is also an increasing interest in improving the

efﬁciency of diarization systems. In this paper, we demonstrate that a multi-stage

clustering strategy that uses different clustering algorithms for input of different

lengths can address multi-faceted challenges of on-device speaker diarization

applications. Speciﬁcally, a fallback clusterer is used to handle short-form inputs;

a main clusterer is used to handle medium-length inputs; and a pre-clusterer is

used to compress long-form inputs before they are processed by the main clusterer.

Both the main clusterer and the pre-clusterer can be conﬁgured with an upper

bound of the computational complexity to adapt to devices with different resource

constraints. This multi-stage clustering strategy is critical for streaming on-device

speaker diarization systems, where the budgets of CPU, memory and battery are

tight.

1 INTRODUCTION

Research advances in the speaker diarization community have been tackling various challenges in

the past decade. Different clustering algorithms including agglomerative hierarchical clustering

(AHC) [

], K-means [

] and spectral clustering [

] had been explored. Speciﬁcally, many works

had been proposed to further improve the spectral clustering algorithms for speaker diarization,

such as auto-tune [

], speaker turn constraints [

], and multi-scale segmentation [

]. To better

leverage training datasets that are annotated with timestamped speaker labels, supervised diarization

approaches have been explored, including UIS-RNN [

], DNC [

], and EEND [

]. Approaches

such as TS-VAD [

] and EEND-SS [

] had been proposed to solve the speech separation and

diarization problems jointly [

]. Various other advances are described and discussed in literature

reviews and tutorials [15,16].

Despite these advances, another challenge that prevents speaker diarization systems from being

widely used in production environments is the efﬁciency of the system, which is also a relatively less

discussed topic in the speaker diarization community. In this paper, we are particularly interested in

streaming on-device speaker diarization for mobile phones, such as annotating speaker labels in a

recording session [17]. The requirements for such applications are multi-faceted:

The diarization system must perform well on audio data from multiple domains. Since

supervised diarization algorithms such as UIS-RNN [

] and EEND [

] are highly dependent

on the domain of the training data, and often suffer from insufﬁcient training data, we prefer

to use unsupervised clustering algorithms in such applications.

The diarization system must perform well on input audio of variable lengths, from a few

seconds to a few hours.

The system must work in a streaming fashion, producing real-time speaker labels while

audio being recorded by the microphone on the device.

The diarization system must be optimized to be highly efﬁcient, to work within the limited

budget of CPU, memory and power. Particularly, the computational cost of the clustering

arXiv:2210.13690v4 [eess.AS] 8 Jan 2024

Highly Efﬁcient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

algorithm must be upper bounded to avoid out-of-memory (OOM) or excessive battery drain

issues on mobile devices, even if the input audio is hours long.

To meet these requirements, we propose a multi-stage clustering strategy that combines AHC and

spectral clustering, and leverages the strength of both algorithms. Based on our experiments, with a

pre-deﬁne threshold, AHC is good at identifying single speaker versus multiple speakers for short-

form audio, but often ends up with too many small clusters, especially for long-form speech. Spectral

clustering is great at estimating the number of speakers with the eigen-gap approach, but usually

under the assumptions that there are at least two different speakers, and there are sufﬁcient data points

to be clustered. At the same time, the computational cost of spectral clustering is too expensive for

long-form speech, mostly due to the calculation and eigen decomposition of the Laplacian matrix.

Based on the above observations, we use different clustering algorithms for inputs of different lengths.

When input audio is short, we use AHC to avoid the insufﬁcient data problem of spectral clustering;

when input audio is of medium length, we use spectral clustering for better speaker count estimate,

and eliminate dependency on a pre-deﬁned AHC threshold; when input audio is long, we ﬁrst use

AHC as a pre-clusterer to compress the inputs to hierarchical cluster centroids, then use spectral

clustering to further cluster these centroids. We enforce an upper bound on the number of inputs

to the AHC pre-clusterer by caching and re-using previous AHC centroids, such that the overall

computational cost of the entire diarization system is always bounded.

2 BASELINE SYSTEM

Our speaker diarization system is largely built on top of the Turn-to-Diarize system described

in [

], which consists of a speaker turn detection model, a speaker encoder model, and unsupervised

clustering.

2.1 FEATURE FRONTEND

We used a shared feature frontend for the speaker turn detection model and the speaker encoder model.

This frontend ﬁrst applies automatic gain control [

] to the input audio, then extracts 32ms-long

Hanning-windowed frames with a step of 10ms. For each frame, 128-dimensional log Mel-ﬁlterbank

energies (LFBE) are computed in the range between 125Hz and 7500Hz. These ﬁlterbank energies are

then stacked by 4 frames and subsampled by 3 frames, resulting in ﬁnal features of 512 dimensions

with a frame rate of 30ms. These features are then ﬁltered by a CLDNN based Voice Activity

Detection (VAD) model [

] before fed into the speaker turn detection and the speaker encoder

models.

2.2 SPEAKER TURN DETECTION

2.2.1 MODEL ARCHITECTURE

The speaker turn detection model is a Transformer Transducer (T-T) [

] trained to output automatic

speech recognition (ASR) transcripts augmented with a special token

<st>

to represent a speaker

turn. The Transformer Transducer architecture includes an audio encoder, a label encoder, and a joint

network that produces the ﬁnal output distribution over all possible tokens.

The audio encoder has 15 layers of Transformer blocks. Each block has 32 left context and no right

context. The hyper-parameters for each repeated block can be found in Table 1. We also use a

stacking layer after the second transformer block to change the frame rate from 30ms to 90ms, and an

unstacking layer after the 13th transformer block to change the frame rate from 90ms back to 30ms,

to speed up the audio encoder as proposed in [21].

We use a LSTM-based label encoder that has a single layer of 128 dimensions.

For the joint network, we have a projection layer that projects the audio encoder output to 256

dimensions. At the output of the joint network, it produces a distribution over 75 possible graphemes

with a softmax layer. For optimization, we follow the same hyper-parameters described in [20].

1https://github.com/google/speaker-id/blob/master/publications/

Multi-stage/graphemes.syms

Highly Efﬁcient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

Table 1: Hyper-parameters of a Transformer block for the audio encoder of the speaker turn detection

model.

Input feature projection 256

Dense layer 1 1024

Dense layer 2 256

Number attention heads 8

Head dimension 64

Dropout ratio 0.1

This T-T model has

∼

13M parameters in total, and is trained with the token-level loss that is

introduced in [

], with these hyper-parameters of the loss function: weight for word errors

α= 1

;

weight for speaker turn false accepts

β= 100

; weight for speaker turn false rejects

γ= 100

; weight

for RNN-T loss λ= 0.03; and weight for customized minimum edit distance alignment k= 1.1.

2.2.2 TRAINING DATA

The training data for the speaker turn detection model include Fisher [

] training subset, Callhome

American English [

] training subset, AMI training subset, ICSI training subset

, 4545 hours of

internal long-form videos, and 80 hours of internal simulated business meeting recordings.

2.3 SPEAKER ENCODER

2.3.1 MODEL ARCHITECTURE

The speaker encoder is a text-independent speaker recognition model trained with the generalized

end-to-end extended-set softmax (GE2E-XS) loss [

], which consists of 12 conformer [

]

layers each of 144-dim, followed by an attentive temporal pooling module [

]. The speaker encoder

model has ∼7.4M parameters in total, and the architecture is illustrated in Fig. 1.

To reduce the length of the sequence to be clustered, the speaker encoder produces only one embedding

(i.e. d-vector) for each speaker turn, or every 6 seconds if a single turn is longer than that. To achieve

great performance and streaming diarization at the same time, we run spectral clustering in an online

fashion: every time when we have a new speaker embedding, we run spectral clustering on the entire

sequence of all existing embeddings. This means it’s possible to correct previously predicted speaker

labels in a later clustering step.

2.3.2 TRAINING DATA

The training data for the speaker encoder model are the same as the data used in [

] and [

]. It is a

mixture [

] consisting mostly of vendor collected speech queries from different language varieties

using devices such as laptops and cell phones, as well as public training datasets including Lib-

riVox [

], CN-Celeb [

] and LDC sourced data (i.e. Fisher [

], Mixer 4/5 [

], and TIMIT [

]).

The language distribution is shown in Table 2.3.2.

During training, we apply data augmentation techniques based on noise and room simulation ef-

fects [

]. Similar augmentation techniques [

] were previously used for

speaker recognition. Noise is added to the training utterances with an SNR ranging from 3dB to

15dB. The signal amplitude is also varied from a scale of 0.01x to 1.2x.

2https://github.com/kaldi-asr/kaldi/tree/master/egs/icsi

72 languages in the training data include: Afrikaans, Akan, Albanian, Arabic, Assamese, Basque, Bengali,

Bulgarian, Cantonese, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French,

Galician, German, Greek, Gujarati, Haitian, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Ital-

ian, Japanese, Kannada, Kazakh, Kinyarwanda, Korean, Lithuanian, Macedonian, Malagasy, Malay, Malayalam,

Mandarin Chinese (Simpliﬁed), Mandarin Chinese (Traditional), Marathi, Mongolian, Norwegian, Odia, Oromo,

Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Serbian, Sesotho, Sindhi, Slovak, Spanish, Swedish,

Tamil, Telugu, Thai, Tibetan, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Yoruba, Zulu.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HIGHLYEFFICIENTREAL-TIMESTREAMINGANDFULLYON-DEVICESPEAKERDIARIZATIONWITHMULTI-STAGECLUSTERINGQuanWang⋆YilingHuang⋆HanLuGuanlongZhaoIgnacioLopezMorenoGoogleLLC,USA⋆Equalcontribution{quanw,yilinghuang,luha,guanlongzhao,elnota}@google.comABSTRACTWhilerecentresearchadvancesinspeakerdiarizationmostlyfocu...

展开>> 收起<<

HIGHLY EFFICIENT REAL-TIME STREAMING AND FULLY ON-DEVICE SPEAKER DIARIZATION WITH MULTI -STAGE CLUSTERING.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HIGHLY EFFICIENT REAL-TIME STREAMING AND FULLY ON-DEVICE SPEAKER DIARIZATION WITH MULTI -STAGE CLUSTERING

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: