MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION Shota Horiguchi1Yuki Takashima1Shinji Watanabe2Paola Garc ıa3

2025-05-02 0 0 470.9KB 6 页 10玖币

侵权投诉

MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL

END-TO-END NEURAL DIARIZATION

Shota Horiguchi1Yuki Takashima1Shinji Watanabe2Paola Garc´

ıa3

1Hitachi, Ltd., Japan

2Carnegie Mellon University, USA

3Johns Hopkins University, USA

ABSTRACT

Due to the high performance of multi-channel speech processing,

we can use the outputs from a multi-channel model as teacher labels

when training a single-channel model with knowledge distillation.

To the contrary, it is also known that single-channel speech data

can beneﬁt multi-channel models by mixing it with multi-channel

speech data during training or by using it for model pretraining. This

paper focuses on speaker diarization and proposes to conduct the

above bi-directional knowledge transfer alternately. We ﬁrst intro-

duce an end-to-end neural diarization model that can handle both

single- and multi-channel inputs. Using this model, we alternately

conduct i) knowledge distillation from a multi-channel model to a

single-channel model and ii) ﬁnetuning from the distilled single-

channel model to a multi-channel model. Experimental results on

two-speaker data show that the proposed method mutually improved

single- and multi-channel speaker diarization performances.

Index Terms—Speaker diarization, EEND, multi-channel,

knowledge distillation, transfer learning, mutual learning

1. INTRODUCTION

Speech processing under noisy and reverberant environments or the

existence of multiple speakers expands the practicality of speech

applications. While single-channel solutions for such conditions

are widely studied, multi-channel approaches have shown promising

performance in various speech applications such as speech recog-

nition [1, 2], speech separation [3, 4], speaker recognition [5], and

speaker diarization [6, 7, 8]. Especially, multi-channel processing

based on distributed microphones rather than microphone-array de-

vices is attracting much attention for its high versatility [8, 9, 10, 11].

Since multi-channel speech processing is powerful, its outputs

are sometimes used as teacher labels when training a single-channel

model, which is known as knowledge distillation or teacher-student

learning [12, 13]. On the other hand, it has been reported that

single-channel data is still useful in training multi-channel models,

e.g., single-channel pretraining [14, 15, 16] and simultaneous use of

single- and multi-channel data [17, 8, 16]. This can be because the

information captured by single- and multi-channel models are dif-

ferent. For example, when considering speech separation or speaker

diarization, single-channel methods must rely on speaker charac-

teristics, while multi-channel methods can use spatial information

additionally (or even only). Another study demonstrated that incor-

porating spectral and spatial information boosts speech separation

performance [18]. Let us consider a multi-channel model that can

also handle single-channel inputs. Using single-channel data to

train such a multi-channel model avoids falling into local minima

Fig. 1: Mutual learning of single- and multi-channel EEND.

that rely too heavily on spatial information and allow the model

to beneﬁt more from speaker characteristics [8]. Here a research

question arises—does iterative knowledge distillation from multi-

channel to single-channel model and ﬁnetuning from single-channel

to multi-channel model improve the performance of both single and

multi-channel speech processing?

Given that question as motivation, this paper proposes a mu-

tual learning method of single- and multi-channel end-to-end neural

diarization (EEND), illustrated in Fig. 1. We focus speciﬁcally on

speaker diarization here, but the method can be applied to other

speech processing tasks such as speech recognition and separa-

tion. We ﬁrst introduce a co-attention-based multi-channel EEND

model invariant to the number and geometry of microphones. The

multi-channel model is designed to be identical to the conventional

Transformer-based single-channel EEND given single-channel in-

puts. We conduct the following processes iteratively: i) distilling

the knowledge from multi-channel EEND to single-channel EEND

(Fig. 1 left) and ii) ﬁnetuning from the distilled single-channel

EEND to multi-channel EEND (Fig. 1 right)1. We demonstrate that

the proposed method mutually improves both single- and multi-

channel speaker diarization performance.

2. RELATED WORK

2.1. Speaker diarization

Speaker diarization is a task to determine who is speaking when from

input audio. It has long been conducted by clustering speaker em-

1The proposed method can also be applied to two multi-channel models,

u-channel and v-channel models with u<v.

arXiv:2210.03459v1 [eess.AS] 7 Oct 2022

beddings extracted from speech segments [19], but recently end-to-

end methods have attracted much attention, such as EEND [20, 8],

target-speaker voice activity detection [7], and recurrent selective

hearing networks [6]. One reason is that optimization is simple be-

cause everything is completed in one model. Since the one-model

approach also makes it possible to apply knowledge transfer tech-

niques such as knowledge distillation and ﬁnetuning to the entire

network easily, this paper focuses on end-to-end methods, especially

EEND.

2.2. Knowledge distillation in speech applications

Knowledge distillation or teacher-student learning is a scheme to

train a student model to mimic a well-trained teacher model [12, 13].

It is widely used in speech applications such as speech recognition

[21] and separation [22]. One typical use case is knowledge distil-

lation between different network architectures: a large model to a

small model [21, 23], a normal model to a binarized model [22], an

ensemble of models to a single model [24], and a high-latency model

to a streaming model [25].

The other type of knowledge distillation, which we focus on in

this paper, is based on different inputs, while the network architec-

tures are not necessarily different. In some studies on unsupervised

domain adaptation of speech recognition [26] and speaker veriﬁca-

tion [27], a far-ﬁeld model is trained with knowledge distillation by

using a close-talk model as a teacher, both of which take single-

channel signals as inputs. Another series of studies leverage multi-

channel signals; a student model is trained so that the output when

the noisy features are input is close to the output of the teacher model

when the enhanced features are input [28, 29]. Here, the enhanced

features are calculated from a multi-channel signal using beamform-

ing and the input to the model is still single-channel; thus, it is not ap-

plicable to speaker diarization where there is more than one speaker

to be enhanced. In the context of continuous speech separation [30],

a student VarArray model [11] is trained to produce similar outputs

to a teacher model with a fewer number of channels [31]. This pa-

per, in contrast, tackles multi- to single-channel knowledge distilla-

tion with an end-to-end model rather than multi- to multi-channel

knowledge distillation as in [31].

3. FORMULATION OF SINGLE-CHANNEL END-TO-END

NEURAL DIARIZATION

To facilitate the explanation of the proposed method, in this section

we ﬁrst formulate single-channel EEND and Transformer encoders

contained in it. For simplicity, we omit the bias parameters of each

fully-connected layer from the formulation.

3.1. Overview

As a single-channel model, we used attractor-based EEND (EEND-

EDA) [20], in which speaker-wise speech activities are calculated

from speaker-wise attractors and frame-wise embeddings. Input T-

length F-dimensional frame-wise acoustic features X∈RF×Tare

ﬁrst converted using a position-wise fully-connected layer parame-

terized by W0∈RD×Fand layer normalization as

E(0) =LayerNorm (W0X)∈RD×T.(1)

The resulting frame-wise embeddings are further converted using

N-stacked Transformer encoders [32] without positional encodings.

The transition in n-th encoder layer for 1≤n≤Nis denoted as

E(n)=TransformerEncoder E(n−1)∈RD×T.(2)

(a) Transformer encoder

(b) Co-attention encoder

Fig. 2: The architectures of encoders used in this paper. D: the

dimensionality of embeddings, T: sequence length, C: the number

of channels.

Then, Sspeakers’ speech activities Yare estimated based on inner

products between the frame-wise embeddings from the last encoder

E(N)and speaker-wise attractors Bas

B=EDA E(N)∈(−1,1)D×S,(3)

Z=BTE(N)∈RS×T,(4)

Y=σ(Z)∈(0,1)S×T(5)

where EDA is an encoder-decoder-based attractor calculation mod-

ule, (·)Tdenotes matrix transpose, and σ(·)is an element-wise sig-

moid operation. Note that the inner products Zbetween the embed-

dings and attractors become logits of the speech activities Y.

The speech activities are optimized to minimize the permutation-

free loss, which is deﬁned as

LBCE ΘX, ˜

Y=1

T S min

PBCE ˜

Y , P Y (6)

where P∈ {0,1}S×Sdenotes a S×Spermutation matrix , ˜

Y∈

{0,1}S×Tis the groundtruth speech activities, and BCE (·,·)is the

summation of the element-wise binary cross entropy. Θis a set of

parameters of the network.

3.2. Detailed formulation of Transformer encoder

Given an input embedding sequence Ein ∈RD×T, each Trans-

former encoder layer in (2) converts it into Eout ∈RD×Tas follows:

E0=LayerNorm (Ein +MA (Ein , Ein, Ein) ; Φ)∈RD×T,(7)

Eout =LayerNorm E0+FFN E0;Ψ,(8)

where Φand Ψare sets of parameters, and MA and FFN denote

multi-head scaled dot-product attention and a feed-forward network

that consists of two fully-connected layers, respectively. Note that

we omit the layer index (n)for simplicity. The processes above are

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MUTUALLEARNINGOFSINGLE-ANDMULTI-CHANNELEND-TO-ENDNEURALDIARIZATIONShotaHoriguchi1YukiTakashima1ShinjiWatanabe2PaolaGarc´a31Hitachi,Ltd.,Japan2CarnegieMellonUniversity,USA3JohnsHopkinsUniversity,USAABSTRACTDuetothehighperformanceofmulti-channelspeechprocessing,wecanusetheoutputsfromamulti-channelmod...

展开>> 收起<<

MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION Shota Horiguchi1Yuki Takashima1Shinji Watanabe2Paola Garc ıa3.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION Shota Horiguchi1Yuki Takashima1Shinji Watanabe2Paola Garc ıa3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: