MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION Shota Horiguchi1Yuki Takashima1Shinji Watanabe2Paola Garc ıa3

2025-05-02 0 0 470.9KB 6 页 10玖币
侵权投诉
MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL
END-TO-END NEURAL DIARIZATION
Shota Horiguchi1Yuki Takashima1Shinji Watanabe2Paola Garc´
ıa3
1Hitachi, Ltd., Japan
2Carnegie Mellon University, USA
3Johns Hopkins University, USA
ABSTRACT
Due to the high performance of multi-channel speech processing,
we can use the outputs from a multi-channel model as teacher labels
when training a single-channel model with knowledge distillation.
To the contrary, it is also known that single-channel speech data
can benefit multi-channel models by mixing it with multi-channel
speech data during training or by using it for model pretraining. This
paper focuses on speaker diarization and proposes to conduct the
above bi-directional knowledge transfer alternately. We first intro-
duce an end-to-end neural diarization model that can handle both
single- and multi-channel inputs. Using this model, we alternately
conduct i) knowledge distillation from a multi-channel model to a
single-channel model and ii) finetuning from the distilled single-
channel model to a multi-channel model. Experimental results on
two-speaker data show that the proposed method mutually improved
single- and multi-channel speaker diarization performances.
Index TermsSpeaker diarization, EEND, multi-channel,
knowledge distillation, transfer learning, mutual learning
1. INTRODUCTION
Speech processing under noisy and reverberant environments or the
existence of multiple speakers expands the practicality of speech
applications. While single-channel solutions for such conditions
are widely studied, multi-channel approaches have shown promising
performance in various speech applications such as speech recog-
nition [1, 2], speech separation [3, 4], speaker recognition [5], and
speaker diarization [6, 7, 8]. Especially, multi-channel processing
based on distributed microphones rather than microphone-array de-
vices is attracting much attention for its high versatility [8, 9, 10, 11].
Since multi-channel speech processing is powerful, its outputs
are sometimes used as teacher labels when training a single-channel
model, which is known as knowledge distillation or teacher-student
learning [12, 13]. On the other hand, it has been reported that
single-channel data is still useful in training multi-channel models,
e.g., single-channel pretraining [14, 15, 16] and simultaneous use of
single- and multi-channel data [17, 8, 16]. This can be because the
information captured by single- and multi-channel models are dif-
ferent. For example, when considering speech separation or speaker
diarization, single-channel methods must rely on speaker charac-
teristics, while multi-channel methods can use spatial information
additionally (or even only). Another study demonstrated that incor-
porating spectral and spatial information boosts speech separation
performance [18]. Let us consider a multi-channel model that can
also handle single-channel inputs. Using single-channel data to
train such a multi-channel model avoids falling into local minima
Fig. 1: Mutual learning of single- and multi-channel EEND.
that rely too heavily on spatial information and allow the model
to benefit more from speaker characteristics [8]. Here a research
question arises—does iterative knowledge distillation from multi-
channel to single-channel model and finetuning from single-channel
to multi-channel model improve the performance of both single and
multi-channel speech processing?
Given that question as motivation, this paper proposes a mu-
tual learning method of single- and multi-channel end-to-end neural
diarization (EEND), illustrated in Fig. 1. We focus specifically on
speaker diarization here, but the method can be applied to other
speech processing tasks such as speech recognition and separa-
tion. We first introduce a co-attention-based multi-channel EEND
model invariant to the number and geometry of microphones. The
multi-channel model is designed to be identical to the conventional
Transformer-based single-channel EEND given single-channel in-
puts. We conduct the following processes iteratively: i) distilling
the knowledge from multi-channel EEND to single-channel EEND
(Fig. 1 left) and ii) finetuning from the distilled single-channel
EEND to multi-channel EEND (Fig. 1 right)1. We demonstrate that
the proposed method mutually improves both single- and multi-
channel speaker diarization performance.
2. RELATED WORK
2.1. Speaker diarization
Speaker diarization is a task to determine who is speaking when from
input audio. It has long been conducted by clustering speaker em-
1The proposed method can also be applied to two multi-channel models,
u-channel and v-channel models with u<v.
arXiv:2210.03459v1 [eess.AS] 7 Oct 2022
beddings extracted from speech segments [19], but recently end-to-
end methods have attracted much attention, such as EEND [20, 8],
target-speaker voice activity detection [7], and recurrent selective
hearing networks [6]. One reason is that optimization is simple be-
cause everything is completed in one model. Since the one-model
approach also makes it possible to apply knowledge transfer tech-
niques such as knowledge distillation and finetuning to the entire
network easily, this paper focuses on end-to-end methods, especially
EEND.
2.2. Knowledge distillation in speech applications
Knowledge distillation or teacher-student learning is a scheme to
train a student model to mimic a well-trained teacher model [12, 13].
It is widely used in speech applications such as speech recognition
[21] and separation [22]. One typical use case is knowledge distil-
lation between different network architectures: a large model to a
small model [21, 23], a normal model to a binarized model [22], an
ensemble of models to a single model [24], and a high-latency model
to a streaming model [25].
The other type of knowledge distillation, which we focus on in
this paper, is based on different inputs, while the network architec-
tures are not necessarily different. In some studies on unsupervised
domain adaptation of speech recognition [26] and speaker verifica-
tion [27], a far-field model is trained with knowledge distillation by
using a close-talk model as a teacher, both of which take single-
channel signals as inputs. Another series of studies leverage multi-
channel signals; a student model is trained so that the output when
the noisy features are input is close to the output of the teacher model
when the enhanced features are input [28, 29]. Here, the enhanced
features are calculated from a multi-channel signal using beamform-
ing and the input to the model is still single-channel; thus, it is not ap-
plicable to speaker diarization where there is more than one speaker
to be enhanced. In the context of continuous speech separation [30],
a student VarArray model [11] is trained to produce similar outputs
to a teacher model with a fewer number of channels [31]. This pa-
per, in contrast, tackles multi- to single-channel knowledge distilla-
tion with an end-to-end model rather than multi- to multi-channel
knowledge distillation as in [31].
3. FORMULATION OF SINGLE-CHANNEL END-TO-END
NEURAL DIARIZATION
To facilitate the explanation of the proposed method, in this section
we first formulate single-channel EEND and Transformer encoders
contained in it. For simplicity, we omit the bias parameters of each
fully-connected layer from the formulation.
3.1. Overview
As a single-channel model, we used attractor-based EEND (EEND-
EDA) [20], in which speaker-wise speech activities are calculated
from speaker-wise attractors and frame-wise embeddings. Input T-
length F-dimensional frame-wise acoustic features XRF×Tare
first converted using a position-wise fully-connected layer parame-
terized by W0RD×Fand layer normalization as
E(0) =LayerNorm (W0X)RD×T.(1)
The resulting frame-wise embeddings are further converted using
N-stacked Transformer encoders [32] without positional encodings.
The transition in n-th encoder layer for 1nNis denoted as
E(n)=TransformerEncoder E(n1)RD×T.(2)
(a) Transformer encoder
(b) Co-attention encoder
Fig. 2: The architectures of encoders used in this paper. D: the
dimensionality of embeddings, T: sequence length, C: the number
of channels.
Then, Sspeakers’ speech activities Yare estimated based on inner
products between the frame-wise embeddings from the last encoder
E(N)and speaker-wise attractors Bas
B=EDA E(N)(1,1)D×S,(3)
Z=BTE(N)RS×T,(4)
Y=σ(Z)(0,1)S×T(5)
where EDA is an encoder-decoder-based attractor calculation mod-
ule, (·)Tdenotes matrix transpose, and σ(·)is an element-wise sig-
moid operation. Note that the inner products Zbetween the embed-
dings and attractors become logits of the speech activities Y.
The speech activities are optimized to minimize the permutation-
free loss, which is defined as
LBCE ΘX, ˜
Y=1
T S min
PBCE ˜
Y , P Y (6)
where P∈ {0,1}S×Sdenotes a S×Spermutation matrix , ˜
Y
{0,1}S×Tis the groundtruth speech activities, and BCE (·,·)is the
summation of the element-wise binary cross entropy. Θis a set of
parameters of the network.
3.2. Detailed formulation of Transformer encoder
Given an input embedding sequence Ein RD×T, each Trans-
former encoder layer in (2) converts it into Eout RD×Tas follows:
E0=LayerNorm (Ein +MA (Ein , Ein, Ein) ; Φ)RD×T,(7)
Eout =LayerNorm E0+FFN E0;Ψ,(8)
where Φand Ψare sets of parameters, and MA and FFN denote
multi-head scaled dot-product attention and a feed-forward network
that consists of two fully-connected layers, respectively. Note that
we omit the layer index (n)for simplicity. The processes above are
摘要:

MUTUALLEARNINGOFSINGLE-ANDMULTI-CHANNELEND-TO-ENDNEURALDIARIZATIONShotaHoriguchi1YukiTakashima1ShinjiWatanabe2PaolaGarc´a31Hitachi,Ltd.,Japan2CarnegieMellonUniversity,USA3JohnsHopkinsUniversity,USAABSTRACTDuetothehighperformanceofmulti-channelspeechprocessing,wecanusetheoutputsfromamulti-channelmod...

展开>> 收起<<
MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION Shota Horiguchi1Yuki Takashima1Shinji Watanabe2Paola Garc ıa3.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:470.9KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注