MFCCAMULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN MULTI-PARTY MEETING SCENARIO Fan Yu1 Shiliang Zhang Pengcheng Guo1 Yuhao Liang1 Zhihao Du Yuxiao Lin2 Lei Xie1

2025-05-02 0 0 531.89KB 8 页 10玖币

侵权投诉

MFCCA:MULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN

MULTI-PARTY MEETING SCENARIO

Fan Yu1, Shiliang Zhang, Pengcheng Guo1, Yuhao Liang1, Zhihao Du, Yuxiao Lin2, Lei Xie1∗

1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,

Northwestern Polytechnical University, Xi’an, China

2College of Computer Science and Technology, Zhejiang University, Hangzhou, China

ABSTRACT

Recently cross-channel attention, which better leverages

multi-channel signals from microphone array, has shown

promising results in the multi-party meeting scenario. Cross-

channel attention focuses on either learning global correla-

tions between sequences of different channels or exploiting

ﬁne-grained channel-wise information effectively at each

time step. Considering the delay of microphone array receiv-

ing sound, we propose a multi-frame cross-channel attention,

which models cross-channel information between adjacent

frames to exploit the complementarity of both frame-wise and

channel-wise knowledge. Besides, we also propose a multi-

layer convolutional mechanism to fuse the multi-channel

output and a channel masking strategy to combat the channel

number mismatch problem between training and inference.

Experiments on the AliMeeting, a real-world corpus, reveal

that our proposed model outperforms single-channel model

by 31.7% and 37.0% CER reduction on Eval and Test sets.

Moreover, with comparable model parameters and training

data, our proposed model achieves a new SOTA performance

on the AliMeeting corpus, as compared with the top ranking

systems in the ICASSP2022 M2MeT challenge, a recently

held multi-channel multi-speaker ASR challenge.

Index Terms—Multi-speaker ASR, multi-channel, cross-

channel attention, AliMeeting, M2MeT

1. INTRODUCTION

Multi-speaker automatic speech recognition (ASR) aims to

transcribe speech that contains multiple speakers, and hope-

fully overlapped speech can be correctly transcribed. It is

an essential task of rich transcription in multi-party meet-

ings [1, 2, 3]. In recent years, with the advances of deep

learning, many end-to-end neural multi-speaker ASR ap-

proaches have been proposed [4, 5, 6] and promising results

have been achieved on synthetic multi-speaker datasets, e.g.,

LibriCSS [7]. However, transcribing real-world meetings

is far more challenging with entangled difﬁculties such as

* Lei Xie is the corresponding author.

overlapping speech, conversational speaking style, unknown

number of speakers, far-ﬁeld speech signals with noise and

reverberation. Recently, two challenges – Multi-channel

Multi-party Meeting Transcription (M2MeT) [8, 9] and Mul-

timodal Information based Speech Processing (MISP) [10] –

have made available valuable real-world multi-talker speech

datasets to benchmark multi-speaker ASR towards real con-

ditions and applications.

In the real-world applications, microphone array is usu-

ally adopted for far-ﬁeld speech recording scenarios, includ-

ing those in M2MET and MISP, where beamforming is a

common algorithm to leverage spatial information for multi-

channel speech enhancement. With the help of deep neural

networks, time-frequency mask-based beamforming [11, 12,

13, 14, 13, 15] has shown superior performance in various

multi-speaker benchmarks, such as AMI [16], CHiME [17,

18] and M2MeT [8, 9]. The mask estimation network needs

to be trained with signal-level criteria on the simulated data

where the reference speech is required. Simulated data has

a clear gap with real-world data, and optimizing the signal-

level criteria may not necessarily lead to lowered word error

rate (WER) as well. Aiming to alleviate such mismatch, joint

optimization of multi-channel front-end and ASR has been

proposed [19, 20, 21, 22, 23, 24]. Under the joint learning

framework, the whole system can be optimized with an ulti-

mate ASR loss function by adopting real-world data without

reference-cleaned signals.

The attention mechanism has been recently introduced to

neural beamforming [24, 25], which performs recursive non-

linear beamforming on the data represented in a latent space.

Speciﬁcally, cross-channel attention has been proposed to

directly leverage multi-channel signals in a neural speech

recognition system [26, 27]. Impressively, such an approach

can bypass the complicated front-end formalization and inte-

grate beamforming and acoustic modeling into an end-to-end

neural solution. This cross-channel attention approach takes

the frame-wise multi-channel signal as input and learns global

correlations between sequences of different channels, which

can be easily depicted as mapping each channel represen-

tation (query) with a set of channel-average representation

arXiv:2210.05265v1 [cs.SD] 11 Oct 2022

(key-value) pairs to an output [26, 27], namely frame-level

cross-channel attention (FLCCA). Meanwhile, channel-level

cross-channel (CLCCA) attention has recently achieved re-

markable performance on speech separation [28, 29] and

speaker diarization [30, 31] tasks, even leading a system to

win the ﬁrst place in the speaker diarization track in M2MeT

challenge [31]. Compared with FLCCA, CLCCA is com-

puted along the channel dimension, the representations of

each channel are combined with those of the other chan-

nels for each time step [28], which functions similarly as

beamforming.

From our point of view, FLCCA and CLCCA can be

complementary in capturing temporal and spatial informa-

tion. Frame-level is less capable of extracting ﬁne-grained

channel-wise patterns since averaging the channel repre-

sentations directly may deteriorate the individual channel

information. Channel-level cross-channel attention, on the

other hand, only focuses on leveraging spatial diversities and

capturing inter-channel correlations on each time step, with-

out considering the context relationship between different

channels. Thus, in this paper, we exploit the complemen-

tarity between frame-level and channel-level cross-channel

attention and propose a multi-frame cross-channel attention

(MFCCA) by modeling both channel-wise and frame-wise

information simultaneously. Direction of arrival (DOA) es-

timation [32] has been widely used for speech enhancement,

which utilizes the delay of microphone array receiving the

signal to estimate the sound source direction based on the

phase difference. Inspired by the intuitive idea behind DOA,

our proposed method will pay more attention to channel con-

text between adjacent frames to model both frame-wise and

channel-wise dependencies.

We build our MFCCA based multi-channel ASR within

an attention based encoder-decoder (AED) structure [33].

Moreover, the multi-channel outputs from the encoder are

aggregated by multi-layer convolution to reduce channel di-

mensions gradually. Although the cross-channel attention is

independent of the number and geometry of microphones,

it has the well-known performance degradation issue when

number of microphones is reduced [30, 28]. In order to

combat this issue, we propose a channel masking strategy.

By randomly masking several channels from the original

multi-channel input during training, our MFCCA approach

becomes more stable and robust to the arbitrary number of

channels.

To the best of our knowledge, we are the ﬁrst to leverage

cross-channel attention on a real meeting corpus – AliMeet-

ing – to examine its ability in multi-speaker ASR in meet-

ing scenarios. Experiments on the AliMeeting corpus show

that our proposed multi-channel multi-speaker ASR model

outperforms the single-channel multi-speaker ASR model by

31.7% and 37.0% relative CER reduction on Eval and Test

sets, respectively. Moreover, with comparable model param-

eters and amount of the training data, our proposed model

achieves 16.1% and 17.5% CER on Eval and Test sets, which

surpasses the best system in the M2MeT challenge, resulting

in a new SOTA performance on the AliMeeting corpus.

2. FROM SINGLE-CHANNEL TO CROSS-CHANNEL

ATTENTION

In this section, we ﬁrst review the multi-headed self-attention

commonly used in signal channel cases and then intro-

duce the frame-level and channel-level cross-channel at-

tentions, respectively. A single channel feature input is

deﬁned as X, while a C-channel input is formulated as

X= [X0,··· ,XC−1].

2.1. Single-channel attention

Single-channel attention, which is a standard self-attention

structure, adopts the multi-headed scaled dot-product to learn

the contextual information within a single channel of speech

signal, as shown in Fig. 1a. The output of a single-channel

attention for the i-th head is calculated as

Qsc

i=XWsc,q

i+ (bsc,q

i)T∈RT×D,

Ksc

i=XWsc,k

i+ (bsc,k

i)T∈RT×D,

Vsc

i=XWsc,v

i+ (bsc,v

i)T∈RT×D,

Hsc

i=Softmax Qsc

i(Ksc

i)T

√DVsc

i∈RT×D,

(1)

where Softmax(·)is the column-wise softmax function,

Wsc,∗

iand bsc,∗

iare learnable weight and bias parameters

for the i-th head respectively.

2.2. Frame-level cross-channel attention

Frame-level cross-channel attention [26, 27] learns not only

the contextual information between time frames but also spa-

tial information across channels, as shown in Fig. 1b. The i-th

head of FLCCA is calculated as

Qfl

i=¯

XWf l,q

i+ (bfl,q

i)T∈RC×T×D,

Kfl

i=¯

X0Wfl,k

i+ (bfl,k

i)T∈RC×T×D,

Vfl

i=¯

X0Wfl,v

i+ (bfl,v

i)T∈RC×T×D,

Hfl

i=softmax Qfl

i(Kfl

i)T

√D!Vfl

i∈RC×T×D,

(2)

X0= [ ¯

0,··· ,¯

C−1].¯

cis the average of all chan-

nels except for the cth channel, which is calculated by

c= (Pn,n6=c¯

Xn)/(C−1) ∈RT×D.Wfl,∗and bf l,∗are

learnable weight and bias parameters, respectively.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MFCCA:MULTI-FRAMECROSS-CHANNELATTENTIONFORMULTI-SPEAKERASRINMULTI-PARTYMEETINGSCENARIOFanYu1,ShiliangZhang,PengchengGuo1,YuhaoLiang1,ZhihaoDu,YuxiaoLin2,LeiXie11Audio,SpeechandLanguageProcessingGroup(ASLP@NPU),SchoolofComputerScience,NorthwesternPolytechnicalUniversity,Xi'an,China2CollegeofComputer...

展开>> 收起<<

MFCCAMULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN MULTI-PARTY MEETING SCENARIO Fan Yu1 Shiliang Zhang Pengcheng Guo1 Yuhao Liang1 Zhihao Du Yuxiao Lin2 Lei Xie1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MFCCAMULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN MULTI-PARTY MEETING SCENARIO Fan Yu1 Shiliang Zhang Pengcheng Guo1 Yuhao Liang1 Zhihao Du Yuxiao Lin2 Lei Xie1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: