MFCCAMULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN MULTI-PARTY MEETING SCENARIO Fan Yu1 Shiliang Zhang Pengcheng Guo1 Yuhao Liang1 Zhihao Du Yuxiao Lin2 Lei Xie1

2025-05-02 0 0 531.89KB 8 页 10玖币
侵权投诉
MFCCA:MULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN
MULTI-PARTY MEETING SCENARIO
Fan Yu1, Shiliang Zhang, Pengcheng Guo1, Yuhao Liang1, Zhihao Du, Yuxiao Lin2, Lei Xie1
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xi’an, China
2College of Computer Science and Technology, Zhejiang University, Hangzhou, China
ABSTRACT
Recently cross-channel attention, which better leverages
multi-channel signals from microphone array, has shown
promising results in the multi-party meeting scenario. Cross-
channel attention focuses on either learning global correla-
tions between sequences of different channels or exploiting
fine-grained channel-wise information effectively at each
time step. Considering the delay of microphone array receiv-
ing sound, we propose a multi-frame cross-channel attention,
which models cross-channel information between adjacent
frames to exploit the complementarity of both frame-wise and
channel-wise knowledge. Besides, we also propose a multi-
layer convolutional mechanism to fuse the multi-channel
output and a channel masking strategy to combat the channel
number mismatch problem between training and inference.
Experiments on the AliMeeting, a real-world corpus, reveal
that our proposed model outperforms single-channel model
by 31.7% and 37.0% CER reduction on Eval and Test sets.
Moreover, with comparable model parameters and training
data, our proposed model achieves a new SOTA performance
on the AliMeeting corpus, as compared with the top ranking
systems in the ICASSP2022 M2MeT challenge, a recently
held multi-channel multi-speaker ASR challenge.
Index TermsMulti-speaker ASR, multi-channel, cross-
channel attention, AliMeeting, M2MeT
1. INTRODUCTION
Multi-speaker automatic speech recognition (ASR) aims to
transcribe speech that contains multiple speakers, and hope-
fully overlapped speech can be correctly transcribed. It is
an essential task of rich transcription in multi-party meet-
ings [1, 2, 3]. In recent years, with the advances of deep
learning, many end-to-end neural multi-speaker ASR ap-
proaches have been proposed [4, 5, 6] and promising results
have been achieved on synthetic multi-speaker datasets, e.g.,
LibriCSS [7]. However, transcribing real-world meetings
is far more challenging with entangled difficulties such as
* Lei Xie is the corresponding author.
overlapping speech, conversational speaking style, unknown
number of speakers, far-field speech signals with noise and
reverberation. Recently, two challenges – Multi-channel
Multi-party Meeting Transcription (M2MeT) [8, 9] and Mul-
timodal Information based Speech Processing (MISP) [10] –
have made available valuable real-world multi-talker speech
datasets to benchmark multi-speaker ASR towards real con-
ditions and applications.
In the real-world applications, microphone array is usu-
ally adopted for far-field speech recording scenarios, includ-
ing those in M2MET and MISP, where beamforming is a
common algorithm to leverage spatial information for multi-
channel speech enhancement. With the help of deep neural
networks, time-frequency mask-based beamforming [11, 12,
13, 14, 13, 15] has shown superior performance in various
multi-speaker benchmarks, such as AMI [16], CHiME [17,
18] and M2MeT [8, 9]. The mask estimation network needs
to be trained with signal-level criteria on the simulated data
where the reference speech is required. Simulated data has
a clear gap with real-world data, and optimizing the signal-
level criteria may not necessarily lead to lowered word error
rate (WER) as well. Aiming to alleviate such mismatch, joint
optimization of multi-channel front-end and ASR has been
proposed [19, 20, 21, 22, 23, 24]. Under the joint learning
framework, the whole system can be optimized with an ulti-
mate ASR loss function by adopting real-world data without
reference-cleaned signals.
The attention mechanism has been recently introduced to
neural beamforming [24, 25], which performs recursive non-
linear beamforming on the data represented in a latent space.
Specifically, cross-channel attention has been proposed to
directly leverage multi-channel signals in a neural speech
recognition system [26, 27]. Impressively, such an approach
can bypass the complicated front-end formalization and inte-
grate beamforming and acoustic modeling into an end-to-end
neural solution. This cross-channel attention approach takes
the frame-wise multi-channel signal as input and learns global
correlations between sequences of different channels, which
can be easily depicted as mapping each channel represen-
tation (query) with a set of channel-average representation
arXiv:2210.05265v1 [cs.SD] 11 Oct 2022
(key-value) pairs to an output [26, 27], namely frame-level
cross-channel attention (FLCCA). Meanwhile, channel-level
cross-channel (CLCCA) attention has recently achieved re-
markable performance on speech separation [28, 29] and
speaker diarization [30, 31] tasks, even leading a system to
win the first place in the speaker diarization track in M2MeT
challenge [31]. Compared with FLCCA, CLCCA is com-
puted along the channel dimension, the representations of
each channel are combined with those of the other chan-
nels for each time step [28], which functions similarly as
beamforming.
From our point of view, FLCCA and CLCCA can be
complementary in capturing temporal and spatial informa-
tion. Frame-level is less capable of extracting fine-grained
channel-wise patterns since averaging the channel repre-
sentations directly may deteriorate the individual channel
information. Channel-level cross-channel attention, on the
other hand, only focuses on leveraging spatial diversities and
capturing inter-channel correlations on each time step, with-
out considering the context relationship between different
channels. Thus, in this paper, we exploit the complemen-
tarity between frame-level and channel-level cross-channel
attention and propose a multi-frame cross-channel attention
(MFCCA) by modeling both channel-wise and frame-wise
information simultaneously. Direction of arrival (DOA) es-
timation [32] has been widely used for speech enhancement,
which utilizes the delay of microphone array receiving the
signal to estimate the sound source direction based on the
phase difference. Inspired by the intuitive idea behind DOA,
our proposed method will pay more attention to channel con-
text between adjacent frames to model both frame-wise and
channel-wise dependencies.
We build our MFCCA based multi-channel ASR within
an attention based encoder-decoder (AED) structure [33].
Moreover, the multi-channel outputs from the encoder are
aggregated by multi-layer convolution to reduce channel di-
mensions gradually. Although the cross-channel attention is
independent of the number and geometry of microphones,
it has the well-known performance degradation issue when
number of microphones is reduced [30, 28]. In order to
combat this issue, we propose a channel masking strategy.
By randomly masking several channels from the original
multi-channel input during training, our MFCCA approach
becomes more stable and robust to the arbitrary number of
channels.
To the best of our knowledge, we are the first to leverage
cross-channel attention on a real meeting corpus – AliMeet-
ing – to examine its ability in multi-speaker ASR in meet-
ing scenarios. Experiments on the AliMeeting corpus show
that our proposed multi-channel multi-speaker ASR model
outperforms the single-channel multi-speaker ASR model by
31.7% and 37.0% relative CER reduction on Eval and Test
sets, respectively. Moreover, with comparable model param-
eters and amount of the training data, our proposed model
achieves 16.1% and 17.5% CER on Eval and Test sets, which
surpasses the best system in the M2MeT challenge, resulting
in a new SOTA performance on the AliMeeting corpus.
2. FROM SINGLE-CHANNEL TO CROSS-CHANNEL
ATTENTION
In this section, we first review the multi-headed self-attention
commonly used in signal channel cases and then intro-
duce the frame-level and channel-level cross-channel at-
tentions, respectively. A single channel feature input is
defined as X, while a C-channel input is formulated as
¯
X= [X0,··· ,XC1].
2.1. Single-channel attention
Single-channel attention, which is a standard self-attention
structure, adopts the multi-headed scaled dot-product to learn
the contextual information within a single channel of speech
signal, as shown in Fig. 1a. The output of a single-channel
attention for the i-th head is calculated as
Qsc
i=XWsc,q
i+ (bsc,q
i)TRT×D,
Ksc
i=XWsc,k
i+ (bsc,k
i)TRT×D,
Vsc
i=XWsc,v
i+ (bsc,v
i)TRT×D,
Hsc
i=Softmax Qsc
i(Ksc
i)T
DVsc
iRT×D,
(1)
where Softmax(·)is the column-wise softmax function,
Wsc,
iand bsc,
iare learnable weight and bias parameters
for the i-th head respectively.
2.2. Frame-level cross-channel attention
Frame-level cross-channel attention [26, 27] learns not only
the contextual information between time frames but also spa-
tial information across channels, as shown in Fig. 1b. The i-th
head of FLCCA is calculated as
Qfl
i=¯
XWf l,q
i+ (bfl,q
i)TRC×T×D,
Kfl
i=¯
X0Wfl,k
i+ (bfl,k
i)TRC×T×D,
Vfl
i=¯
X0Wfl,v
i+ (bfl,v
i)TRC×T×D,
Hfl
i=softmax Qfl
i(Kfl
i)T
D!Vfl
iRC×T×D,
(2)
¯
X0= [ ¯
X0
0,··· ,¯
X0
C1].¯
X0
cis the average of all chan-
nels except for the cth channel, which is calculated by
¯
X0
c= (Pn,n6=c¯
Xn)/(C1) RT×D.Wfl,and bf l,are
learnable weight and bias parameters, respectively.
摘要:

MFCCA:MULTI-FRAMECROSS-CHANNELATTENTIONFORMULTI-SPEAKERASRINMULTI-PARTYMEETINGSCENARIOFanYu1,ShiliangZhang,PengchengGuo1,YuhaoLiang1,ZhihaoDu,YuxiaoLin2,LeiXie11Audio,SpeechandLanguageProcessingGroup(ASLP@NPU),SchoolofComputerScience,NorthwesternPolytechnicalUniversity,Xi'an,China2CollegeofComputer...

展开>> 收起<<
MFCCAMULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN MULTI-PARTY MEETING SCENARIO Fan Yu1 Shiliang Zhang Pengcheng Guo1 Yuhao Liang1 Zhihao Du Yuxiao Lin2 Lei Xie1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:531.89KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注