
MFCCA:MULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN
MULTI-PARTY MEETING SCENARIO
Fan Yu1, Shiliang Zhang, Pengcheng Guo1, Yuhao Liang1, Zhihao Du, Yuxiao Lin2, Lei Xie1∗
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xi’an, China
2College of Computer Science and Technology, Zhejiang University, Hangzhou, China
ABSTRACT
Recently cross-channel attention, which better leverages
multi-channel signals from microphone array, has shown
promising results in the multi-party meeting scenario. Cross-
channel attention focuses on either learning global correla-
tions between sequences of different channels or exploiting
fine-grained channel-wise information effectively at each
time step. Considering the delay of microphone array receiv-
ing sound, we propose a multi-frame cross-channel attention,
which models cross-channel information between adjacent
frames to exploit the complementarity of both frame-wise and
channel-wise knowledge. Besides, we also propose a multi-
layer convolutional mechanism to fuse the multi-channel
output and a channel masking strategy to combat the channel
number mismatch problem between training and inference.
Experiments on the AliMeeting, a real-world corpus, reveal
that our proposed model outperforms single-channel model
by 31.7% and 37.0% CER reduction on Eval and Test sets.
Moreover, with comparable model parameters and training
data, our proposed model achieves a new SOTA performance
on the AliMeeting corpus, as compared with the top ranking
systems in the ICASSP2022 M2MeT challenge, a recently
held multi-channel multi-speaker ASR challenge.
Index Terms—Multi-speaker ASR, multi-channel, cross-
channel attention, AliMeeting, M2MeT
1. INTRODUCTION
Multi-speaker automatic speech recognition (ASR) aims to
transcribe speech that contains multiple speakers, and hope-
fully overlapped speech can be correctly transcribed. It is
an essential task of rich transcription in multi-party meet-
ings [1, 2, 3]. In recent years, with the advances of deep
learning, many end-to-end neural multi-speaker ASR ap-
proaches have been proposed [4, 5, 6] and promising results
have been achieved on synthetic multi-speaker datasets, e.g.,
LibriCSS [7]. However, transcribing real-world meetings
is far more challenging with entangled difficulties such as
* Lei Xie is the corresponding author.
overlapping speech, conversational speaking style, unknown
number of speakers, far-field speech signals with noise and
reverberation. Recently, two challenges – Multi-channel
Multi-party Meeting Transcription (M2MeT) [8, 9] and Mul-
timodal Information based Speech Processing (MISP) [10] –
have made available valuable real-world multi-talker speech
datasets to benchmark multi-speaker ASR towards real con-
ditions and applications.
In the real-world applications, microphone array is usu-
ally adopted for far-field speech recording scenarios, includ-
ing those in M2MET and MISP, where beamforming is a
common algorithm to leverage spatial information for multi-
channel speech enhancement. With the help of deep neural
networks, time-frequency mask-based beamforming [11, 12,
13, 14, 13, 15] has shown superior performance in various
multi-speaker benchmarks, such as AMI [16], CHiME [17,
18] and M2MeT [8, 9]. The mask estimation network needs
to be trained with signal-level criteria on the simulated data
where the reference speech is required. Simulated data has
a clear gap with real-world data, and optimizing the signal-
level criteria may not necessarily lead to lowered word error
rate (WER) as well. Aiming to alleviate such mismatch, joint
optimization of multi-channel front-end and ASR has been
proposed [19, 20, 21, 22, 23, 24]. Under the joint learning
framework, the whole system can be optimized with an ulti-
mate ASR loss function by adopting real-world data without
reference-cleaned signals.
The attention mechanism has been recently introduced to
neural beamforming [24, 25], which performs recursive non-
linear beamforming on the data represented in a latent space.
Specifically, cross-channel attention has been proposed to
directly leverage multi-channel signals in a neural speech
recognition system [26, 27]. Impressively, such an approach
can bypass the complicated front-end formalization and inte-
grate beamforming and acoustic modeling into an end-to-end
neural solution. This cross-channel attention approach takes
the frame-wise multi-channel signal as input and learns global
correlations between sequences of different channels, which
can be easily depicted as mapping each channel represen-
tation (query) with a set of channel-average representation
arXiv:2210.05265v1 [cs.SD] 11 Oct 2022