
TRIDENTSE: GUIDING SPEECH ENHANCEMENT WITH 32 GLOBAL TOKENS
Dacheng Yin1∗, Zhiyuan Zhao2, Chuanxin Tang2, Zhiwei Xiong1, Chong Luo2
1University of Science and Technology of China, Hefei, China
2Microsoft Research Asia, Beijing, China
ABSTRACT
In this paper, we present TridentSE, a novel architecture for speech
enhancement, which is capable of efficiently capturing both global
information and local details. TridentSE maintains T-F bin level rep-
resentation to capture details, and uses a small number of global to-
kens to process the global information. Information is propagated
between the local and the global representations through cross atten-
tion modules. To capture both inter- and intra-frame information, the
global tokens are divided into two groups to process along the time
and the frequency axis respectively. A metric discriminator is further
employed to guide our model to achieve higher perceptual quality.
Even with significantly lower computational cost, TridentSE outper-
forms a variety of previous speech enhancement methods, achieving
a PESQ of 3.47 on VoiceBank+DEMAND dataset and a PESQ of
3.44 on DNS no-reverb test set. Visualization shows that the global
tokens learn diverse and interpretable global patterns.
Index Terms—Speech enhancement, global representation
1. INTRODUCTION
Speech enhancement (SE) aims to improve the quality of speech
when it is contaminated with noise. In the deep learning era, speech
enhancement techniques have also made great progress. One line
of research is the time-domain methods [1, 2, 3], which process
speech directly in waveform domain. Another line of research is the
frequency-domain methods [4, 5, 6], which process speech in the T-
F spectrogram domain. Our method belongs to the second category,
and the objective of this research is to design an effective frequency-
domain method for single-channel speech enhancement.
For a frequency-domain SE method, the input is the time-
frequency (T-F) spectrogram and previous research [7] has shown
that it is better to use T-F mask instead of the T-F values as the
immediate prediction target. Therefore, SE solves a dense classifi-
cation or prediction problem, where dense means each T-F bin has
a corresponding prediction output. The T-F bin level details, espe-
cially the phase structure, has become increasingly important with
the development of masking method [8, 9, 10, 11]. This requires
the SE network to faithfully capture the local details. On the other
hand, previous work indicate that a good SE network is inseparable
from the understanding of global (long-range) information on both
frequency axis [4] and time axis [12]. In short, an SE network needs
to learn both local details and global information.
Simultaneous learning of these two types of information is a
non-trivial problem. Existing frequency-domain SE methods either
adopt a cylindrical network structure [4, 6, 12] or a U-shaped struc-
ture [13, 5]. In the first category, the feature map maintains its orig-
inal T-F resolution as it is transformed by the SE network. While
*Work done during internship at Microsoft Research Asia.
dense local information is naturally processed in each T-F bin, sparse
global information is also aggregated by each T-F bin without much
coordination, which is computationally inefficient. In the second
category, the feature map is gradually down-sampled during feature
transformation. At its smallest size, it is affordable to compute the
global semantic information. Then, the transformed feature map
is gradually up-sampled to its original size. Skip connections are
adopted to connect two layers with the same feature size before and
after the computation on the smallest feature map, or in other words,
merge the low-level and high-level features. Specially, the full res-
olution feature is not merged with high-level features until the end
of the network. This limits the network’s information fusion abil-
ity compared to the cylindrical architecture which process both level
information at each layer.
In this work, we propose a third network structure for the SE
task. It is formed by a main network, which maintains a full-
resolution feature map, and two companion branches, each of which
only keeps 16 global tokens. As the three-branch network architec-
ture is like a trident, we name our SE network TridentSE. The main
network is responsible for computing the dense low-level details
and the companion branches handle the global information. We
differentiate temporal tokens and frequency tokens, both of which
are initially extracted from the original feature map by the cross-
attention operation. After each processing unit, they inject the global
temporal and frequency information back to the main network by
the same cross-attention operation.
Employing two dedicated branches to compute high-level se-
mantic information brings notable benefits. Compared with the
cylindrical network structure, TridentSE significantly reduces the
computational redundancy of global information. Compared with
the U-shaped network structure, TridentSE is able to perform long-
range computation from the very beginning of the network. The
cross-attention-based fusion is also more powerful than the simple
addition operation usually adopted in skip connection.
Experimental results show that TridentSE achieves higher en-
hancement quality with lower computational complexity compared
with various previous methods. Through visualization, we confirm
that the global tokens learn diverse and interpretable global patterns.
2. METHOD
2.1. Overview of TridentSE
We adopt the T-F masking framework in [12] to estimate the clean
waveform from input noisy signal. Fig.1 shows the overall archi-
tecture of the proposed T-F mask prediction network TridentSE. It
is composed of three components: the encoder, the backbone, and
the decoder. The encoder extracts local time-frequency feature from
the input spectrogram, and the decoder decodes the time-frequency
representation into the complex ratio mask Mc∈RT×F×2. Specif-
arXiv:2210.12995v1 [eess.AS] 24 Oct 2022