TRIDENTSE GUIDING SPEECH ENHANCEMENT WITH 32 GLOBAL TOKENS Dacheng Yin1 Zhiyuan Zhao2 Chuanxin Tang2 Zhiwei Xiong1 Chong Luo2 1University of Science and Technology of China Hefei China

2025-05-06 0 0 654.21KB 5 页 10玖币
侵权投诉
TRIDENTSE: GUIDING SPEECH ENHANCEMENT WITH 32 GLOBAL TOKENS
Dacheng Yin1, Zhiyuan Zhao2, Chuanxin Tang2, Zhiwei Xiong1, Chong Luo2
1University of Science and Technology of China, Hefei, China
2Microsoft Research Asia, Beijing, China
ABSTRACT
In this paper, we present TridentSE, a novel architecture for speech
enhancement, which is capable of efficiently capturing both global
information and local details. TridentSE maintains T-F bin level rep-
resentation to capture details, and uses a small number of global to-
kens to process the global information. Information is propagated
between the local and the global representations through cross atten-
tion modules. To capture both inter- and intra-frame information, the
global tokens are divided into two groups to process along the time
and the frequency axis respectively. A metric discriminator is further
employed to guide our model to achieve higher perceptual quality.
Even with significantly lower computational cost, TridentSE outper-
forms a variety of previous speech enhancement methods, achieving
a PESQ of 3.47 on VoiceBank+DEMAND dataset and a PESQ of
3.44 on DNS no-reverb test set. Visualization shows that the global
tokens learn diverse and interpretable global patterns.
Index TermsSpeech enhancement, global representation
1. INTRODUCTION
Speech enhancement (SE) aims to improve the quality of speech
when it is contaminated with noise. In the deep learning era, speech
enhancement techniques have also made great progress. One line
of research is the time-domain methods [1, 2, 3], which process
speech directly in waveform domain. Another line of research is the
frequency-domain methods [4, 5, 6], which process speech in the T-
F spectrogram domain. Our method belongs to the second category,
and the objective of this research is to design an effective frequency-
domain method for single-channel speech enhancement.
For a frequency-domain SE method, the input is the time-
frequency (T-F) spectrogram and previous research [7] has shown
that it is better to use T-F mask instead of the T-F values as the
immediate prediction target. Therefore, SE solves a dense classifi-
cation or prediction problem, where dense means each T-F bin has
a corresponding prediction output. The T-F bin level details, espe-
cially the phase structure, has become increasingly important with
the development of masking method [8, 9, 10, 11]. This requires
the SE network to faithfully capture the local details. On the other
hand, previous work indicate that a good SE network is inseparable
from the understanding of global (long-range) information on both
frequency axis [4] and time axis [12]. In short, an SE network needs
to learn both local details and global information.
Simultaneous learning of these two types of information is a
non-trivial problem. Existing frequency-domain SE methods either
adopt a cylindrical network structure [4, 6, 12] or a U-shaped struc-
ture [13, 5]. In the first category, the feature map maintains its orig-
inal T-F resolution as it is transformed by the SE network. While
*Work done during internship at Microsoft Research Asia.
dense local information is naturally processed in each T-F bin, sparse
global information is also aggregated by each T-F bin without much
coordination, which is computationally inefficient. In the second
category, the feature map is gradually down-sampled during feature
transformation. At its smallest size, it is affordable to compute the
global semantic information. Then, the transformed feature map
is gradually up-sampled to its original size. Skip connections are
adopted to connect two layers with the same feature size before and
after the computation on the smallest feature map, or in other words,
merge the low-level and high-level features. Specially, the full res-
olution feature is not merged with high-level features until the end
of the network. This limits the network’s information fusion abil-
ity compared to the cylindrical architecture which process both level
information at each layer.
In this work, we propose a third network structure for the SE
task. It is formed by a main network, which maintains a full-
resolution feature map, and two companion branches, each of which
only keeps 16 global tokens. As the three-branch network architec-
ture is like a trident, we name our SE network TridentSE. The main
network is responsible for computing the dense low-level details
and the companion branches handle the global information. We
differentiate temporal tokens and frequency tokens, both of which
are initially extracted from the original feature map by the cross-
attention operation. After each processing unit, they inject the global
temporal and frequency information back to the main network by
the same cross-attention operation.
Employing two dedicated branches to compute high-level se-
mantic information brings notable benefits. Compared with the
cylindrical network structure, TridentSE significantly reduces the
computational redundancy of global information. Compared with
the U-shaped network structure, TridentSE is able to perform long-
range computation from the very beginning of the network. The
cross-attention-based fusion is also more powerful than the simple
addition operation usually adopted in skip connection.
Experimental results show that TridentSE achieves higher en-
hancement quality with lower computational complexity compared
with various previous methods. Through visualization, we confirm
that the global tokens learn diverse and interpretable global patterns.
2. METHOD
2.1. Overview of TridentSE
We adopt the T-F masking framework in [12] to estimate the clean
waveform from input noisy signal. Fig.1 shows the overall archi-
tecture of the proposed T-F mask prediction network TridentSE. It
is composed of three components: the encoder, the backbone, and
the decoder. The encoder extracts local time-frequency feature from
the input spectrogram, and the decoder decodes the time-frequency
representation into the complex ratio mask McRT×F×2. Specif-
arXiv:2210.12995v1 [eess.AS] 24 Oct 2022
摘要:

TRIDENTSE:GUIDINGSPEECHENHANCEMENTWITH32GLOBALTOKENSDachengYin1,ZhiyuanZhao2,ChuanxinTang2,ZhiweiXiong1,ChongLuo21UniversityofScienceandTechnologyofChina,Hefei,China2MicrosoftResearchAsia,Beijing,ChinaABSTRACTInthispaper,wepresentTridentSE,anovelarchitectureforspeechenhancement,whichiscapableofefc...

展开>> 收起<<
TRIDENTSE GUIDING SPEECH ENHANCEMENT WITH 32 GLOBAL TOKENS Dacheng Yin1 Zhiyuan Zhao2 Chuanxin Tang2 Zhiwei Xiong1 Chong Luo2 1University of Science and Technology of China Hefei China.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:654.21KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注