TRIDENTSE GUIDING SPEECH ENHANCEMENT WITH 32 GLOBAL TOKENS Dacheng Yin1 Zhiyuan Zhao2 Chuanxin Tang2 Zhiwei Xiong1 Chong Luo2 1University of Science and Technology of China Hefei China

2025-05-06 0 0 654.21KB 5 页 10玖币

侵权投诉

TRIDENTSE: GUIDING SPEECH ENHANCEMENT WITH 32 GLOBAL TOKENS

Dacheng Yin1∗, Zhiyuan Zhao2, Chuanxin Tang2, Zhiwei Xiong1, Chong Luo2

1University of Science and Technology of China, Hefei, China

2Microsoft Research Asia, Beijing, China

ABSTRACT

In this paper, we present TridentSE, a novel architecture for speech

enhancement, which is capable of efﬁciently capturing both global

information and local details. TridentSE maintains T-F bin level rep-

resentation to capture details, and uses a small number of global to-

kens to process the global information. Information is propagated

between the local and the global representations through cross atten-

tion modules. To capture both inter- and intra-frame information, the

global tokens are divided into two groups to process along the time

and the frequency axis respectively. A metric discriminator is further

employed to guide our model to achieve higher perceptual quality.

Even with signiﬁcantly lower computational cost, TridentSE outper-

forms a variety of previous speech enhancement methods, achieving

a PESQ of 3.47 on VoiceBank+DEMAND dataset and a PESQ of

3.44 on DNS no-reverb test set. Visualization shows that the global

tokens learn diverse and interpretable global patterns.

Index Terms—Speech enhancement, global representation

1. INTRODUCTION

Speech enhancement (SE) aims to improve the quality of speech

when it is contaminated with noise. In the deep learning era, speech

enhancement techniques have also made great progress. One line

of research is the time-domain methods [1, 2, 3], which process

speech directly in waveform domain. Another line of research is the

frequency-domain methods [4, 5, 6], which process speech in the T-

F spectrogram domain. Our method belongs to the second category,

and the objective of this research is to design an effective frequency-

domain method for single-channel speech enhancement.

For a frequency-domain SE method, the input is the time-

frequency (T-F) spectrogram and previous research [7] has shown

that it is better to use T-F mask instead of the T-F values as the

immediate prediction target. Therefore, SE solves a dense classiﬁ-

cation or prediction problem, where dense means each T-F bin has

a corresponding prediction output. The T-F bin level details, espe-

cially the phase structure, has become increasingly important with

the development of masking method [8, 9, 10, 11]. This requires

the SE network to faithfully capture the local details. On the other

hand, previous work indicate that a good SE network is inseparable

from the understanding of global (long-range) information on both

frequency axis [4] and time axis [12]. In short, an SE network needs

to learn both local details and global information.

Simultaneous learning of these two types of information is a

non-trivial problem. Existing frequency-domain SE methods either

adopt a cylindrical network structure [4, 6, 12] or a U-shaped struc-

ture [13, 5]. In the ﬁrst category, the feature map maintains its orig-

inal T-F resolution as it is transformed by the SE network. While

*Work done during internship at Microsoft Research Asia.

dense local information is naturally processed in each T-F bin, sparse

global information is also aggregated by each T-F bin without much

coordination, which is computationally inefﬁcient. In the second

category, the feature map is gradually down-sampled during feature

transformation. At its smallest size, it is affordable to compute the

global semantic information. Then, the transformed feature map

is gradually up-sampled to its original size. Skip connections are

adopted to connect two layers with the same feature size before and

after the computation on the smallest feature map, or in other words,

merge the low-level and high-level features. Specially, the full res-

olution feature is not merged with high-level features until the end

of the network. This limits the network’s information fusion abil-

ity compared to the cylindrical architecture which process both level

information at each layer.

In this work, we propose a third network structure for the SE

task. It is formed by a main network, which maintains a full-

resolution feature map, and two companion branches, each of which

only keeps 16 global tokens. As the three-branch network architec-

ture is like a trident, we name our SE network TridentSE. The main

network is responsible for computing the dense low-level details

and the companion branches handle the global information. We

differentiate temporal tokens and frequency tokens, both of which

are initially extracted from the original feature map by the cross-

attention operation. After each processing unit, they inject the global

temporal and frequency information back to the main network by

the same cross-attention operation.

Employing two dedicated branches to compute high-level se-

mantic information brings notable beneﬁts. Compared with the

cylindrical network structure, TridentSE signiﬁcantly reduces the

computational redundancy of global information. Compared with

the U-shaped network structure, TridentSE is able to perform long-

range computation from the very beginning of the network. The

cross-attention-based fusion is also more powerful than the simple

addition operation usually adopted in skip connection.

Experimental results show that TridentSE achieves higher en-

hancement quality with lower computational complexity compared

with various previous methods. Through visualization, we conﬁrm

that the global tokens learn diverse and interpretable global patterns.

2. METHOD

2.1. Overview of TridentSE

We adopt the T-F masking framework in [12] to estimate the clean

waveform from input noisy signal. Fig.1 shows the overall archi-

tecture of the proposed T-F mask prediction network TridentSE. It

is composed of three components: the encoder, the backbone, and

the decoder. The encoder extracts local time-frequency feature from

the input spectrogram, and the decoder decodes the time-frequency

representation into the complex ratio mask Mc∈RT×F×2. Specif-

arXiv:2210.12995v1 [eess.AS] 24 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TRIDENTSE:GUIDINGSPEECHENHANCEMENTWITH32GLOBALTOKENSDachengYin1,ZhiyuanZhao2,ChuanxinTang2,ZhiweiXiong1,ChongLuo21UniversityofScienceandTechnologyofChina,Hefei,China2MicrosoftResearchAsia,Beijing,ChinaABSTRACTInthispaper,wepresentTridentSE,anovelarchitectureforspeechenhancement,whichiscapableofefc...

展开>> 收起<<

TRIDENTSE GUIDING SPEECH ENHANCEMENT WITH 32 GLOBAL TOKENS Dacheng Yin1 Zhiyuan Zhao2 Chuanxin Tang2 Zhiwei Xiong1 Chong Luo2 1University of Science and Technology of China Hefei China.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TRIDENTSE GUIDING SPEECH ENHANCEMENT WITH 32 GLOBAL TOKENS Dacheng Yin1 Zhiyuan Zhao2 Chuanxin Tang2 Zhiwei Xiong1 Chong Luo2 1University of Science and Technology of China Hefei China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: