SAN A ROBUST END-TO-END ASR MODEL ARCHITECTURE Zeping Min1Qian Ge1Guanhua Huang2 1Peking University

2025-05-03 0 0 389.08KB 5 页 10玖币

侵权投诉

SAN: A ROBUST END-TO-END ASR MODEL ARCHITECTURE

Zeping Min?1Qian Ge?1Guanhua Huang?2

1Peking University

2University of Science and Technology of China

ABSTRACT

In this paper, we propose a novel Siamese Adversarial Net-

work (SAN) architecture for automatic speech recognition,

which aims at solving the difﬁculty of fuzzy audio recogni-

tion. Speciﬁcally, SAN constructs two sub-networks to dif-

ferentiate the audio feature input and then introduces a loss

to unify the output distribution of these sub-networks. Adver-

sarial learning enables the network to capture more essential

acoustic features and helps the models achieve better perfor-

mance when encountering fuzzy audio input. We conduct nu-

merical experiments with the SAN model on several datasets

for the automatic speech recognition task. All experimental

results show that the siamese adversarial nets signiﬁcantly re-

duce the character error rate (CER). Speciﬁcally, we achieve

a new state of art 4.37 CER without language model on the

AISHELL-1 dataset, which leads to around 5% relative CER

reduction. To reveal the generality of the siamese adversarial

net, we also conduct experiments on the phoneme recognition

task, which also shows the superiority of the siamese adver-

sarial network.

Index Terms—Automatic speech recognition, adversar-

ial learning, fuzzy audio, siamese net

1. INTRODUCTION

Automatic speech recognition (ASR) is a task with a wide

range of application scenarios. There has a long history on

automatic speech recognition (ASR). Before the popularity of

deep learning, the HMM-GMM [1] models are widely used

in the automatic speech recognition community. In HMM-

GMM architecture, each frame of input corresponds to a la-

bel category, and the labels need repeated iterations to en-

sure more accurate alignment. With the development of deep

learning, automatic speech recognition entered the end-to-end

era. Recently, researchers have presented many end-to-end

deep learning speech recognition methods [2, 3, 4, 5], which

gain better performance and easier training.

However, these models still suffer from the fuzzy audio

problem. For example, the word ”wood” and ”world” have

very similar pronunciations. It may be more difﬁcult to dis-

tinguish after adding various noises to the actual scene and

?Equal contribution

hence the model may suffer from them. Most previous works

add an additional language model to the decoder to deal with

the fuzzy audio problem [4, 6, 7, 8]. The language model

performs well when fuzzy audio is easy to distinguish seman-

tically. For instance, in the audio with transcription of ”There

are billions of people in the world.”, the raw model output

may have a high probability on both ”There are billions of

people in the world.” and ”There are billions of people in the

wood.” predictions. And it is easy to eliminate the latter in-

terference option by using the language model.

Unfortunately, this does not work all the time. When the

interference option is also semantically meaningful, the lan-

guage model can not help eliminate the wrong option and may

even mislead the output, such as the ”I like the wood” and ”I

like the world.”. Both of them are semantically meaningful,

but ”I like the world” are more common. For input wave with

the ground truth transcription ”I like the wood.”, the speech

recognition model output may also have a high probability on

the transcription of ”I like the world.”. Further, the language

model may also vote for the wrong option ”I like the world.”.

Finally, the model is prone to make mistakes on the audio in-

put.

In this paper, inspired by [9], we propose a novel siamese

adversarial net (SAN) architecture for automatic speech

recognition, which aims at solving the difﬁculty of recog-

nizing fuzzy audio. In detail, SAN consists of two weight-

shared sub-networks, which employ the dropout layers to mix

different noises into acoustic features and make the acoustic

features of the two sub-networks different. Then a Kull-

back–Leibler (KL) divergence is leveraged to minimize the

output distributions of these two sub-networks, which boosts

the model to learn the essential acoustic features to help the

model deal with the fuzzy audio input. As experimental re-

sults, we achieve a new state of art 4.37 CER on AISHELL-1

dataset, which leads to around 5% relative CER reduction to

the previous. In summary, our contributions are as follows:

• We propose a novel siamese adversarial net (SAN) ar-

chitecture, solving the difﬁculty of recognizing fuzzy

audio by adversarial learning with two subnets.

• We fulﬁll the gap that few works take care of the fuzzy

audio recognition in the acoustic model itself.

arXiv:2210.15285v1 [cs.SD] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SAN:AROBUSTEND-TO-ENDASRMODELARCHITECTUREZepingMin?1QianGe?1GuanhuaHuang?21PekingUniversity2UniversityofScienceandTechnologyofChinaABSTRACTInthispaper,weproposeanovelSiameseAdversarialNet-work(SAN)architectureforautomaticspeechrecognition,whichaimsatsolvingthedifcultyoffuzzyaudiorecogni-tion.Speci...

展开>> 收起<<

SAN A ROBUST END-TO-END ASR MODEL ARCHITECTURE Zeping Min1Qian Ge1Guanhua Huang2 1Peking University.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SAN A ROBUST END-TO-END ASR MODEL ARCHITECTURE Zeping Min1Qian Ge1Guanhua Huang2 1Peking University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: