SAN A ROBUST END-TO-END ASR MODEL ARCHITECTURE Zeping Min1Qian Ge1Guanhua Huang2 1Peking University

2025-05-03 0 0 389.08KB 5 页 10玖币
侵权投诉
SAN: A ROBUST END-TO-END ASR MODEL ARCHITECTURE
Zeping Min?1Qian Ge?1Guanhua Huang?2
1Peking University
2University of Science and Technology of China
ABSTRACT
In this paper, we propose a novel Siamese Adversarial Net-
work (SAN) architecture for automatic speech recognition,
which aims at solving the difficulty of fuzzy audio recogni-
tion. Specifically, SAN constructs two sub-networks to dif-
ferentiate the audio feature input and then introduces a loss
to unify the output distribution of these sub-networks. Adver-
sarial learning enables the network to capture more essential
acoustic features and helps the models achieve better perfor-
mance when encountering fuzzy audio input. We conduct nu-
merical experiments with the SAN model on several datasets
for the automatic speech recognition task. All experimental
results show that the siamese adversarial nets significantly re-
duce the character error rate (CER). Specifically, we achieve
a new state of art 4.37 CER without language model on the
AISHELL-1 dataset, which leads to around 5% relative CER
reduction. To reveal the generality of the siamese adversarial
net, we also conduct experiments on the phoneme recognition
task, which also shows the superiority of the siamese adver-
sarial network.
Index TermsAutomatic speech recognition, adversar-
ial learning, fuzzy audio, siamese net
1. INTRODUCTION
Automatic speech recognition (ASR) is a task with a wide
range of application scenarios. There has a long history on
automatic speech recognition (ASR). Before the popularity of
deep learning, the HMM-GMM [1] models are widely used
in the automatic speech recognition community. In HMM-
GMM architecture, each frame of input corresponds to a la-
bel category, and the labels need repeated iterations to en-
sure more accurate alignment. With the development of deep
learning, automatic speech recognition entered the end-to-end
era. Recently, researchers have presented many end-to-end
deep learning speech recognition methods [2, 3, 4, 5], which
gain better performance and easier training.
However, these models still suffer from the fuzzy audio
problem. For example, the word ”wood” and ”world” have
very similar pronunciations. It may be more difficult to dis-
tinguish after adding various noises to the actual scene and
?Equal contribution
hence the model may suffer from them. Most previous works
add an additional language model to the decoder to deal with
the fuzzy audio problem [4, 6, 7, 8]. The language model
performs well when fuzzy audio is easy to distinguish seman-
tically. For instance, in the audio with transcription of ”There
are billions of people in the world.”, the raw model output
may have a high probability on both ”There are billions of
people in the world.” and ”There are billions of people in the
wood.” predictions. And it is easy to eliminate the latter in-
terference option by using the language model.
Unfortunately, this does not work all the time. When the
interference option is also semantically meaningful, the lan-
guage model can not help eliminate the wrong option and may
even mislead the output, such as the ”I like the wood” and ”I
like the world.”. Both of them are semantically meaningful,
but ”I like the world” are more common. For input wave with
the ground truth transcription ”I like the wood.”, the speech
recognition model output may also have a high probability on
the transcription of ”I like the world.”. Further, the language
model may also vote for the wrong option ”I like the world.”.
Finally, the model is prone to make mistakes on the audio in-
put.
In this paper, inspired by [9], we propose a novel siamese
adversarial net (SAN) architecture for automatic speech
recognition, which aims at solving the difficulty of recog-
nizing fuzzy audio. In detail, SAN consists of two weight-
shared sub-networks, which employ the dropout layers to mix
different noises into acoustic features and make the acoustic
features of the two sub-networks different. Then a Kull-
back–Leibler (KL) divergence is leveraged to minimize the
output distributions of these two sub-networks, which boosts
the model to learn the essential acoustic features to help the
model deal with the fuzzy audio input. As experimental re-
sults, we achieve a new state of art 4.37 CER on AISHELL-1
dataset, which leads to around 5% relative CER reduction to
the previous. In summary, our contributions are as follows:
We propose a novel siamese adversarial net (SAN) ar-
chitecture, solving the difficulty of recognizing fuzzy
audio by adversarial learning with two subnets.
We fulfill the gap that few works take care of the fuzzy
audio recognition in the acoustic model itself.
arXiv:2210.15285v1 [cs.SD] 27 Oct 2022
摘要:

SAN:AROBUSTEND-TO-ENDASRMODELARCHITECTUREZepingMin?1QianGe?1GuanhuaHuang?21PekingUniversity2UniversityofScienceandTechnologyofChinaABSTRACTInthispaper,weproposeanovelSiameseAdversarialNet-work(SAN)architectureforautomaticspeechrecognition,whichaimsatsolvingthedifcultyoffuzzyaudiorecogni-tion.Speci...

展开>> 收起<<
SAN A ROBUST END-TO-END ASR MODEL ARCHITECTURE Zeping Min1Qian Ge1Guanhua Huang2 1Peking University.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:389.08KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注