A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR Rui Lix Guodong Max Dexin Zhaoyx Ranran Zengy Xiaoyu Liyand Hao Huangz

2025-04-30 0 0 920.24KB 6 页 10玖币
侵权投诉
A Policy-based Approach to the SpecAugment
Method for Low Resource E2E ASR
Rui Li§, Guodong Ma§, Dexin Zhao§, Ranran Zeng, Xiaoyu Liand Hao Huang∗‡
School of Information Science and Engineering, Xinjiang University, Urumqi, China
China Telecom Beijing Research Institute, Beijing, China
Xinjiang Provincial Key Laboratory of Multi-lingual Information Technology, Urumqi, China
Corresponding author; E-mail: hwanghao@gmail.com
§Equal contributions
Abstract—SpecAugment is a very effective data augmentation
method for both HMM and E2E-based automatic speech recog-
nition (ASR) systems. Especially, it also works in low-resource
scenarios. However, SpecAugment masks the spectrum of time
or the frequency domain in a fixed augmentation policy, which
may bring relatively less data diversity to the low-resource ASR.
In this paper, we propose a policy-based SpecAugment (Policy-
SpecAugment) method to alleviate the above problem. The idea
is to use the augmentation-select policy and the augmentation-
parameter changing policy to solve the fixed way. These policies
are learned based on the loss of validation set, which is applied
to the corresponding augmentation policies. It aims to encourage
the model to learn more diverse data, which the model relatively
requires. In experiments, we evaluate the effectiveness of our
approach in low-resource scenarios, i.e., the 100 hours librispeech
task. According to the results and analysis, we can see that the
above issue can be obviously alleviated using our proposal. In
addition, the experimental results show that, compared with the
state-of-the-art SpecAugment, the proposed Policy-SpecAugment
has a relative WER reduction of more than 10% on the Test/Dev-
clean set, more than 5% on the Test/Dev-other set, and an absolute
WER reduction of more than 1% on all test sets.
I. INTRODUCTION
Recently, end-to-end (E2E) automatic speech recognition
(ASR) [1]–[6] based neural networks have achieved a large
improvements. Meanwhile, the E2E ASR simplifies the pro-
cessing of system construction, which establishes a direct
mapping from the acoustic feature sequences to the modeling
unit sequences. With the emergence of E2E ASR, researchers
[7]–[22] explore different E2E ASR scenarios and partly focus
on the data augmentation and training strategy due to the
nature of data-hungry and easy over-fitting. However, most of
the existing work on data augmentation technology is just to
explore a fixed way to bring more abundant data to the model.
For example, the state-of-the-art SpecAugment method [7] uses
three spectrum disturbance strategies in a fixed augmentation
policy for input speech spectrum features. Under a certain
amount of training data, the fixed augmentation policy may
tend to be stable faster so that it will not bring too much
data diversity in a model learning stage. Therefore, we believe
that, with the use of SpecAugment, the information brought to
the model can be enriched. In addition, when the model is in
different learning states, it may more need to learn the data
applied to different the combinations of augmentation policy.
But the masking strategies of SpecAugment are completely
random which is not related to model state.
Based on the above discussions, we propose a policy-
based SpecAugment to improve the performance of low re-
source end-to-end (E2E) ASR systems, which is named Policy-
SpecAugment. In our proposed Policy-SpecAugment method,
we will calculate the loss value of the validation set under the
action of each augmentation strategies in the model learning
stage. The loss value can be represented the fitting degree
of the model to the corresponding augmentation strategies
at the previous epoch trained model, so as to reflect which
augmentation strategy should be studied at the current model
training epoch. Then, the losses will be used to calculate the
probabilities of the augmentation-select policy and the factor
of the augmentation-parameter changing policy, which aims to
encourage the augmentation method to produce the various data
needed by ASR model. The specific details will be introduced
in section IV. For fair comparison, the augmentation strategies
we used are consistent with SpecAugment, including time
masking, frequency masking and time warping. We will briefly
show the three classic and effective augmentation strategies
in Section III. In the 100 hours librispeech task [23], we
use ESPnet1 [24] to confirm our Policy-SpecAugment. The
experimental results show that, compared with the state-of-
the-art SpecAugment, our proposed Policy-SpecAugment has
a relative increase of more than 10% on the test/dev-clean set,
a relative increase of more than 5% on the test/dev-other set,
and an absolute increase of more than 1% on all test sets.
The paper is organized as follows. Section II is to review
the prior related works. In Section III, we will briefly show the
three data augmentation methods in SpecAugment. And Section
IV presents the proposed policy-based SpecAugment method in
detail. In addition, Section V describes the experiment setups
and results. After that, we perform further analysis in Section
VI and conclude in Section VII.
II. RELATED WORKS
Data augmentation is a method of increasing the diversity of
training data [25]–[27] to prevent the model over-fitting. Cur-
rently, there are many data augmentation methods to improve
ASR system performance. In [28], audio speed perturbation
was proposed, which aims to augment the varied speed data
for the model. In [29], room impulse responses were proposed
to simulate far-field data. In [30], speech synthesis methods
arXiv:2210.08520v1 [cs.SD] 16 Oct 2022
Fig. 1: Policy-SpecAugment augmentation method, where TW denotes Time Warping, FM presents Frequency Masking and
TM refers to Time Masking.
are used to augment the data. In addition, Our recent proposed
Phone Masking Training (PMT) [8] alleviates the impact of
the phonetic reduction in Uyghur ASR by simulating phonetic
reduction data. Then, the state-of-the-Art SpecAugment [7]
has proven to be a very effective method, which operates on
the input speech spectrum features of the ASR model using
the three augmentation strategies, including time masking,
frequency masking and time warping.
However, the data augmentation methods mentioned above
both enrich the training data in a fixed augmentation policy,
which does not choose the diversity of data based on the state
of the model learning. We believe that the fixed way will reduce
the data diversity brought by the data augmentation method.
To alleviate this issue, the recent SapAugment [31] proposes
to use the loss value of the training sample to make a selection
for the augmentation operation at the corresponding training
sample. Their intuitions are that perturbing a hard sample with
a strong augmentation may also make it too hard to learn
from, and a sample with low training loss should be perturbed
by a stronger augmentation to provide more robustness to
a variety of conditions. [31] shows an obvious performance
advantage than SpecAugment. But, more strategies used in
SapAugment (such as CutMix [32] etc.) are introduced compare
with SpecAugment. In addition, we believe that the strategy
selection should be more focused on the strategy rather than
data samples.
III. AUGMENTATION STRATEGY
Our motivation is to construct a policy that allows the
model to fully learn the benefits of different data augmentation
methods as it is trained. For fair comparison, we use the same
data augmentation methods as SpecAugment to our proposed
Policy-SpecAugment. The following is a brief introduction to
three data augmentation strategies used in SpecAugment. Please
refer to [7] for the details.
A. Time Masking
Time masking is to mask the input features in the time
domain. The masked size is first chosen from a uniform
distribution from 0 to the time mask parameter T.
B. Frequency Masking
Frequency masking masks the input spectrum features in
the frequency domain. The masked part is first chosen from
a uniform distribution from 0 to the frequency mask parameter
F.
C. Time Warping
About time warping, given an input speech spectrum fea-
tures, a random point along the horizontal line passing through
the center of the spectrum is to be warped either to the left or
right by a certain distance chosen from a uniform distribution.
IV. POLICY-BASED SPECAUGMENT
A. Random-SpecAugment
Before the introduction of the Policy-SpecAugment, we will
show the Random-SpecAugment. Briefly, each training sample
were applied to only one augmentation strategy in equal random
way. Though Random-SpecAugment with the very simple idea
is worsened than SpecAugment in experiments, it outperform
obviously the baseline without any augmentation method and
motivates us to propose the Policy-SpecAugment.
B. Prob-SpecAugment
Based on the Random-SpecAugment, to fix the equal random
way, we propose a probability-based SpecAugment (Prob-
SpecAugment). The idea is that, at each iteration of the training
phase, we will calculate Nloss values (Lossi,i= 1..N ) of
valid set applied to the corresponding augmentation technology,
where Ndisplays the total number of augmentation strategies.
We consider the normalized loss as the probability distribution
摘要:

APolicy-basedApproachtotheSpecAugmentMethodforLowResourceE2EASRRuiLix,GuodongMax,DexinZhaoyx,RanranZengy,XiaoyuLiyandHaoHuangzSchoolofInformationScienceandEngineering,XinjiangUniversity,Urumqi,ChinayChinaTelecomBeijingResearchInstitute,Beijing,ChinazXinjiangProvincialKeyLaboratoryofMulti-lingual...

展开>> 收起<<
A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR Rui Lix Guodong Max Dexin Zhaoyx Ranran Zengy Xiaoyu Liyand Hao Huangz.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:920.24KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注