A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR Rui Lix Guodong Max Dexin Zhaoyx Ranran Zengy Xiaoyu Liyand Hao Huangz

2025-04-30 1 0 920.24KB 6 页 10玖币

侵权投诉

A Policy-based Approach to the SpecAugment

Method for Low Resource E2E ASR

Rui Li∗§, Guodong Ma∗§, Dexin Zhao†§, Ranran Zeng†, Xiaoyu Li†and Hao Huang∗‡

∗School of Information Science and Engineering, Xinjiang University, Urumqi, China

†China Telecom Beijing Research Institute, Beijing, China

‡Xinjiang Provincial Key Laboratory of Multi-lingual Information Technology, Urumqi, China

Corresponding author; E-mail: hwanghao@gmail.com

§Equal contributions

Abstract—SpecAugment is a very effective data augmentation

method for both HMM and E2E-based automatic speech recog-

nition (ASR) systems. Especially, it also works in low-resource

scenarios. However, SpecAugment masks the spectrum of time

or the frequency domain in a ﬁxed augmentation policy, which

may bring relatively less data diversity to the low-resource ASR.

In this paper, we propose a policy-based SpecAugment (Policy-

SpecAugment) method to alleviate the above problem. The idea

is to use the augmentation-select policy and the augmentation-

parameter changing policy to solve the ﬁxed way. These policies

are learned based on the loss of validation set, which is applied

to the corresponding augmentation policies. It aims to encourage

the model to learn more diverse data, which the model relatively

requires. In experiments, we evaluate the effectiveness of our

approach in low-resource scenarios, i.e., the 100 hours librispeech

task. According to the results and analysis, we can see that the

above issue can be obviously alleviated using our proposal. In

addition, the experimental results show that, compared with the

state-of-the-art SpecAugment, the proposed Policy-SpecAugment

has a relative WER reduction of more than 10% on the Test/Dev-

clean set, more than 5% on the Test/Dev-other set, and an absolute

WER reduction of more than 1% on all test sets.

I. INTRODUCTION

Recently, end-to-end (E2E) automatic speech recognition

(ASR) [1]–[6] based neural networks have achieved a large

improvements. Meanwhile, the E2E ASR simpliﬁes the pro-

cessing of system construction, which establishes a direct

mapping from the acoustic feature sequences to the modeling

unit sequences. With the emergence of E2E ASR, researchers

[7]–[22] explore different E2E ASR scenarios and partly focus

on the data augmentation and training strategy due to the

nature of data-hungry and easy over-ﬁtting. However, most of

the existing work on data augmentation technology is just to

explore a ﬁxed way to bring more abundant data to the model.

For example, the state-of-the-art SpecAugment method [7] uses

three spectrum disturbance strategies in a ﬁxed augmentation

policy for input speech spectrum features. Under a certain

amount of training data, the ﬁxed augmentation policy may

tend to be stable faster so that it will not bring too much

data diversity in a model learning stage. Therefore, we believe

that, with the use of SpecAugment, the information brought to

the model can be enriched. In addition, when the model is in

different learning states, it may more need to learn the data

applied to different the combinations of augmentation policy.

But the masking strategies of SpecAugment are completely

random which is not related to model state.

Based on the above discussions, we propose a policy-

based SpecAugment to improve the performance of low re-

source end-to-end (E2E) ASR systems, which is named Policy-

SpecAugment. In our proposed Policy-SpecAugment method,

we will calculate the loss value of the validation set under the

action of each augmentation strategies in the model learning

stage. The loss value can be represented the ﬁtting degree

of the model to the corresponding augmentation strategies

at the previous epoch trained model, so as to reﬂect which

augmentation strategy should be studied at the current model

training epoch. Then, the losses will be used to calculate the

probabilities of the augmentation-select policy and the factor

of the augmentation-parameter changing policy, which aims to

encourage the augmentation method to produce the various data

needed by ASR model. The speciﬁc details will be introduced

in section IV. For fair comparison, the augmentation strategies

we used are consistent with SpecAugment, including time

masking, frequency masking and time warping. We will brieﬂy

show the three classic and effective augmentation strategies

in Section III. In the 100 hours librispeech task [23], we

use ESPnet1 [24] to conﬁrm our Policy-SpecAugment. The

experimental results show that, compared with the state-of-

the-art SpecAugment, our proposed Policy-SpecAugment has

a relative increase of more than 10% on the test/dev-clean set,

a relative increase of more than 5% on the test/dev-other set,

and an absolute increase of more than 1% on all test sets.

The paper is organized as follows. Section II is to review

the prior related works. In Section III, we will brieﬂy show the

three data augmentation methods in SpecAugment. And Section

IV presents the proposed policy-based SpecAugment method in

detail. In addition, Section V describes the experiment setups

and results. After that, we perform further analysis in Section

VI and conclude in Section VII.

II. RELATED WORKS

Data augmentation is a method of increasing the diversity of

training data [25]–[27] to prevent the model over-ﬁtting. Cur-

rently, there are many data augmentation methods to improve

ASR system performance. In [28], audio speed perturbation

was proposed, which aims to augment the varied speed data

for the model. In [29], room impulse responses were proposed

to simulate far-ﬁeld data. In [30], speech synthesis methods

arXiv:2210.08520v1 [cs.SD] 16 Oct 2022

Fig. 1: Policy-SpecAugment augmentation method, where TW denotes Time Warping, FM presents Frequency Masking and

TM refers to Time Masking.

are used to augment the data. In addition, Our recent proposed

Phone Masking Training (PMT) [8] alleviates the impact of

the phonetic reduction in Uyghur ASR by simulating phonetic

reduction data. Then, the state-of-the-Art SpecAugment [7]

has proven to be a very effective method, which operates on

the input speech spectrum features of the ASR model using

the three augmentation strategies, including time masking,

frequency masking and time warping.

However, the data augmentation methods mentioned above

both enrich the training data in a ﬁxed augmentation policy,

which does not choose the diversity of data based on the state

of the model learning. We believe that the ﬁxed way will reduce

the data diversity brought by the data augmentation method.

To alleviate this issue, the recent SapAugment [31] proposes

to use the loss value of the training sample to make a selection

for the augmentation operation at the corresponding training

sample. Their intuitions are that perturbing a hard sample with

a strong augmentation may also make it too hard to learn

from, and a sample with low training loss should be perturbed

by a stronger augmentation to provide more robustness to

a variety of conditions. [31] shows an obvious performance

advantage than SpecAugment. But, more strategies used in

SapAugment (such as CutMix [32] etc.) are introduced compare

with SpecAugment. In addition, we believe that the strategy

selection should be more focused on the strategy rather than

data samples.

III. AUGMENTATION STRATEGY

Our motivation is to construct a policy that allows the

model to fully learn the beneﬁts of different data augmentation

methods as it is trained. For fair comparison, we use the same

data augmentation methods as SpecAugment to our proposed

Policy-SpecAugment. The following is a brief introduction to

three data augmentation strategies used in SpecAugment. Please

refer to [7] for the details.

A. Time Masking

Time masking is to mask the input features in the time

domain. The masked size is ﬁrst chosen from a uniform

distribution from 0 to the time mask parameter T.

B. Frequency Masking

Frequency masking masks the input spectrum features in

the frequency domain. The masked part is ﬁrst chosen from

a uniform distribution from 0 to the frequency mask parameter

C. Time Warping

About time warping, given an input speech spectrum fea-

tures, a random point along the horizontal line passing through

the center of the spectrum is to be warped either to the left or

right by a certain distance chosen from a uniform distribution.

IV. POLICY-BASED SPECAUGMENT

A. Random-SpecAugment

Before the introduction of the Policy-SpecAugment, we will

show the Random-SpecAugment. Brieﬂy, each training sample

were applied to only one augmentation strategy in equal random

way. Though Random-SpecAugment with the very simple idea

is worsened than SpecAugment in experiments, it outperform

obviously the baseline without any augmentation method and

motivates us to propose the Policy-SpecAugment.

B. Prob-SpecAugment

Based on the Random-SpecAugment, to ﬁx the equal random

way, we propose a probability-based SpecAugment (Prob-

SpecAugment). The idea is that, at each iteration of the training

phase, we will calculate Nloss values (Lossi,i= 1..N ) of

valid set applied to the corresponding augmentation technology,

where Ndisplays the total number of augmentation strategies.

We consider the normalized loss as the probability distribution

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

APolicy-basedApproachtotheSpecAugmentMethodforLowResourceE2EASRRuiLix,GuodongMax,DexinZhaoyx,RanranZengy,XiaoyuLiyandHaoHuangzSchoolofInformationScienceandEngineering,XinjiangUniversity,Urumqi,ChinayChinaTelecomBeijingResearchInstitute,Beijing,ChinazXinjiangProvincialKeyLaboratoryofMulti-lingual...

展开>> 收起<<

A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR Rui Lix Guodong Max Dexin Zhaoyx Ranran Zengy Xiaoyu Liyand Hao Huangz.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR Rui Lix Guodong Max Dexin Zhaoyx Ranran Zengy Xiaoyu Liyand Hao Huangz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: