G-AUGMENT SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR Gary Wang Ekin D. Cubuk Andrew Rosenberg

2025-05-06 0 0 603.25KB 8 页 10玖币

侵权投诉

G-AUGMENT: SEARCHING FOR THE META-STRUCTURE

OF DATA AUGMENTATION POLICIES FOR ASR

Gary Wang, Ekin D. Cubuk, Andrew Rosenberg,

Shuyang Cheng†∗, Ron J. Weiss‡∗,

Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le and Daniel S. Park

Google Inc., †Pony.ai, ‡Meta Inc.

ABSTRACT

Data augmentation is a ubiquitous technique used to provide

robustness to automatic speech recognition (ASR) training.

However, even as so much of the ASR training process has

become automated and more “end-to-end,” the data augmen-

tation policy (what augmentation functions to use, and how

to apply them) remains hand-crafted. We present G(raph)-

Augment, a technique to deﬁne the augmentation space as

directed acyclic graphs (DAGs) and search over this space

to optimize the augmentation policy itself. We show that

given the same computational budget, policies produced by

G-Augment are able to perform better than SpecAugment

policies obtained by random search on ﬁne-tuning tasks on

CHiME-6 and AMI. G-Augment is also able to establish

a new state-of-the-art ASR performance on the CHiME-6

evaluation set (30.7% WER). We further demonstrate that

G-Augment policies show better transfer properties across

warm-start to cold-start training and model size compared to

random-searched SpecAugment policies.

Index Terms—Speech Recognition, Data Augmentation

1. INTRODUCTION

Data augmentation [1, 2, 3] is an important component of

deep learning and has demonstrated to be a crucial component

of training deep networks on a wide range of tasks, including

automatic speech recognition (ASR) [4, 5, 6, 7, 8, 9, 10, 11].

While methods for automatically optimizing augmenta-

tion policies have been introduced and studied [12, 13, 14],

previous studies made certain structural assumptions about

how data augmentations are applied. For example, augmenta-

tion searches for images typically assume a hierarchy, where

certain augmentations are assumed to always be applied in ad-

dition to other augmentation operations. The same has been

true for ASR, where assumptions about the meta-structure of

the augmentation are made before searching over parameters

of the augmentations themselves [15].

While such accumulated heuristics are effective for ad-

dressing tasks that have been studied extensively before with

*Work done while at Google.

a set of well-known augmentations, when encountered with

a new task or with a new set of augmentations, one needs

to re-establish the heuristics for designing a good augmen-

tation scheme. For example, in [16], the authors discovered

that SpecAugment [11] did not compose well with multi-style

training augmentation [4, 17], and found that they needed to

ensemble the augmentations to beneﬁt from both.

In this work, we address this problem by a scheme we re-

fer to as G(raph)-Augment, where a stochastic augmentation

policy is parameterized by a directed acyclic graph (DAG)

whose edges are labeled by sampling probabilities and aug-

mentation parameters. By simultaneously searching for the

graph structure and the parameters that label the graph, we

are able to optimize not only the augmentation parameters of

the individual augmentations, but how those augmentations

are being applied. We utilize 17 ASR augmentations in our

search space, details of which can be found in section 3.3.

We use an evolutionary algorithm [18, 19] to optimize these

graphs based on the dev-set performance of the augmentation.

The search is conducted on two “warm-start” tasks, where

we pre-train a Conformer [20] RNN-T [21] model on the

SpeechStew [22] dataset and ﬁne-tune on the CHiME-6 [23]

and AMI [24] corpora. For the AMI task, we remove the

AMI portion of the SpeechStew dataset for pre-training. We

compare the performance of the best discovered G-Augment

policy against the best SpecAugment policy found using ran-

dom search with the same computational budget.1By doing

so, we are able to arrive at the following results:

• The best G-Augment policy discovered outperforms

the best SpecAugment policy on both tasks.

• The G-Augment policies exhibit better transfer proper-

ties across warm to cold-start training and model size

than the SpecAugment policies.

• By adapting the G-Augment policy for training a very

large (1B parameter) Conformer [20] pre-trained [25,

1Naively, one may deem the comparison between a random search and a

genetic algorithm to be unfair. We, however, must note that the search space

size of G-Augment is much larger than that of SpecAugment (by a factor of

1050) in this work, which justiﬁes the comparison in our view.

Doha, Qatar. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or

for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must

be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ

08855-1331, USA. Telephone: + Intl. 908-562-3966.

arXiv:2210.10879v2 [cs.LG] 24 Oct 2022

26, 27] on YouTube and SpeechStew, we achieve state-

of-the-art performance on CHiME-6.

While we have limited our scope to ASR in this work, G-

Augment is a general framework that can be applied to any

task where augmentation is utilized.

2. RELATED WORKS

Data augmentation is an effective method for improving the

generalization performance of deep learning models. Do-

main speciﬁc augmentations have been utilized for a variety

of domains, from image classiﬁcation [2] to speech recog-

nition [11]. More recently, automated data augmentation

methods have been utilized for increasing the efﬁcacy of data

augmentation for 2D image classiﬁcation, where a policy is

(meta)-learned using a large search space of possible data

augmentations [13, 28, 29]. Automated data augmentation

methods have been made more efﬁcient for image classiﬁca-

tion [14, 30, 31] and have been extended to other modalities

in vision [32, 33]. Some efforts to apply automated augmen-

tation search for ASR tasks has also shown success [15, 34].

While these previous attempts learned augmentation pa-

rameters such as application probability and distortion mag-

nitude from the data, they used a manually chosen augmen-

tation structure. For example, AutoAugment learned 25 sub-

policies, each with two layers of augmentation. Decisions

such as whether ﬂips and cropping should be applied were

manually made speciﬁc to the dataset [14, 30, 31]. For exam-

ple, Cutout [35] was always applied after the AutoAugment

layers on CIFAR-10 [36], but not on reduced SVHN [37]. In

this work, we fully automate the optimization of the data aug-

mentation policy: the graph structure of the policy and param-

eters of individual augmentations that make up the policy are

learned jointly. In addition, we are able to demonstrate posi-

tive transfer of augmentation policies from the search task to

set-ups with different training dynamics and model sizes. To

our knowledge, our work is the ﬁrst example of an automated

data augmentation approach that outperforms manually de-

signed policies in speech recognition such as SpecAugment.

Methods for searching over graph spaces have been exten-

sively investigated in the context of neural architecture search

[38, 39]. While we choose to use a relatively simple evolu-

tionary algorithm [18, 19] to search over augmentation poli-

cies in this particular work, a variety of methods have been

employed for such searches in the literature [38, 39, 40, 41,

42], an extensive list of which can be found in [43].

3. G-AUGMENT

To search for both how the augmentations are applied and the

parameters of the augmentations themselves, we parameterize

an augmentation policy as a graph with labeled nodes and

edges. Here we describe the details of this parameterization

and the algorithm we employ to search over this space.

3.1. Search Space

We parameterize an augmentation by a directed acyclic

graph (DAG), consisting of a single input node, a single out-

put node and a number Nof ensemble nodes. The input node

has only outgoing edges that can connect to ensemble nodes.

The output node connects to a single ensemble node via an

edge, which passes its state to the output. Each node repre-

sents an augmented state of the data, while the edges represent

the augmentations themselves.

Each ensemble node of the graph takes two inputs (Figure

3.1). We denote one of the incoming edges the left edge and

the other the right edge of a given ensemble node for con-

venience. We denote the node connected to the tail of the

left/right edge as the left/right input, respectively. The incom-

ing edges are labeled by sampling probabilities pland prthat

sum to unity, and quadruples aland arthat represent aug-

mentations. The state of a given node is obtained by applying

the augmentations al/arto the left/right inputs and sampling

them with probability pl/prrespectively. In other words,

(node state) =(al(left input)w/ probability pl,

ar(right input)w/ probability pr.(1)

We require the graph to be directed and all directed paths

trace back to the input. This is enforced by assigning indices

1,· · · , N to the ensemble nodes, the index 0 to the input node,

and selecting the left/right input indices of node nfrom [0, n−

1]. The output node is always connected to node N.

The augmentation policy is applied to an input by stochas-

tically back-propagating through the graph. Given an input,

the augmentation to be applied to that input is determined by

starting at the output node and back-propagating through the

graph by randomly selecting the path to travel based on the

selection probability of the edges. The path connecting the

input node to the output node sampled this way represents an

augmentation obtained by sequentially composing the aug-

mentations encountered in the path. The probability of a par-

ticular path being selected is obtained by multiplying the se-

lection probabilities. This process is depicted in ﬁgure 3.1

along with pseudo-code. Any AutoAugment [13] policy or

RandAugment policy [14] is representable by such a graph.

To make the search space uniform, we represent an aug-

mentation by a quadruple a= (t, q, x1, x2)where tis a string

denoting the type of augmentation, qis the application prob-

ability (not to be confused with the sampling probability) and

x1and x2are the strength parameters for the augmentation.

xiare taken to have 11 discrete integer values from 0 to 10.

We employ 17 augmentations, which we list in section 3.3.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

G-AUGMENT:SEARCHINGFORTHEMETA-STRUCTUREOFDATAAUGMENTATIONPOLICIESFORASRGaryWang,EkinD.Cubuk,AndrewRosenberg,ShuyangChengy,RonJ.Weissz,BhuvanaRamabhadran,PedroJ.Moreno,QuocV.LeandDanielS.ParkGoogleInc.,yPony.ai,zMetaInc.ABSTRACTDataaugmentationisaubiquitoustechniqueusedtoproviderobustnesstoautomati...

展开>> 收起<<

G-AUGMENT SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR Gary Wang Ekin D. Cubuk Andrew Rosenberg.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

G-AUGMENT SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR Gary Wang Ekin D. Cubuk Andrew Rosenberg

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: