G-AUGMENT SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR Gary Wang Ekin D. Cubuk Andrew Rosenberg

2025-05-06 0 0 603.25KB 8 页 10玖币
侵权投诉
G-AUGMENT: SEARCHING FOR THE META-STRUCTURE
OF DATA AUGMENTATION POLICIES FOR ASR
Gary Wang, Ekin D. Cubuk, Andrew Rosenberg,
Shuyang Cheng†∗, Ron J. Weiss‡∗,
Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le and Daniel S. Park
Google Inc., Pony.ai, Meta Inc.
ABSTRACT
Data augmentation is a ubiquitous technique used to provide
robustness to automatic speech recognition (ASR) training.
However, even as so much of the ASR training process has
become automated and more “end-to-end,” the data augmen-
tation policy (what augmentation functions to use, and how
to apply them) remains hand-crafted. We present G(raph)-
Augment, a technique to define the augmentation space as
directed acyclic graphs (DAGs) and search over this space
to optimize the augmentation policy itself. We show that
given the same computational budget, policies produced by
G-Augment are able to perform better than SpecAugment
policies obtained by random search on fine-tuning tasks on
CHiME-6 and AMI. G-Augment is also able to establish
a new state-of-the-art ASR performance on the CHiME-6
evaluation set (30.7% WER). We further demonstrate that
G-Augment policies show better transfer properties across
warm-start to cold-start training and model size compared to
random-searched SpecAugment policies.
Index TermsSpeech Recognition, Data Augmentation
1. INTRODUCTION
Data augmentation [1, 2, 3] is an important component of
deep learning and has demonstrated to be a crucial component
of training deep networks on a wide range of tasks, including
automatic speech recognition (ASR) [4, 5, 6, 7, 8, 9, 10, 11].
While methods for automatically optimizing augmenta-
tion policies have been introduced and studied [12, 13, 14],
previous studies made certain structural assumptions about
how data augmentations are applied. For example, augmenta-
tion searches for images typically assume a hierarchy, where
certain augmentations are assumed to always be applied in ad-
dition to other augmentation operations. The same has been
true for ASR, where assumptions about the meta-structure of
the augmentation are made before searching over parameters
of the augmentations themselves [15].
While such accumulated heuristics are effective for ad-
dressing tasks that have been studied extensively before with
*Work done while at Google.
a set of well-known augmentations, when encountered with
a new task or with a new set of augmentations, one needs
to re-establish the heuristics for designing a good augmen-
tation scheme. For example, in [16], the authors discovered
that SpecAugment [11] did not compose well with multi-style
training augmentation [4, 17], and found that they needed to
ensemble the augmentations to benefit from both.
In this work, we address this problem by a scheme we re-
fer to as G(raph)-Augment, where a stochastic augmentation
policy is parameterized by a directed acyclic graph (DAG)
whose edges are labeled by sampling probabilities and aug-
mentation parameters. By simultaneously searching for the
graph structure and the parameters that label the graph, we
are able to optimize not only the augmentation parameters of
the individual augmentations, but how those augmentations
are being applied. We utilize 17 ASR augmentations in our
search space, details of which can be found in section 3.3.
We use an evolutionary algorithm [18, 19] to optimize these
graphs based on the dev-set performance of the augmentation.
The search is conducted on two “warm-start” tasks, where
we pre-train a Conformer [20] RNN-T [21] model on the
SpeechStew [22] dataset and fine-tune on the CHiME-6 [23]
and AMI [24] corpora. For the AMI task, we remove the
AMI portion of the SpeechStew dataset for pre-training. We
compare the performance of the best discovered G-Augment
policy against the best SpecAugment policy found using ran-
dom search with the same computational budget.1By doing
so, we are able to arrive at the following results:
The best G-Augment policy discovered outperforms
the best SpecAugment policy on both tasks.
The G-Augment policies exhibit better transfer proper-
ties across warm to cold-start training and model size
than the SpecAugment policies.
By adapting the G-Augment policy for training a very
large (1B parameter) Conformer [20] pre-trained [25,
1Naively, one may deem the comparison between a random search and a
genetic algorithm to be unfair. We, however, must note that the search space
size of G-Augment is much larger than that of SpecAugment (by a factor of
1050) in this work, which justifies the comparison in our view.
© Copyright 2023 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023 in
Doha, Qatar. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or
for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must
be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ
08855-1331, USA. Telephone: + Intl. 908-562-3966.
arXiv:2210.10879v2 [cs.LG] 24 Oct 2022
26, 27] on YouTube and SpeechStew, we achieve state-
of-the-art performance on CHiME-6.
While we have limited our scope to ASR in this work, G-
Augment is a general framework that can be applied to any
task where augmentation is utilized.
2. RELATED WORKS
Data augmentation is an effective method for improving the
generalization performance of deep learning models. Do-
main specific augmentations have been utilized for a variety
of domains, from image classification [2] to speech recog-
nition [11]. More recently, automated data augmentation
methods have been utilized for increasing the efficacy of data
augmentation for 2D image classification, where a policy is
(meta)-learned using a large search space of possible data
augmentations [13, 28, 29]. Automated data augmentation
methods have been made more efficient for image classifica-
tion [14, 30, 31] and have been extended to other modalities
in vision [32, 33]. Some efforts to apply automated augmen-
tation search for ASR tasks has also shown success [15, 34].
While these previous attempts learned augmentation pa-
rameters such as application probability and distortion mag-
nitude from the data, they used a manually chosen augmen-
tation structure. For example, AutoAugment learned 25 sub-
policies, each with two layers of augmentation. Decisions
such as whether flips and cropping should be applied were
manually made specific to the dataset [14, 30, 31]. For exam-
ple, Cutout [35] was always applied after the AutoAugment
layers on CIFAR-10 [36], but not on reduced SVHN [37]. In
this work, we fully automate the optimization of the data aug-
mentation policy: the graph structure of the policy and param-
eters of individual augmentations that make up the policy are
learned jointly. In addition, we are able to demonstrate posi-
tive transfer of augmentation policies from the search task to
set-ups with different training dynamics and model sizes. To
our knowledge, our work is the first example of an automated
data augmentation approach that outperforms manually de-
signed policies in speech recognition such as SpecAugment.
Methods for searching over graph spaces have been exten-
sively investigated in the context of neural architecture search
[38, 39]. While we choose to use a relatively simple evolu-
tionary algorithm [18, 19] to search over augmentation poli-
cies in this particular work, a variety of methods have been
employed for such searches in the literature [38, 39, 40, 41,
42], an extensive list of which can be found in [43].
3. G-AUGMENT
To search for both how the augmentations are applied and the
parameters of the augmentations themselves, we parameterize
an augmentation policy as a graph with labeled nodes and
edges. Here we describe the details of this parameterization
and the algorithm we employ to search over this space.
3.1. Search Space
We parameterize an augmentation by a directed acyclic
graph (DAG), consisting of a single input node, a single out-
put node and a number Nof ensemble nodes. The input node
has only outgoing edges that can connect to ensemble nodes.
The output node connects to a single ensemble node via an
edge, which passes its state to the output. Each node repre-
sents an augmented state of the data, while the edges represent
the augmentations themselves.
Each ensemble node of the graph takes two inputs (Figure
3.1). We denote one of the incoming edges the left edge and
the other the right edge of a given ensemble node for con-
venience. We denote the node connected to the tail of the
left/right edge as the left/right input, respectively. The incom-
ing edges are labeled by sampling probabilities pland prthat
sum to unity, and quadruples aland arthat represent aug-
mentations. The state of a given node is obtained by applying
the augmentations al/arto the left/right inputs and sampling
them with probability pl/prrespectively. In other words,
(node state) =(al(left input)w/ probability pl,
ar(right input)w/ probability pr.(1)
We require the graph to be directed and all directed paths
trace back to the input. This is enforced by assigning indices
1,· · · , N to the ensemble nodes, the index 0 to the input node,
and selecting the left/right input indices of node nfrom [0, n
1]. The output node is always connected to node N.
The augmentation policy is applied to an input by stochas-
tically back-propagating through the graph. Given an input,
the augmentation to be applied to that input is determined by
starting at the output node and back-propagating through the
graph by randomly selecting the path to travel based on the
selection probability of the edges. The path connecting the
input node to the output node sampled this way represents an
augmentation obtained by sequentially composing the aug-
mentations encountered in the path. The probability of a par-
ticular path being selected is obtained by multiplying the se-
lection probabilities. This process is depicted in figure 3.1
along with pseudo-code. Any AutoAugment [13] policy or
RandAugment policy [14] is representable by such a graph.
To make the search space uniform, we represent an aug-
mentation by a quadruple a= (t, q, x1, x2)where tis a string
denoting the type of augmentation, qis the application prob-
ability (not to be confused with the sampling probability) and
x1and x2are the strength parameters for the augmentation.
xiare taken to have 11 discrete integer values from 0 to 10.
We employ 17 augmentations, which we list in section 3.3.
2
摘要:

G-AUGMENT:SEARCHINGFORTHEMETA-STRUCTUREOFDATAAUGMENTATIONPOLICIESFORASRGaryWang,EkinD.Cubuk,AndrewRosenberg,ShuyangChengy,RonJ.Weissz,BhuvanaRamabhadran,PedroJ.Moreno,QuocV.LeandDanielS.ParkGoogleInc.,yPony.ai,zMetaInc.ABSTRACTDataaugmentationisaubiquitoustechniqueusedtoproviderobustnesstoautomati...

展开>> 收起<<
G-AUGMENT SEARCHING FOR THE META-STRUCTURE OF DATA AUGMENTATION POLICIES FOR ASR Gary Wang Ekin D. Cubuk Andrew Rosenberg.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:603.25KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注