MAESTRO-U LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR ZERO SUPERVISED SPEECH ASR Zhehuai Chen Ankur Bapna Andrew Rosenberg Yu Zhang

2025-05-02 0 0 2.46MB 8 页 10玖币
侵权投诉
MAESTRO-U: LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR
ZERO SUPERVISED SPEECH ASR
Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang,
Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen
Google, Inc.
ABSTRACT
Training state-of-the-art Automated Speech Recognition (ASR)
models typically requires a substantial amount of transcribed speech.
In this work, we demonstrate that a modality-matched joint speech
and text model introduced in [1] can be leveraged to train a mas-
sively multilingual ASR model without any supervised (manually
transcribed) speech for some languages. This paper explores the use
of jointly learnt speech and text representations in a massively multi-
lingual, zero supervised speech, real-world setting to expand the set
of languages covered by ASR with only unlabeled speech and text in
the target languages. Using the FLEURS dataset, we define the task
to cover 102 languages, where transcribed speech is available in 52
of these languages and can be used to improve end-to-end ASR qual-
ity on the remaining 50. First, we show that by combining speech
representations with byte-level text representations and use of lan-
guage embeddings, we can dramatically reduce the Character Error
Rate (CER) on languages with no supervised speech from 64.8% to
30.8%, a relative reduction of 53%. Second, using a subset of South
Asian languages we show that Maestro-U can promote knowledge
transfer from languages with supervised speech even when there is
limited to no graphemic overlap. Overall, Maestro-U closes the gap
to oracle performance by 68.5% relative and reduces the CER of 19
languages below 15%.
Index TermsSpeech-text Representation learning, Zero Re-
source, Massively Multilingual zero-supervised-speech ASR
1. INTRODUCTION
The last few years have seen the emergence of two major direc-
tions of research towards improving low resource ASR quality.
The first direction uses multilingual models to leverage the large
amounts of supervised (manually transcribed) speech available
for high resource languages to improve quality on low resource
languages [2–5]. The second direction utilizes self-supervised pre-
training on large amounts of unlabeled speech [6–9], unlabeled
text [10, 11] or both [12–14] to complement the relatively small
amounts of transcribed data available for these languages. An ex-
treme example of the low-resource setting is learning ASR without
Thanks to Gary Wang, Jesse Emond, Charles Yoon, Zhong Meng and
Kevin Hu for many discussions and infratructure related assistance.
Copyright 2023 IEEE. Published in the 2022 IEEE Spoken Language
Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023
in Doha, Qatar. Personal use of this material is permitted. However, permis-
sion to reprint/republish this material for advertising or promotional purposes
or for creating new collective works for resale or redistribution to servers or
lists, or to reuse any copyrighted component of this work in other works,
must be obtained from the IEEE. Contact: Manager, Copyrights and Permis-
sions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway,
NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
the availability of any (in-language) transcribed resources (zero-
supervised-speech ASR). In this work, we explore the possibility
of using jointly learnt speech and text representations [13, 14] to
expand ASR to languages lacking transcribed speech resources.
The zero-supervised-speech setting has previously been ex-
plored in several works [15–18]. However, most prior research on
unsupervised ASR either learns models for phoneme recognition
(implicitly assuming a model for phoneme to grapheme conversion),
or assumes the availability of grapheme to phoneme (G2P) models
for text augmentation. The construction of a G2P model requires at
least as much expert human knowledge and effort as speech tran-
scription. As such they are unavailable for many of the worlds’
languages. In many zero resources settings, lack of a lexicon can
double the unit error rate [19]. In the ZeroSpeech2021 [20] chal-
lenge, researchers explored the ability of models to learn language
models with raw speech and no textual resources. These models
were evaluated on their ability to learn phonetics, lexicon, syntax
and semantic structures in the language.
In this work, we define a practical setting in line with real world
constraints, assuming the availability of unlabeled speech and text
(graphemes) in all 102 languages under consideration, and the avail-
ability of supervised speech in 52 of these languages. Given these
resources, we attempt to improve end-to-end ASR quality on the re-
maining 50 zero-supervised-speech languages. We establish that a
joint speech-text representation learning model, Maestro [14] fails
to perform well on this zero supervised speech task, reaching an av-
erage Character Error Rate (CER) of 54.2% averaged over 50 lan-
guages. To improve the joint speech and text representation learning
for this setting we propose the following:
Building on the FLEURS benchmark [21], we define a massively
multilingual zero-supervised-speech ASR task motivated by real-
world constraints, with the goal of expanding the set of languages
covered by ASR models.
We propose several improvements to the Maestro described in [1],
namely, the use of language embeddings and adapters to learn bet-
ter mappings (Section 3.2) across speech and text in languages
sharing writing systems; and use of byte level text representations
to enable better transfer to script-unique zero-supervised-speech
languages (Section 3.3).
We analyze and compare the role of different text injection strate-
gies, including using phonemized text and byte-level text repre-
sentations to understand the role of shared vocabularies in zero-
supervised-speech ASR (Section 3.3)
We conduct ablations of components used in representation learn-
ing to understand the role of our proposed techniques and those
proposed in [14], including the importance of the learnt duration
model and consistency losses.
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.10027v2 [cs.CL] 21 Oct 2022
Fig. 1. Maestro-U Training for Zero Supervised Speech ASR. The zero supervised speech ASR is defined in Section 2. Speech, shared and
text encoders are described in Section 3.1. The use of bytes, language ID and residual adapter are described in Section 3.2 and 3.3.
The proposed work in this paper results in a final zero supervised
speech average CER of 30.8%, a relative reduction of 43% relative
over Maestro [1]. To the best of our knowledge, we believe this
is the first demonstration that competitive ASR performance can be
achieved for an unseen language using no language resources other
than unspoken text and untranscribed speech.
2. FLEURS ZERO SUPERVISED SPEECH ASR
We define our massively multilingual zero supervised speech ASR
task building on the FLEURS benchmark [21]. FLEURS is a pub-
lically available, multi-way parallel dataset of just 10 hours of read
speech in each of the 102 languages spanning 7geo-groups, which
can be used as a benchmark task for ASR. Of the 102 languages
present in the FLEURS benchmark, we choose 52 to serve as our
supervised languages (Group A) while the remaining 50 will be uti-
lized in a zero-supervised-speech setting (Group B). In order to un-
derstand zero supervised speech performance across all geo-groups,
we balance the number of languages in Groups A and B from each
geo-group as shown in Table 1.
In addition to FLEURS, following [13, 14], we also include su-
pervised speech and unlabeled speech from the MLS [4], VoxPop-
uli [22], CommonVoice [23] and Babel [24] datasets when available.
While mC4 [25] is a good text resource for injection, it contains
noisy data that can hurt ASR quality [26]. Therefore, we cleaned
this text further using the language-id and wordlist-based approaches
described in [27].
To understand graphemic overlap between the supervised lan-
guages L(A)and zero supervised speech languages L(B)and its ef-
fect on the ASR performance, we define the unseen grapheme ratio
γ(l)of language lin group B w.r.t. group A languages in Equation 1,
γ(l) = 1
n
k=1V(l)V(L(A)
k)
|V(l)|, l L(B)(1)
where V(l)denotes the grapheme vocabulary of the language l,
which can be obtained from any text resource for l. In this work,
we obtain V(l)from the FLEURS release in [21].
3. PROPOSED METHOD: MAESTRO-U
In this work, we pursue the idea of expanding an ASR model to
new languages while requiring zero supervised speech, using only
text and untranscribed speech. This is done by text injection with
the previously proposed Maestro [14] with a series of innovations to
handle unseen scripts and promote multilingual knowledge transfer.
Figure 1 summarizes the Maestro-U training process.
3.1. Text injection using Maestro
Zero- or few-shot approaches require training models that can map
one sequence to another implicitly. This has been achieved for sev-
eral text style transfer and MT tasks via training cross-lingual models
with GANs [28], self-supervised pre-training and for ASR for map-
ping audio to phonemes with GANs [18]. Recent work on speech-
text pre-training, like mSLAM [13] and Maestro [14], have demon-
strated that it is possible to learn shared representations of speech
and text in the same model.
Maestro was proposed in [14] to address the speech-text repre-
sentation learning problem by first aligning text to speech using an
RNN-T decoder and then training a Text Encoder. The resultant text
encoder can be used to map unspoken text to this aligned shared
space and learn from it. When learning from untranscribed speech
data, we use contrastive loss on the speech encoder outputs and a
masked language model (MLM) loss on the shared encoder output
similar to W2v-BERT [29]. When learning from paired speech and
text, the text encoder uses this RNN-T model to generate alignments
between the text targets and the speech encoder output. The Resam-
pler and Refiner layers replicate the initially learned text embeddings
to match the duration of the speech embedding using this alignment
information and a Mean-Squared Error (MSE) training objective is
used to enforce consistency between the resultant speech and text
representations.
When learning from unspoken text, speech-text alignment infor-
mation is unavailable. Therefore, Maestro uses durations predicted
from a duration prediction model in a fashion similar to speech syn-
thesis [30]. This model is trained to predict the duration of each
token. The predicted duration on unspoken text is subsequently
used to upsample the learned text embeddings to match the speech
frame-rate. RNN-T loss is applied over the resultant upsampled text
摘要:

MAESTRO-U:LEVERAGINGJOINTSPEECH-TEXTREPRESENTATIONLEARNINGFORZEROSUPERVISEDSPEECHASRZhehuaiChen,AnkurBapna,AndrewRosenberg,YuZhang,BhuvanaRamabhadran,PedroMoreno,NanxinChenGoogle,Inc.ABSTRACTTrainingstate-of-the-artAutomatedSpeechRecognition(ASR)modelstypicallyrequiresasubstantialamountoftranscribed...

展开>> 收起<<
MAESTRO-U LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR ZERO SUPERVISED SPEECH ASR Zhehuai Chen Ankur Bapna Andrew Rosenberg Yu Zhang.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:2.46MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注