
MAESTRO-U: LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR
ZERO SUPERVISED SPEECH ASR
Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang,
Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen
Google, Inc.
ABSTRACT
Training state-of-the-art Automated Speech Recognition (ASR)
models typically requires a substantial amount of transcribed speech.
In this work, we demonstrate that a modality-matched joint speech
and text model introduced in [1] can be leveraged to train a mas-
sively multilingual ASR model without any supervised (manually
transcribed) speech for some languages. This paper explores the use
of jointly learnt speech and text representations in a massively multi-
lingual, zero supervised speech, real-world setting to expand the set
of languages covered by ASR with only unlabeled speech and text in
the target languages. Using the FLEURS dataset, we define the task
to cover 102 languages, where transcribed speech is available in 52
of these languages and can be used to improve end-to-end ASR qual-
ity on the remaining 50. First, we show that by combining speech
representations with byte-level text representations and use of lan-
guage embeddings, we can dramatically reduce the Character Error
Rate (CER) on languages with no supervised speech from 64.8% to
30.8%, a relative reduction of 53%. Second, using a subset of South
Asian languages we show that Maestro-U can promote knowledge
transfer from languages with supervised speech even when there is
limited to no graphemic overlap. Overall, Maestro-U closes the gap
to oracle performance by 68.5% relative and reduces the CER of 19
languages below 15%.
Index Terms—Speech-text Representation learning, Zero Re-
source, Massively Multilingual zero-supervised-speech ASR
1. INTRODUCTION
The last few years have seen the emergence of two major direc-
tions of research towards improving low resource ASR quality.
The first direction uses multilingual models to leverage the large
amounts of supervised (manually transcribed) speech available
for high resource languages to improve quality on low resource
languages [2–5]. The second direction utilizes self-supervised pre-
training on large amounts of unlabeled speech [6–9], unlabeled
text [10, 11] or both [12–14] to complement the relatively small
amounts of transcribed data available for these languages. An ex-
treme example of the low-resource setting is learning ASR without
Thanks to Gary Wang, Jesse Emond, Charles Yoon, Zhong Meng and
Kevin Hu for many discussions and infratructure related assistance.
Copyright 2023 IEEE. Published in the 2022 IEEE Spoken Language
Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023
in Doha, Qatar. Personal use of this material is permitted. However, permis-
sion to reprint/republish this material for advertising or promotional purposes
or for creating new collective works for resale or redistribution to servers or
lists, or to reuse any copyrighted component of this work in other works,
must be obtained from the IEEE. Contact: Manager, Copyrights and Permis-
sions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway,
NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
the availability of any (in-language) transcribed resources (zero-
supervised-speech ASR). In this work, we explore the possibility
of using jointly learnt speech and text representations [13, 14] to
expand ASR to languages lacking transcribed speech resources.
The zero-supervised-speech setting has previously been ex-
plored in several works [15–18]. However, most prior research on
unsupervised ASR either learns models for phoneme recognition
(implicitly assuming a model for phoneme to grapheme conversion),
or assumes the availability of grapheme to phoneme (G2P) models
for text augmentation. The construction of a G2P model requires at
least as much expert human knowledge and effort as speech tran-
scription. As such they are unavailable for many of the worlds’
languages. In many zero resources settings, lack of a lexicon can
double the unit error rate [19]. In the ZeroSpeech2021 [20] chal-
lenge, researchers explored the ability of models to learn language
models with raw speech and no textual resources. These models
were evaluated on their ability to learn phonetics, lexicon, syntax
and semantic structures in the language.
In this work, we define a practical setting in line with real world
constraints, assuming the availability of unlabeled speech and text
(graphemes) in all 102 languages under consideration, and the avail-
ability of supervised speech in 52 of these languages. Given these
resources, we attempt to improve end-to-end ASR quality on the re-
maining 50 zero-supervised-speech languages. We establish that a
joint speech-text representation learning model, Maestro [14] fails
to perform well on this zero supervised speech task, reaching an av-
erage Character Error Rate (CER) of 54.2% averaged over 50 lan-
guages. To improve the joint speech and text representation learning
for this setting we propose the following:
• Building on the FLEURS benchmark [21], we define a massively
multilingual zero-supervised-speech ASR task motivated by real-
world constraints, with the goal of expanding the set of languages
covered by ASR models.
• We propose several improvements to the Maestro described in [1],
namely, the use of language embeddings and adapters to learn bet-
ter mappings (Section 3.2) across speech and text in languages
sharing writing systems; and use of byte level text representations
to enable better transfer to script-unique zero-supervised-speech
languages (Section 3.3).
• We analyze and compare the role of different text injection strate-
gies, including using phonemized text and byte-level text repre-
sentations to understand the role of shared vocabularies in zero-
supervised-speech ASR (Section 3.3)
• We conduct ablations of components used in representation learn-
ing to understand the role of our proposed techniques and those
proposed in [14], including the importance of the learnt duration
model and consistency losses.
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.10027v2 [cs.CL] 21 Oct 2022