MAESTRO-U LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR ZERO SUPERVISED SPEECH ASR Zhehuai Chen Ankur Bapna Andrew Rosenberg Yu Zhang

2025-05-02 1 0 2.46MB 8 页 10玖币

侵权投诉

MAESTRO-U: LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR

ZERO SUPERVISED SPEECH ASR

Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang,

Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen

Google, Inc.

ABSTRACT

Training state-of-the-art Automated Speech Recognition (ASR)

models typically requires a substantial amount of transcribed speech.

In this work, we demonstrate that a modality-matched joint speech

and text model introduced in [1] can be leveraged to train a mas-

sively multilingual ASR model without any supervised (manually

transcribed) speech for some languages. This paper explores the use

of jointly learnt speech and text representations in a massively multi-

lingual, zero supervised speech, real-world setting to expand the set

of languages covered by ASR with only unlabeled speech and text in

the target languages. Using the FLEURS dataset, we deﬁne the task

to cover 102 languages, where transcribed speech is available in 52

of these languages and can be used to improve end-to-end ASR qual-

ity on the remaining 50. First, we show that by combining speech

representations with byte-level text representations and use of lan-

guage embeddings, we can dramatically reduce the Character Error

Rate (CER) on languages with no supervised speech from 64.8% to

30.8%, a relative reduction of 53%. Second, using a subset of South

Asian languages we show that Maestro-U can promote knowledge

transfer from languages with supervised speech even when there is

limited to no graphemic overlap. Overall, Maestro-U closes the gap

to oracle performance by 68.5% relative and reduces the CER of 19

languages below 15%.

Index Terms—Speech-text Representation learning, Zero Re-

source, Massively Multilingual zero-supervised-speech ASR

1. INTRODUCTION

The last few years have seen the emergence of two major direc-

tions of research towards improving low resource ASR quality.

The ﬁrst direction uses multilingual models to leverage the large

amounts of supervised (manually transcribed) speech available

for high resource languages to improve quality on low resource

languages [2–5]. The second direction utilizes self-supervised pre-

training on large amounts of unlabeled speech [6–9], unlabeled

text [10, 11] or both [12–14] to complement the relatively small

amounts of transcribed data available for these languages. An ex-

treme example of the low-resource setting is learning ASR without

Thanks to Gary Wang, Jesse Emond, Charles Yoon, Zhong Meng and

Kevin Hu for many discussions and infratructure related assistance.

Technology Workshop (SLT) (SLT 2022), scheduled for 19-22 January 2023

in Doha, Qatar. Personal use of this material is permitted. However, permis-

sion to reprint/republish this material for advertising or promotional purposes

or for creating new collective works for resale or redistribution to servers or

lists, or to reuse any copyrighted component of this work in other works,

must be obtained from the IEEE. Contact: Manager, Copyrights and Permis-

sions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway,

NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

the availability of any (in-language) transcribed resources (zero-

supervised-speech ASR). In this work, we explore the possibility

of using jointly learnt speech and text representations [13, 14] to

expand ASR to languages lacking transcribed speech resources.

The zero-supervised-speech setting has previously been ex-

plored in several works [15–18]. However, most prior research on

unsupervised ASR either learns models for phoneme recognition

(implicitly assuming a model for phoneme to grapheme conversion),

or assumes the availability of grapheme to phoneme (G2P) models

for text augmentation. The construction of a G2P model requires at

least as much expert human knowledge and effort as speech tran-

scription. As such they are unavailable for many of the worlds’

languages. In many zero resources settings, lack of a lexicon can

double the unit error rate [19]. In the ZeroSpeech2021 [20] chal-

lenge, researchers explored the ability of models to learn language

models with raw speech and no textual resources. These models

were evaluated on their ability to learn phonetics, lexicon, syntax

and semantic structures in the language.

In this work, we deﬁne a practical setting in line with real world

constraints, assuming the availability of unlabeled speech and text

(graphemes) in all 102 languages under consideration, and the avail-

ability of supervised speech in 52 of these languages. Given these

resources, we attempt to improve end-to-end ASR quality on the re-

maining 50 zero-supervised-speech languages. We establish that a

joint speech-text representation learning model, Maestro [14] fails

to perform well on this zero supervised speech task, reaching an av-

erage Character Error Rate (CER) of 54.2% averaged over 50 lan-

guages. To improve the joint speech and text representation learning

for this setting we propose the following:

• Building on the FLEURS benchmark [21], we deﬁne a massively

multilingual zero-supervised-speech ASR task motivated by real-

world constraints, with the goal of expanding the set of languages

covered by ASR models.

• We propose several improvements to the Maestro described in [1],

namely, the use of language embeddings and adapters to learn bet-

ter mappings (Section 3.2) across speech and text in languages

sharing writing systems; and use of byte level text representations

to enable better transfer to script-unique zero-supervised-speech

languages (Section 3.3).

• We analyze and compare the role of different text injection strate-

gies, including using phonemized text and byte-level text repre-

sentations to understand the role of shared vocabularies in zero-

supervised-speech ASR (Section 3.3)

• We conduct ablations of components used in representation learn-

ing to understand the role of our proposed techniques and those

proposed in [14], including the importance of the learnt duration

model and consistency losses.

arXiv:2210.10027v2 [cs.CL] 21 Oct 2022

Fig. 1. Maestro-U Training for Zero Supervised Speech ASR. The zero supervised speech ASR is deﬁned in Section 2. Speech, shared and

text encoders are described in Section 3.1. The use of bytes, language ID and residual adapter are described in Section 3.2 and 3.3.

The proposed work in this paper results in a ﬁnal zero supervised

speech average CER of 30.8%, a relative reduction of 43% relative

over Maestro [1]. To the best of our knowledge, we believe this

is the ﬁrst demonstration that competitive ASR performance can be

achieved for an unseen language using no language resources other

than unspoken text and untranscribed speech.

2. FLEURS ZERO SUPERVISED SPEECH ASR

We deﬁne our massively multilingual zero supervised speech ASR

task building on the FLEURS benchmark [21]. FLEURS is a pub-

lically available, multi-way parallel dataset of just 10 hours of read

speech in each of the 102 languages spanning 7geo-groups, which

can be used as a benchmark task for ASR. Of the 102 languages

present in the FLEURS benchmark, we choose 52 to serve as our

supervised languages (Group A) while the remaining 50 will be uti-

lized in a zero-supervised-speech setting (Group B). In order to un-

derstand zero supervised speech performance across all geo-groups,

we balance the number of languages in Groups A and B from each

geo-group as shown in Table 1.

In addition to FLEURS, following [13, 14], we also include su-

pervised speech and unlabeled speech from the MLS [4], VoxPop-

uli [22], CommonVoice [23] and Babel [24] datasets when available.

While mC4 [25] is a good text resource for injection, it contains

noisy data that can hurt ASR quality [26]. Therefore, we cleaned

this text further using the language-id and wordlist-based approaches

described in [27].

To understand graphemic overlap between the supervised lan-

guages L(A)and zero supervised speech languages L(B)and its ef-

fect on the ASR performance, we deﬁne the unseen grapheme ratio

γ(l)of language lin group B w.r.t. group A languages in Equation 1,

γ(l) = 1 −



∪n

k=1V(l)∩V(L(A)

k)



|V(l)|, l ∈L(B)(1)

where V(l)denotes the grapheme vocabulary of the language l,

which can be obtained from any text resource for l. In this work,

we obtain V(l)from the FLEURS release in [21].

3. PROPOSED METHOD: MAESTRO-U

In this work, we pursue the idea of expanding an ASR model to

new languages while requiring zero supervised speech, using only

text and untranscribed speech. This is done by text injection with

the previously proposed Maestro [14] with a series of innovations to

handle unseen scripts and promote multilingual knowledge transfer.

Figure 1 summarizes the Maestro-U training process.

3.1. Text injection using Maestro

Zero- or few-shot approaches require training models that can map

one sequence to another implicitly. This has been achieved for sev-

eral text style transfer and MT tasks via training cross-lingual models

with GANs [28], self-supervised pre-training and for ASR for map-

ping audio to phonemes with GANs [18]. Recent work on speech-

text pre-training, like mSLAM [13] and Maestro [14], have demon-

strated that it is possible to learn shared representations of speech

and text in the same model.

Maestro was proposed in [14] to address the speech-text repre-

sentation learning problem by ﬁrst aligning text to speech using an

RNN-T decoder and then training a Text Encoder. The resultant text

encoder can be used to map unspoken text to this aligned shared

space and learn from it. When learning from untranscribed speech

data, we use contrastive loss on the speech encoder outputs and a

masked language model (MLM) loss on the shared encoder output

similar to W2v-BERT [29]. When learning from paired speech and

text, the text encoder uses this RNN-T model to generate alignments

between the text targets and the speech encoder output. The Resam-

pler and Reﬁner layers replicate the initially learned text embeddings

to match the duration of the speech embedding using this alignment

information and a Mean-Squared Error (MSE) training objective is

used to enforce consistency between the resultant speech and text

representations.

When learning from unspoken text, speech-text alignment infor-

mation is unavailable. Therefore, Maestro uses durations predicted

from a duration prediction model in a fashion similar to speech syn-

thesis [30]. This model is trained to predict the duration of each

token. The predicted duration on unspoken text is subsequently

used to upsample the learned text embeddings to match the speech

frame-rate. RNN-T loss is applied over the resultant upsampled text

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MAESTRO-U:LEVERAGINGJOINTSPEECH-TEXTREPRESENTATIONLEARNINGFORZEROSUPERVISEDSPEECHASRZhehuaiChen,AnkurBapna,AndrewRosenberg,YuZhang,BhuvanaRamabhadran,PedroMoreno,NanxinChenGoogle,Inc.ABSTRACTTrainingstate-of-the-artAutomatedSpeechRecognition(ASR)modelstypicallyrequiresasubstantialamountoftranscribed...

展开>> 收起<<

MAESTRO-U LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR ZERO SUPERVISED SPEECH ASR Zhehuai Chen Ankur Bapna Andrew Rosenberg Yu Zhang.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MAESTRO-U LEVERAGING JOINT SPEECH-TEXT REPRESENTATION LEARNING FOR ZERO SUPERVISED SPEECH ASR Zhehuai Chen Ankur Bapna Andrew Rosenberg Yu Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: