VIRTUOSO MASSIVE MULTILINGUAL SPEECH-TEXT JOINT SEMI-SUPERVISED LEARNING FOR TEXT-TO-SPEECH Takaaki Saeki1 3 Heiga Zen1 Zhehuai Chen2 Nobuyuki Morioka1 Gary Wang2

2025-05-06 0 0 374.29KB 5 页 10玖币
侵权投诉
VIRTUOSO: MASSIVE MULTILINGUAL SPEECH-TEXT JOINT
SEMI-SUPERVISED LEARNING FOR TEXT-TO-SPEECH
Takaaki Saeki 1 3 *, Heiga Zen 1, Zhehuai Chen 2, Nobuyuki Morioka 1, Gary Wang 2,
Yu Zhang 2, Ankur Bapna 2, Andrew Rosenberg 2, Bhuvana Ramabhadran 2
1Google, Japan 2Google, USA 3The University of Tokyo, Japan
takaaki_saeki@ipc.i.u-tokyo.ac.jp, {heigazen,zhehuai}@google.com
ABSTRACT
This paper proposes Virtuoso, a massively multilingual speech–text
joint semi-supervised learning framework for text-to-speech synthe-
sis (TTS) models. Existing multilingual TTS typically supports tens
of languages, which are a small fraction of the thousands of languages
in the world. One difficulty to scale multilingual TTS to hundreds of
languages is collecting high-quality speech–text paired data in low-
resource languages. This study extends Maestro, a speech–text joint
pretraining framework for automatic speech recognition (ASR), to
speech generation tasks. To train a TTS model from various types
of speech and text data, different training schemes are designed to
handle supervised (paired TTS and ASR data) and unsupervised
(untranscribed speech and unspoken text) datasets. Experimental
evaluation shows that 1) multilingual TTS models trained on Virtu-
oso can achieve significantly better naturalness and intelligibility than
baseline ones in seen languages, and 2) they can synthesize reason-
ably intelligible and naturally sounding speech for unseen languages
where no high-quality paired TTS data is available.
Index TermsMultilingual text-to-speech synthesis, massive
multilingual pretraining, speech–text semi-supervised joint learning.
1. INTRODUCTION
With the remarkable progress of neural text-to-speech synthesis
(TTS) methods, current multilingual TTS systems can synthesize
human-like high-quality speech in multiple languages. Early work
on multilingual TTS focused on building a TTS system for rich-
resource languages. For example, Zen et al. [1] built a multilingual
HMM-based statistical parametric speech synthesis (SPSS) from five
Western European languages, and Li and Zen [2] developed a neu-
ral network-based multilingual SPSS from six Western European
languages. Recently, the research community has started scaling
multilingual TTS to tens of languages. He et al. [3] proposed a
multilingual Byte2Speech TTS model, where 900-hour speech data
of 43 languages was used. However, scaling it to hundreds of lan-
guages is still highly challenging due to the difficulty in collecting a
large amount of high-quality paired TTS data for low-resource lan-
guages [3]. To cover thousands of languages, this paper aims to
develop a technology that can scale multilingual TTS to hundreds of
languages by using diverse speech and text data.
Semi-supervised and self-supervised learning has shown effec-
tiveness for a wide range of speech and natural language processing
tasks. Massive multilingual speech pretraining [4] has shown re-
markable performance for downstream speech recognition tasks such
as multilingual ASR and speech translation. Recently, it has been
This work was carried out as an intern at Google, Japan in 2022.
extended to multimodal speech–text joint pretraining [5, 6] using
speech-text pairs, untranscribed speech, and unspoken text. Although
various approaches of massively multilingual self/semi-supervised
learning have been attempted for speech recognition tasks, they have
not been fully explored for multilingual speech generation tasks.
This paper proposes Virtuoso, a massive multilingual speech–
text joint pretraining framework based on self-supervised and semi-
supervised learning. It extends Maestro [6], a speech–text semi-
supervised pretraining framework for ASR, to speech generation
tasks. Virtuoso allows us to pretrain a multilingual TTS model using
unsupervised (untranscribed speech and unspoken text) and super-
vised (paired TTS and ASR data) datasets with training schemes
designed for them, which will allow the model to scale to hundreds
of languages. This work has the following contributions:
Proposing massive multilingual semi-supervised pretraining for
TTS. It leverages different training schemes for “paired ASR”,
“paired TTS”, “untranscribed speech” and “unspoken text” data,
to train a single TTS model.
Zero-shot TTS, where decent-quality TTS can be achieved for
languages not included in the “paired TTS” data.
2. RELATED WORK
Large-scale self-/semi-supervised speech pretraining has been ac-
tively studied and applied to various downstream recognition tasks.
In addition to speech-only pretraining [7–9], there are multimodal ap-
proaches such as TTS-based text injection [10] and speech–text joint
pretraining [5, 11–13]. Maestro [6] performs the modality matching
of speech and text embedding to learn speech-aware text representa-
tion and vice versa. Virtuoso extends Maestro to speech generation
tasks by adding a speech decoder on Maestros shared encoder.
There have been prior studies on joint training of ASR and TTS
to improve ASR [14, 15], to obtain alignments [16], and to scale
ASR for low-resource settings [17, 18]. Virtuoso also jointly learns
ASR and TTS models, where its shared encoder learns speech–text
representation for both recognition and generation tasks.
While most of the existing studies on multilingual TTS [2,19–22]
have focused on a limited number of rich-resource languages, some
studies have investigated low-resource languages [23, 24]. Some
previous work has used a byte sequence [3, 25] as input text tokens
to eliminate per-language modules for phoneme inputs and to learn
linguistic representations shared across multiple languages. The
prior work which is most similar to this paper is Byte2Speech [3],
where a multilingual TTS model mapping a byte sequence to mel-
spectrogram was trained from 900 hours of paired TTS data including
43 languages by 109 speakers. Virtuoso also uses graphemes or bytes
arXiv:2210.15447v2 [cs.SD] 15 Mar 2023
摘要:

VIRTUOSO:MASSIVEMULTILINGUALSPEECH-TEXTJOINTSEMI-SUPERVISEDLEARNINGFORTEXT-TO-SPEECHTakaakiSaeki13*,HeigaZen1,ZhehuaiChen2,NobuyukiMorioka1,GaryWang2,YuZhang2,AnkurBapna2,AndrewRosenberg2,BhuvanaRamabhadran21Google,Japan2Google,USA3TheUniversityofTokyo,Japantakaaki_saeki@ipc.i.u-tokyo.ac.jp,{heigaze...

展开>> 收起<<
VIRTUOSO MASSIVE MULTILINGUAL SPEECH-TEXT JOINT SEMI-SUPERVISED LEARNING FOR TEXT-TO-SPEECH Takaaki Saeki1 3 Heiga Zen1 Zhehuai Chen2 Nobuyuki Morioka1 Gary Wang2.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:374.29KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注