TOWARDS HIGH-QUALITY NEURAL TTS FOR LOW-RESOURCE LANGUAGES BY LEARNING COMPACT SPEECH REPRESENTATIONS Haohan Guo Fenglong Xiey Xixin Wu Hui Lu Helen Meng

2025-05-06 0 0 443.11KB 5 页 10玖币
侵权投诉
TOWARDS HIGH-QUALITY NEURAL TTS FOR LOW-RESOURCE LANGUAGES BY
LEARNING COMPACT SPEECH REPRESENTATIONS
Haohan Guo, Fenglong Xie, Xixin Wu, Hui Lu, Helen Meng
The Chinese University of Hong Kong, Hong Kong SAR, China
Xiaohongshu Inc., Shanghai, China
{hguo,xxwu,luhui,hmmeng}@se.cuhk.edu.hk,fenglongxie@xiaohongshu.com
ABSTRACT
This paper aims to enhance low-resource TTS by reducing
training data requirements using compact speech representa-
tions. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is
trained to learn the representation, MSMCR, and decode it to
waveforms. Subsequently, we train the multi-stage predictor
to predict MSMCRs from the text for TTS synthesis. More-
over, we optimize the training strategy by leveraging more
audio to learn MSMCRs better for low-resource languages.
It selects audio from other languages using speaker similar-
ity metric to augment the training set, and applies transfer
learning to improve training quality. In MOS tests, the pro-
posed system significantly outperforms FastSpeech and VITS
in standard and low-resource scenarios, showing lower data
requirements. The proposed training strategy effectively en-
hances MSMCRs on waveform reconstruction. It improves
TTS performance further, which wins 77% votes in the pref-
erence test for the low-resource TTS with only 15 minutes of
paired data.
Index TermsCompact Representations, MSMC-TTS,
VQ-GAN, GAN, Low-Resource TTS
1. INTRODUCTION
Text-to-Speech (TTS) technologies have been widely applied
to serve all people around the world in intelligent speech in-
teractions, such as speech translation [1, 2], human-machine
interactions and conversations [3], etc. However, it becomes
harder for regions using minority (even endangered) lan-
guages to achieve satisfactory TTS performance, due to the
lack of training data on these languages. Hence, seeking
practical approaches to address this data sparsity issue has
become increasingly crucial for low-resource TTS.
Recent works on this topic mostly concentrate on leverag-
ing more data in other fields, to compensate for the lack of tar-
get data. For example, some works [4, 5, 6] aim to build a TTS
dataset using crowd-sourced or automatic methods for data
collection and transcribing. However, the obtained dataset
Work performed during the first author’s internship at Xiaohongshu.
may have low recording quality and low naturalness, which
makes it difficult to achieve comparable performance as when
standard TTS datasets are used. Therefore, some works con-
sider using well-designed datasets in other languages to en-
hance TTS for low-resource languages, such as cross-lingual
transfer learning [7, 8] and multi-lingual TTS [9].
Besides leveraging more data, we can also tackle this
problem by reducing the training data requirement. This
paper proposes learning compact speech representations to
enhance low-resource TTS from this perspective. The speech
waveform, as a long sequence with much redundant infor-
mation, is hard to predict from the text directly without a
powerful model and sufficient data [10]. Hence, acoustic fea-
tures with higher compactness, i.e. shorter length and fewer
parameters, are usually used in TTS systems. They can be
well-predicted from the text, and converted to high-fidelity
waveforms via a vocoder. The compact representation effec-
tively reduces the requirement for paired data to train acoustic
models. Hence, for low-resource languages with fewer paired
data, we can learn a more compact speech representation to
further reduce the data requirement.
MSMC-TTS [11] has shown great potential in this regard.
It trains a Multi-Stage Multi-Codebook (MSMC) VQ-VAE
to compress the waveform into the compact representation,
MSMCR, i.e. a set of discrete sequences with different time
resolutions. The representation can be predicted from the text
by a multi-stage predictor, and converted to the waveform via
a neural vocoder. In this paper, we first integrate the autoen-
coder and the neural vocoder into one model, MSMC-VQ-
GAN, for system simplification and joint optimization. More-
over, to learn better MSMCRs for low-resource languages, we
also optimize the training strategy by leveraging more high-
quality audio from other languages to train MSMC-VQ-GAN.
It first augments the training set by selecting utterances with
high similarity to the target speaker from the low-resource
language, then trains the model using transfer learning to en-
hance the training quality. Finally, we conduct experiments
to compare the proposed system with other mainstream TTS
systems under different scenarios, and evaluate the effect of
the proposed training strategy on the proposed system.
arXiv:2210.15131v1 [cs.SD] 27 Oct 2022
摘要:

TOWARDSHIGH-QUALITYNEURALTTSFORLOW-RESOURCELANGUAGESBYLEARNINGCOMPACTSPEECHREPRESENTATIONSHaohanGuo,FenglongXiey,XixinWu,HuiLu,HelenMengTheChineseUniversityofHongKong,HongKongSAR,ChinayXiaohongshuInc.,Shanghai,China{hguo,xxwu,luhui,hmmeng}@se.cuhk.edu.hk,fenglongxie@xiaohongshu.comABSTRACTThisp...

展开>> 收起<<
TOWARDS HIGH-QUALITY NEURAL TTS FOR LOW-RESOURCE LANGUAGES BY LEARNING COMPACT SPEECH REPRESENTATIONS Haohan Guo Fenglong Xiey Xixin Wu Hui Lu Helen Meng.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:443.11KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注