TOWARDS HIGH-QUALITY NEURAL TTS FOR LOW-RESOURCE LANGUAGES BY LEARNING COMPACT SPEECH REPRESENTATIONS Haohan Guo Fenglong Xiey Xixin Wu Hui Lu Helen Meng

2025-05-06 0 0 443.11KB 5 页 10玖币

侵权投诉

TOWARDS HIGH-QUALITY NEURAL TTS FOR LOW-RESOURCE LANGUAGES BY

LEARNING COMPACT SPEECH REPRESENTATIONS

Haohan Guo∗, Fenglong Xie†, Xixin Wu∗, Hui Lu∗, Helen Meng∗

∗The Chinese University of Hong Kong, Hong Kong SAR, China

†Xiaohongshu Inc., Shanghai, China

{hguo,xxwu,luhui,hmmeng}@se.cuhk.edu.hk,fenglongxie@xiaohongshu.com

ABSTRACT

This paper aims to enhance low-resource TTS by reducing

training data requirements using compact speech representa-

tions. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is

trained to learn the representation, MSMCR, and decode it to

waveforms. Subsequently, we train the multi-stage predictor

to predict MSMCRs from the text for TTS synthesis. More-

over, we optimize the training strategy by leveraging more

audio to learn MSMCRs better for low-resource languages.

It selects audio from other languages using speaker similar-

ity metric to augment the training set, and applies transfer

learning to improve training quality. In MOS tests, the pro-

posed system signiﬁcantly outperforms FastSpeech and VITS

in standard and low-resource scenarios, showing lower data

requirements. The proposed training strategy effectively en-

hances MSMCRs on waveform reconstruction. It improves

TTS performance further, which wins 77% votes in the pref-

erence test for the low-resource TTS with only 15 minutes of

paired data.

Index Terms—Compact Representations, MSMC-TTS,

VQ-GAN, GAN, Low-Resource TTS

1. INTRODUCTION

Text-to-Speech (TTS) technologies have been widely applied

to serve all people around the world in intelligent speech in-

teractions, such as speech translation [1, 2], human-machine

interactions and conversations [3], etc. However, it becomes

harder for regions using minority (even endangered) lan-

guages to achieve satisfactory TTS performance, due to the

lack of training data on these languages. Hence, seeking

practical approaches to address this data sparsity issue has

become increasingly crucial for low-resource TTS.

Recent works on this topic mostly concentrate on leverag-

ing more data in other ﬁelds, to compensate for the lack of tar-

get data. For example, some works [4, 5, 6] aim to build a TTS

dataset using crowd-sourced or automatic methods for data

collection and transcribing. However, the obtained dataset

∗Work performed during the ﬁrst author’s internship at Xiaohongshu.

may have low recording quality and low naturalness, which

makes it difﬁcult to achieve comparable performance as when

standard TTS datasets are used. Therefore, some works con-

sider using well-designed datasets in other languages to en-

hance TTS for low-resource languages, such as cross-lingual

transfer learning [7, 8] and multi-lingual TTS [9].

Besides leveraging more data, we can also tackle this

problem by reducing the training data requirement. This

paper proposes learning compact speech representations to

enhance low-resource TTS from this perspective. The speech

waveform, as a long sequence with much redundant infor-

mation, is hard to predict from the text directly without a

powerful model and sufﬁcient data [10]. Hence, acoustic fea-

tures with higher compactness, i.e. shorter length and fewer

parameters, are usually used in TTS systems. They can be

well-predicted from the text, and converted to high-ﬁdelity

waveforms via a vocoder. The compact representation effec-

tively reduces the requirement for paired data to train acoustic

models. Hence, for low-resource languages with fewer paired

data, we can learn a more compact speech representation to

further reduce the data requirement.

MSMC-TTS [11] has shown great potential in this regard.

It trains a Multi-Stage Multi-Codebook (MSMC) VQ-VAE

to compress the waveform into the compact representation,

MSMCR, i.e. a set of discrete sequences with different time

resolutions. The representation can be predicted from the text

by a multi-stage predictor, and converted to the waveform via

a neural vocoder. In this paper, we ﬁrst integrate the autoen-

coder and the neural vocoder into one model, MSMC-VQ-

GAN, for system simpliﬁcation and joint optimization. More-

over, to learn better MSMCRs for low-resource languages, we

also optimize the training strategy by leveraging more high-

quality audio from other languages to train MSMC-VQ-GAN.

It ﬁrst augments the training set by selecting utterances with

high similarity to the target speaker from the low-resource

language, then trains the model using transfer learning to en-

hance the training quality. Finally, we conduct experiments

to compare the proposed system with other mainstream TTS

systems under different scenarios, and evaluate the effect of

the proposed training strategy on the proposed system.

arXiv:2210.15131v1 [cs.SD] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TOWARDSHIGH-QUALITYNEURALTTSFORLOW-RESOURCELANGUAGESBYLEARNINGCOMPACTSPEECHREPRESENTATIONSHaohanGuo,FenglongXiey,XixinWu,HuiLu,HelenMengTheChineseUniversityofHongKong,HongKongSAR,ChinayXiaohongshuInc.,Shanghai,China{hguo,xxwu,luhui,hmmeng}@se.cuhk.edu.hk,fenglongxie@xiaohongshu.comABSTRACTThisp...

展开>> 收起<<

TOWARDS HIGH-QUALITY NEURAL TTS FOR LOW-RESOURCE LANGUAGES BY LEARNING COMPACT SPEECH REPRESENTATIONS Haohan Guo Fenglong Xiey Xixin Wu Hui Lu Helen Meng.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TOWARDS HIGH-QUALITY NEURAL TTS FOR LOW-RESOURCE LANGUAGES BY LEARNING COMPACT SPEECH REPRESENTATIONS Haohan Guo Fenglong Xiey Xixin Wu Hui Lu Helen Meng

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: