Low-Resource Multilingual and Zero-Shot Multispeaker TTS Florian Lux andJulia Koch andNgoc Thang Vu University of Stuttgart

2025-05-06 0 0 593.17KB 12 页 10玖币

侵权投诉

Low-Resource Multilingual and Zero-Shot Multispeaker TTS

Florian Lux and Julia Koch and Ngoc Thang Vu

University of Stuttgart

florian.lux@ims.uni-stuttgart.de

Abstract

While neural methods for text-to-speech (TTS)

have shown great advances in modeling mul-

tiple speakers, even in zero-shot settings, the

amount of data needed for those approaches

is generally not feasible for the vast majority

of the world’s over 6,000 spoken languages.

In this work, we bring together the tasks of

zero-shot voice cloning and multilingual low-

resource TTS. Using the language agnostic

meta learning (LAML) procedure and modiﬁ-

cations to a TTS encoder, we show that it is

possible for a system to learn speaking a new

language using just 5 minutes of training data

while retaining the ability to infer the voice of

even unseen speakers in the newly learned lan-

guage. We show the success of our proposed

approach in terms of intelligibility, naturalness

and similarity to target speaker using objective

metrics as well as human studies and provide

our code and trained models open source.

1 Introduction

The applications of modern TTS systems are om-

nipresent and bring major beneﬁts in a very diverse

range of tasks. For example, low-resource TTS

can be used to revitalize and conserve languages

with diminishing numbers of speakers (Pine et al.,

2022). Other recent applications go into the di-

rection of protecting the privacy of a speaker, by

exchanging their voice for a different voice, while

not affecting the content of what is said (Meyer

et al.,2022). Even in literary studies, TTS systems

can be applied to investigate perceptive aspects of

poetry reading (Koch et al.,2022). However, while

the ﬁrst of those examples can be done with just

a single speaker, the latter two require the TTS

system to be able to exchange the voice of the utter-

ance that is produced, which usually requires large

amounts of clean multispeaker data. The same

requirement exists for many other such applica-

tions, which can also be seen in the rise of interest

in the research community on voice-cloning tech-

nologies (Wu et al.,2022;Casanova et al.,2022;

Neekhara et al.,2021;Hemati and Borth,2021;

Cooper et al.,2020). The communities of speakers

of low-resourced languages are thus mostly locked

out of plenty of the applications that modern TTS

enables. For many instances of such languages,

like the Taa language, which is famous for its 83

click sounds or the Yoruba language, in which the

tones bear so much meaning, that the language can

be mostly whistled, it would be extremely difﬁcult

to collect the required amounts of data, and transfer

learning to such unique languages is very challeng-

ing. Still, we believe that a single model that speaks

many languages with any voice can exhibit strong

generalizing properties and is a promising ﬁrst step

towards ﬁxing these inequalities.

In this work we ask the following question: Can

a multilingual TTS system be used to achieve zero-

shot multispeaker TTS in a low-resource scenario?

Our approach is to use crosslingual knowledge-

sharing to enable 1) ﬁnetuning a TTS on just 5

minutes of data in an unseen language in an unseen

branch in the phylogenetic tree of languages and

2) transferring zero-shot multispeaker capabilities

from the pretraining languages to the unseen lan-

guage. To achieve this, we propose changes to a

TTS encoder to better handle multilingual data and

disentangle languages from speakers. Further, we

show that the LAML pretraining procedure (Finn

et al.,2017;Lux and Vu,2022) can also be used to

train general speaker-conditioned models. To ver-

ify the effectiveness of our contributions, we train

models on just 5 minutes of German and Russian

while excluding all Germanic and Slavic languages

from the pretraining respectively. We choose a sim-

ulated low-resource scenario over an actual low-

resource scenario in order to get more reliable eval-

uations using both objective measures as well as

human studies. Furthermore, we show that mod-

els trained with this approach do not only serve

arXiv:2210.12223v1 [cs.CL] 21 Oct 2022

as a basis for low-resource ﬁnetuning with greatly

reduced data-need, they can also be used without

ﬁnetuning as strong multispeaker and multilingual

models. We train a model on 12 languages simul-

taneously and show that it can transfer speaker

identities across all languages, even the ones where

it has only seen a single speaker during training.

All of our code, as well as the trained multilin-

gual model are available open source

. An interac-

tive demo

and a demo with pre-generated audios

are available.

2 Related Work

2.1 Zero-Shot Multispeaker TTS

Zero-shot multispeaker TTS has ﬁrst been at-

tempted in (Arik et al.,2018). The idea of using

an external speaker encoder as conditioning signal

was further explored by (Jia et al.,2018). (Cooper

et al.,2020) attempted to close the quality gap be-

tween seen and unseen speakers in zero-shot multi-

speaker TTS using more informative embeddings.

With the use of attentive speaker embeddings for

more general speaking style encoding (Wang et al.,

2018;Choi et al.,2020) as well as different de-

coding approaches in the acoustic space such as

generative ﬂows (Casanova et al.,2021), further

attempts have been made at closing the quality gap

between seen and unseen speakers. This is however

still not a fully solved task. Furthermore, zero-shot

multispeaker TTS requires a large amount of high

quality data featuring many different speakers to

cover a variety of voice properties.

2.2 Low-Resource TTS

In some languages, even a single speaker TTS is

not feasible due to the severe lack of high-quality

training data available. Attempts at enabling TTS

on seen speakers in low-resource scenarios have

been made by (Azizah et al.,2020;Xu et al.,2020;

Chen et al.,2019) through the use of transfer learn-

ing from multilingual data, which comes with a

set of problems due to the mismatch in the input

space (i.e. different sets of phonemes) when us-

ing multiple languages. Training a model jointly

on multiple languages to share knowledge across

languages has been attempted by (He et al.,2021;

1https://github.com/DigitalPhonetics/

IMS-Toucan

2https://huggingface.co/spaces/

Flux9665/IMS-Toucan

3https://multilingualtoucan.github.io/

de Korte et al.,2020;Yang and He,2020). One so-

lution to the problem of sharing knowledge across

different phonemesets is the use of articulatory fea-

tures, which has been proposed in (Staib et al.,

2020;Wells et al.,2021;Lux and Vu,2022).

2.3 Multilingual Multispeaker TTS

The task of multilingual (not even considering low-

resource languages) zero-shot multispeaker TTS

is mostly unexplored. YourTTS (Casanova et al.,

2022) claims to be the ﬁrst work on zero-shot

speaker transfer across multiple languages and was

developed concurrently to this work. At the time

of writing, there is only a preprint available, so

our comparison to their model and methods may

differ to a later version. YourTTS reports similar

results to ours on high-resource languages using

the VITS architecture (Kim et al.,2021) with a

set of modiﬁcations to handle multilingual data.

The authors ﬁnd that their model doesn’t perform

as well with unseen voices in languages that have

only seen single speaker training data. Through the

low-resource focused design, our approach does

not exhibit this problem, while being conceptually

simpler. It is shown that just one minute of data

sufﬁces to achieve very good results in adapting to

a new speaker in a known language with YourTTS.

This is consistent with our results, however we go

one step further and show that 5 minutes of data

is enough to not only adapt to a new speaker, but

also to a new language. Also consistent with their

results we see that the speaker embedding learns

to attribute noisy training data to certain speakers,

so not all speakers perform equally well. Ideally

we would want to also disentangle the noise mod-

eling from the speakers and languages. The GST

approach (Wang et al.,2018) has shown that dis-

entangling noise from speakers is possible, it is

however not trivial to also disentangle languages,

since language properties are also relevant to the

encoder, not only the decoder.

Finally, combining the task of zero-shot multi-

speaker TTS with the task of low-resource TTS

has to the best of our knowledge only been at-

tempted once in a very recent approach that was

developed concurrently to ours (Azizah and Jat-

miko,2022). Their system uses a multi-stage

transfer learning process, that starts from a sin-

gle speaker system which is expanded with a pre-

trained speaker encoder. They add the required

components for speaker and language conditioning

Figure 1: Overview of the encoder design. All of the

projections project to the same dimensionality, which

we chose to be 384. Round corners mean trainable.

Conformer blocks include relative positional encoding.

and apply ﬁnetuning to only those parts of the ar-

chitecture. The main difference of our system to

theirs is that we train the full architecture jointly on

the high-resource source domain using the LAML

pretraining procedure.

3 Proposed Method

3.1 System Architecture

Due to its elegant solution to the one-to-many prob-

lem of speech synthesis, we choose FastSpeech 2

(Ren et al.,2020) as the basis for our method. There

is however no reason why this procedure should not

work in conjunction with any comparable architec-

ture, making the approach mostly model agnostic.

We use the Conformer architecture (Gulati et al.,

2020) in both encoder and decoder. This is the

same as the basic implementation in the IMS Tou-

can toolkit (Lux et al.,2021) which is in turn based

on the ESPnet toolkit (Hayashi et al.,2020,2021).

To handle the zero-shot multispeaker task,

we condition the TTS on an ensemble of pre-

trained speaker embedding functions that consist

of ECAPA-TDNN (Desplanques et al.,2020) and

X-Vector (Snyder et al.,2018) trained on Vox-

celeb 1 and 2 (Nagrani et al.,2019,2017;Chung

et al.,2018) using the SpeechBrain toolkit (Ra-

vanelli et al.,2021) as suggested in (Meyer et al.,

2022). Consistent with (Jia et al.,2018) we ﬁnd that

the best ability to produce speech from voices un-

seen during training is achieved when injecting the

speaker embeddings into the output of the encoder.

First we bottleneck the speaker embeddings and

apply the SoftSign function, as suggested in (Gib-

iansky et al.,2017). Then we concatenate them to

the encoder’s hidden state and project them back to

the size of the encoder’s hidden state. At inference

time, a speaker embedding of a reference audio can

be used to make the synthesis speak in the voice of

the reference speaker. An important trick we found

is to add layer normalization right after the embed-

ding is injected into the hidden state. This does

not affect the synthesis of speakers seen during

training, however it helps with unseen speakers.

In order to disentangle the languages from the

speakers, we add an embedding for the language of

the current sample along the sequence axis to the

phoneme embedding sequence at the start of the

encoder. This ﬁts well to the intuition of a TTS en-

coder dealing with the text and the decoder dealing

with the speech, since the text processing should

not rely on speaker information, as a text does not

have an inherent speaker. So we infuse the lan-

guage information at the text stage and the speaker

information at the speech stage of the model’s infor-

mation ﬂow. Since, unlike the amount of possible

voices, the amount of languages in the world is

ﬁnite, we simply use an embedding lookup table to

get embeddings of languages which receive their

meaning purely through backpropagation during

training. A text based language embedding could

allow for zero-shot language adaptation, which we

plan to investigate in the future. An overview of

the multilingual multispeaker encoder is shown in

Figure 1.

To transform the spectrograms that the Fast-

Speech 2 based synthesis produces into a wave-

form, we make use of the HiFi-GAN architecture

(Kong et al.,2020) as implemented in the IMS

Toucan toolkit (Lux et al.,2021). As is shown in

(Liu et al.,2021), neural vocoders can do super-

resolution as well as spectrogram inversion. We

apply the same trick to transform the 16kHz spec-

trograms the synthesis produces into 48kHz wave-

forms.

3.2 Input Representation

To make the use of multilingual data with only par-

tially overlapping phonemesets easier, we represent

the inputs to our system as articulatory feature vec-

tors rather than identity based vectors, the same as

is introduced in (Lux and Vu,2022). On top of this,

we add an additional mechanism to deal with the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Low-ResourceMultilingualandZero-ShotMultispeakerTTSFlorianLuxandJuliaKochandNgocThangVuUniversityofStuttgartflorian.lux@ims.uni-stuttgart.deAbstractWhileneuralmethodsfortext-to-speech(TTS)haveshowngreatadvancesinmodelingmul-tiplespeakers,eveninzero-shotsettings,theamountofdataneededforthoseapproache...

展开>> 收起<<

Low-Resource Multilingual and Zero-Shot Multispeaker TTS Florian Lux andJulia Koch andNgoc Thang Vu University of Stuttgart.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Low-Resource Multilingual and Zero-Shot Multispeaker TTS Florian Lux andJulia Koch andNgoc Thang Vu University of Stuttgart

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: