Low-Resource Multilingual and Zero-Shot Multispeaker TTS Florian Lux andJulia Koch andNgoc Thang Vu University of Stuttgart

2025-05-06 0 0 593.17KB 12 页 10玖币
侵权投诉
Low-Resource Multilingual and Zero-Shot Multispeaker TTS
Florian Lux and Julia Koch and Ngoc Thang Vu
University of Stuttgart
florian.lux@ims.uni-stuttgart.de
Abstract
While neural methods for text-to-speech (TTS)
have shown great advances in modeling mul-
tiple speakers, even in zero-shot settings, the
amount of data needed for those approaches
is generally not feasible for the vast majority
of the world’s over 6,000 spoken languages.
In this work, we bring together the tasks of
zero-shot voice cloning and multilingual low-
resource TTS. Using the language agnostic
meta learning (LAML) procedure and modifi-
cations to a TTS encoder, we show that it is
possible for a system to learn speaking a new
language using just 5 minutes of training data
while retaining the ability to infer the voice of
even unseen speakers in the newly learned lan-
guage. We show the success of our proposed
approach in terms of intelligibility, naturalness
and similarity to target speaker using objective
metrics as well as human studies and provide
our code and trained models open source.
1 Introduction
The applications of modern TTS systems are om-
nipresent and bring major benefits in a very diverse
range of tasks. For example, low-resource TTS
can be used to revitalize and conserve languages
with diminishing numbers of speakers (Pine et al.,
2022). Other recent applications go into the di-
rection of protecting the privacy of a speaker, by
exchanging their voice for a different voice, while
not affecting the content of what is said (Meyer
et al.,2022). Even in literary studies, TTS systems
can be applied to investigate perceptive aspects of
poetry reading (Koch et al.,2022). However, while
the first of those examples can be done with just
a single speaker, the latter two require the TTS
system to be able to exchange the voice of the utter-
ance that is produced, which usually requires large
amounts of clean multispeaker data. The same
requirement exists for many other such applica-
tions, which can also be seen in the rise of interest
in the research community on voice-cloning tech-
nologies (Wu et al.,2022;Casanova et al.,2022;
Neekhara et al.,2021;Hemati and Borth,2021;
Cooper et al.,2020). The communities of speakers
of low-resourced languages are thus mostly locked
out of plenty of the applications that modern TTS
enables. For many instances of such languages,
like the Taa language, which is famous for its 83
click sounds or the Yoruba language, in which the
tones bear so much meaning, that the language can
be mostly whistled, it would be extremely difficult
to collect the required amounts of data, and transfer
learning to such unique languages is very challeng-
ing. Still, we believe that a single model that speaks
many languages with any voice can exhibit strong
generalizing properties and is a promising first step
towards fixing these inequalities.
In this work we ask the following question: Can
a multilingual TTS system be used to achieve zero-
shot multispeaker TTS in a low-resource scenario?
Our approach is to use crosslingual knowledge-
sharing to enable 1) finetuning a TTS on just 5
minutes of data in an unseen language in an unseen
branch in the phylogenetic tree of languages and
2) transferring zero-shot multispeaker capabilities
from the pretraining languages to the unseen lan-
guage. To achieve this, we propose changes to a
TTS encoder to better handle multilingual data and
disentangle languages from speakers. Further, we
show that the LAML pretraining procedure (Finn
et al.,2017;Lux and Vu,2022) can also be used to
train general speaker-conditioned models. To ver-
ify the effectiveness of our contributions, we train
models on just 5 minutes of German and Russian
while excluding all Germanic and Slavic languages
from the pretraining respectively. We choose a sim-
ulated low-resource scenario over an actual low-
resource scenario in order to get more reliable eval-
uations using both objective measures as well as
human studies. Furthermore, we show that mod-
els trained with this approach do not only serve
arXiv:2210.12223v1 [cs.CL] 21 Oct 2022
as a basis for low-resource finetuning with greatly
reduced data-need, they can also be used without
finetuning as strong multispeaker and multilingual
models. We train a model on 12 languages simul-
taneously and show that it can transfer speaker
identities across all languages, even the ones where
it has only seen a single speaker during training.
All of our code, as well as the trained multilin-
gual model are available open source
1
. An interac-
tive demo
2
and a demo with pre-generated audios
3
are available.
2 Related Work
2.1 Zero-Shot Multispeaker TTS
Zero-shot multispeaker TTS has first been at-
tempted in (Arik et al.,2018). The idea of using
an external speaker encoder as conditioning signal
was further explored by (Jia et al.,2018). (Cooper
et al.,2020) attempted to close the quality gap be-
tween seen and unseen speakers in zero-shot multi-
speaker TTS using more informative embeddings.
With the use of attentive speaker embeddings for
more general speaking style encoding (Wang et al.,
2018;Choi et al.,2020) as well as different de-
coding approaches in the acoustic space such as
generative flows (Casanova et al.,2021), further
attempts have been made at closing the quality gap
between seen and unseen speakers. This is however
still not a fully solved task. Furthermore, zero-shot
multispeaker TTS requires a large amount of high
quality data featuring many different speakers to
cover a variety of voice properties.
2.2 Low-Resource TTS
In some languages, even a single speaker TTS is
not feasible due to the severe lack of high-quality
training data available. Attempts at enabling TTS
on seen speakers in low-resource scenarios have
been made by (Azizah et al.,2020;Xu et al.,2020;
Chen et al.,2019) through the use of transfer learn-
ing from multilingual data, which comes with a
set of problems due to the mismatch in the input
space (i.e. different sets of phonemes) when us-
ing multiple languages. Training a model jointly
on multiple languages to share knowledge across
languages has been attempted by (He et al.,2021;
1https://github.com/DigitalPhonetics/
IMS-Toucan
2https://huggingface.co/spaces/
Flux9665/IMS-Toucan
3https://multilingualtoucan.github.io/
de Korte et al.,2020;Yang and He,2020). One so-
lution to the problem of sharing knowledge across
different phonemesets is the use of articulatory fea-
tures, which has been proposed in (Staib et al.,
2020;Wells et al.,2021;Lux and Vu,2022).
2.3 Multilingual Multispeaker TTS
The task of multilingual (not even considering low-
resource languages) zero-shot multispeaker TTS
is mostly unexplored. YourTTS (Casanova et al.,
2022) claims to be the first work on zero-shot
speaker transfer across multiple languages and was
developed concurrently to this work. At the time
of writing, there is only a preprint available, so
our comparison to their model and methods may
differ to a later version. YourTTS reports similar
results to ours on high-resource languages using
the VITS architecture (Kim et al.,2021) with a
set of modifications to handle multilingual data.
The authors find that their model doesn’t perform
as well with unseen voices in languages that have
only seen single speaker training data. Through the
low-resource focused design, our approach does
not exhibit this problem, while being conceptually
simpler. It is shown that just one minute of data
suffices to achieve very good results in adapting to
a new speaker in a known language with YourTTS.
This is consistent with our results, however we go
one step further and show that 5 minutes of data
is enough to not only adapt to a new speaker, but
also to a new language. Also consistent with their
results we see that the speaker embedding learns
to attribute noisy training data to certain speakers,
so not all speakers perform equally well. Ideally
we would want to also disentangle the noise mod-
eling from the speakers and languages. The GST
approach (Wang et al.,2018) has shown that dis-
entangling noise from speakers is possible, it is
however not trivial to also disentangle languages,
since language properties are also relevant to the
encoder, not only the decoder.
Finally, combining the task of zero-shot multi-
speaker TTS with the task of low-resource TTS
has to the best of our knowledge only been at-
tempted once in a very recent approach that was
developed concurrently to ours (Azizah and Jat-
miko,2022). Their system uses a multi-stage
transfer learning process, that starts from a sin-
gle speaker system which is expanded with a pre-
trained speaker encoder. They add the required
components for speaker and language conditioning
Figure 1: Overview of the encoder design. All of the
projections project to the same dimensionality, which
we chose to be 384. Round corners mean trainable.
Conformer blocks include relative positional encoding.
and apply finetuning to only those parts of the ar-
chitecture. The main difference of our system to
theirs is that we train the full architecture jointly on
the high-resource source domain using the LAML
pretraining procedure.
3 Proposed Method
3.1 System Architecture
Due to its elegant solution to the one-to-many prob-
lem of speech synthesis, we choose FastSpeech 2
(Ren et al.,2020) as the basis for our method. There
is however no reason why this procedure should not
work in conjunction with any comparable architec-
ture, making the approach mostly model agnostic.
We use the Conformer architecture (Gulati et al.,
2020) in both encoder and decoder. This is the
same as the basic implementation in the IMS Tou-
can toolkit (Lux et al.,2021) which is in turn based
on the ESPnet toolkit (Hayashi et al.,2020,2021).
To handle the zero-shot multispeaker task,
we condition the TTS on an ensemble of pre-
trained speaker embedding functions that consist
of ECAPA-TDNN (Desplanques et al.,2020) and
X-Vector (Snyder et al.,2018) trained on Vox-
celeb 1 and 2 (Nagrani et al.,2019,2017;Chung
et al.,2018) using the SpeechBrain toolkit (Ra-
vanelli et al.,2021) as suggested in (Meyer et al.,
2022). Consistent with (Jia et al.,2018) we find that
the best ability to produce speech from voices un-
seen during training is achieved when injecting the
speaker embeddings into the output of the encoder.
First we bottleneck the speaker embeddings and
apply the SoftSign function, as suggested in (Gib-
iansky et al.,2017). Then we concatenate them to
the encoder’s hidden state and project them back to
the size of the encoder’s hidden state. At inference
time, a speaker embedding of a reference audio can
be used to make the synthesis speak in the voice of
the reference speaker. An important trick we found
is to add layer normalization right after the embed-
ding is injected into the hidden state. This does
not affect the synthesis of speakers seen during
training, however it helps with unseen speakers.
In order to disentangle the languages from the
speakers, we add an embedding for the language of
the current sample along the sequence axis to the
phoneme embedding sequence at the start of the
encoder. This fits well to the intuition of a TTS en-
coder dealing with the text and the decoder dealing
with the speech, since the text processing should
not rely on speaker information, as a text does not
have an inherent speaker. So we infuse the lan-
guage information at the text stage and the speaker
information at the speech stage of the model’s infor-
mation flow. Since, unlike the amount of possible
voices, the amount of languages in the world is
finite, we simply use an embedding lookup table to
get embeddings of languages which receive their
meaning purely through backpropagation during
training. A text based language embedding could
allow for zero-shot language adaptation, which we
plan to investigate in the future. An overview of
the multilingual multispeaker encoder is shown in
Figure 1.
To transform the spectrograms that the Fast-
Speech 2 based synthesis produces into a wave-
form, we make use of the HiFi-GAN architecture
(Kong et al.,2020) as implemented in the IMS
Toucan toolkit (Lux et al.,2021). As is shown in
(Liu et al.,2021), neural vocoders can do super-
resolution as well as spectrogram inversion. We
apply the same trick to transform the 16kHz spec-
trograms the synthesis produces into 48kHz wave-
forms.
3.2 Input Representation
To make the use of multilingual data with only par-
tially overlapping phonemesets easier, we represent
the inputs to our system as articulatory feature vec-
tors rather than identity based vectors, the same as
is introduced in (Lux and Vu,2022). On top of this,
we add an additional mechanism to deal with the
摘要:

Low-ResourceMultilingualandZero-ShotMultispeakerTTSFlorianLuxandJuliaKochandNgocThangVuUniversityofStuttgartflorian.lux@ims.uni-stuttgart.deAbstractWhileneuralmethodsfortext-to-speech(TTS)haveshowngreatadvancesinmodelingmul-tiplespeakers,eveninzero-shotsettings,theamountofdataneededforthoseapproache...

展开>> 收起<<
Low-Resource Multilingual and Zero-Shot Multispeaker TTS Florian Lux andJulia Koch andNgoc Thang Vu University of Stuttgart.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:593.17KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注