
as a basis for low-resource finetuning with greatly
reduced data-need, they can also be used without
finetuning as strong multispeaker and multilingual
models. We train a model on 12 languages simul-
taneously and show that it can transfer speaker
identities across all languages, even the ones where
it has only seen a single speaker during training.
All of our code, as well as the trained multilin-
gual model are available open source
1
. An interac-
tive demo
2
and a demo with pre-generated audios
3
are available.
2 Related Work
2.1 Zero-Shot Multispeaker TTS
Zero-shot multispeaker TTS has first been at-
tempted in (Arik et al.,2018). The idea of using
an external speaker encoder as conditioning signal
was further explored by (Jia et al.,2018). (Cooper
et al.,2020) attempted to close the quality gap be-
tween seen and unseen speakers in zero-shot multi-
speaker TTS using more informative embeddings.
With the use of attentive speaker embeddings for
more general speaking style encoding (Wang et al.,
2018;Choi et al.,2020) as well as different de-
coding approaches in the acoustic space such as
generative flows (Casanova et al.,2021), further
attempts have been made at closing the quality gap
between seen and unseen speakers. This is however
still not a fully solved task. Furthermore, zero-shot
multispeaker TTS requires a large amount of high
quality data featuring many different speakers to
cover a variety of voice properties.
2.2 Low-Resource TTS
In some languages, even a single speaker TTS is
not feasible due to the severe lack of high-quality
training data available. Attempts at enabling TTS
on seen speakers in low-resource scenarios have
been made by (Azizah et al.,2020;Xu et al.,2020;
Chen et al.,2019) through the use of transfer learn-
ing from multilingual data, which comes with a
set of problems due to the mismatch in the input
space (i.e. different sets of phonemes) when us-
ing multiple languages. Training a model jointly
on multiple languages to share knowledge across
languages has been attempted by (He et al.,2021;
1https://github.com/DigitalPhonetics/
IMS-Toucan
2https://huggingface.co/spaces/
Flux9665/IMS-Toucan
3https://multilingualtoucan.github.io/
de Korte et al.,2020;Yang and He,2020). One so-
lution to the problem of sharing knowledge across
different phonemesets is the use of articulatory fea-
tures, which has been proposed in (Staib et al.,
2020;Wells et al.,2021;Lux and Vu,2022).
2.3 Multilingual Multispeaker TTS
The task of multilingual (not even considering low-
resource languages) zero-shot multispeaker TTS
is mostly unexplored. YourTTS (Casanova et al.,
2022) claims to be the first work on zero-shot
speaker transfer across multiple languages and was
developed concurrently to this work. At the time
of writing, there is only a preprint available, so
our comparison to their model and methods may
differ to a later version. YourTTS reports similar
results to ours on high-resource languages using
the VITS architecture (Kim et al.,2021) with a
set of modifications to handle multilingual data.
The authors find that their model doesn’t perform
as well with unseen voices in languages that have
only seen single speaker training data. Through the
low-resource focused design, our approach does
not exhibit this problem, while being conceptually
simpler. It is shown that just one minute of data
suffices to achieve very good results in adapting to
a new speaker in a known language with YourTTS.
This is consistent with our results, however we go
one step further and show that 5 minutes of data
is enough to not only adapt to a new speaker, but
also to a new language. Also consistent with their
results we see that the speaker embedding learns
to attribute noisy training data to certain speakers,
so not all speakers perform equally well. Ideally
we would want to also disentangle the noise mod-
eling from the speakers and languages. The GST
approach (Wang et al.,2018) has shown that dis-
entangling noise from speakers is possible, it is
however not trivial to also disentangle languages,
since language properties are also relevant to the
encoder, not only the decoder.
Finally, combining the task of zero-shot multi-
speaker TTS with the task of low-resource TTS
has to the best of our knowledge only been at-
tempted once in a very recent approach that was
developed concurrently to ours (Azizah and Jat-
miko,2022). Their system uses a multi-stage
transfer learning process, that starts from a sin-
gle speaker system which is expanded with a pre-
trained speaker encoder. They add the required
components for speaker and language conditioning