
MID-ATTRIBUTE SPEAKER GENERATION USING OPTIMAL-TRANSPORT-BASED
INTERPOLATION OF GAUSSIAN MIXTURE MODELS
Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi Saruwatari
The University of Tokyo, Japan.
ABSTRACT
In this paper, we propose a method for intermediating multiple
speakers’ attributes and diversifying their voice characteristics in
“speaker generation,” an emerging task that aims to synthesize
a nonexistent speaker’s naturally sounding voice. The conven-
tional TacoSpawn-based speaker generation method represents the
distributions of speaker embeddings by Gaussian mixture models
(GMMs) conditioned with speaker attributes. Although this method
enables the sampling of various speakers from the speaker-attribute-
aware GMMs, it is not yet clear whether the learned distributions can
represent speakers with an intermediate attribute (i.e., mid-attribute).
To this end, we propose an optimal-transport-based method that in-
terpolates the learned GMMs to generate nonexistent speakers with
mid-attribute (e.g., gender-neutral) voices. We empirically validate
our method and evaluate the naturalness of synthetic speech and
the controllability of two speaker attributes: gender and language
fluency. The evaluation results show that our method can control the
generated speakers’ attributes by a continuous scalar value without
statistically significant degradation of speech naturalness.
Index Terms—speech synthesis, cross-lingual speech synthe-
sis, multi-speaker speech synthesis, speaker generation
1. INTRODUCTION
Despite the improved quality of synthetic speech through deep neu-
ral network (DNN)-based text-to-speech (TTS) [1]–[3], diversifying
speakers’ voices remains challenging. One approach to increase
speaker diversity is multi-speaker TTS [4], in which a single TTS
model reproduces the voice characteristics of speakers included in
a multi-speaker corpus. However, the training requires sufficient
speech data for each speaker to achieve high-quality TTS. Although
few-shot speaker adaptation [5]–[8] and zero-shot speaker encod-
ing [6], [9]–[12] can reproduce a target speaker’s voice using only
a few utterances of the speaker, they still need an existent speaker’s
speech data. Some work has attempted to generate nonexistent
speakers from a trained multi-speaker TTS model to deal with the
difficulty in collecting speech data of existent speakers [6], [13],
[14]. Recently, Stanton et al. [15] define this task as “speaker gen-
eration,” where the purpose is to synthesize nonexistent speakers’
natural-sounding voices and achieve practical applications such as
audiobook readers and video production.
Stanton et al. [15] proposed TacoSpawn as a method for resolv-
ing the speaker generation. TacoSpawn jointly learns two DNNs: a
multi-speaker TTS model and an encoder that defines the parametric
distributions of speaker embeddings. The former generates a target
This work is supported by JSPS KAKENHI 21H04900 (practical ex-
periment) and Moonshot R&D Grant Number JPMJPS2011 (algorithm de-
velopment). We also appreciate Takaaki Saeki and Yuta Matsunaga of the
University of Tokyo for their help.
speaker’s mel-spectrogram from the input text and the speaker em-
bedding. The latter learns the distributions of speaker embeddings
as Gaussian mixture models (GMMs) for each “speaker attribute”
(or “speaker metadata” [15]) representing the attributes (e.g., gen-
der) of a specific speaker. The combination of these two models
achieves TTS of not only existent speakers’ voice, but also nonexis-
tent ones’ by sampling new embeddings from the speaker-attribute-
aware GMM.
Learned parametric distributions of speaker embeddings by the
conventional TacoSpawn method can potentially synthesize more
diverse speakers’ voices. For example, we can transform or interpo-
late the speaker embedding distributions to define a new distribution
for speaker generation. In other words, at present, TacoSpawn only
handles speakers with categorical attributes. However, by combin-
ing the distributions of individual attributes, it is possible to handle
speakers of non-categorical attributes. Such “mid-attribute speaker
generation” method would extend the application range of TTS
technologies, e.g., creating gender-neutral voices for communica-
tion that reduces gender bias and language-fluency-controllable TTS
for computer-assisted language learning [16].
In this paper, we propose a method for intermediating multiple
speaker attributes by means of optimal-transport-based interpolation
of GMMs. Our method first computes the weighted barycenter of a
set of GMMs [17], in which each GMM corresponds to one categor-
ical speaker attribute. Then, it defines a new distribution using the
weighted barycenter for sampling nonexistent speakers with a mid-
attribute controlled by interpolation weights. One can define such a
mid-attribute GMM by estimating its parameters from the interpola-
tion weights representing intermediate speaker attributes. However,
this simple method does not guarantee to estimate the best path for
interpolating the learned GMMs, since the order of the mixtures in
the GMMs is indefinite. In contrast, the optimal transport theory
supports the smooth interpolation of multiple distinct distributions
and fits our focus on the interpolation of multiple speaker-attribute-
aware GMMs. We empirically validate our method by evaluating the
naturalness of synthetic speech and the controllability of two speaker
attributes: gender and language fluency. The evaluation results show
that our method can control the generated speakers’ attributes by
a continuous scalar value without statistically significant degrada-
tion of speech naturalness. The speech samples are available on our
project page1.
2. RELATED WORKS
2.1. Speaker embedding prior of TacoSpawn
TacoSpawn [15] learns the speaker embedding distribution as a
speaker-attribute-aware GMM and uses the learned distribution to
generate a nonexistent speaker’s voice. Let Dand Kbe the di-
mensionality of the speaker embeddings and the number of mixture
1https://sarulab-speech.github.io/demo_
mid-attribute-speaker-generation
arXiv:2210.09916v1 [cs.SD] 18 Oct 2022