MID-ATTRIBUTE SPEAKER GENERATION USING OPTIMAL-TRANSPORT-BASED INTERPOLATION OF GAUSSIAN MIXTURE MODELS Aya Watanabe Shinnosuke Takamichi Yuki Saito Detai Xin Hiroshi Saruwatari

2025-05-02 0 0 1.77MB 5 页 10玖币

侵权投诉

MID-ATTRIBUTE SPEAKER GENERATION USING OPTIMAL-TRANSPORT-BASED

INTERPOLATION OF GAUSSIAN MIXTURE MODELS

Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi Saruwatari

The University of Tokyo, Japan.

ABSTRACT

In this paper, we propose a method for intermediating multiple

speakers’ attributes and diversifying their voice characteristics in

“speaker generation,” an emerging task that aims to synthesize

a nonexistent speaker’s naturally sounding voice. The conven-

tional TacoSpawn-based speaker generation method represents the

distributions of speaker embeddings by Gaussian mixture models

(GMMs) conditioned with speaker attributes. Although this method

enables the sampling of various speakers from the speaker-attribute-

aware GMMs, it is not yet clear whether the learned distributions can

represent speakers with an intermediate attribute (i.e., mid-attribute).

To this end, we propose an optimal-transport-based method that in-

terpolates the learned GMMs to generate nonexistent speakers with

mid-attribute (e.g., gender-neutral) voices. We empirically validate

our method and evaluate the naturalness of synthetic speech and

the controllability of two speaker attributes: gender and language

ﬂuency. The evaluation results show that our method can control the

generated speakers’ attributes by a continuous scalar value without

statistically signiﬁcant degradation of speech naturalness.

Index Terms—speech synthesis, cross-lingual speech synthe-

sis, multi-speaker speech synthesis, speaker generation

1. INTRODUCTION

Despite the improved quality of synthetic speech through deep neu-

ral network (DNN)-based text-to-speech (TTS) [1]–[3], diversifying

speakers’ voices remains challenging. One approach to increase

speaker diversity is multi-speaker TTS [4], in which a single TTS

model reproduces the voice characteristics of speakers included in

a multi-speaker corpus. However, the training requires sufﬁcient

speech data for each speaker to achieve high-quality TTS. Although

few-shot speaker adaptation [5]–[8] and zero-shot speaker encod-

ing [6], [9]–[12] can reproduce a target speaker’s voice using only

a few utterances of the speaker, they still need an existent speaker’s

speech data. Some work has attempted to generate nonexistent

speakers from a trained multi-speaker TTS model to deal with the

difﬁculty in collecting speech data of existent speakers [6], [13],

[14]. Recently, Stanton et al. [15] deﬁne this task as “speaker gen-

eration,” where the purpose is to synthesize nonexistent speakers’

natural-sounding voices and achieve practical applications such as

audiobook readers and video production.

Stanton et al. [15] proposed TacoSpawn as a method for resolv-

ing the speaker generation. TacoSpawn jointly learns two DNNs: a

multi-speaker TTS model and an encoder that deﬁnes the parametric

distributions of speaker embeddings. The former generates a target

This work is supported by JSPS KAKENHI 21H04900 (practical ex-

periment) and Moonshot R&D Grant Number JPMJPS2011 (algorithm de-

velopment). We also appreciate Takaaki Saeki and Yuta Matsunaga of the

University of Tokyo for their help.

speaker’s mel-spectrogram from the input text and the speaker em-

bedding. The latter learns the distributions of speaker embeddings

as Gaussian mixture models (GMMs) for each “speaker attribute”

(or “speaker metadata” [15]) representing the attributes (e.g., gen-

der) of a speciﬁc speaker. The combination of these two models

achieves TTS of not only existent speakers’ voice, but also nonexis-

tent ones’ by sampling new embeddings from the speaker-attribute-

aware GMM.

Learned parametric distributions of speaker embeddings by the

conventional TacoSpawn method can potentially synthesize more

diverse speakers’ voices. For example, we can transform or interpo-

late the speaker embedding distributions to deﬁne a new distribution

for speaker generation. In other words, at present, TacoSpawn only

handles speakers with categorical attributes. However, by combin-

ing the distributions of individual attributes, it is possible to handle

speakers of non-categorical attributes. Such “mid-attribute speaker

generation” method would extend the application range of TTS

technologies, e.g., creating gender-neutral voices for communica-

tion that reduces gender bias and language-ﬂuency-controllable TTS

for computer-assisted language learning [16].

In this paper, we propose a method for intermediating multiple

speaker attributes by means of optimal-transport-based interpolation

of GMMs. Our method ﬁrst computes the weighted barycenter of a

set of GMMs [17], in which each GMM corresponds to one categor-

ical speaker attribute. Then, it deﬁnes a new distribution using the

weighted barycenter for sampling nonexistent speakers with a mid-

attribute controlled by interpolation weights. One can deﬁne such a

mid-attribute GMM by estimating its parameters from the interpola-

tion weights representing intermediate speaker attributes. However,

this simple method does not guarantee to estimate the best path for

interpolating the learned GMMs, since the order of the mixtures in

the GMMs is indeﬁnite. In contrast, the optimal transport theory

supports the smooth interpolation of multiple distinct distributions

and ﬁts our focus on the interpolation of multiple speaker-attribute-

aware GMMs. We empirically validate our method by evaluating the

naturalness of synthetic speech and the controllability of two speaker

attributes: gender and language ﬂuency. The evaluation results show

that our method can control the generated speakers’ attributes by

a continuous scalar value without statistically signiﬁcant degrada-

tion of speech naturalness. The speech samples are available on our

project page1.

2. RELATED WORKS

2.1. Speaker embedding prior of TacoSpawn

TacoSpawn [15] learns the speaker embedding distribution as a

speaker-attribute-aware GMM and uses the learned distribution to

generate a nonexistent speaker’s voice. Let Dand Kbe the di-

mensionality of the speaker embeddings and the number of mixture

1https://sarulab-speech.github.io/demo_

mid-attribute-speaker-generation

arXiv:2210.09916v1 [cs.SD] 18 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MID-ATTRIBUTESPEAKERGENERATIONUSINGOPTIMAL-TRANSPORT-BASEDINTERPOLATIONOFGAUSSIANMIXTUREMODELSAyaWatanabe,ShinnosukeTakamichi,YukiSaito,DetaiXin,HiroshiSaruwatariTheUniversityofTokyo,Japan.ABSTRACTInthispaper,weproposeamethodforintermediatingmultiplespeakers'attributesanddiversifyingtheirvoicecharac...

展开>> 收起<<

MID-ATTRIBUTE SPEAKER GENERATION USING OPTIMAL-TRANSPORT-BASED INTERPOLATION OF GAUSSIAN MIXTURE MODELS Aya Watanabe Shinnosuke Takamichi Yuki Saito Detai Xin Hiroshi Saruwatari.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MID-ATTRIBUTE SPEAKER GENERATION USING OPTIMAL-TRANSPORT-BASED INTERPOLATION OF GAUSSIAN MIXTURE MODELS Aya Watanabe Shinnosuke Takamichi Yuki Saito Detai Xin Hiroshi Saruwatari

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: