Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

2025-04-27 0 0 280.95KB 5 页 10玖币
侵权投诉
Adversarial Speaker-Consistency Learning Using
Untranscribed Speech Data for Zero-Shot
Multi-Speaker Text-to-Speech
Byoung Jin Choi, Myeonghun Jeong, Minchan Kim, Sung Hwan Mun, Nam Soo Kim
Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, Korea
E-mail: {bjchoi, mhjeong, mckim, shmun}@hi.snu.ac.kr, nkim@snu.ac.kr Tel/Fax: +82-02-884-1824
Abstract—Several recently proposed text-to-speech (TTS) mod-
els achieved to generate the speech samples with the human-level
quality in the single-speaker and multi-speaker TTS scenarios
with a set of pre-defined speakers. However, synthesizing a new
speaker’s voice with a single reference audio, commonly known
as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a
very challenging task. The main challenge of ZSM-TTS is the
speaker domain shift problem upon the speech generation of a
new speaker. To mitigate this problem, we propose adversarial
speaker-consistency learning (ASCL). The proposed method first
generates an additional speech of a query speaker using the
external untranscribed datasets at each training iteration. Then,
the model learns to consistently generate the speech sample of
the same speaker as the corresponding speaker embedding vector
by employing an adversarial learning scheme. The experimental
results show that the proposed method is effective compared to
the baseline in terms of the quality and speaker similarity in
ZSM-TTS.
I. INTRODUCTION
The performance of the neural text-to-speech (TTS) models
has dramatically improved in terms of the quality of the speech
samples in a recent few years. While the most innovations in
TTS field emerged around enhancing the quality of the single-
speaker and the multi-speaker models which are trained with a
pre-defined speaker set, the deployment in the field application
requires TTS systems with various capabilities. One popular
demand is an instant speaker adaptation to build a personalized
TTS system. Personalized TTS aims to analyze and control
the underlying speech factors to imitate the user’s voice
characteristics. Nonetheless, these factors are not rigorously
defined in a scientific manner and generally known to be
entangled which makes it difficult to control each component.
In personalized TTS, the main objective is to adapt to a
new speaker’s voice characteristics with limited data available.
To realize such demand, zero-shot multi-speaker TTS (ZSM-
TTS), a sub-branched research under the umbrella of speaker
adaptation, has gained an enormous attention from researchers
recently. ZSM-TTS seeks to train a multi-speaker TTS model
which can generate a speech sample of a new speaker’s voice
identity which was not present in the training dataset given a
reference utterance without further finetuning the model.
Some previous works have tackled ZSM-TTS using a pre-
trained speaker encoder for speaker verification to the existing
TTS models [3], [4], [5], [6]. Meanwhile, an effective style
modeling method was proposed by [7], where a bank of style
vectors and their weights are learned in an unsupervised man-
ner. On the other hand, [8] exploits a meta-learning approach
by utilizing an episodic training scheme with the phoneme and
style discriminators. However, although these approaches focus
on improving the speaker embedding extraction, conditioning
scheme, and training method, they disregard the fact that the
number of speakers in the current TTS training dataset is
far less than the total population which is not sufficient to
learn the entire speaker space. This directly results in poor
generalization on the unseen speakers at inference leading to
unsatisfactory performance.
The main challenge of ZSM-TTS is the speaker domain
shift problem, which happens when the speaker outside of
the training dataset must be inferred properly. It is commonly
known that there is a strong bias towards the speakers from the
training dataset. In order to overcome this challenge, we pro-
pose to train a TTS model with adversarial speaker-consistency
learning (ASCL). The ASCL scheme generates an additional
speech sample using a query speaker obtained from external
untranscribed audio datasets. Such untranscribed audio datasets
are readily available from various sources. The generated
sample is then used for an adversarial training where the
speaker-consistency discriminator is newly proposed. Training
in this way exposes the model to a larger speaker pool than
the limited training dataset, hence inducing better speaker
generalization for ZSM-TTS.
The proposed method directly addresses to the aforemen-
tioned speaker domain shift problem by expanding the speaker
pool for the ZSM-TTS. The ASCL scheme is built on the
architecture of variational inference TTS (VITS) [9] and its
inverse transformation capability of normalizing flow module.
We demonstrate the effectiveness of the ASCL by comparing
with the baseline using subjective and objective scores. Our
results show that the proposed method overcome the baseline
in terms of speech quality and speaker similarity.
Our contributions are two-folded as follows:
1) We propose adversarial speaker-consistency learning
(ASCL), a novel way to train a TTS model to address
the speaker domain shift problem.
arXiv:2210.05979v2 [eess.AS] 22 Nov 2022
摘要:

AdversarialSpeaker-ConsistencyLearningUsingUntranscribedSpeechDataforZero-ShotMulti-SpeakerText-to-SpeechByoungJinChoi,MyeonghunJeong,MinchanKim,SungHwanMun,NamSooKimDepartmentofElectricalandComputerEngineeringandINMC,SeoulNationalUniversity,Seoul,KoreaE-mail:fbjchoi,mhjeong,mckim,shmung@hi.snu.ac.k...

展开>> 收起<<
Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:280.95KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注