STREAMING VOICE CONVERSION VIA INTERMEDIATE BOTTLENECK FEATURES AND NON-STREAMING TEACHER GUIDANCE Yuanzhe Chen Ming Tu Tang Li Xin Li Qiuqiang Kong Jiaxin Li Zhichao Wang

2025-05-02 0 0 517.81KB 5 页 10玖币

侵权投诉

STREAMING VOICE CONVERSION VIA INTERMEDIATE BOTTLENECK FEATURES AND

NON-STREAMING TEACHER GUIDANCE

Yuanzhe Chen∗, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang

Qiao Tian, Yuping Wang, Yuxuan Wang

Speech, Audio & Music Intelligence (SAMI), ByteDance

ABSTRACT

Streaming voice conversion (VC) is the task of converting

the voice of one person to another in real-time. Previous

streaming VC methods use phonetic posteriorgrams (PPGs)

extracted from automatic speech recognition (ASR) systems

to represent speaker-independent information. However,

PPGs lack the prosody and vocalization information of the

source speaker, and streaming PPGs contain undesired leaked

timbre of the source speaker. In this paper, we propose to

use intermediate bottleneck features (IBFs) to replace PPGs.

VC systems trained with IBFs retain more prosody and vo-

calization information of the source speaker. Furthermore,

we propose a non-streaming teacher guidance (TG) frame-

work that addresses the timbre leakage problem. Experiments

show that our proposed IBFs and the TG framework achieve

a state-of-the-art streaming VC naturalness of 3.85, a content

consistency of 3.77, and a timbre similarity of 3.77 under a

future receptive ﬁeld of 160 ms which signiﬁcantly outper-

form previous streaming VC systems.

Index Terms—voice conversion (VC), streaming VC, in-

termediate bottleneck features (IBFs), teacher guidance (TG).

1. INTRODUCTION

Voice conversion (VC) is the task of converting the voice of

one person to another while keeping the content consistent.

VC has a strong demand for audio editing and voice inter-

action scenarios. In the past, many researches focused on

non-streaming VC, such as AutoVC [1]. Non-streaming VC

methods require users to enter a complete voice sentence be-

fore returning the result. Different from non-streaming VC,

streaming VC is a task to continuously convert the voice of

one person to another in real time. In recent years, streaming

VC has many applications such as livestreaming, avatar, and

real-time communication (RTC).

Due to the difﬁculty of acquiring parallel corpus con-

taining paired utterances between different speakers, current

VC methods are mostly based on non-parallel corpus such

as auto-encoder based VC [1, 2], phonetic posteriorgrams

* chenyuanzhe@bytedance.com

(PPGs) based VC [3, 4], and self-supervised representation

based VC [5]. Those methods use an encoder to extract

speaker-independent representations, such as the content of

the input speech and use a decoder along with the input

speaker ID as a condition to reconstruct the input audio.

Recent streaming VC methods include using streaming au-

tomatic speech recognition (ASR) encoder to extract PPGs

[6, 7] and designing causal model structures for VC [8, 9, 10].

A perfect representation extracted from the encoder

should contain the complete content and prosody but not

the timbre of the source speaker. However, the receptive

ﬁelds of streaming ASR systems are limited to the past, so

the recognition accuracy of streaming ASR systems are worse

than non-streaming ASR systems [11]. Second, PPGs only

represent ﬁnite phoneme classes and lack paralinguistic infor-

mation such as prosody and vocalization [12]. Third, due to

the streaming ASR encoder having limited a receptive ﬁeld,

the streaming ASR encoder contains the undesired timbre of

the source speaker, which is referred to as timbre leakage.

In inference, timbre leakage will result in the VC output

containing timbres of both the source and target speakers.

To address those problems, we ﬁrst propose to use inter-

mediate bottleneck features (IBFs) to replace PPGs as source

speaker representations to improve the content robustness of

the VC system. IBFs contain more ﬁne-grained pronunci-

ation and paralinguistic information than PPGs. IBFs can

signiﬁcantly improve the pronunciation accuracy, vocaliza-

tion, and prosody retention in both non-streaming VC and

streaming VC systems. Second, we propose a non-streaming

teacher guidance (TG) framework which uses a pre-trained

non-streaming VC model as a teacher to generate parallel

training data of different source speakers and target speakers.

Then, we propose a student model to learn from those paral-

lel data instead of using self-reconstruction loss. We analyze

that the TG framework will force the streaming VC system

to ignore the timbre of the source speaker that signiﬁcantly

mitigate the timbre leakage problem.

The rest of this paper is organized as follows. Section 2

introduces our proposed IBFs based VC system. Section 3

introduces our proposed non-streaming TG training strategy.

Section 4 shows experiments. Section 5 concludes this paper.

arXiv:2210.15158v1 [eess.AS] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

STREAMINGVOICECONVERSIONVIAINTERMEDIATEBOTTLENECKFEATURESANDNON-STREAMINGTEACHERGUIDANCEYuanzheChen,MingTu,TangLi,XinLi,QiuqiangKong,JiaxinLi,ZhichaoWangQiaoTian,YupingWang,YuxuanWangSpeech,Audio&MusicIntelligence(SAMI),ByteDanceABSTRACTStreamingvoiceconversion(VC)isthetaskofconvertingthevoiceofone...

展开>> 收起<<

STREAMING VOICE CONVERSION VIA INTERMEDIATE BOTTLENECK FEATURES AND NON-STREAMING TEACHER GUIDANCE Yuanzhe Chen Ming Tu Tang Li Xin Li Qiuqiang Kong Jiaxin Li Zhichao Wang.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

STREAMING VOICE CONVERSION VIA INTERMEDIATE BOTTLENECK FEATURES AND NON-STREAMING TEACHER GUIDANCE Yuanzhe Chen Ming Tu Tang Li Xin Li Qiuqiang Kong Jiaxin Li Zhichao Wang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: