STREAMING VOICE CONVERSION VIA INTERMEDIATE BOTTLENECK FEATURES AND NON-STREAMING TEACHER GUIDANCE Yuanzhe Chen Ming Tu Tang Li Xin Li Qiuqiang Kong Jiaxin Li Zhichao Wang

2025-05-02 0 0 517.81KB 5 页 10玖币
侵权投诉
STREAMING VOICE CONVERSION VIA INTERMEDIATE BOTTLENECK FEATURES AND
NON-STREAMING TEACHER GUIDANCE
Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang
Qiao Tian, Yuping Wang, Yuxuan Wang
Speech, Audio & Music Intelligence (SAMI), ByteDance
ABSTRACT
Streaming voice conversion (VC) is the task of converting
the voice of one person to another in real-time. Previous
streaming VC methods use phonetic posteriorgrams (PPGs)
extracted from automatic speech recognition (ASR) systems
to represent speaker-independent information. However,
PPGs lack the prosody and vocalization information of the
source speaker, and streaming PPGs contain undesired leaked
timbre of the source speaker. In this paper, we propose to
use intermediate bottleneck features (IBFs) to replace PPGs.
VC systems trained with IBFs retain more prosody and vo-
calization information of the source speaker. Furthermore,
we propose a non-streaming teacher guidance (TG) frame-
work that addresses the timbre leakage problem. Experiments
show that our proposed IBFs and the TG framework achieve
a state-of-the-art streaming VC naturalness of 3.85, a content
consistency of 3.77, and a timbre similarity of 3.77 under a
future receptive field of 160 ms which significantly outper-
form previous streaming VC systems.
Index Termsvoice conversion (VC), streaming VC, in-
termediate bottleneck features (IBFs), teacher guidance (TG).
1. INTRODUCTION
Voice conversion (VC) is the task of converting the voice of
one person to another while keeping the content consistent.
VC has a strong demand for audio editing and voice inter-
action scenarios. In the past, many researches focused on
non-streaming VC, such as AutoVC [1]. Non-streaming VC
methods require users to enter a complete voice sentence be-
fore returning the result. Different from non-streaming VC,
streaming VC is a task to continuously convert the voice of
one person to another in real time. In recent years, streaming
VC has many applications such as livestreaming, avatar, and
real-time communication (RTC).
Due to the difficulty of acquiring parallel corpus con-
taining paired utterances between different speakers, current
VC methods are mostly based on non-parallel corpus such
as auto-encoder based VC [1, 2], phonetic posteriorgrams
* chenyuanzhe@bytedance.com
(PPGs) based VC [3, 4], and self-supervised representation
based VC [5]. Those methods use an encoder to extract
speaker-independent representations, such as the content of
the input speech and use a decoder along with the input
speaker ID as a condition to reconstruct the input audio.
Recent streaming VC methods include using streaming au-
tomatic speech recognition (ASR) encoder to extract PPGs
[6, 7] and designing causal model structures for VC [8, 9, 10].
A perfect representation extracted from the encoder
should contain the complete content and prosody but not
the timbre of the source speaker. However, the receptive
fields of streaming ASR systems are limited to the past, so
the recognition accuracy of streaming ASR systems are worse
than non-streaming ASR systems [11]. Second, PPGs only
represent finite phoneme classes and lack paralinguistic infor-
mation such as prosody and vocalization [12]. Third, due to
the streaming ASR encoder having limited a receptive field,
the streaming ASR encoder contains the undesired timbre of
the source speaker, which is referred to as timbre leakage.
In inference, timbre leakage will result in the VC output
containing timbres of both the source and target speakers.
To address those problems, we first propose to use inter-
mediate bottleneck features (IBFs) to replace PPGs as source
speaker representations to improve the content robustness of
the VC system. IBFs contain more fine-grained pronunci-
ation and paralinguistic information than PPGs. IBFs can
significantly improve the pronunciation accuracy, vocaliza-
tion, and prosody retention in both non-streaming VC and
streaming VC systems. Second, we propose a non-streaming
teacher guidance (TG) framework which uses a pre-trained
non-streaming VC model as a teacher to generate parallel
training data of different source speakers and target speakers.
Then, we propose a student model to learn from those paral-
lel data instead of using self-reconstruction loss. We analyze
that the TG framework will force the streaming VC system
to ignore the timbre of the source speaker that significantly
mitigate the timbre leakage problem.
The rest of this paper is organized as follows. Section 2
introduces our proposed IBFs based VC system. Section 3
introduces our proposed non-streaming TG training strategy.
Section 4 shows experiments. Section 5 concludes this paper.
arXiv:2210.15158v1 [eess.AS] 27 Oct 2022
摘要:

STREAMINGVOICECONVERSIONVIAINTERMEDIATEBOTTLENECKFEATURESANDNON-STREAMINGTEACHERGUIDANCEYuanzheChen,MingTu,TangLi,XinLi,QiuqiangKong,JiaxinLi,ZhichaoWangQiaoTian,YupingWang,YuxuanWangSpeech,Audio&MusicIntelligence(SAMI),ByteDanceABSTRACTStreamingvoiceconversion(VC)isthetaskofconvertingthevoiceofone...

展开>> 收起<<
STREAMING VOICE CONVERSION VIA INTERMEDIATE BOTTLENECK FEATURES AND NON-STREAMING TEACHER GUIDANCE Yuanzhe Chen Ming Tu Tang Li Xin Li Qiuqiang Kong Jiaxin Li Zhichao Wang.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:517.81KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注