
STREAMING VOICE CONVERSION VIA INTERMEDIATE BOTTLENECK FEATURES AND
NON-STREAMING TEACHER GUIDANCE
Yuanzhe Chen∗, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang
Qiao Tian, Yuping Wang, Yuxuan Wang
Speech, Audio & Music Intelligence (SAMI), ByteDance
ABSTRACT
Streaming voice conversion (VC) is the task of converting
the voice of one person to another in real-time. Previous
streaming VC methods use phonetic posteriorgrams (PPGs)
extracted from automatic speech recognition (ASR) systems
to represent speaker-independent information. However,
PPGs lack the prosody and vocalization information of the
source speaker, and streaming PPGs contain undesired leaked
timbre of the source speaker. In this paper, we propose to
use intermediate bottleneck features (IBFs) to replace PPGs.
VC systems trained with IBFs retain more prosody and vo-
calization information of the source speaker. Furthermore,
we propose a non-streaming teacher guidance (TG) frame-
work that addresses the timbre leakage problem. Experiments
show that our proposed IBFs and the TG framework achieve
a state-of-the-art streaming VC naturalness of 3.85, a content
consistency of 3.77, and a timbre similarity of 3.77 under a
future receptive field of 160 ms which significantly outper-
form previous streaming VC systems.
Index Terms—voice conversion (VC), streaming VC, in-
termediate bottleneck features (IBFs), teacher guidance (TG).
1. INTRODUCTION
Voice conversion (VC) is the task of converting the voice of
one person to another while keeping the content consistent.
VC has a strong demand for audio editing and voice inter-
action scenarios. In the past, many researches focused on
non-streaming VC, such as AutoVC [1]. Non-streaming VC
methods require users to enter a complete voice sentence be-
fore returning the result. Different from non-streaming VC,
streaming VC is a task to continuously convert the voice of
one person to another in real time. In recent years, streaming
VC has many applications such as livestreaming, avatar, and
real-time communication (RTC).
Due to the difficulty of acquiring parallel corpus con-
taining paired utterances between different speakers, current
VC methods are mostly based on non-parallel corpus such
as auto-encoder based VC [1, 2], phonetic posteriorgrams
* chenyuanzhe@bytedance.com
(PPGs) based VC [3, 4], and self-supervised representation
based VC [5]. Those methods use an encoder to extract
speaker-independent representations, such as the content of
the input speech and use a decoder along with the input
speaker ID as a condition to reconstruct the input audio.
Recent streaming VC methods include using streaming au-
tomatic speech recognition (ASR) encoder to extract PPGs
[6, 7] and designing causal model structures for VC [8, 9, 10].
A perfect representation extracted from the encoder
should contain the complete content and prosody but not
the timbre of the source speaker. However, the receptive
fields of streaming ASR systems are limited to the past, so
the recognition accuracy of streaming ASR systems are worse
than non-streaming ASR systems [11]. Second, PPGs only
represent finite phoneme classes and lack paralinguistic infor-
mation such as prosody and vocalization [12]. Third, due to
the streaming ASR encoder having limited a receptive field,
the streaming ASR encoder contains the undesired timbre of
the source speaker, which is referred to as timbre leakage.
In inference, timbre leakage will result in the VC output
containing timbres of both the source and target speakers.
To address those problems, we first propose to use inter-
mediate bottleneck features (IBFs) to replace PPGs as source
speaker representations to improve the content robustness of
the VC system. IBFs contain more fine-grained pronunci-
ation and paralinguistic information than PPGs. IBFs can
significantly improve the pronunciation accuracy, vocaliza-
tion, and prosody retention in both non-streaming VC and
streaming VC systems. Second, we propose a non-streaming
teacher guidance (TG) framework which uses a pre-trained
non-streaming VC model as a teacher to generate parallel
training data of different source speakers and target speakers.
Then, we propose a student model to learn from those paral-
lel data instead of using self-reconstruction loss. We analyze
that the TG framework will force the streaming VC system
to ignore the timbre of the source speaker that significantly
mitigate the timbre leakage problem.
The rest of this paper is organized as follows. Section 2
introduces our proposed IBFs based VC system. Section 3
introduces our proposed non-streaming TG training strategy.
Section 4 shows experiments. Section 5 concludes this paper.
arXiv:2210.15158v1 [eess.AS] 27 Oct 2022