
FREEVC: TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION
Jingyi Li1,2Weiping Tu1,2,∗Li Xiao1,2
1National Engineering Research Center for Multimedia Software, School of Computer Science,
Wuhan University, Wuhan 430072, China
2Hubei Key Laboratory of Multimedia and Network Communication Engineering,
Wuhan University,Wuhan 430072, China
ABSTRACT
Voice conversion (VC) can be achieved by first extracting
source content information and target speaker information,
and then reconstructing waveform with these information.
However, current approaches normally either extract dirty
content information with speaker information leaked in, or
demand a large amount of annotated data for training. Be-
sides, the quality of reconstructed waveform can be degraded
by the mismatch between conversion model and vocoder. In
this paper, we adopt the end-to-end framework of VITS for
high-quality waveform reconstruction, and propose strate-
gies for clean content information extraction without text
annotation. We disentangle content information by imposing
an information bottleneck to WavLM features, and propose
the spectrogram-resize based data augmentation to improve
the purity of extracted content information. Experimental
results show that the proposed method outperforms the lat-
est VC models trained with annotated data and has greater
robustness.
Index Terms—voice conversion, self-supervised learn-
ing, information bottleneck, data augmentation
1. INTRODUCTION
Voice conversion (VC) is a technique that alters the voice of
a source speaker to a target style, such as speaker identity [1],
prosody [2] and emotion [3], while keeping the linguistic con-
tent unchanged. In this paper, we focus on the speaker identity
conversion under one-shot setting, i.e., given only one utter-
ance of target speaker as reference.
A typical approach of one-shot voice conversion is to dis-
entangle content information and speaker information from
source and target speech, respectively, and then use them to
reconstruct the converted speech [4]. As a result, the quality
of converted speech relys on (1) the disentanglement ability
of VC model, and (2) the reconstruction ability of VC model.
Based on how a VC system disentangles content informa-
tion, we can categorize current VC approaches into text-based
∗Corresponding author.
VC and text-free VC. A popular text-based VC approach is to
use an automatic speech recognition (ASR) model to extract
phonetic posteriorgram (PPG) as content representation [5]
[6]. Some researchers have also resolved to leveraging shared
linguistic knowledge from a text-to-speech (TTS) model [7]
[8]. However, these approaches require an extensive amount
of annotated data for training the ASR or TTS model. Data
annotation is costly, and the accuracy and granularity, e.g.
phoneme level and grapheme level, of annotation affects the
model performance. To avoid the concerns of text-based ap-
proaches, text-free approaches that learn to extract content in-
formation without the guidance of text annotation have been
explored. Typical text-free approaches include information
bottleneck [4], vector quantization [9], instance normaliza-
tion [10], etc. However, their performance generally lags be-
hind text-based approaches [11]. This can be attributed to the
fact that the content information they extract is more easily to
have source speaker information leaked in.
Many VC systems adopt a two-stage reconstruction pipe-
line [6] [4]. A conversion model converts the source acoustic
features into target speaker’s voice in the first stage, while a
vocoder transforms the converted features into waveform in
the second stage. The two models are usually trained sepa-
rately. However, the acoustic feature predicted by conversion
model has a different distribution from that the vocoder uses
during training, which is from the real speech. This feature
mismatch problem, which also exists in TTS, can degrade the
quality of reconstructed waveform [12]. VITS [13] is a one-
stage model that can do both TTS and VC. By connecting
models of the two stages through latent variables of a condi-
tional variational autoencoder (CVAE), the feature mismatch
is reduced. By adopting adversarial training, the quality of
reconstructed waveform is further improved. However, VITS
is a text-based model and is limited to many-to-many VC, i.e.
the source and target speakers are all seen speakers.
In this paper, we propose a text-free one-shot VC sys-
tem named FreeVC, which adopts the framework of VITS
for its brilliant reconstruction ability, but learns to disentan-
gle content information without the need of text annotation.
The recent success of speech self-supervised learning (SSL)
arXiv:2210.15418v1 [cs.SD] 27 Oct 2022