FREEVC TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION Jingyi Li12Weiping Tu12Li Xiao12 1National Engineering Research Center for Multimedia Software School of Computer Science

2025-05-06 1 0 795.06KB 5 页 10玖币

侵权投诉

FREEVC: TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION

Jingyi Li1,2Weiping Tu1,2,∗Li Xiao1,2

1National Engineering Research Center for Multimedia Software, School of Computer Science,

Wuhan University, Wuhan 430072, China

2Hubei Key Laboratory of Multimedia and Network Communication Engineering,

Wuhan University,Wuhan 430072, China

ABSTRACT

Voice conversion (VC) can be achieved by ﬁrst extracting

source content information and target speaker information,

and then reconstructing waveform with these information.

However, current approaches normally either extract dirty

content information with speaker information leaked in, or

demand a large amount of annotated data for training. Be-

sides, the quality of reconstructed waveform can be degraded

by the mismatch between conversion model and vocoder. In

this paper, we adopt the end-to-end framework of VITS for

high-quality waveform reconstruction, and propose strate-

gies for clean content information extraction without text

annotation. We disentangle content information by imposing

an information bottleneck to WavLM features, and propose

the spectrogram-resize based data augmentation to improve

the purity of extracted content information. Experimental

results show that the proposed method outperforms the lat-

est VC models trained with annotated data and has greater

robustness.

Index Terms—voice conversion, self-supervised learn-

ing, information bottleneck, data augmentation

1. INTRODUCTION

Voice conversion (VC) is a technique that alters the voice of

a source speaker to a target style, such as speaker identity [1],

prosody [2] and emotion [3], while keeping the linguistic con-

tent unchanged. In this paper, we focus on the speaker identity

conversion under one-shot setting, i.e., given only one utter-

ance of target speaker as reference.

A typical approach of one-shot voice conversion is to dis-

entangle content information and speaker information from

source and target speech, respectively, and then use them to

reconstruct the converted speech [4]. As a result, the quality

of converted speech relys on (1) the disentanglement ability

of VC model, and (2) the reconstruction ability of VC model.

Based on how a VC system disentangles content informa-

tion, we can categorize current VC approaches into text-based

∗Corresponding author.

VC and text-free VC. A popular text-based VC approach is to

use an automatic speech recognition (ASR) model to extract

phonetic posteriorgram (PPG) as content representation [5]

[6]. Some researchers have also resolved to leveraging shared

linguistic knowledge from a text-to-speech (TTS) model [7]

[8]. However, these approaches require an extensive amount

of annotated data for training the ASR or TTS model. Data

annotation is costly, and the accuracy and granularity, e.g.

phoneme level and grapheme level, of annotation affects the

model performance. To avoid the concerns of text-based ap-

proaches, text-free approaches that learn to extract content in-

formation without the guidance of text annotation have been

explored. Typical text-free approaches include information

bottleneck [4], vector quantization [9], instance normaliza-

tion [10], etc. However, their performance generally lags be-

hind text-based approaches [11]. This can be attributed to the

fact that the content information they extract is more easily to

have source speaker information leaked in.

Many VC systems adopt a two-stage reconstruction pipe-

line [6] [4]. A conversion model converts the source acoustic

features into target speaker’s voice in the ﬁrst stage, while a

vocoder transforms the converted features into waveform in

the second stage. The two models are usually trained sepa-

rately. However, the acoustic feature predicted by conversion

model has a different distribution from that the vocoder uses

during training, which is from the real speech. This feature

mismatch problem, which also exists in TTS, can degrade the

quality of reconstructed waveform [12]. VITS [13] is a one-

stage model that can do both TTS and VC. By connecting

models of the two stages through latent variables of a condi-

tional variational autoencoder (CVAE), the feature mismatch

is reduced. By adopting adversarial training, the quality of

reconstructed waveform is further improved. However, VITS

is a text-based model and is limited to many-to-many VC, i.e.

the source and target speakers are all seen speakers.

In this paper, we propose a text-free one-shot VC sys-

tem named FreeVC, which adopts the framework of VITS

for its brilliant reconstruction ability, but learns to disentan-

gle content information without the need of text annotation.

The recent success of speech self-supervised learning (SSL)

arXiv:2210.15418v1 [cs.SD] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FREEVC:TOWARDSHIGH-QUALITYTEXT-FREEONE-SHOTVOICECONVERSIONJingyiLi1;2WeipingTu1;2;LiXiao1;21NationalEngineeringResearchCenterforMultimediaSoftware,SchoolofComputerScience,WuhanUniversity,Wuhan430072,China2HubeiKeyLaboratoryofMultimediaandNetworkCommunicationEngineering,WuhanUniversity,Wuhan430072,C...

展开>> 收起<<

FREEVC TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION Jingyi Li12Weiping Tu12Li Xiao12 1National Engineering Research Center for Multimedia Software School of Computer Science.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FREEVC TOWARDS HIGH-QUALITY TEXT-FREE ONE-SHOT VOICE CONVERSION Jingyi Li12Weiping Tu12Li Xiao12 1National Engineering Research Center for Multimedia Software School of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: