MetaSpeech Speech Effects Switch Along with Environment for Metaverse Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao

2025-05-02 0 0 2.58MB 6 页 10玖币

侵权投诉

MetaSpeech: Speech Effects Switch Along with

Environment for Metaverse

Xulong Zhang, Jianzong Wang∗, Ning Cheng, Jing Xiao

Ping An Technology (Shenzhen) Co., Ltd.

Abstract—Metaverse expands the physical world to a new

dimension, and the physical environment and Metaverse envi-

ronment can be directly connected and entered. Voice is an

indispensable communication medium in the real world and

Metaverse. Fusion of the voice with environment effects is impor-

tant for user immersion in Metaverse. In this paper, we proposed

using the voice conversion based method for the conversion of

target environment effect speech. The proposed method was

named MetaSpeech, which introduces an environment effect

module containing an effect extractor to extract the environment

information and an effect encoder to encode the environment

effect condition, in which gradient reversal layer was used for

adversarial training to keep the speech content and speaker

information while disentangling the environmental effects. From

the experiment results on the public dataset of LJSpeech with

four environment effects, the proposed model could complete

the speciﬁc environment effect conversion and outperforms the

baseline methods from the voice conversion task.

Index Terms—metaverse, environment effect, audio effect,

voice conversion, room impulse response

I. INTRODUCTION

Metaverse [1]–[3] is the expansion of the real world in the

virtual world. It is digitalization and virtualization based on the

physical world so that people can carry out their daily work

and entertainment in a more convenient way. People can enter

Metaverse at any time from different locations, and instantly

enter the same meeting room to start a work discussion

meeting on time. Eliminates the physical constraints of the

physical world and the constraints of resource-constrained con-

ference rooms [4]. We can also carry out unrealistic activities

in the real world in Metaverse. For example, in Metaverse,

it is possible to achieve instantaneous movement free from

geographical constraints and production and creation free from

material constraints.

Although Metaverse can break many constraints in the real

world, Metaverse must also provide a more realistic experience

and immersion [5]. People can interact and walk around in

different environments of Metaverse, such as ending a work

meeting from a conference room in Metaverse, entering a

gallery in Metaverse to enjoy art parting, or entering a concert

hall to enjoy a large-scale symphony. Although the transition

of different scenes can be ignored as instant transfer, the

immersion needs to allow users to experience different en-

vironmental atmospheres in different environments, including

visual space and sound.

∗Corresponding author: Jianzong Wang, jzwang@188.com.

Voice effects [6]–[8] in different environments will bring

different perception effects to listeners according to the size of

the space and the material of the environment. Two methods

are usually used to add environmental sound effects to the

recorded vocal sound in a virtual scene [9], [10], one is

parameter calculation, and the other is manual tuning. The

parameter calculation is to calculate the reﬂection structure of

the sound according to the material and distance in the physical

environment to convolve the comb-like shape ﬁlter with the

vocal sound. Manual tuning [11] is to adjust or increase certain

audio components in the audio based on experience. For

different scenes in Metaverse, different sound effects need to

be designed to enhance the difference of playback audio, such

as enhancing vocals, bass compensation, expanding surround

and creating artiﬁcial reverberation, etc., to achieve surround

feeling, vocals, presence, and other auditory effects are en-

hanced.

An audio effect based on modulation or varying with time

relates to an audio processor. The use of delay lines and digital

ﬁlters can implement many audio effects [12]. Recently the

deep model-based method has achieved outperformance on

many tasks [13], [14], it also been applied for the generation

of audio effects. Convolutional neural networks and recurrent

neural networks are combined to model audio special effects.

While the handcraft audio effects or the learned audio effects

can be directly applied to the vocal to achieve the speciﬁc

environmental effects. But there needs the clean vocal of the

speech such as the studio or soundproof room. This is not easy

for the user to get access to Metaverse anywhere they want.

In this paper, we proposed the framework of effect con-

version to remove the source effect and replace it with the

speciﬁc target effect. For the effect conversion, we disentangle

the speech and environment with two separate representations.

With the reference speech in the target environment, we

extract the environment latent and fusion with the speech

of the source to decode the generated speech with only the

target environment effect. To enhance the naturalness of the

generated speech, a variance adaptor was added to the latent

representation.

Our contribution can be concluded as: 1) For Metaverse, we

proposed the speech effects switch method by the framework

of effect conversion. 2) Disentanglement of the environment

effect was modeled as a latent representation of an effect

extractor. 3) Variance adaptor was proposed to enhance the

naturalness of the generated speech.

arXiv:2210.13811v1 [cs.SD] 25 Oct 2022

II. RELATED WORKS

The effect conversion task is similar to the voice conversion

(VC) [15]–[21] task in terms of spectrum conversion, both

need to do a conversion of the target speech. But in the voice

conversion, there is only a need to keep the content the same.

The conversion of the environment effect needs to keep the

content and the same timbre of the source speaker. To some

extent, it can be treated as the same when content not just the

information of speech text.

The voice conversion models can be categorized into

three main classes, they are GAN based model [22]–[25],

VAE based model [26]–[28] and encoder-decoder based

model [29]–[31]. Kaneko et al. [22] applied the CycleGAN

on the voice conversion task, with the cycle consistent loss

resolving the need of parallel dataset. This method shows a

performance comparably to a parallel VC method. However,

the generated speech still has a large gap with the real speech.

An enhanced version CycleGAN-VC2 is updated by Kaneko

et al. [32], which incorporates three new enhancements for the

generator, discriminator, and objective separately. However,

the two methods are both used for mel-cepstrum conversion,

which cannot be directly used for mel-spectrum or spectrum,

which has no ability for the modeling of the aperiodicities

information.

While the GAN based VC methods are tough in training and

have the poor ability of generalization to the out-of-set speaker.

On the other hand, VAE is easier to train. The VAE based

model [26], [33] also can commit a conversion on the non-

parallel corpora. Through a condition of speaker embedding

as an additional input for CVAE [27] can achieve speciﬁc

conversion of the target speaker. However, CVAE alone often

suffers from over-smoothing of the output and cannot guar-

antee the distribution matching. Qian et al. [29] propose to

use an autoencoder with a well-designed bottleneck for the

disentanglement of content and speaker style. With only a

self-reconstruct loss it can achieve distribution matching style

transfer and could perform the zero-shot voice conversion.

To the environment effect switch task, many environments

could be created and the entrance environment to Metaverse

could be various. Motivated by the voice conversion methods,

we proposed to do a disentanglement of the effect and commit

any to any conversion of environment effect.

III. METHOD

In this section, we ﬁrst give the overview of the proposed

method and then show the detail of the main components.

We also introduce the training and inference process of the

proposed method for speech environment effect conversion.

A. Model Overview

As shown in Figure 1, the main modules include a mel

encoder, an environment effect module. The mel encoder is

built up with convolutional layers with 1-dimensional convo-

lution along the time axis. The variance adaptor based on the

work in [34], it contains pitch predictor and energy predictor

to predict the pitch and energy for naturalness enhancement

Encoder

Environment

effect module

Variance adaptor

Mel decoder

Mel spectrum

Reference mel spectrum

Effect extrator

Effect spectrum MAE loss

CE loss

Effect predictor

Gradien reverse

Effect encoder

Effect condition

ConvDonw1

ConvDonw2

ConvDonw3

ConvUp3

ConvUp2

ConvUp1

Reference mel spectrum

Effect spectrum

Fig. 1. The framework of environment effect conversion

of the generated target speech. The mel decoder is used for

the generation of mel spectrum from the latent variable. A

feed-forward transformer is used for the mel decoder. The

environment effect module is the core of the speech environ-

ment effect conversion, it contains an effect extractor, an effect

encoder, and an effect predictor to enhance the extracted effect

in an adversarial way. The Environment effect module will be

described in detail at III-B.

B. Environment Effect Module

In this section, we will introduce the environment effect

module. Trough an effect extractor and an effect predictor

with a target-speciﬁc gradient reversal layer to enhance the

representation of the effect spectrum. Finally, the controllable

effect spectrum is embedded as an effect condition to add the

speech content for the environment effect conversion.

1) Effect Extractor: The effect extractor aims to disentangle

the effect spectrum y0from the reference mel spectrum y.

As shown in the right of Figure 1, we proposed to use

the architecture of Unet for effect spectrum extractor. There

are three convolutional down layers and three convolutional

up layers, both up and down layers are used 1-dimensional

convolution, and each convolutional layer follows with a batch

normalization layer and with the activation of ReLU. The

training of the Unet is jointly with the effect spectrum predic-

tor, the classiﬁer can help the end-to-end gradient propagation

without a speciﬁc label of the target effect spectrum y0.

2) Effect Encoder: The effect encoder combines the source

speech content and the extracted reference effect spectrum

with a controllable factor αto generate an effect condition

of the speech of the target environment. The effect encoder

is built up with a convolution layer, padding and dilation

are both 1. For constraints of the same length, during the

training, we used two paired data that one is the source and

reference spectrum is the same and the other is with the same

environment and a ﬁxed max length.

3) Adversarial Classiﬁer: We set two adversarial classiﬁers

in the environment effect module. One is for the mel encoder

output with gradient reversal layer to make the mel encoder

without the representation of environment effect. The other

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MetaSpeech:SpeechEffectsSwitchAlongwithEnvironmentforMetaverseXulongZhang,JianzongWang,NingCheng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.AbstractMetaverseexpandsthephysicalworldtoanewdimension,andthephysicalenvironmentandMetaverseenvi-ronmentcanbedirectlyconnectedandentered.Voiceisanindispensabl...

展开>> 收起<<

MetaSpeech Speech Effects Switch Along with Environment for Metaverse Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MetaSpeech Speech Effects Switch Along with Environment for Metaverse Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: