MetaSpeech Speech Effects Switch Along with Environment for Metaverse Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao

2025-05-02 0 0 2.58MB 6 页 10玖币
侵权投诉
MetaSpeech: Speech Effects Switch Along with
Environment for Metaverse
Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—Metaverse expands the physical world to a new
dimension, and the physical environment and Metaverse envi-
ronment can be directly connected and entered. Voice is an
indispensable communication medium in the real world and
Metaverse. Fusion of the voice with environment effects is impor-
tant for user immersion in Metaverse. In this paper, we proposed
using the voice conversion based method for the conversion of
target environment effect speech. The proposed method was
named MetaSpeech, which introduces an environment effect
module containing an effect extractor to extract the environment
information and an effect encoder to encode the environment
effect condition, in which gradient reversal layer was used for
adversarial training to keep the speech content and speaker
information while disentangling the environmental effects. From
the experiment results on the public dataset of LJSpeech with
four environment effects, the proposed model could complete
the specific environment effect conversion and outperforms the
baseline methods from the voice conversion task.
Index Terms—metaverse, environment effect, audio effect,
voice conversion, room impulse response
I. INTRODUCTION
Metaverse [1]–[3] is the expansion of the real world in the
virtual world. It is digitalization and virtualization based on the
physical world so that people can carry out their daily work
and entertainment in a more convenient way. People can enter
Metaverse at any time from different locations, and instantly
enter the same meeting room to start a work discussion
meeting on time. Eliminates the physical constraints of the
physical world and the constraints of resource-constrained con-
ference rooms [4]. We can also carry out unrealistic activities
in the real world in Metaverse. For example, in Metaverse,
it is possible to achieve instantaneous movement free from
geographical constraints and production and creation free from
material constraints.
Although Metaverse can break many constraints in the real
world, Metaverse must also provide a more realistic experience
and immersion [5]. People can interact and walk around in
different environments of Metaverse, such as ending a work
meeting from a conference room in Metaverse, entering a
gallery in Metaverse to enjoy art parting, or entering a concert
hall to enjoy a large-scale symphony. Although the transition
of different scenes can be ignored as instant transfer, the
immersion needs to allow users to experience different en-
vironmental atmospheres in different environments, including
visual space and sound.
Corresponding author: Jianzong Wang, jzwang@188.com.
Voice effects [6]–[8] in different environments will bring
different perception effects to listeners according to the size of
the space and the material of the environment. Two methods
are usually used to add environmental sound effects to the
recorded vocal sound in a virtual scene [9], [10], one is
parameter calculation, and the other is manual tuning. The
parameter calculation is to calculate the reflection structure of
the sound according to the material and distance in the physical
environment to convolve the comb-like shape filter with the
vocal sound. Manual tuning [11] is to adjust or increase certain
audio components in the audio based on experience. For
different scenes in Metaverse, different sound effects need to
be designed to enhance the difference of playback audio, such
as enhancing vocals, bass compensation, expanding surround
and creating artificial reverberation, etc., to achieve surround
feeling, vocals, presence, and other auditory effects are en-
hanced.
An audio effect based on modulation or varying with time
relates to an audio processor. The use of delay lines and digital
filters can implement many audio effects [12]. Recently the
deep model-based method has achieved outperformance on
many tasks [13], [14], it also been applied for the generation
of audio effects. Convolutional neural networks and recurrent
neural networks are combined to model audio special effects.
While the handcraft audio effects or the learned audio effects
can be directly applied to the vocal to achieve the specific
environmental effects. But there needs the clean vocal of the
speech such as the studio or soundproof room. This is not easy
for the user to get access to Metaverse anywhere they want.
In this paper, we proposed the framework of effect con-
version to remove the source effect and replace it with the
specific target effect. For the effect conversion, we disentangle
the speech and environment with two separate representations.
With the reference speech in the target environment, we
extract the environment latent and fusion with the speech
of the source to decode the generated speech with only the
target environment effect. To enhance the naturalness of the
generated speech, a variance adaptor was added to the latent
representation.
Our contribution can be concluded as: 1) For Metaverse, we
proposed the speech effects switch method by the framework
of effect conversion. 2) Disentanglement of the environment
effect was modeled as a latent representation of an effect
extractor. 3) Variance adaptor was proposed to enhance the
naturalness of the generated speech.
arXiv:2210.13811v1 [cs.SD] 25 Oct 2022
II. RELATED WORKS
The effect conversion task is similar to the voice conversion
(VC) [15]–[21] task in terms of spectrum conversion, both
need to do a conversion of the target speech. But in the voice
conversion, there is only a need to keep the content the same.
The conversion of the environment effect needs to keep the
content and the same timbre of the source speaker. To some
extent, it can be treated as the same when content not just the
information of speech text.
The voice conversion models can be categorized into
three main classes, they are GAN based model [22]–[25],
VAE based model [26]–[28] and encoder-decoder based
model [29]–[31]. Kaneko et al. [22] applied the CycleGAN
on the voice conversion task, with the cycle consistent loss
resolving the need of parallel dataset. This method shows a
performance comparably to a parallel VC method. However,
the generated speech still has a large gap with the real speech.
An enhanced version CycleGAN-VC2 is updated by Kaneko
et al. [32], which incorporates three new enhancements for the
generator, discriminator, and objective separately. However,
the two methods are both used for mel-cepstrum conversion,
which cannot be directly used for mel-spectrum or spectrum,
which has no ability for the modeling of the aperiodicities
information.
While the GAN based VC methods are tough in training and
have the poor ability of generalization to the out-of-set speaker.
On the other hand, VAE is easier to train. The VAE based
model [26], [33] also can commit a conversion on the non-
parallel corpora. Through a condition of speaker embedding
as an additional input for CVAE [27] can achieve specific
conversion of the target speaker. However, CVAE alone often
suffers from over-smoothing of the output and cannot guar-
antee the distribution matching. Qian et al. [29] propose to
use an autoencoder with a well-designed bottleneck for the
disentanglement of content and speaker style. With only a
self-reconstruct loss it can achieve distribution matching style
transfer and could perform the zero-shot voice conversion.
To the environment effect switch task, many environments
could be created and the entrance environment to Metaverse
could be various. Motivated by the voice conversion methods,
we proposed to do a disentanglement of the effect and commit
any to any conversion of environment effect.
III. METHOD
In this section, we first give the overview of the proposed
method and then show the detail of the main components.
We also introduce the training and inference process of the
proposed method for speech environment effect conversion.
A. Model Overview
As shown in Figure 1, the main modules include a mel
encoder, an environment effect module. The mel encoder is
built up with convolutional layers with 1-dimensional convo-
lution along the time axis. The variance adaptor based on the
work in [34], it contains pitch predictor and energy predictor
to predict the pitch and energy for naturalness enhancement
α
Encoder
Environment
effect module
Variance adaptor
Mel decoder
Mel spectrum
Mel spectrum
Reference mel spectrum
Effect extrator
Effect spectrum MAE loss
CE loss
Effect predictor
Gradien reverse
Effect encoder
Effect condition
ConvDonw1
ConvDonw2
ConvDonw3
ConvUp3
ConvUp2
ConvUp1
Reference mel spectrum
Effect spectrum
Fig. 1. The framework of environment effect conversion
of the generated target speech. The mel decoder is used for
the generation of mel spectrum from the latent variable. A
feed-forward transformer is used for the mel decoder. The
environment effect module is the core of the speech environ-
ment effect conversion, it contains an effect extractor, an effect
encoder, and an effect predictor to enhance the extracted effect
in an adversarial way. The Environment effect module will be
described in detail at III-B.
B. Environment Effect Module
In this section, we will introduce the environment effect
module. Trough an effect extractor and an effect predictor
with a target-specific gradient reversal layer to enhance the
representation of the effect spectrum. Finally, the controllable
effect spectrum is embedded as an effect condition to add the
speech content for the environment effect conversion.
1) Effect Extractor: The effect extractor aims to disentangle
the effect spectrum y0from the reference mel spectrum y.
As shown in the right of Figure 1, we proposed to use
the architecture of Unet for effect spectrum extractor. There
are three convolutional down layers and three convolutional
up layers, both up and down layers are used 1-dimensional
convolution, and each convolutional layer follows with a batch
normalization layer and with the activation of ReLU. The
training of the Unet is jointly with the effect spectrum predic-
tor, the classifier can help the end-to-end gradient propagation
without a specific label of the target effect spectrum y0.
2) Effect Encoder: The effect encoder combines the source
speech content and the extracted reference effect spectrum
with a controllable factor αto generate an effect condition
of the speech of the target environment. The effect encoder
is built up with a convolution layer, padding and dilation
are both 1. For constraints of the same length, during the
training, we used two paired data that one is the source and
reference spectrum is the same and the other is with the same
environment and a fixed max length.
3) Adversarial Classifier: We set two adversarial classifiers
in the environment effect module. One is for the mel encoder
output with gradient reversal layer to make the mel encoder
without the representation of environment effect. The other
摘要:

MetaSpeech:SpeechEffectsSwitchAlongwithEnvironmentforMetaverseXulongZhang,JianzongWang,NingCheng,JingXiaoPingAnTechnology(Shenzhen)Co.,Ltd.Abstract—Metaverseexpandsthephysicalworldtoanewdimension,andthephysicalenvironmentandMetaverseenvi-ronmentcanbedirectlyconnectedandentered.Voiceisanindispensabl...

展开>> 收起<<
MetaSpeech Speech Effects Switch Along with Environment for Metaverse Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:2.58MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注