MetaSpeech: Speech Effects Switch Along with
Environment for Metaverse
Xulong Zhang, Jianzong Wang∗, Ning Cheng, Jing Xiao
Ping An Technology (Shenzhen) Co., Ltd.
Abstract—Metaverse expands the physical world to a new
dimension, and the physical environment and Metaverse envi-
ronment can be directly connected and entered. Voice is an
indispensable communication medium in the real world and
Metaverse. Fusion of the voice with environment effects is impor-
tant for user immersion in Metaverse. In this paper, we proposed
using the voice conversion based method for the conversion of
target environment effect speech. The proposed method was
named MetaSpeech, which introduces an environment effect
module containing an effect extractor to extract the environment
information and an effect encoder to encode the environment
effect condition, in which gradient reversal layer was used for
adversarial training to keep the speech content and speaker
information while disentangling the environmental effects. From
the experiment results on the public dataset of LJSpeech with
four environment effects, the proposed model could complete
the specific environment effect conversion and outperforms the
baseline methods from the voice conversion task.
Index Terms—metaverse, environment effect, audio effect,
voice conversion, room impulse response
I. INTRODUCTION
Metaverse [1]–[3] is the expansion of the real world in the
virtual world. It is digitalization and virtualization based on the
physical world so that people can carry out their daily work
and entertainment in a more convenient way. People can enter
Metaverse at any time from different locations, and instantly
enter the same meeting room to start a work discussion
meeting on time. Eliminates the physical constraints of the
physical world and the constraints of resource-constrained con-
ference rooms [4]. We can also carry out unrealistic activities
in the real world in Metaverse. For example, in Metaverse,
it is possible to achieve instantaneous movement free from
geographical constraints and production and creation free from
material constraints.
Although Metaverse can break many constraints in the real
world, Metaverse must also provide a more realistic experience
and immersion [5]. People can interact and walk around in
different environments of Metaverse, such as ending a work
meeting from a conference room in Metaverse, entering a
gallery in Metaverse to enjoy art parting, or entering a concert
hall to enjoy a large-scale symphony. Although the transition
of different scenes can be ignored as instant transfer, the
immersion needs to allow users to experience different en-
vironmental atmospheres in different environments, including
visual space and sound.
∗Corresponding author: Jianzong Wang, jzwang@188.com.
Voice effects [6]–[8] in different environments will bring
different perception effects to listeners according to the size of
the space and the material of the environment. Two methods
are usually used to add environmental sound effects to the
recorded vocal sound in a virtual scene [9], [10], one is
parameter calculation, and the other is manual tuning. The
parameter calculation is to calculate the reflection structure of
the sound according to the material and distance in the physical
environment to convolve the comb-like shape filter with the
vocal sound. Manual tuning [11] is to adjust or increase certain
audio components in the audio based on experience. For
different scenes in Metaverse, different sound effects need to
be designed to enhance the difference of playback audio, such
as enhancing vocals, bass compensation, expanding surround
and creating artificial reverberation, etc., to achieve surround
feeling, vocals, presence, and other auditory effects are en-
hanced.
An audio effect based on modulation or varying with time
relates to an audio processor. The use of delay lines and digital
filters can implement many audio effects [12]. Recently the
deep model-based method has achieved outperformance on
many tasks [13], [14], it also been applied for the generation
of audio effects. Convolutional neural networks and recurrent
neural networks are combined to model audio special effects.
While the handcraft audio effects or the learned audio effects
can be directly applied to the vocal to achieve the specific
environmental effects. But there needs the clean vocal of the
speech such as the studio or soundproof room. This is not easy
for the user to get access to Metaverse anywhere they want.
In this paper, we proposed the framework of effect con-
version to remove the source effect and replace it with the
specific target effect. For the effect conversion, we disentangle
the speech and environment with two separate representations.
With the reference speech in the target environment, we
extract the environment latent and fusion with the speech
of the source to decode the generated speech with only the
target environment effect. To enhance the naturalness of the
generated speech, a variance adaptor was added to the latent
representation.
Our contribution can be concluded as: 1) For Metaverse, we
proposed the speech effects switch method by the framework
of effect conversion. 2) Disentanglement of the environment
effect was modeled as a latent representation of an effect
extractor. 3) Variance adaptor was proposed to enhance the
naturalness of the generated speech.
arXiv:2210.13811v1 [cs.SD] 25 Oct 2022