MIXED-EVC: MIXED EMOTION SYNTHESIS AND CONTROL IN VOICE CONVERSION
Kun Zhou1, Berrak Sisman2, Carlos Busso2, Bin Ma1, Haizhou Li3,4
1Speech Lab of DAMO Academy, Alibaba Group, Singapore
2The University of Texas at Dallas, United States of America
3The Chinese University of Hong Kong, Shenzhen, China 4National University of Singapore, Singapore
ABSTRACT
Emotional voice conversion (EVC) traditionally targets the transfor-
mation of spoken utterances from one emotional state to another,
with previous research mainly focusing on discrete emotion cate-
gories. This paper departs from the norm by introducing a novel
perspective: a nuanced rendering of mixed emotions and enhanc-
ing control over emotional expression. To achieve this, we pro-
pose a novel EVC framework, Mixed-EVC, which only leverages
discrete emotion training labels. We construct an attribute vector
that encodes the relationships among these discrete emotions, which
is predicted using a ranking-based support vector machine and then
integrated into a sequence-to-sequence (seq2seq) EVC framework.
Mixed-EVC not only learns to characterize the input emotional style
but also quantifies its relevance to other emotions during training.
As a result, users have the ability to assign these attributes to achieve
their desired rendering of mixed emotions. Objective and subjec-
tive evaluations confirm the effectiveness of our approach in terms
of mixed emotion synthesis and control while surpassing traditional
baselines in the conversion of discrete emotions from one to another.
Index Terms—Emotional voice conversion, mixed emotions
1. INTRODUCTION
Human speech often encompasses a blend of emotions, resulting in
the emergence of complex emotional expressions, as evidenced in
prior studies [1, 2, 3]. Emotional voice conversion (EVC) aims to
manipulate the emotional state of a spoken utterance while keeping
speaker identity and linguistic content unchanged [4]. This paper
represents a progressive step in the field of EVC, with a unique fo-
cus on infusing a quantifiable mixed emotion rendering into a hu-
man voice. The primary objective is to enhance the naturalness of
human-computer interactions [5], for example, enriching the emo-
tional responses within a dialogue system [6, 7].
EVC poses unique challenges due to the complex structure of
emotions [8]. People use different words to describe the emotions
that they feel, showing that there are nearly 34, 000 distinct emotions
that a human may experience [9]. To understand how these emo-
tions correlate with each other, scientists analyze them in a valence-
arousal space [8]. The evidence for the valence-arousal view comes
from statistical analysis of how people report feelings [10]. How-
ever, the analysis from a valence-arousal view has not always been
able to tell us the real difference between emotions [11]. Plutchik’s
emotion wheel [12] provides a more straightforward way to describe
emotions. 8 primary emotions: anger, fear, sadness, disgust, sur-
prise, anticipation, trust, and joy, are arranged in an emotion wheel.
The diverse amount of emotions could be produced either by chang-
ing the intensity or by adding up the primary emotions. Although
preliminary studies [13, 14] have explored synthesizing mixed emo-
tions for text-to-speech systems, we observe a lack of study on mixed
emotion synthesis in the literature of EVC, with existing studies
mostly focusing on the conversion between discrete emotions. In
this research, we draw inspiration from the emotion wheel theory
and introduce an approach that employs voice conversion techniques
to manipulate human emotions into a mixed emotional state.
Speech emotions are inherently supra-segmental and intricate,
complex with multiple acoustic cues such as speech quality, pitch,
energy, and speaking rate [15]. Addressing these complexities in
EVC calls for the modeling of both spectral and prosodic varia-
tions at the same time, leading to the study of sequence-to-sequence
(seq2seq) architecture for EVC [16, 17, 18, 19]. To learn emotion
information, existing EVC frameworks mostly leverage pre-defined
discrete emotion labels as the supervision to emotion training, ei-
ther learning a translation model between emotion pairs [20, 21,
22, 23, 24], or disentangling emotional elements with auto-encoders
[25, 26, 27]. Emotion conversion can be achieved by assigning an
emotion label or transferring from an emotional speech. These meth-
ods restrict learning richer style descriptions of emotions but rather
produce a stereotypical emotional pattern [28]. Consequently, they
confine emotions within specific categories, posing challenges when
it comes to examining the connections between different emotional
states, encompassing the entirety of human emotions, and creating a
mixed emotional profile. This paper aims to fill these gaps.
This paper presents the first investigation of mixed emotion syn-
thesis and control for EVC, denoted as “Mixed-EVC”, which aims
to address two challenges: (1) how to describe and quantify the com-
bination of emotions; and (2) how to assess the produced mixed out-
comes. Mixed-EVC introduces a novel approach to explicitly quan-
tify and encode the relationships among discrete emotions into an at-
tribute vector and distinguishes itself by infusing diverse emotional
behaviors into the human voice leveraging limited discrete emotion
labels. Mixed-EVC also allows users to intuitively and quantifiably
control the emotion rendering through categorical classes, offering
a more user-friendly alternative than manipulating continuous emo-
tional attributes. Our key contributions can be outlined as follows:
• We propose a novel formulation to quantify the mixture of emo-
tions. We construct a ranking function for each pair of emotions,
where the ranking values represent the degree of relevance with
respect to an emotion;
• During training, the attribute can be precisely predicted by the
ranking function, guiding the decoder to quantify the relevance
between the input emotional style and all other emotions. At run-
time, the users can define those attributes to generate various emo-
tional mixtures;
• We design evaluation metrics to assess the effectiveness of our
approach in terms of synthesizing mixed emotions and enabling
arXiv:2210.13756v3 [eess.AS] 18 Sep 2023