MIXED-EVC MIXED EMOTION SYNTHESIS AND CONTROL IN VOICE CONVERSION Kun Zhou1 Berrak Sisman2 Carlos Busso2 Bin Ma1 Haizhou Li34 1Speech Lab of DAMO Academy Alibaba Group Singapore

2025-05-02 0 0 500.92KB 5 页 10玖币

侵权投诉

MIXED-EVC: MIXED EMOTION SYNTHESIS AND CONTROL IN VOICE CONVERSION

Kun Zhou1, Berrak Sisman2, Carlos Busso2, Bin Ma1, Haizhou Li3,4

1Speech Lab of DAMO Academy, Alibaba Group, Singapore

2The University of Texas at Dallas, United States of America

3The Chinese University of Hong Kong, Shenzhen, China 4National University of Singapore, Singapore

ABSTRACT

Emotional voice conversion (EVC) traditionally targets the transfor-

mation of spoken utterances from one emotional state to another,

with previous research mainly focusing on discrete emotion cate-

gories. This paper departs from the norm by introducing a novel

perspective: a nuanced rendering of mixed emotions and enhanc-

ing control over emotional expression. To achieve this, we pro-

pose a novel EVC framework, Mixed-EVC, which only leverages

discrete emotion training labels. We construct an attribute vector

that encodes the relationships among these discrete emotions, which

is predicted using a ranking-based support vector machine and then

integrated into a sequence-to-sequence (seq2seq) EVC framework.

Mixed-EVC not only learns to characterize the input emotional style

but also quantiﬁes its relevance to other emotions during training.

As a result, users have the ability to assign these attributes to achieve

their desired rendering of mixed emotions. Objective and subjec-

tive evaluations conﬁrm the effectiveness of our approach in terms

of mixed emotion synthesis and control while surpassing traditional

baselines in the conversion of discrete emotions from one to another.

Index Terms—Emotional voice conversion, mixed emotions

1. INTRODUCTION

Human speech often encompasses a blend of emotions, resulting in

the emergence of complex emotional expressions, as evidenced in

prior studies [1, 2, 3]. Emotional voice conversion (EVC) aims to

manipulate the emotional state of a spoken utterance while keeping

speaker identity and linguistic content unchanged [4]. This paper

represents a progressive step in the ﬁeld of EVC, with a unique fo-

cus on infusing a quantiﬁable mixed emotion rendering into a hu-

man voice. The primary objective is to enhance the naturalness of

human-computer interactions [5], for example, enriching the emo-

tional responses within a dialogue system [6, 7].

EVC poses unique challenges due to the complex structure of

emotions [8]. People use different words to describe the emotions

that they feel, showing that there are nearly 34, 000 distinct emotions

that a human may experience [9]. To understand how these emo-

tions correlate with each other, scientists analyze them in a valence-

arousal space [8]. The evidence for the valence-arousal view comes

from statistical analysis of how people report feelings [10]. How-

ever, the analysis from a valence-arousal view has not always been

able to tell us the real difference between emotions [11]. Plutchik’s

emotion wheel [12] provides a more straightforward way to describe

emotions. 8 primary emotions: anger, fear, sadness, disgust, sur-

prise, anticipation, trust, and joy, are arranged in an emotion wheel.

The diverse amount of emotions could be produced either by chang-

ing the intensity or by adding up the primary emotions. Although

preliminary studies [13, 14] have explored synthesizing mixed emo-

tions for text-to-speech systems, we observe a lack of study on mixed

emotion synthesis in the literature of EVC, with existing studies

mostly focusing on the conversion between discrete emotions. In

this research, we draw inspiration from the emotion wheel theory

and introduce an approach that employs voice conversion techniques

to manipulate human emotions into a mixed emotional state.

Speech emotions are inherently supra-segmental and intricate,

complex with multiple acoustic cues such as speech quality, pitch,

energy, and speaking rate [15]. Addressing these complexities in

EVC calls for the modeling of both spectral and prosodic varia-

tions at the same time, leading to the study of sequence-to-sequence

(seq2seq) architecture for EVC [16, 17, 18, 19]. To learn emotion

information, existing EVC frameworks mostly leverage pre-deﬁned

discrete emotion labels as the supervision to emotion training, ei-

ther learning a translation model between emotion pairs [20, 21,

22, 23, 24], or disentangling emotional elements with auto-encoders

[25, 26, 27]. Emotion conversion can be achieved by assigning an

emotion label or transferring from an emotional speech. These meth-

ods restrict learning richer style descriptions of emotions but rather

produce a stereotypical emotional pattern [28]. Consequently, they

conﬁne emotions within speciﬁc categories, posing challenges when

it comes to examining the connections between different emotional

states, encompassing the entirety of human emotions, and creating a

mixed emotional proﬁle. This paper aims to ﬁll these gaps.

This paper presents the ﬁrst investigation of mixed emotion syn-

thesis and control for EVC, denoted as “Mixed-EVC”, which aims

to address two challenges: (1) how to describe and quantify the com-

bination of emotions; and (2) how to assess the produced mixed out-

comes. Mixed-EVC introduces a novel approach to explicitly quan-

tify and encode the relationships among discrete emotions into an at-

tribute vector and distinguishes itself by infusing diverse emotional

behaviors into the human voice leveraging limited discrete emotion

labels. Mixed-EVC also allows users to intuitively and quantiﬁably

control the emotion rendering through categorical classes, offering

a more user-friendly alternative than manipulating continuous emo-

tional attributes. Our key contributions can be outlined as follows:

• We propose a novel formulation to quantify the mixture of emo-

tions. We construct a ranking function for each pair of emotions,

where the ranking values represent the degree of relevance with

respect to an emotion;

• During training, the attribute can be precisely predicted by the

ranking function, guiding the decoder to quantify the relevance

between the input emotional style and all other emotions. At run-

time, the users can deﬁne those attributes to generate various emo-

tional mixtures;

• We design evaluation metrics to assess the effectiveness of our

approach in terms of synthesizing mixed emotions and enabling

arXiv:2210.13756v3 [eess.AS] 18 Sep 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MIXED-EVC:MIXEDEMOTIONSYNTHESISANDCONTROLINVOICECONVERSIONKunZhou1,BerrakSisman2,CarlosBusso2,BinMa1,HaizhouLi3,41SpeechLabofDAMOAcademy,AlibabaGroup,Singapore2TheUniversityofTexasatDallas,UnitedStatesofAmerica3TheChineseUniversityofHongKong,Shenzhen,China4NationalUniversityofSingapore,SingaporeABST...

展开>> 收起<<

MIXED-EVC MIXED EMOTION SYNTHESIS AND CONTROL IN VOICE CONVERSION Kun Zhou1 Berrak Sisman2 Carlos Busso2 Bin Ma1 Haizhou Li34 1Speech Lab of DAMO Academy Alibaba Group Singapore.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MIXED-EVC MIXED EMOTION SYNTHESIS AND CONTROL IN VOICE CONVERSION Kun Zhou1 Berrak Sisman2 Carlos Busso2 Bin Ma1 Haizhou Li34 1Speech Lab of DAMO Academy Alibaba Group Singapore

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: