FCTALKER FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE CONVERSATIONAL SPEECH SYNTHESIS Yifan Hu1 Rui Liu1 Guanglai Gao1 Haizhou Li2

2025-05-06 0 0 809.44KB 5 页 10玖币

侵权投诉

FCTALKER: FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE

CONVERSATIONAL SPEECH SYNTHESIS

Yifan Hu1, Rui Liu1,∗, Guanglai Gao1, Haizhou Li2

1Inner Mongolia University, China 2The Chinese University of Hong Kong, Shenzhen, China

hyfwalker@163.com, liurui imu@163.com, csggl@imu.edu.cn, haizhouli@cuhk.edu.cn

ABSTRACT

Conversational Text-to-Speech (TTS) aims to synthesis an

utterance with the right linguistic and affective prosody in a

conversational context. The correlation between the current

utterance and the dialogue history at the utterance level was

used to improve the expressiveness of synthesized speech.

However, the ﬁne-grained information in the dialogue history

at the word level also has an important impact on the prosodic

expression of an utterance, which has not been well studied

in the prior work. Therefore, we propose a novel expressive

conversational TTS model, termed as FCTalker, that learn

the ﬁne and coarse grained context dependency at the same

time during speech generation. Speciﬁcally, the FCTalker

includes ﬁne and coarse grained encoders to exploit the

word and utterance-level context dependency. To model

the word-level dependencies between an utterance and its

dialogue history, the ﬁne-grained dialogue encoder is built

on top of a dialogue BERT model. The experimental results

show that the proposed method outperforms all baselines

and generates more expressive speech that is contextually

appropriate. We release the source code at: https://

github.com/walker-hyf/FCTalker

Index Terms—Conversational Text-to-Speech (TTS),

Fine and Coarse Grained, Context, Expressive

1. INTRODUCTION

In conversational Text-to-Speech (TTS), we take the speaker

interaction history between two speakers into account and

generate expressive speech for a target speaker [1, 2]. This

technique is highly demanded in the deployment of intelligent

agents [3, 4].

With the advent of deep learning, neural TTS [5–8],

i.e. Tacotron [5, 6], FastSpeech [7, 8] based models, has

gained remarkable performance over the traditional statistical

parametric speech synthesis methods [9, 10] in terms of

*: Corresponding author.

This research was funded by the High-level Talents Introduction Project

of Inner Mongolia University (No. 10000-22311201/002) and the Young

Scientists Fund of the National Natural Science Foundation of China (NSFC)

(No. 62206136).

: Is that trophy yours?

: Yes, we just won ﬁrst place in the basketball game.

: Wow, that's awesome! Congratulations

Fig. 1. An example of word-level context dependencies in

a conversation, which the blue words in the conversation

history have a direct effect on the prosodic expression of the

orange words in the current utterance.

speech quality. However, the prosodic rendering of neural

TTS in a conversational context remains a challenge.

The attempts at conversational TTS can be traced back to

the HMM era [11–13]. They make use of rich textual infor-

mation, such as dialog acts [11] and extended context [13]

for expressive speech generation. However, these approaches

are limited by the need of manual annotation and inadequate

dialogue representation of the model. In the context of neural

TTS, Guo et al. [1] proposed a conversation context encoder

based on Tacotron2 model to extract utterance-level prosody-

related information from the dialogue history. Cong et al. [2]

proposed a context-aware acoustic model which predicting

the utterance-level acoustic embedding according to the di-

alogue history. Mitsui et al. [14] exploited utterance-level

BERT encoding to predict conversational speaking styles with

spontaneous behavior during TTS synthesis. These studies

have advanced the state-of-the-art in conversational TTS.

However, they didn’t exploit the word-level information in

the dialogue history for the prosody rendering of current

utterance.

Speech prosody is rendered at various segmental level

from syllable, lexical word, to sentence [15,16]. As shown in

Fig1, the blue words “trophy” and “won” are strong indicators

that determine the prosodic expression of the ﬁnal response.

We also ﬁnd that ﬁne-grained token-level information has

played a signiﬁcant role in conversation-related studies, such

as multiturn dialog generation [16], conversational emotion

recognition [15], dialogue state tracking [17,18], conversation

intent classiﬁcation [19] etc. They simultaneously model the

hierarchical contextual semantic dependencies, i.e. word and

sentence, between the current utterances and its conversa-

tional history, and achieve performance gains.

arXiv:2210.15360v1 [cs.CL] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FCTALKER:FINEANDCOARSEGRAINEDCONTEXTMODELINGFOREXPRESSIVECONVERSATIONALSPEECHSYNTHESISYifanHu1,RuiLiu1;,GuanglaiGao1,HaizhouLi21InnerMongoliaUniversity,China2TheChineseUniversityofHongKong,Shenzhen,Chinahyfwalker@163.com,liuruiimu@163.com,csggl@imu.edu.cn,haizhouli@cuhk.edu.cnABSTRACTConversational...

展开>> 收起<<

FCTALKER FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE CONVERSATIONAL SPEECH SYNTHESIS Yifan Hu1 Rui Liu1 Guanglai Gao1 Haizhou Li2.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FCTALKER FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE CONVERSATIONAL SPEECH SYNTHESIS Yifan Hu1 Rui Liu1 Guanglai Gao1 Haizhou Li2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: