FCTALKER FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE CONVERSATIONAL SPEECH SYNTHESIS Yifan Hu1 Rui Liu1 Guanglai Gao1 Haizhou Li2

2025-05-06 0 0 809.44KB 5 页 10玖币
侵权投诉
FCTALKER: FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE
CONVERSATIONAL SPEECH SYNTHESIS
Yifan Hu1, Rui Liu1,, Guanglai Gao1, Haizhou Li2
1Inner Mongolia University, China 2The Chinese University of Hong Kong, Shenzhen, China
hyfwalker@163.com, liurui imu@163.com, csggl@imu.edu.cn, haizhouli@cuhk.edu.cn
ABSTRACT
Conversational Text-to-Speech (TTS) aims to synthesis an
utterance with the right linguistic and affective prosody in a
conversational context. The correlation between the current
utterance and the dialogue history at the utterance level was
used to improve the expressiveness of synthesized speech.
However, the fine-grained information in the dialogue history
at the word level also has an important impact on the prosodic
expression of an utterance, which has not been well studied
in the prior work. Therefore, we propose a novel expressive
conversational TTS model, termed as FCTalker, that learn
the fine and coarse grained context dependency at the same
time during speech generation. Specifically, the FCTalker
includes fine and coarse grained encoders to exploit the
word and utterance-level context dependency. To model
the word-level dependencies between an utterance and its
dialogue history, the fine-grained dialogue encoder is built
on top of a dialogue BERT model. The experimental results
show that the proposed method outperforms all baselines
and generates more expressive speech that is contextually
appropriate. We release the source code at: https://
github.com/walker-hyf/FCTalker
Index TermsConversational Text-to-Speech (TTS),
Fine and Coarse Grained, Context, Expressive
1. INTRODUCTION
In conversational Text-to-Speech (TTS), we take the speaker
interaction history between two speakers into account and
generate expressive speech for a target speaker [1, 2]. This
technique is highly demanded in the deployment of intelligent
agents [3, 4].
With the advent of deep learning, neural TTS [5–8],
i.e. Tacotron [5, 6], FastSpeech [7, 8] based models, has
gained remarkable performance over the traditional statistical
parametric speech synthesis methods [9, 10] in terms of
*: Corresponding author.
This research was funded by the High-level Talents Introduction Project
of Inner Mongolia University (No. 10000-22311201/002) and the Young
Scientists Fund of the National Natural Science Foundation of China (NSFC)
(No. 62206136).
: Is that trophy yours?
: Yes, we just won first place in the basketball game.
: Wow, that's awesome! Congratulations
Fig. 1. An example of word-level context dependencies in
a conversation, which the blue words in the conversation
history have a direct effect on the prosodic expression of the
orange words in the current utterance.
speech quality. However, the prosodic rendering of neural
TTS in a conversational context remains a challenge.
The attempts at conversational TTS can be traced back to
the HMM era [11–13]. They make use of rich textual infor-
mation, such as dialog acts [11] and extended context [13]
for expressive speech generation. However, these approaches
are limited by the need of manual annotation and inadequate
dialogue representation of the model. In the context of neural
TTS, Guo et al. [1] proposed a conversation context encoder
based on Tacotron2 model to extract utterance-level prosody-
related information from the dialogue history. Cong et al. [2]
proposed a context-aware acoustic model which predicting
the utterance-level acoustic embedding according to the di-
alogue history. Mitsui et al. [14] exploited utterance-level
BERT encoding to predict conversational speaking styles with
spontaneous behavior during TTS synthesis. These studies
have advanced the state-of-the-art in conversational TTS.
However, they didn’t exploit the word-level information in
the dialogue history for the prosody rendering of current
utterance.
Speech prosody is rendered at various segmental level
from syllable, lexical word, to sentence [15,16]. As shown in
Fig1, the blue words “trophy” and “won” are strong indicators
that determine the prosodic expression of the final response.
We also find that fine-grained token-level information has
played a significant role in conversation-related studies, such
as multiturn dialog generation [16], conversational emotion
recognition [15], dialogue state tracking [17,18], conversation
intent classification [19] etc. They simultaneously model the
hierarchical contextual semantic dependencies, i.e. word and
sentence, between the current utterances and its conversa-
tional history, and achieve performance gains.
arXiv:2210.15360v1 [cs.CL] 27 Oct 2022
摘要:

FCTALKER:FINEANDCOARSEGRAINEDCONTEXTMODELINGFOREXPRESSIVECONVERSATIONALSPEECHSYNTHESISYifanHu1,RuiLiu1;,GuanglaiGao1,HaizhouLi21InnerMongoliaUniversity,China2TheChineseUniversityofHongKong,Shenzhen,Chinahyfwalker@163.com,liuruiimu@163.com,csggl@imu.edu.cn,haizhouli@cuhk.edu.cnABSTRACTConversational...

展开>> 收起<<
FCTALKER FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE CONVERSATIONAL SPEECH SYNTHESIS Yifan Hu1 Rui Liu1 Guanglai Gao1 Haizhou Li2.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:809.44KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注