
FCTALKER: FINE AND COARSE GRAINED CONTEXT MODELING FOR EXPRESSIVE
CONVERSATIONAL SPEECH SYNTHESIS
Yifan Hu1, Rui Liu1,∗, Guanglai Gao1, Haizhou Li2
1Inner Mongolia University, China 2The Chinese University of Hong Kong, Shenzhen, China
hyfwalker@163.com, liurui imu@163.com, csggl@imu.edu.cn, haizhouli@cuhk.edu.cn
ABSTRACT
Conversational Text-to-Speech (TTS) aims to synthesis an
utterance with the right linguistic and affective prosody in a
conversational context. The correlation between the current
utterance and the dialogue history at the utterance level was
used to improve the expressiveness of synthesized speech.
However, the fine-grained information in the dialogue history
at the word level also has an important impact on the prosodic
expression of an utterance, which has not been well studied
in the prior work. Therefore, we propose a novel expressive
conversational TTS model, termed as FCTalker, that learn
the fine and coarse grained context dependency at the same
time during speech generation. Specifically, the FCTalker
includes fine and coarse grained encoders to exploit the
word and utterance-level context dependency. To model
the word-level dependencies between an utterance and its
dialogue history, the fine-grained dialogue encoder is built
on top of a dialogue BERT model. The experimental results
show that the proposed method outperforms all baselines
and generates more expressive speech that is contextually
appropriate. We release the source code at: https://
github.com/walker-hyf/FCTalker
Index Terms—Conversational Text-to-Speech (TTS),
Fine and Coarse Grained, Context, Expressive
1. INTRODUCTION
In conversational Text-to-Speech (TTS), we take the speaker
interaction history between two speakers into account and
generate expressive speech for a target speaker [1, 2]. This
technique is highly demanded in the deployment of intelligent
agents [3, 4].
With the advent of deep learning, neural TTS [5–8],
i.e. Tacotron [5, 6], FastSpeech [7, 8] based models, has
gained remarkable performance over the traditional statistical
parametric speech synthesis methods [9, 10] in terms of
*: Corresponding author.
This research was funded by the High-level Talents Introduction Project
of Inner Mongolia University (No. 10000-22311201/002) and the Young
Scientists Fund of the National Natural Science Foundation of China (NSFC)
(No. 62206136).
: Is that trophy yours?
: Yes, we just won first place in the basketball game.
: Wow, that's awesome! Congratulations
Fig. 1. An example of word-level context dependencies in
a conversation, which the blue words in the conversation
history have a direct effect on the prosodic expression of the
orange words in the current utterance.
speech quality. However, the prosodic rendering of neural
TTS in a conversational context remains a challenge.
The attempts at conversational TTS can be traced back to
the HMM era [11–13]. They make use of rich textual infor-
mation, such as dialog acts [11] and extended context [13]
for expressive speech generation. However, these approaches
are limited by the need of manual annotation and inadequate
dialogue representation of the model. In the context of neural
TTS, Guo et al. [1] proposed a conversation context encoder
based on Tacotron2 model to extract utterance-level prosody-
related information from the dialogue history. Cong et al. [2]
proposed a context-aware acoustic model which predicting
the utterance-level acoustic embedding according to the di-
alogue history. Mitsui et al. [14] exploited utterance-level
BERT encoding to predict conversational speaking styles with
spontaneous behavior during TTS synthesis. These studies
have advanced the state-of-the-art in conversational TTS.
However, they didn’t exploit the word-level information in
the dialogue history for the prosody rendering of current
utterance.
Speech prosody is rendered at various segmental level
from syllable, lexical word, to sentence [15,16]. As shown in
Fig1, the blue words “trophy” and “won” are strong indicators
that determine the prosodic expression of the final response.
We also find that fine-grained token-level information has
played a significant role in conversation-related studies, such
as multiturn dialog generation [16], conversational emotion
recognition [15], dialogue state tracking [17,18], conversation
intent classification [19] etc. They simultaneously model the
hierarchical contextual semantic dependencies, i.e. word and
sentence, between the current utterances and its conversa-
tional history, and achieve performance gains.
arXiv:2210.15360v1 [cs.CL] 27 Oct 2022