
REDPEN: REGION- AND REASON-ANNOTATED DATASET OF UNNATURAL SPEECH
Kyumin Park1, Keon Lee12∗, Daeyoung Kim1, Dongyeop Kang3
1School of Computing, KAIST, Republic of Korea
2KRAFTON Inc., Republic of Korea
3University of Minnesota, US
{pkm9403, kimd}@kaist.ac.kr, keonlee@krafton.com, dongyeop@umn.edu
ABSTRACT
Even with recent advances in speech synthesis models, the
evaluation of such models is based purely on human judge-
ment as a single naturalness score, such as the Mean Opinion
Score (MOS). The score-based metric does not give any fur-
ther information about which parts of speech are unnatural
or why human judges believe they are unnatural. We present
a novel speech dataset, RedPen, with human annotations on
unnatural speech regions and their corresponding reasons.
RedPen consists of 180 synthesized speeches with unnatu-
ral regions annotated by crowd workers; These regions are
then reasoned and categorized by error types, such as voice
trembling and background noise. We find that our dataset
shows a better explanation for unnatural speech regions than
the model-driven unnaturalness prediction. Our analysis also
shows that each model includes different types of error types.
Summing up, our dataset successfully shows the possibility
that various error regions and types lie under the single natu-
ralness score. We believe that our dataset will shed light on
the evaluation and development of more interpretable speech
models in the future. Our dataset will be publicly available
upon acceptance.
Index Terms—Speech synthesis, Evaluation, Mean
opinion score
1. INTRODUCTION
Naturalness is one of the most influential factors in evaluating
speech models [1] on whether or not output speech naturally
sounds like a human. The naturalness of speech models is of-
ten quantified by human evaluation with single-score metrics
such as Mean Opinion Score (MOS), or the human judgement
is estimated by deep learning-based predictor [2, 3, 4].
When scoring naturalness, human judges listen to each
audio and give a single score to the audio. However, a single,
unified score does not provide further information but only the
overall naturalness of the entire speech. It does not provide
where and why the score is determined. In fact, there exist
∗This work was done when Keon Lee was with KAIST as a MS student.
Fig. 1: The sample annotation of unnatural regions. There is a
discrepancy between the human-annotated unnatural regions
(red) with their reason categories and salient regions predicted
from the model (green).
several factors that affect human perception of naturalness,
such as speech style [5] or acoustic features including pitch
and energy [6, 7].
To better interpret the model’s behavior and find such fac-
tors contributing to the final naturalness, various interpreta-
tion methods on deep learning models have been studied [8,
9]. However, a recent study shows that these saliency-based
measurements cannot fully represent human perception in de-
tecting linguistic styles [10]. This motivates us to collect hu-
man’s real perceptions for speech naturalness, and propose
them as interpretable measurements for speech evaluation.
In this paper, we introduce RedPen, a region- and reason-
annotated dataset of synthesized speeches. We developed a
tool asking people to annotate unnatural regions in each au-
dio. From the annotations, we additionally categorize each
annotated region into common error types in speech synthe-
sis, as shown in Figure 1. Our analyses find that our region
annotation represents unnatural regions better than the pre-
dicted interpretation from the MOS prediction model. Also,
the reason for unnaturalness varies by different speech syn-
thesis models. In our human evaluation, our dataset is judged
better than the previous system in providing better explana-
tions of naturalness. Through our dataset, we show evidence
of several factors lying under a single naturalness score.
arXiv:2210.14406v1 [eess.AS] 26 Oct 2022