REDPEN REGION- AND REASON-ANNOTATED DATASET OF UNNATURAL SPEECH Kyumin Park1 Keon Lee12 Daeyoung Kim1 Dongyeop Kang3 1School of Computing KAIST Republic of Korea

2025-04-29 0 0 514.32KB 5 页 10玖币
侵权投诉
REDPEN: REGION- AND REASON-ANNOTATED DATASET OF UNNATURAL SPEECH
Kyumin Park1, Keon Lee12, Daeyoung Kim1, Dongyeop Kang3
1School of Computing, KAIST, Republic of Korea
2KRAFTON Inc., Republic of Korea
3University of Minnesota, US
{pkm9403, kimd}@kaist.ac.kr, keonlee@krafton.com, dongyeop@umn.edu
ABSTRACT
Even with recent advances in speech synthesis models, the
evaluation of such models is based purely on human judge-
ment as a single naturalness score, such as the Mean Opinion
Score (MOS). The score-based metric does not give any fur-
ther information about which parts of speech are unnatural
or why human judges believe they are unnatural. We present
a novel speech dataset, RedPen, with human annotations on
unnatural speech regions and their corresponding reasons.
RedPen consists of 180 synthesized speeches with unnatu-
ral regions annotated by crowd workers; These regions are
then reasoned and categorized by error types, such as voice
trembling and background noise. We find that our dataset
shows a better explanation for unnatural speech regions than
the model-driven unnaturalness prediction. Our analysis also
shows that each model includes different types of error types.
Summing up, our dataset successfully shows the possibility
that various error regions and types lie under the single natu-
ralness score. We believe that our dataset will shed light on
the evaluation and development of more interpretable speech
models in the future. Our dataset will be publicly available
upon acceptance.
Index TermsSpeech synthesis, Evaluation, Mean
opinion score
1. INTRODUCTION
Naturalness is one of the most influential factors in evaluating
speech models [1] on whether or not output speech naturally
sounds like a human. The naturalness of speech models is of-
ten quantified by human evaluation with single-score metrics
such as Mean Opinion Score (MOS), or the human judgement
is estimated by deep learning-based predictor [2, 3, 4].
When scoring naturalness, human judges listen to each
audio and give a single score to the audio. However, a single,
unified score does not provide further information but only the
overall naturalness of the entire speech. It does not provide
where and why the score is determined. In fact, there exist
This work was done when Keon Lee was with KAIST as a MS student.
Fig. 1: The sample annotation of unnatural regions. There is a
discrepancy between the human-annotated unnatural regions
(red) with their reason categories and salient regions predicted
from the model (green).
several factors that affect human perception of naturalness,
such as speech style [5] or acoustic features including pitch
and energy [6, 7].
To better interpret the model’s behavior and find such fac-
tors contributing to the final naturalness, various interpreta-
tion methods on deep learning models have been studied [8,
9]. However, a recent study shows that these saliency-based
measurements cannot fully represent human perception in de-
tecting linguistic styles [10]. This motivates us to collect hu-
man’s real perceptions for speech naturalness, and propose
them as interpretable measurements for speech evaluation.
In this paper, we introduce RedPen, a region- and reason-
annotated dataset of synthesized speeches. We developed a
tool asking people to annotate unnatural regions in each au-
dio. From the annotations, we additionally categorize each
annotated region into common error types in speech synthe-
sis, as shown in Figure 1. Our analyses find that our region
annotation represents unnatural regions better than the pre-
dicted interpretation from the MOS prediction model. Also,
the reason for unnaturalness varies by different speech syn-
thesis models. In our human evaluation, our dataset is judged
better than the previous system in providing better explana-
tions of naturalness. Through our dataset, we show evidence
of several factors lying under a single naturalness score.
arXiv:2210.14406v1 [eess.AS] 26 Oct 2022
摘要:

REDPEN:REGION-ANDREASON-ANNOTATEDDATASETOFUNNATURALSPEECHKyuminPark1,KeonLee12,DaeyoungKim1,DongyeopKang31SchoolofComputing,KAIST,RepublicofKorea2KRAFTONInc.,RepublicofKorea3UniversityofMinnesota,USfpkm9403,kimdg@kaist.ac.kr,keonlee@krafton.com,dongyeop@umn.eduABSTRACTEvenwithrecentadvancesinspeech...

展开>> 收起<<
REDPEN REGION- AND REASON-ANNOTATED DATASET OF UNNATURAL SPEECH Kyumin Park1 Keon Lee12 Daeyoung Kim1 Dongyeop Kang3 1School of Computing KAIST Republic of Korea.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:514.32KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注