REDPEN REGION- AND REASON-ANNOTATED DATASET OF UNNATURAL SPEECH Kyumin Park1 Keon Lee12 Daeyoung Kim1 Dongyeop Kang3 1School of Computing KAIST Republic of Korea

2025-04-29 0 0 514.32KB 5 页 10玖币

侵权投诉

REDPEN: REGION- AND REASON-ANNOTATED DATASET OF UNNATURAL SPEECH

Kyumin Park1, Keon Lee12∗, Daeyoung Kim1, Dongyeop Kang3

1School of Computing, KAIST, Republic of Korea

2KRAFTON Inc., Republic of Korea

3University of Minnesota, US

{pkm9403, kimd}@kaist.ac.kr, keonlee@krafton.com, dongyeop@umn.edu

ABSTRACT

Even with recent advances in speech synthesis models, the

evaluation of such models is based purely on human judge-

ment as a single naturalness score, such as the Mean Opinion

Score (MOS). The score-based metric does not give any fur-

ther information about which parts of speech are unnatural

or why human judges believe they are unnatural. We present

a novel speech dataset, RedPen, with human annotations on

unnatural speech regions and their corresponding reasons.

RedPen consists of 180 synthesized speeches with unnatu-

ral regions annotated by crowd workers; These regions are

then reasoned and categorized by error types, such as voice

trembling and background noise. We ﬁnd that our dataset

shows a better explanation for unnatural speech regions than

the model-driven unnaturalness prediction. Our analysis also

shows that each model includes different types of error types.

Summing up, our dataset successfully shows the possibility

that various error regions and types lie under the single natu-

ralness score. We believe that our dataset will shed light on

the evaluation and development of more interpretable speech

models in the future. Our dataset will be publicly available

upon acceptance.

Index Terms—Speech synthesis, Evaluation, Mean

opinion score

1. INTRODUCTION

Naturalness is one of the most inﬂuential factors in evaluating

speech models [1] on whether or not output speech naturally

sounds like a human. The naturalness of speech models is of-

ten quantiﬁed by human evaluation with single-score metrics

such as Mean Opinion Score (MOS), or the human judgement

is estimated by deep learning-based predictor [2, 3, 4].

When scoring naturalness, human judges listen to each

audio and give a single score to the audio. However, a single,

uniﬁed score does not provide further information but only the

overall naturalness of the entire speech. It does not provide

where and why the score is determined. In fact, there exist

∗This work was done when Keon Lee was with KAIST as a MS student.

Fig. 1: The sample annotation of unnatural regions. There is a

discrepancy between the human-annotated unnatural regions

(red) with their reason categories and salient regions predicted

from the model (green).

several factors that affect human perception of naturalness,

such as speech style [5] or acoustic features including pitch

and energy [6, 7].

To better interpret the model’s behavior and ﬁnd such fac-

tors contributing to the ﬁnal naturalness, various interpreta-

tion methods on deep learning models have been studied [8,

9]. However, a recent study shows that these saliency-based

measurements cannot fully represent human perception in de-

tecting linguistic styles [10]. This motivates us to collect hu-

man’s real perceptions for speech naturalness, and propose

them as interpretable measurements for speech evaluation.

In this paper, we introduce RedPen, a region- and reason-

annotated dataset of synthesized speeches. We developed a

tool asking people to annotate unnatural regions in each au-

dio. From the annotations, we additionally categorize each

annotated region into common error types in speech synthe-

sis, as shown in Figure 1. Our analyses ﬁnd that our region

annotation represents unnatural regions better than the pre-

dicted interpretation from the MOS prediction model. Also,

the reason for unnaturalness varies by different speech syn-

thesis models. In our human evaluation, our dataset is judged

better than the previous system in providing better explana-

tions of naturalness. Through our dataset, we show evidence

of several factors lying under a single naturalness score.

arXiv:2210.14406v1 [eess.AS] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

REDPEN:REGION-ANDREASON-ANNOTATEDDATASETOFUNNATURALSPEECHKyuminPark1,KeonLee12,DaeyoungKim1,DongyeopKang31SchoolofComputing,KAIST,RepublicofKorea2KRAFTONInc.,RepublicofKorea3UniversityofMinnesota,USfpkm9403,kimdg@kaist.ac.kr,keonlee@krafton.com,dongyeop@umn.eduABSTRACTEvenwithrecentadvancesinspeech...

展开>> 收起<<

REDPEN REGION- AND REASON-ANNOTATED DATASET OF UNNATURAL SPEECH Kyumin Park1 Keon Lee12 Daeyoung Kim1 Dongyeop Kang3 1School of Computing KAIST Republic of Korea.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

REDPEN REGION- AND REASON-ANNOTATED DATASET OF UNNATURAL SPEECH Kyumin Park1 Keon Lee12 Daeyoung Kim1 Dongyeop Kang3 1School of Computing KAIST Republic of Korea

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: