A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION Anupama Chingacham Vera Demberg Dietrich Klakow

2025-04-30 0 0 253.49KB 8 页 10玖币

A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE

GENERATION WITH LINGUISTIC MODIFICATION

Anupama Chingacham, Vera Demberg, Dietrich Klakow

Saarland Informatics Campus, Saarland University, Germany

Qatar. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating

new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained

from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA.

Telephone: + Intl. 908-562-3966.

ABSTRACT

In noisy environments, speech can be hard to understand

for humans. Spoken dialog systems can help to enhance the

intelligibility of their output, either by modifying the speech

synthesis (e.g., imitate Lombard speech) or by optimizing

the language generation. We here focus on the second type

of approach, by which an intended message is realized with

words that are more intelligible in a speciﬁc noisy environ-

ment. By conducting a speech perception experiment, we cre-

ated a dataset of 900 paraphrases in babble noise, perceived

by native English speakers with normal hearing. We ﬁnd

that careful selection of paraphrases can improve intelligibil-

ity by 33% at SNR -5 dB. Our analysis of the data shows that

the intelligibility differences between paraphrases are mainly

driven by noise-robust acoustic cues. Furthermore, we pro-

pose an intelligibility-aware paraphrase ranking model, which

outperforms baseline models with a relative improvement of

31.37% at SNR -5 dB.

Index Terms—noise-adaptive speech, paraphrases

1. INTRODUCTION

Over the past decade, speech-based interfaces have become

an increasingly common mode of human-machine interac-

tion. Today, spoken dialog systems (SDS) are part of several

systems such as those used for medical assistance, language

learning, navigation and so on. To improve the performance

of SDS in the noisy conditions of daily-life, earlier studies

have largely focused on speech enhancements for better au-

tomatic speech recognition (ASR). But there is considerably

less work on speech synthesis techniques to improve human

recognition in noise. However, to improve the human-like be-

haviours in SDS, speech synthesis needs to be adaptive to the

noisy conditions.

Prior work has shown that acoustic modiﬁcations can al-

ter the intelligibility of speech uttered in an adverse listen-

ing condition [1]. Synthesis of Lombard speech [2, 3], vowel

space expansion [4], speech rate reduction and insertion of

additional pauses [5] are some of the existing algorithmic

solutions to reduce the noise impact on the intelligibility of

synthesized speech. However, earlier studies have also high-

lighted the counter-productive effect of, signal distortions in-

troduced by some noise-reduction techniques [1, 6]. On the

other hand, linguistic modiﬁcations are seldom leveraged by

SDS to improve the utterance intelligibility, even though it is

well-known that the speech perception in noise is signiﬁcantly

inﬂuenced by linguistic characteristics such as predictabil-

ity [7, 8], word familiarity [9], neighborhood density [10],

syntactic structure [11, 12], word order [13] etc. In partic-

ular, it was shown earlier that different types of noise, affect

some speech sounds more than others [14, 15, 16, 17]. This

opens the possibility to speciﬁcally choose lexical items that

are less affected by the interference of a speciﬁc type of noise.

To this end, we propose an alternate strategy based on lin-

guistic modiﬁcations to improve speech perception in noise.

More precisely, we utilize the potential of sentential para-

phrases to represent the meaning of a message using lexical

forms which exhibit better noise-robustness. One of the ear-

lier approaches of utilizing linguistic forms to reduce word

misperceptions in noise consisted of modeling phoneme con-

fusions and pre-selecting less confusable words [18]. Al-

though their proposed model predicts the position of potential

confusions in short phrases (which are formed by a closed

vocabulary), the applicability of this approach for conversa-

tional data has not yet been studied. Rational strategies like

lexical/phrasal repetitions [1, 19] and insertion of clariﬁcation

expressions [20] have also showed the possibility of improv-

ing the speech perception in noise without acoustic modiﬁca-

tions. However, the scope of such template-based strategies

are limited, as it may lead to the generation of less natural-

sounding and monotonous utterances. Compared to those ear-

lier attempts, our work is more closely related to the study on

rephrasing-based intelligibility enhancement [21]. Zhang et

al., 2013 focused on the development of an objective measure

to distinguish phrases based on their intelligibility in noise.

In this paper, we concentrate on studying the impact of

paraphrasing on utterance intelligibility at different levels of a

noise type. The current work is inspired by the earlier ﬁnding

that lexical intelligibility in noise can be improved by replac-

ing a word with its noise-robust synonym [22]. While this

is an interesting ﬁnding, a sentence intelligibility improve-

ment strategy solely based on lexical replacements is con-

strained by the availability of synonyms that ﬁt the context of

a given utterance. Hence, a paraphrase generation model was

employed to include more generic types of sentential para-

arXiv:2210.10252v1 [cs.CL] 19 Oct 2022

Fig. 1. Architecture of the proposed solution to generate

noise-adaptive utterances using linguistic modiﬁcations.

phrases. Speech perception experiments were conducted at

three different levels of babble noise: 5 dB, 0 dB, and -5 dB.

We collected data from 90 native English speakers regarding

their comprehension of 900 paraphrase pairs in the presence

of babble noise. To date, this constitutes the largest available

corpus of its kind.1

Further, we investigated the inﬂuence of both linguis-

tic and acoustic cues on intelligibility differences between

paraphrases in noise. We utilized the speech intelligibility

metric, STOI [23] to capture the amount of acoustic cues that

survived the energetic masking, in a noise-contaminated ut-

terance. This metric also indicates the potential of listening-

in-the-dips [24], as noise-robust acoustic cues capture the

glimpses of the actual speech. Additionally, a pre-trained

language model was used for estimating the predictability

offered by linguistic cues in an utterance. Our modeling ex-

periments reveal that the impact of paraphrasing on utterance

intelligibility increases, as the noise level increases. Also,

we found that the observed gain in intelligibility is mainly

introduced by paraphrases with noise-robust acoustic cues.

For instance, consider the following paraphrases (s1,s2),

which are similar in linguistic predictability; yet distinct in

intelligibility in the presence of babble noise:

s1:it never hurts to have some kind of a grounding in law.

s2:it doesn’t hurt to have some kind of legal education.

More concretely, at SNR 0 dB, listeners perceived s2 (1.2

times) better than s1. Further analysis of this distinction in in-

telligibility showed that, the acoustic cues which survived the

energetic masking, is more in s2, than in s1. Subsequently,

our ﬁnal step consists of demonstrating that it is possible to

automatically predict among a pair of paraphrases, which of

them will be more robust to noise (see Section 6). As shown

in Figure 1, such ranking models could further be deployed

in the language generation module of SDS, to generate noise-

adaptive utterances without signal distortions. Here, we as-

sume that the noise in the user’s environment is either known

in prior or can be estimated.

1Experiment data is released with an open-source license at:

https://github.com/SFB1102/A4-ParaphrasesinNoise.git

They seem to give more of just the facts than opinions.

They give more information than opinions.

They seem to give more facts than opinions.

You never hear about it really in the big ones.

You don’t hear much about it in the big ones.

In the big ones you don’t hear about it.

It was a very close game and hard fought game.

The game was close and hard fought.

It was a very close game.

Table 1. A few examples of (s1, s2, s3) in the PiN dataset.

2. SENTENTIAL PARAPHRASES

Paraphrases are those phrases/sentences which represent sim-

ilar semantics using different wording. However, the notion

comes with the difﬁculty that two different sentences rarely

have the exact same meaning in all contexts, hence para-

phrases, especially at the sentence level [25] typically only

approximate the original meaning. On the one hand, gen-

erating sentential paraphrases which are exactly equivalent

in semantics leads to trivial patterns such as word order

changes or minimal lexical substitutions among paraphrases

[26]. This however can mean that there is only a minimal

difference in the effect of intelligibility in noise between such

paraphrases. On the other hand, generation of non-trivial

paraphrases introduces better lexical/syntactic diversity, and

may hence have larger effects on intelligibility, but this in

turn demands more scrutiny for semantic similarity [27].

In this paper, we hence explore the effect of paraphrases

that approximate semantic equivalence instead of strict se-

mantic equivalence. In order to include a large variety of para-

phrases, stimuli sentences were generated using a pre-trained

text generation model [28, 29] which was ﬁne-tuned on sev-

eral paraphrase datasets like Quora Question Pairs, PAWS

[30] etc. For the input sentences to the paraphrasing model,

we selected a list of short sentences (10-12 words) from the

dialogue corpus Switchboard [31]. After paraphrase gener-

ation, we employed automatic ﬁltering to select the top two

paraphrases for each input sentence, based on semantic sim-

ilarity score[32]. This resulted in a list of paraphrase triplets

(s1, s2, s3), consisting of different paraphrase types formed

by lexical replacements, changes in syntactic structure etc.

Since existing paraphrasing models lack the domain knowl-

edge of spoken data, a manual selection was performed to

ensure the quality of the generated paraphrases in terms of

semantic equivalence. Every paraphrase triplet was converted

to three pairs: (s1, s2), (s2, s3) and (s1, s3). Then, every

paraphrase pair was veriﬁed for closeness in semantics. We

identiﬁed about 300 triplets that exhibited approximate se-

mantic similarity in all three pairs. Those triplets were ran-

domly split into three groups of 100 (one for each listening

environment). Hereafter, we refer to this dataset as para-

phrases in noise (PiN). Table 1 lists few samples in PiN. To

ensure that the sentential paraphrases in the PiN dataset are

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ADATA-DRIVENINVESTIGATIONOFNOISE-ADAPTIVEUTTERANCEGENERATIONWITHLINGUISTICMODIFICATIONAnupamaChingacham,VeraDemberg,DietrichKlakowSaarlandInformaticsCampus,SaarlandUniversity,Germany©Copyright2023IEEE.Publishedinthe2022IEEESpokenLanguageTechnologyWorkshop(SLT)(SLT2022),scheduledfor9-12January2023inD...

展开>> 收起<<

A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION Anupama Chingacham Vera Demberg Dietrich Klakow.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION Anupama Chingacham Vera Demberg Dietrich Klakow

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: