A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION Anupama Chingacham Vera Demberg Dietrich Klakow

2025-04-30 0 0 253.49KB 8 页 10玖币
侵权投诉
A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE
GENERATION WITH LINGUISTIC MODIFICATION
Anupama Chingacham, Vera Demberg, Dietrich Klakow
Saarland Informatics Campus, Saarland University, Germany
©Copyright 2023 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 9-12 January 2023 in Doha,
Qatar. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating
new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained
from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA.
Telephone: + Intl. 908-562-3966.
ABSTRACT
In noisy environments, speech can be hard to understand
for humans. Spoken dialog systems can help to enhance the
intelligibility of their output, either by modifying the speech
synthesis (e.g., imitate Lombard speech) or by optimizing
the language generation. We here focus on the second type
of approach, by which an intended message is realized with
words that are more intelligible in a specific noisy environ-
ment. By conducting a speech perception experiment, we cre-
ated a dataset of 900 paraphrases in babble noise, perceived
by native English speakers with normal hearing. We find
that careful selection of paraphrases can improve intelligibil-
ity by 33% at SNR -5 dB. Our analysis of the data shows that
the intelligibility differences between paraphrases are mainly
driven by noise-robust acoustic cues. Furthermore, we pro-
pose an intelligibility-aware paraphrase ranking model, which
outperforms baseline models with a relative improvement of
31.37% at SNR -5 dB.
Index Termsnoise-adaptive speech, paraphrases
1. INTRODUCTION
Over the past decade, speech-based interfaces have become
an increasingly common mode of human-machine interac-
tion. Today, spoken dialog systems (SDS) are part of several
systems such as those used for medical assistance, language
learning, navigation and so on. To improve the performance
of SDS in the noisy conditions of daily-life, earlier studies
have largely focused on speech enhancements for better au-
tomatic speech recognition (ASR). But there is considerably
less work on speech synthesis techniques to improve human
recognition in noise. However, to improve the human-like be-
haviours in SDS, speech synthesis needs to be adaptive to the
noisy conditions.
Prior work has shown that acoustic modifications can al-
ter the intelligibility of speech uttered in an adverse listen-
ing condition [1]. Synthesis of Lombard speech [2, 3], vowel
space expansion [4], speech rate reduction and insertion of
additional pauses [5] are some of the existing algorithmic
solutions to reduce the noise impact on the intelligibility of
synthesized speech. However, earlier studies have also high-
lighted the counter-productive effect of, signal distortions in-
troduced by some noise-reduction techniques [1, 6]. On the
other hand, linguistic modifications are seldom leveraged by
SDS to improve the utterance intelligibility, even though it is
well-known that the speech perception in noise is significantly
influenced by linguistic characteristics such as predictabil-
ity [7, 8], word familiarity [9], neighborhood density [10],
syntactic structure [11, 12], word order [13] etc. In partic-
ular, it was shown earlier that different types of noise, affect
some speech sounds more than others [14, 15, 16, 17]. This
opens the possibility to specifically choose lexical items that
are less affected by the interference of a specific type of noise.
To this end, we propose an alternate strategy based on lin-
guistic modifications to improve speech perception in noise.
More precisely, we utilize the potential of sentential para-
phrases to represent the meaning of a message using lexical
forms which exhibit better noise-robustness. One of the ear-
lier approaches of utilizing linguistic forms to reduce word
misperceptions in noise consisted of modeling phoneme con-
fusions and pre-selecting less confusable words [18]. Al-
though their proposed model predicts the position of potential
confusions in short phrases (which are formed by a closed
vocabulary), the applicability of this approach for conversa-
tional data has not yet been studied. Rational strategies like
lexical/phrasal repetitions [1, 19] and insertion of clarification
expressions [20] have also showed the possibility of improv-
ing the speech perception in noise without acoustic modifica-
tions. However, the scope of such template-based strategies
are limited, as it may lead to the generation of less natural-
sounding and monotonous utterances. Compared to those ear-
lier attempts, our work is more closely related to the study on
rephrasing-based intelligibility enhancement [21]. Zhang et
al., 2013 focused on the development of an objective measure
to distinguish phrases based on their intelligibility in noise.
In this paper, we concentrate on studying the impact of
paraphrasing on utterance intelligibility at different levels of a
noise type. The current work is inspired by the earlier finding
that lexical intelligibility in noise can be improved by replac-
ing a word with its noise-robust synonym [22]. While this
is an interesting finding, a sentence intelligibility improve-
ment strategy solely based on lexical replacements is con-
strained by the availability of synonyms that fit the context of
a given utterance. Hence, a paraphrase generation model was
employed to include more generic types of sentential para-
arXiv:2210.10252v1 [cs.CL] 19 Oct 2022
Fig. 1. Architecture of the proposed solution to generate
noise-adaptive utterances using linguistic modifications.
phrases. Speech perception experiments were conducted at
three different levels of babble noise: 5 dB, 0 dB, and -5 dB.
We collected data from 90 native English speakers regarding
their comprehension of 900 paraphrase pairs in the presence
of babble noise. To date, this constitutes the largest available
corpus of its kind.1
Further, we investigated the influence of both linguis-
tic and acoustic cues on intelligibility differences between
paraphrases in noise. We utilized the speech intelligibility
metric, STOI [23] to capture the amount of acoustic cues that
survived the energetic masking, in a noise-contaminated ut-
terance. This metric also indicates the potential of listening-
in-the-dips [24], as noise-robust acoustic cues capture the
glimpses of the actual speech. Additionally, a pre-trained
language model was used for estimating the predictability
offered by linguistic cues in an utterance. Our modeling ex-
periments reveal that the impact of paraphrasing on utterance
intelligibility increases, as the noise level increases. Also,
we found that the observed gain in intelligibility is mainly
introduced by paraphrases with noise-robust acoustic cues.
For instance, consider the following paraphrases (s1,s2),
which are similar in linguistic predictability; yet distinct in
intelligibility in the presence of babble noise:
s1:it never hurts to have some kind of a grounding in law.
s2:it doesn’t hurt to have some kind of legal education.
More concretely, at SNR 0 dB, listeners perceived s2 (1.2
times) better than s1. Further analysis of this distinction in in-
telligibility showed that, the acoustic cues which survived the
energetic masking, is more in s2, than in s1. Subsequently,
our final step consists of demonstrating that it is possible to
automatically predict among a pair of paraphrases, which of
them will be more robust to noise (see Section 6). As shown
in Figure 1, such ranking models could further be deployed
in the language generation module of SDS, to generate noise-
adaptive utterances without signal distortions. Here, we as-
sume that the noise in the user’s environment is either known
in prior or can be estimated.
1Experiment data is released with an open-source license at:
https://github.com/SFB1102/A4-ParaphrasesinNoise.git
They seem to give more of just the facts than opinions.
They give more information than opinions.
They seem to give more facts than opinions.
You never hear about it really in the big ones.
You don’t hear much about it in the big ones.
In the big ones you don’t hear about it.
It was a very close game and hard fought game.
The game was close and hard fought.
It was a very close game.
Table 1. A few examples of (s1, s2, s3) in the PiN dataset.
2. SENTENTIAL PARAPHRASES
Paraphrases are those phrases/sentences which represent sim-
ilar semantics using different wording. However, the notion
comes with the difficulty that two different sentences rarely
have the exact same meaning in all contexts, hence para-
phrases, especially at the sentence level [25] typically only
approximate the original meaning. On the one hand, gen-
erating sentential paraphrases which are exactly equivalent
in semantics leads to trivial patterns such as word order
changes or minimal lexical substitutions among paraphrases
[26]. This however can mean that there is only a minimal
difference in the effect of intelligibility in noise between such
paraphrases. On the other hand, generation of non-trivial
paraphrases introduces better lexical/syntactic diversity, and
may hence have larger effects on intelligibility, but this in
turn demands more scrutiny for semantic similarity [27].
In this paper, we hence explore the effect of paraphrases
that approximate semantic equivalence instead of strict se-
mantic equivalence. In order to include a large variety of para-
phrases, stimuli sentences were generated using a pre-trained
text generation model [28, 29] which was fine-tuned on sev-
eral paraphrase datasets like Quora Question Pairs, PAWS
[30] etc. For the input sentences to the paraphrasing model,
we selected a list of short sentences (10-12 words) from the
dialogue corpus Switchboard [31]. After paraphrase gener-
ation, we employed automatic filtering to select the top two
paraphrases for each input sentence, based on semantic sim-
ilarity score[32]. This resulted in a list of paraphrase triplets
(s1, s2, s3), consisting of different paraphrase types formed
by lexical replacements, changes in syntactic structure etc.
Since existing paraphrasing models lack the domain knowl-
edge of spoken data, a manual selection was performed to
ensure the quality of the generated paraphrases in terms of
semantic equivalence. Every paraphrase triplet was converted
to three pairs: (s1, s2), (s2, s3) and (s1, s3). Then, every
paraphrase pair was verified for closeness in semantics. We
identified about 300 triplets that exhibited approximate se-
mantic similarity in all three pairs. Those triplets were ran-
domly split into three groups of 100 (one for each listening
environment). Hereafter, we refer to this dataset as para-
phrases in noise (PiN). Table 1 lists few samples in PiN. To
ensure that the sentential paraphrases in the PiN dataset are
摘要:

ADATA-DRIVENINVESTIGATIONOFNOISE-ADAPTIVEUTTERANCEGENERATIONWITHLINGUISTICMODIFICATIONAnupamaChingacham,VeraDemberg,DietrichKlakowSaarlandInformaticsCampus,SaarlandUniversity,Germany©Copyright2023IEEE.Publishedinthe2022IEEESpokenLanguageTechnologyWorkshop(SLT)(SLT2022),scheduledfor9-12January2023inD...

展开>> 收起<<
A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE GENERATION WITH LINGUISTIC MODIFICATION Anupama Chingacham Vera Demberg Dietrich Klakow.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:253.49KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注