
A DATA-DRIVEN INVESTIGATION OF NOISE-ADAPTIVE UTTERANCE
GENERATION WITH LINGUISTIC MODIFICATION
Anupama Chingacham, Vera Demberg, Dietrich Klakow
Saarland Informatics Campus, Saarland University, Germany
©Copyright 2023 IEEE. Published in the 2022 IEEE Spoken Language Technology Workshop (SLT) (SLT 2022), scheduled for 9-12 January 2023 in Doha,
Qatar. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating
new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained
from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA.
Telephone: + Intl. 908-562-3966.
ABSTRACT
In noisy environments, speech can be hard to understand
for humans. Spoken dialog systems can help to enhance the
intelligibility of their output, either by modifying the speech
synthesis (e.g., imitate Lombard speech) or by optimizing
the language generation. We here focus on the second type
of approach, by which an intended message is realized with
words that are more intelligible in a specific noisy environ-
ment. By conducting a speech perception experiment, we cre-
ated a dataset of 900 paraphrases in babble noise, perceived
by native English speakers with normal hearing. We find
that careful selection of paraphrases can improve intelligibil-
ity by 33% at SNR -5 dB. Our analysis of the data shows that
the intelligibility differences between paraphrases are mainly
driven by noise-robust acoustic cues. Furthermore, we pro-
pose an intelligibility-aware paraphrase ranking model, which
outperforms baseline models with a relative improvement of
31.37% at SNR -5 dB.
Index Terms—noise-adaptive speech, paraphrases
1. INTRODUCTION
Over the past decade, speech-based interfaces have become
an increasingly common mode of human-machine interac-
tion. Today, spoken dialog systems (SDS) are part of several
systems such as those used for medical assistance, language
learning, navigation and so on. To improve the performance
of SDS in the noisy conditions of daily-life, earlier studies
have largely focused on speech enhancements for better au-
tomatic speech recognition (ASR). But there is considerably
less work on speech synthesis techniques to improve human
recognition in noise. However, to improve the human-like be-
haviours in SDS, speech synthesis needs to be adaptive to the
noisy conditions.
Prior work has shown that acoustic modifications can al-
ter the intelligibility of speech uttered in an adverse listen-
ing condition [1]. Synthesis of Lombard speech [2, 3], vowel
space expansion [4], speech rate reduction and insertion of
additional pauses [5] are some of the existing algorithmic
solutions to reduce the noise impact on the intelligibility of
synthesized speech. However, earlier studies have also high-
lighted the counter-productive effect of, signal distortions in-
troduced by some noise-reduction techniques [1, 6]. On the
other hand, linguistic modifications are seldom leveraged by
SDS to improve the utterance intelligibility, even though it is
well-known that the speech perception in noise is significantly
influenced by linguistic characteristics such as predictabil-
ity [7, 8], word familiarity [9], neighborhood density [10],
syntactic structure [11, 12], word order [13] etc. In partic-
ular, it was shown earlier that different types of noise, affect
some speech sounds more than others [14, 15, 16, 17]. This
opens the possibility to specifically choose lexical items that
are less affected by the interference of a specific type of noise.
To this end, we propose an alternate strategy based on lin-
guistic modifications to improve speech perception in noise.
More precisely, we utilize the potential of sentential para-
phrases to represent the meaning of a message using lexical
forms which exhibit better noise-robustness. One of the ear-
lier approaches of utilizing linguistic forms to reduce word
misperceptions in noise consisted of modeling phoneme con-
fusions and pre-selecting less confusable words [18]. Al-
though their proposed model predicts the position of potential
confusions in short phrases (which are formed by a closed
vocabulary), the applicability of this approach for conversa-
tional data has not yet been studied. Rational strategies like
lexical/phrasal repetitions [1, 19] and insertion of clarification
expressions [20] have also showed the possibility of improv-
ing the speech perception in noise without acoustic modifica-
tions. However, the scope of such template-based strategies
are limited, as it may lead to the generation of less natural-
sounding and monotonous utterances. Compared to those ear-
lier attempts, our work is more closely related to the study on
rephrasing-based intelligibility enhancement [21]. Zhang et
al., 2013 focused on the development of an objective measure
to distinguish phrases based on their intelligibility in noise.
In this paper, we concentrate on studying the impact of
paraphrasing on utterance intelligibility at different levels of a
noise type. The current work is inspired by the earlier finding
that lexical intelligibility in noise can be improved by replac-
ing a word with its noise-robust synonym [22]. While this
is an interesting finding, a sentence intelligibility improve-
ment strategy solely based on lexical replacements is con-
strained by the availability of synonyms that fit the context of
a given utterance. Hence, a paraphrase generation model was
employed to include more generic types of sentential para-
arXiv:2210.10252v1 [cs.CL] 19 Oct 2022