BLIND SIGNAL DEREVERBERATION FOR MACHINE SPEECH RECOGNITION
Samik Sadhu1, Hynek Hermansky1,2
1Center for Language and Speech Processing, Johns Hopkins University, USA
2Human Language Technology Center of Excellence, Johns Hopkins University, USA
ABSTRACT
We present a method to remove unknown convolutive noise
introduced to speech by reverberations of recording environ-
ments, utilizing some amount of training speech data from the
reverberant environment, and any available non-reverberant
speech data. Using Fourier transform computed over long
temporal windows, which ideally cover the entire room im-
pulse response, we convert room induced convolution to addi-
tions in the log spectral domain. Next, we compute a spectral
normalization vector from statistics gathered over reverber-
ated as well as over clean speech in the log spectral domain.
During operation, this normalization vectors are used to al-
leviate reverberations from complex speech spectra recorded
under the same reverberant conditions . Such dereverberated
complex speech spectra are used to compute complex FDLP-
spectrograms for use in automatic speech recognition.
Index Terms—blind dereverberation, robust speech
recognition
1. INTRODUCTION
In many speech-to-text (STT) applications, the message-
carrying speech signal s(t)is corrupted by room reverbera-
tions n(t), yielding the reverberated signal o(t) = s(t)∗n(t)
where ∗is the convolution operator and tdenotes time. Ac-
cording to the convolution theorem of Fourier transform,
convolutions in time domain, turn into multiplication in the
spectral domain and thereby additions in log spectral domain
shown in equation 1.
log F(o(t)) = log F(o(t)) (1)
= log F(s(t)∗n(t))
= log(F(s(t)) × F(n(t)))
= log F(s(t)) + log F(n(t))
Findicates the Fourier transform operator. Thus, for known
n(t), the original signal s(t)could easily be recovered as
s(t) = F−1(exp(log F(o(t)) −log F(n(t)))) (2)
, where F−1is the inverse Fourier transform operator.
However, a few practical issues arise in this analysis.
• Even though n(t)has infinite duration, in digital signal
processing, it can only be represented as a finite length
discrete time signal.
• In order to use equation 2 with digital signals, we need
equal length discrete Fourier transforms of the digitized
versions of the signals o(t)and n(t).
• As we shall show, arithmetic operations done in the log
spectral domain needs phase unwrapping operations to
remove phase ambiguity.
•n(t)is typically not known.
2. PROPOSED TECHNIQUE
Suppose that the infinitely long impulse response n(t)can be
approximated by its truncated digital version n0={n0
k}S
k=1.
Even though n0is unknown, it can be assumed to be a constant
vector of real numbers.
Assume a digitized observed reverberated speech utter-
ances o={ok}S+T−1
k=1 , obtained by convolving a source sig-
nal s={sk}T
k=1 with n0. Appropriate number of zeros can
be appended to the signals to make them uniform length se-
quences leading to the equation
log F(o) = log F(s) + log F(n0)(3)
, where Fis the discrete Fourier transform operator. Since n0
is assumed to be a constant vector, so is log F(n0). Computing
the expected values on both sides of equation 3, we get
Elog F(o) = Elog F(s) + log F(n0)(4)
Thus, an estimate of the unchanging logarithmic spectra of
the room impulse response can simply be obtained as
log F(n0) = Elog F(o)−Elog F(s)(5)
Equation 5 forms the basis of our algorithm where the ex-
pected values are replaced with empirical sums computed
over a finite number of speech utterances to obtain an esti-
mate φof the log spectrum of the room impulse response
log F(n0). This estimate can be used to normalize the log
spectrum of the observed speech to estimate the clean speech
as
ˆs=F−1(exp(log F(o)−φ)) (6)
arXiv:2210.00117v1 [eess.AS] 30 Sep 2022