BLIND SIGNAL DEREVERBERATION FOR MACHINE SPEECH RECOGNITION Samik Sadhu1 Hynek Hermansky12 1Center for Language and Speech Processing Johns Hopkins University USA

2025-04-27 0 0 4.75MB 5 页 10玖币
侵权投诉
BLIND SIGNAL DEREVERBERATION FOR MACHINE SPEECH RECOGNITION
Samik Sadhu1, Hynek Hermansky1,2
1Center for Language and Speech Processing, Johns Hopkins University, USA
2Human Language Technology Center of Excellence, Johns Hopkins University, USA
ABSTRACT
We present a method to remove unknown convolutive noise
introduced to speech by reverberations of recording environ-
ments, utilizing some amount of training speech data from the
reverberant environment, and any available non-reverberant
speech data. Using Fourier transform computed over long
temporal windows, which ideally cover the entire room im-
pulse response, we convert room induced convolution to addi-
tions in the log spectral domain. Next, we compute a spectral
normalization vector from statistics gathered over reverber-
ated as well as over clean speech in the log spectral domain.
During operation, this normalization vectors are used to al-
leviate reverberations from complex speech spectra recorded
under the same reverberant conditions . Such dereverberated
complex speech spectra are used to compute complex FDLP-
spectrograms for use in automatic speech recognition.
Index Termsblind dereverberation, robust speech
recognition
1. INTRODUCTION
In many speech-to-text (STT) applications, the message-
carrying speech signal s(t)is corrupted by room reverbera-
tions n(t), yielding the reverberated signal o(t) = s(t)n(t)
where is the convolution operator and tdenotes time. Ac-
cording to the convolution theorem of Fourier transform,
convolutions in time domain, turn into multiplication in the
spectral domain and thereby additions in log spectral domain
shown in equation 1.
log F(o(t)) = log F(o(t)) (1)
= log F(s(t)n(t))
= log(F(s(t)) × F(n(t)))
= log F(s(t)) + log F(n(t))
Findicates the Fourier transform operator. Thus, for known
n(t), the original signal s(t)could easily be recovered as
s(t) = F1(exp(log F(o(t)) log F(n(t)))) (2)
, where F1is the inverse Fourier transform operator.
However, a few practical issues arise in this analysis.
Even though n(t)has infinite duration, in digital signal
processing, it can only be represented as a finite length
discrete time signal.
In order to use equation 2 with digital signals, we need
equal length discrete Fourier transforms of the digitized
versions of the signals o(t)and n(t).
As we shall show, arithmetic operations done in the log
spectral domain needs phase unwrapping operations to
remove phase ambiguity.
n(t)is typically not known.
2. PROPOSED TECHNIQUE
Suppose that the infinitely long impulse response n(t)can be
approximated by its truncated digital version n0={n0
k}S
k=1.
Even though n0is unknown, it can be assumed to be a constant
vector of real numbers.
Assume a digitized observed reverberated speech utter-
ances o={ok}S+T1
k=1 , obtained by convolving a source sig-
nal s={sk}T
k=1 with n0. Appropriate number of zeros can
be appended to the signals to make them uniform length se-
quences leading to the equation
log F(o) = log F(s) + log F(n0)(3)
, where Fis the discrete Fourier transform operator. Since n0
is assumed to be a constant vector, so is log F(n0). Computing
the expected values on both sides of equation 3, we get
Elog F(o) = Elog F(s) + log F(n0)(4)
Thus, an estimate of the unchanging logarithmic spectra of
the room impulse response can simply be obtained as
log F(n0) = Elog F(o)Elog F(s)(5)
Equation 5 forms the basis of our algorithm where the ex-
pected values are replaced with empirical sums computed
over a finite number of speech utterances to obtain an esti-
mate φof the log spectrum of the room impulse response
log F(n0). This estimate can be used to normalize the log
spectrum of the observed speech to estimate the clean speech
as
ˆs=F1(exp(log F(o)φ)) (6)
arXiv:2210.00117v1 [eess.AS] 30 Sep 2022
摘要:

BLINDSIGNALDEREVERBERATIONFORMACHINESPEECHRECOGNITIONSamikSadhu1,HynekHermansky1;21CenterforLanguageandSpeechProcessing,JohnsHopkinsUniversity,USA2HumanLanguageTechnologyCenterofExcellence,JohnsHopkinsUniversity,USAABSTRACTWepresentamethodtoremoveunknownconvolutivenoiseintroducedtospeechbyreverberat...

展开>> 收起<<
BLIND SIGNAL DEREVERBERATION FOR MACHINE SPEECH RECOGNITION Samik Sadhu1 Hynek Hermansky12 1Center for Language and Speech Processing Johns Hopkins University USA.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:4.75MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注