EFFICIENT ACOUSTIC FEATURE TRANSFORMATION IN MISMATCHED ENVIRONMENTS USING A GUIDED -GAN Walter Heymansab Marelie H. Davelabc and Charl van Heerdend

2025-05-03 0 0 1.37MB 19 页 10玖币
侵权投诉
EFFICIENT ACOUSTIC FEATURE TRANSFORMATION IN
MISMATCHED ENVIRONMENTS USING A GUIDED-GAN
Walter Heymans a,b, Marelie H. Davel a,b,c, and Charl van Heerden d
aFaculty of Engineering, North-West University, South Africa
https://engineering.nwu.ac.za/must
bCentre for Artificial Intelligence Research (CAIR), South Africa
cNational Institute for Theoretical and Computational Sciences (NITheCS), South Africa
dSaigen, South Africa
ABSTRACT
We propose a new framework to improve automatic speech recognition (ASR) systems
in resource-scarce environments using a generative adversarial network (GAN) operating
on acoustic input features. The GAN is used to enhance the features of mismatched
data prior to decoding, or can optionally be used to fine-tune the acoustic model. We
achieve improvements that are comparable to multi-style training (MTR), but at a lower
computational cost. With less than one hour of data, an ASR system trained on good quality
data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative
word error rate (WER). Experiments demonstrate that the framework can be very useful in
under-resourced environments where training data and computational resources are limited.
The GAN does not require parallel training data, because it utilises a baseline acoustic model
to provide an additional loss term that guides the generator to create acoustic features that
are better classified by the baseline.
Index Terms: speech recognition, generative adversarial networks, mismatched data, resource-scarce
environments
1 Introduction
Automatic speech recognition (ASR) systems generally struggle to perform well when there is a mismatch
between the data that was used to train the ASR system and new data that is decoded. This is especially true in
under-resourced environments, where it is often prohibitively expensive to collect enough transcribed training
data to cover even the most common application environments.
For some under-resourced languages, there are ASR systems that achieve low word errors rates (WERs)
on audio similar to what the systems were trained with. However, when such a system is used to decode
This is a preprint - the final authenticated publication is available online at:
https://doi.org/10.1016/j.specom.2022.07.002
©
2022. This manuscript version is made available under the CC-BY-NC-ND 4.0 license
https://creativecommons.
org/licenses/by-nc-nd/4.0
E-mail addresses: walterheymans07@gmail.com (W. Heymans), marelie.davel@nwu.ac.za (M.H. Davel),
charl@saigen.com (C. van Heerden),
arXiv:2210.00721v3 [cs.SD] 6 Oct 2022
mismatched data, the WER is significantly higher. Some of the challenging aspects that lead to mismatched
conditions include background noise, reverberation, encoding artifacts and microphone distortion. In cases
where it is possible to collect a small dataset of training data that better matches the test conditions, it is
usually very expensive to retrain an entire system to include the new data. This effect is magnified when new
mismatched data regularly needs to be decoded. When computational resources are limited, it is not feasible to
regularly retrain models or have multiple models that run in parallel.
In order to be useful in many different, constantly changing environments, a good baseline ASR system requires
an efficient technique to adapt to the changing environments. An attractive solution is feature compensation
because you can replace the adaptation technique without changing the acoustic model [
1
]. The features
of noisy audio are changed or enhanced to increase the accuracy of the acoustic model [
2
]. Methods that
are used as a front-end processing technique include masking time and frequency components [
3
], speech
enhancement [
4
], deep denoising autoencoders [
5
] and generative adversarial network (GAN) based feature
enhancement [
6
,
7
]. An alternative to feature compensation is model compensation, where the acoustic
model is adapted in some way. Model compensation usually provides better performance gains than feature
compensation, but at a much higher computational cost [1].
Multi-style training (MTR) is one of the most popular and well-established techniques used to address
mismatch by allowing the acoustic model to learn robust representations of the data. MTR aims to transform
the training data to be more representative of the testing data. New training datasets are created from an
existing set by adding a series of MTR styles. These can include changing the speed and volume [
8
], speech
style [
9
] or sampling rate [
10
], adding time and frequency distortions [
11
] or background noise, and simulating
reverberation [
12
]. The styles are typically chosen without knowledge of the testing conditions and must
still be able to handle a wide variety of mismatch. In addition, the number of styles that are added must be
considered because the computational cost of training an ASR system increases significantly with each style
that is added.
Recently, a number of GAN-based feature enhancement techniques have been proposed to improve ASR
robustness to noise and reverberation [
6
,
7
,
13
]. A GAN framework uses two networks that compete in a
min-max game and improve by learning from each other [
14
]. A generator network learns to create new
samples using random noise as input. Another network, called the discriminator, is used to determine if the
sample came from the true data distribution or the generator. The output of the discriminator is maximised
for real samples and minimised for generated samples. The generator is trained using the output from the
discriminator to improve the quality of the samples it generates. In ASR, the generator uses speech features or
embeddings as input and aims to produce new features that are more robust to noise or reverberation [
6
,
7
].
GAN-based feature enhancement techniques can achieve very good performance improvements when a clean
trained baseline ASR system is used [6, 7, 13].
In this work, we investigate the use of GANs to transform acoustic features of mismatched audio in under-
resourced environments. We want to improve the performance on a new mismatched test set with a limited
computational budget, provided that an ASR system trained on good quality data already exists. Our GAN is
trained to transform acoustic features of mismatched audio to be better classified by the acoustic model. Our
technique can be applied to most acoustic features including Mel-frequency cepstral coefficients (MFCCs),
filter-banks and speaker-adapted transforms like feature-space maximum likelihood linear regression (fMLLR).
For the purposes of this paper, we focus specifically on WAV49-encoded audio [
15
,
16
], a full-rate GSM codec
with a compression ratio of 10:1, often used by South African call centres to store large volumes of telephone
calls. Call centre speech analytics is an important application of ASR, and due to the large volumes of audio
generated by call centres, some form of compression is often encountered. This is why we specifically focus
on noisy speech (also typical of call centre data) encoded with WAV49.
2
2 Related work
GANs can improve both deep neural network hidden Markov model (DNN-HMM) [
13
] and end-to-end ASR
systems [
6
,
7
] by creating indistinguishable representations between clean and noisy audio samples. A clean
corpus is augmented by adding noise or room impulse responses to create a new noisy corpus. GANs used
for DNN-HMM ASR systems normally operate on acoustic features (log-power spectra, MFCCs) [
13
], and
on log-Mel filterbank spectra or embeddings when used for end-to-end systems [
6
,
7
]. The generator maps
the noisy features to the same representation as the corresponding clean features. The discriminator provides
feedback to the generator using the clean and noisy features as input. An L1 or mean squared error loss
between clean and noisy samples is minimised which assists the generator with the mapping [
6
,
7
]. This only
works when the noisy set is created from the clean set, because the loss requires both samples to be of the same
frames in a recording. All these approaches rely on the assumption that you know what the test conditions are
(noise or reverberation). They are only as effective as the conditions you can create using a clean dataset. Still,
GANs such as these that are used to improve features or embeddings have been shown to outperform data
augmentation and DNN-based enhancement techniques [6, 13].
These approaches are designed primarily for noise robustness and dereverberation, not necessarily mismatched
audio. It is extremely difficult to accurately produce a noisy dataset using only clean audio if the conditions
of the testing data are unknown. Sometimes it is possible to collect a small training dataset with matched
conditions to the new test set. In this scenario, the previous GAN-based enhancement techniques cannot be
used, because a parallel training set (clean version) does not exist.
An approach different from feature enhancement is to use a GAN to fine-tune an end-to-end ASR system [
17
].
The entire ASR model is treated as a generator network. The ASR system is trained normally until convergence
occurs. After training, a discriminator network is tasked to determine if the transcriptions came from the ASR
system or are ground truth. Since the ASR model is already trained, the output is very similar to the ground
truth, which makes the task very difficult for the discriminator. The GAN can improve output transcriptions of
an ASR model that has already converged. Adding the GAN-based fine-tuning technique yields consistent
WER improvements [17].
Previous work focused on noisy and reverberant speech, with little attention paid to compressed audio. None
of the GAN-based enhancement techniques are designed to work with a newly transcribed dataset that does
not have a corresponding clean version. We are not aware of any research that utilises GANs for acoustic
feature transformation in under-resourced ASR or call centre environments.
3 Guided-GAN
We use a GAN to improve an ASR system on mismatched (noisy and compressed) audio. Our approach can
use any GAN loss function, but changes the loss in order to utilise an existing ASR system to guide the GAN
training process. We refer to this architecture as a Guided-GAN.
Two datasets are used during GAN training. The first set can be a clean dataset or the same data used to
train the acoustic model (known conditions). The second set, which is used as input for the generator, is a
noisy or mismatched dataset that does not come from the same distribution as the clean set. Figure 1 shows a
diagram of the Guided-GAN training process. Acoustic features of the mismatched audio are given to the
generator as input. The generator then creates a new vector of the same dimension, which is used as input
to both a pre-trained DNN acoustic model and the discriminator. The negative log-likelihood (NLL) loss of
the acoustic model is added to the generator’s loss to assist it in generating realistic samples of the correct
class. The input to the discriminator is clean (or have known conditions) and generated features for which the
output is maximised and minimised respectively. The goal of the GAN is to reduce the performance difference
between data with known conditions and a new mismatched test set, for which a small transcribed training set
3
is available. When the system is used to evaluate a test set, only the generator network is used to transform the
noisy features before handing them over to the downstream acoustic model.
3.1 Loss functions
Any GAN loss function can be extended by adding the guiding term to the generator’s loss. We use three
different loss functions and select the one that is best suited for our approach. The first is an adapted version
of a spectral normalisation GAN (SN-GAN) [
18
] in which we extend the generator loss function with an
additional term. Specifically, we use the labels of the noisy sample to calculate the NLL loss of the cleaned
sample. This value is added to the loss function to guide the generator to create features that are more likely to
be classified correctly by the target acoustic model. An SN-GAN uses the same loss functions as a Wasserstein
GAN (WGAN), but uses spectral normalisation instead of weight clipping in the discriminator to satisfy the
Lipschitz constraint [18]. The standard WGAN loss for the generator is given by:
LG=Eˆxpg[D(ˆx)],(1)
where pgis the model distribution implicitly defined by ˆx =G(˜x)with ˜x the noisy input features, and D(ˆx)
is the output of the discriminator that can be a positive or negative value. We add the NLL loss to Eq. (1):
LG=Eˆxpg[D(ˆx)] λ·Eˆx,˜ypdata log pam(˜y|ˆx),(2)
where
λ
is a hyperparameter,
pam
is the probability defined by the trained acoustic model, and
˜y
is the senone
class label of the noisy features. The WGAN loss function for the discriminator is given by:
LD=Expd[D(x)] + Eˆxpg[D(ˆx)] (3)
where
pd
is the real distribution defined by the clean samples
x
. Spectral normalisation is used in the
discriminator to ensure that the Lipschitz constraint is satisfied [
18
]. We also compare the SN-GAN loss
functions to the standard non-saturating GAN (NS-GAN) [
14
] and the WGAN with gradient penalty (WGAN-
GP) [19]. The NS-GAN loss function we use for the generator is:
LNSGAN
G=Eˆxpg[log D(ˆx)] λ·Eˆx,˜ypdata log pam(˜y|ˆx)(4)
with a corresponding discriminator loss function:
LNSGAN
D=Expd[log (D(x))] Eˆxpg[log (1 D(ˆx))].(5)
Figure 1: Diagram of a Guided-GAN training process.
4
摘要:

EFFICIENTACOUSTICFEATURETRANSFORMATIONINMISMATCHEDENVIRONMENTSUSINGAGUIDED-GANWalterHeymansa,b,MarelieH.Davela,b,c,andCharlvanHeerdendaFacultyofEngineering,North-WestUniversity,SouthAfricahttps://engineering.nwu.ac.za/mustbCentreforArticialIntelligenceResearch(CAIR),SouthAfricacNationalInstitutefor...

展开>> 收起<<
EFFICIENT ACOUSTIC FEATURE TRANSFORMATION IN MISMATCHED ENVIRONMENTS USING A GUIDED -GAN Walter Heymansab Marelie H. Davelabc and Charl van Heerdend.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.37MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注