mismatched data, the WER is significantly higher. Some of the challenging aspects that lead to mismatched
conditions include background noise, reverberation, encoding artifacts and microphone distortion. In cases
where it is possible to collect a small dataset of training data that better matches the test conditions, it is
usually very expensive to retrain an entire system to include the new data. This effect is magnified when new
mismatched data regularly needs to be decoded. When computational resources are limited, it is not feasible to
regularly retrain models or have multiple models that run in parallel.
In order to be useful in many different, constantly changing environments, a good baseline ASR system requires
an efficient technique to adapt to the changing environments. An attractive solution is feature compensation
because you can replace the adaptation technique without changing the acoustic model [
1
]. The features
of noisy audio are changed or enhanced to increase the accuracy of the acoustic model [
2
]. Methods that
are used as a front-end processing technique include masking time and frequency components [
3
], speech
enhancement [
4
], deep denoising autoencoders [
5
] and generative adversarial network (GAN) based feature
enhancement [
6
,
7
]. An alternative to feature compensation is model compensation, where the acoustic
model is adapted in some way. Model compensation usually provides better performance gains than feature
compensation, but at a much higher computational cost [1].
Multi-style training (MTR) is one of the most popular and well-established techniques used to address
mismatch by allowing the acoustic model to learn robust representations of the data. MTR aims to transform
the training data to be more representative of the testing data. New training datasets are created from an
existing set by adding a series of MTR styles. These can include changing the speed and volume [
8
], speech
style [
9
] or sampling rate [
10
], adding time and frequency distortions [
11
] or background noise, and simulating
reverberation [
12
]. The styles are typically chosen without knowledge of the testing conditions and must
still be able to handle a wide variety of mismatch. In addition, the number of styles that are added must be
considered because the computational cost of training an ASR system increases significantly with each style
that is added.
Recently, a number of GAN-based feature enhancement techniques have been proposed to improve ASR
robustness to noise and reverberation [
6
,
7
,
13
]. A GAN framework uses two networks that compete in a
min-max game and improve by learning from each other [
14
]. A generator network learns to create new
samples using random noise as input. Another network, called the discriminator, is used to determine if the
sample came from the true data distribution or the generator. The output of the discriminator is maximised
for real samples and minimised for generated samples. The generator is trained using the output from the
discriminator to improve the quality of the samples it generates. In ASR, the generator uses speech features or
embeddings as input and aims to produce new features that are more robust to noise or reverberation [
6
,
7
].
GAN-based feature enhancement techniques can achieve very good performance improvements when a clean
trained baseline ASR system is used [6, 7, 13].
In this work, we investigate the use of GANs to transform acoustic features of mismatched audio in under-
resourced environments. We want to improve the performance on a new mismatched test set with a limited
computational budget, provided that an ASR system trained on good quality data already exists. Our GAN is
trained to transform acoustic features of mismatched audio to be better classified by the acoustic model. Our
technique can be applied to most acoustic features including Mel-frequency cepstral coefficients (MFCCs),
filter-banks and speaker-adapted transforms like feature-space maximum likelihood linear regression (fMLLR).
For the purposes of this paper, we focus specifically on WAV49-encoded audio [
15
,
16
], a full-rate GSM codec
with a compression ratio of 10:1, often used by South African call centres to store large volumes of telephone
calls. Call centre speech analytics is an important application of ASR, and due to the large volumes of audio
generated by call centres, some form of compression is often encountered. This is why we specifically focus
on noisy speech (also typical of call centre data) encoded with WAV49.
2