EFFICIENT ACOUSTIC FEATURE TRANSFORMATION IN MISMATCHED ENVIRONMENTS USING A GUIDED -GAN Walter Heymansab Marelie H. Davelabc and Charl van Heerdend

2025-05-03 0 0 1.37MB 19 页 10玖币

侵权投诉

EFFICIENT ACOUSTIC FEATURE TRANSFORMATION IN

MISMATCHED ENVIRONMENTS USING A GUIDED-GAN

Walter Heymans a,b, Marelie H. Davel a,b,c, and Charl van Heerden d

aFaculty of Engineering, North-West University, South Africa

https://engineering.nwu.ac.za/must

bCentre for Artiﬁcial Intelligence Research (CAIR), South Africa

cNational Institute for Theoretical and Computational Sciences (NITheCS), South Africa

dSaigen, South Africa

ABSTRACT

We propose a new framework to improve automatic speech recognition (ASR) systems

in resource-scarce environments using a generative adversarial network (GAN) operating

on acoustic input features. The GAN is used to enhance the features of mismatched

data prior to decoding, or can optionally be used to ﬁne-tune the acoustic model. We

achieve improvements that are comparable to multi-style training (MTR), but at a lower

computational cost. With less than one hour of data, an ASR system trained on good quality

data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative

word error rate (WER). Experiments demonstrate that the framework can be very useful in

under-resourced environments where training data and computational resources are limited.

The GAN does not require parallel training data, because it utilises a baseline acoustic model

to provide an additional loss term that guides the generator to create acoustic features that

are better classiﬁed by the baseline.

Index Terms: speech recognition, generative adversarial networks, mismatched data, resource-scarce

environments

1 Introduction

Automatic speech recognition (ASR) systems generally struggle to perform well when there is a mismatch

between the data that was used to train the ASR system and new data that is decoded. This is especially true in

under-resourced environments, where it is often prohibitively expensive to collect enough transcribed training

data to cover even the most common application environments.

For some under-resourced languages, there are ASR systems that achieve low word errors rates (WERs)

on audio similar to what the systems were trained with. However, when such a system is used to decode

This is a preprint - the ﬁnal authenticated publication is available online at:

https://doi.org/10.1016/j.specom.2022.07.002

2022. This manuscript version is made available under the CC-BY-NC-ND 4.0 license

https://creativecommons.

org/licenses/by-nc-nd/4.0

E-mail addresses: walterheymans07@gmail.com (W. Heymans), marelie.davel@nwu.ac.za (M.H. Davel),

charl@saigen.com (C. van Heerden),

arXiv:2210.00721v3 [cs.SD] 6 Oct 2022

mismatched data, the WER is signiﬁcantly higher. Some of the challenging aspects that lead to mismatched

conditions include background noise, reverberation, encoding artifacts and microphone distortion. In cases

where it is possible to collect a small dataset of training data that better matches the test conditions, it is

usually very expensive to retrain an entire system to include the new data. This effect is magniﬁed when new

mismatched data regularly needs to be decoded. When computational resources are limited, it is not feasible to

regularly retrain models or have multiple models that run in parallel.

In order to be useful in many different, constantly changing environments, a good baseline ASR system requires

an efﬁcient technique to adapt to the changing environments. An attractive solution is feature compensation

because you can replace the adaptation technique without changing the acoustic model [

]. The features

of noisy audio are changed or enhanced to increase the accuracy of the acoustic model [

]. Methods that

are used as a front-end processing technique include masking time and frequency components [

], speech

enhancement [

], deep denoising autoencoders [

] and generative adversarial network (GAN) based feature

enhancement [

]. An alternative to feature compensation is model compensation, where the acoustic

model is adapted in some way. Model compensation usually provides better performance gains than feature

compensation, but at a much higher computational cost [1].

Multi-style training (MTR) is one of the most popular and well-established techniques used to address

mismatch by allowing the acoustic model to learn robust representations of the data. MTR aims to transform

the training data to be more representative of the testing data. New training datasets are created from an

existing set by adding a series of MTR styles. These can include changing the speed and volume [

], speech

style [

] or sampling rate [

], adding time and frequency distortions [

] or background noise, and simulating

reverberation [

]. The styles are typically chosen without knowledge of the testing conditions and must

still be able to handle a wide variety of mismatch. In addition, the number of styles that are added must be

considered because the computational cost of training an ASR system increases signiﬁcantly with each style

that is added.

Recently, a number of GAN-based feature enhancement techniques have been proposed to improve ASR

robustness to noise and reverberation [

]. A GAN framework uses two networks that compete in a

min-max game and improve by learning from each other [

]. A generator network learns to create new

samples using random noise as input. Another network, called the discriminator, is used to determine if the

sample came from the true data distribution or the generator. The output of the discriminator is maximised

for real samples and minimised for generated samples. The generator is trained using the output from the

discriminator to improve the quality of the samples it generates. In ASR, the generator uses speech features or

embeddings as input and aims to produce new features that are more robust to noise or reverberation [

GAN-based feature enhancement techniques can achieve very good performance improvements when a clean

trained baseline ASR system is used [6, 7, 13].

In this work, we investigate the use of GANs to transform acoustic features of mismatched audio in under-

resourced environments. We want to improve the performance on a new mismatched test set with a limited

computational budget, provided that an ASR system trained on good quality data already exists. Our GAN is

trained to transform acoustic features of mismatched audio to be better classiﬁed by the acoustic model. Our

technique can be applied to most acoustic features including Mel-frequency cepstral coefﬁcients (MFCCs),

ﬁlter-banks and speaker-adapted transforms like feature-space maximum likelihood linear regression (fMLLR).

For the purposes of this paper, we focus speciﬁcally on WAV49-encoded audio [

], a full-rate GSM codec

with a compression ratio of 10:1, often used by South African call centres to store large volumes of telephone

calls. Call centre speech analytics is an important application of ASR, and due to the large volumes of audio

generated by call centres, some form of compression is often encountered. This is why we speciﬁcally focus

on noisy speech (also typical of call centre data) encoded with WAV49.

2 Related work

GANs can improve both deep neural network hidden Markov model (DNN-HMM) [

] and end-to-end ASR

systems [

] by creating indistinguishable representations between clean and noisy audio samples. A clean

corpus is augmented by adding noise or room impulse responses to create a new noisy corpus. GANs used

for DNN-HMM ASR systems normally operate on acoustic features (log-power spectra, MFCCs) [

], and

on log-Mel ﬁlterbank spectra or embeddings when used for end-to-end systems [

]. The generator maps

the noisy features to the same representation as the corresponding clean features. The discriminator provides

feedback to the generator using the clean and noisy features as input. An L1 or mean squared error loss

between clean and noisy samples is minimised which assists the generator with the mapping [

]. This only

works when the noisy set is created from the clean set, because the loss requires both samples to be of the same

frames in a recording. All these approaches rely on the assumption that you know what the test conditions are

(noise or reverberation). They are only as effective as the conditions you can create using a clean dataset. Still,

GANs such as these that are used to improve features or embeddings have been shown to outperform data

augmentation and DNN-based enhancement techniques [6, 13].

These approaches are designed primarily for noise robustness and dereverberation, not necessarily mismatched

audio. It is extremely difﬁcult to accurately produce a noisy dataset using only clean audio if the conditions

of the testing data are unknown. Sometimes it is possible to collect a small training dataset with matched

conditions to the new test set. In this scenario, the previous GAN-based enhancement techniques cannot be

used, because a parallel training set (clean version) does not exist.

An approach different from feature enhancement is to use a GAN to ﬁne-tune an end-to-end ASR system [

The entire ASR model is treated as a generator network. The ASR system is trained normally until convergence

occurs. After training, a discriminator network is tasked to determine if the transcriptions came from the ASR

system or are ground truth. Since the ASR model is already trained, the output is very similar to the ground

truth, which makes the task very difﬁcult for the discriminator. The GAN can improve output transcriptions of

an ASR model that has already converged. Adding the GAN-based ﬁne-tuning technique yields consistent

WER improvements [17].

Previous work focused on noisy and reverberant speech, with little attention paid to compressed audio. None

of the GAN-based enhancement techniques are designed to work with a newly transcribed dataset that does

not have a corresponding clean version. We are not aware of any research that utilises GANs for acoustic

feature transformation in under-resourced ASR or call centre environments.

3 Guided-GAN

We use a GAN to improve an ASR system on mismatched (noisy and compressed) audio. Our approach can

use any GAN loss function, but changes the loss in order to utilise an existing ASR system to guide the GAN

training process. We refer to this architecture as a Guided-GAN.

Two datasets are used during GAN training. The ﬁrst set can be a clean dataset or the same data used to

train the acoustic model (known conditions). The second set, which is used as input for the generator, is a

noisy or mismatched dataset that does not come from the same distribution as the clean set. Figure 1 shows a

diagram of the Guided-GAN training process. Acoustic features of the mismatched audio are given to the

generator as input. The generator then creates a new vector of the same dimension, which is used as input

to both a pre-trained DNN acoustic model and the discriminator. The negative log-likelihood (NLL) loss of

the acoustic model is added to the generator’s loss to assist it in generating realistic samples of the correct

class. The input to the discriminator is clean (or have known conditions) and generated features for which the

output is maximised and minimised respectively. The goal of the GAN is to reduce the performance difference

between data with known conditions and a new mismatched test set, for which a small transcribed training set

is available. When the system is used to evaluate a test set, only the generator network is used to transform the

noisy features before handing them over to the downstream acoustic model.

3.1 Loss functions

Any GAN loss function can be extended by adding the guiding term to the generator’s loss. We use three

different loss functions and select the one that is best suited for our approach. The ﬁrst is an adapted version

of a spectral normalisation GAN (SN-GAN) [

] in which we extend the generator loss function with an

additional term. Speciﬁcally, we use the labels of the noisy sample to calculate the NLL loss of the cleaned

sample. This value is added to the loss function to guide the generator to create features that are more likely to

be classiﬁed correctly by the target acoustic model. An SN-GAN uses the same loss functions as a Wasserstein

GAN (WGAN), but uses spectral normalisation instead of weight clipping in the discriminator to satisfy the

Lipschitz constraint [18]. The standard WGAN loss for the generator is given by:

LG=−Eˆx∼pg[D(ˆx)],(1)

where pgis the model distribution implicitly deﬁned by ˆx =G(˜x)with ˜x the noisy input features, and D(ˆx)

is the output of the discriminator that can be a positive or negative value. We add the NLL loss to Eq. (1):

LG=−Eˆx∼pg[D(ˆx)] −λ·Eˆx,˜y∼pdata log pam(˜y|ˆx),(2)

where

is a hyperparameter,

pam

is the probability deﬁned by the trained acoustic model, and

˜y

is the senone

class label of the noisy features. The WGAN loss function for the discriminator is given by:

LD=−Ex∼pd[D(x)] + Eˆx∼pg[D(ˆx)] (3)

where

is the real distribution deﬁned by the clean samples

. Spectral normalisation is used in the

discriminator to ensure that the Lipschitz constraint is satisﬁed [

]. We also compare the SN-GAN loss

functions to the standard non-saturating GAN (NS-GAN) [

] and the WGAN with gradient penalty (WGAN-

GP) [19]. The NS-GAN loss function we use for the generator is:

LNS−GAN

G=−Eˆx∼pg[log D(ˆx)] −λ·Eˆx,˜y∼pdata log pam(˜y|ˆx)(4)

with a corresponding discriminator loss function:

LNS−GAN

D=−Ex∼pd[log (D(x))] −Eˆx∼pg[log (1 −D(ˆx))].(5)

Figure 1: Diagram of a Guided-GAN training process.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EFFICIENTACOUSTICFEATURETRANSFORMATIONINMISMATCHEDENVIRONMENTSUSINGAGUIDED-GANWalterHeymansa,b,MarelieH.Davela,b,c,andCharlvanHeerdendaFacultyofEngineering,North-WestUniversity,SouthAfricahttps://engineering.nwu.ac.za/mustbCentreforArticialIntelligenceResearch(CAIR),SouthAfricacNationalInstitutefor...

展开>> 收起<<

EFFICIENT ACOUSTIC FEATURE TRANSFORMATION IN MISMATCHED ENVIRONMENTS USING A GUIDED -GAN Walter Heymansab Marelie H. Davelabc and Charl van Heerdend.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EFFICIENT ACOUSTIC FEATURE TRANSFORMATION IN MISMATCHED ENVIRONMENTS USING A GUIDED -GAN Walter Heymansab Marelie H. Davelabc and Charl van Heerdend

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: